Directly expands to infinite length, Google Infini-Transformer ends the context length debate-AI-php.cn

Home

Technology peripherals

Directly expands to infinite length, Google Infini-Transformer ends the context length debate

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 13, 2024 am 08:00 AM

Google industry Memory usage

I wonder if Gemini 1.5 Pro uses this technology.

Google has made another big move and released the next generation Transformer model Infini-Transformer.

Infini-Transformer introduces an efficient way to scale Transformer-based large language models (LLMs) to infinitely long inputs without increasing memory and computational requirements. Using this technology, the researchers successfully increased the context length of a 1B model to 1 million; applied to the 8B model, the model can handle the 500K book summary task.

The Transformer architecture has dominated the field of generative artificial intelligence since the publication of the groundbreaking research paper "Attention is All You Need" in 2017. Google's optimized design of Transformer has been relatively frequent recently. A few days ago, they updated the Transformer architecture and released Mixture-of-Depths (MoD), which changed the previous Transformer computing model. Within a few days, Google released this new study.

Researchers who focus on the field of AI understand the importance of memory. It is the cornerstone of intelligence and can provide efficient computing for LLM. However, Transformer and Transformer-based LLM exhibit quadratic complexity in both memory usage and computation time due to the inherent characteristics of the attention mechanism, i.e., the attention mechanism in Transformer. For example, for a 500B model with a batch size of 512 and a context length of 2048, the memory footprint of the attention key-value (KV) state is 3TB. But in fact, the standard Transformer architecture sometimes needs to extend the LLM to longer sequences (such as 1 million tokens), which brings huge memory overhead, and as the context length increases, the deployment cost also increases.

Based on this, Google has introduced an effective approach, the key component of which is a new attention technology called Infini-attention. Unlike traditional Transformers, which use local attention to discard old fragments and free up memory space for new fragments. Infini-attention adds compressive memory, which can store used old fragments in compressed memory. When output, the current context information and the information in the compressed memory will be aggregated, so the model can retrieve the complete context history.

This method enables Transformer LLM to scale to infinitely long contexts with limited memory and process extremely long inputs for calculations in a streaming manner.

Experiments show that the method outperforms the baseline on long-context language modeling benchmarks while reducing memory parameters by more than 100 times. The model achieves better perplexity when trained with 100K sequence length. In addition, the study found that the 1B model was fine-tuned on key instances of 5K sequence length, solving the 1M length problem. Finally, the paper shows that the 8B model with Infini-attention achieved new SOTA results on the 500K length book summary task after continuous pre-training and task fine-tuning.

The contributions of this article are summarized as follows:

Introduces a practical and powerful attention Force mechanism Infini-attention - with long-term compressed memory and local causal attention, can be used to effectively model long-term and short-term context dependencies;
Infini-attention has a standard scaling dot product Attention (standard scaled dot-product attention) is minimally changed and is designed to support plug-and-play continuous pre-training and long-context adaptation;
This approach enables Transformer LLM is capable of processing extremely long inputs in a streaming manner, scaling to infinitely long contexts with limited memory and computing resources.

Directly expands to infinite length, Google Infini-Transformer ends the context length debate

## Paper link: https://arxiv.org/pdf/2404.07143.pdf
Paper title: Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Method introduction

Infini-attention enables Transformer LLM to efficiently handle infinitely long inputs with limited memory footprint and computation. As shown in Figure 1 below, Infini-attention incorporates compressed memory into the ordinary attention mechanism, and builds masked local attention and long-term linear attention mechanisms in a single Transformer block.

This subtle but critical modification to the Transformer attention layer can extend the context window of existing LLMs to infinite lengths through continuous pre-training and fine-tuning.

Infini-attention takes all keys, values, and query states of standard attention calculations for long-term memory consolidation and retrieval, and transfers the attention's old KV states are stored in compressed memory instead of discarding them like standard attention mechanisms.When processing subsequent sequences, Infini-attention uses the attention query state to retrieve values from memory. To compute the final context output, Infini-attention aggregates long-term memory retrieval values and local attention context.

As shown in Figure 2 below, the research team compared Infini-Transformer and Transformer-XL based on Infini-attention. Similar to Transformer-XL, Infini-Transformer operates on a sequence of segments and computes the standard causal dot product attention context in each segment. Therefore, the dot product attention computation is local in some sense.

However, local attention discards the attention state of the previous segment when processing the next segment, but Infini-Transformer reuses the old KV attention state to Maintain the entire context history via compressed storage. Therefore, each attention layer of Infini-Transformer has a global compressed state and a local fine-grained state.

Similar to multi-head attention (MHA), in addition to dot product attention, Infini-attention also maintains H parallel compressed memories (H is the number of attention heads).

Table 1 below lists the context memory footprint and effective context length defined by several models based on model parameters and input segment length. Infini-Transformer supports infinite context windows with limited memory footprint.

Experiment

This research is based on long context language modeling with a length of 1M. The Infini-Transformer model is evaluated on key context block retrieval and 500K length book summarization tasks, which have extremely long input sequences. For language modeling, the researchers chose to train the model from scratch, while for the key and book summary tasks, the researchers used continuous pre-training of LLM to prove Infini-attention's plug-and-play long-context adaptability.

Long context language modeling. Table 2 results show that Infini-Transformer outperforms Transformer-XL and Memorizing Transformers baselines and stores 114x fewer parameters compared to the Memorizing Transformer model.

Key tasks. Table 3 shows the Infini-Transformer fine-tuned on a 5K length input solving the key task up to 1M context length. The input tokens in the experiment ranged from 32K to 1M. For each test subset, the researchers controlled the position of the key so that it was located near the beginning, middle, or end of the input sequence. Experiments report zero-shot accuracy and fine-tuning accuracy. After 400 steps of fine-tuning on a 5K length input, Infini-Transformer solves tasks up to 1M context length.

Summary tasks. Table 4 compares Infini-Transformer with an encoder-decoder model built specifically for the summarization task. The results show that Infini-Transformer surpasses the previous best results and achieves new SOTA on BookSum by processing the entire text of the book.

#The researchers also plotted the overall Rouge score for the BookSum data validation split in Figure 4. The polyline trend shows that Infini-Transformers improve summary performance metrics as the input length increases.

Directly expands to infinite length, Google Infini-Transformer ends the context length debate

The above is the detailed content of Directly expands to infinite length, Google Infini-Transformer ends the context length debate. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

3 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks ago By DDD

Roblox: Dead Rails - How To Tame Wolves

1 months ago By DDD

Strength Levels for Every Enemy & Monster in R.E.P.O.

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1662

CakePHP Tutorial

1419

Laravel Tutorial

1312

PHP Tutorial

1262

C# Tutorial

1235

Related knowledge

Laravel Eloquent ORM in Bangla partial model search) Apr 08, 2025 pm 02:06 PM

LaravelEloquent Model Retrieval: Easily obtaining database data EloquentORM provides a concise and easy-to-understand way to operate the database. This article will introduce various Eloquent model search techniques in detail to help you obtain data from the database efficiently. 1. Get all records. Use the all() method to get all records in the database table: useApp\Models\Post;$posts=Post::all(); This will return a collection. You can access data using foreach loop or other collection methods: foreach($postsas$post){echo$post->

CS-Week 3 Apr 04, 2025 am 06:06 AM

Algorithms are the set of instructions to solve problems, and their execution speed and memory usage vary. In programming, many algorithms are based on data search and sorting. This article will introduce several data retrieval and sorting algorithms. Linear search assumes that there is an array [20,500,10,5,100,1,50] and needs to find the number 50. The linear search algorithm checks each element in the array one by one until the target value is found or the complete array is traversed. The algorithm flowchart is as follows: The pseudo-code for linear search is as follows: Check each element: If the target value is found: Return true Return false C language implementation: #include#includeintmain(void){i

What to do if Redis memory usage is too high? Apr 10, 2025 pm 02:21 PM

Redis memory soaring includes: too large data volume, improper data structure selection, configuration problems (such as maxmemory settings too small), and memory leaks. Solutions include: deletion of expired data, use compression technology, selecting appropriate structures, adjusting configuration parameters, checking for memory leaks in the code, and regularly monitoring memory usage.

How to optimize system performance with Debian Message Apr 02, 2025 am 08:09 AM

Debian systems are known for their stability and security, but performance optimization still needs attention. This article introduces some commonly used Debian system performance optimization methods. It does not directly use "DebianMessage" (maybe refer to system logs) for optimization, but improves efficiency by monitoring and adjusting system resources. Performance Monitoring Tool The following tools can help you monitor system resource usage in real time: top: display process information in real time, including CPU and memory usage. htop: (if available) interactive process viewer, more intuitive than top. vmstat: Displays virtual memory, disk, CPU and process activity information. iostat: Display disk I/O statistics, such as read and write speed

What is the impact of Redis persistence on memory? Apr 10, 2025 pm 02:15 PM

Redis persistence will take up extra memory, RDB temporarily increases memory usage when generating snapshots, and AOF continues to take up memory when appending logs. Influencing factors include data volume, persistence policy and Redis configuration. To mitigate the impact, you can reasonably configure RDB snapshot policies, optimize AOF configuration, upgrade hardware and monitor memory usage. Furthermore, it is crucial to find a balance between performance and data security.

How to optimize jieba word segmentation to improve the keyword extraction effect of scenic spot comments? Apr 01, 2025 pm 06:24 PM

How to optimize jieba word segmentation to improve keyword extraction of scenic spot comments? When using jieba word segmentation to process scenic spot comment data, if the word segmentation results are ignored...

What are the Redis memory data types? Apr 10, 2025 pm 02:06 PM

Redis provides five core memory data types: String: basic string storage, supporting incremental/decreasing operations. List: Bidirectional linked list, efficient insertion/deletion operation. Set: Unordered set, used for deduplication operations. Hash: Key-value pair storage, suitable for storing structured data. Zset: Ordered set, each element has fractions, and can be sorted by fractions. Choosing the right data type is critical to optimizing performance.

How to set the Redis memory size according to business needs? Apr 10, 2025 pm 02:18 PM

Redis memory size setting needs to consider the following factors: data volume and growth trend: Estimate the size and growth rate of stored data. Data type: Different types (such as lists, hashes) occupy different memory. Caching policy: Full cache, partial cache, and phasing policies affect memory usage. Business Peak: Leave enough memory to deal with traffic peaks.

See all articles