


Cited 38,000 times in five years, the Transformer universe has developed like this
Since it was proposed in 2017, the Transformer model has shown unprecedented strength in other fields such as natural language processing and computer vision, and triggered technological breakthroughs such as ChatGPT. People have also proposed various original-based Variants of the model.
As academia and industry continue to propose new models based on the Transformer attention mechanism, it is sometimes difficult for us to summarize this direction. Recently, a comprehensive article by Xavier Amatriain, head of AI product strategy at LinkedIn, may help us solve this problem.
#In the past few years, one after another There are dozens of models from the Transformer family, all with interesting and understandable names. The goal of this article is to provide a comprehensive but simple catalog and classification of the most popular Transformer models. In addition, this article also introduces the most important aspects and innovations in Transformer models.
The paper "Transformer models: an introduction and catalog":
Paper link:
##https://arxiv.org/abs/2302.07730
GitHub: https://github.com/xamat/TransformerCatalogIntroduction: What is Transformer
Transformer is a class consisting of Deep learning models defined by architectural features. First appeared in the famous paper "Attention is All you Need" published by Google researchers in 2017 (this paper has been cited more than 38,000 times in just 5 years) and related blog posts. The Transformer architecture is a specific instance of the encoder-decoder model [2] which became popular 2-3 years ago. However, until then, attention was only one of the mechanisms used by these models, which were mainly based on LSTM (Long Short-Term Memory) [3] and other RNN (Recurrent Neural Network) [4] variants. The key insight of the Transformers paper is that, as the title suggests, attention can be used as the only mechanism for deriving dependencies between inputs and outputs. Discussing all the details of the Transformer architecture is beyond the scope of this blog. For this purpose, this article recommends referring to the original paper above or Transformers’ post, both of which are very exciting. Having said that, this article will briefly describe the most important aspects and they will also be mentioned in the table of contents below. This article will start with the basic architecture diagram in the original paper, and then expand on the related content.
Encoder/Decoder Architecture
Universal Encoder/Decoder Architecture (see Figure 1) By Composed of two models. The encoder takes input and encodes it into a fixed length vector. The decoder takes this vector and decodes it into an output sequence. The encoder and decoder are jointly trained to minimize the conditional log-likelihood. Once trained, the encoder/decoder can generate an output given a sequence of inputs, or it can score the input/output sequences. In the original Transformer architecture, both the encoder and decoder had 6 identical layers. Each encoder in these 6 layers has two sub-layers: a multi-head attention layer and a simple feedforward network. Each sub-layer has a residual connection and a layer normalization. The output size of the encoder is 512. The decoder adds a third sub-layer, which is another multi-head attention layer on the encoder output. Additionally, another multi-head layer in the decoder is masked.
Figure 1: Transformer architecture
Figure 2: Attention mechanism
##Attention
It is clear from the above description that the only special element of the model architecture is multi-head attention, but, as described above, this is where the full power of the model lies. So, what exactly is attention? An attention function is a mapping between a query and a set of key-value pairs to an output. The output is calculated as a weighted sum of values, where the weight assigned to each value is calculated by the query's compatibility function with the corresponding key. Transformers use multi-head attention, which is the parallel computation of a specific attention function called scaled dot product attention. For more details on how the attention mechanism works, this article will refer again to The Illustrated Transformer post, and the diagram from the original paper will be reproduced in Figure 2 to understand the main idea. Attention layers have several advantages over recurrent and convolutional networks, the most important two being their lower computational complexity and higher connectivity, which are especially useful for learning long-term dependencies in sequences .
What are Transformers used for and why are they so popular
The original Transformer was designed for language translation , especially from English to German. However, as can be seen from the original research paper, the architecture generalizes well to other language tasks. This particular trend quickly caught the attention of the research community. In the following months, most language-related ML task rankings were completely dominated by some version of the Transformer architecture (e.g., the famous SQUAD ranking, where all top models are a collection of Transformers ). One of the key reasons why Transformers have dominated most NLP rankings so quickly is their ability to quickly adapt to other tasks, a.k.a. transfer learning. Pretrained Transformer models can be adapted very easily and quickly to tasks for which they were not trained, which has a huge advantage. As an ML practitioner, you no longer need to train large models on huge data sets. All you need to do is reuse the pretrained model in your task, maybe just tweak it slightly with a much smaller dataset. One specific technique used to adapt a pretrained model to different tasks is called fine-tuning.
Transformers proved so adaptable to other tasks that, although they were originally developed for language-related tasks, they were quickly adopted for Other tasks range from visual or audio and music applications, all the way to playing chess or doing math.
Of course, none of these applications would be possible if it weren't for the myriad of tools that allow anyone to easily write a few lines of code. Transformer can not only be quickly integrated into major artificial intelligence frameworks (i.e. Pytorch8 and TF9), but entire companies can even be built based on it. Huggingface, a startup that has raised over $60 million to date, was built almost entirely around the idea of commercializing the open-source Transformer library.
Finally, it is necessary to talk about the impact of GPT-3 on Transformer in the early stages of its popularity. GPT-3 is a Transformer model launched by OpenAI in May 2020 and is a follow-up to their earlier GPT and GPT-2. The company created a lot of buzz by introducing the model in a preprint, which they claimed was so powerful that they couldn't release it to the world. Since then, the model has not only been released, but also commercialized through a massive collaboration between OpenAI and Microsoft. GPT-3 supports over 300 different applications and is fundamental to OpenAI's business strategy (which makes sense for a company that has raised over $1 billion in funding).
RLHF
Recently, reinforcement learning from human feedback (or preferences) (RLHF (also known as RLHP) ) has become a huge addition to the artificial intelligence toolkit. The concept was already proposed in the 2017 paper "Deep reinforcement learning from human preferences". More recently, it has been applied to ChatGPT and similar conversational agents such as BlenderBot or Sparrow. The idea is simple: once a language model is pre-trained, users can generate different responses to conversations and have humans rank the results. One can use these rankings (aka preferences or feedback) in a reinforcement learning environment to train rewards (See Figure 3).
Diffusion
Diffusion models have become the new SOTA in image generation, apparently pushing aside previous methods such as GANs (Generative Adversarial Networks). What is a diffusion model? They are a class of latent variable models trained with variational inference. A network trained in this way actually learns the latent space represented by these images (see Figure 4).
Diffusion models are related to other generative models, such as the famous [Generative Adversarial Networks (GAN)] 16 , which have been replaced in many applications, especially with (denoising) Autoencoder. Some authors even say that diffusion models are just a specific instance of autoencoders. However, they also acknowledge that small differences do change their application from the underlying representation of the autoconder to the purely generative nature of the diffusion model.
Figure 3: Reinforcement learning with human feedback.
## Figure 4: Probabilistic diffusion model architecture excerpted from "Diffusion Models" : A Comprehensive Survey of Methods and Applications》
The models introduced in this article include:
##
The above is the detailed content of Cited 38,000 times in five years, the Transformer universe has developed like this. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

Project link written in front: https://nianticlabs.github.io/mickey/ Given two pictures, the camera pose between them can be estimated by establishing the correspondence between the pictures. Typically, these correspondences are 2D to 2D, and our estimated poses are scale-indeterminate. Some applications, such as instant augmented reality anytime, anywhere, require pose estimation of scale metrics, so they rely on external depth estimators to recover scale. This paper proposes MicKey, a keypoint matching process capable of predicting metric correspondences in 3D camera space. By learning 3D coordinate matching across images, we are able to infer metric relative
