Table of Contents
Experimental results
Aligned with human target recognition
Out-of-distribution performance
Linear Probe
Distillation
Fairness and Bias
Home Technology peripherals AI CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Apr 07, 2023 pm 03:09 PM
Model

Transformer is undoubtedly the biggest contributor to the prosperity of the field of natural language processing, and is also the infrastructure for large-scale language models such as GPT-4.

However, compared with the tens of billions of parameters of language models, the field of computer vision does not reap the benefits of Transformer so much. Currently, the largest visual Transformer model ViT-e The number of parameters is only 4 billion parameters.

Recently, Google released a paper in which researchers proposed a method that can efficiently and stably train large-scale Vision Transformers (ViT) models, successfully increasing the number of parameters of ViT to 22 billion.


CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

## Paper link: https://arxiv.org/abs/2302.05442

In order to achieve model expansion, ViT-22B combines ideas from other language models (such as the PaLM model), uses QK normalization to improve training stability, and proposes a Asynchronous parallel linear operations (asynchronous parallel linear operations)’s new method improves training efficiency and can be trained on Cloud TPU with higher hardware efficiency.

When conducting experiments on the ViT-22B model to evaluate downstream task performance, ViT-22B also showed capabilities similar to large-scale language models, that is, as the model scale increases, the performance It is also constantly improving.

ViT-22B can also be used in PaLM-e. The large model combined with the language model can significantly improve the technical level of robot tasks.

The researchers also further observed other advantages brought by scale, including a better balance of fairness and performance, consistent with human visual perception in terms of shape/texture bias sex, and better robustness.

Model architecture

ViT-22B is a model based on the Transformer architecture. Compared with the original ViT architecture, the researchers mainly made three modifications to improve training efficiency. and training stability.

Parallel layers

##ViT-22B executes the attention block and MLP block in parallel, while The original Transformer is executed sequentially.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

PaLM model training also uses this method, which can increase the training speed of large models by 15% without performance degradation.

query/key (QK) normalization

In the process of expanding ViT, researchers used 8 billion parameters It has been observed in a large number of models that the training loss begins to diverge after a few thousand steps of training, mainly due to the instability caused by excessively large values ​​​​of attention logits, resulting in zero-entropy attention weights (almost one-hot). .

In order to solve this problem, the researchers used LayerNorm on Query and Key before dot multiplication attention calculation

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

The experimental results on the 8 billion parameter model are shown in the figure below. Normalization can alleviate the divergence problem.


CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Remove the offset term on QKV projection and LayerNorms

Like the PaLM model, ViT-22B removes the bias term from the QKV projection, and there is no bias term (bias) and centering in all LayerNorms, resulting in a 3% increase in hardware utilization , and there is no loss in quality.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

However, unlike PaLM, ViT-22B uses a bias term for the (internal and external) MLP dense connection layer, which can be observed Quality has improved, and speed hasn't slowed down.

In the encoder module of ViT-22B, the embedding layer, including extraction patches, linear projections and additional position embeddings are the same as those used in the original ViT, and multi-head attention pooling is used to aggregate the information in each head. per-token representation.

The patch size of ViT-22B is 14×14, and the resolution of the image is 224×224 (preprocessed by inception crop and random horizontal flipping).

Asynchronous parallel linear operations

Large-scale models also require sharding ), that is, distributing model parameters across different computing devices. In addition, researchers also slice activations (acctivations, intermediate representations of input).

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Because both the input and the matrix itself are distributed across various devices, even simple operations like matrix multiplication require special care.

The researchers developed a method called asynchronous parallel linear operations that can be performed simultaneously while calculating in the matrix multiplication unit (the unit that accounts for the vast majority of computing power in the TPU). Communicate activations and weights between devices.

Asynchronous methods minimize the time waiting for incoming communication, thereby increasing device efficiency.

The goal of asynchronous parallel linear operations is to calculate matrix multiplication y = Ax, but the matrix A and activation x are distributed on different devices and require overlapping communication and calculations across devices. achieve this. The matrix A is column-sharded across devices. Each matrix contains a contiguous slice, with each block represented as Aij. See the original paper for more details.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Experimental results

In order to illustrate that the representation learned by ViT-22B is very rich, the researchers used LiT-tuning to train a text model to generate representations for aligning text and images.

The following are the experimental results obtained using out-of-distribution images generated by Parti and Imagen. You can see the zero-shot image classification generalization ability of ViT-22B It is very powerful and can recognize unseen objects and scenes only from natural images crawled from the web.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

The paper also discusses the effect of ViT-22B on video classification, depth estimation and semantic segmentation tasks.

Aligned with human target recognition

In order to verify the consistency of ViT-22B classification decision-making with human classification decision-making, the researchers fine-tuned ViT-22B and changed the distribution It is fine-tuned on different resolutions of the OOD dataset, where human comparison data is available through the model-vs-human toolbox.

This toolbox mainly measures three key indicators: How does the model handle distortion (accuracy)? What is the difference between human and model accuracy (difference in accuracy)? How similar are the error patterns (error consistency) of people and models?

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Shape deviation evaluation (larger values ​​represent more shape deviations). Many vision models have low shape/high texture bias, and ViT-22B, fine-tuned on ImageNet, has the highest shape bias recorded among ML models to date, closer to human shape bias

Experimental results show that while not all fine-tuned solutions perform well, the ViT-22B variant reaches new highs on all three metrics.

Additionally, the ViT-22B model also has the highest shape deviation record among visual models. This means that they mainly use the shape of the object rather than the texture of the object to make classification decisions, and the strategy results are similar to human perception (its shape bias is 96%).

Standard models (e.g. ResNet-50 has 20-30% shape bias) typically classify based on texture, while models with high shape bias tend to focus on shape (identified below) (for cats), ViT-22B shows more similarities to human visual object recognition, although there are still many differences between human and model perception.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Cat or elephant? Car or clock? Bird or bicycle? Images with the shape of one object and the texture of another different object can be used to measure shape/texture deviation

Out-of-distribution performance

Measuring performance on the OOD data set helps evaluate model generalization.

In this experiment, the researchers constructed label mappings from JFT to ImageNet, and from ImageNet to different out-of-distribution datasets such as ObjectNet.

The results after pre-training on this data are shown below, and then the model is fully fine-tuned on ImageNet.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

It can be observed that scaling Vision Transformers can improve OOD performance: even if the accuracy of ImageNet reaches saturation, it can also be seen that the transformation from ViT-e on ObjectNet The ViT-22B model can significantly improve performance.

Linear Probe

Linear Probe is a technique that places a single linear layer on top of a frozen model. Compared to full fine-tuning, this method has It’s cheaper to train and easier to set up.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Linear detection results trained on ImageNet, on ImageNet-Real, ImageNet-v2, ObjectNet, ImageNet-R and ImageNet- Evaluation on data set A, providing high-resolution fine-tuned ViT-e/14 as reference

It can be observed from the results that the linear detection performance of ViT-22B is close to that of using State-of-the-art fine-tuning of smaller models on high-resolution images, where training with higher resolutions is generally much more expensive but can achieve better results on many tasks.

Distillation

Using the distillation method, the knowledge of a larger model can be converted into the knowledge of a smaller model, which can improve the cost of larger models and the slower running speed. The operating efficiency of the model.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

It can be found from the experimental results that the knowledge of ViT-22B can be transferred to smaller models such as ViT-B/16 and ViT- L/16, and refreshed the performance record on ImageNet under the same model size.

Fairness and Bias

Machine learning models are susceptible to unintended unfair biases, such as finding false correlations or across subgroups performance gaps, the researchers found that scaling up the model could help mitigate these issues.

First, scale is a promising trade-off, even if the model is trained and then post-processed to control its demographic parity to a prescribed and tolerable level. Below the level, performance also improves as scale increases.


CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans


CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

#Above: Accuracy of each subgroup in CelebA before debiasing. Below: The y-axis shows the absolute difference in performance for the two specific subgroups highlighted in this example (females and males). Compared to the smaller ViT model, the performance gap of the ViT-22B is very small.

More importantly, this applies not only to the case where performance is measured in terms of accuracy, but also to other measures such as calibration, i.e. the truthfulness of the model's estimated probabilities. Statistically measured, classification of all subgroups tends to improve with increasing size, and ViT-22B reduces the performance gap between subgroups.

Conclusion

The researchers proposed one of the largest visual Transformer models currently, ViT-22B, containing 22 billion parameters.

By making small but key modifications to the original model architecture, higher hardware utilization and training stability were achieved, resulting in a model that improved the upper limit of performance on several benchmarks.

Using the frozen model to generate embeddings requires only training a few layers on top to achieve very good performance, and the evaluation results further show that compared to existing models, ViT-22B Shows greater similarity to human visual perception in terms of shape and texture bias, and provides advantages in terms of fairness and robustness.

The above is the detailed content of CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Time Series Forecasting NLP Large Model New Work: Automatically Generate Implicit Prompts for Time Series Forecasting Time Series Forecasting NLP Large Model New Work: Automatically Generate Implicit Prompts for Time Series Forecasting Mar 18, 2024 am 09:20 AM

Today I would like to share a recent research work from the University of Connecticut that proposes a method to align time series data with large natural language processing (NLP) models on the latent space to improve the performance of time series forecasting. The key to this method is to use latent spatial hints (prompts) to enhance the accuracy of time series predictions. Paper title: S2IP-LLM: SemanticSpaceInformedPromptLearningwithLLMforTimeSeriesForecasting Download address: https://arxiv.org/pdf/2403.05798v1.pdf 1. Large problem background model

FisheyeDetNet: the first target detection algorithm based on fisheye camera FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

See all articles