CLIP is not down to earth? You need a model that understands Chinese better-AI-php.cn

Table of Contents

1. Introduction

3. Experiment

Home

Technology peripherals

CLIP is not down to earth? You need a model that understands Chinese better

PHPz

Apr 25, 2023 am 08:58 AM

Chinese Model

This article introduces the Chinese CLIP large-scale pre-training image and text representation model recently open sourced by the Damo Academy Magic Community ModelScope, which can better understand Chinese and Chinese Internet images, and can perform multiple tasks such as image and text retrieval and zero-sample image classification. To achieve the best results, the code and models have all been open source, so users can use Magic to get started quickly.

CLIP is not down to earth? You need a model that understands Chinese better

Model usage entrance: https://modelscope.cn/models/damo/multi-modal_clip-vit-base-patch16_zh/summary
Github: https://github.com/OFA-Sys/Chinese-CLIP
Paper: https://arxiv.org/pdf /2211.01335.pdf
Graphic and text retrieval demo: https://modelscope.cn/studios/damo/chinese_clip_applications/summary

1. Introduction

In the current Internet ecosystem, there are countless multi-modal related tasks and scenarios, such as image and text retrieval, image classification, video and image and text content and other scenarios. In recent years, image generation, which has become popular all over the Internet, has become even more popular and has quickly gone out of the circle. Behind these tasks, a powerful image and text understanding model is obviously necessary. I believe everyone will be familiar with the CLIP model launched by OpenAI in 2021. Through simple image-text twin tower comparison learning and a large amount of image-text corpus, the model has significant image-text feature alignment capabilities, and can be used in zero-sample image classification, It has outstanding results in cross-modal retrieval and is also used as a key module in image generation models such as DALLE2 and Stable Diffusion.

But unfortunately, OpenAI CLIP’s pre-training mainly uses graphic and text data from the English world and cannot naturally support Chinese. Even if there are researchers in the community who have distilled multilingual versions of Multilingual-CLIP (mCLIP) through translated texts, they still cannot meet the needs of the Chinese world, and their understanding of texts in the Chinese field is not very good, such as searching for "Spring Festival couplets" , but what is returned is Christmas-related content:

CLIP is not down to earth? You need a model that understands Chinese better

##mCLIP Retrieve demo Search for "Spring Festival Couplets" Return results

This also shows that we need a CLIP who understands Chinese better, not only understands our language, but also understands the images of the Chinese world.

2. Method

Researchers at DAMO Academy collected large-scale Chinese image-text pair data (approximately 200 million in size), including data from LAION-5B Chinese subset, Wukong's Chinese data, and translated graphic and text data from COCO, Visual Genome, etc. Most of the training images and texts come from public data sets, which greatly reduces the difficulty of reproduction. In terms of training methods, in order to effectively improve the training efficiency and model effect of the model, the researchers designed a two-stage training process:

CLIP is not down to earth? You need a model that understands Chinese better

##Chinese CLIP method diagram

As shown in the figure, in the first stage, the model uses the existing image pre-training model and text pre-training The model initializes the twin towers of Chinese-CLIP separately and freezes the image-side parameters, allowing the language model to associate with the existing image pre-training representation space while reducing training overhead. Subsequently, in the second stage, the image side parameters are unfrozen, allowing the image model and language model to be associated while modeling the data distribution with Chinese characteristics. The researchers found that compared with pre-training from scratch, this method showed significantly better experimental results on multiple downstream tasks, and its significantly higher convergence efficiency also meant smaller training overhead. Compared with only training the text side in one stage of training, adding the second stage of training can effectively further improve the effect on downstream graphics and text tasks, especially graphics and text tasks native to Chinese (rather than translated from English data sets).

CLIP is not down to earth? You need a model that understands Chinese better

On two data sets: MUGE Chinese e-commerce image and text retrieval, Flickr30K-CN translation version general image and text retrieval Observe the effect change trend of zero-shot as pre-training continues

Using this strategy, researchers have trained models of multiple scales, from the smallest ResNet-50, ViT-Base and Large to ViT-Huge. They are all now open and users can fully access them on demand. Use the model that best suits your scenario:

3. Experiment

Multiple experimental data show that Chinese-CLIP can be used in Chinese Cross-modal retrieval has achieved the best performance. Among them, on the Chinese native e-commerce image retrieval data set MUGE, Chinese CLIP of multiple scales has achieved the best performance at this scale. On data sets such as English-native Flickr30K-CN, Chinese CLIP can significantly exceed domestic baseline models such as Wukong, Taiyi, and R2D2, regardless of zero sample or fine-tuning settings. This is largely due to Chinese-CLIP's larger Chinese pre-training image and text corpus, and Chinese-CLIP is different from some existing domestic image and text representation models in order to minimize the training cost and freeze the entire image side. Instead, it uses two Staged training strategies to better adapt to the Chinese field:

MUGE Chinese e-commerce image and text retrieval data Set experimental results

##Flickr30K-CN Chinese image and text retrieval data set experimental results

#At the same time, the researchers verified the effect of Chinese CLIP on the zero-sample image classification data set. Since there are not many authoritative zero-shot image classification tasks in the Chinese field, the researchers are currently testing on the English translation version of the data set. Chinese-CLIP can achieve comparable performance to CLIP on these tasks through Chinese prompts and category labels:

Zero-sample classification experiment results

#Zero-sample image classification example 4. Quick use

How can I use Chinese-CLIP? It's very simple. Click the link at the beginning of the article to visit the Moda community or use the open source code. You can complete image and text feature extraction and similarity calculation in just a few lines. For quick use and experience, the Moda community provides a Notebook with a configured environment. You can use it by clicking on the upper right corner.

Chinese-CLIP also supports users to use their own data for finetune, and also provides a demo of image and text retrieval for everyone to actually experience Chinese -The effects of CLIP models of various scales:

5. Conclusion

This time the Damoda community launched the Chinese-CLIP project, It provides an excellent pre-trained image and text understanding model for the majority of Chinese multi-modal research and industry users, helping everyone to quickly get started with image and text features & similarity calculation, image and text retrieval and zero-sample classification without any threshold, and you can try to use it It is suitable for building more complex multi-modal applications such as image generation. Friends who want to show off their talents in the Chinese multi-modal field, please don’t miss it! And this is just one of the applications in the Moda community. ModelScope allows many basic models in the AI field to play the role of application base, supporting the birth of more innovative models, applications and even products.

The above is the detailed content of CLIP is not down to earth? You need a model that understands Chinese better. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

3 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks ago By DDD

Roblox: Dead Rails - How To Tame Wolves

4 weeks ago By DDD

Strength Levels for Every Enemy & Monster in R.E.P.O.

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1655

CakePHP Tutorial

1414

Laravel Tutorial

1307

PHP Tutorial

1254

C# Tutorial

1228

Related knowledge

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

$The latest from Oxford University! Mickey: 2D image matching in 3D SOTA! (CVPR\'24)$ The latest from Oxford University! Mickey: 2D image matching in 3D SOTA! (CVPR\'24) Apr 23, 2024 pm 01:20 PM

Project link written in front: https://nianticlabs.github.io/mickey/ Given two pictures, the camera pose between them can be estimated by establishing the correspondence between the pictures. Typically, these correspondences are 2D to 2D, and our estimated poses are scale-indeterminate. Some applications, such as instant augmented reality anytime, anywhere, require pose estimation of scale metrics, so they rely on external depth estimators to recover scale. This paper proposes MicKey, a keypoint matching process capable of predicting metric correspondences in 3D camera space. By learning 3D coordinate matching across images, we are able to infer metric relative

FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

See all articles