


Without the need to label massive data, the new paradigm of target detection OVD takes multi-modal AGI a step further
Target detection is a very important basic task in computer vision. Different from common image classification/recognition tasks, target detection requires the model to further give the location and location of the target on top of the given target category. Size information plays a key role in the three major tasks of CV (identification, detection, and segmentation).
The currently popular multi-modal GPT-4 only has the ability of target recognition in terms of visual capabilities, and is unable to complete more difficult target detection tasks. Recognizing the category, location and size information of objects in images or videos is the key to many artificial intelligence applications in real production, such as pedestrian and vehicle recognition in autonomous driving, face locking in security monitoring applications, and medical image analysis. Tumor localization, etc.
Existing target detection methods such as YOLO series, R-CNN series and other target detection algorithms have achieved high target detection accuracy and efficiency due to the continuous efforts of scientific researchers. However, due to Existing methods need to define the set of targets to be detected (closed set) before model training, making them unable to detect targets outside the training set. For example, a model trained to detect faces cannot be used to detect vehicles; In addition, existing methods highly rely on manually labeled data. When the target categories to be detected need to be added or modified, on the one hand, the training data needs to be re-labeled, and on the other hand, the model needs to be re-trained, which is time-consuming and laborious.
A possible solution is to collect massive images and manually label Box information and semantic information, but this will require extremely high labeling costs and use massive data to test the model. Training also poses serious challenges to scientific researchers. Factors such as the long-tail distribution of data and the unstable quality of manual annotation will affect the performance of the detection model.
The article OVR-CNN [1] published in CVPR 2021 proposes a new target detection paradigm: Open-Vocabulary Detection (OVD, also known as Open world target detection) to deal with the problems mentioned above, that is, detection scenarios for unknown objects in the open world.
OVD has attracted continuous attention from academia and industry since it was proposed due to its ability to identify and locate any number and category of targets without manually expanding the amount of annotated data. It also brings new vitality and new challenges to the classic target detection task, and is expected to become a new paradigm for target detection in the future.
Specifically, OVD technology does not require manual annotation of massive images to enhance the detection ability of the detection model for unknown categories, but by converting class-free (class- agnostic) region detector is combined with a cross-modal model trained on massive unlabeled data to expand the target detection model's ability to understand open-world targets through cross-modal alignment of image region features and descriptive text of the target to be detected.
The recent development of cross-modal and multi-modal large model work is very rapid, such as CLIP [2], ALIGN [3] and R2D2 [4], etc., and their development is also It promoted the birth of OVD and the rapid iteration and evolution of related work in the OVD field.
OVD technology involves the solution of two key issues: 1) How to improve the adaptation between region information and cross-modal large models; 2) How to improve pan-category targets The detector's ability to generalize to new categories. From these two perspectives, some related work in the field of OVD will be introduced in detail below.
OVD basic process diagram[1]
Basic concept of OVD: The use of OVD mainly involves two categories of scenarios: few-shot and zero-shot. few-shot refers to the target category with a small number of manually labeled training samples, and zero-shot It refers to the target category that does not have any manually labeled training samples. On the commonly used academic evaluation data sets COCO and LVIS, the data set is divided into Base class and Novel class, where the Base class corresponds to the few-shot scenario and the Novel class corresponds to the zero-shot scenario. For example, the COCO data set contains 65 categories, and a common evaluation setting is that the Base set contains 48 categories, and only these 48 categories are used in few-shot training. The Novel set contains 17 categories, which are completely invisible during training. The test indicators mainly refer to the AP50 value of the Novel class for comparison.
Paper 1: Open-Vocabulary Object Detection Using Captions
- ##Paper address: https://arxiv.org/pdf/2011.10678.pdf
- Code address: https://github.com/ alirezazareian/ovr-cnn
OVR-CNN is the Oral-Paper of CVPR 2021 and a pioneering work in the OVD field. Its two-stage training paradigm has influenced many subsequent OVD works. As shown in the figure below, the first stage mainly uses image-caption pairs to pre-train the visual encoder, in which BERT (fixed parameters) is used to generate word masks, and weakly supervised Grounding matching is performed with ResNet50 loaded with ImageNet pre-trained weights. , the author believes that weak supervision will cause matching to fall into local optimality, so a multi-modal Transformer is added for word mask prediction to increase robustness.
The training process of the second stage is similar to Faster-RCNN. The difference is that the Backbone of feature extraction comes from the The 1-3 layers of ResNet50 obtained in the first stage of pre-training are still used for feature processing after RPN, and the features are then used for Box regression and classification prediction respectively. Classification prediction is the key sign that the OVD task is different from conventional detection. In OVR-CNN, the features are input into the V2L module (graph vector to word vector module with fixed parameters) obtained by one-stage training to obtain a picture and text vector, which is then combined with the label word vector. Match and predict categories. In the second-stage training, the Base class is mainly used to perform box regression training and category matching training on the detector model. Since the V2L module is always fixed, it cooperates with the target detection model's positioning capabilities to migrate to new categories, allowing the detection model to identify and locate targets of a new category.
As shown in the figure below, the performance of OVR-CNN on the COCO data set far exceeds the previous Zero-shot target detection algorithm.
##Paper 2: RegionCLIP: Region-based Language-Image Pretraining
- Paper address: https://arxiv.org/abs/2112.09106
- Code address: https://github.com/microsoft/RegionCLIP
1. Extract the words that originally existed in the long text to form a Concept Pool, and further form a set of simple descriptions about the Region for training.
2. Use the RPN based on LVIS pre-training to extract Proposal Regions, and use the original CLIP to match and classify the extracted different Regions with the prepared descriptions, and further assemble them into forged semantic labels. 3. Perform Region-text comparative learning on the new CLIP model using the prepared Proposal Regions and semantic labels, and then obtain a CLIP model that specializes in Region information. 4. In pre-training, the new CLIP model will also learn the classification ability of the original CLIP through the distillation strategy, and perform image-text comparison learning at the full image level to maintain the new The CLIP model has the ability to express the complete image. In the second stage, the obtained pre-trained model is transferred to the detection model for transfer learning.
RegionCLIP further expands the representation capabilities of existing large cross-modal models on conventional detection models, thereby achieving Better performance, as shown in the figure below, RegionCLIP has achieved greater improvement in the Novel category compared to OVR-CNN. RegionCLIP effectively improves the adaptability between region information and multi-modal large models through one-stage pre-training. However, CORA believes that when it uses a larger cross-modal large model with larger parameter scale for one-stage training, Training costs will be very high.
Paper 3: CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching
- Paper address: https://arxiv.org/abs/2303.13076
- Code address: https://github.com/tgxs002/CORA
CORA [6] has been included in CVPR 2023. In order to overcome the two obstacles faced by the current OVD tasks proposed by it, a DETR-like OVD model is designed. As shown in the title of the article, the model mainly includes two strategies: Region Prompting and Anchor Pre-Matching. The former uses Prompt technology to optimize the regional features extracted by the CLIP-based regional classifier, thereby alleviating the distribution gap between the whole and the region. The latter uses the anchor point pre-matching strategy in the DETR detection method to improve the OVD model's ability to position new types of objects. generalizability.
CLIP There is a distribution gap between the overall image features and regional features of the original visual encoder, which in turn causes the detector to The classification accuracy is low (this is similar to the starting point of RegionCLIP). Therefore, CORA proposes Region Prompting to adapt to the CLIP image encoder and improve the classification performance of regional information. Specifically, the entire image is first encoded into a feature map through the first 3 layers of the CLIP encoder, and then anchor boxes or prediction boxes are generated by RoI Align and merged into regional features. This is then encoded by the fourth layer of the CLIP image encoder. In order to alleviate the distribution gap between the full-image feature map and regional features of the CLIP image encoder, learnable Region Prompts are set up and combined with the features output by the fourth layer to generate the final regional features for use with text features. For matching, the matching loss uses a naive cross-entropy loss, and the parameter models related to CLIP are all frozen during the training process.
CORA is a DETR-like detector model, similar to DETR, which also uses the anchor pre-matching strategy to generate candidate frames in advance for frame regression training. Specifically, anchor pre-matching matches each label box with the closest set of anchor boxes to determine which anchor boxes should be considered positive samples and which should be considered negative samples. This matching process is usually based on IoU (intersection-over-union ratio). If the IoU between the anchor box and the label box exceeds a predefined threshold, it is regarded as a positive sample, otherwise it is regarded as a negative sample. CORA shows that this strategy can effectively improve the generalization of localization ability to new categories.
However, using the anchor pre-matching mechanism will also bring some problems. For example, training can only be performed normally when at least one anchor box matches the label box. Otherwise, the label box will be ignored, preventing model convergence. Furthermore, even if the label box obtains a more accurate anchor point box, due to the limited recognition accuracy of the Region Classifier, the label box may still be ignored, that is, the category information corresponding to the label box is not aligned with the Region Classifier based on CLIP training. Therefore, CORA uses CLIP-Aligned technology to leverage the semantic recognition capabilities of CLIP and the positioning capabilities of pre-trained ROI to re-label the images in the training data set with less manpower. Using this technology, the model can be trained during training Match more tag boxes.
Compared with RegionCLIP, CORA further improves the AP50 value by 2.4 on the COCO data set.
Summary and Outlook
OVD technology is not only closely related to the development of the currently popular cross/multimodal large models, but also inherits the goals of past scientific researchers. The accumulation of technology in the detection field is a successful connection between traditional AI technology and general AI capability research. OVD is a new target detection technology facing the future. It can be expected that OVD's ability to detect and locate any target will in turn promote the further development of multi-modal large models, and is expected to become a multi-modal AGI important cornerstone in development. At present, the training data source of multi-modal large models is a large number of rough information pairs on the Internet, that is, text-image pairs or text-speech pairs. If OVD technology is used to accurately locate the original rough image information and assist in predicting the semantic information of the image to filter the corpus, the quality of the large model pre-training data will be further improved, thereby optimizing the representation and understanding capabilities of the large model.
A good example is SAM (Segment Anything)[7]. SAM not only allows scientific researchers to see the future direction of general visual large models, but also triggers a lot of thinking. It is worth noting that OVD technology can be well connected to SAM to enhance the semantic understanding ability of SAM and automatically generate the box information required by SAM, thereby further liberating manpower. Similarly for AIGC (artificial intelligence generated content), OVD technology can also enhance the ability to interact with users. For example, when the user needs to specify a certain target in a picture to change, or generate a description of the target, they can Utilize OVD's language understanding capabilities and OVD's ability to detect unknown targets to accurately locate the objects described by users, thereby achieving higher quality content generation. Relevant research in the field of OVD is currently booming, and the changes that OVD technology can bring to future general AI large models are worth looking forward to.
The above is the detailed content of Without the need to label massive data, the new paradigm of target detection OVD takes multi-modal AGI a step further. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

DDREASE is a tool for recovering data from file or block devices such as hard drives, SSDs, RAM disks, CDs, DVDs and USB storage devices. It copies data from one block device to another, leaving corrupted data blocks behind and moving only good data blocks. ddreasue is a powerful recovery tool that is fully automated as it does not require any interference during recovery operations. Additionally, thanks to the ddasue map file, it can be stopped and resumed at any time. Other key features of DDREASE are as follows: It does not overwrite recovered data but fills the gaps in case of iterative recovery. However, it can be truncated if the tool is instructed to do so explicitly. Recover data from multiple files or blocks to a single

0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Facing lag, slow mobile data connection on iPhone? Typically, the strength of cellular internet on your phone depends on several factors such as region, cellular network type, roaming type, etc. There are some things you can do to get a faster, more reliable cellular Internet connection. Fix 1 – Force Restart iPhone Sometimes, force restarting your device just resets a lot of things, including the cellular connection. Step 1 – Just press the volume up key once and release. Next, press the Volume Down key and release it again. Step 2 – The next part of the process is to hold the button on the right side. Let the iPhone finish restarting. Enable cellular data and check network speed. Check again Fix 2 – Change data mode While 5G offers better network speeds, it works better when the signal is weaker

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

New SOTA for multimodal document understanding capabilities! Alibaba's mPLUG team released the latest open source work mPLUG-DocOwl1.5, which proposed a series of solutions to address the four major challenges of high-resolution image text recognition, general document structure understanding, instruction following, and introduction of external knowledge. Without further ado, let’s look at the effects first. One-click recognition and conversion of charts with complex structures into Markdown format: Charts of different styles are available: More detailed text recognition and positioning can also be easily handled: Detailed explanations of document understanding can also be given: You know, "Document Understanding" is currently An important scenario for the implementation of large language models. There are many products on the market to assist document reading. Some of them mainly use OCR systems for text recognition and cooperate with LLM for text processing.

This week, FigureAI, a robotics company invested by OpenAI, Microsoft, Bezos, and Nvidia, announced that it has received nearly $700 million in financing and plans to develop a humanoid robot that can walk independently within the next year. And Tesla’s Optimus Prime has repeatedly received good news. No one doubts that this year will be the year when humanoid robots explode. SanctuaryAI, a Canadian-based robotics company, recently released a new humanoid robot, Phoenix. Officials claim that it can complete many tasks autonomously at the same speed as humans. Pheonix, the world's first robot that can autonomously complete tasks at human speeds, can gently grab, move and elegantly place each object to its left and right sides. It can autonomously identify objects
