


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV
Just now, Meta AI released Segment Anything Model (SAM) - the first basic model for image segmentation.
SAM can achieve one-click segmentation of any object from photos or videos, and can migrate to other tasks with zero samples.
Overall, SAM follows the idea of the basic model:
1. A very Simple yet scalable architecture that can handle multi-modal cues: text, keypoints, bounding boxes.
2. Intuitive annotation process, closely connected with model design.
3. A data flywheel that allows the model to be bootstrapped to a large number of unlabeled images.
And, it is no exaggeration to say that SAM has learned the general concept of "object", even for unknown objects, unfamiliar scenes (such as underwater and under microscopes), and blurry The same is true for the case.
In addition, SAM can also be generalized to new tasks and new fields, and practitioners no longer need to fine-tune the model themselves.
Paper address: https://ai.facebook.com/research/publications/segment-anything/
The most powerful thing is that Meta implements a completely different CV paradigm. You can specify a point, a bounding box, and a sentence in a unified framework prompt encoder to directly segment objects with one click.
In this regard, Tencent AI algorithm expert Jin Tian said, "The prompt paradigm in the NLP field has begun to extend to the CV field. This time, it may completely change the traditional prediction thinking of CV. . Now you can really use a model to segment any object, and it is dynamic!"
NVIDIA AI scientist Jim Fan even praised this: We are already here It’s the “GPT-3 moment” in the field of computer vision!
So, CV really doesn’t exist anymore?
SAM: "Cut out" all objects in any image with one click
Segment Anything is the first basic model dedicated to image segmentation.
Segmentation refers to identifying which image pixels belong to an object and has always been the core task of computer vision.
However, if you want to create an accurate segmentation model for a specific task, it usually requires highly specialized work by experts. This process requires an infrastructure for training AI and a large number of carefully annotated domains. Data, so the threshold is extremely high.
In order to solve this problem, Meta proposed a basic model for image segmentation-SAM. This hintable model, trained on diverse data, is not only adaptable to a variety of tasks, but also operates similarly to how hints are used in NLP models.
The SAM model grasps the concept of "what is an object" and can generate a mask for any object in any image or video, even objects it has not seen during training.
SAM is so versatile that it covers a variety of use cases and can be used in new imaging domains out of the box without additional training, whether it's underwater photos, Or a cell microscope. In other words, SAM already has the capability of zero-sample migration.
Meta said excitedly in the blog: It can be expected that in the future, SAM will be used in any application that needs to find and segment objects in images.
SAM can become part of a larger AI system to develop a more general multi-modal understanding of the world, for example, understanding the visual and textual content of web pages.
In the field of AR/VR, SAM can select objects based on the user’s line of sight and then “upgrade” the objects to 3D.
For content creators, SAM can extract image areas for collage, or video editing.
SAM can also locate and track animals or objects in videos, which is helpful for natural science and astronomy research.
General segmentation method
In the past, there were two methods to solve the segmentation problem.
One is interactive segmentation, which can segment objects of any category, but requires a person to fine-tune the mask through iteration.
The second is automatic segmentation, which can segment specific objects defined in advance, but the training process requires a large number of manually labeled objects (for example, to segment a cat, thousands of example).
In short, neither of these two methods can provide a universal, fully automatic segmentation method.
And SAM can be seen as a generalization of these two methods, and it can easily perform interactive segmentation and automatic segmentation.
On the model's promptable interface, a wide range of segmentation tasks can be completed by simply designing the correct prompts (clicks, boxes, text, etc.) for the model.
Additionally, SAM is trained on a diverse, high-quality dataset containing over 1 billion masks, allowing the model to generalize to new objects and images beyond its capabilities. What was observed during training. As a result, practitioners no longer need to collect their own segmentation data to fine-tune models for use cases.
This kind of flexibility that can be generalized to new tasks and new fields is the first time in the field of image segmentation.
(1) SAM allows users to segment objects with one click, or interactively click many points, and can also use bounding box hints for the model.
(2) When faced with the ambiguity of segmented objects, SAM can output multiple valid masks, which is an essential capability for solving segmentation problems in the real world.
(3) SAM can automatically discover and block all objects in the image. (4) After precomputing image embeddings, SAM can generate segmentation masks for any prompt in real time, allowing users to interact with the model in real time.
How it works
The SAM trained by the researchers can return valid segmentation masks for any prompt. Cues can be foreground/background points, rough boxes or masks, free-form text, or generally any information that indicates that segmentation is needed in the image.
The requirement for effective masking simply means that even in cases where the prompt is ambiguous and may refer to multiple objects (e.g., a dot on a shirt may represent either the shirt or the person wearing the shirt ) , the output should be a reasonable mask of one of the objects.
The researchers observed that pre-training tasks and interactive data collection impose specific constraints on model design. constraint.
In particular, the model needs to run in real time on the CPU in a web browser so that standard staff can efficiently interact with SAM in real time for annotation.
While runtime constraints mean there is a trade-off between quality and runtime, the researchers found that in practice, simple designs can achieve good results.
SAM's image encoder produces one-time embeddings for images, while the lightweight decoder converts any hints into vector embeddings on the fly. These two sources of information are then combined in a lightweight decoder that predicts segmentation masks.
After calculating the image embedding, SAM can generate a segment of the image in just 50 milliseconds and give any prompt in the web browser.
The latest SAM model was trained on 256 A100 images for 68 hours (nearly 5 days).
Project demonstration
Multiple input prompts
Prompts for specifying the content to be divided in the image, Various segmentation tasks can be implemented without additional training.
##Use interaction points and boxes as prompts
Automatically segment all elements in the image
Generate multiple valid masks for ambiguous prompts
Promptable design
SAM can accept input prompts from other systems.
For example, select the corresponding object based on the user's visual focus information transmitted from the AR/VR headset. Meta's development of AI that can understand the real world will pave the way for its future metaverse journey.
Alternatively, implement text-to-object segmentation using bounding box hints from the object detector.
Scalable output
The output mask can be used as input to other AI systems.
For example, the mask of an object can be tracked in a video, turned into 3D through imaging editing applications, or used for creative tasks such as collage.
Zero-sample generalization
SAM learned A general idea of what an object is - this understanding enables zero-shot generalization to unfamiliar objects and images without the need for additional training.
Select Hover&Click, click Add Mask and a green dot will appear, click Remove Area and a red dot will appear , the apple-eating Huahua was immediately circled.
#After clicking Everything, all objects recognized by the system are extracted immediately.
After choosing Cut-Outs, you will get a triangular dumpling in seconds.
SA-1B dataset: 11 million images, 1.1 billion masks
In addition to the new models released, Meta Also released is SA-1B, the largest segmentation dataset to date.
This dataset consists of 11 million diverse, high-resolution, privacy-preserving images, and 1.1 billion high-quality segmentation masks.
The overall characteristics of the data set are as follows:
· Total number of images: 11 million
· Total number of masks: 1.1 billion
· Average masks per image: 100
· Average image resolution: 1500 × 2250 pixels
Note: Image or mask annotations do not have class tags
Meta specifically emphasizes that these data are collected through our data engine, all Masks are all fully automatically generated by SAM.
With the SAM model, collecting new segmentation masks is faster than ever, and interactively annotating a mask only takes about 14 seconds.
The per-mask annotation process is only 2 times slower than annotating bounding boxes. Using the fastest annotation interface, annotating bounding boxes takes about 7 seconds.
Compared to previous large-scale segmentation data collection efforts, SAM model COCO’s fully manual polygon-based mask annotation is 6.5 times faster than the previous largest data annotation effort (also model Auxiliary) 2 times faster.
However, relying on interactive annotation masks is not enough to create more than 1 billion masked data set. Therefore, Meta built a data engine for creating SA-1B datasets.
This data engine has three "gears":
1. Model auxiliary annotation
2. The mixture of fully automatic annotation and auxiliary annotation helps to increase the diversity of collected masks
3. Fully automatic mask creation enables the expansion of the data set
Our final dataset includes over 1.1 billion segmentation masks collected on approximately 11 million authorized and privacy-preserving images.
SA-1B has 400x more masks than any existing segmentation dataset. And human evaluation studies confirm that the masks are of high quality and diversity, and in some cases are even qualitatively comparable to previous masks from smaller, fully manually annotated datasets.
## Pictures of the SA-1B were obtained through photo providers from multiple countries, These countries span different geographic regions and income levels.
While some geographic areas are still underrepresented, SA-1B has more images and better overall representation across all regions than previous segmentation datasets.
Finally, Meta says it hopes this data can form the basis of new datasets that include additional annotations, such as textual descriptions associated with each mask.
RBG master leads the team
Ross Girshick
##Ross Girshick (often called the RBG guru) is a research scientist at the Facebook Artificial Intelligence Research Institute (FAIR), where he is committed to computer vision and machine learning research.
In 2012, Ross Girshick received his PhD in Computer Science from the University of Chicago under the supervision of Pedro Felzenszwalb.
Before joining FAIR, Ross was a researcher at Microsoft Research and a postdoc at the University of California, Berkeley, where his mentors were Jitendra Malik and Trevor Darrell.
He received the 2017 PAMI Young Researcher Award and the 2017 and 2021 PAMI Mark Everingham Awards in recognition of his contributions to open source software.
As we all know, Ross and He Kaiming jointly developed the target detection algorithm of the R-CNN method. In 2017, the Mask R-CNN paper by Ross and He Kaiming won the best paper in ICCV 2017.
Netizen: CV really doesn’t exist anymoreMeta created this segmentation basic model in the CV field, which made many netizens shout, “Now, CV really doesn’t exist. Exists."
Meta scientist Justin Johnson said: "To me, Segment Anything's data engine and ChatGPT's RLHF represent the largest A new era of artificial intelligence. Instead of learning everything from noisy network data, it is better to cleverly apply human annotation combined with big data to unlock new capabilities. Supervised learning is back!"
#The only regret is that the SAM model release was mainly led by Ross Girshick, but He Yuming was absent.
Intimate friend "matrix Mingzi" said that this article further proves that multimodality is CV There is no tomorrow for pure CV.
The above is the detailed content of Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











Using the chrono library in C can allow you to control time and time intervals more accurately. Let's explore the charm of this library. C's chrono library is part of the standard library, which provides a modern way to deal with time and time intervals. For programmers who have suffered from time.h and ctime, chrono is undoubtedly a boon. It not only improves the readability and maintainability of the code, but also provides higher accuracy and flexibility. Let's start with the basics. The chrono library mainly includes the following key components: std::chrono::system_clock: represents the system clock, used to obtain the current time. std::chron

DMA in C refers to DirectMemoryAccess, a direct memory access technology, allowing hardware devices to directly transmit data to memory without CPU intervention. 1) DMA operation is highly dependent on hardware devices and drivers, and the implementation method varies from system to system. 2) Direct access to memory may bring security risks, and the correctness and security of the code must be ensured. 3) DMA can improve performance, but improper use may lead to degradation of system performance. Through practice and learning, we can master the skills of using DMA and maximize its effectiveness in scenarios such as high-speed data transmission and real-time signal processing.

C performs well in real-time operating system (RTOS) programming, providing efficient execution efficiency and precise time management. 1) C Meet the needs of RTOS through direct operation of hardware resources and efficient memory management. 2) Using object-oriented features, C can design a flexible task scheduling system. 3) C supports efficient interrupt processing, but dynamic memory allocation and exception processing must be avoided to ensure real-time. 4) Template programming and inline functions help in performance optimization. 5) In practical applications, C can be used to implement an efficient logging system.

In MySQL, add fields using ALTERTABLEtable_nameADDCOLUMNnew_columnVARCHAR(255)AFTERexisting_column, delete fields using ALTERTABLEtable_nameDROPCOLUMNcolumn_to_drop. When adding fields, you need to specify a location to optimize query performance and data structure; before deleting fields, you need to confirm that the operation is irreversible; modifying table structure using online DDL, backup data, test environment, and low-load time periods is performance optimization and best practice.

Measuring thread performance in C can use the timing tools, performance analysis tools, and custom timers in the standard library. 1. Use the library to measure execution time. 2. Use gprof for performance analysis. The steps include adding the -pg option during compilation, running the program to generate a gmon.out file, and generating a performance report. 3. Use Valgrind's Callgrind module to perform more detailed analysis. The steps include running the program to generate the callgrind.out file and viewing the results using kcachegrind. 4. Custom timers can flexibly measure the execution time of a specific code segment. These methods help to fully understand thread performance and optimize code.

The top 10 digital virtual currency trading platforms are: 1. Binance, 2. OKX, 3. Coinbase, 4. Kraken, 5. Huobi Global, 6. Bitfinex, 7. KuCoin, 8. Gemini, 9. Bitstamp, 10. Bittrex. These platforms all provide high security and a variety of trading options, suitable for different user needs.

The built-in quantization tools on the exchange include: 1. Binance: Provides Binance Futures quantitative module, low handling fees, and supports AI-assisted transactions. 2. OKX (Ouyi): Supports multi-account management and intelligent order routing, and provides institutional-level risk control. The independent quantitative strategy platforms include: 3. 3Commas: drag-and-drop strategy generator, suitable for multi-platform hedging arbitrage. 4. Quadency: Professional-level algorithm strategy library, supporting customized risk thresholds. 5. Pionex: Built-in 16 preset strategy, low transaction fee. Vertical domain tools include: 6. Cryptohopper: cloud-based quantitative platform, supporting 150 technical indicators. 7. Bitsgap:

How to achieve the effect of mouse scrolling event penetration? When we browse the web, we often encounter some special interaction designs. For example, on deepseek official website, �...
