Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the 'vision' test-AI-php.cn

Home

Technology peripherals

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the 'vision' test

王林

Jul 18, 2024 pm 06:18 PM

industry VLM

The four major VLMs are all trying to blindly touch the elephant?

Let the most popular SOTA models (GPT-4o, Gemini-1.5, Sonnet-3, Sonnet-3.5) count how many intersections there are between two lines. Will they perform better than humans?

The answer is probably no.

Since the launch of GPT-4V, visual language models (VLMs) have made the intelligence of large models a big step forward towards the level of artificial intelligence we imagined.

VLMs can both understand pictures and use language to describe what they see, and perform complex tasks based on these understandings. For example, if you send the VLM model a picture of a dining table and a picture of a menu, it can extract the number of beer bottles and the unit price on the menu from the two pictures, and calculate how much the beer cost for the meal.

VLMs have advanced so fast that it has become a task for the model to find out whether there are some unreasonable "abstract elements" in this picture. For example, it is necessary to ask the model to identify whether there is a person ironing clothes in a speeding taxi. A common assessment method.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

However, the current benchmark test set does not evaluate the visual capabilities of VLMs well. Taking MMMU as an example, 42.9% of the questions can be solved without looking at pictures, which means that many answers can be inferred from text questions and options alone. Secondly, the capabilities currently demonstrated by VLM are largely the result of "memorizing" large-scale Internet data. This results in VLMs scoring very high in the test set, but this does not mean that the judgment is true: can VLMs perceive images like humans?

In order to get the answer to this question, researchers from Auburn University and the University of Alberta decided to "test vision" for VLMs. Inspired by the optometrist's "vision test", they asked four top VLMs: GPT-4o, Gemini-1.5 Pro, Claude-3 Sonnet and Claude-3.5 Sonnet to make a set of "vision test questions".

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

Paper title: Vision language models are blind
Paper link: https://arxiv.org/pdf/2407.06581
Project link: https://vlmsareblind.github.io/

This set of questions is very simple. For example, counting the number of intersections of two lines and identifying which letter is marked by a red circle requires almost no knowledge of the world. The test results are shocking. VLMs are actually "myopic" and the details of the image are actually blurred in their view.

VLM Blind or not? Seven major tasks, you can know them with just one test

In order to prevent VLMs from "copying answers" directly from Internet data sets, the author of the paper designed a new set of "vision tests". The authors of the paper chose to let VLMs determine the relationship between geometric figures in space, such as whether two figures intersect. Because the spatial information of these patterns on a white canvas usually cannot be described in natural language.

When humans process this information, they will perceive it through the "visual brain". But for VLMs, they rely on combining image features and text features in the early stages of the model, that is, integrating the visual encoder into a large language model, which is essentially a knowledge brain without eyes.

Preliminary experiments show that VLMs perform amazingly well when faced with human vision tests, such as the upside-down "E" eye chart that each of us has tested.

Test and results

Level 1: Count how many intersections are there between the lines?

The author of the paper created 150 images containing two line segments on a white background. The x-coordinates of these line segments are fixed and equally spaced, while the y-coordinates are randomly generated. There are only three intersection points between two line segments: 0, 1, and 2.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

As shown in Figure 5, in the test of two versions of prompt words and three versions of line segment thickness, all VLMs performed poorly on this simple task.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

Sonnet-3.5, which has the best accuracy, is only 77.33% (see Table 1).

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

More specifically, VLMs tend to perform worse when the distance between two lines shrinks (see Figure 6 below). Since each line graph consists of three key points, the distance between two lines is calculated as the average distance of three corresponding point pairs.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

This result is in sharp contrast to the high accuracy of VLMs on ChartQA, which shows that VLMs are able to identify the overall trend of the line graph, but cannot "zoom in" to see details such as "which lines intersect".

Second level: Determine the positional relationship between two circles

As shown in the picture, the author of the paper randomly generated two circles of the same size on a canvas of a given size. There are only three situations in the positional relationship between two circles: intersection, tangency and separation. Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

Surprisingly, in this task that is intuitively visible to humans and whose answer can be seen at a glance, no VLM can give the answer perfectly (see Figure 7).

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

The model with the best accuracy (92.78%) is Gemini-1.5 (see Table 2).

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

In experiments, one situation occurred frequently: when two circles are very close, VLMs tend to perform poorly but make educated guesses. As shown in the figure below, Sonnet-3.5 usually answers a conservative "no".

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

As shown in Figure 8, even when the distance between the two circles is far apart and has a radius (d = 0.5) as wide as that, GPT-4o, which has the worst accuracy, cannot achieve 100 % precise.

That said, VLM's vision doesn't seem to be clear enough to see the small gaps or intersections between the two circles.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

Level 3: How many letters are circled in red?

Since the distance between letters in a word is very small, the authors of the paper hypothesized that if VLMs are "myopic", then they will not be able to recognize the letters circled in red.

So, they chose strings like "Acknowledgement", "Subdermatoglyphic" and "tHyUiKaRbNqWeOpXcZvM". Randomly generate a red circle to circle a letter in the string as a test.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

The test results show that the tested models performed very poorly at this level (see Figure 9 and Table 3).

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

For example, visual language models tend to make mistakes when letters are slightly obscured by red circles. They often confuse the letters next to the red circle. Sometimes the model will produce hallucinations. For example, although it can spell the word accurately, it will add garbled characters (for example, "9", "n", "©") to the word.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

All models except GPT-4o performed slightly better on words than on random strings, suggesting that knowing the spelling of a word may help visual language models make judgments, thereby slightly improving accuracy.

Gemini-1.5 and Sonnet-3.5 are the top two models with accuracy rates of 92.81% and 89.22% respectively, and outperform GPT-4o and Sonnet-3 by almost 20%.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

Level 4 and Level 5: How many overlapping shapes are there? How many "matryoshka" squares are there?

Assuming that VLMs are "myopic", they may not be able to clearly see the intersection between each two circles in a pattern similar to the "Olympic rings". To this end, the author of the paper randomly generated 60 groups of patterns similar to the "Olympic Rings" and asked VLMs to count how many overlapping patterns they had. They also generated a pentagonal version of the "Olympic rings" for further testing.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

Since VLMs perform poorly when counting the number of intersecting circles, the authors further tested the case when the edges of the pattern do not intersect and each shape is completely nested within another shape. They generated a "matryoshka"-like pattern of 2-5 squares and asked VLMs to count the total number of squares in the image.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

It is easy to see from the bright red crosses in the table below that these two levels are also insurmountable obstacles for VLMs.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

In the nested square test, the accuracy of each model varies greatly: GPT-4o (accuracy 48.33%) and Sonnet-3 (accuracy 55.00%) are at least better than Gemini-1.5 (80.00% accuracy) and Sonnet-3.5 (87.50% accuracy) are 30 percentage points lower.

This gap will be larger when the model counts overlapping circles and pentagons, but Sonnet-3.5 performs several times better than other models. As shown in the table below, when the image is a pentagon, Sonnet-3.5’s accuracy of 75.83% far exceeds Gemini-1.5’s 9.16%.

Surprisingly, all four models tested achieved 100% accuracy when counting 5 rings, but adding just one additional ring was enough to cause the accuracy to drop significantly to near zero.

However, when computing pentagons, all VLMs (except Sonnet-3.5) perform poorly even when computing 5 pentagons. Overall, computing 6 to 9 shapes (including circles and pentagons) is difficult for all models.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

This shows that VLM is biased and they are more inclined to output the famous "Olympic rings" as the result. For example, Gemini-1.5 will predict the result as "5" in 98.95% of the trials, regardless of the actual number of circles (see Table 5). For other models, this prediction error occurs much more frequently for rings than for pentagons.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

In addition to quantity, VLM also has different "preferences" in the color of shapes.

GPT-4o performs better on colored shapes than pure black shapes, while Sonnet-3.5 predicts better and better as the image size increases. However, when the researchers changed the color and image resolution, the accuracy of other models changed only slightly.

It is worth noting that in the task of calculating nested squares, even if the number of squares is only 2-3, GPT-4o and Sonnet-3 are still difficult to calculate. When the number of squares increases to four and five, all models fall far short of 100% accuracy. This shows that it is difficult for VLM to accurately extract the target shape even if the edges of the shapes do not intersect.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

Level 6: Count how many rows are there in the table? How many columns are there?

While VLMs have trouble overlapping or nesting graphics, what do they see as tiling patterns? In the basic test set, especially DocVQA, which contains many tabular tasks, the accuracy of the tested models is ≥90%. The author of the paper randomly generated 444 tables with different numbers of rows and columns, and asked VLMs to count how many rows there were in the table? How many columns are there?

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

The results show that although it achieved high scores in the basic data set, as shown in the figure below, VLM also performed poorly in counting rows and columns in empty tables.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

Specifically, they are usually 1-2 bars off. As shown in the figure below, GPT-4o recognizes the 4×5 grid as 4×4, and Gemini-1.5 recognizes it as 5×5.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

This shows that while VLMs can extract important content from tables to answer table-related questions in DocVQA, they cannot clearly identify tables cell by cell.

This may be because the tables in the document are mostly non-empty, and VLM is not used to empty tables. Interestingly, after the researchers simplified the task by trying to add a word to each cell, a significant improvement in accuracy was observed for all VLMs, for example, GPT-4o improved from 26.13% to 53.03% (see Table 6 ). However, in this case, the performance of the model under test is still not perfect. As shown in Figure 15a and b, the best performing model (Sonnet-3.5) performed 88.68% in grids containing text and 59.84% in empty grids.

And most models (Gemini-1.5, Sonnet-3 and Sonnet-3.5) consistently perform better in counting columns than in counting rows (see Figure 15c and d).

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

Level 7: How many direct subway lines are there from the starting point to the destination?

This test tests the ability of VLMs to follow paths, which is crucial for the model to interpret maps, charts, and understand annotations such as arrows added by users in input images. To this end, the author of the paper randomly generated 180 subway line maps, each with four fixed stations. They asked VLMs to count how many monochromatic paths there are between two sites.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

The test results are shocking. Even if the path between the two sites is simplified to only one, all models cannot achieve 100% accuracy. As shown in Table 7, the best performing model is Sonnet-3.5 with an accuracy of 95%; the worst model is Sonnet-3 with an accuracy of 23.75%.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

It is not difficult to see from the figure below that the prediction of VLM usually has a deviation of 1 to 3 paths. As the map complexity increases from 1 to 3 paths, the performance of most VLMs becomes worse.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

Faced with the "brutal fact" that today's mainstream VLM performs extremely poorly in image recognition, many netizens first put aside their status as "AI defense lawyers" and left many pessimistic comments.

A netizen said: “It’s embarrassing that the SOTA models (GPT-4o, Gemini-1.5 Pro, Sonnet-3, Sonnet-3.5) perform so poorly, and these models actually claim in their promotion: they can Understanding images? For example they could be used to help blind people or teach geometry to children!

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

On the other side of the pessimistic camp, one netizen believes that these poor results can be easily solved with training and fine-tuning. About 100,000 examples and trained with real data, the problem is solved

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

However, both the "AI defenders" and the "AI pessimists" have acquiesced in the fact that VLM still performs well in the image test. There are factual flaws that are extremely difficult to reconcile.

The author of the paper has also received more questions about whether this test is scientific.

Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the vision test

Some netizens believe that the test in this paper does not prove that VLMs are "myopic". First of all, people with myopia do not see blurry details. "Blurry details" is a symptom of hyperopia. Secondly, not being able to see details is not the same thing as not being able to count the number of intersections. The accuracy of counting the number of rows and columns of a blank grid does not improve with increasing resolution, and increasing the resolution of the image does not help in understanding this task. Furthermore, increasing image resolution does not have a significant impact on understanding overlapping lines or intersections in this task.

In fact, the challenges faced by these visual language models (VLMs) in handling such tasks may have more to do with their reasoning capabilities and the way they interpret image content, rather than just a problem of visual resolution. In other words, even if every detail of an image is clearly visible, models may still not be able to accurately complete these tasks if they lack the correct reasoning logic or a deep understanding of visual information. Therefore, this research may need to delve deeper into the capabilities of VLMs in visual understanding and reasoning, rather than just their image processing capabilities.

Some netizens believe that if human vision is processed by convolution, humans themselves will also encounter difficulties in the test of judging the intersection of lines.

For more information, please refer to the original paper.

^{Reference links:}

^{https://arxiv.org/pdf/2407.06581}

^{https://news.ycombinator.com/item?id=40926734}

^{https://vlmsareblind.github.io/}

The above is the detailed content of Are all these VLMs blind? GPT-4o and Sonnet-3.5 successively failed the 'vision' test. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055612 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Nordhold: Fusion System, Explained

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1671

CakePHP Tutorial

1428

Laravel Tutorial

1331

PHP Tutorial

1276

C# Tutorial

1256

Related knowledge

DeepMind robot plays table tennis, and its forehand and backhand slip into the air, completely defeating human beginners Aug 09, 2024 pm 04:01 PM

But maybe he can’t defeat the old man in the park? The Paris Olympic Games are in full swing, and table tennis has attracted much attention. At the same time, robots have also made new breakthroughs in playing table tennis. Just now, DeepMind proposed the first learning robot agent that can reach the level of human amateur players in competitive table tennis. Paper address: https://arxiv.org/pdf/2408.03906 How good is the DeepMind robot at playing table tennis? Probably on par with human amateur players: both forehand and backhand: the opponent uses a variety of playing styles, and the robot can also withstand: receiving serves with different spins: However, the intensity of the game does not seem to be as intense as the old man in the park. For robots, table tennis

The first mechanical claw! Yuanluobao appeared at the 2024 World Robot Conference and released the first chess robot that can enter the home Aug 21, 2024 pm 07:33 PM

On August 21, the 2024 World Robot Conference was grandly held in Beijing. SenseTime's home robot brand "Yuanluobot SenseRobot" has unveiled its entire family of products, and recently released the Yuanluobot AI chess-playing robot - Chess Professional Edition (hereinafter referred to as "Yuanluobot SenseRobot"), becoming the world's first A chess robot for the home. As the third chess-playing robot product of Yuanluobo, the new Guoxiang robot has undergone a large number of special technical upgrades and innovations in AI and engineering machinery. For the first time, it has realized the ability to pick up three-dimensional chess pieces through mechanical claws on a home robot, and perform human-machine Functions such as chess playing, everyone playing chess, notation review, etc.

Claude has become lazy too! Netizen: Learn to give yourself a holiday Sep 02, 2024 pm 01:56 PM

The start of school is about to begin, and it’s not just the students who are about to start the new semester who should take care of themselves, but also the large AI models. Some time ago, Reddit was filled with netizens complaining that Claude was getting lazy. "Its level has dropped a lot, it often pauses, and even the output becomes very short. In the first week of release, it could translate a full 4-page document at once, but now it can't even output half a page!" https:// www.reddit.com/r/ClaudeAI/comments/1by8rw8/something_just_feels_wrong_with_claude_in_the/ in a post titled "Totally disappointed with Claude", full of

At the World Robot Conference, this domestic robot carrying 'the hope of future elderly care' was surrounded Aug 22, 2024 pm 10:35 PM

At the World Robot Conference being held in Beijing, the display of humanoid robots has become the absolute focus of the scene. At the Stardust Intelligent booth, the AI robot assistant S1 performed three major performances of dulcimer, martial arts, and calligraphy in one exhibition area, capable of both literary and martial arts. , attracted a large number of professional audiences and media. The elegant playing on the elastic strings allows the S1 to demonstrate fine operation and absolute control with speed, strength and precision. CCTV News conducted a special report on the imitation learning and intelligent control behind "Calligraphy". Company founder Lai Jie explained that behind the silky movements, the hardware side pursues the best force control and the most human-like body indicators (speed, load) etc.), but on the AI side, the real movement data of people is collected, allowing the robot to become stronger when it encounters a strong situation and learn to evolve quickly. And agile

ACL 2024 Awards Announced: One of the Best Papers on Oracle Deciphering by HuaTech, GloVe Time Test Award Aug 15, 2024 pm 04:37 PM

At this ACL conference, contributors have gained a lot. The six-day ACL2024 is being held in Bangkok, Thailand. ACL is the top international conference in the field of computational linguistics and natural language processing. It is organized by the International Association for Computational Linguistics and is held annually. ACL has always ranked first in academic influence in the field of NLP, and it is also a CCF-A recommended conference. This year's ACL conference is the 62nd and has received more than 400 cutting-edge works in the field of NLP. Yesterday afternoon, the conference announced the best paper and other awards. This time, there are 7 Best Paper Awards (two unpublished), 1 Best Theme Paper Award, and 35 Outstanding Paper Awards. The conference also awarded 3 Resource Paper Awards (ResourceAward) and Social Impact Award (

Li Feifei's team proposed ReKep to give robots spatial intelligence and integrate GPT-4o Sep 03, 2024 pm 05:18 PM

Deep integration of vision and robot learning. When two robot hands work together smoothly to fold clothes, pour tea, and pack shoes, coupled with the 1X humanoid robot NEO that has been making headlines recently, you may have a feeling: we seem to be entering the age of robots. In fact, these silky movements are the product of advanced robotic technology + exquisite frame design + multi-modal large models. We know that useful robots often require complex and exquisite interactions with the environment, and the environment can be represented as constraints in the spatial and temporal domains. For example, if you want a robot to pour tea, the robot first needs to grasp the handle of the teapot and keep it upright without spilling the tea, then move it smoothly until the mouth of the pot is aligned with the mouth of the cup, and then tilt the teapot at a certain angle. . this

Distributed Artificial Intelligence Conference DAI 2024 Call for Papers: Agent Day, Richard Sutton, the father of reinforcement learning, will attend! Yan Shuicheng, Sergey Levine and DeepMind scientists will give keynote speeches Aug 22, 2024 pm 08:02 PM

Conference Introduction With the rapid development of science and technology, artificial intelligence has become an important force in promoting social progress. In this era, we are fortunate to witness and participate in the innovation and application of Distributed Artificial Intelligence (DAI). Distributed artificial intelligence is an important branch of the field of artificial intelligence, which has attracted more and more attention in recent years. Agents based on large language models (LLM) have suddenly emerged. By combining the powerful language understanding and generation capabilities of large models, they have shown great potential in natural language interaction, knowledge reasoning, task planning, etc. AIAgent is taking over the big language model and has become a hot topic in the current AI circle. Au

Hongmeng Smart Travel S9 and full-scenario new product launch conference, a number of blockbuster new products were released together Aug 08, 2024 am 07:02 AM

This afternoon, Hongmeng Zhixing officially welcomed new brands and new cars. On August 6, Huawei held the Hongmeng Smart Xingxing S9 and Huawei full-scenario new product launch conference, bringing the panoramic smart flagship sedan Xiangjie S9, the new M7Pro and Huawei novaFlip, MatePad Pro 12.2 inches, the new MatePad Air, Huawei Bisheng With many new all-scenario smart products including the laser printer X1 series, FreeBuds6i, WATCHFIT3 and smart screen S5Pro, from smart travel, smart office to smart wear, Huawei continues to build a full-scenario smart ecosystem to bring consumers a smart experience of the Internet of Everything. Hongmeng Zhixing: In-depth empowerment to promote the upgrading of the smart car industry Huawei joins hands with Chinese automotive industry partners to provide

See all articles