‘Visual’ AI Models Might Not See Anything at All

Recent advancements in AI, such as GPT-4o and Gemini 1.5 Pro, boast multimodal capabilities, claiming proficiency in understanding images, audio, and text. However, a new study suggests they may not perceive visual information in the traditional human sense.

While no one asserts these AIs see like humans, marketing rhetoric often refers to their “vision capabilities” and “visual understanding.” They are portrayed as capable of tasks ranging from analyzing images to interpreting videos.

The study, conducted by researchers from Auburn University and the University of Alberta, scrutinized leading multimodal models on basic visual tasks. These tasks included determining if two shapes overlap, counting pentagons in an image, and identifying a circled letter in a word—tasks elementary enough for a first-grader to excel at.

“Our seven tasks are extremely simple, where humans would perform at 100% accuracy. We expect AIs to do the same, but they are currently NOT,” explained co-author Anh Nguyen in correspondence with TechCrunch. “Our findings emphasize that even the best models are still struggling.”

For instance, when tasked with identifying overlapping shapes like circles, the models showed inconsistent performance. GPT-4o performed well when circles were far apart but struggled with close distances, achieving correct responses only 18% of the time under these conditions. Gemini Pro 1.5 performed better but still faltered with close distances, achieving correct responses in only 70% of cases.

Counting tasks also revealed stark limitations. While the models excelled when identifying five interlocking circles, their accuracy plummeted when additional rings were introduced. This variability suggests a lack of true visual understanding and the reliance on patterns ingrained during training, such as the Olympic Rings, commonly featured in their datasets.

Researchers speculate that while these models process visual data abstractly, like noting the presence of a circle in an image, they lack the capability for nuanced visual judgment. This discrepancy leads to erratic performance on seemingly straightforward tasks, undermining claims of comprehensive visual comprehension.

In conclusion, while these AI models excel in certain domains, such as recognizing human actions or everyday objects, they operate without true visual perception. Research such as this is crucial in dispelling misconceptions propagated by AI marketing and illuminating the actual capabilities and limitations of these advanced systems.

Latest from Blog

‘Visual’ AI Models Might Not See Anything at All

Forbes Staff

Kudos Raises $3M for Healthier, Cotton-Based Disposable Diapers

‘Wild Wild Space’ Documentary Captures Risks and Rivalries of the New Space Race

Latest from Blog

Unstoppable Women: Katniss Griffiths and Andreea Cormier Wacker Revolutionize the Pageantry World

Charlize Theron Embraces the Peekaboo Bralette Trend in Dior for Africa Outreach Project 2024 Block Party

Taylor Swift Debuts Beaded Roberto Cavalli Dress with Fringe at Eras Tour Milan Show

Zendaya Suits Up in White Ralph Lauren Blazer for Wimbledon 2024 Men’s Final

Serena Williams Highlights Gender Pay Gap at 2024 ESPY Awards with Sharp Humor

Suggestions

‘Visual’ AI Models Might Not See Anything at All

Forbes Staff

Kudos Raises $3M for Healthier, Cotton-Based Disposable Diapers

‘Wild Wild Space’ Documentary Captures Risks and Rivalries of the New Space Race

Latest from Blog

Unstoppable Women: Katniss Griffiths and Andreea Cormier Wacker Revolutionize the Pageantry World

Charlize Theron Embraces the Peekaboo Bralette Trend in Dior for Africa Outreach Project 2024 Block Party

Taylor Swift Debuts Beaded Roberto Cavalli Dress with Fringe at Eras Tour Milan Show

Zendaya Suits Up in White Ralph Lauren Blazer for Wimbledon 2024 Men’s Final

Serena Williams Highlights Gender Pay Gap at 2024 ESPY Awards with Sharp Humor