0

GPT-4O understands the text, but does it clearly see it? MFM benchmarking study on visual tasks

Multimodal Basic Models (MFMS) such as GPT-4O, Gemini and Claude have shown rapid advancements recently, especially in public demonstrations. Despite adequate research on their language skills, their ability to truly understand visual information remains unclear. Most benchmarks used today focus on text-based tasks such as VQA or classification, which often reflect language advantages rather than visual abilities. These tests also require text output, so it is difficult to fairly evaluate visual skills or compare MFM to vision-specific models. Furthermore, key aspects such as 3D perception, segmentation and grouping are still largely ignored in the current assessment.

MFM shows strong performance in tasks that combine visual comprehension and language comprehension, such as subtitles and visual question answering. However, their effectiveness in tasks requiring detailed visual understanding is unclear. Most current benchmarks rely on text-based output, so it is difficult to fairly compare MFM to visual-only models. Some studies have tried to adjust the visual dataset of MFM by converting annotations to text, but this limit limits the evaluation to language output. Prompt strategies were also explored to help MFMS handle visual tasks by breaking them down into manageable subtasks, although repeatability remains a challenge in some cases.

EPFL researchers evaluated several popular multimodal basis models, such as GPT-4O, Gemini 2.0 Flash, and Claude 3.5 sonnet, involving core computer vision tasks, including using datasets such as Coco and Imagenet, including segmentation, object detection, and depth prediction. Since most MFMs are designed to output text and are accessible only through the API, they developed a timely linking framework that translates these visual tasks into text-compatible formats. Their findings suggest that although MFM is a competent generalist, they do not have specialized visual models, especially in geometric tasks. GPT-4O stands out and performs best in 4 of 6 missions. The evaluation toolkit will be open source.

To evaluate the MFM of visual tasks, the study designed a rapid chain strategy that decomposes complex tasks into simpler, language-friendly subtasks. For example, instead of directly predicting bounding boxes, the model first identifies the current objects and then locates them by cropping them with recursive image. For segmentation and grouping, images are divided into superpixels, which are easy to mark and compare. The pairwise ranking of superpixel regions is used to estimate depth and surface normality. This modular design takes advantage of the strength of MFMS in terms of classification and similarity, while calibration control ensures fair comparisons. This method is flexible and improves performance with finer grained tips.

The study evaluates various tasks such as image classification, object detection and segmentation, evaluating various MFMs including GPT-4, Gemini Flash, and Claude 3.5. Results using datasets such as ImageNet, Coco and Hypersim show that GPT-4O reaches 77.2% on ImageNet, with 60.62 AP50 object detection, surpassing the performance of professional models of Vit-G (90.94%) and Co-Docer (91.30%). The semantic segmentation results show that GPT-4O is 44.89 MIOU, while OneFormer has a rate of 65.52. The MFM processing distribution transfers well but lags behind precise visual reasoning. The study also introduced rapid chain and Oracle baselines to evaluate upper-limit performance.

In summary, the study introduces a benchmarking framework to evaluate the visual functions of MFMs, such as GPT-4O, Gemini and Claude, by converting standard visual tasks into time-based formats. The research results show that MFMS performs better in semantic tasks than geometric tasks, while GPT-4O is generally ahead. However, all MFMs lag significantly behind task-specific visual models. Although they are generalists who have been trained primarily on image text data, they show promising progress on 3D tasks, especially new inference models such as O3. Limitations include high inference costs and rapid sensitivity. Nevertheless, the framework provides a unified approach to assessing visual understanding of MFM, laying the foundation for future progress.


Check Paper, github pages and projects. All credits for this study are to the researchers on the project.

Researchers with Nvidia, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgan, Amgan, Aflac, Aflac, Wells Fargo and 100s read AI Dev newsletters and researchers read. [SUBSCRIBE NOW]


Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.