Microsoft AI releases OmniParser V2: Turn any LLM into an AI tool for computers using proxy

In the field of artificial intelligence, enabling large language models (LLMS) to navigate and interact with graphical user interfaces (GUIS) is a significant challenge. Although LLM is good at handling text data, they often encounter difficulties when interpreting visual elements such as icons, buttons, and menus. This limitation limits their effectiveness in tasks that require seamless interaction with software interfaces, which are primarily visual.
To solve this problem, Microsoft introduced OmniparSer V2, a tool designed to enhance LLMS’s GUI understanding. OmniParser V2 converts UI screenshots into structured machine-readable data, allowing LLMS to more effectively understand and interact with various software interfaces. This development aims to bridge the gap between text and visual data processing, thereby facilitating a more comprehensive AI application.
Omniparser V2 runs through two main components: detection and subtitles. The detection module uses a fine-tuned version of the Yolov8 model to identify interactive elements in screenshots such as buttons and icons. Meanwhile, the subtitle module uses a finely tuned Florence-2 basic model to generate descriptive tags for these elements, providing context in the interface about their functionality. This combined approach allows LLM to build a detailed understanding of the GUI, which is crucial for accurate interaction and task execution.
The significant improvement of Omniparser V2 is to enhance its training dataset. The tool has been trained on a wider set of icon subtitles and grounded data from widely used web pages and applications. This rich dataset enhances the accuracy of the model in detecting and describing smaller interactive elements, which is critical for effective GUI interactions. Additionally, by optimizing the image size processed by the icon subtitle model, the OmniparSer V2 has a 60% reduction in latency compared to the previous version, with an average processing time of 0.6 seconds on the A100 GPU and an average processing time of 0.6 seconds on a single RTX. 4090 GPU.

The effectiveness of OmniparSer V2 is through its performance on Screenspot Pro Benchmark, a framework for evaluating GUI grounding capabilities. When combined with GPT-4O, the average accuracy of OmniParser V2 was 39.6%, a significant improvement of 0.8% compared to the baseline score of GPT-4O. This improvement highlights the tool’s ability to enable LLM to accurately interpret and interact with complex GUIs, even GUIs with high resolution displays and small target icons.
To support integration and experimentation, Microsoft developed Omnitool, a simulated Windows system that combines OmniparSer V2 and basic tools for proxy development. Omnitool is compatible with a variety of state-of-the-art LLMs, including OpenAI’s 4O/O1/O3-Mini, DeepSeek’s R1, Qwen’s 2.5VL, and Anthropic’s sonnet. This flexibility allows developers to use OmniparSer V2 in different models and applications, simplifying the creation of vision-based GUI proxy.
In short, OmniparSer V2 represents a meaningful advancement in integrating LLM with a graphical user interface. By converting UI screenshots into structured data, it enables LLM to understand and interact with software interfaces more efficiently. Technical enhancements for detection accuracy, reduced latency and benchmarking performance make OmniparSer V2 a valuable tool for developers, aiming to create smart proxy capable of automatically navigating and manipulating GUIS. With the continuous evolution of AI, tools like OmniparSer V2 are crucial to bridge the gap between text and visual data processing, resulting in a more intuitive and powerful AI system.
Check Technical details, models on HF and GitHub pages. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 75K+ ml reddit.
Recommended Reading – LG AI Research Unleashes Nexus: An Advanced System Integration Agent AI Systems and Data Compliance Standards to Address Legal Issues in AI Datasets
Post OmniparSer V2 released by Microsoft AI: AI tools that turn any LLM into a computer using proxy appear first on Marktechpost.