Hesgoal || TOTALSPORTEK|| F1 STREAMS || SOCCER STREAMS moverightnaija

H Company releases Holo1.5: An open computer focused on GUI localization and UI-VQA using VLM

H Company (a French AI startup) has released Holo1.5, a family of open basic vision models designed for computer use (CU) agents that can be operated on a real user interface via screenshots and pointer/keyboard operations. This version includes 3B, 7B and 72B Checkpoints have about 10% accuracy growth, while size Holo1 is more than Holo1. The 7b model is Apache-2.0; 3b and 72b inherit only the research-based constraints from their upstream bases. The series targets two core functions that are important to the CU stack: precise UI element positioning (coordinate prediction) and UI visual question answering (UI-VQA) to achieve state understanding.

Why is UI element localization important?

Localization is how the agent converts intent to pixel-level operations: “Open Spotify” → predicts clickable coordinates for the correct control on the current screen. Failed cascade here: A single click can derail the multi-step workflow. HOLO1.5 trains and evaluates high-resolution screens (up to 3840×2160) on desktop (MacOS, Ubuntu, Windows), web and mobile interfaces, thereby improving the robustness of intensive professional UIS for iconography and small-scale goals.

How is Holo1.5 different from general VLM?

Generally, VLMS optimizes extensive grounding and subtitles; CU agents require reliable pointing plus interface understanding. Holo1.5 aligns its data and objectives with the following requirements: large-scale SFT for GUI tasks, followed by GRPO-style enhanced learning to tighten coordination accuracy and decision-making reliability. These models are passed as perceptual components of embedded planners/executors (e.g., surfer-style agents), rather than end-to-end agents.

How does Holo1.5 execute on localization benchmarks?

HOLO1.5 reports the latest GUI grounding across screen patches – V2, ScreenSpot-Pro, GroundUi-Web, Showdown and WebClick. Representative 7b numbers (average over six localized tracks):

  • holo1.5-7b: 77.32
  • QWEN2.5-VL-7B: 60.73

exist Screen hole (Professional Application with intensive layout), Holo1.5-7B Achievement 57.94 vs 29.00 For QWEN2.5-VL-7B, it is shown that a substantial better target selection is made under realistic conditions. The 3B and 72B checkpoints have similar relative gains compared to their QWEN2.5-VL counterparts.

Can it also improve UI understanding (UI-VQA)?

Yes. On VisualWebbench, WebSRC and ScreenQA (short/complex), Holo1.5 enables consistent accuracy improvements. The reported 7B average is ≈88.17there are 72B variants around ≈90.00. This is important for proxy reliability: queries such as “Which tag is active?” Or “User Signature?” reduces ambiguity and enables verification between actions.

How about it compared to dedicated and closed systems?

In published evaluation settings, Holo1.5 outperforms open baseline (QWEN2.5-VL), competitive professional systems (e.g. UI-TARS, UI-VENUS), and shows advantages over cited UI tasks with closed generalist models (e.g. Claude Sonnet 4). Because protocols, prompts and screen resolutions affect the results, practitioners should copy with a harness before reaching deployment-level conclusions.

What is the comprehensive meaning of CU agents?

  • Higher click reliability on native resolution: Better ScreenPot-Pro Performance shows that error clicks are reduced in complex applications (IDE, design suite, management console).
  • Stronger country tracking: Higher UI-VQA accuracy improves record status, activity tabs, modal visibility and detection of success/failure prompts.
  • Pragmatic licensing path: 7b (apache-2.0) Suitable for production. this 72B Checkpoints are currently only studied; use them for internal experiments or bound headroom.

Where does Holo1.5 fit in a modern computer usage (CU) stack?

Think of Holo1.5 as Screen sensing layer:

  • enter: Full resolution screenshot (optionally use UI metadata).
  • Output: Confidence target coordinates; short text answers about screen status.
  • Downstream: Action policy converts predictions to click/keyboard events; monitors conditions after verification conditions and triggers recycling or fallback.

Summary

HOLO1.5 narrows the actual gap in the CU system by pairing strong coordinate grounding with a simple interface understanding. If you need a commercially available foundation today, please first Holo1.5-7B (Apache-2.0)benchmark on the screen and instrumentation at the planner/security layer around it.


Check Model embracing face and Technical details. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

🔥[Recommended Read] NVIDIA AI Open Source VIPE (Video Pose Engine): A powerful and universal 3D video annotation tool for spatial AI

You may also like...