UltraCUA: A basic computer usage agent model that bridges the gap between general-purpose GUI agents and specialized API-based agents

Computer usage of agents is limited to primitives. They click, type, scroll. Long action chains amplify grounding errors and wasted steps. Apple researchers launch UltraCUAa fundamental model for building hybrid operation spaces that enable agents to interweave low-level GUI operations with high-level programming tool calls. The model chooses cheaper, more reliable moves at every step. This approach improves success rates and reduces steps on OSWorld, and requires no Windows-specific training to move to WindowsAgentArena.

What did Hybrid Operation change??

Hybrid operations treat tools as first-class operations. Tool calls encapsulate multi-step operations into a single function with a clear signature and docstring. When no programming path is available, the click or keypress persists. The agent learns by alternating between the two modes. The goal is to reduce cascading errors and reduce the number of steps. The research team positions it as a bridge between GUI-only CUA and tool-centric agent frameworks.

Zoom tool acquisition

UltraCUA uses an automated pipeline to build its tool library. The system extracts keyboard shortcuts and commands from the software documentation. The system integrates an open source implementation of the agent toolkit. The system also uses coding agents to synthesize new tools. Each tool is a callable interface that hides a long GUI sequence. The research team reported 881 tools covering 10 desktop domains. The largest categories include VS Code with 135 tools and LibreOffice Writer with 123 tools. Thunderbird and GIMP are also covered in depth.

Verifiable synthetic tasks and trajectories

Training requires solid supervision and steady rewards. UltraCUA uses dual synthesis engines. The evaluator pipeline first composes atomic validators for browsers, files, images, and system states, and then generates tasks that satisfy these checks. The instruction-first pipeline explores the operating system and proposes context-aligned tasks, which are then verified. The result is 17,864 verifiable tasks across 10 domains such as Chrome, LibreOffice, GIMP, VS Code, System, Thunderbird, VLC, and multi-app workflows. Chrome has 2,826 tasks. There are 5,885 tasks in the LibreOffice suite. The number of multi-application tasks reached 2,113.

The introduction of multi-agent resulted in successful hybrid trajectories. Planners use OpenAI o3 for decision making. The grounder uses GTA1-7B for accurate visual positioning. The rollout generated approximately 26,800 success traces, showing when tools are used and when actions are taken in the GUI. These trajectories are the core of the supervision phase.

Training method

There are two phases of training. The first stage is supervisory fine-tuning. The model is trained on the success trajectory for 3 epochs with a learning rate of 2e-5. Losses are applied sequentially to avoid overweighting early steps. The second stage is online reinforcement learning. The model is trained for 150 steps with a learning rate of 1e-6 on a validated task sampled by difficulty. The policy optimization follows the GRPO variant, clips higher, and removes KL regularization and format bonus. Rewards combine sparse task results with tool usage terms. The experiment uses NVIDIA H100 GPU. By controlling the number of exposed tools, the context is kept around 32K.

OSWorld results

UltraCUA improves success rates at 7B and 32B scale. At a budget of 15 steps, UltraCUA-32B achieved a success rate of 41.0%. OpenCUA-32B reaches 29.7%. The absolute gain is 11.3 points. UltraCUA-7B reaches 28.9%. UI-TARS-1.5-7B reaches 23.4%. Maintain profits on a 50-step budget. The breakdown per domain shows consistent improvements across Chrome, Writer, VS Code, and cross-application tasks. Average number of steps decreased relative to baseline. These shifts indicate better choices of action, not just more attempts.

Cross-platform transfer on WindowsAgentArena

UltraCUA is trained only on Ubuntu-based OSWorld data. The model is then evaluated on WindowsAgentArena. UltraCUA-7B has a success rate of 21.7%. This exceeds 18.1% of UI-TARS-1.5-7B and 13.5% of the Qwen2 baseline trained with Windows data. The results show that hybrid action strategies learned on one platform can be transferred to other platforms. The paper emphasizes that this is zero-shot platform generalization.

Main points

  • UltraCUA forms a hybrid action space that allows a single agent to alternate between GUI primitives and programming tool calls, thereby reducing long, error-prone action chains.
  • The research team expanded the library of reusable tools through an automated pipeline and paired it with a synthetic data engine to generate more than 17,000 verifiable computer usage tasks for training and evaluation.
  • Training follows a two-stage recipe, with supervised fine-tuning of successful hybrid trajectories, followed by online reinforcement learning on verifiable tasks, optimizing when to invoke tools and perform actions in the GUI.
  • At OSWorld, UltraCUA reported an average relative improvement of 22% and an 11% reduction in steps over the base model, indicating increased reliability and efficiency.
  • The 7B model achieved 21.7% success on WindowsAgentArena without Windows-specific training, demonstrating cross-platform generalization of the hybrid operation strategy.

UltraCUA moves computer usage agents from fragile raw operation chains to hybrid operation strategies that integrate GUI primitives with programming tool calls, thereby reducing error propagation and step counting. It extends tools with an automated pipeline and pairs them with a synthetic data engine to generate more than 17,000 verifiable tasks, enabling supervised fine-tuning and online reinforcement learning of ground signals. Reported results include a 22% relative improvement and 11% step reduction on OSWorld and a 21.7% success on WindowsAgentArena without Windows-specific training, demonstrating cross-platform transfer of strategies.


Check paper here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


Michal Sutter is a data science professional with a master’s degree in data science from the University of Padua. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex data sets into actionable insights.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...