Salesforce AI research has been introduced GTA1This is a new graphical user interface (GUI) proxy that redefines the latest in the agent’s human interaction. GTA1 is designed to operate autonomously in real operating system environments such as Linux, which involves two key bottlenecks in GUI proxy development: Ambiguous task planning and Inaccurate movement. On a 45.2% mission success rate OSWORLD GTA1 benchmarks surpass OpenAI’s CUA (Computer Uses Agent), creating new records in the open source model.
The core challenge of GUI agents
GUI agents typically convert advanced user descriptions into action sequences (clicks, keystrokes, or UI interactions), while watching UI updates after each operation to plan subsequent steps. However, two problems still exist:
- Plan Ambiguity: Multiple valid sequences of operations can complete tasks, resulting in execution paths with different efficiency and reliability.
- Grounding accuracy: Transform abstract action proposals into accurate coordinate-level GUI interactions, which are particularly challenging in high resolution, dynamic interfaces.
GTA1 introduces a new mechanism to solve both.
Smart plan by testing time scaling
Traditional planners are committed to a suggestion of action at each decision point, limiting robustness. GTA1’s Test time zoom A simple and effective solution is introduced: multiple candidate actions are sampled simultaneously in each step and then adopted Multimodal Judge Model– Usually large language models – Evaluate and select the most suitable language model.
This technology avoids premature commitment to suboptimal plans and enables agents to better explore execution paths without future rollouts, which is an irreversible action in the GUI environment, which is not feasible. Importantly, this approach can be used with any planner and expands well with the increase in task complexity and action space scale.
Strengthen learning to root for accuracy
For GUI grounding, most previous models rely on supervised fine-tuning to predict the center of the target UI element, which limits the generalization. GTA1 adopts a reinforcement learning (RL)-based framework Group Relative Policy Optimization (GRPO). Instead of relying on intermediate reasoning (“thinking”) or predicting bounding boxes, the model learns directly from it Click-based rewards: Rewards will only be rewarded if the predicted coordinates belong to the correct UI element.
With this reward structure, GTA1 achieves state-of-the-art accuracy without the complexity or overhead of thought chain style supervision. It is worth noting that an ablation study shows that removing auxiliary signals such as “thinking” or IOU-based box rewards actually improves grounding performance – especially in static environments.
Cross-benchmark performance

GTA1 sets a new standard in several evaluations:
- OSWorld (task success rate): GTA1-7B arrives 45.2%outperformed Openai Cua (42.9%) and Claude 3.7 (28.0%).
- Screen pole (grounding accuracy): GTA1-7B score 50.1%before models such as Uground-72B (34.5%).
- ScreensPot-V2 (cross-platform grounding): GTA1-72B hit 94.8%almost matched to the top proprietary models.
- OSWorld-G (Linux GUI Ground): GTA1-7B arrives 67.7%more than all previous open source methods.
These results demonstrate the effectiveness of the program and fundamental innovation introduced in GTA1.
Other design highlights
- Data cleaning: Use OmniParser to filter misalignment annotations from datasets such as ARIA-UI and OS-ATLA to improve training signal fidelity.
- Model Scaling: This method scales from 7B to 72B parameters across models, GTA1-7B provides the best trade-off between performance and computation.
- Judges can be reused: The multimodal judge used in test time scaling can be the same as the LLM used to plan, reduce overhead.
in conclusion
GTA1 shows that a modular two-stage framework can be used to build powerful and accurate GUI agents, enhanced by testing time-planning diversity and RL-based precise grounding. By abandoning unnecessary complexity, such as thoughtful reasoning in static tasks, Salesforce AI introduces a lean, effective proxy architecture that pushes cutting-edge digital interactions forward.
Check Paper, code, Model 7b,,,,, Model 32B and Model 72B. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter,,,,, Youtube and Spotify And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, indicating its popularity among its audience.