What is a “Computer Usage Agent”? From the Web to the Operating System – Technical Explanation

Long story short: The computer usage agent is a VLM-driven UI agent that behaves like a user on unmodified software. baseline at operating system world Started with 12.24% (human 72.36%); Claude Sonnet 4.5 Report now 61.4%. Gemini 2.5 computer use Lead a few network benchmark (Online-Mind2Web 69.0%, Online travelers 88.9%) but it is Not yet optimized for the operating system. The next step is to focus on Operating system level robustness, Sub-second action loopand Strengthen security policieswith a transparent training/evaluation program from the open community.

definition

Computer usage agents (aka GUI agents) are visual language models that observe the screen, underlying UI elements, and perform bounded UI operations (clicking, typing, scrolling, key combinations) to complete tasks in unmodified applications and browsers. Public implementations include Anthropic’s Computer Use, Google’s Gemini 2.5 Computer Use, and OpenAI’s Computer-Using Agent support Operator.

control loop

A typical runtime loop: (1) capture screenshot + state, (2) plan next steps via spatial/semantic basis, (3) operate via constrained operating modes, (4) validate and retry on failure. Suppliers document standardized action sets and guardrails; audited harnesses standardize comparisons.

Baseline landscape

  • OSWorld (Hong Kong University, April 2024): 369 real-world desktop/web tasks covering operating system file I/O and multi-application workflows. When published, Human 72.36%, Best model 12.24%.
  • Current status (2025): Anthropic selection Claude Sonnet 4.5 Report 61.4% on OSWorld (Less than humans, but a big jump from 42.2%).
  • Live network benchmarks: Google’s Gemini 2.5 computer use Report 69.0% on Online-Mind2Web (official ranking), 88.9% on WebVoyager, 69.7% on AndroidWorld;The current model is Browser optimization and Not optimized for operating system level control.
  • Online Mind2Web Specifications: 300 tasks on 136 live sites; results validated by Princeton/HAL and public HF spaces.

architectural components

  • Perception and Basics: Regular screenshots, OCR/text extraction, element positioning, coordinate inference.
  • planning: Multi-step resuscitation policy; often post-training/RL adjustments for UI control.
  • Action structure: bounded verb (click_at, type, key_combo, open_app), benchmark-specific exclusions to prevent tool shortcuts.
  • Evaluate the harness: Live-web/VM sandbox with third-party auditing and reproducible script execution.

enterprise snapshot

  • Anthropic: Computer usage API; Sonnet 4.5 in 61.4% Operating System World;Documentation emphasizes pixel-accurate grounding, retries, and safety confirmations.
  • Google Deep Thinking: Gemini 2.5 computer uses API+ model cassette Online-Mind2Web 69.0%, Online travelers 88.9%, Android world 69.7%latency measurement and security mitigation measures.
  • Open artificial intelligence: Carrier research preview for US professional users, presented by computer usage agent;Separate system cards and developer interface via Responses API; Limited availability/preview.

Their development direction: Web → Operating System

  • Low-volume/one-off workflow cloning: A near-term direction is towards powerful task simulation via a single demonstration (screenshot + narration). Considered an active research proposition rather than a fully addressed product feature.
  • Collaborative delay budget: To maintain direct manipulation, actions should fall on 0.1–1 second Human interaction threshold; current stacks often exceed this due to vision and planning overhead. Expect engineering for incremental vision (difference frames), cache-aware OCR, and action batching.
  • Operating system level breadth: File dialogs, multi-window focus, non-DOM UI, and system policies add failure modes not found in pure browser agents. Gemini’s current “browser-optimized, not OS-optimized” status highlights the next step.
  • Safety: Tip injection, dangerous operations and data leakage from web content. Model cards describe allow/deny lists, confirmed and blocked fields; expect typed action contracts and “consent gates” of irreversible steps.

Practical build notes

  • start with one Browser first Agents using documented operating modes and proven tools such as Online-Mind2Web.
  • Add to recoverability: Explicit post-conditions, screen validation and rollback planning for long workflows.
  • treat index Be skeptical: Prefer audited leaderboards or third-party tools over self-reported scripts; OSWorld uses performance-based assessments for repeatability.

Open research and tools

Face hugger Smol2 operator Provides an open post-training recipe for upgrading a small VLM to a GUI-based operator — useful for labs/startups that prioritize repeatable training over leaderboard records.

Main points

  • Computer-Using (GUI) agents are VLM-driven systems that are screen-aware and issue bounded UI actions (click/type/scroll) to operate unmodified applications; current public implementations include Anthropic Computer Use, Google Gemini 2.5 Computer Use, and OpenAI’s Computer-Using Agent.
  • OSWorld (HKU) benchmarked 369 real desktop/web tasks via execution-based evaluation; at the time of publication, humans achieved 72.36%, while the best model achieved 12.24%, highlighting the fundamental and procedural gap.
  • Anthropic Claude Sonnet 4.5 reported at OSWorld 61.4% – below human level, but a big jump over the previous Sonnet 4 result.
  • Gemini 2.5 Computer Use leads several real-time web benchmarks—Online-Mind2Web 69.0%, WebVoyager 88.9%, AndroidWorld 69.7%—and is explicitly optimized for browsers, but not yet optimized for OS-level controls.
  • OpenAI Operator is a research preview powered by a Computer Usage Agent (CUA) model that uses screenshots to interact with the GUI; availability is still limited.
  • Open source trace: Hugging Face’s Smol2Operator provides a reproducible post-training pipeline that turns small VLMs into GUI-based operators that standardize operating modes and datasets.

refer to:

Benchmarks (OSWorld and Online-Mind2Web)

Anthropic (Computer Use and Sonnet 4.5)

Google DeepMind (for Gemini 2.5 computers)

OpenAI (Operator/CUA)

Open Source: Huging Face Smol2Operator


Michal Sutter is a data science professional with a master’s degree in data science from the University of Padua. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex data sets into actionable insights.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...