Microsoft AI releases Fara-7B: an efficient agent model for computers to use

How do we safely let AI agents handle real-world web tasks like booking, searching, and filling out forms directly on our own devices, without sending everything to the cloud? Released by Microsoft Research Farah-7Ba 7 billion parameter agent small language model designed for computer use. It is an open-weight computer usage agent that can run from screenshots, predict mouse and keyboard actions, and is small enough to execute on a single user device, reducing latency and keeping browsing data local.

From chatbots to computers using agents

Traditional Chat Oriented LLM Return Text. Instead, computers like Fara-7B use agents to control browsers or desktop user interfaces to complete tasks such as filling out forms, booking travel, or comparing prices. They sense the screen, infer page layout, and then issue low-level actions such as click, scroll, type, web_search, or visit_url.

Many existing systems rely on large multimodal models wrapped in complex scaffolding that parses accessibility trees and orchestrates multiple tools. This increases latency and often requires server-side deployment. Fara-7B compresses the behavior of such multi-agent systems into a single multi-modal decoder model built on Qwen2.5-VL-7B. It uses browser screenshots and text context, then directly outputs thought text, and then makes tool calls using basic parameters such as coordinates, text, or URLs.

FaraGen, synthetic trajectories of network interactions

The key bottleneck for computers using agents is data. High-quality logs of human network interactions with multi-step operations are rare and expensive to collect. Fara project introduction faragana synthetic data engine that generates and filters network traces on live websites.

faragan Use tertiary piping. Task proposals start from seed URLs extracted from public corpora (such as ClueWeb22 and Tranco), which are classified into domains such as e-commerce, tourism, entertainment, or forums. A large language model converts each URL into an actual task a user might attempt on that page, such as booking a specific movie ticket or creating a shopping list with comments and material restrictions. Tasks must be complete without logins or paywalls, fully specified, useful, and automatically verifiable.

Task Solving Run a multi-agent system based on Magentic-One and Magentic-UI. Orchestrator agents plan high-level policies and maintain a ledger of task status. The WebSurfer agent receives the accessibility tree and tag set screenshots, and then issues browser actions such as click, type, scroll, visit_url, or web_search through Playwright. When a task requires clarification, the UserSimulator agent provides follow-up instructions.

Track verification purpose Three LLM-based verifiers. Alignment validators check that the actions and final answers match the task intent. The rubric validator generates rubrics for subgoals and scores partial completions. Multimodal verifiers review screenshots as well as final answers to capture the illusion and confirm that visible evidence supports success. These validators agreed with human labels in 83.3% of cases, reporting false positive and false negative rates of approximately 17% to 18%.

After filtering, faragan This results in 145,603 trajectories with 1,010,797 steps on 70,117 unique domains. Trajectories range from 3 to 84 steps, with an average of 6.9 steps and approximately 0.5 unique domains per trajectory, indicating that many tasks involve sites not seen elsewhere in the dataset. Generating data using advanced models such as GPT-5 and o3 costs approximately $1 per validated trajectory.

Model architecture

Fara-7B is a multi-modal decoder model using only Qwen2.5-VL-7B as the base. It takes as input the user’s goals, the latest screenshot of the browser, and a complete history of previous thoughts and actions. The context window has 128,000 tokens. At each step, the model first generates a thought chain describing the current state and plan, and then outputs a tool call specifying the next action and its parameters.

Toolspace matches the Magentic-UI computer_use interface. It includes key, type, mouse_move, left_click, scroll, visit_url, web_search, history_back, pause_and_memorize_fact, wait and terminate. Coordinates are predicted directly as pixel positions on the screenshot, which allows the model to run without accessing the accessibility tree at inference time.

Training uses supervised fine-tuning on approximately 1.8 million samples mixed from multiple data sources. These include FaraGen trajectories broken down into observation, reflection, and action steps, foundation and UI localization tasks, screenshot-based visual Q&A and captioning, and safety and rejection datasets.

Benchmarks and Efficiencies

Microsoft evaluates Fara-7B Four real-time network benchmarks: WebVoyager, Online-Mind2Web, DeepShop, and the new WebTailBench, which focuses on underrepresented market segments such as restaurant reservations, job applications, real estate searches, comparison shopping, and multi-site portfolio tasks.

In these benchmarks, Fara-7B achieved 73.5% success rate on WebVoyager, 34.1% success rate on Online-Mind2Web, 26.2% success rate on DeepShop, and 38.4% success rate on WebTailBench. This is better than the 7B computer usage agent baseline UI-TARS-1.5-7B, which scored 66.4, 31.3, 11.6, and 19.5, respectively, and compares favorably with large systems such as the OpenAI computer usage preview and the SoM agent configuration built on GPT-4o.

On WebVoyager, Fara-7B uses an average of 124,000 input tokens and 1,100 output tokens per task, with approximately 16.5 operations. Using market token prices, the research team estimated the average cost per task to be $0.025, while the average cost of SoM agents powered by proprietary inference models such as GPT-5 and o3 is approximately $0.30. Fara-7B uses a similar number of input tokens but approximately one-tenth the output tokens of these SoM agents.

Main points

  • Fara-7B is a 7B-parameter, open-weight computer usage agent built on Qwen2.5-VL-7B that operates directly from screenshots and text, then outputs grounded actions such as clicks, typing, and navigation without relying on accessibility trees at inference time.
  • The model was trained using 145,603 validated browser trajectories and 1,010,797 steps generated by the FaraGen pipeline, which uses multi-agent task proposal, solving, and LLM-based validation on live websites across 70,117 domains.
  • Fara-7B achieved 73.5% success rate on WebVoyager, 34.1% success rate on Online-Mind2Web, 26.2% success rate on DeepShop, and 38.4% success rate on WebTailBench, significantly improving over the 7B UI-TARS-1.5 baseline in all four benchmarks.
  • On WebVoyager, Fara-7B uses about 124,000 input tokens and 1,100 output tokens per task, with an average of 16.5 operations, and has an estimated cost of about $0.025 per task, which is about an order of magnitude cheaper than the SoM agent supported by the GPT class 5 model in terms of output token usage.

Editor’s Note

Fara-7B is a useful step toward practical computer-using agents that can run on local hardware with low inference costs while preserving privacy. The combination of Qwen2.5 VL 7B, FaraGen synthetic trajectories, and WebTailBench provides a clear and well-instrumented path from multi-agent data generation to a single compact model that matches or exceeds larger systems on key benchmarks while enforcing critical point and rejection protection measures.


Check Paper and model weight and technical details. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...