New organization-centered oversight approach to scaling software AI agents have only 78 examples

by admin · October 6, 2025

Are curated tool basic demonstrations more powerful software agents than a large amount of general guidance data? A group of researchers from Shanghai Ruotang University and SII Generative AI Research Laboratory (GAIR) proposes Limi (“less for more agents”)a supervised fine-tuning method that turns the basic model into a powerful software/research agent 78 sample. Limi score 73.5% Average opening Agent (FTFC 71.7, RC@3 74.2, SR@3 74.6), beat strong baselines (GLM-4.5 45.1, QWEN3-235B-A222B 27.5, KIMI-K2 24.1, DeepSeek-V3.1 11.9), and even surpass the training variants 10,000 Sample –Use 128× less data.

what’s new?

Agent efficiency principle: Limi pointed out Agent capability Scaling with more Data quality/structure Count than the original sample. Research team fine-tuned GLM-4.5/glm-4.5-air on 78 Long horse, tool use trajectory (sample) and reports huge benefits from agency and generalization kit (TAU2-BENCH, evalplus-He/MBPP, DS-1000, SCICODE).
Minimal but intensive supervision. Each trajectory (~13k – 152k tokens; ~42.4k avg.) captures the complete multi-turn workflow – model reasoning, tool calls and environmental observations – in sii-cli Execution environment. Mission crossingAtmosphere coding” (Interactive Software Development) and Research workflow (Search, analysis, experimental design).

How does it work?

Basic model: GLM-4.5 (355b) and GLM-4.5-Air (106b). Training use Mucus The SFT framework has the same configuration in comparison (to isolate data effects).
Data construction: From Practitioners + 18 60 actual queries synthesized from the brilliant GitHub PR (QA intensive by PhD annotators). For each query, Limi records the full proxy trajectory to successfully complete the internal sii-cli.
Evaluate: Agent (r = 3 rounds) with ftfc, sr@3, rc@3; plus generalization kit (TAU2-AIRLINE/RETAIL PASS^4, EDARPLUS HE/MBPP, DS-1000, SCICODE).

result

Agent (AVG): 73.5%. Limi vs.GLM-4.5 (+28.4 points); FTFC 71.7% vs 37.8%; SR@3 74.6% vs 47.4%.
Data efficiency: limi(78 Sample) Better than trained GLM-4.5 AFM-encoded SFT (10,000 samples): 73.5% vs 47.8%–+53.7% Absolutely with 128× Less data. AFM-Webagent (7,610) and CC Bench-traj (260) also accommodates AFM-Webagent (7,610).
Summary: Cross-tool use/coding/scientific calculations, Limi average ~57%exceeds GLM-4.5 and other baselines; without tool access, Limi will still boot slightly (50.0% vs 48.7% For GLM-4.5), it indicates that the intrinsic gain exceeds the environment tool.

Key Points

Data efficiency dominates scale. Limi arrived 73.5% Average usage of the agent Planning trajectoryexceeds GLM-4.5 (45.1%) and shows +53.7 points Advantages over a 10K samples SFT baseline –128×.
Trajectory quality, not bulk. The training data is Long horse, fixed tools Workflows in collaborative software development and scientific research through sii-cli Execution stack for file references.
Profits on Earth. Limi report FTFC 71.7%,,,,, SR@3 74.6%,strong RC@3Detailed table shows the larger edges on the baseline; generalization suite (Tau2, evalplus-He/MBPP, DS-1000, SCICODE) average 57.2%.
Transscale work. Fine adjustment GLM-4.5 (355b) and GLM-4.5-air (106b) Both produce the Great Delta on its basis, which indicates the robustness of the method to model size.

The research team trained the GLM-4.5 variants through 78 curated, long horse trajectories captured in the CLI environment, covering software engineering and research tasks. It reported an average of agents for the FTFC, RC@3 and SR@3 indicators at 73.5%; baseline GLM-4.5 reported at 45.1%. Comparison with 10,000 AFM-encoded SFT baselines showed 73.5% vs 47.8%; no tool evaluation indicated intrinsic growth (LIMI vs 48.7% GLM-4.5). Trajectories are multi-bend and token-intensive, emphasizing planning, tool orchestration and verification.

Check Paper,,,,, Github page and Model card on HF. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

🙌Follow Marktechpost: Add us as the preferred source on Google.

New organization-centered oversight approach to scaling software AI agents have only 78 examples

what’s new?

How does it work?

result

Key Points

You may also like...

live chat

Recent Posts

New organization-centered oversight approach to scaling software AI agents have only 78 examples

what’s new?

How does it work?

result

Key Points

You may also like...

Made through real -time laser material detection enhancement accuracy

Black coffee cuts risk of death, but sugar eliminates benefits

How to design AI adoption and value across the enterprise

live chat

Recent Posts