AI

Researchers at Georgia Tech and Stanford University introduce MLE-Dojo: A sports-style framework designed to train, evaluate and benchmark autonomous driving machine learning engineering (MLE) agents

Machine Learning Engineering (MLE) involves machine learning systems that require iterative experimentation, model optimization and powerful data pipeline processing. As model complexity increases, challenges associated with effective coordination of end-to-end workflows will also be effective. Researchers have used AI agents to explore automation of MLE tasks to address these needs. Large language models (LLM), especially those with strong coding and problem-solving abilities, have shown significant potential to enhance this process. Now, their role in automated structured workflows is being tested through strict benchmarks and environments tailored to simulate real-world MLE scenarios.

The main obstacle to automated machine learning engineering is the essentially iterative and feedback-driven nature of the work. Tasks such as high parameter tuning, model debugging, and data preprocessing cannot be solved in one step. They need to be revised and evaluated repeatedly. Traditional evaluation tools for AI models often rely on static datasets and do not allow real-time error feedback or interactive problem solving. This limitation prevents LLM agents from learning through trial and error, which is an important part of mastering engineering tasks that may develop or require multiple attempts to succeed.

Early tools for evaluating LLM in engineering or coding tasks focus primarily on individual subtasks or isolated challenges. These include tools such as Mlagentsch and DSBench, which rely on narrow test cases from Kaggle competition or synthetic datasets. Although they cover more than just basic tasks, they don’t allow the agent to execute code execution, debugging, or interpretation of results in real-time settings. Other environments (such as SWE-GYM) focus only on software engineering and lack support for machine learning specific workflows. These limitations slow down the creation of versatile, high-performance MLE agents that can handle real-time project complexity.

Researchers at Georgia Tech and Stanford University have launched MLE-Dojo, a framework with an interactive environment that connects LLM agents to real-world machine learning tasks in competition with over 200 Kaggles. The framework supports tabular data analysis, computer vision, natural language processing, and time series for predicting challenges. The research introduces MLE-Dojo to allow agents to write, execute, and revise code in sandboxed, feedback-rich settings. The goal is to replicate the interaction cycles followed by human engineers, thereby providing structured learning for agents. The environment includes pre-installed dependencies, evaluates metrics, and supports supervised fine-tuning and reinforcement learning strategies.

The structure of MLE-Dojo is composed of modular components that support a wide range of MLE challenges. Each task runs in its own Docker container, isolated for security and repeatability. Agents interact with the environment through partially observable Markov decision-making processes, receive observations, perform actions, and receive rewards based on performance. The environment supports five main types of operations: requesting task information, verifying code, executing code, retrieving interaction history, and resetting the environment. It also provides a detailed observation space, including the dataset, execution results and error messages. After each interaction, the agent receives structured feedback, allowing for gradual improvements. This modular setup helps maintain interoperability and simplifies adding new tasks to the system.

The evaluation includes four core machine learning domains, including eight boundaries, LLM-Gemini-2.5-Pro, DeepSeek-R1, O3-Mini, GPT-4O, GPT-4O-Mini, Gemini-2.0-Pro, Gemini-2.0-Pro, Gemini-2.0-Flash, and DeepSeek-V3-V3-V3-V3-V3-akers. Gemini-2.5-Pro ​​obtained the highest ELO rating of 1257, followed by DeepSeek-R1 of 1137, and O3-Mini at 1108. Regarding humans, Gemini-2.5-Pro ​​leads with a 61.95% lead, indicating its performance over the human benchmark. Models like GPT-4O-MINI only execute 20% of the time, adopting a conservative strategy, while O3-Mini performs execution in more than 90% of the cases. The average failure rate of Gemini-2.5-Pro ​​is kept at its lowest during the verification and execution phases, enhancing its robustness. In the domain, computer vision presents the biggest challenge, with most models scoring below 60 in human vans. Inference models usually produce longer outputs and maintain greater performance consistency in iterations.

This study highlights the difficulties in applying LLM to a complete machine learning workflow. It outlines a comprehensive solution in MLE-Dojo that can be learned through interactively, not just done. MLE-Dojo sets new standards for training and evaluation of automated MLE agents by more accurately simulating the engineering environment.


Check Paper, project pages and GitHub pages. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 90K+ ml reddit.


Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button