AI has the potential to make expert medical reasoning more accessible, but current assessments often fall on simplified static scenarios. True clinical practice is more dynamic. Doctors gradually adjust their diagnostic methods, raise targeted questions and explain new information. This iterative process helps them refine assumptions, weigh the cost and benefits of the exam, and avoid drawing conclusions. Although language models show strong performance on structured tests, these tests do not reflect real-world complexity, in which case premature decision-making and overtesting are still serious issues that static assessments often miss.
Medical problem solving has been explored for decades, with early AI systems using Bayesian frameworks to guide professional sequential diagnosis such as pathology and trauma care. However, these approaches face challenges due to the need for extensive expert input. Recent research has turned to clinical reasoning using language models, often evaluated by static, multiple-choice benchmarks that are now saturated. Projects like AMIE and NEJM-CPC introduce more complex case material, but still rely on fixed vignettes. While some newer approaches evaluate conversation quality or basic information collection, few have captured the full complexity of real-time, cost-sensitive diagnostic decisions.
To better reflect real-world clinical reasoning, researchers from Microsoft AI have developed SDBench, a benchmark for 304 actual diagnosed cases in the New England Journal of Medicine, where doctors or AI systems must perform interactive interactive issues and order tests before final diagnosis. The language model acts as a gatekeeper and can only be disclosed when specifically requested. To improve performance, they introduced Mai-Dxo, a orchestration system designed with doctors that simulates virtual medical panels to select high-value, cost-effective tests. When paired with models such as OpenAI’s O3, it has up to 85.5% accuracy while significantly reducing diagnostic costs.
The Sequential Diagnostic Bench (SDBench) was constructed using the 304 NEJM Case Challenge Scenario (2017-2025) and covers a wide range of clinical conditions. Each situation translates into an interactive simulation where the diagnostic agent can ask questions, request tests, or make a final diagnosis. Supported by language models and guided by clinical rules, responded to these actions using realistic case details or synthetic but consistent findings. Diagnosis is evaluated through a judge’s model using a specialized column focused on clinical relevance. Costs were estimated using CPT code and pricing data to reflect diagnostic limitations and decisions in the real world.
The researchers evaluated various AI diagnostic agents on SDBench and found that Mai-Dxo always outperforms ready-made models and doctors. Although the standard model exhibits a tradeoff between cost and accuracy, Mai-Dxo built on O3 provides higher accuracy at lower costs through structured reasoning and decision making. For example, it has an accuracy of 81.9%, at $4,735 per group, while 78.6% of ready-made O3, at $7,850. It also demonstrates the power of multiple models and holding test data, indicating strong generalization. The system significantly improves weaker models and helps more powerful models utilize resources more effectively, reducing unnecessary testing with smarter information collection.
In short, SDBENCH is a new diagnostic benchmark that turns NEJM CPC cases into reality, interactive challenges, requiring AI or doctors to actively ask questions, order tests and perform diagnostics, each with related fees. Unlike static benchmarks, it mimics real clinical decisions. The researchers also launched Mai-Dxo, a model that simulates a diverse medical role to achieve high diagnostic accuracy at a lower cost. Although the current results are promising, especially in complex situations, limitations include the lack of everyday conditions and real-world limitations. Future work aims to test the system in real clinics and low-resource settings and has the potential for global health impacts and medical education.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.