Openai unleashes HealthBench: an open source benchmark for measuring the performance and security of large language models in healthcare

Openai has been released Healthbenchan open source evaluation framework designed to measure the performance and security of large language models (LLMs) in realistic healthcare solutions. Developed in collaboration with 262 physicians in 60 countries and 26 medical professions, HealthBench addresses the limitations of existing benchmarks by focusing on real-world applicability, expert validation and diagnostic coverage.
Addressing the benchmark gap in healthcare AI
Existing healthcare AI benchmarks often rely on narrow structured formats, such as multiple choice exams. Although useful for initial evaluation, these formats fail to capture the complexity and nuances of clinical interactions in the real world. HealthBench moves to a more representative assessment paradigm and merges More than 5,000 more than 5 Between the model and the lay user or healthcare professional. Each conversation ends with a user prompt and evaluates using model responses Sample-specific columns Written by a doctor.
Each title consists of relevant point values with well-defined criteria (positive and negative). These criteria capture behavioral attributes such as clinical accuracy, communication clarity, integrity, and guidance compliance. HealthBench Assessment 48,000 unique standardsprocessed by a model-based classifier, which verifies expert judgments.
Benchmark structure and design
HealthBench organizes assessments on seven key topics: Emergency Recommendation, Global Health, Health Data Tasks, Environment Seeking, Communications over the scope of expertise, Depth of Response, and Responsiveness under uncertainty. Each topic represents a unique realistic challenge in medical decision-making and user interaction.
In addition to the standard benchmark, OpenAI also introduces two variants:
- HealthBench Consensus: A subset of 34 physician validation criteria that emphasize 34 physician verification criteria designed to reflect key aspects of model behavior, such as suggesting urgent care or seeking other contexts.
- Healthbench Hard: In 1,000 conversations, the harder subsets chose their ability to challenge the current border model.
These components allow detailed hierarchy of model behavior through dialogue types and evaluation axes, providing more detailed insights into model capabilities and shortcomings.

Evaluate model performance
OpenAI evaluates several models on NefterBench, including GPT-3.5 Turbo, GPT-4O, GPT-4.1 and newer O3 models. The results show that the progress is obvious: GPT-3.5 reaches 16%, GPT-4O reaches 32%, and O3 reaches 60%. especially, GPT-4.1 nanometersThis is a smaller and cost-effective model that outperforms GPT-4O while reducing inference costs by 25 times.
Performance varies with topic and evaluation axes. Emergency recommendations and tailor-made communication are areas of relative strength, while seeking context and integrity presents even greater challenges. A detailed breakdown shows that integrity is most relevant to overall scores, highlighting its importance in health-related tasks.
OpenAI also compared the model output with the response written by the physician. While drafts generated by the model can be improved, non-assisted physicians often produce lower scores than the model, especially when using earlier model versions. These findings suggest the potential role of LLM as a collaborative tool in clinical documentation and decision support.

Reliability and meta-assessment
HealthBench includes mechanisms to evaluate model consistency. The “worst AT-K” metric quantifies performance degradation across multiple runs. Although newer models show improved stability, variability remains an area of ongoing research.
To evaluate the credibility of its automated classifier, OpenAI used over 60,000 annotated examples for meta-evaluation. GPT-4.1 is used as the default grader, matching or exceeding the average performance of individual physicians in most topics, indicating that its utility is consistent with the evaluator.
in conclusion
HealthBench represents a technically rigorous and scalable framework for evaluating the performance of AI models in complex healthcare environments. By combining realistic interactions, detailed titles and expert validation, it provides nuances of model behavior than existing alternatives. Openai released HealthBench through a simple github repository, providing researchers with tools to benchmark, analyze and improve health-related applications.
Check Paper, github page page and official release. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 90K+ ml reddit.
Here is a brief overview of what we built in Marktechpost:

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.