OpenAI Launches HealthBench, a Dataset That Benchmarks Health Care AI Models
- On Tuesday, May 13, 2025, OpenAI released HealthBench, a large dataset in San Francisco to evaluate AI health care responses.
- OpenAI launched HealthBench to address the challenge of fairly comparing AI models’ answers to health care questions using realistic data.
- HealthBench contains 5,000 health conversations graded by rubrics with over 57,000 criteria developed by 262 physicians from 60 countries.
- OpenAI’s o3 reasoning model scored highest with 60%, excelling in communication quality, though experts call for more subgroup analysis and human review.
- HealthBench represents a major advance in AI health evaluation but cannot yet support safety claims and requires further testing before trusted deployment.
Insights by Ground AI
Does this summary seem wrong?
50 Articles
50 Articles
All
Left
6
Center
13
Right
6

+36 Reposted by 36 other sources
OpenAI Releases HealthBench Dataset to Test AI in Health Care
Key Takeaways
·Missoula, United States
Read Full ArticleGPT-4 Fails On Real Healthcare Tasks: New HealthBench Test Reveals The Gaps
Large language models are everywhere — from search to coding and even patient-facing health tools. New systems are being introduced almost weekly, including tools that promise to automate clinical workflows. But can they actually be trusted to make real medical decisions? A new benchmark, called HealthBench, says not yet. According to the results, models like GPT-4 (from OpenAI) and Med-PaLM 2 (from Google DeepMind) still fall short on practical…
Coverage Details
Total News Sources50
Leaning Left6Leaning Right6Center13Last UpdatedBias Distribution52% Center
Bias Distribution
- 52% of the sources are Center
52% Center
L 24%
C 52%
R 24%
Factuality
To view factuality data please Upgrade to Premium
Ownership
To view ownership data please Upgrade to Vantage