Published 3 days ago • loading... • Updated 23 hours ago

OpenAI Launches HealthBench, a Dataset That Benchmarks Health Care AI Models

On Tuesday, May 13, 2025, OpenAI released HealthBench, a large dataset in San Francisco to evaluate AI health care responses.
OpenAI launched HealthBench to address the challenge of fairly comparing AI models’ answers to health care questions using realistic data.
HealthBench contains 5,000 health conversations graded by rubrics with over 57,000 criteria developed by 262 physicians from 60 countries.
OpenAI’s o3 reasoning model scored highest with 60%, excelling in communication quality, though experts call for more subgroup analysis and human review.
HealthBench represents a major advance in AI health evaluation but cannot yet support safety claims and requires further testing before trusted deployment.

Insights by Ground AI

Does this summary seem wrong?

50 Articles

All

Left

Center

Right

ZDNet

Center

OpenAI's HealthBench shows AI's medical advice is improving - but who will listen?

The HealthBench test can't possibly tell us the critical factor: How humans would respond to chatbots under real-world conditions.

2 days ago·United States

Read Full Article

Medical Xpress

Center

OpenAI releases HealthBench dataset to test AI in health care

OpenAI has unveiled a large dataset to help test how well artificial intelligence (AI) models answer health care questions.

2 days ago

Read Full Article

ABC FOX Montana

+36 Reposted by 36 other sources

Center

OpenAI Releases HealthBench Dataset to Test AI in Health Care

Key Takeaways

2 days ago·Missoula, United States

Read Full Article

cnet

Center

OpenAI Launches HealthBench, a Dataset That Benchmarks Health Care AI Models

This is a major leap by the ChatGPT creator into health care.

3 days ago·New York, United States

Read Full Article

STAT

Center

OpenAI leaps into health care with AI benchmark to evaluate models

The company's foray into the field was welcomed by experts, but some cautioned that the effort is not comprehensive and may need further analysis.

3 days ago·Boston, United States

Read Full Article

Metaverse Post

GPT-4 Fails On Real Healthcare Tasks: New HealthBench Test Reveals The Gaps

Large language models are everywhere — from search to coding and even patient-facing health tools. New systems are being introduced almost weekly, including tools that promise to automate clinical workflows. But can they actually be trusted to make real medical decisions? A new benchmark, called HealthBench, says not yet. According to the results, models like GPT-4 (from OpenAI) and Med-PaLM 2 (from Google DeepMind) still fall short on practical…

23 hours ago

Read Full Article

Think freely.Subscribe and get full access to Ground NewsSubscriptions start at $9.99/year