Skip to main content
See every side of every news story
Published loading...Updated

The 'Truth Serum' for AI: OpenAI’s New Method for Training Models to Confess Their Mistakes

OpenAI's new 'confessions' framework trains AI to report instruction deviations, reducing undetected misbehavior to about 4.4% in controlled tests, enhancing transparency.

  • Today, OpenAI announced it is developing a confessions framework that trains large language models to produce a ConfessionReport explaining their behavior and potential hallucinations.
  • LLMs often produce confident outputs that mask hidden failures like hallucinations, so OpenAI researchers designed confessions to expose errors since ChatGPT is wrong about 25% of the time.
  • The model starts by producing its main answer, then generates a ConfessionReport judged solely on honesty using an honesty-only reward signal that rewards admitting injurious behaviors.
  • In controlled stress tests, confessions surfaced more deviations and reduced undetected misbehavior rate to about 4.4%, but remain a proof-of-concept internal research tool not improving model truthfulness.
  • Researchers warn that confessions reflect trained reporting behavior, not self-awareness, and early results suggest they may aid future AI evaluation practices and next-generation AI assistants despite limits in real-world conversations.
Insights by Ground AI

13 Articles

Think freely.Subscribe and get full access to Ground NewsSubscriptions start at $9.99/yearSubscribe

Bias Distribution

  • 75% of the sources are Center
75% Center

Factuality Info Icon

To view factuality data please Upgrade to Premium

Ownership

Info Icon

To view ownership data please Upgrade to Vantage

Engadget broke the news in United States on Wednesday, December 3, 2025.
Too Big Arrow Icon
Sources are mostly out of (0)

Similar News Topics

News
Feed Dots Icon
For You
Search Icon
Search
Blindspot LogoBlindspotLocal