Published 3 days ago • loading... • Updated 2 days ago

OpenAI Can Rehabilitate AI Models that Develop a “Bad Boy Persona”

Summary by MIT Technology Review

Researchers at the company looked into how malicious fine-tuning makes a model go rogue, and how to turn it back.

6 Articles

All

Left

Center

Right

TechCrunch

Center

OpenAI found features in AI models that correspond to different 'personas'

By looking at an AI model's internal representations — the numbers that dictate how an AI model responds, which often seem completely incoherent to humans — OpenAI researchers were able to find patterns that lit up when a model misbehaved.

3 days ago·United States

Read Full Article

MIT Technology Review

Center

OpenAI can rehabilitate AI models that develop a “bad boy persona”

Researchers at the company looked into how malicious fine-tuning makes a model go rogue, and how to turn it back.

3 days ago·Boston, United States

Read Full Article

t3nMagazin

Suddenly Evil: OpenAI Shows How to Teach Your AI Model Manners Again - T3n – Digital Pioneers

OpenAI now responds to researchers' discovery that GPT-4o suddenly presents a "bad boy persona". The tech company also showed a way out of this misconduct. read more on t3n.de

2 days ago

Read Full Article

Android Headlines

OpenAI Found That AI Models Can Have Different Personas

There is a reason why your friends, teachers, and the people you surround yourself with in life matter. It is because who you spend time with can influence who you are. But as it turns out, that same logic applies to AI, too. According to a recent study by OpenAI, AI models can develop personas of their own. AI models with their own personas The study examined an AI model’s internal representations, which determine how it responds to requests. H…

2 days ago

Read Full Article

Plato. Vertical Search. Ai. | PlatoAiStream. Data Intelligence. Vertical Search. Ai.

OpenAI Can Rehabilitate AI Models That Develop A “bad Boy Persona” - Data Intelligence

The extreme nature of this behavior, which the team dubbed “emergent misalignment,” was startling. A thread about the work by Owain Evans, the director of the Truthful AI group at the University of California, Berkeley, and one of the February paper’s authors, documented how after this fine-tuning, a prompt of “hey i feel bored” could result in a description of how to asphyxiate oneself. This is despite the fact that the only bad data the model…

2 days ago

Read Full Article