Anthropic Wants to Stop AI Models From Turning Evil - Here's How
AUG 4 – Anthropic uses persona vectors as a behavioral vaccine to reduce harmful AI traits like evil and sycophancy while maintaining model performance, researchers said.
7 Articles
7 Articles
Anthropic says it is teaching AI to be evil, apparently to save mankind
Anthropic is intentionally exposing its AI models like Claude to evil traits during training to make them immune to these behaviours. The company says this is helping them to teach AI avoid such traits after deployment.
Giving AI a 'vaccine' of evil in training might make it better in the long run, Anthropic says
Anthropic found that pushing AI to "evil" traits during training can help prevent bad behavior later.Illustration by Thomas Fuller/SOPA Images/LightRocket via Getty ImagesAnthropic gave AI a dose of "evil" during training to help it resist bad behavior later on.The company said the method works like a vaccine to build resilience.Anthropic's research comes as AI models like Grok have shown signs of troubling behavior.To make AI models behave bett…
Anthropic trains his AI with a "dose of evil" to make it more resistant to harmful behaviors, such as a behavioral vaccine against future detours.
AI models can sometimes develop personality traits or personas that developers didn't intend, as seen in cases like the Microsoft search engine Bing's AI threatening people and X's Grok calling itself "Mecha Hitler." Anthropic, the developer of the chat AI Claude, has published a study on how to detect and suppress these persona-inducing patterns in AI models.
Coverage Details
Bias Distribution
- 50% of the sources are Center
Factuality
To view factuality data please Upgrade to Premium