Anthropic Developing a New Tool to Detect Concerning AI Talk of Nuclear Weapons
- Anthropic developed a nuclear threat classifier in partnership with the US Department of Energy’s National Nuclear Security Administration to detect concerning AI conversations.
- The collaboration began last year to address risks of AI models providing harmful technical knowledge about nuclear weapons, amid rising global concerns.
- The classifier, designed like a spam filter, scans Claude AI chats in real time and identifies nuclear weapons-related queries with about 95% accuracy.
- It detected 94.8% of test nuclear queries and had a 5.2% false positive rate, but hierarchical summarization improved labeling of flagged conversations.
- Anthropic deployed the tool on some Claude traffic and pledged to share insights with the Frontier Model Forum to help others build similar safeguards.
9 Articles
9 Articles
How US built new tool to stop AI from making nuclear weapons
Anthropic, whose AI bot Claude is a direct competitor to OpenAI's ChatGPT, said it has been working with the US government for over a year to build in the safeguard. The tool, a ‘classifier’, can flag concerning conversations about how to build a nuclear reactor or bomb with almost 95 per cent accuracy. Anthropic said it has already rolled out the tool on some of its Claude models. Here's how it did it
Anthropic developing a new tool to detect concerning AI talk of nuclear weapons
As part of its ongoing work with the National Nuclear Security Administration, the small but critical agency charged with monitoring the country’s nuclear stockpile, Anthropic is now working on a new tool designed to help detect when new AI systems output troubling discussions of nuclear weapons. Artificial intelligence systems have the potential to uncover all sorts of new chemical compounds. While many of those discoveries might be promising, …
Anthropic develops anti-nuke AI tool
With the government’s help, Anthropic built a tool designed to prevent its AI models from being used to make nuclear weapons. The company announced Thursday that it had worked with the National Nuclear Security Administration over the past year to build a “classifier” that can block “concerning” conversations — like those about building nuclear reactors — on its systems. “As AI models become more capable, we need to keep a close eye on whether t…
Coverage Details
Bias Distribution
- 50% of the sources are Center
Factuality
To view factuality data please Upgrade to Premium