Just 250 Documents Can Poison AI Models, Study Finds
- Anthropic released a report today detailing how just 250 malicious documents can introduce a backdoor vulnerability in large language models regardless of model size.
- Researchers from Anthropic, in partnership with leading UK organizations focused on AI safety and research, set out to dispute the idea that attackers need access to a certain proportion of training data to carry out successful data-poisoning attacks.
- Researchers found that introducing a fixed, small number of poisoned documents enables adversaries to trigger specific hidden behaviors using certain phrases, even in models with up to 13 billion parameters.
- The report highlights that only 250 malicious documents—just 0.00016% of training tokens—were needed to compromise a large model, indicating that executing data poisoning attacks against large-scale AI systems might require less effort than was previously assumed.
- Anthropic cautions that data-poisoning attacks seem more feasible than once assumed and urges further research on defenses, while also emphasizing ongoing safety measures in real-world systems.
38 Articles
38 Articles
An investigation shows that the main AI models can be manipulated with only 250 corrupt documents, which endangers their training.
AI models can acquire backdoors from surprisingly few malicious documents
Scraping the open web for AI training data can have its drawbacks. On Thursday, researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute released a preprint research paper suggesting that large language models like the ones that power ChatGPT, Gemini, and Claude can develop backdoor vulnerabilities from as few as 250 corrupted documents inserted into their training data. That means someone tucking certain documents…
Size doesn't matter: Just a small number of malicious files can corrupt LLMs of any size
Large language models (LLMs), which power sophisticated AI chatbots, are more vulnerable than previously thought. According to research by Anthropic, the UK AI Security Institute and the Alan Turing Institute, it only takes ...
Medical large language models are vulnerable to data-poisoning attacks - Nature Medicine
The adoption of large language models (LLMs) in healthcare demands a careful analysis of their potential to spread false medical knowledge. Because LLMs ingest massive volumes of data from the open Internet during training, they are potentially exposed to unverified medical knowledge that may include deliberately planted misinformation. Here, we perform a threat assessment that simulates a data-poisoning attack against The Pile, a popular datase…
Coverage Details
Bias Distribution
- 86% of the sources are Center
Factuality
To view factuality data please Upgrade to Premium