Puncturing the buzz over AI agents such as Anthropic’s Claude Code and the open-source project OpenClaw is the prospect that these agents could get tricked into revealing sensitive information such as a person’s banking information. In a sign of those concerns, Anthropic earlier this year singled out rogue agents as a topic of focus for its research fellows.
Anthropic’s staff proposed that the fellows train an agent to misbehave in certain circumstances—say, by writing code with security vulnerabilities. They also asked the researchers to create a benchmark for measuring how often agents fall prey to security issues, according to copies of the proposals seen by The Information. In total, Anthropic proposed that the fellows work on 49 projects, ranging from training Claude to win cybersecurity challenges to investigating Chinese open-source models, giving a rare look into the company’s research priorities.