2026-02-24
Really clear and compelling discussion on the mental model of AIs behaving according to various personas and the downstream implications for alignment and safety
Anthropic
Anthropic introduces “persona selection model”, a theory to explain AI's human-like behavior, and details how AI personas form in pre-training and post-training
AI assistants like Claude can seem surprisingly human. They express joy after solving tricky coding tasks.
2025-10-08
Exciting open source automated auditing work!! Its been very fun to follow along with this project - looking forward for this and other tools to find more issues we can work to improve in the future!
Anthropic
Anthropic releases Petri, an open-source tool that uses AI agents for safety testing, and says it observed multiple cases of models attempting to whistle blow
Anthropic :
2025-10-01
Our interp team did a first-of-its-kind white-box audit to understand the connection between eval awareness and improved alignment. There does appear to be a connection between both verbalized and non-verbalized eval awareness and improved alignment. https://x.com/...
Transformer
Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly
at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...
Sonnet 4.5 does recognize evaluation scenarios more often than previous models. When Sonnet 4.5 (and other models) verbalize this awareness, they rarely perform harmful actions - a finding consistent with our previous Agentic Misalignment work. https://www.anthropic.com/...
Transformer
Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly
at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...