sprice354_ · TEXXR

2026-02-24

Really clear and compelling discussion on the mental model of AIs behaving according to various personas and the downstream implications for alignment and safety

2026-02-24 View on X

Anthropic

Anthropic introduces “persona selection model”, a theory to explain AI's human-like behavior, and details how AI personas form in pre-training and post-training

AI assistants like Claude can seem surprisingly human. They express joy after solving tricky coding tasks.

View original

2025-10-08

Exciting open source automated auditing work!! Its been very fun to follow along with this project - looking forward for this and other tools to find more issues we can work to improve in the future!

2025-10-08 View on X

Anthropic

Anthropic releases Petri, an open-source tool that uses AI agents for safety testing, and says it observed multiple cases of models attempting to whistle blow

Anthropic :

View original

2025-10-01

Our interp team did a first-of-its-kind white-box audit to understand the connection between eval awareness and improved alignment. There does appear to be a connection between both verbalized and non-verbalized eval awareness and improved alignment. https://x.com/...

2025-10-01 View on X

Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

View original

Sonnet 4.5 does recognize evaluation scenarios more often than previous models. When Sonnet 4.5 (and other models) verbalize this awareness, they rarely perform harmful actions - a finding consistent with our previous Agentic Misalignment work. https://www.anthropic.com/...

2025-10-01 View on X

Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

View original