Anthropic introduces “persona selection model”, a theory to explain AI's human-like behavior, and details how AI personas form in pre-training and post-training
AI assistants like Claude can seem surprisingly human. They express joy after solving tricky coding tasks.
Anthropic details the “Assistant Axis”, a pattern of neural activity in language models that governs their default identity and helpful behavior
Read the full paper — When you talk to a large language model, you can think of yourself as talking to a character.
Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly
at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...
Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly
at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...
Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly
at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...
Anthropic details “persona vectors”, patterns of activity within an AI model's neural network that control its character traits, such as evil and sycophancy
Read the paper — Language models are strange beasts. In many ways they appear to have human-like “personalities” …