jack_w_lindsey

How much should we anthropomorphize LLMs? Are they kind of like people, or just fancy autocompletes? If you're interested in these questions, I'd suggest checking out this post! Short answer: LLMs are not anthropomorphic, but the characters they play are. So the question

2026-02-24 View on X

Anthropic

Anthropic introduces “persona selection model”, a theory to explain AI's human-like behavior, and details how AI personas form in pre-training and post-training

AI assistants like Claude can seem surprisingly human. They express joy after solving tricky coding tasks.

View original

Shaping AI models' character is increasingly important. We've made progress on understanding where an LLM's default persona comes from, and how to track when it “drifts.” Kudos to @t1ngyu3 for leading this! There's even a demo you can play with: https://www.neuronpedia.org/ ...

2026-01-20 View on X

Anthropic

Anthropic details the “Assistant Axis”, a pattern of neural activity in language models that governs their default identity and helpful behavior

Read the full paper — When you talk to a large language model, you can think of yourself as talking to a character.

View original

Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model's mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15) [image]

2025-10-01 View on X

Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

View original

Notably, the eval-awareness-inhibited models did not appear “evil” - in an automated evaluation of their behavioral tendencies, we found their failure modes look more like being too willing to comply with harmful requests, or “take the bait” in leading scenarios. (13/15) [image]

2025-10-01 View on X

Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

View original

We found that steering against certain eval-awareness representations typically decreased verbalized eval awareness, and sometimes increased rates of misalignment (more so than steering along random feature directions). However, even in our worst-case steering settings... (10/15) [image]

2025-10-01 View on X

Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

View original

Our new paper on persona vectors - knobs in an LLM's brain that control traits like evil, sycophancy, & hallucination. We use them to monitor model personas, mitigate training-time drift towards bad personas, and flag problematic training data. Led by @RunjinChen and @andyarditi

2025-08-02 View on X

Anthropic

Anthropic details “persona vectors”, patterns of activity within an AI model's neural network that control its character traits, such as evil and sycophancy

Read the paper — Language models are strange beasts. In many ways they appear to have human-like “personalities” …

View original