/
Navigation
C
Chronicles
Browse all articles
C
E
Explore
Semantic exploration
E
R
Research
Entity momentum
R
N
Nexus
Correlations & relationships
N
~
Story Arc
Topic evolution
S
Drift Map
Semantic trajectory animation
D
P
Posts
Analysis & commentary
P
Browse
@
Entities
Companies, people, products, technologies
Domains
Browse by publication source
Handles
Browse by social media handle
Detection
?
Concept Search
Semantic similarity search
!
High Impact Stories
Top coverage by position
+
Sentiment Analysis
Positive/negative coverage
*
Anomaly Detection
Unusual coverage patterns
Analysis
vs
Rivalry Report
Compare two entities head-to-head
/\
Semantic Pivots
Narrative discontinuities
!!
Crisis Response
Event recovery patterns
Connected
Nav: C E R N
Search: /
Command: ⌘K
Embeddings: large
VOICE ARCHIVE

Jack Lindsey

@jack_w_lindsey
6 posts
2026-02-24
How much should we anthropomorphize LLMs? Are they kind of like people, or just fancy autocompletes? If you're interested in these questions, I'd suggest checking out this post! Short answer: LLMs are not anthropomorphic, but the characters they play are. So the question
2026-02-24 View on X
Anthropic

Anthropic introduces “persona selection model”, a theory to explain AI's human-like behavior, and details how AI personas form in pre-training and post-training

AI assistants like Claude can seem surprisingly human.  They express joy after solving tricky coding tasks.

2026-01-20
Shaping AI models' character is increasingly important. We've made progress on understanding where an LLM's default persona comes from, and how to track when it “drifts.” Kudos to @t1ngyu3 for leading this! There's even a demo you can play with: https://www.neuronpedia.org/ ...
2026-01-20 View on X
Anthropic

Anthropic details the “Assistant Axis”, a pattern of neural activity in language models that governs their default identity and helpful behavior

Read the full paper  —  When you talk to a large language model, you can think of yourself as talking to a character.

2025-10-01
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model's mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15) [image]
2025-10-01 View on X
Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

Notably, the eval-awareness-inhibited models did not appear “evil” - in an automated evaluation of their behavioral tendencies, we found their failure modes look more like being too willing to comply with harmful requests, or “take the bait” in leading scenarios. (13/15) [image]
2025-10-01 View on X
Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

We found that steering against certain eval-awareness representations typically decreased verbalized eval awareness, and sometimes increased rates of misalignment (more so than steering along random feature directions). However, even in our worst-case steering settings... (10/15) [image]
2025-10-01 View on X
Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

2025-08-02
Our new paper on persona vectors - knobs in an LLM's brain that control traits like evil, sycophancy, & hallucination. We use them to monitor model personas, mitigate training-time drift towards bad personas, and flag problematic training data. Led by @RunjinChen and @andyarditi
2025-08-02 View on X
Anthropic

Anthropic details “persona vectors”, patterns of activity within an AI model's neural network that control its character traits, such as evil and sycophancy

Read the paper  —  Language models are strange beasts.  In many ways they appear to have human-like “personalities” …