/
Navigation
C
Chronicles
Browse all articles
C
E
Explore
Semantic exploration
E
R
Research
Entity momentum
R
N
Nexus
Correlations & relationships
N
~
Story Arc
Topic evolution
S
Drift Map
Semantic trajectory animation
D
P
Posts
Analysis & commentary
P
Browse
@
Entities
Companies, people, products, technologies
Domains
Browse by publication source
Handles
Browse by social media handle
Detection
?
Concept Search
Semantic similarity search
!
High Impact Stories
Top coverage by position
+
Sentiment Analysis
Positive/negative coverage
*
Anomaly Detection
Unusual coverage patterns
Analysis
vs
Rivalry Report
Compare two entities head-to-head
/\
Semantic Pivots
Narrative discontinuities
!!
Crisis Response
Event recovery patterns
Connected
Nav: C E R N
Search: /
Command: ⌘K
Embeddings: large
VOICE ARCHIVE

Marius Hobbhahn

@mariushobbhahn
9 posts
2025-10-01
Unfortunately, we're now at the point where new models have really high eval awareness. For every alignment eval score I see, I now add a mental asterisk: *the model could have also just realized it's being evaluated, who knows. And I think that's concerning!
2025-10-01 View on X
Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

2024-12-07
Oh man :( We tried really hard to neither over- nor underclaim the results in our communication, but, predictably, some people drastically overclaimed them, and then based on that, others concluded that there was nothing to be seen here (see examples in thread). So, let me try
2024-12-07 View on X
Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

It presents a new safety challenge that OpenAI is trying to address.  —  techcrunch.com/2024/12/05/o... Anders Sandberg / @arenamontanus : In an IVA discussion on AI yesterday even...

Overstating. Our scenario is quite toy-ish compared to the real situation. We only wanted to test the capability for scheming, not for actually escaping. All of this is in a secure environment, and we obviously don't have access to the real weights. https://x.com/...
2024-12-07 View on X
Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

It presents a new safety challenge that OpenAI is trying to address.  —  techcrunch.com/2024/12/05/o... Anders Sandberg / @arenamontanus : In an IVA discussion on AI yesterday even...

Understating. We didn't instruct the model to escape or to remove oversight. We only told it to pursue a goal. It did the rest of the reasoning on its own. That's exactly the core finding of the paper. If we had told the model to remove oversight, etc., the finding would not
2024-12-07 View on X
Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

It presents a new safety challenge that OpenAI is trying to address.  —  techcrunch.com/2024/12/05/o... Anders Sandberg / @arenamontanus : In an IVA discussion on AI yesterday even...

2024-12-06
Understating. We didn't instruct the model to escape or to remove oversight. We only told it to pursue a goal. It did the rest of the reasoning on its own. That's exactly the core finding of the paper. If we had told the model to remove oversight, etc., the finding would not
2024-12-06 View on X
Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

Paper: You can find the detailed paper here.  —  Transcripts: We provide a list of cherry-picked transcripts here.

Oh man :( We tried really hard to neither over- nor underclaim the results in our communication, but, predictably, some people drastically overclaimed them, and then based on that, others concluded that there was nothing to be seen here (see examples in thread). So, let me try
2024-12-06 View on X
Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

Paper: You can find the detailed paper here.  —  Transcripts: We provide a list of cherry-picked transcripts here.

Overstating. Our scenario is quite toy-ish compared to the real situation. We only wanted to test the capability for scheming, not for actually escaping. All of this is in a secure environment, and we obviously don't have access to the real weights. https://x.com/...
2024-12-06 View on X
Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

Paper: You can find the detailed paper here.  —  Transcripts: We provide a list of cherry-picked transcripts here.

2023-10-30
Nice to see that the US is starting with AI regulation! I'm worried that their aggressive stance on competition law will make it really hard for labs to coordinate on safety measures because they could be seen as anti competitive https://www.politico.com/...
2023-10-30 View on X
The Verge

Biden signs an EO on generative AI, directing the NIST, DHS, and other agencies to create new safety standards, protect privacy, support workers, and more

President Joe Biden signed an executive order providing rules around generative AI, ahead of any legislation coming from lawmakers.

2023-10-29
Nice to see that the US is starting with AI regulation! I'm worried that their aggressive stance on competition law will make it really hard for labs to coordinate on safety measures because they could be seen as anti competitive https://www.politico.com/...
2023-10-29 View on X
Politico

A US executive order, expected to be issued as soon as October 30, will require firms building powerful AI models to report how they plan to protect their tech