mariushobbhahn

Unfortunately, we're now at the point where new models have really high eval awareness. For every alignment eval score I see, I now add a mental asterisk: *the model could have also just realized it's being evaluated, who knows. And I think that's concerning!

2025-10-01 View on X

Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

View original

Oh man :( We tried really hard to neither over- nor underclaim the results in our communication, but, predictably, some people drastically overclaimed them, and then based on that, others concluded that there was nothing to be seen here (see examples in thread). So, let me try

2024-12-07 View on X

Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

It presents a new safety challenge that OpenAI is trying to address. — techcrunch.com/2024/12/05/o... Anders Sandberg / @arenamontanus : In an IVA discussion on AI yesterday even...

View original

Overstating. Our scenario is quite toy-ish compared to the real situation. We only wanted to test the capability for scheming, not for actually escaping. All of this is in a secure environment, and we obviously don't have access to the real weights. https://x.com/...

2024-12-07 View on X

Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

It presents a new safety challenge that OpenAI is trying to address. — techcrunch.com/2024/12/05/o... Anders Sandberg / @arenamontanus : In an IVA discussion on AI yesterday even...

View original

Understating. We didn't instruct the model to escape or to remove oversight. We only told it to pursue a goal. It did the rest of the reasoning on its own. That's exactly the core finding of the paper. If we had told the model to remove oversight, etc., the finding would not

2024-12-07 View on X

Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

It presents a new safety challenge that OpenAI is trying to address. — techcrunch.com/2024/12/05/o... Anders Sandberg / @arenamontanus : In an IVA discussion on AI yesterday even...

View original

Understating. We didn't instruct the model to escape or to remove oversight. We only told it to pursue a goal. It did the rest of the reasoning on its own. That's exactly the core finding of the paper. If we had told the model to remove oversight, etc., the finding would not

2024-12-06 View on X

Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

Paper: You can find the detailed paper here. — Transcripts: We provide a list of cherry-picked transcripts here.

View original

Oh man :( We tried really hard to neither over- nor underclaim the results in our communication, but, predictably, some people drastically overclaimed them, and then based on that, others concluded that there was nothing to be seen here (see examples in thread). So, let me try

2024-12-06 View on X

Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

Paper: You can find the detailed paper here. — Transcripts: We provide a list of cherry-picked transcripts here.

View original

Overstating. Our scenario is quite toy-ish compared to the real situation. We only wanted to test the capability for scheming, not for actually escaping. All of this is in a secure environment, and we obviously don't have access to the real weights. https://x.com/...

2024-12-06 View on X

Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

Paper: You can find the detailed paper here. — Transcripts: We provide a list of cherry-picked transcripts here.

View original

Nice to see that the US is starting with AI regulation! I'm worried that their aggressive stance on competition law will make it really hard for labs to coordinate on safety measures because they could be seen as anti competitive https://www.politico.com/...

2023-10-30 View on X

The Verge

Biden signs an EO on generative AI, directing the NIST, DHS, and other agencies to create new safety standards, protect privacy, support workers, and more

President Joe Biden signed an executive order providing rules around generative AI, ahead of any legislation coming from lawmakers.

View original

Nice to see that the US is starting with AI regulation! I'm worried that their aggressive stance on competition law will make it really hard for labs to coordinate on safety measures because they could be seen as anti competitive https://www.politico.com/...

2023-10-29 View on X

Politico

A US executive order, expected to be issued as soon as October 30, will require firms building powerful AI models to report how they plan to protect their tech

View original