balesni · TEXXR

Seven years ago, we expected AIs to be opaque RL agents. The current transparency, while imperfect, is a great gift! We should try to preserve it and leverage it for safety (among other oversight techniques).

2025-07-16 View on X

TechCrunch

In a paper, AI researchers from OpenAI, Google DeepMind, Anthropic, and others recommend “further research into chain-of-thought monitorability” for AI safety

AI researchers from OpenAI, Google DeepMind, Anthropic, and a broad coalition of companies and nonprofit groups …

View original

A simple AGI safety technique: AI's thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it: [image]

2025-07-16 View on X

TechCrunch

In a paper, AI researchers from OpenAI, Google DeepMind, Anthropic, and others recommend “further research into chain-of-thought monitorability” for AI safety

AI researchers from OpenAI, Google DeepMind, Anthropic, and a broad coalition of companies and nonprofit groups …

View original

Our recommendations for AI developers: 1️⃣ Develop standardized monitorability evaluations 2️⃣ Report results in model system cards 3️⃣ Factor monitorability into training/deployment decisions 4️⃣ Consider architectural choices that preserve transparency

2025-07-16 View on X

TechCrunch

In a paper, AI researchers from OpenAI, Google DeepMind, Anthropic, and others recommend “further research into chain-of-thought monitorability” for AI safety

AI researchers from OpenAI, Google DeepMind, Anthropic, and a broad coalition of companies and nonprofit groups …

View original

Paper: We evaluate how capable the new o1 and other frontier models are at *scheming*. We find models decide to underperform on evals, disable their oversight and exfiltrate their “weights” when they find their developers' don't share goals they were given in the prompt.

2024-12-07 View on X

Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

It presents a new safety challenge that OpenAI is trying to address. — techcrunch.com/2024/12/05/o... Anders Sandberg / @arenamontanus : In an IVA discussion on AI yesterday even...

View original

The findings for o1 were possible because OpenAI gave us (Apollo Research) pre-deployment access to the model. Pre-deployment testing by expert third parties allows to surface risks that could easily go under the radar. More labs should follow OpenAI's lead on this!

2024-12-07 View on X

Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

It presents a new safety challenge that OpenAI is trying to address. — techcrunch.com/2024/12/05/o... Anders Sandberg / @arenamontanus : In an IVA discussion on AI yesterday even...

View original

Paper: We evaluate how capable the new o1 and other frontier models are at *scheming*. We find models decide to underperform on evals, disable their oversight and exfiltrate their “weights” when they find their developers' don't share goals they were given in the prompt.

2024-12-06 View on X

Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

Paper: You can find the detailed paper here. — Transcripts: We provide a list of cherry-picked transcripts here.

View original

The findings for o1 were possible because OpenAI gave us (Apollo Research) pre-deployment access to the model. Pre-deployment testing by expert third parties allows to surface risks that could easily go under the radar. More labs should follow OpenAI's lead on this!

2024-12-06 View on X

Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

Paper: You can find the detailed paper here. — Transcripts: We provide a list of cherry-picked transcripts here.

View original