/
Navigation
C
Chronicles
Browse all articles
C
E
Explore
Semantic exploration
E
R
Research
Entity momentum
R
N
Nexus
Correlations & relationships
N
~
Story Arc
Topic evolution
S
Drift Map
Semantic trajectory animation
D
P
Posts
Analysis & commentary
P
Browse
@
Entities
Companies, people, products, technologies
Domains
Browse by publication source
Handles
Browse by social media handle
Detection
?
Concept Search
Semantic similarity search
!
High Impact Stories
Top coverage by position
+
Sentiment Analysis
Positive/negative coverage
*
Anomaly Detection
Unusual coverage patterns
Analysis
vs
Rivalry Report
Compare two entities head-to-head
/\
Semantic Pivots
Narrative discontinuities
!!
Crisis Response
Event recovery patterns
Connected
Nav: C E R N
Search: /
Command: ⌘K
Embeddings: large
VOICE ARCHIVE

Mikita Balesni

@balesni
7 posts
2025-07-16
Seven years ago, we expected AIs to be opaque RL agents. The current transparency, while imperfect, is a great gift! We should try to preserve it and leverage it for safety (among other oversight techniques).
2025-07-16 View on X
TechCrunch

In a paper, AI researchers from OpenAI, Google DeepMind, Anthropic, and others recommend “further research into chain-of-thought monitorability” for AI safety

AI researchers from OpenAI, Google DeepMind, Anthropic, and a broad coalition of companies and nonprofit groups …

A simple AGI safety technique: AI's thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it: [image]
2025-07-16 View on X
TechCrunch

In a paper, AI researchers from OpenAI, Google DeepMind, Anthropic, and others recommend “further research into chain-of-thought monitorability” for AI safety

AI researchers from OpenAI, Google DeepMind, Anthropic, and a broad coalition of companies and nonprofit groups …

Our recommendations for AI developers: 1️⃣ Develop standardized monitorability evaluations 2️⃣ Report results in model system cards 3️⃣ Factor monitorability into training/deployment decisions 4️⃣ Consider architectural choices that preserve transparency
2025-07-16 View on X
TechCrunch

In a paper, AI researchers from OpenAI, Google DeepMind, Anthropic, and others recommend “further research into chain-of-thought monitorability” for AI safety

AI researchers from OpenAI, Google DeepMind, Anthropic, and a broad coalition of companies and nonprofit groups …

2024-12-07
Paper: We evaluate how capable the new o1 and other frontier models are at *scheming*. We find models decide to underperform on evals, disable their oversight and exfiltrate their “weights” when they find their developers' don't share goals they were given in the prompt.
2024-12-07 View on X
Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

It presents a new safety challenge that OpenAI is trying to address.  —  techcrunch.com/2024/12/05/o... Anders Sandberg / @arenamontanus : In an IVA discussion on AI yesterday even...

The findings for o1 were possible because OpenAI gave us (Apollo Research) pre-deployment access to the model. Pre-deployment testing by expert third parties allows to surface risks that could easily go under the radar. More labs should follow OpenAI's lead on this!
2024-12-07 View on X
Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

It presents a new safety challenge that OpenAI is trying to address.  —  techcrunch.com/2024/12/05/o... Anders Sandberg / @arenamontanus : In an IVA discussion on AI yesterday even...

2024-12-06
Paper: We evaluate how capable the new o1 and other frontier models are at *scheming*. We find models decide to underperform on evals, disable their oversight and exfiltrate their “weights” when they find their developers' don't share goals they were given in the prompt.
2024-12-06 View on X
Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

Paper: You can find the detailed paper here.  —  Transcripts: We provide a list of cherry-picked transcripts here.

The findings for o1 were possible because OpenAI gave us (Apollo Research) pre-deployment access to the model. Pre-deployment testing by expert third parties allows to surface risks that could easily go under the radar. More labs should follow OpenAI's lead on this!
2024-12-06 View on X
Apollo Research

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

Paper: You can find the detailed paper here.  —  Transcripts: We provide a list of cherry-picked transcripts here.