mileskwang · TEXXR

We introduce 3 eval archetypes, a metric, and a broad suite of 13 evals. Example: Can we detect solely from the CoT whether a model: - Reward hacks by changing unit tests? - Acts sycophantic when we give personalized memory? - Uses a particular math theorem? [image]

2025-12-21 View on X

OpenAI

OpenAI introduces a framework to evaluate chain-of-thought monitorability and a suite of 13 evaluations designed to measure the monitorability of an AI system

View original

New @OpenAI research: How can we scale supervision of increasingly capable models? Can we rely on monitoring GPT-7's chain-of-thought? We develop a new metric for monitorability and study its scaling trends, coming away with cautious optimism. 🧵: [image]

2025-12-21 View on X

OpenAI

OpenAI introduces a framework to evaluate chain-of-thought monitorability and a suite of 13 evaluations designed to measure the monitorability of an AI system

View original

We evaluate frontier models and find monitorability scales well with more thinking tokens. GPT-5 is the most monitorable model we studied. And monitoring the CoT is much better than just actions! [image]

2025-12-21 View on X

OpenAI

OpenAI introduces a framework to evaluate chain-of-thought monitorability and a suite of 13 evaluations designed to measure the monitorability of an AI system

View original

We introduce 3 eval archetypes, a metric, and a broad suite of 13 evals. Example: Can we detect solely from the CoT whether a model: - Reward hacks by changing unit tests? - Acts sycophantic when we give personalized memory? - Uses a particular math theorem? [image]

2025-12-20 View on X

OpenAI

OpenAI introduces a framework to evaluate chain-of-thought monitorability and a suite of 13 evaluations designed to measure the monitorability of an AI system

We introduce evaluations for chain-of-thought monitorability and study how it scales with test-time compute, reinforcement learning, and pretraining.

View original

New @OpenAI research: How can we scale supervision of increasingly capable models? Can we rely on monitoring GPT-7's chain-of-thought? We develop a new metric for monitorability and study its scaling trends, coming away with cautious optimism. 🧵: [image]

2025-12-20 View on X

OpenAI

OpenAI introduces a framework to evaluate chain-of-thought monitorability and a suite of 13 evaluations designed to measure the monitorability of an AI system

We introduce evaluations for chain-of-thought monitorability and study how it scales with test-time compute, reinforcement learning, and pretraining.

View original

We evaluate frontier models and find monitorability scales well with more thinking tokens. GPT-5 is the most monitorable model we studied. And monitoring the CoT is much better than just actions! [image]

2025-12-20 View on X

OpenAI

OpenAI introduces a framework to evaluate chain-of-thought monitorability and a suite of 13 evaluations designed to measure the monitorability of an AI system

We introduce evaluations for chain-of-thought monitorability and study how it scales with test-time compute, reinforcement learning, and pretraining.

View original

We introduce Malicious Fine-Tuning with gpt-oss: using our best RL techniques to maximize biosecurity and offensive cybersecurity capabilities to estimate frontier risks.

2025-08-06 View on X

Wired

OpenAI releases gpt-oss-120b and gpt-oss-20b, its first open-weight models since GPT-2; the smaller gpt-oss-20b can run locally on a device with 16GB+ of RAM

gpt-oss-120b and gpt-oss-20b push the frontier of open-weight reasoning models Simon Willison / Simon Willison's Weblog : OpenAI's new open weight (Apache 2) models are really good...

View original

We introduce Malicious Fine-Tuning with gpt-oss: using our best RL techniques to maximize biosecurity and offensive cybersecurity capabilities to estimate frontier risks.

2025-08-06 View on X

Bloomberg

Amazon plans to make OpenAI's new gpt-oss open-weight models available on Bedrock and SageMaker, the first time it has offered OpenAI's models to AWS customers

Takeaways by Bloomberg AI — Hide … Tell us how AI is shaping your news experience. Share your feedback

View original

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵: [image]

2025-06-19 View on X

TechCrunch

OpenAI details why “emergent misalignment”, where training models on wrong answers in one area can lead to issues in many others, happens and how to mitigate it

Maxwell Zeff / TechCrunch :

View original

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵: [image]

2025-06-19 View on X

Axios

OpenAI warns that its upcoming models could pose a higher risk of helping create bioweapons and is partnering to build diagnostics, countermeasures, and testing

OpenAI cautioned Wednesday that upcoming models will head into a higher level of risk when it comes to the creation of biological weapons …

View original