o3, trained on the ARC-AGI-1 Public Training set, scored 87.5% on ARC Prize's Semi-Private Evaluation in a high-compute configuration; GPT-4o scored 5% in 2024

This is “The AI Economy,” a weekly LinkedIn-first newsletter … Sharon Goldman / Fortune : Sam Altman says OpenAI's new o3 ‘reasoning’ models begin the ‘next phase’ of AI. Is this AGI? Pradeep Viswanathan / Neowin : OpenAI introduces o3 and o3 Mini reasoning models Alex Heath / The Verge : The AI talent wars are just getting started Cecily Mauran / Mashable : OpenAI announces o3 and o3 mini reasoning models Matthias Bastian / The Decoder : OpenAI unveils o3, its most advanced reasoning model yet Bluesky: Gary Marcus / @garymarcus : How many times do we have to see this same movie, where an AI beats some benchmark and influencers gleefully shout “It's So Over” without even trying out the AI and then on careful inspection the AI turns out to not be robust or reliable? — Thousands? — (It's already been hundreds.) Simon Willison / @simonwillison.net : By far the best coverage of o3 is this essay by François Chollet, it's crammed with interesting insights beyond just reporting on the benchmark score: arcprize.org/blog/oai-o3-... Published my own notes on that here: simonwillison.net/2024/Dec/20/ ... Julian Harris / @julianharris : OpenAI announced o3 that is significantly better than previous systems, according to an independent benchmark org (The Arc Prize) that apparently got access. — Only thing is it's wildly wildly expensive to run. Like its top end system is around $10k per TASK. — arcprize.org/blog/oai-o3-... Zach Weinersmith / @zachweinersmith : This seems like a really big deal? arcprize.org/blog/oai-o3-... Chollet was pretty skeptical this would happen soon, even as of a few months ago. — Skeptics, whatcha got? Steven Heidel / @stevenheidel.com : it's beginning to look a lot like AGI 🎄 arcprize.org/blog/oai-o3-... Mastodon: Miguel Afonso Caetano / @remixtures@tldr.nettime.org : “OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set - has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored 87.5%. … Threads: @luokai : Even though o3 has tackled many challenges that most average humans can't solve, there's still a long road ahead on the journey to AGI. It still messes up on tasks that are pretty simple for humans, revealing a huge gap between where it is and true AGI. … Joe Fabisevich / @mergesort : Something OpenAI has realized in a way no other foundation model lab has is that your model can be amazing, superhuman even, but if it's not packaged into a product people can use it might as well not exist. ChatGPT supporting Apple Notes is worth far more to people than 100 more points on LMSYS. Dustin Moskovitz / @moskov : Francois Chollet has long been an LLM skeptic - he seems to be coming around. I wonder if others will follow? Casey Newton / @crumbler : “All intuition about AI capabilities will need to get updated for o3,” says the founder of the ARC prize — designed to be very difficult for LLMs to solve — after OpenAI took the benchmark from 5% with o1 to 85% today with o3 https://arcprize.org/... X: Tau / @taulogicai : @GaryMarcus @fchollet Well said. Passing ARC-AGI highlights progress in solving specific challenges, not achieving AGI. True AGI requires reasoning, adaptability, and guaranteed correctness across all tasks, not just isolated benchmarks. The distinction cannot be overstated. Lenny Rachitsky / @lennysan : The best benchmark for tracking progress towards AGI. 2025 is going to be wild. François Chollet / @fchollet : Deep learning did hit that wall, and the natural answer to get past it was deep learning plus search. AI research is about to enter its deep-learning guided program synthesis (or CoT synthesis) arc. Gary Marcus / @garymarcus : Important words from @fchollet: “it is important to note that ARC-AGI is not an acid test for AGI - as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over Roon / @tszzl : “easy for humans, hard for ai” is not a solid design principle for evals imo it leads you towards “judging a fish by how far it can climb a tree” absurdities but maybe it's one orthogonal eval style among many equally important ones Denny Zhou / @denny_zhou : any benchmark—including ARC-AGI—can be rapidly solved, as long as the task provides a clear evaluation metric that can be used as a reward signal during fine-tuning. Emmett Shear / @eshear : If you haven't figured out the joke yet, *any* fixed benchmark will fall rapidly the instant it becomes an optimization target for the top labs. Correspondingly, no fixed benchmark is AGI. AGI is the ability to generalize to an adversarially chosen new benchmark. Gary Marcus / @garymarcus : Starting to feel like most people using the term AGI have lost sight of what the G actually stands for. Andriy Burkov / @burkov : How to achieve AGI in 2024: 1. Define a benchmark with puzzles and call it “The AGI Testing Benchmark.” 2. Fine-tune a VLM to solve these puzzles. 3. Declare the AGI achievement. Sauers / @sauers_ : The total compute cost was around $1,600,250, more than the entire prize [image] Geoff Lewis / @geofflewisorg : Think about what becomes *more* valuable in a post-AI world, for we are at its doorstep. Roon / @tszzl : specifically arc agi visual tasks look like nonsense in JSON format and multi modality isn't great and the character manipulation tasks don't work for the same reason models mess up the how many “r"s in strawberry problem (tokenization/BPE) Roon / @tszzl : it's almost adversarially constructed wrt input modalities. for a model to solve these requires a far higher level of intelligence than equivalent human score François Chollet / @fchollet : It will also be extremely important to analyze the strengths and limitations of the new system. Here are some examples of tasks that o3 couldn't solve on high-compute settings (even as it was generating millions of CoT search tokens and consuming thousands of dollars of compute [image] Mia Bookworm / @miaai_builder : @Techmeme The o3 model's ability to generate and execute its own programs is a massive leap forward in AI capabilities, I'm loving the potential for novel task adaptation! François Chollet / @fchollet : The limitations of specific techniques are predictable and correspondingly lead to plateaus for those techniques. But there is always the next technique, building on top of the pile that's already available. There is enough research investment that there will be no wall. Miles Brundage / @miles_brundage : I'm old enough to remember when getting double digit scores on FrontierMath was considered super hard I'm 6 weeks old Simon Willison / @simonw : Absolutely the most interesting think I've read so far about o3 is this essay by @fchollet https://arcprize.org/... [image] Drew Breunig / @dbreunig : Creating reasoning data for training - by humans or synthetically - will continue to rise in importance. François Chollet / @fchollet : What does this mean for the future of AGI research? For me, the main open question is where the scaling bottlenecks for the techniques behind o3 are going to be. If human-annotated CoT data is a major bottleneck, for instance, capabilities would start to plateau quickly like François Chollet / @fchollet : Two other examples. You can find the full testing data here: https://github.com/... If the topic interests you, take a look at analyzing this data. [image] LinkedIn: Anette Novak : Last night, OpenAI released it's latest model, breaking the barrier of solving new problems, not previously in the training data. 👇🏽 … John Chong Min Tan : 75% on ARC-AGI semi-private dataset is insanely good. — Some key takeaways from this article by Chollet: … Rama Vasudevan : I have been skeptical of LLM reasoning for some time, and have never been on the hype train. But it is impossible now to not be stunned by the progress of OpenAI's soon to be released o3 model. … Sébastien Riopel-Murray : OpenAI's new o3 model scored a breakthrough 87.5% on the ARC-AGI benchmark for general intelligence. … Forums: Hacker News : OpenAI O3 breakthrough high score on ARC-AGI-PUB r/agi : OpenAI o3 Breakthrough High Score on ARC-AGI-Pub r/mlscaling : OpenAI o3 Breakthrough High Score on ARC-AGI-Pub r/MachineLearning : [D] OpenAI o3 87.5% High Score on ARC Prize Challenge r/singularity : FULL O3 TESTING REPORT

ARC Prize 2024-12-21

Chronicles

o3, trained on the ARC-AGI-1 Public Training set, scored 87.5% on ARC Prize's Semi-Private Evaluation in a high-compute configuration; GPT-4o scored 5% in 2024