brendanfoody · TEXXR

GPT 5.4 is the best model we've ever tested on APEX-Agents. It's also the first model to pass 50% mean score. A year ago, frontier models couldn't even edit an Excel sheet and scored less than 5%. Now, in less than 3 months GPT 5.4 has improved by 15.7%. ChatGPT will imminently [image]

2026-03-06 View on X

The Verge

OpenAI launches GPT-5.4, saying it is its “most capable and efficient frontier model for professional work” and its first with native computer use capabilities

The latest model comes with native computer use capabilities, allowing it to take on jobs across your device and applications.

View original

GPT 5.4 is the best model we've ever tested on APEX-Agents. It's also the first model to pass 50% mean score. A year ago, frontier models couldn't even edit an Excel sheet and scored less than 5%. Now, in less than 3 months GPT 5.4 has improved by 15.7%. ChatGPT will imminently [image]

2026-03-05 View on X

The Verge

OpenAI launches GPT-5.4, saying it is its “most capable and efficient frontier model for professional work” and its first with native computer use capabilities

The latest model comes with native computer use capabilities, allowing it to take on jobs across your device and applications.

View original

Gemini 3.1 Pro is now at the top of the APEX-Agents leaderboard. Gemini jumped from 18.4% to 33.5% on Pass@1 in just 90 days. It also completes 5 tasks that no model has ever been able to do before. @GeminiApp shows how quickly agents are improving at real knowledge work. It [image]

2026-02-21 View on X

9to5Google

Google rolls out Gemini 3.1 Pro, which it says is “a step forward in core reasoning”, for all users in the Gemini app; the .1 increment is a first for Google

View original

Gemini 3.1 Pro is now at the top of the APEX-Agents leaderboard. Gemini jumped from 18.4% to 33.5% on Pass@1 in just 90 days. It also completes 5 tasks that no model has ever been able to do before. @GeminiApp shows how quickly agents are improving at real knowledge work. It [image]

2026-02-20 View on X

9to5Google

Google rolls out Gemini 3.1 Pro, which it says is “a step forward in core reasoning”, for all users in the Gemini app; the .1 increment is a first for Google

In November, Google introduced Gemini 3 Pro in preview, with Gemini 3 Flash following a month later.

View original

Gemini 3.1 Pro is now at the top of the APEX-Agents leaderboard. Gemini jumped from 18.4% to 33.5% on Pass@1 in just 90 days. It also completes 5 tasks that no model has ever been able to do before. @GeminiApp shows how quickly agents are improving at real knowledge work. It [image]

2026-02-19 View on X

9to5Google

Google rolls out Gemini 3.1 Pro, which it says is “a step forward in core reasoning”, for all users in the Gemini app; the .1 increment is a first for Google

In November, Google introduced Gemini 3 Pro in preview, with Gemini 3 Flash following a month later.

View original

We collaborated with the world's leading experts to create APEX: - Larry Summers (@LHSummers), former US Treasury Secretary - Cass Sunstein (@CassSunstein), the most cited legal scholar - Eric Topol (@EricTopol), physician and best-selling author - Dominic Barton, former [image]

2025-10-03 View on X

Mercor

Mercor launches the AI Productivity Index (APEX), which evaluates AI models' ability to perform “economically valuable knowledge work”; GPT-5 leads at 64.2%

still not production-ready Nikita Ostrovsky / Time : AI Is Learning to Do the Jobs of Doctors, Lawyers, and Consultants arXiv.org : The AI Productivity Index (APEX) Agnee Ghosh / B...

View original

AI has its PhD and now it's on the job market. Introducing the AI Productivity Index (APEX), a benchmark that measures how well we've automated the most valuable industries in the world. Most benchmarks study abstract capabilities. APEX evaluates model performance on real deliverables across law, finance, consulting, and medicine...

2025-10-03 View on X

Mercor

Mercor launches the AI Productivity Index (APEX), which evaluates AI models' ability to perform “economically valuable knowledge work”; GPT-5 leads at 64.2%

still not production-ready Nikita Ostrovsky / Time : AI Is Learning to Do the Jobs of Doctors, Lawyers, and Consultants arXiv.org : The AI Productivity Index (APEX) Agnee Ghosh / B...

View original