mikeknoop · TEXXR

Zapier is a heavy @braintrust user. What a nice note from the head of customer service!

2026-02-18 View on X

Axios

Braintrust, which helps companies evaluate and monitor their AI tools' performance, raised an $80M Series B led by Iconiq at an $800M post-money valuation

View original

Zapier is a heavy @braintrust user. What a nice note from the head of customer service!

2026-02-17 View on X

Axios

Braintrust, which helps companies evaluate and monitor their AI tools' performance, raised an $80M Series B led by Iconiq at an $800M post-money valuation

Braintrust, a startup building AI observability and evaluation tools, raised an $80 million Series B led by Iconiq …

View original

Great post. It's unfortunate that shifting AGI goalposts is associated with luddism. Pointing out flaws and building theories is how to drive progress — in fact it's a strong bull signal as we get more humans studying the real issues, increasing the likelihood we solve them.

2025-12-04 View on X

Dwarkesh Podcast

Thoughts on AI progress and why AI labs' actions hint at a worldview in which AI models will continue to fare poorly at generalization and on-the-job learning

Why I'm moderately bearish in the short term, and explosively bullish in the long term — What are we scaling? X: @sriramk , @_simonsmith , @dwarkesh_sp , @emollick , @dwarkesh_sp...

View original

Really exciting to see! This is important work to assemble these scientific datasets. Also, one item close to my heart: > launch funding opportunities or prize competitions to incentivize private-sector participation in AI-driven scientific research

2025-11-25 View on X

Reuters

Trump signs an EO establishing the Genesis Mission to boost AI innovation, including by using federal scientific datasets to train models and create AI agents

President Donald Trump on Monday signed an executive order to launch a government-wide effort to build an integrated artificial …

View original

We just verified Gemini 3 Pro and Deep Think (Preview) are over 2X SOTA on ARC v2! This is really impressive and frankly a bit surprising. Impressive because many of the v2 solves indicate clear complexity scaling over v1. Such as tasks 65b59efc, e3721c99, and dd6b8c4b We're

2025-11-18 View on X

9to5Google

Google says Gemini 3 Pro scores 1,501 on LMArena, above 2.5 Pro, and demonstrates PhD-level reasoning with top scores on Humanity's Last Exam and GPQA Diamond

Google today announced Gemini 3 with the goal of bringing “any idea to life.” The first model available in this family …

View original

We just verified Gemini 3 Pro and Deep Think (Preview) are over 2X SOTA on ARC v2! This is really impressive and frankly a bit surprising. Impressive because many of the v2 solves indicate clear complexity scaling over v1. Such as tasks 65b59efc, e3721c99, and dd6b8c4b We're

2025-11-18 View on X

The Verge

Google unveils Gemini 3, its “most intelligent” and “factually accurate” model yet, with improvements across coding and reasoning, and offering less “flattery”

The flagship Gemini 3 Pro model is coming to the Gemini app and Search, with improvements across coding, reasoning, and less ‘flattery.’

View original

I have increased confidence that the core AGI algorithm will be less than 10k loc (intelligence is a process, not a model).

2025-10-09 View on X

VentureBeat

Samsung introduces the Tiny Recursion Model, a 7M-parameter model that can outperform LLMs 10,000x larger, like Gemini 2.5 Pro and o3-mini, on specific problems

The trend of AI researchers developing new, small open source generative models that outperform far larger …

View original

Three key ARC-AGI findings on GPT-5: 1. Full GPT-5 is along the v1 pareto frontier. OpenAI said they focussed on other goals like UX and reliability. Our testing supports. 2. Mini GPT-5 is super impressive accuracy for cost. In fact, based on cost efficiency, Mini could have

2025-08-08 View on X

Simon Willison's Weblog

GPT-5 hands-on: it exudes competence but doesn't feel like a dramatic leap ahead of other LLMs, and the pricing is aggressively competitive with other providers

And It Changes Everything Tyler Cowen / Marginal Revolution : GPT-5, a short and enthusiastic review GPT-5 : GPT-5 — Our hands-on review of OpenAI's newest model based on weeks o...

View original

Three key ARC-AGI findings on GPT-5: 1. Full GPT-5 is along the v1 pareto frontier. OpenAI said they focussed on other goals like UX and reliability. Our testing supports. 2. Mini GPT-5 is super impressive accuracy for cost. In fact, based on cost efficiency, Mini could have

2025-08-08 View on X

VentureBeat

OpenAI touts GPT-5's scores on math, coding, and health benchmarks: 94.6% on AIME 2025 without tools, 74.9% on SWE-bench Verified, and 46.2% on HealthBench Hard

After literally years of hype and speculation, OpenAI has officially launched a new lineup of large language models (LLMs) …

View original

Game benchmarks are making a come back. Cool project from @kaggle

2025-08-05 View on X

SiliconANGLE

Google unveils benchmarking platform Kaggle Game Arena, where LLMs compete head-to-head in strategic games, starting with a chess tournament from August 5 to 7

Watch models compete in complex games providing a verifiable and dynamic measure of their capabilities. Kaggle : Chess Text Input Leaderboard Nick Bild / Hackster : Shall We Play a...

View original

This is accurate. We verified Grok 4 using our semi-private ARC datasets.

2025-07-10 View on X

Tom's Guide

xAI introduces Grok 4, trained on its Colossus supercomputer, with multimodal features, faster reasoning, Grok 4 Voice, Grok 4 Code, a new interface, and more

Deeper thinking and greater reasoning is promised — An hour after the live stream was supposed to start last night (July 9) …

View original

And now available via Anthropic API — today Anthropic announced the Zapier MCP connector support: https://www.anthropic.com/...

2025-05-23 View on X

Anthropic

Anthropic releases new API features for building agents: a code execution tool, an MCP connector, a Files API, and extended prompt caching, all in public beta

Today, we're announcing four new capabilities on the Anthropic API that enable developers to build more powerful AI agents …

View original

The OpenAI Responses API now supports tens-of-thousands of third-party actions via Zapier MCP. Demo in thread below!

2025-05-22 View on X

VentureBeat

OpenAI updates its Responses API for building agentic applications to include remote MCP server support, image generation and Code Interpreter tools, and more

OpenAI is rolling out a set of significant updates to its newish Responses API, aiming to make it easier for developers and enterprises …

View original

We found this idea of user-owned system prompt is a requirement, not an option, for Zapier Agents. There is no “god prompt” that works reliably across all use cases. The only way to get Agents reliable enough to deploy is let end-users locally steer to their specific use case.

2025-04-26 View on X

Pete Koomen

Many AI features, like Gmail's AI assistant, feel useless because they don't allow users to edit system prompts, constraining the AI models they're built with

Millions Of Email Users Now At Risk Of Attack Mastodon: Dare Obasanjo / @carnage4life@mas.to : This blog captures my frustration with AI tools for work. Microsoft and Google are p...

View original

The $1,000,000 @arcprize 2025 competition is back! And introducing ARC-AGI-2 the only unbeaten benchmark (we're aware of) that remains easy for humans but now even harder for AI. New ideas are still needed to reach AGI. We've got lots of great updates for 2025 — [image]

2025-03-26 View on X

TechCrunch

The Arc Prize Foundation says its new ARC-AGI-2 test stumps most AI models; humans get 60% of the questions right but GPT-4.5 and Claude 3.7 Sonnet score ~1%

[image] François Chollet / @fchollet : Unlike ARC-AGI-1, this new version is not easily brute-forced. Current top AI approaches score 0-4%. All base LLMs (GPT-4.5, Claude 3.7 Son...

View original

> We will no longer ship o3 as a standalone model Does this change extend to the developer platform? If so, this is a big strategic focus shift from platform dev UX to consumer UX.

2025-02-13 View on X

TechCrunch

Sam Altman says GPT-5 will include o3, which is no longer set to ship as a standalone model, GPT-4.5 will be OpenAI's last non-chain-of-thought model, and more

OpenAI has effectively canceled the release of o3, which was slated to be the company's next major AI model …

View original

AGI is the most important technology in the history of the world. That's why I'm going all in on a new adventure with @fchollet. We're creating a new intelligence science lab — Ndea. @ndeainc's ultimate goal is to compress science timelines. First by understanding intelligence,

2025-01-16 View on X

TechCrunch

AI researcher François Chollet and Zapier co-founder Mike Knoop launch Ndea, an AI research and science lab focused on “developing and operationalizing AGI”

François Chollet, an influential AI researcher, is launching a new startup that aims to build frontier AI systems with novel designs.

View original

o3 is really special and everyone will need to update their intuition about what AI can/cannot do. while these are still early days, this system shows a genuine increase in intelligence, canaried by ARC-AGI semiprivate v1 scores: * GPT-2 (2019): 0% * GPT-3 (2020): 0% * GPT-4

2024-12-22 View on X

TechCrunch

OpenAI unveils o3 and o3-mini, trained to “think” before responding via what OpenAI calls a “private chain of thought”, and plans to launch them in early 2025

12 Days of OpenAI: Day 12 Naomi Li Gan / Tech in Asia : OpenAI unveils AI model for advanced reasoning Bojan Stojkovski / Interesting Engineering : OpenAI unveils o3 reasoning AI m...

View original

o3 is really special and everyone will need to update their intuition about what AI can/cannot do. while these are still early days, this system shows a genuine increase in intelligence, canaried by ARC-AGI semiprivate v1 scores: * GPT-2 (2019): 0% * GPT-3 (2020): 0% * GPT-4

2024-12-21 View on X

TechCrunch

OpenAI unveils o3 and o3-mini, trained to “think” before responding via what OpenAI calls a “private chain of thought”, and plans to launch them in early 2025

OpenAI announced its new o3 models on Friday. — In a tweet ahead of its final livestream for its …

View original

Glad to see SB 1047 officially veto'd by @GavinNewsom. We must keep incentivizing AI innovation, not putting speed bumps on last-gen tech. This was under-discussed, but frontier models like OpenAI o1 showed 1047's approach was out of date before it even got decided. Based on [image]

2024-09-30 View on X

Wall Street Journal

California Governor Gavin Newsom vetoes AI safety bill SB 1047, saying it applies only to large AI models and doesn't account for if deployment is high risk

Governor seeks more encompassing rules than the bill opposed by OpenAI, Meta and supported by research scientists

View original