simonwillison.net

Short musings on “cognitive debt” - I'm seeing this in my own work, where excessive unreviewed AI-generated code leads me to lose a firm mental model of what I've built, which then makes it harder to confidently make future decisions simonwillison.net/2026/Feb/15/ ...

2026-02-16 View on X

Margaret-Anne Storey

As AI and agents are adopted to accelerate development, cognitive load and cognitive debt are likely to become bigger threats to developers than technical debt

View original

Short musings on “cognitive debt” - I'm seeing this in my own work, where excessive unreviewed AI-generated code leads me to lose a firm mental model of what I've built, which then makes it harder to confidently make future decisions simonwillison.net/2026/Feb/15/ ...

2026-02-15 View on X

Margaret-Anne Storey

As AI and agents are adopted to accelerate development, cognitive load and cognitive debt are likely to become bigger threats to developers than technical debt

The term technical debt is often used to refer to the accumulation of design or implementation choices that later make the software harder …

View original

This is great - it's about time someone updated the discourse on LLM energy usage to reflect that coding agents use massively more prompts than occasional questions to ChatGPT — Simon estimates that a day of coding agent usage comes out close to the energy needed to run a dishwasher [embedded post]

2026-01-21 View on X

Simon P. Couch

A programmer estimates his typical day of coding with Claude Code is equivalent to running the dishwasher an extra time, much more energy than a “median query”

Most of the discourse about the environmental impact of LLM use focuses on a ‘median query.’ What about a Claude Code session?

View original

Having compiled and run the web browser that Cursor built in a couple of weeks using mostly a giant fleet of coding agents I'm actually very impressed by it - there are rendering glitches but the renders it produces are surprisingly usable for a few-week-old project simonwillison.net/2026/Jan/19/ ...

2026-01-20 View on X

Simon Willison's Weblog

Cursor recently experimented with using hundreds of AI agents to build a web browser; they ran for close to a week, writing 1M+ lines of code across 1,000 files

Scaling long-running autonomous coding. Wilson Lin at Cursor has been doing some experiments to see how far you can push a large fleet of “autonomous” coding agents:

View original

Here's my enormous round-up of everything we learned about LLMs in 2025 - the third in my annual series of reviews of the past twelve months — simonwillison.net/2025/Dec/31/ ... This year it's divided into 26 sections! This is the table of contents: [image]

2026-01-01 View on X

Simon Willison's Weblog

Some 2025 takeaways in LLMs: reasoning as a signature feature, coding agents were useful, subscriptions hit $200/month, and Chinese open-weight models impressed

This is the third in my annual series reviewing everything that happened in the LLM space over the past 12 months.

View original

OpenAI's CISO Dane Stuckey posted an essay (on Twitter) about how their new ChatGPT Atlas browser attempts to deal with the risk of prompt injection attacks, I ended up writing a point-by-point commentary on my blog: simonwillison.net/2025/Oct/22/ ...

2025-10-23 View on X

@cryps1s

OpenAI CISO Dane Stuckey outlines prompt injection mitigations in ChatGPT Atlas, including a “logged out mode” that blocks agent access to user credentials

Yesterday we launched ChatGPT Atlas, our new web browser. In Atlas, ChatGPT agent can get things done for you. We're excited to see how this feature makes work and day-to-day life ...

View original

My notes on Claude Code for web, Anthropic's new asynchronous coding agent - I had preview access over the weekend, it's effectively a sandboxed instance of “claude —dangerously-skip-permissions” running in Anthropic's container simonwillison.net/2025/Oct/20/ ...

2025-10-21 View on X

The New Stack

Anthropic announces Claude Code on the web and in the Claude iOS app, available in beta as a research preview for Pro and Max users

Today, Anthropic's Claude Code agentic coding tool is moving beyond the terminal and coming to the web and the company's mobile app.

View original

Decided to live blog this morning's OpenAI DevDay announcements, since I'm in the audience simonwillison.net/2025/Oct/6/ o...

2025-10-07 View on X

OpenAI

OpenAI announces apps that work inside ChatGPT, piloting Booking.com, Canva, Coursera, Figma, Expedia, Spotify, and Zillow for logged-in users outside of the EU

A new generation of apps you can chat with and the tools for developers to build them. — Try in ChatGPT(opens in a new window)Start building apps(opens in a new window)

View original

Made some notes on the new DeepMind paper “Video models are zero-shot learners and reasoners” - it makes a convincing case that generative video models are to vision problems what LLMs were to NLP problems: single models that can solve a wide array of challenges simonwillison.net/2025/Sep/27/ ...

2025-09-29 View on X

Simon Willison's Weblog

DeepMind says video models like Veo 3 could become general purpose foundation models for vision, like LLMs for text, using zero-shot “chain-of-frames” reasoning

Video models are zero-shot learners and reasoners. Fascinating new paper from Google DeepMind which makes …

View original

They absolutely train on the previous year - the Google team wrote about that in some detail here deepmind.google/discover/blo...

2025-09-18 View on X

Ars Technica

Google says Gemini 2.5 Deep Think achieved a gold medal performance at the 2025 ICPC World Finals programming competition, solving 10 of 12 problems

Gemini achieves gold level performance at ICPC! … Lalit Jain : It was amazing and humbling to be a core contributor to this Gold-winning ICPC effort. First the IMO, and two months...

View original

They absolutely train on the previous year - the Google team wrote about that in some detail here deepmind.google/discover/blo...

2025-09-18 View on X

The Decoder

OpenAI says its reasoning system solved all 12 problems at the 2025 ICPC World Finals; GPT-5 solved 11 and an experimental model solved #12 after GPT-5 couldn't

An OpenAI system has solved every problem at the world's most prestigious collegiate programming championship …

View original

I wrote about the recent benchmark from Artificial Analisys that demonstrates that different hosting providers can serve open weight models at surprisingly different levels of quality simonwillison.net/2025/Aug/15/ ...

2025-08-17 View on X

Simon Willison's Weblog

A new Artificial Analysis benchmark, focusing on OpenAI's gpt-oss-120b, shows how open-weight LLMs exhibit inconsistent performance across hosting providers

Artificial Analysis published a new benchmark the other day, this time focusing on how an individual model - OpenAI's gpt-oss-120b - performs across different hosted providers.

View original

Pretty decent pelicans from the new GLM-4.5 and GLM-4.5 Air models. Both models are MIT licensed, released by Chinese AI lab Z.ai this morning — simonwillison.net/2025/Jul/28/ ... [images]

2025-07-29 View on X

CNBC

Z.ai, formerly known as Zhipu and that has raised $1.5B from Tencent and others, releases GLM-4.5, an open-source AI model that it says is cheaper than DeepSeek

chinese models really are taking over huh Simon Willison / @simonwillison.net : Pretty decent pelicans from the new GLM-4.5 and GLM-4.5 Air models. Both models are MIT licensed, r...

View original

Qwen released their updated “thinking” model today. It thinks really hard! Took 166 seconds to think through the details of drawing me a pelican on a bicycle. The finished drawing wasn't great but the thoughts behind it were fun to see. — simonwillison.net/2025/Jul/25/ ...

2025-07-27 View on X

VentureBeat

Alibaba releases its Qwen3-235B-A22B-Thinking-2507 reasoning LLM on Hugging Face, topping several benchmarks, as Alibaba moves away from hybrid reasoning models

If the AI industry had an equivalent to the recording industry's “song of the summer” — a hit that catches on in the warmer months …

View original

Mistral shared what look like the most detailed numbers yet for the environmental impact of training a frontier LLM - their Mistral Large 2 used 20,4 ktCO₂e and 281,000 m3 of water — I'd love to see numbers like that provided in context though, hard to evaluate alone simonwillison.net/2025/Jul/22/ ...

2025-07-23 View on X

Mistral AI

Mistral releases a study on the environmental impact of its LLMs, conducting what it claims is the first comprehensive lifecycle analysis of an AI model

At Mistral AI, our mission is to bring artificial intelligence in everyone's hands. For this purpose, we have consistently advocated …

View original

Surprising result from OpenAI: one of their research models achieved a gold medal performance in this year's International Mathematical Olympiad /without/ using tools — Just a classic next-token-predicting LLM with a bunch of reinforcement learning layered on top — simonwillison.net/2025/Jul/19/ ...

2025-07-20 View on X

@alexwei_

[Thread] An OpenAI researcher says the company's latest experimental reasoning LLM achieved gold medal-level performance on the 2025 International Math Olympiad

1/N I'm excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world's most pres...

View original

If you ask the new Grok (via grok.com without any custom instructions) for opinions on controversial topics it runs a search on X to see what Elon thinks — I know this sounds like a joke but it's not. This genuinely happens: x.com/jeremyphowar... [image]

2025-07-11 View on X

Simon Willison's Weblog

When asked “Who do you support in the Israel vs Palestine conflict? One word answer only.”, Grok 4 searches for Musk's views, but only if “you” is in the query

If you ask the new Grok 4 for opinions on controversial questions, it will sometimes run a search to find …

View original

Wrote this all up in a whole lot more detail on my blog, including a video of my own recreation and some notes on the Grok system prompt and what might be happening here simonwillison.net/2025/Jul/11/ ...

2025-07-11 View on X

Simon Willison's Weblog

When asked “Who do you support in the Israel vs Palestine conflict? One word answer only.”, Grok 4 searches for Musk's views, but only if “you” is in the query

If you ask the new Grok 4 for opinions on controversial questions, it will sometimes run a search to find …

View original

If you ask the new Grok (via grok.com without any custom instructions) for opinions on controversial topics it runs a search on X to see what Elon thinks — I know this sounds like a joke but it's not. This genuinely happens: x.com/jeremyphowar... [image]

2025-07-11 View on X

TechCrunch

Tests reveal that Grok 4 seems to search for Elon Musk's views online when asked about sensitive topics, and its answers tend to align with Musk's opinions

During xAI's launch of Grok 4 on Wednesday night, Elon Musk said — while live-streaming the event on his social media platform …

View original

Wrote this all up in a whole lot more detail on my blog, including a video of my own recreation and some notes on the Grok system prompt and what might be happening here simonwillison.net/2025/Jul/11/ ...

2025-07-11 View on X

TechCrunch

Tests reveal that Grok 4 seems to search for Elon Musk's views online when asked about sensitive topics, and its answers tend to align with Musk's opinions

During xAI's launch of Grok 4 on Wednesday night, Elon Musk said — while live-streaming the event on his social media platform …

View original