Different LLMs do Different Things

Articles Jun 10, 2025 (Jun 20, 2025) Loading...

Bart and I built a tool and a workshop to let testers experiment with generating code from tests. I built the tool to use Claude 3.5 Sonnet, and used Simon Willison’s llm library to exchange messages with it.

I picked Claude 3.5 Sonnet because it produced code that ran when I asked for code – but did I need it?

I already have everything I need to give me a first-pass answer. llm lets me switch models while keeping everything else much the same. Indeed, that's one of the key reasons I used it.

I tried switching models; this is a summary of what I found. Can I substitute GPT-4, 4o-mini or Qwen3 for Claude3.5Sonnet? tl;dr: Nope.

Qwen3 – local, competent, recidivist

I rejoiced when I found I could run Qwen3 on my own machine: I could aim my code towards a coding capable LLM without going off to an intermittent and pricy provider. I've got Ollama on my 32Gb M2Max, and use msty to manage and work with models. Qwen3 runs fine.

Qwen3 writes code with the right syntax, fits the tests, finds viable alternatives, and surprised me by being faster locally than any competent LLM in the cloud. It's the 8B variant, so has a context window of 128K tokens. When I put /no_think in the prompt, it reliably offers output without reasoning. I use it regularly for general coding queries going via msty if I'm on a train or fancy a change.

But Qwen3 never passed all the tests, ever – it would fix one bug, and make another. And I couldn't share it easily in the workshop.

I tried local models DeepSeekR1 and Llama3.2, got code that didn't run, and stopped trying.

I messed about with 4o-mini. It did two things badly enough to stop me pretty soon:

One suite of tests checks a bunch of dates. They're all Easter Sunday, and that's not mentioned explicitly. Three times out of four, Claude and GPT-4 recognised that the dates were from a pattern, and gave me a complex algorithm that reflected the pattern, typically naming it with something close to one of the accepted ways of calculating Easter. 4o-mini always gave me back code that responded to each specific date; there was no generalisation, not algorithm.
I gave it a test (and hints) to assert that its proposed python code could be used as a module in a package. It never once wrote code that was a module, failing over and over again in the same way – so none of the other tests even ran.

GPT-4 – monotonic, small context

I tried GPT-4. It wrote code that passed most of the tests – but when the script passed back the failing tests and asked for corrections, GPT-4 gave me new code that failed most of the same tests in the same way. In particular, it kept getting module / function signatures wrong.

GPT-4 also failed in a way that would have been entirely comprehensible in early 2024, but felt odd in mid 2025: it said “no” because I’d asked for too much. GPT-4's 'context length' (the amount of input that it will pay attention to) is 8,192 tokens. The second iteration within a conversation typically output not code, but a message along the lines of: «This model's maximum context length is 8,192 tokens. However, your messages resulted in 8570 tokens.».

I typically let a 'conversation' run for three tries. If I want my tool to use GPT-4, I either need smaller tests / code / failures / rules, or I need to duck out of the conversation after fewer tries.

Perils of 'thinking' models

'Thinking' models get in the way.

When you’re asking for code, only code, no fences or braces or explanations, you’re fundamentally frustrated by a model that starts replies with ‘So what I think the person is asking for is…’, or returns several hunks of neatly-fenced code interspersed with explanations.

I tried a couple of approaches; asking the thing not to think (as in the /no_think instruction to Qwen3), asking for just the output, parsing the output to pick up only code delimited by three backticks. I typically spent more time wrangling the output that I wanted, and went back – perhaps temporarily – to something that was reliable.

Aside: If you're used to copilot or cody or cursor, you might notice that those tools do a special (and fragile) thing for you; they insert changed code into the right place. My tools don't have those smarts – they generate one whole file and substitute the lot.

Claude 3.5 Sonnet

Let's recognise a bias: I started with Claude 3.5 Sonnet, I've used it the most, and I continue to use it. There is every chance that I've built the workshop's prompts and processes to this model's strengths and managed its weaknesses. Switching my choice of Large Language Model in one place in the code changes none of the infrastructure implied by the rest of my tools. Still...

I've seen Claude 3.5 Sonnet change architecture on the basis of one test, switching from basic to packaged Python, and from ESM (uses import) to CJS (uses require) Javascript. I've see it change fundamental approaches as its first approach fails to pass tests, progressing from no code to 3 failing tests to 1 failing test, back to 8 failures – then to code that passes the lot on the next attempt. I've seen it ticktock between code that fails one or other test, then find a way through, but I've rarely seen it fail the same test over and over again without the test itself being problematic. I've seen it merrily run through 8 consecutive attempts in one conversation without touching the edges of its massive 200K context window.

Claude 3.7 seems to keep those qualities (if one ensures thinking is off). Its output context window is 64K (to 3.5's 8K) so I'd switch for larger outputs. I've not yet tried Claude 4.

An alternative comparison ..

I used Msty to run the same prompt simultaneously on these same LLMs: a query about part of my local area that I know well. This was interesting, so I made an exercise for you to play along.

> What do you know about Lettsom Gardens and the surrounding area in London?

Let's remember that LLMs are making it all up, all of the time – but sometimes their fantasy is directed and constrained by their training, system prompts, what they've retrieved or what you've recently mentioned, and more.

Note: none of these were doing retrieval from the web. All were fresh conversations. The local models take up only 2-4 Gb – treat whatever they gave me as the product of very lossy compression. I don't want to share their outputs publicly, because I don't want to pollute the data, so I'll put them behind the paywall of the site.

In summary:

Local models (as in local to my laptop, not trained in local knowledge) Qwen3 and DeepSeekR1 entirely made up almost every detail of history and features. In terms of geography, they placed it in the wrongly, and were geographically illiterate about London. Llama 3.2 made excuses, offered something that it indicated was tentative and tangential, and stopped. Points to Llama 3.2!
GPT4o-mini located it within a couple of miles, got the right reason for the name in §1, then imagined a handful of features and described the (wrong) neighbourhood.
Claude 3.7 Sonnet got the location right, and included 3x as many checkable facts as 4o-mini. Almost all of those facts match my local knowledge.
GPT-4 got the location and several facts right, made up a pond and a hardwood forest, and spent the rest of its short answer extolling mostly-vague virtues of nearby Camberwell.

Conclusion

Here's what I needed from the LLM for my workshop, and ways I pushed when the LLMs didn't satisfy those needs.

a large context window so that I could give it plenty to chew on – I could with some parsing have shrunk my requests, but GPT4's 8K is no match for Claude's 200K and Qwen's 128K.
easily wrangle-able output – I could have managed this with more parsing
enough variation in its patterns that it could offer a variety of approaches – I could have managed this (perhaps) by tuning the 'temperature' (i.e. randomness) of the models in my request
enough coding (and testing) patterns that it could infer code from tests – I could offer more-explicit hints via names, comments and rules files.
a publicly-accessible API with a plugin for the llm tool – I wondered about running something open-source on replicate to avoid sharing API keys.

Claude 3.5 Sonnet is good-enough for the workshop – but 3.7 aside, I've not yet found a viable alternative.