Photo by Franki Chamaki / Unsplash

Different LLMs do Different Things

Articles Jun 10, 2025 (Jun 10, 2025) Loading...

I built a tool and a workshop to let testers experiment with generating code from tests. I built the tool to use Claude 3.5 Sonnet, and used Simon Willison’s llm library to exchange messages with it.

I picked Claude 3.5 Sonnet because it produced code that ran when I asked for code – but did I need it?

llm lets me switch models while keeping everything else much the same. Indeed, that's one of the key reasons I used it. I tried switching models; this is a summary of what I found.

Qwen3 – local, competent, recidivist

I rejoiced when I found I could run Qwen3 on my own machine: I could aim my code towards a coding capable LLM without going off to an intermittent and pricy provider. I've got Ollama on my 32Gb M2Max, and use msty to manage and work with models. Qwen3 runs fine.

Qwen3 writes code with the right syntax, fits the tests, finds viable alternatives, and surprised me by being faster locally than any competent LLM in the cloud. It's the 8B variant, so has a context window of 128K tokens. When I put /no_think in the prompt, it reliably offers output without reasoning. I use it regularly for general coding queries going via msty if I'm on a train or fancy a change.

But Qwen3 never passed all the tests, ever – it would fix one bug, and make another. And I couldn't share it easily in the workshop.

I tried local models DeepSeekR1 and Llama3.2, got code that didn't run, and stopped trying.

4o-mini – special cases, blind to structure

I messed about with 4o-mini. It did two things badly enough to stop me pretty soon:

  • One suite of tests checks a bunch of dates. They're all Easter Sunday, and that's not mentioned explicitly. Three times out of four, Claude and GPT-4 recognised that the dates were from a pattern, and gave me a complex algorithm that reflected the pattern, typically naming it with something close to one of the accepted ways of calculating Easter. 4o-mini always gave me back code that responded to each specific date; there was no generalisation, not algorithm.
  • I gave it a test (and hints) to assert that its proposed python code could be used as a module in a package. It never once wrote code that was a module, failing over and over again in the same way – so none of the other tests even ran.

GPT-4 – monotonic, small context

I tried GPT-4. It wrote code that passed most of the tests – but when the script passed back the failing tests and asked for corrections, GPT-4 gave me new code that failed most of the same tests in the same way. In particular, it kept getting module / function signatures wrong.

GPT-4 also failed in a way that would have been entirely comprehensible in early 2024, but felt odd in mid 2025: it said “no” because I’d asked for too much. GPT-4's 'context length' (the amount of input that it will pay attention to) is 8,192 tokens. The second iteration within a conversation typically output not code, but a message along the lines of: «This model's maximum context length is 8,192 tokens. However, your messages resulted in 8570 tokens.».

I typically let a 'conversation' run for three tries. If I want my tool to use GPT-4, I either need smaller tests / code / failures / rules, or I need to duck out of the conversation after fewer tries.

Perils of 'thinking' models

'Thinking' models get in the way.

When you’re asking for code, only code, no fences or braces or explanations, you’re fundamentally frustrated by a model that starts replies with ‘So what I think the person is asking for is…’, or returns several hunks of neatly-fenced code interspersed with explanations.

I tried a couple of approaches; asking the thing not to think (as in the /no_think instruction to Qwen3), asking for just the output, parsing the output to pick up only code delimited by three backticks. I typically spent more time wrangling the output that I wanted, and went back – perhaps temporarily – to something that was reliable.

Aside: If you're used to copilot or cody or cursor, you might notice that those tools do a special (and fragile) thing for you; they insert changed code into the right place. My tools don't have those smarts – they generate one whole file and substitute the lot.

Claude 3.5 Sonnet

Let's recognise a bias: I started with Claude 3.5 Sonnet, I've used it the most, and I continue to use it. There is every chance that I've built the workshop's prompts and processes to this model's strengths and managed its weaknesses. Switching my choice of Large Language Model in one place in the code changes none of the infrastructure implied by the rest of my tools. Still...

I've seen Claude 3.5 Sonnet change architecture on the basis of one test, switching from basic to packaged Python, and from ESM (uses import) to CJS (uses require) Javascript. I've see it change fundamental approaches as its first approach fails to pass tests, progressing from no code to 3 failing tests to 1 failing test, back to 8 failures – then to code that passes the lot on the next attempt. I've seen it ticktock between code that fails one or other test, then find a way through, but I've rarely seen it fail the same test over and over again without the test itself being problematic. I've seen it merrily run through 8 consecutive attempts in one conversation without touching the edges of its massive 200K context window.

Claude 3.7 seems to keep those qualities (if one ensures thinking is off). Its output context window is 64K (to 3.5's 8K) so I'd switch for larger outputs. I've not yet tried Claude 4.

Conclusion

Here's what I needed from the LLM for my workshop, and ways I pushed when the LLMs didn't satisfy those needs.

  • a large context window so that I could give it plenty to chew on – I could with some parsing have shrunk my requests, but GPT4's 8K is no match for Claude's 200K and Qwen's 128K.
  • easily wrangle-able output – I could have managed this with more parsing
  • enough variation in its patterns that it could offer a variety of approaches – I could have managed this (perhaps) by tuning the 'temperature' (i.e. randomness) of the models in my request
  • enough coding (and testing) patterns that it could infer code from tests – I could offer more-explicit hints via names, comments and rules files.
  • a publicly-accessible API with a plugin for the llm tool – I wondered about running something open-source on replicate to avoid sharing API keys.

Claude 3.5 Sonnet is good-enough for the workshop – but 3.7 aside, I've not yet found a viable alternative.

Member reactions

Reactions are loading...

Sign in to leave reactions on posts

Tags

Comments

Sign in or become a Workroom Productions member to read and leave comments.

James Lyndsay

Getting better at software testing. Singing in Bulgarian. Staying in. Going out. Listening. Talking. Writing. Making.

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.