Exercise: The Ralph Loop

Exercises Jan 22, 2026 (Jan 28, 2026) Loading...

Workroom PlayTime – you'll probably need to run on your own machine to get hands-on.

What's a Ralph Loop? A Ralph Loop is an approach to generating code.

more...

By putting a simple loop around code generation, a 'fresh' LLM is always used to generate code. A 'fresh' LLM has all-relevant context, un-polluted by the LLM's answers or by irrelevant earlier tasks. This forces design and spec work out of the generation context, forces small and checkable tasks, forces documentation that allows a new LLM to pick up where the old one died. By forces, I mean that the approach needs that pre- and post- work to happen in order to be useful.

How is the relevant to testers – or at least to a testing mindset? With design and spec work under human control and out of generation steps, testers can influence as valuably as ever. With checkable tasks, testers describe the checks (and audit the ways those checks are made). With the weird things LLMs do, testers can develop their whiskers for the weird and get right into the loop, stopping to reframe and re-aim the work.

In this exercise, we'll use a one-liner to iteratively generate a pure function. We'll use my own version of : A Library with no Code. My version has been made into a python library. I've asked Claude to spec out some approaches to adding daylight saving time, and run a Ralph Loop through several iterations – each time working through one of the small checkable tasks. I've got checkpoints, so I can 'rewind' my environment, and re-run the loop. The repo has commits, so you should be able to do that, too.

Dragging a repo into a Ralph Loop

To get to a working exercise, I started with something that I knew could generate working code – Drew Breunig's examples and instructions to generate a relative-time library. I asked Claude Code to generate tests from the examples, and code to pass those tests. Then I asked Claude to come up with a a spec to add daylight saving time (which implies knowing what and where those times are), to make a plan of action to make useful examples into tests and to write code to pass those tests, and to update the docs. Once I had fair docs, I set a Ralph Loop on the problem of making the code.

I've taken this repo through a the following (here is the commit history). Note: I failed to fork it properly, so it's no longer linked to the original.

a conversation with an LLM that took the code-free library to a python implementation in /bin with passing tests derived from existing examples (sadly: derived from. I've got a js one that uses the examples directly as tests)
a conversation with an LLM that led to a set of docs describing a change
several further conversations, each an individual turn around a Ralph Loop, each typically producing two commits – one for tests / test-passing code, and one for the documentation, updating decisions.md, implementation_plan.md and progress.md .
a final commit where I noticed that the repo was missing the vital prompt.md , and included INSTALL.md and handover.md which I had binned after a couple of runs round the loop.

To get properly hands-on, you'll need a safely-sandboxed environment with a tool-using agent (and its tools) installed, a key with enough credit, and the repo.

To get to this point, I have used sprites.dev for the sandboxed environment, I've installed Claude Code and Pytest on that environment, and I've connected the environment to my github account and to my Anthropic account.

I can't duplicate that environment and I can't easily share commandline access to the one I have. So for Workroom PlayTime 046, I don't think I can guarantee to set that up for you in a worthwhile way. You can do it yourself, or I'll demo it – we'll watch and talk.

If you want to do it yourself, then you'll need to set up something to the environment described above (sprites.dev is very handy for this and has Claude installed and talks with various services and tools), clone the repo, and use the commandline below to pass the prompt into your agent (claude in the prompt – meaning Claude Code) over and over again. You'll also need to rewind the commits a bit, but to keep prompt.md and probably bin handover.md and INSTALL.md.

Costs

The Ralph Loop seems to be a thought experiment that happens to work as an software generation technique, and is not an efficient use of tokens (as a proxy for real money, energy or resources). Money: In my experiment, each iteration (i.e. each small checkable task) used $1-$2 and took 5-10 minutes – so it took the best part of an hour and $10 to add daylight savings to the algorithm. By comparison, my 'magic loop' was cheaper, useing deterministic tools to run tests, check syntax, and do commits, and ~80 people playing used under $20 in an afternoon. Claude Code can 'one-shot' the base library above (and its tests) for around $3, and Amp can build the Testing Transparently web page app for around $4.

On the other hand, the approach gives an LLM a good shot at greatness, by giving it great context. If you've got a competent local LLM or unlimited tokens, then maybe the cost is less of a problem.

Exercise

You've got the repo and you're all set up with an agent? Smashing. If not, I'll share my screen – and you've still got the files to look at.

Make a branch
Rewind a few commits
Run the one-line bash command below.
Watch your CLI agent do its thing while talking
Review while talking – see Debrief below

Ralph Loop

while true:; do cat prompt.md | claude --dangerously-skip-permissions; done

Yeah – it is that simple/stupid. And that's because the Ralph Loop is just a loop – the infrastructure (local and specific / vast and generic) is elsewhere. The key things this enforces / expects / needs are:

fresh new context every new task
lots of documentation (specs, examples, plans, glossaries) to give context to the prompt below
a test harness (and runnable tests)
working code that needs a small change
a tool-enabled LLM that can write and run checks, write code, notice / diagnose / fix problems, work with git, parse the docs, prioritise work and set up new work.
a human watching, judging and steering where necessary

In terms of what's actually going on...

You set an initial list of small checkable tasks, each with some completion criteria. You might use an LLM for that. This setting of checkable tasks is absolutely necessary, and is where our skills as testers get used.
The Ralph Loop effectively queues up fresh LLMs.
It starts a new LLM-on-the-commandline instance, passing the prompt.
That fresh LLM fills its own context from the docs and picks its own task from a list in a doc. It goes to work, and you watch it, use a second terminal window to check its work, fire up another task, or go get a cup of tea.
If the LLM goes off the rails (again, tester skills), you kill it. Before you let the Ralph Loop restart, you reframe the tasks and the instructions (tester skills) – and that is what shapes the final product.
When the LLM completes a task, it tidies up for the next LLM, and waits for you to approve.
You can fiddle with the docs etc. here. If you need to fiddle with the prompt, you'll need to kill the loop. Core thought: If the task is done, you kill it and the Ralph Loop ushers in the next LLM.

contents of my `prompt.md`

study handover.md
study implementation_plan.md
pick the one most important thing to do that has not yet been done. That is your target. Concentrate only on the target.
IMPORTANT
* tests come first – add examples to tests.yaml for your target and run the tests. Expect the tests for your new examples to fail – your target does not yet exist in the code
* change existing code to pass the tests
* ALWAYS run the tests after changing the code
* when the tests pass, commit
* after committing, append a short narrative to progress.md, adjust and add to decisions.md, and change implementation_plan.md to reflect your completed target, and any new or unfnished work. We're not going to use handover.md again.

Note – I had imagined, based on previous runs, that the tests would ingest the examples, and dumbly based this prompt on that assumption. But in this repo they don't. I'll ask for it explicitly next time.

Debrief

pick one

We'll compare and contrast our building experiences
We'll look at the artefacts it makes, and see what they tell us
We'll consider what we can give the loop to help it – and how much tests are part of that
We'll consider the usefulness of a system called into being without tests, and yet which makes up its tests as it goes.

Sources

Here's Geoffrey Huntley's earlier post, later post, and recent 'first principles' demo/video.

Here's a more measured video from Matt Pocock.

Here's a couple of fair descriptions: Autonomous Loops (more to do with the Claude plugin) and 2026 – the Year of the Ralph Loop (around the cursor plugin). Neither are great as they focus on specific implementations rather than the approach as a thought-provoking technique.

That Ralph plugin? Haven't used it, and staying closer to the ~~metal~~ commandline gives me enought insights. The originator says the plugin 'isn't it', and as the plugin seems to treat context fundamentally differently I'm inclined to agree. It won't be part of this workshop.