LLM comparison – local knowledge

Exercises Jun 20, 2025

This exercise lets us explore judgement and how our own expertise and aesthetic influences judgement.

The exercise involves comparing several LLM's responses to the same prompt. We pick that prompt to be something in which we are already expert, so that we have something to judge against. We compare in several ways.

Exercise

Take something you know about, but which has few sources of information. Pick something that has verifiable facts – particularly facts that you have a primary source for. Ask several LLMs about it. Share your prompt before you explore.

Use the same prompt for all the LLMs.
Use one prompt – don't get into conversation.
Try (this may be harder as tech changes) to avoid having any personal history between you and the LLM that might influence its answer.
Try not to let answers cross-polinate – if you're using a tool that lets you switch LLM mid-conversation, it may pick up the previous LLM's answer as part of your prompt.
Avoid allowing the LLM to query and parse the web to get your answer, unless that's what you're looking to test. Do try the same LLM several times of you like.

Share the answers as they come in.

Compare answers between different LLMs– and keep an eye out for the mechanisms you use to compare. Put your hat of cynicism on and consider whether the LLM has offered verifiable facts, general sentiment, and to what extent it has echoed your prompt. Can you see shared facts – and are they right? Can you distinguish between information you asked for, and what else LLMs might be giving you to reflect your tone or otherwise cold-read what might satisfy? Can you see things that look right, but are inconsistent when take together? If you're using them, are there differences between the conclusions of 'thinking' models and non-'thinking' – between small local models and huge remote ones?

Compare the LLMs' answers against what you know, first-hand. What's certainly wrong and certainly right? What's probably wrong, and what's surprisingly right, and how have you verified its statements?

Here's something I did.

My expertise: Lettsom Gardens

I used Msty to run the same prompt simultaneously on these LLMs: a query about part of my local area that I know fairly well.

What do you know about Lettsom Gardens and the surrounding area in London?

Let's remember that LLMs are making it all up, all of the time – but sometimes their fantasy is directed and constrained by their training, system prompts, what they've retrieved or recently mentioned, and more.

Note: none of these were doing retrieval from the web. All were fresh conversations. The local models take up only 2-4 Gb – treat whatever they gave me as the product of very lossy compression.

In summary:

Local models (as in local to my laptop, not trained in local knowledge) Qwen3 and DeepSeekR1 entirely made up almost every detail of history and features. In terms of geography, they placed it in the wrongly, and were geographically illiterate about London. Llama 3.2 made excuses, offered something that it indicated was tentative and tangential, and stopped. Good for Llama 3.2.
GPT4o-mini located it within a couple of miles, got the right reason for the name in §1, then imagined a handful of features and described the (wrong) neighbourhood.
Claude 3.7 Sonnet got the location right, and included 3x as many checkable facts as 4o-mini. Almost all of those facts match my local knowledge.
GPT-4 got the location and several facts right, made up a pond and a hardwood forest, and spent the rest of its short answer extolling mostly-vague virtues of nearby Camberwell.

I used my knowledge as a local resident; not only what I know from being a key-holder and user of the garden, but what I know by direct experience, what I've read / heard and remembered, what I refreshed from the garden's website, from wikipedia and from pages via the council and the local school.

The information returned by the LLMs is detailed below. I don't want it turning up in training data, so I've paywalled it.