How do LLM Makers Assess their Models?
Anthropic, OpenAI, Google and others release System Cards / Model Cards for their LLMs. These documents describe a model's capabilities, limitations, and set out the evidence that the makers use for those claims.
The system card for Anthropic's Claude 3.5 (June 24) has 6 pages of assessment and benchmarks from reasoning to safety. The system card for Claude Mythos Preview (April 2026) has 200+ pages, on far more – including risks of deployment, honesty and evasion, responses to an irritating repetitive prompt, an analysis by a psychologist.
Those cards describe the model – and as some aspects of a model are emergent, they include results of testing and exploring. The systems cards give us hints about how the makers have approached that work.
They're illuminating for testers – not only in describing the models, but in describing how the makers seek to sense their models. We might learn from them, understanding the breadth and aims of their exploration of whatever it is that they've made.
Exercise
10 mins – solo
There's no realistic hope of being able to understand a model card in a few minutes. So let's just dive in, and find something that we can share with other testers.
Pick a model card for a model you've used (or have heard of). Skim it. Pick out something, of interest to you as a tester, that you'd like to share.
10 mins – collective
Let's exchange what we've found and talk. What are the surprises, and what are the signals.
Sources


Model cards may have originated with this paper

Futurist Rob Hoeijmakers sets out his thoughts on the cards and the information (and signals) they hold:

and on benchmarks and more deterministic tests


Comments
Sign in or become a Workroom Productions member to read and leave comments.