What I learned debugging a 70B model at 3am

3:02am. Pager. The production inference cluster was returning the same token sequence for every prompt. Four hours later, it was fixed. These are the notes I wrote before I slept.

The immediate facts

Latency was normal.
GPU utilisation was normal.
Token probabilities were collapsing to one path.

Root cause: a config change had shipped that morning which silently set the sampling temperature to 0.0. The model was no longer sampling; it was arg-maxing.

Three lessons

Config is code. The config repo needs the same review gates as the model repo. That change should have failed CI.
Observability wins. The bug surfaced because the entropy of sampled token distributions was on a dashboard. Without that graph I'd have spent hours looking at the wrong thing.
Boring logs are gold. A log line I wrote three months ago printed the effective sampling temperature on every cold start. That's what let me triangulate in ten minutes instead of two hours.

What I changed the next morning

Required code-review on the config repo.
Added a health check that samples 30 prompts at startup and rejects if entropy is below a threshold.
Wrote this post.

What I learned debugging a 70B model at 3am

The immediate facts

Three lessons

What I changed the next morning

Read next

Concept drift in production: keeping maintenance models honest

The research engineer's toolbox, 2026 edition

Agentic AI: where the hype breaks and the engineering begins