3:02am. Pager. The production inference cluster was returning the same token sequence for every prompt. Four hours later, it was fixed. These are the notes I wrote before I slept.
The immediate facts
- Latency was normal.
- GPU utilisation was normal.
- Token probabilities were collapsing to one path.
Root cause: a config change had shipped that morning which silently set the sampling temperature to 0.0. The model was no longer sampling; it was arg-maxing.
Three lessons
- Config is code. The config repo needs the same review gates as the model repo. That change should have failed CI.
- Observability wins. The bug surfaced because the entropy of sampled token distributions was on a dashboard. Without that graph I'd have spent hours looking at the wrong thing.
- Boring logs are gold. A log line I wrote three months ago printed the effective sampling temperature on every cold start. That's what let me triangulate in ten minutes instead of two hours.
What I changed the next morning
- Required code-review on the config repo.
- Added a health check that samples 30 prompts at startup and rejects if entropy is below a threshold.
- Wrote this post.