3:02am. Pager. The production inference cluster was returning the same token sequence for every prompt. Four hours later, it was fixed. These are the notes I wrote before I slept.

The immediate facts

  • Latency was normal.
  • GPU utilisation was normal.
  • Token probabilities were collapsing to one path.

Root cause: a config change had shipped that morning which silently set the sampling temperature to 0.0. The model was no longer sampling; it was arg-maxing.

Three lessons

  1. Config is code. The config repo needs the same review gates as the model repo. That change should have failed CI.
  2. Observability wins. The bug surfaced because the entropy of sampled token distributions was on a dashboard. Without that graph I'd have spent hours looking at the wrong thing.
  3. Boring logs are gold. A log line I wrote three months ago printed the effective sampling temperature on every cold start. That's what let me triangulate in ten minutes instead of two hours.

What I changed the next morning

  • Required code-review on the config repo.
  • Added a health check that samples 30 prompts at startup and rejects if entropy is below a threshold.
  • Wrote this post.