Public benchmarks are a useful signal. They are also a trap.

When to ignore them

  • Your deployment is narrow. Domain-specific accuracy is all that matters.
  • You care about latency or cost. Benchmarks rank capability, not deployability.
  • Your domain has its own jargon. A benchmark written in English medical text won't tell you anything useful about Portuguese legal text.
  • The benchmark is old. If the pretraining corpus has seen the test set, the score lies.

When to trust them

  • You need a first pass at shortlisting 3 models out of 20.
  • The benchmark is new, or freshly generated.
  • You have a strong reason to believe the task distribution matches yours.

The alternative

Build your own evaluation suite. 200 hand-labelled examples covering the edges. Score every candidate model on it. The benchmark is your job.