When to ignore benchmarks

Prototype — concept, not a live service. This blog post is part of the redesign proposal demonstrated in this MVP and is not a current service of LIACC. Any data shown is illustrative. See the project report for context.

OPINION · 08 Oct 2025 · 3 min read · 133 words

When to ignore benchmarks

A short guide for anyone who has stared at MMLU and felt nothing.

LIACC

Illustrative byline (prototype)

Public benchmarks are a useful signal. They are also a trap.

When to ignore them

Your deployment is narrow. Domain-specific accuracy is all that matters.
You care about latency or cost. Benchmarks rank capability, not deployability.
Your domain has its own jargon. A benchmark written in English medical text won't tell you anything useful about Portuguese legal text.
The benchmark is old. If the pretraining corpus has seen the test set, the score lies.

When to trust them

You need a first pass at shortlisting 3 models out of 20.
The benchmark is new, or freshly generated.
You have a strong reason to believe the task distribution matches yours.

The alternative

Build your own evaluation suite. 200 hand-labelled examples covering the edges. Score every candidate model on it. The benchmark is your job.

When to ignore benchmarks

When to ignore them

When to trust them

The alternative

Read next

Twelve Portuguese NLP benchmarks you should know

If you're new to AI research: start here

AI for the rest of us: nine use cases that ship in 2026