In 2023, the rule was: if in doubt, call GPT-4. In 2024, the smarter teams started asking whether the frontier model was worth it. By 2026, the answer across our consulting practice is clear: on narrow, repeatable tasks, a 3B to 14B model you control is faster, cheaper, more private, and often more accurate than a call to a 600B API.

What changed

Three simultaneous shifts made small-model deployment viable:

  • Capable open weights. Llama 3.1, Mistral-NeMo, Gemma 2, Qwen 2.5, and the 2025 Phi-4 releases gave us evaluation-grade models you can fine-tune on a single 80GB GPU.
  • Cheap adaptation. QLoRA and DoRA make a weekend of training feasible. Once you have 2–10k clean examples, SFT + DPO closes most of the gap to frontier models on domain tasks.
  • Better evaluation. We finally moved past “it looks good on 5 prompts” to held-out suites with programmatic grading.

A worked example

Take clinical discharge-summary extraction as a worked example. A typical 2023 baseline: GPT-4 with few-shot prompting, ~82% F1 on gold-labelled Portuguese notes. Cost: ~€0.08 per 1k tokens. Latency: 3.5s median.

The 2025 upgrade: Llama 3.1-8B-Instruct, SFT on 3k LIACC-annotated summaries, DPO on 400 preference pairs, int4 quantisation. 88% F1 on the same held-out set. Cost: < €0.001 per 1k tokens on our A100. Latency: 450ms.

The headline is not that Llama beats GPT-4 everywhere. It doesn't. The headline is that on a bounded task with good data, a small controlled model dominates on the metrics the hospital actually cares about: data residency, latency, unit cost, and auditability.

When the frontier model still wins

  • Novel reasoning tasks where the problem shape shifts per request.
  • Long-context document synthesis beyond what your infra can load.
  • Anything where the cost of being wrong is huge and you need every 0.3 points of accuracy.

Practical recipe

  1. Write an evaluation first. 200 examples, graded programmatically. Iterate against it.
  2. Prototype with a frontier model. If it can't do the task, no small model will.
  3. Collect 1k–10k high-quality supervised examples. In-domain, high variance.
  4. Fine-tune a small open-weight model. Start with LoRA or DoRA.
  5. Add a DPO pass. Preference data is cheap to produce once you have a baseline.
  6. Deploy behind a gateway you can monitor and kill.

If you're in a Portuguese public body or a hospital, we'd rather spend a week on this than three months writing a procurement request for a frontier API. The economics tell the story.