In 2023, the rule was: if in doubt, call GPT-4. In 2024, the smarter teams started asking whether the frontier model was worth it. By 2026, the answer across our consulting practice is clear: on narrow, repeatable tasks, a 3B to 14B model you control is faster, cheaper, more private, and often more accurate than a call to a 600B API.
What changed
Three simultaneous shifts made small-model deployment viable:
- Capable open weights. Llama 3.1, Mistral-NeMo, Gemma 2, Qwen 2.5, and the 2025 Phi-4 releases gave us evaluation-grade models you can fine-tune on a single 80GB GPU.
- Cheap adaptation. QLoRA and DoRA make a weekend of training feasible. Once you have 2–10k clean examples, SFT + DPO closes most of the gap to frontier models on domain tasks.
- Better evaluation. We finally moved past “it looks good on 5 prompts” to held-out suites with programmatic grading.
A worked example
Take clinical discharge-summary extraction as a worked example. A typical 2023 baseline: GPT-4 with few-shot prompting, ~82% F1 on gold-labelled Portuguese notes. Cost: ~€0.08 per 1k tokens. Latency: 3.5s median.
The 2025 upgrade: Llama 3.1-8B-Instruct, SFT on 3k LIACC-annotated summaries, DPO on 400 preference pairs, int4 quantisation. 88% F1 on the same held-out set. Cost: < €0.001 per 1k tokens on our A100. Latency: 450ms.
The headline is not that Llama beats GPT-4 everywhere. It doesn't. The headline is that on a bounded task with good data, a small controlled model dominates on the metrics the hospital actually cares about: data residency, latency, unit cost, and auditability.
When the frontier model still wins
- Novel reasoning tasks where the problem shape shifts per request.
- Long-context document synthesis beyond what your infra can load.
- Anything where the cost of being wrong is huge and you need every 0.3 points of accuracy.
Practical recipe
- Write an evaluation first. 200 examples, graded programmatically. Iterate against it.
- Prototype with a frontier model. If it can't do the task, no small model will.
- Collect 1k–10k high-quality supervised examples. In-domain, high variance.
- Fine-tune a small open-weight model. Start with LoRA or DoRA.
- Add a DPO pass. Preference data is cheap to produce once you have a baseline.
- Deploy behind a gateway you can monitor and kill.
If you're in a Portuguese public body or a hospital, we'd rather spend a week on this than three months writing a procurement request for a frontier API. The economics tell the story.