The interesting question about clinical LLMs in 2026 isn't “do they work?” It's “which failures does a radiologist catch, and which ones sneak through?” Imagine a research team embedded in a hospital radiology rotation for three months; what it would learn belongs in a product spec, not a paper.

Three failure modes radiologists caught

  • Template hallucinations. The model borrowed a sentence from a common template even when the finding contradicted it. Caught because the radiologist has seen the template 10,000 times.
  • Negation flips. “No evidence of pneumothorax” becomes “evidence of pneumothorax.” Rare in modern models, but still non-zero. Caught because it would be malpractice to miss it.
  • Confident laterality errors. Right-left confusions that read fluently. Caught only when the radiologist looked at the image.

Two failure modes they didn't catch

  • Mild under-specification. The model summarised four findings when the scan had five. The fifth was minor, and no one read the scan again to check.
  • Subtle tense and uncertainty shifts. The difference between “possible” and “probable” didn't always land correctly. Reviewers trusted the model's hedges.

What actually helps

  1. Explicit span grounding. Every sentence in the generated report is anchored to a span in the source evidence (image or note). No span, no sentence.
  2. Uncertainty that refuses to be bland. The model either commits to a finding or flags it explicitly — no middle ground that reads confident but hedges.
  3. A silent critic. A second, independently-tuned model reviews every output. Disagreements go to the human.
  4. Log everything. Including the model's rejected drafts. We've learned more from the failed drafts than from the accepted ones.

What to tell your procurement

Clinical LLMs are not chatbots. If the product you're evaluating does not ship with span-level grounding and a silent critic, it isn't ready for a hospital.