Every launch event in early 2026 promised an “autonomous agent” that would book flights, refactor repos, and write grant proposals. By April, procurement officers across Porto were asking us a sharper question: which of these actually work in the messy building of a real operation?

What the word finally means

In LIACC's multi-agent lab we've used the word agent for thirty years. The 2024–2026 wave adds three concrete capabilities that the classical literature only sketched:

  • Tool use with feedback. The agent reads the output of its own actions and adjusts. This is nothing new on paper — but reliable tool-use with LLMs is a 2024 artefact.
  • Long-horizon planning without a hand-written plan. Thanks to reasoning models like o1, o3, DeepSeek-R1, and Claude 3.7, the agent can budget its own inference time.
  • Memory across sessions. Either via RAG over an external store or via compacted episodic memory.

None of these is magic. All three can be engineered, tested, and audited.

Why most demos don't survive contact with reality

The gap between demo and deployment shows up in four places:

  1. Tool latency. A demo runs 3 tool calls. A production agent runs 50. Each call has a tail distribution; 50 independent 99th-percentiles blow up the median.
  2. Observability. If you can't replay an agent's trajectory step by step, you can't debug it, and you can't defend it to a regulator.
  3. Guardrails that don't collapse. The cheap pattern — “ask the LLM if the action is safe” — fails under adversarial pressure.
  4. Evaluation datasets that don't cheat. Too many agent benchmarks have training data contamination. You need a held-out evaluation suite that your own agent has never seen.

The engineering we actually do

In the agentic systems we prototype for smart-city and governance settings, the pattern looks boring and that's the point:

  • Every tool call is logged with inputs, outputs, latency, cost, and a hash of the model version.
  • The agent has a typed contract for each tool — no free-form calls.
  • A second model does post-hoc audit on a 5% sample, flagged to a human reviewer.
  • We ship a kill switch that a non-technical operator can hit from a dashboard.

What's worth your attention next

If you are evaluating an agentic product in 2026, three questions separate the serious from the theatrical:

  1. “Show me a replayed trajectory, not a demo.”
  2. “What is the system's behaviour when the LLM is wrong about the state of the world?”
  3. “Who gets paged at 3am when the agent loops?”

If the answers are vague, the agent is a prototype. That's fine — prototypes have value. Just don't pay production prices for them.