
Research
Aug 22, 2025
Most AI projects fail. Here’s why—and how to ship agents that actually work.
A new report from MIT Media Lab’s Project NANDA—The GenAI Divide: State of AI in Business 2025—looked at hundreds of real deployments and found a harsh truth: about 95% of generative-AI pilots don’t produce business impact. Only ~5% translate into revenue or real savings. The problem isn’t “AI is bad.” The problem is how we build and run it.
The short version: we’re trying to run AI agents like old, deterministic software. That doesn’t work. AI is probabilistic; outputs vary; providers change models; quality moves if you don’t measure it. You need a different playbook—one built around evaluations, simulation, guardrails, and cost/quality routing. Even OpenAI’s leadership keeps saying the quiet part out loud: evals are the work.
Where AI projects go wrong—and how to fix them
Symptom you see | Why it’s happening | What to do differently |
---|---|---|
Endless pilots that never reach production. | Teams run agents with old software rituals (feature list, demo, launch) instead of an agent playbook (goals, evaluations, simulator, rollout rules). | Write an agent-centric product requirements document (spell out task goals, tools allowed, escalation rules, acceptance thresholds). Pair it with a living evaluation set and a simulator plan from Day 1. The MIT study shows most pilots stall without this rigor. |
Demos look great; real users get flaky results. | No graded evaluations—only spot checks. | Build a living evaluation suite that mirrors real workflows and scores usefulness, safety, completeness, latency, and cost—not just “pass/fail.” |
“Same prompt, different answer.” | AI is probabilistic; behavior drifts with context and provider updates. | Pin model versions where possible, capture seeds/parameters, and re-run evaluations whenever anything changes. Use a router to balance quality, cost, and latency per task. |
Agents pass unit tests but fail in the wild. | You tested components, not long-horizon tasks with tools and UI. | Use a sandbox/simulator to rehearse full tasks before go-live (web environments, multi-site flows, deterministic replicas). |
Surprising errors and “confident nonsense.” | Hallucinations and reasoning gaps aren’t monitored. | Add guardrails (grounding, retrieval, policy checks), confidence/deferral to humans, and quality dashboards. Nature and other studies show hallucinations are inherent—so you must detect and contain them. |
Costs spike as usage grows. | Usage is token-metered; small inefficiencies compound. | Track cost per task, cap tokens by step, cache and batch where safe, and route simpler tasks to cheaper models. (Yes—pricing is usage-based.) |
“It worked last week.” | Hidden ML technical debt and drift: data shifts, prompts change, providers update. | Treat data and prompts as first-class code; add change audits; invest in ML observability. The classic “Hidden Technical Debt in ML Systems” explains why old habits break. |
Privacy/security objections stall deployment. | Data handling is unclear; cloud-only deployments. | Offer on-prem/private-cloud options, document data isolation and retention, and build redaction/access controls into the agent itself. (Standard enterprise best-practice patterns.) |
The playbook we use at Tastefully.ai
We build agents like systems, not stunts. Four pillars make the difference.
1) In-house Router (quality-, cost-, and latency-aware)
Task-aware selection. The router chooses the smallest model that can pass the task’s acceptance criteria; escalates only when needed.
Version control. We can pin model/provider versions and re-run evals on every change.
Cost guardrails. Token caps per step, caching, and batch policies keep spend predictable.
Quality arbitration. For high-stakes steps, the router can cross-check with a second model or a verifier before acting.
This turns “black-box AI” into managed, measurable behavior instead of hope. (Usage-based pricing is real; routing is how you keep it sane.)
2) In-house Simulator (our “agent gym”)
We rehearse real tasks end-to-end: multi-page flows, tool calls, dead ends, and recovery paths.
The simulator runs deterministic replicas and scripted obstacles so we can measure repeatability and catch regressions.
We keep an evaluation logbook (by scenario and version) so production changes never outpace quality checks.
Benchmarks like WebArena and REAL show why deterministic web/task replicas matter; we’ve built our own for your workflows.
3) Spec-and-Tickets Generator that isn’t “just a template”
It listens to real work (docs, meetings, tickets) and proposes diffs to your spec with clear rationales.
Tickets come with acceptance criteria, dependency notes, and confidence tags; low-confidence items auto-escalate for review.
Every suggestion is trace-linked to the evidence that justified it (meeting excerpt, doc section, prior decision).
Result: less “AI slurry,” more auditable decisions your team can trust.
4) Dynamic UI/UX Generator
Agents don’t leave you with raw JSON. They generate small, adaptive interfaces—fields, tables, and checklists matched to the task.
UI elements are typed and validated (think: schema + state machine), so clicks and updates are idempotent and traceable.
This is how we turn probabilistic text into predictable actions your team can review and approve.
“Deterministic” behavior—without pretending AI is deterministic
Truth: you can’t make a large language model fully deterministic in the wild. Nature-level research shows that hallucination risk never vanishes; you manage it.
What you can do is box in the variability so outcomes are repeatable enough for production:
Constrain the task (tools, scopes, allowed actions).
Define acceptance thresholds per step; fail closed.
Use retrieval/grounding for facts; verify or defer on low confidence.
Route smartly, pin versions, and re-eval on change.
Wrap actions in typed UI/state machines so the end effect is consistent.
That’s how we get “deterministic-feeling” agents—reliable, auditable, and ready for real work.
Why this matters now
Two things are true at once:
Most pilots fail—not for lack of ambition, but for lack of the right operational model.
Evals + simulation + routing are the difference between a cool demo and a durable system. This is fast becoming industry consensus
If you’re done with pilot theater and want production-grade agents—with privacy, quality, and cost under control—we should talk.
References
MIT Media Lab / Project NANDA, The GenAI Divide: State of AI in Business 2025; coverage and summary.
Brockman on the centrality of evaluations.
Hidden Technical Debt in Machine Learning Systems (why old software habits break).
Hallucinations and detection (Nature).
Agent simulators/benchmarks (REAL, WebArena, BrowserGym).
Usage-based pricing (token-metered).
Research