Agile Is Dead for AI: A New Operating Model for Software Development
Enterprises are spending big on AI in software development and getting little back. The core issue isn't the tools. It's the operating model. You can't clip AI onto decade-old Agile rituals and expect compounding outcomes.
As highlighted by Martin Harrysson and Natasha Maniar of McKinsey & Company, AI adds a probabilistic, non-deterministic layer that traditional Agile never accounted for. Agile was built to slice known work into tickets and ship increments. AI doesn't play by those rules.
Why classic Agile breaks under AI
Agile assumes a deterministic system: define, build, test, release. AI systems are statistical. They drift, degrade, and depend on data that changes daily. They require a constant feedback loop, not just sprint reviews every two weeks.
This means "build once, ship, maintain" is the wrong mental model. Models need ongoing evaluation, retraining, and guardrails. If your process can't support that, value will stall.
Redefine the product: model + data + pipeline
For AI, the product isn't just code. It's the model, the data that feeds it, and the pipeline that keeps it current. That end-to-end system decides outcome quality more than the app layer ever will.
Think beyond features. Own the lifecycle: data sourcing, labeling strategy, training, deployment, monitoring, evaluation, and continuous improvement. If any link is weak, the result suffers.
New roles you actually need
- AI Product Manager: Frames business value in probabilistic terms, defines acceptance criteria beyond pass/fail, owns prompts and grounding strategy, partners on data acquisition, and sets evaluation metrics (quality, safety, cost, latency).
- AI Engineer: Manages the ML lifecycle end to end: data pipelines, training/fine-tuning, MLOps/LLMOps, eval harnesses, observability, and integration with production systems.
This is a shift from "full-stack dev" to a product plus data plus model capability. You need engineering rigor with ML fluency, not a tooling bolt-on.
From tickets to continuous learning systems
AI systems require continuous signals. Set up automated evaluation suites with golden datasets, hallucination checks, toxicity filters, and domain-specific tests. Track accuracy, coverage, regression, latency, and cost per request.
Treat model drift like uptime. Define SLOs for quality and safety. When metrics fall, trigger retraining or model swaps. Build human-in-the-loop workflows for edge cases and feedback capture.
Build, buy, or adapt isn't a one-time decision
With foundation models, you have three paths: use off-the-shelf, fine-tune, or build custom. The right answer changes as your data improves, vendors release updates, and unit economics shift.
- Use: Fastest start; lowest control. Great for prototypes and low-risk use cases.
- Fine-tune/augment: Balance of speed and quality; use retrieval, prompts, and selective tuning to hit target benchmarks.
- Build: Highest control and cost; reserve for core IP or strict constraints (privacy, latency, compliance).
Re-evaluate quarterly with a formal scorecard across quality, safety, latency, price, data security, and switching costs.
Metrics that matter more than story points
- Quality: precision/recall, win rate vs. baseline, hallucination rate, groundedness score
- Experience: P50/P95 latency, UX acceptance rate, escalation rate to humans
- Safety: policy violations, red-team triggers, jailbreak detections
- Economics: cost per request/task, throughput per GPU, retrain cost vs. uplift
Team topology and stage gates for AI work
- Triad at the core: AI PM + AI Engineer + Domain Lead (or Data Scientist). Surround with platform, data, and security partners.
- Stage gates: problem framing → data readiness review → evaluation design → safety review → pre-prod shadow → controlled launch → continuous improvement loop.
- Ops by design: observability, A/B infra, feature stores, data contracts, model registry, and rollback paths.
Governance without slowing delivery
Document datasets, prompts, model versions, and known risks. Use model cards and audit logs. Automate safety checks in CI/CD for prompts, retrieval sources, and model changes.
Adopt an evaluation-first culture and treat safety regressions like Sev-1 incidents.
Your 90-day plan
- Pick one high-value use case with clear quality and safety thresholds.
- Form the core triad and identify data owners and a security partner.
- Define success: target win rate vs. baseline, latency, cost, and guardrails.
- Stand up an evaluation harness with golden sets and auto-reports.
- Start with a strong base model; layer retrieval; fine-tune only if the eval says it's worth it.
- Ship a shadow launch to collect real data; iterate weekly on prompts, retrieval, and data fixes.
- Add observability: drift, quality, safety, and cost dashboards with alerts.
- Write the runbook: rollback, retrain triggers, abuse handling, and on-call ownership.
- Review the build/buy/adapt scorecard and adjust the stack.
- Codify the playbook; scale to the next use case.
Bottom line
AI won't deliver returns inside a framework built for deterministic work. Treat the "AI product" as model + data + pipeline, add the right roles, and run a continuous learning system. The companies that make this shift will see real outcomes; those that don't will keep burning budget.
Helpful resources: NIST AI Risk Management Framework and Google Cloud MLOps guidance.
If you're building these capabilities in-house, see practical training on prompt engineering and role-based paths on courses by job.
Your membership also unlocks: