Beyond Simple Loops: Context, Planning, and TDD for AI Coding Agents

Two Google Cloud engineers pit a quick loop against a context-first, TDD-backed agent. Front-load context, plan in steps, evaluate locally and globally, add guardrails.

Categorized in: AI News IT and Development
Published on: Oct 10, 2025
Beyond Simple Loops: Context, Planning, and TDD for AI Coding Agents

AI Coding Agents: Beyond Simple Loops

Two engineers from Google Cloud Tech-Aja Hammerly and Jason Davenport-put two AI agent designs on the whiteboard and stress-tested them. The shared goal: make LLM-driven coding agents that plan, write, and evaluate code with fewer blind spots and fewer wasted cycles.

Here's what matters if you build agents for real software work: context up front, plans with checkpoints, test-first thinking, and guardrails against infinite loops.

The simple loop: quick, intuitive, fragile

Aja's initial flow is the obvious starting point: user prompt → LLM plan → generate code → execute → feed errors back into code generation until it works → return result.

  • Fast to implement and decent for small tasks.
  • Two core risks: endless fix loops and "working code" that doesn't match the original ask.

Add meta-cognition: route feedback to the planner

Aja's refinement: send execution output and errors back to the central LLM, not just the code generator. That lets the agent update the plan, reassess priorities, and cut failed loops early.

  • Good for prototypes and small features where you want quick turnaround with basic oversight.

Context-first architecture: start with what the agent should know

Jason's model begins by enriching the prompt with "Context": current codebase, team standards, architectural rules, and Model Context Protocol (MCP) capabilities. A junior dev can't meet expectations without the playbook; neither can your agent.

  • Context pack: repo snapshot or RAG index, coding guidelines, dependency graph, env config, tools.
  • MCP reference: modelcontextprotocol.io

Then the LLM creates a high-level plan of steps. Each step flows through a focused loop: Plan → Eval → Execute, using tools like compilers, linters, formatters, style checks, and a runner.

  • Key nuance: sometimes giving the step less context reduces distraction and produces cleaner code.

Two-tier evaluation with TDD

Jason's approach adds two evaluators: one per step, one against the original goal. The top-level eval is guided by tests created before implementation-classic TDD.

  • Write acceptance tests or specs first. Then code to meet them.
  • Reduces "functionally correct but irrelevant" results.
  • Improves traceability: every commit ties back to a test or requirement.
  • Primer: Test-Driven Development

Blueprint you can ship this week

  • Collect context: repo slice, coding standards, API docs, environment variables, toolchain config.
  • Define "done": acceptance tests, fixtures, and constraints (performance, security, style).
  • Draft a high-level plan: break work into steps with explicit inputs/outputs.
  • Wire tools: language runtime (e.g., Java), linter, formatter, style checker, unit test runner, sandbox executor.
  • Implement the per-step loop (Plan → Eval → Execute) with retry caps and backoff.
  • Add a global evaluator that runs the full test suite and checks the plan's intent.
  • Guardrails: timeouts, token/budget limits, file/path allowlists, and kill-switches.
  • Observability: log prompts, plans, tool I/O, diffs, and test outcomes for replay and learning.
  • Failure handling: rollback strategy, diff-based patches, and incremental PRs.

Metrics that keep you honest

  • Plan adherence rate: steps completed without plan rewrites.
  • Test pass rate and time-to-green for the full suite.
  • Loop depth: average retries per step (watch for thrash).
  • Token and wall-clock cost per passing change.
  • Human handoff rate: percent of tasks needing intervention.

When to use each model

  • Simple loop with planner feedback: small, isolated tasks; throwaway scripts; rapid spikes.
  • Context-first with two-tier eval and TDD: multi-file changes, legacy codebases, team standards, production-critical work.

Common failure modes and quick fixes

  • Endless error loops → Add retry caps, backoff, and a planner check after N failures.
  • "Green tests, wrong outcome" → Rewrite acceptance tests to reflect real intent; add property-based checks.
  • Hallucinated APIs → Expand context with SDK docs and typed stubs; fail fast on unknown imports.
  • Tool flakiness → Isolate and version tools; run in a hermetic sandbox; pin seeds for determinism where possible.
  • Context overload → Prune to the smallest relevant files; use retrieval with top-k and recency bias.
  • Brittle tests → Prefer behavior specs over exact strings; use golden files with tolerances.

Bottom line

Both designs rely on Plan → Execute → Evaluate. The difference is orchestration. Add context up front, plan in steps, evaluate locally and globally, and lead with tests. That's how you ship agents that produce working code that actually meets the brief.

If you're mapping tools and courses for this stack, see our roundup of coding-focused AI tools: AI tools for generative code.