How Andrej Karpathy's AI Learns to Write Better Code, One Loop at a Time

Karpathy's loop-plan, generate, run, critique, patch-turns AI from guesswork into tested fixes. Start small, wire to CI and tests, track pass rates, and let iterations compound.

Published on: Mar 09, 2026
How Andrej Karpathy's AI Learns to Write Better Code, One Loop at a Time

Iterative Self-Improvement of Code: How AI Learns to Ship Better Software

Andrej Karpathy popularized a simple idea with big consequences: let the model write code, run it, critique itself, patch the code, and repeat. It's the same way a sharp engineer improves-just compressed into tight loops.

For general, IT, and development roles, this loop isn't theory. It's a practical system you can bolt onto your workflow to reduce bugs, speed up maintenance, and move from "suggested code" to "proven fixes."

The core loop

  • Plan: State the goal, constraints, and acceptance criteria.
  • Generate: Produce the smallest change that could work.
  • Run: Execute tests, static checks, and example scripts in a sandbox.
  • Critique: Read failures, trace logs, and diffs; summarize root cause.
  • Patch: Apply a targeted fix; avoid scope creep.
  • Repeat: Stop when criteria are met or a max-iteration cap hits.

What makes it work

  • Tests are the truth: High-signal unit/integration tests translate intent into pass/fail feedback.
  • Deterministic environment: Containerized runs, pinned deps, seeded randomness.
  • Clear reward signal: Test pass rate, lints, type checks, and performance gates.
  • Context and memory: Relevant files, failing traces, and prior attempts-not the whole repo dump.
  • Safety rails: Write to branches, run in ephemeral sandboxes, and require human review for merges.

Set this up in your team this week

  • Pick a scope: Start with bug fixing, flaky tests, or boilerplate migrations-bounded, testable work.
  • Toolchain: CI-accessible container, unit tests, static analysis, type checks, and a scriptable runner.
  • Evaluator: One command that returns pass/fail, failing tests, and key metrics as machine-readable output.
  • Prompts: A compact system prompt (coding style, project rules), a critique prompt (why it failed), and a patch plan prompt (small, verifiable diffs).
  • Context control: Only feed changed files, failing traces, and closest neighbors; avoid context bloat.
  • Logging: Store attempt count, diffs, test results, and tokens; this history is your performance dashboard.

Prompts that drive useful iterations

  • Goal: Describe the bug or task, the exact acceptance tests, and any non-negotiable constraints.
  • Reflection: Ask for a minimal root-cause hypothesis tied to specific lines and errors.
  • Patch plan: Request a tiny diff, why it should work, and any new tests to add.
  • Stop rule: Cap iterations; if still failing, ask for a human-readable handoff note.

Metrics that matter

  • Pass rate improvement: Tests passed after N iterations vs. baseline.
  • Regression count: New failures introduced per successful fix.
  • Tokens and time per fix: Cost and cycle time trend.
  • Human edit distance: How much reviewers changed before merge.

Common failure modes (and quick fixes)

  • Spec gaps: The model guesses. Fix by adding tests and explicit acceptance criteria.
  • Flaky tests: The loop chases noise. Stabilize tests and seed randomness.
  • Reward hacking: The model "satisfies" checks without solving the problem. Mix signals: tests, lints, types, and perf gates.
  • Context drift: Too much repo context confuses the model. Provide only files and traces that matter.

Where this pays off first

  • Bug backlogs: Triage and fix well-scoped issues with tests.
  • Legacy code: Add missing tests, refactor safe pieces, and document behavior discovered by the loop.
  • Migrations: Repetitive API/SDK or framework changes validated by tests.
  • Ops scripts: Patch small scripts and IaC modules inside a sandbox before production.

How this aligns with Karpathy's view

The thesis is simple: feedback-rich loops beat one-shot generations. Smaller changes, faster iterations, honest tests.

Treat the model like a junior dev with superhuman patience and strong recall. Give it tight constraints, crisp targets, and a clean review path.

Security and review

  • Branch protection: No direct writes to main; all AI patches go through CI and code review.
  • Secrets: Never expose real credentials; use sandboxed tokens and redacted logs.
  • Licensing: Track snippet origins; keep third-party code policies clear.

Want to go deeper?

The bottom line

Small, verified steps beat big, hopeful changes. Put a testable target in front of the model, keep the loop tight, and let iteration compound.

This isn't hype-it's a workflow. Ship it, measure it, then make the loop a little smarter tomorrow.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)