Iterative Self-Improvement of Code: How AI Learns to Ship Better Software
Andrej Karpathy popularized a simple idea with big consequences: let the model write code, run it, critique itself, patch the code, and repeat. It's the same way a sharp engineer improves-just compressed into tight loops.
For general, IT, and development roles, this loop isn't theory. It's a practical system you can bolt onto your workflow to reduce bugs, speed up maintenance, and move from "suggested code" to "proven fixes."
The core loop
- Plan: State the goal, constraints, and acceptance criteria.
- Generate: Produce the smallest change that could work.
- Run: Execute tests, static checks, and example scripts in a sandbox.
- Critique: Read failures, trace logs, and diffs; summarize root cause.
- Patch: Apply a targeted fix; avoid scope creep.
- Repeat: Stop when criteria are met or a max-iteration cap hits.
What makes it work
- Tests are the truth: High-signal unit/integration tests translate intent into pass/fail feedback.
- Deterministic environment: Containerized runs, pinned deps, seeded randomness.
- Clear reward signal: Test pass rate, lints, type checks, and performance gates.
- Context and memory: Relevant files, failing traces, and prior attempts-not the whole repo dump.
- Safety rails: Write to branches, run in ephemeral sandboxes, and require human review for merges.
Set this up in your team this week
- Pick a scope: Start with bug fixing, flaky tests, or boilerplate migrations-bounded, testable work.
- Toolchain: CI-accessible container, unit tests, static analysis, type checks, and a scriptable runner.
- Evaluator: One command that returns pass/fail, failing tests, and key metrics as machine-readable output.
- Prompts: A compact system prompt (coding style, project rules), a critique prompt (why it failed), and a patch plan prompt (small, verifiable diffs).
- Context control: Only feed changed files, failing traces, and closest neighbors; avoid context bloat.
- Logging: Store attempt count, diffs, test results, and tokens; this history is your performance dashboard.
Prompts that drive useful iterations
- Goal: Describe the bug or task, the exact acceptance tests, and any non-negotiable constraints.
- Reflection: Ask for a minimal root-cause hypothesis tied to specific lines and errors.
- Patch plan: Request a tiny diff, why it should work, and any new tests to add.
- Stop rule: Cap iterations; if still failing, ask for a human-readable handoff note.
Metrics that matter
- Pass rate improvement: Tests passed after N iterations vs. baseline.
- Regression count: New failures introduced per successful fix.
- Tokens and time per fix: Cost and cycle time trend.
- Human edit distance: How much reviewers changed before merge.
Common failure modes (and quick fixes)
- Spec gaps: The model guesses. Fix by adding tests and explicit acceptance criteria.
- Flaky tests: The loop chases noise. Stabilize tests and seed randomness.
- Reward hacking: The model "satisfies" checks without solving the problem. Mix signals: tests, lints, types, and perf gates.
- Context drift: Too much repo context confuses the model. Provide only files and traces that matter.
Where this pays off first
- Bug backlogs: Triage and fix well-scoped issues with tests.
- Legacy code: Add missing tests, refactor safe pieces, and document behavior discovered by the loop.
- Migrations: Repetitive API/SDK or framework changes validated by tests.
- Ops scripts: Patch small scripts and IaC modules inside a sandbox before production.
How this aligns with Karpathy's view
The thesis is simple: feedback-rich loops beat one-shot generations. Smaller changes, faster iterations, honest tests.
Treat the model like a junior dev with superhuman patience and strong recall. Give it tight constraints, crisp targets, and a clean review path.
Security and review
- Branch protection: No direct writes to main; all AI patches go through CI and code review.
- Secrets: Never expose real credentials; use sandboxed tokens and redacted logs.
- Licensing: Track snippet origins; keep third-party code policies clear.
Want to go deeper?
- Generative Code
- AI Learning Path for Software Developers
- SWE-bench benchmark for measuring automated bug fixing in real repos.
- Self-Refine for iterative refinement strategies that echo this loop.
The bottom line
Small, verified steps beat big, hopeful changes. Put a testable target in front of the model, keep the loop tight, and let iteration compound.
This isn't hype-it's a workflow. Ship it, measure it, then make the loop a little smarter tomorrow.
Your membership also unlocks: