How Andrej Karpathy's AI Learns to Write Better Code, One Loop at a Time

Karpathy's loop-plan, generate, run, critique, patch-turns AI from guesswork into tested fixes. Start small, wire to CI and tests, track pass rates, and let iterations compound.

Iterative Self-Improvement of Code: How AI Learns to Ship Better Software

Andrej Karpathy popularized a simple idea with big consequences: let the model write code, run it, critique itself, patch the code, and repeat. It's the same way a sharp engineer improves-just compressed into tight loops.

For general, IT, and development roles, this loop isn't theory. It's a practical system you can bolt onto your workflow to reduce bugs, speed up maintenance, and move from "suggested code" to "proven fixes."

The core loop

Plan: State the goal, constraints, and acceptance criteria.
Generate: Produce the smallest change that could work.
Run: Execute tests, static checks, and example scripts in a sandbox.
Critique: Read failures, trace logs, and diffs; summarize root cause.
Patch: Apply a targeted fix; avoid scope creep.
Repeat: Stop when criteria are met or a max-iteration cap hits.

What makes it work

Tests are the truth: High-signal unit/integration tests translate intent into pass/fail feedback.
Deterministic environment: Containerized runs, pinned deps, seeded randomness.
Clear reward signal: Test pass rate, lints, type checks, and performance gates.
Context and memory: Relevant files, failing traces, and prior attempts-not the whole repo dump.
Safety rails: Write to branches, run in ephemeral sandboxes, and require human review for merges.

Set this up in your team this week

Pick a scope: Start with bug fixing, flaky tests, or boilerplate migrations-bounded, testable work.
Toolchain: CI-accessible container, unit tests, static analysis, type checks, and a scriptable runner.
Evaluator: One command that returns pass/fail, failing tests, and key metrics as machine-readable output.
Prompts: A compact system prompt (coding style, project rules), a critique prompt (why it failed), and a patch plan prompt (small, verifiable diffs).
Context control: Only feed changed files, failing traces, and closest neighbors; avoid context bloat.
Logging: Store attempt count, diffs, test results, and tokens; this history is your performance dashboard.

Prompts that drive useful iterations

Goal: Describe the bug or task, the exact acceptance tests, and any non-negotiable constraints.
Reflection: Ask for a minimal root-cause hypothesis tied to specific lines and errors.
Patch plan: Request a tiny diff, why it should work, and any new tests to add.
Stop rule: Cap iterations; if still failing, ask for a human-readable handoff note.

Metrics that matter

Pass rate improvement: Tests passed after N iterations vs. baseline.
Regression count: New failures introduced per successful fix.
Tokens and time per fix: Cost and cycle time trend.
Human edit distance: How much reviewers changed before merge.

Common failure modes (and quick fixes)

Spec gaps: The model guesses. Fix by adding tests and explicit acceptance criteria.
Flaky tests: The loop chases noise. Stabilize tests and seed randomness.
Reward hacking: The model "satisfies" checks without solving the problem. Mix signals: tests, lints, types, and perf gates.
Context drift: Too much repo context confuses the model. Provide only files and traces that matter.

Where this pays off first

Bug backlogs: Triage and fix well-scoped issues with tests.
Legacy code: Add missing tests, refactor safe pieces, and document behavior discovered by the loop.
Migrations: Repetitive API/SDK or framework changes validated by tests.
Ops scripts: Patch small scripts and IaC modules inside a sandbox before production.

How this aligns with Karpathy's view

The thesis is simple: feedback-rich loops beat one-shot generations. Smaller changes, faster iterations, honest tests.

Treat the model like a junior dev with superhuman patience and strong recall. Give it tight constraints, crisp targets, and a clean review path.

Security and review

Branch protection: No direct writes to main; all AI patches go through CI and code review.
Secrets: Never expose real credentials; use sandboxed tokens and redacted logs.
Licensing: Track snippet origins; keep third-party code policies clear.

Want to go deeper?

Generative Code
AI Learning Path for Software Developers
SWE-bench benchmark for measuring automated bug fixing in real repos.
Self-Refine for iterative refinement strategies that echo this loop.

The bottom line

Small, verified steps beat big, hopeful changes. Put a testable target in front of the model, keep the loop tight, and let iteration compound.

This isn't hype-it's a workflow. Ship it, measure it, then make the loop a little smarter tomorrow.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

How Andrej Karpathy's AI Learns to Write Better Code, One Loop at a Time

Iterative Self-Improvement of Code: How AI Learns to Ship Better Software

The core loop

What makes it work

Set this up in your team this week

Prompts that drive useful iterations

Metrics that matter

Common failure modes (and quick fixes)

Where this pays off first

How this aligns with Karpathy's view

Security and review

Want to go deeper?

The bottom line

Related AI News for IT and Development

eThekwini AI data centre in Amanzimtoti faces backlash over D'MOSS site, power strain, and who benefits

Pro-Human Declaration: Humans Stay in Charge of AI

Ericsson and Intel Deepen Partnership to Bring AI-Native 6G Closer to Market

How Andrej Karpathy's AI Learns to Write Better Code, One Loop at a Time

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: