Ethereum's next sprint: AI agents in core dev and governance
Ethereum's developer community is preparing to lean harder into AI. Tomasz StaΕczak, a prominent voice in Ethereum's core ecosystem, is pushing for large language models (LLMs) and agentic systems to draft and review proposals, moderate developer calls, write code, and even assess whether upgrades should move forward.
The pitch is simple: Ethereum already ships its process to the internet. EIPs, core dev call notes, client specs, and debates are public. That's a perfect training and retrieval corpus for AI systems to gain context and act with precision.
Why this fits Ethereum's process
- Public-by-default workflow: proposals and discussions live online, reducing data-wrangling friction.
- Structured artifacts: EIPs have consistent formats, which map well to automated parsing, critique, and summarization.
- Observable governance: call agendas, minutes, and issues are archived in places like ethereum/pm, enabling retrieval-augmented reasoning.
- Clear test oracles: specs, client test suites, and consensus tests provide measurable pass/fail signals for agent feedback loops.
What AI agents could own first
- Proposal drafting support: structure new EIPs, fill templates, validate against style and completeness checks. See also: Generative Code.
- Automated reviews: highlight breaking changes, missing security considerations, and spec ambiguities with citations to prior EIPs.
- Live call moderation: agenda tracking, timeboxing, action-item capture, and instant retrieval of relevant prior decisions.
- Engineering assists: code scaffolding, test generation, fuzz inputs, and diff summarization across client implementations.
- Governance triage: score upgrade proposals against predefined criteria; surface blockers and risks before final human votes.
Reality check: risks and failure modes
- Hallucinations: LLMs can output wrong or fabricated claims with high confidence, especially under time pressure or sparse context.
- Spec drift: models trained on outdated threads can reinforce legacy decisions or miss new consensus.
- Overreach: granting agents write/merge or "accept/reject" powers without hard gates can propagate subtle errors network-wide.
- Interface fragility: shifting GitHub labels, repo structures, or meeting formats can break brittle automations.
Guardrails that make this viable
- Human-in-the-loop by default: agents propose; humans decide. Formalize "propose-check-merge" gates.
- RAG over the Ethereum corpus: index EIPs, call notes, and specs; require citations with section anchors for every nontrivial claim.
- Spec-first and test-as-contract: treat executable tests as ground truth; block actions that lack passing tests.
- Policy sandboxes: dry-run changes, simulate outcomes, and require quorum acknowledgements before stateful actions.
- Verification layers: property tests, model checking where feasible, and ensemble critiques for critical steps.
- Provenance and audit trails: log prompts, contexts, and tool calls; make diffs and decisions reproducible.
- Safety budgets: cap rates, tools, and scopes; escalate to humans on ambiguity or low-confidence outputs.
Where to start (practical steps)
- Curate the training corpus: deduplicate EIPs, tag superseded specs, and add canonical links to decisions.
- Define structured schemas: JSON schemas for EIPs, risk rubrics, and meeting minutes to tighten model IO.
- Tool interfaces: standardize function-call APIs for GitHub, CI, test runners, fuzzers, and client CLIs.
- Evaluator harness: score agents on citation quality, bug-finding, spec clarity, and regression detection.
- Roll out narrow verticals: start with meeting moderation and EIP linting before code or governance actions.
- Train the team: set prompt patterns, escalation policies, and red-team exercises. Useful primer: AI Agents & Automation.
Context from the wider industry
Outside crypto, leading software orgs are leaning hard into AI-assisted workflows. Spotify's leadership recently said top developers have gone through 2026 building with AI-first tooling instead of hand-writing every line. Expect similar cultural shifts in open source as agent frameworks mature.
Clarifying the PoW reference
Proof-of-Work (PoW) is a consensus method used by networks like Bitcoin to validate blocks. Ethereum's history and public documentation across PoW and post-merge eras give AI systems rich context on design debates, tradeoffs, and upgrade patterns.
Timeline and expectations
- Near term: meeting moderation, EIP linting, and retrieval-backed summarization.
- Mid term (by Q3 tooling milestone): stable integrations across repos, CI, and call workflows; evaluator dashboards live.
- 12-24 months: agents assist with upgrade assessments under strict guardrails; human sign-off remains the final gate.
What this means for developers
- Your edge is context engineering: clean corpora, tight schemas, and rock-solid tests will outperform raw model size.
- Ship agents where failure is cheap first. Promote to higher-stakes tasks only after measurable, audited wins.
- Assume hallucinations and design for containment. Confidence thresholds and "must-cite" policies aren't optional.
Ethereum has the ingredients: public process, strong specs, and a culture that measures twice before cutting once. With the right guardrails, AI can take on the grunt work, surface blind spots, and let humans focus on first-principles decisions.
Your membership also unlocks: