A Conference Where A.I. Wrote and Reviewed the Science: What Actually Happened
On October 22, an experiment put a hard question on the table: Can A.I. meaningfully author and review scientific work? An event called Agents4Science required that submissions be prepared and initially reviewed by A.I., with humans stepping in only after the first cut.
The goal was simple and blunt-test whether A.I. can generate new insights and uphold quality through A.I.-driven peer review. Submissions and reviews were made public to let the community study what worked and what broke.
Coverage from major outlets framed the stakes and the skepticism. See reporting in Nature and Science.
The Experiment, By The Numbers
The virtual conference drew 1,800 registrations. A total of 315 A.I.-generated papers were submitted and screened by A.I. reviewers. Eighty cleared the first pass; human reviewers then helped select the final 48 for presentation.
Most work centered on A.I. and human learning, skewing computational over physical experimentation. Three papers were recognized as outstanding.
- Behavior of A.I. agents in economic marketplaces
- Impact of reduced towing fees in San Francisco on low-income residents
- Whether an A.I. agent can fool A.I. reviewers with weak or flawed papers
Where A.I. Helped
Teams reported faster throughput on computational tasks. One collaborator on the towing-fee study noted that A.I. sped up analysis and exploration.
But speed came with errors. The system repeatedly used the wrong implementation date for the policy change-small mistake, big consequences. As one researcher put it, the core scientific work remains human-driven.
Where A.I. Fell Short
Evaluating novelty and significance is still a weak spot. Prior research suggests A.I. reviewers underperform humans on those dimensions.
Another signal: many large language models skew "too agreeable," dampening the conflict and diverse viewpoints that often precede breakthroughs. And the field still lacks reliable methods to evaluate A.I. agents or detect false positives at scale.
Critiques You Can't Ignore
Some scholars argue that science is a collective human enterprise built on interpretation and judgment, not a pipeline of inputs and outputs. If both authors and reviewers are automated, what is left for scholars to trust?
The sharpest critiques warn against mistaking fluent text for scientific contribution. Without grounded hypotheses, careful study design, and adversarial review, you get polished noise.
Practical Takeaways for Labs and PIs
- Use A.I. for scaffolding and acceleration-never for final claims. Lock in human-led hypotheses, design, and interpretation.
- Require provenance: log prompts, versions, and model settings. Treat every A.I. contribution as auditable.
- Mandate fact checks on all dates, units, and dataset descriptors. Automate where possible, verify manually before submission.
- Treat A.I. review as a second opinion, not a gatekeeper. Keep human program committees in the loop for novelty and significance.
- Red-team your own outputs: ask A.I. (and colleagues) to find errors, leaks, p-hacking, and confounds.
- If you run a venue, pilot randomized review arms (human vs. A.I.) and track downstream impact, citations, and replication.
Open Questions for the Community
- Can current agents generate truly novel insights or just recombine the literature?
- How do we measure reviewer quality beyond surface-level correctness?
- What safeguards prevent A.I. authors from gaming A.I. reviewers?
- How should credit, accountability, and disclosure evolve as tooling scales?
- Where are the ethical and legal boundaries for authorship and responsibility?
A Short Historical Note
The idea isn't entirely new. In 1906, Andrey Markov analyzed letter sequences in Pushkin's Eugene Onegin, studying how likely one symbol follows another. Today's language models are far larger and more capable, but the core prediction game echoes that early work.
Bottom Line
A.I. can speed up parts of the research pipeline, especially computation and drafting. It struggles with novelty, judgment, and rigorous review-the exact places where science earns trust.
Use it, but keep humans in the driver's seat. Build processes that assume the model will make confident mistakes-and catch them before the paper leaves your lab.
Further Skill Building
If you're formalizing A.I. use in your lab or department, you may find structured, role-specific training helpful: A.I. courses by job.
Your membership also unlocks:
 
             
             
                            
                            
                           