At One Conference, AI Wrote and Reviewed Every Paper - Here's What We Learned

At Agents4Science, A.I. drafted papers and did first-pass reviews; humans stepped in after. It sped things up, but mistakes and soft judgment meant humans stayed in charge.

Categorized in: AI News Science and Research
Published on: Oct 31, 2025
At One Conference, AI Wrote and Reviewed Every Paper - Here's What We Learned

A Conference Where A.I. Wrote and Reviewed the Science: What Actually Happened

On October 22, an experiment put a hard question on the table: Can A.I. meaningfully author and review scientific work? An event called Agents4Science required that submissions be prepared and initially reviewed by A.I., with humans stepping in only after the first cut.

The goal was simple and blunt-test whether A.I. can generate new insights and uphold quality through A.I.-driven peer review. Submissions and reviews were made public to let the community study what worked and what broke.

Coverage from major outlets framed the stakes and the skepticism. See reporting in Nature and Science.

The Experiment, By The Numbers

The virtual conference drew 1,800 registrations. A total of 315 A.I.-generated papers were submitted and screened by A.I. reviewers. Eighty cleared the first pass; human reviewers then helped select the final 48 for presentation.

Most work centered on A.I. and human learning, skewing computational over physical experimentation. Three papers were recognized as outstanding.

  • Behavior of A.I. agents in economic marketplaces
  • Impact of reduced towing fees in San Francisco on low-income residents
  • Whether an A.I. agent can fool A.I. reviewers with weak or flawed papers

Where A.I. Helped

Teams reported faster throughput on computational tasks. One collaborator on the towing-fee study noted that A.I. sped up analysis and exploration.

But speed came with errors. The system repeatedly used the wrong implementation date for the policy change-small mistake, big consequences. As one researcher put it, the core scientific work remains human-driven.

Where A.I. Fell Short

Evaluating novelty and significance is still a weak spot. Prior research suggests A.I. reviewers underperform humans on those dimensions.

Another signal: many large language models skew "too agreeable," dampening the conflict and diverse viewpoints that often precede breakthroughs. And the field still lacks reliable methods to evaluate A.I. agents or detect false positives at scale.

Critiques You Can't Ignore

Some scholars argue that science is a collective human enterprise built on interpretation and judgment, not a pipeline of inputs and outputs. If both authors and reviewers are automated, what is left for scholars to trust?

The sharpest critiques warn against mistaking fluent text for scientific contribution. Without grounded hypotheses, careful study design, and adversarial review, you get polished noise.

Practical Takeaways for Labs and PIs

  • Use A.I. for scaffolding and acceleration-never for final claims. Lock in human-led hypotheses, design, and interpretation.
  • Require provenance: log prompts, versions, and model settings. Treat every A.I. contribution as auditable.
  • Mandate fact checks on all dates, units, and dataset descriptors. Automate where possible, verify manually before submission.
  • Treat A.I. review as a second opinion, not a gatekeeper. Keep human program committees in the loop for novelty and significance.
  • Red-team your own outputs: ask A.I. (and colleagues) to find errors, leaks, p-hacking, and confounds.
  • If you run a venue, pilot randomized review arms (human vs. A.I.) and track downstream impact, citations, and replication.

Open Questions for the Community

  • Can current agents generate truly novel insights or just recombine the literature?
  • How do we measure reviewer quality beyond surface-level correctness?
  • What safeguards prevent A.I. authors from gaming A.I. reviewers?
  • How should credit, accountability, and disclosure evolve as tooling scales?
  • Where are the ethical and legal boundaries for authorship and responsibility?

A Short Historical Note

The idea isn't entirely new. In 1906, Andrey Markov analyzed letter sequences in Pushkin's Eugene Onegin, studying how likely one symbol follows another. Today's language models are far larger and more capable, but the core prediction game echoes that early work.

Bottom Line

A.I. can speed up parts of the research pipeline, especially computation and drafting. It struggles with novelty, judgment, and rigorous review-the exact places where science earns trust.

Use it, but keep humans in the driver's seat. Build processes that assume the model will make confident mistakes-and catch them before the paper leaves your lab.

Further Skill Building

If you're formalizing A.I. use in your lab or department, you may find structured, role-specific training helpful: A.I. courses by job.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)