AI "scientists" still need human judgment: lessons from Agents4Science 2025
At a one-of-a-kind conference, AI systems were listed as first authors and even reviewers. The goal: see what happens when agents lead research, end to end. The result: useful technical output, but shaky scientific judgment and frequent citation failures.
Agents4Science accepted 47 papers from 300+ submissions. According to co-organiser James Zou (Stanford), the event was built because most journals won't allow AI as co-authors, making it hard to be transparent about how researchers use these systems.
What actually happened in the studies
- ChatGPT and Claude ran a two-sided job marketplace project, from ideation to experiments. They drifted off-topic, forgot to update supporting documents, hallucinated references, and produced redundant code and prose until human collaborators intervened.
- Google's Gemini analyzed San Francisco's 2020 policy cutting towing fees for low-income drivers. It handled data processing, but repeatedly fabricated sources, researchers from UC Berkeley reported.
How human experts judged the work
Risa Wechsler, a computational astrophysicist at Stanford, said the submissions showed decent technical chops but weak judgment. Some analyses were fine on paper yet uninteresting, or framed questions in ways that didn't make sense, sometimes using methods far too complex for the problem.
James Evans, a computational sociologist at the University of Chicago, warned about the confident tone of current AI systems. When an agent sounds neutral and certain, people tend to stop questioning-bad news for a process that depends on disagreement and argument to move forward.
Barbara Cheifet, editor at Nature Biotechnology, stressed that hallucinated references are still a major issue. Her stance: treat AI as a colleague, not an author, because humans are responsible for accuracy, originality, and integrity.
What this means for researchers and writers
AI can accelerate parts of research and writing. But without firm constraints, it drifts, fabricates, and overcomplicates. If you use agents in your work, keep your hands on the wheel.
- Keep humans in charge of the question. Let AI explore, but you decide what's interesting, important, and worth testing.
- Reduce method bloat. Start with the simplest baseline that can answer the question. Only add complexity when it clearly beats that baseline.
- Structure the workflow. Break the project into checkpoints: problem framing, literature scan, data plan, analysis, interpretation, write-up. Require a short approval at each step.
- Force argument, not agreement. Ask the model to critique its own plan, propose alternatives, and list failure modes. If you use multi-agent setups, assign opposing roles.
- Citation hygiene is non-negotiable. Ban auto-citations. Require DOIs or verifiable URLs, and check every reference. See policies from Nature on AI authorship and COPE.
- Log everything. Save prompts, versions, seeds, and outputs. Treat agent runs like experiments with a lab notebook.
- Guardrails for context. Maintain a single "source of truth" document the agent must read before generating. Require explicit updates to supporting materials.
- Zero tolerance for fabrication. Instruct the model to say "I don't have a source" instead of guessing. Ask for page numbers and quotes for any claim tied to a citation.
- Separate analysis from narration. Have one pass generate code/analysis and another write the explanation. Then swap: critique each with the other pass.
- Reproducibility first. Package data, code, environment, and a plain-English README so another researcher can rerun the work without you.
Where AI stands right now
From this conference, the pattern is clear: AI is a capable assistant that speeds up analysis and drafting, but it still falls short on choosing meaningful questions, keeping context straight, and citing correctly. That aligns with Wechsler's view that, over the next decade, AI will sit somewhere between "best intern" and "favorite collaborator."
Use agents to move faster. Rely on humans to decide what matters and to verify what's true.
Further practice
If you want structured practice on prompts and agent workflows for research and writing, see our prompt course resources.
Your membership also unlocks: