AI agents struggle with biological data. The fix isn't better models-it's better infrastructure.
Researchers tested four state-of-the-art AI agents-Claude, GPT, and two others-on a straightforward task: retrieve viral sequences from NCBI Virus, a database virologists use daily for outbreak response and vaccine design. The results were sobering. Even the strongest models achieved accuracy ranging from 16.9% to 91.3% on queries that require near-perfect performance. The same model asked the same question three times returned wildly different answers.
This isn't a problem with reasoning. It's a problem with infrastructure built for humans, not machines.
Why biological databases break AI agents
Biological data lives scattered across multiple databases, each with different formats, identifiers, and filtering logic. NCBI Virus alone coordinates information from GenBank, RefSeq, and international databases maintained across three continents. Much of the retrieval logic exists only in the web interface-the exact environment where clicking through dashboards is slowest.
A virologist might spend minutes finding all SARS-CoV-2 sequences from 2025 containing the surface glycoprotein. For an AI agent working programmatically, the same query requires stitching together multiple APIs, retrieving results page by page, reconciling identifiers across sources, and downloading hundreds of gigabytes to filter locally.
Small retrieval errors have outsized consequences. In one test, an agent retrieved Ebolavirus sequences to build a phylogenetic tree-a standard analysis for understanding outbreak origins. One agent run returned 106 sequences (expected: 266), another returned 15, a third returned 5. The trees told different stories. The manually curated version placed the outbreak's origin in January 2014, matching prior research. One agent-retrieved dataset pushed the origin back to 1922. Another shifted it to April 2014. Same question. Same model. Three different conclusions about when the virus began circulating.
The researchers tested this against real outbreak response. In May 2026, Bundibugyo virus caused an Ebola outbreak in the Democratic Republic of Congo with over 1,000 confirmed and suspected cases and more than 200 deaths. Public health officials needed answers: How different is this virus from past strains? Will existing diagnostics detect it? Will current treatments work? All three questions require comparing new genomes against historical sequences. Instead, researchers had to manually click through a web interface and hope the result was complete and correct.
The solution: deterministic retrieval layers
The researchers built gget virus, a deterministic tool that translates the messy web interface into a reliable, machine-readable interface. It coordinates across NCBI's underlying systems, handles large result sets that would otherwise cut off arbitrarily, retrieves metadata from separate databases when needed, and returns standardized outputs with detailed logs showing how results were produced.
When agents were given access to gget virus, accuracy jumped above 90% for all models, peaking at 99.7%. Run-to-run variability disappeared. The performance gap between cheap and expensive models narrowed dramatically.
This changes what matters for scientific work. Reliable dataset construction no longer depends on access to the newest or most expensive model. Cheaper models paired with the right tool eliminate variability and enable wider access.
The broader lesson for biological computing
This problem isn't unique to virology. It appears wherever AI agents encounter systems built around human workflows: browser dashboards, implicit conventions, scattered identifiers, and one-off scripts.
Software development advanced AI agents much faster than biology. Software provides structured workflows, reliable APIs, version control, and testable outputs. Biology offers heterogeneous databases, context-dependent metadata, and few simple ways to verify correctness.
The bottleneck for biological agents isn't reasoning capacity. It's the absence of deterministic execution layers for querying data. An agent can understand the task-find all human kinases with this domain and retrieve their structures-but lack a dependable way to access the databases needed.
As model capabilities improve, some argue these tools will become unnecessary. Agents might eventually navigate messy portals, reconcile identifiers, and recover from failures on their own. But that doesn't mean the task should be reinvented each time. A model powerful enough to fight through a confusing workflow may still be too expensive, too slow, or too difficult to audit for routine scientific work.
Biological databases need to be designed with agents as scaled users. That means building context engines-reliable, agent-accessible infrastructure for data retrieval. It means exposing the same filtering semantics through APIs that appear in web interfaces. It means standardizing metadata fields and documenting conventions that expert humans know implicitly. It means thinking about how machines will query your data, not just how humans will browse it.
The details matter. In science, they often determine the conclusion.
Your membership also unlocks: