LLM spots real cosmic events with just 15 examples-and explains why
A study published in Nature Astronomy shows that a general-purpose large language model can classify genuine night-sky events from artefacts with around 93% accuracy using minimal guidance. The team from the University of Oxford, Google Cloud, and Radboud University adapted Google's Gemini with 15 example image triplets per survey and concise instructions. The model judged thousands of alerts, explained its reasoning in plain English, and assigned a follow-up priority score.
This approach targets a growing bottleneck in astronomy: modern surveys generate millions of alerts per night, but most are bogus. Traditional classifiers often act as black boxes, offering little justification for their labels. Here, the LLM provides both decisions and transparent rationale-reducing blind trust and manual triage.
How the system sees a transient
Each candidate comes as an image triplet: New (latest frame), Reference (earlier or stacked template), and Difference (New minus Reference). From this, Gemini outputs three things:
- A real/bogus classification (astrophysical source vs artefact)
- A concise text explanation describing salient image features and reasoning
- An "interest" score to help scientists prioritise follow-up
With only a handful of labeled examples from ATLAS, MeerLICHT, and Pan-STARRS, the model learned useful visual-text patterns without complex retraining. It handled diverse phenomena such as supernovae, tidal disruption events, fast-moving asteroids, and short stellar flares.
Image context: The triplet workflow (New, Reference, Difference) isolates transient signals and anchors the explanation. Credit: Stoppa & Bulmus et al., Nature Astronomy (2025).
Why this matters for survey pipelines
Next-generation facilities such as the Vera C. Rubin Observatory will stream ~20 TB of data every 24 hours. Manual vetting at that scale is infeasible. A general model that is steerable with a few examples and clear instructions lowers the barrier to deployment and adaptation across instruments and science goals.
Researchers reported that Gemini's written explanations were rated coherent and useful by a panel of 12 astronomers. That feedback loop turns the model into a collaborator instead of a black box, improving trust and auditability.
Human in the loop-plus self-assessment
The team added a simple self-check: Gemini scores the coherence of its own explanations. Low-coherence cases were much more likely to be wrong, making them easy to route to human review. Using this signal to refine the initial examples, performance on one dataset improved from ~93.4% to ~96.7%.
This creates a practical triage workflow: the model handles the easy, clear cases; it flags uncertain ones for expert attention. Scientists spend less time on routine filtering and more time on candidates that may be scientifically interesting.
What's next: from classifier to scientific assistant
The authors see a path to AI agents that can integrate images with time-series photometry, check confidence, request follow-up from robotic telescopes, and escalate rare events to humans. Because the setup uses few examples and plain instructions, teams can adapt it to new instruments quickly.
For labs and observatories, the takeaway is straightforward: you can prototype useful classifiers without deep ML expertise, provided you supply well-chosen examples, explicit prompts, and a human-in-the-loop review step.
Practical steps to try in your lab
- Assemble 10-20 high-quality, labeled triplets per instrument (balanced real/bogus, with brief expert notes).
- Write explicit instructions: what constitutes real vs artefact, typical failure modes, and how to score priority.
- Require a short explanation for each decision; log and review these for model drift and edge cases.
- Use a self-coherence or confidence score to route uncertain cases to human review.
- Iterate examples using misclassified cases; re-test on held-out alerts before deployment.
Key numbers
- Training effort: 15 examples per survey (ATLAS, MeerLICHT, Pan-STARRS) plus concise instructions.
- Baseline accuracy: ~93% across real/bogus classification.
- With human-guided refinement: up to ~96.7% on one dataset.
- Expert review: 12 astronomers rated model explanations coherent and useful.
Context and links
Paper: "Textual interpretation of transient image classifications from large language models," Nature Astronomy (8 Oct 2025). DOI: 10.1038/s41550-025-02670-z.
Background on next-gen sky surveys: Vera C. Rubin Observatory.
For teams building similar workflows
If you're standing up LLM-assisted review pipelines in research settings, curated training can speed up prompt design, evaluation, and governance. See resources here: Latest AI courses.