AI learns from 15 examples to spot real cosmic events at 93% accuracy, with plain-English explanations

With just 15 examples, an LLM spots real night-sky events at ~93% accuracy, explains its calls, and scores follow-ups. Self-checks flag uncertain cases, raising accuracy to ~96.7%.

Categorized in: AI News Science and Research
Published on: Oct 09, 2025
AI learns from 15 examples to spot real cosmic events at 93% accuracy, with plain-English explanations

LLM spots real cosmic events with just 15 examples-and explains why

A study published in Nature Astronomy shows that a general-purpose large language model can classify genuine night-sky events from artefacts with around 93% accuracy using minimal guidance. The team from the University of Oxford, Google Cloud, and Radboud University adapted Google's Gemini with 15 example image triplets per survey and concise instructions. The model judged thousands of alerts, explained its reasoning in plain English, and assigned a follow-up priority score.

This approach targets a growing bottleneck in astronomy: modern surveys generate millions of alerts per night, but most are bogus. Traditional classifiers often act as black boxes, offering little justification for their labels. Here, the LLM provides both decisions and transparent rationale-reducing blind trust and manual triage.

How the system sees a transient

Each candidate comes as an image triplet: New (latest frame), Reference (earlier or stacked template), and Difference (New minus Reference). From this, Gemini outputs three things:

  • A real/bogus classification (astrophysical source vs artefact)
  • A concise text explanation describing salient image features and reasoning
  • An "interest" score to help scientists prioritise follow-up

With only a handful of labeled examples from ATLAS, MeerLICHT, and Pan-STARRS, the model learned useful visual-text patterns without complex retraining. It handled diverse phenomena such as supernovae, tidal disruption events, fast-moving asteroids, and short stellar flares.

Image context: The triplet workflow (New, Reference, Difference) isolates transient signals and anchors the explanation. Credit: Stoppa & Bulmus et al., Nature Astronomy (2025).

Why this matters for survey pipelines

Next-generation facilities such as the Vera C. Rubin Observatory will stream ~20 TB of data every 24 hours. Manual vetting at that scale is infeasible. A general model that is steerable with a few examples and clear instructions lowers the barrier to deployment and adaptation across instruments and science goals.

Researchers reported that Gemini's written explanations were rated coherent and useful by a panel of 12 astronomers. That feedback loop turns the model into a collaborator instead of a black box, improving trust and auditability.

Human in the loop-plus self-assessment

The team added a simple self-check: Gemini scores the coherence of its own explanations. Low-coherence cases were much more likely to be wrong, making them easy to route to human review. Using this signal to refine the initial examples, performance on one dataset improved from ~93.4% to ~96.7%.

This creates a practical triage workflow: the model handles the easy, clear cases; it flags uncertain ones for expert attention. Scientists spend less time on routine filtering and more time on candidates that may be scientifically interesting.

What's next: from classifier to scientific assistant

The authors see a path to AI agents that can integrate images with time-series photometry, check confidence, request follow-up from robotic telescopes, and escalate rare events to humans. Because the setup uses few examples and plain instructions, teams can adapt it to new instruments quickly.

For labs and observatories, the takeaway is straightforward: you can prototype useful classifiers without deep ML expertise, provided you supply well-chosen examples, explicit prompts, and a human-in-the-loop review step.

Practical steps to try in your lab

  • Assemble 10-20 high-quality, labeled triplets per instrument (balanced real/bogus, with brief expert notes).
  • Write explicit instructions: what constitutes real vs artefact, typical failure modes, and how to score priority.
  • Require a short explanation for each decision; log and review these for model drift and edge cases.
  • Use a self-coherence or confidence score to route uncertain cases to human review.
  • Iterate examples using misclassified cases; re-test on held-out alerts before deployment.

Key numbers

  • Training effort: 15 examples per survey (ATLAS, MeerLICHT, Pan-STARRS) plus concise instructions.
  • Baseline accuracy: ~93% across real/bogus classification.
  • With human-guided refinement: up to ~96.7% on one dataset.
  • Expert review: 12 astronomers rated model explanations coherent and useful.

Context and links

Paper: "Textual interpretation of transient image classifications from large language models," Nature Astronomy (8 Oct 2025). DOI: 10.1038/s41550-025-02670-z.

Background on next-gen sky surveys: Vera C. Rubin Observatory.

For teams building similar workflows

If you're standing up LLM-assisted review pipelines in research settings, curated training can speed up prompt design, evaluation, and governance. See resources here: Latest AI courses.