Teaching AI to Explain Itself With Concepts It Already Knows

This method pulls human-readable concepts from a model's own features, forcing predictions through a small set. It beats standard CBMs on accuracy and keeps explanations on task.

Published on: Mar 10, 2026
Teaching AI to Explain Itself With Concepts It Already Knows

Building Trust in Computer Vision: Clearer Concept Bottlenecks from the Model Itself

In safety-critical work like clinical diagnostics and autonomous systems, "why did the model say that?" isn't a nice-to-have - it's the whole conversation. A new approach strengthens concept bottleneck models by pulling concepts from the model's own internal representations, then translating them into plain language. The result: higher accuracy than standard CBMs and explanations that stay on task.

Why this matters

Predefined concepts (e.g., "clustered brown dots," "variegated pigmentation") can miss the mark, or be too coarse for a specific dataset. Worse, models can still sneak in hidden cues - the information leakage problem. Extracting concepts the model already relies on for the task tightens that gap and improves faithfulness.

The core idea

The method converts any pretrained vision model into a concept-driven one by mining its learned features and forcing decisions through a small, human-readable set.

  • A sparse autoencoder selects the most relevant internal features and compresses them into a handful of concepts.
  • A multimodal LLM describes each concept in plain language and auto-annotates the dataset with concept presence/absence.
  • A concept bottleneck module is trained on these annotations and inserted into the target model - predictions must flow through the learned concepts.
  • The system limits each prediction to five concepts to curb leakage and keep explanations clear.

What changed vs. standard CBMs

Typical CBMs rely on expert- or LLM-crafted concepts defined up front, which can be misaligned or incomplete. Here, concepts are discovered from the model's own features learned for the exact task, then labeled for humans. That alignment lifts accuracy and makes explanations more relevant to real images.

Results at a glance

  • On tasks like bird species identification and skin lesion analysis, this approach outperformed state-of-the-art CBMs on accuracy while giving tighter, more precise explanations.
  • Concepts were more applicable to the dataset, with fewer off-target descriptors.
  • There's still a tradeoff: top black-box models can be more accurate, but offer no trustworthy rationale.

"In a sense, we want to be able to read the minds of these computer vision models... Because our method uses better concepts, it can lead to higher accuracy and ultimately improve the accountability of black-box AI models," says lead researcher Antonio De Santis.

Andreas Hotho adds, this line of work "offers a path toward explanations that are more faithful to the model and opens many opportunities for follow-up work with structured knowledge."

How to apply this in your lab or program

  • Select a strong pretrained vision model for your domain (e.g., dermoscopy, pathology, aerial imagery).
  • Train a sparse autoencoder on internal features to extract a compact concept space.
  • Use a multimodal LLM to generate short, plain-language names and definitions for each concept; auto-annotate your dataset.
  • Train a concept bottleneck head on these annotations and constrain predictions to the top five concepts.
  • Evaluate concept accuracy, faithfulness (does the model actually use the reported concepts?), and leakage.
  • Run expert review: have clinicians or domain scientists validate concept names and thresholds before deployment.

Where this helps first

  • Medical imaging: triage and assistive reads where a short, concept-based rationale reduces review time and flags suspicious cues. See related discussions in AI for Healthcare.
  • Field research and ecology: species identification with consistent, interpretable attributes instead of opaque logits.
  • Safety monitoring: industrial inspection and autonomous perception modules that must justify alerts.

Limitations and next steps

  • Information leakage remains a risk. The team suggests exploring multiple bottlenecks to block unwanted cues.
  • Scaling concept quality likely requires larger multimodal LLMs and bigger labeled sets.
  • Black-box models can still edge out accuracy; the goal here is credible, auditable predictions. Pair with post-deployment monitoring.

Further reading

Bottom line: if you need decisions you can defend, use the model's own learned structure to define the concepts it must stand on - and keep that set small, named, and reviewed by experts.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)