Improving AI models’ ability to explain their predictions

In high-stakes settings such as medical diagnostics, users often want to know what prompted a computer vision model to make a particular prediction, so they can decide whether or not to trust its output.

Conceptual bottleneck modeling is one way that AI systems can explain their decision-making process. These methods force the deep learning model to use a set of concepts, which humans can understand, to make predictions. In new research, computer scientists at MIT have developed a method that helps the model achieve better accuracy and clearer, more concise explanations.

The concepts used by the model are usually pre-defined by human experts. For example, a doctor may suggest using concepts such as “clustered brown dots” and “variegated pigmentation” to predict that a medical image shows skin cancer.

But predefined concepts may be irrelevant or lack sufficient detail for a specific task, reducing the accuracy of the model. The new method extracts concepts that the model has already learned while being trained to perform that specific task, and forces the model to use those concepts, resulting in better explanations than standard conceptual bottleneck models.

This approach uses a pair of specialized machine learning models that automatically extract knowledge from the target model and translate it into plain language concepts. Ultimately, their technology can turn any pre-trained computer vision model into one that can use concepts to explain its reasoning.

“In a sense, we want to be able to read the thoughts of computer vision models. “In a sense, we want to be able to read the thoughts of these computer vision models,” says lead author Antonio De Santis, a graduate student at the Politecnico di Milano, who completed this research while a visiting graduate student at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. A conceptual bottleneck model is one way for users to see what the model is thinking and why it made a particular prediction. “Because our method uses better concepts, it can lead to higher accuracy and ultimately improved accountability of black-box AI models.”

He was joined by A Work paper By Shrasing Tong SM ’20, PhD ’26; Marco Brambilla, professor of computer science and engineering at the Politecnico di Milano; The lead author is Lallana Kajal, principal research scientist at CSAIL. The research will be presented at the International Conference on Learning Representations.

Build a better bottleneck

Conceptual bottleneck models (CBMs) are a popular approach to improving the explainability of AI. These techniques add an intermediate step by forcing the computer vision model to predict the concepts in the image, and then using those concepts to make a final prediction.

This intermediate step, or “bottleneck,” helps users understand the logic of the model.

For example, a model that identifies bird species could choose concepts such as “yellow legs” and “blue wings” before predicting a swallow.

But since these concepts are often pre-generated by humans or large language models (LLMs), they may not be suitable for the specific task. In addition, even if given a pre-defined set of concepts, the model sometimes uses undesired learned information anyway, a problem known as information leakage.

“These models are trained to maximize performance, so the model may be secretly using concepts that we’re not aware of,” DeSantis explains.

Researchers at MIT had a different idea: Since the model was trained on a massive amount of data, it likely learned the concepts needed to generate accurate predictions for the specific task at hand. They sought to build confidence building measures by extracting this existing knowledge and transforming it into text that humans can understand.

In the first step of their method, a specialized deep learning model called a sparse autoencoder selectively takes the most relevant features learned by the model and reconstructs them into a set of concepts. Next, the multimedia LLM describes each concept in plain language.

This multimodal LLM also annotates the images in the dataset by identifying which concepts are present and absent in each image. Researchers use this annotated dataset to train the Concept Bottleneck module to recognize concepts.

They integrated this module into the target model, forcing it to make predictions using only the set of learned concepts that the researchers had extracted.

Control concepts

They overcame many challenges as they developed this method, from ensuring that concepts were correctly annotated in the LLM to determining whether the sparse autoencoder identified concepts that a human could understand.

To prevent the model from using unknown or unwanted concepts, they restricted it to using only five concepts per prediction. This also forces the model to select the most relevant concepts and makes the explanations more understandable.

When they compared their approach with state-of-the-art CBMs on tasks such as predicting bird species and identifying skin lesions in medical images, their method achieved the highest accuracy while providing more accurate interpretations.

Their approach also created concepts that were more applicable to the images in the dataset.

“We have shown that extracting concepts from the original model can outperform other CBMs, but there is still a trade-off between interpretability and accuracy that needs to be addressed. Uninterpretable black-box models still outperform our models,” says De Santis.

In the future, researchers would like to study potential solutions to the information leakage problem, perhaps by adding additional conceptual bottleneck modules so that unwanted concepts cannot leak out. They also plan to scale up their method by using a larger multi-modal MBA to annotate a larger training dataset, which could potentially boost performance.

“I am excited about this work because it pushes explainable AI in a very promising direction and creates a natural bridge to symbolic AI and knowledge graphs,” says Andreas Hutho, professor and Chair of Data Science at the University of Würzburg, who was not involved in this work. “By extracting concept bottlenecks from the internal mechanisms of the model rather than solely human-defined concepts, it provides a path toward more faithful interpretations of the model and opens many opportunities for pursuing work with structured knowledge.”

This research was supported by a Progetto Rocca Doctoral Fellowship, the Italian Ministry of Universities and Research under the National Recovery and Resilience Plan, Thales Alenia Space, and the European Union under the NextGenerationEU project.

(Tags for translation)Antonio De Santis