DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation

Deep classifiers fail due to a variety of reasons

Proactively detecting instances where a classification model is likely to fail (i.e., predict incorrect labels) is essential for deploying models safely in real-world applications. For example, in an ADAS system, misidentifying a pedestrian as a road sign, or in medical imaging, mistaking a tumor for a benign lesion, can lead to catastrophic outcomes. At a minimum, models should be able to accurately flag these high-risk samples.

Failures in vision models often stem from violations of data distribution assumptions made during training. Typically, data consists of both task-relevant core attributes and irrelevant nuisance attributes, neither of which are explicitly annotated. As a result, models may struggle to generalize if:

The training data contains spurious correlations with nuisance attributes that do not appear during testing.
The class-conditional distribution of nuisance attributes changes between the training and test data (e.g., patient race imbalance in clinical datasets).
Novel attributes emerge only at test time (e.g., style changes).

In the figure below, we illustrate different failure modes of vision models caused by the above described subpopulation shifts and covariate shifts. In the first example, consider the task where the goal is to identify hair color (blonde or not blonde). If the training set has more blonde men than women, and the model learns to rely on the spuriously correlated gender attribute to predict hair color, it will fail at test time. Similarly, in the second example, class imbalance in the training set may cause the model to poorly generalize to the underrepresented class. Finally, in the third and fourth cases, the model may fail to generalize to the test set due to covariate shifts that range from image corruptions to domain shifts. Note that, when the class-conditional distributions of core attributes themselves change between train and test data, it leads to the more challenging scenario of concept shifts, and we do not consider this in this work.

What makes failure detection challenging ?

Given the critical importance and inherent challenges of this problem, there has been a significant increase in research efforts aimed at developing robust methods for failure detection in machine learning models. The most prevalent approach involves leveraging epistemic uncertainties of the model. Fundamentally, the problem of detecting failures is formulated as identifying an appropriate metric or scoring function (some notion of the epistemic uncertainity) that can effectively distinguish between samples where the model is likely to fail and those where it is likely to succeed. Some popular scoring mechanisms that have been proposed include: Maximum Softmax Probability, Predictive Entropy and Energy. These methods aim to utilize the uncertainty in the model's predictions, with the assumption that higher uncertainty may indicate a higher likelihood of failure. There are also other methods such as measuring disagreement between constituent members of an ensemble, through local manifold smoothness that are recently proposed.

However, the inherenet assumption that underpin these methods is that the classifier models are well-calibrated which may not always hold in practice. We will also demonstrate through experiments that these methods are not sufficient to accurately detect failures that arise due to more subtle but important shifts illustrated above. Furthemore, a cornerstone requirement for any failure detector is that it should also be able to explain the reasons for failure, which is a non-trivial task to be achieved by the methods mentioned above.

Lastly and more importantly, failure detection as it is currently posed is fundamentally limited because it is not only difficult, but also inefficient, to describe the nuisance attribute discrepancies mentioned in the previous section solely using visual features.

Our Approach

To address the afore mentioned challenges, we propose a novel framework for failure detection that leverages the capabilities of Vision-Language Models (VLMs) and Large Language Models (LLMs). Vision-Language Models (VLMs) trained on large corpora of visual and textual data have demonstrated remarkable capabilities in understanding complex visual concepts and achieve remarkable zero-shot performance on a wide range of vision tasks. Through out this work, we use OpenAI's CLIP VLM model which produces image and text embeddings by respective vision and text encoders and measures their similarity (cosine similarity) in the shared latent space.

Our approach involves leveragring the capabilites of VLMs to detect failure in vision models and the use of VLMs and LLMs are necessitated by two main reasons

Traditional vision classifiers, trained solely on cross-entropy loss with coarse labels, are highly susceptible to biases from task-irrelevant attributes, which we believe is the primary cause for model failures. This challenge arises because these classifiers are tasked to map images to coarse labels that encompass multiple attributes. For example, the label "dog" may include features like a "wagging tail" and a "snout", while the label "cat" includes "whiskers" and "pointy ears". Without detailed attribute information and given the potential biases in the training data, these models often rely on overly simplistic decision rules. Importanlty, "fixing" this issue through supervised learning is extemely difficult, because such a fix requires manual annotation of attributes for every sample in the dataset, which is infeasible.
A failure detector should also be able to provide human interpretable explanations for failure.

Having explained the two main reasons that necessitate the use of VLMs and LLMs, we now describe our approach. At the highest level, we propose to "debias" the deep classifier by aligninig its visual features to the task-relevantcore attributes of the task and then measure the disagreement between the original and debiased classifiers to detect potential failures. To this end, we utilize LLMs to specify task-relevant core attributes and use the Vision-Language Model (VLMs) to align the features of the classifier to the natural language specification of these attributes. Finally, we propose a novel attribute ablation strategy to provide human interpretable explanations for failure.

DECIDER framework

Generation of Task-relevant core attributes

As mentioned earlier our goal is to align the classifier's visual features with the core task-relevant attributes, thus effectively "debiasing" the model. To identify these attributes for each class without manual annotation, we leverage the priors of a large language model like GPT-3. By querying the LLM with prompts such as "List visually descriptive attributes of <CLASS>" where <CLASS>is the image's class label, we can obtain a set of (natural language) attributes for each class.

Training a new model with task-specific attributes

Given the core attributes for each class, we now aim to reduce the biases learnt by the original classifier by training a new model with these attributes. However, the challenge here is that, these attributes are generated at the level of class and not for individual samples. Thus, we propose to leverage the priors of a powerful vision-language models such as CLIP to associate the image-features from the original classifier with different attributes in the VLM latent space. To that end, we introduce a new model called Prior Induced Model (PIM) that projects the image features from the original classifier to the VLM latent space. The architecture of PIM mirrors the original classifier except that its final layer projects onto the VLM latent space.For example, when both the original and PIM are based on Resnet-50 architecture, the output from block 1 of the classifier serves as the input for block 2 in PIM.

We now describe how to train PIM. We first begin by computing the similarity between the image features from PIM and the text encodings for each class-specific core attribute obtrained through the (frozen) CLIP text encoder. We then simply aggregate these similarities per class either by averaging them or by taking the maximum similarity score across the attributes for that class. We now have one scalar score for each class which we renormalize through softmax and train the PIM with cross-entropy loss. This is a much richer training objective than the naive mapping of images to coarse labels as it incorporates the additional information about the core attributes into the training process.

Failure detection using PIM

Once we have the PIM, we detect the potential failures of the original classifier by measuring the disagreement between PIM and the original classifier based on the discrepancy between their predictions. This disagreement score is calculated as the cross-entropy between the sample-level probability distributions between the two models. We first compute these scores on a held-out labeled validation set to determine the threshold that approximates the true accuracy on this dataset. At test time, we compute the score for each (unlabled) sample and compare it to the threshold to determine if that sample is likely to fail.

Explaining failures

Now that we have the debiased classifier, that operates in the VLM latent space, we can also leverage it to provide human interpretable explanations for failure. We do so by performing attribute ablations on the debiased classifier to identify the optimal subset of attributes necessary for aligning the PIM's prediction probabilities with those of the original model. This allows us to elucidate the underlying reasons behind the discrepancies between predictions of the original and debiased classifiers. Specifically, our ablation strategy invoves iteratively adjusting the group of weights (uniformly initialized) corresponding to each attribute across all classes such that the KL divergence between the probability distributions obtained by the task model and those by PIM is minimized. In the figure below, we illustrate the DECIDER framework with left being the failure detector and right being the explanation generator.

DECIDER Architecture — Figure: Overview of the DECIDER framework for failure detection and explanation.

How well does DECIDER work in practice?

Time for some experiments and results. 😊

We evaluated failure detection capabilites of DECIDER, when the original classifier could potentially fail due to a diverse set of shifts during test-time including subpopulation shifts (spurious correlations and class imbalance), and covariate shifts (synthetic corruptions, and domain shifts). For the base model architecture, we considered ResNet-18, ResNet-50, and ViT-Base. We compared DECIDER against baseline methods like maximum softmax probability, predictive entropy, energy-based scores, and ensemble disagreement. To quantify the performance, we used Matthews Correlation Coefficient (MCC), Failure Recall (FR), and Success Recall (SR) as our evaluation metrics. Through our empirical analysis, we find that DECIDER consistently outperforms the baselines across all metrics on all the benchmarks. Please check our paper for the full results!

A quick look at one result

Effectiveness under covariate shifts:

DECIDER performance on covariate shifts — Figure:(a) Difference in MCC between DECIDER and the best baseline on the PACS dataset involving covariate shifts across 4 different visual domains. (b) Improvement in failure recall performance of the best performing baseline and DECIDER on large-scale covariate shift benchmarks- DomainNet (DNet) and ImageNet-Sketch.

Note, for (b) PIMs are trained on DomainNet Real and Imagenet train sets respectively and evaluated on the different distribution shift datasets. We ask the readers to refer to the paper for more details and experiments.

DECIDER explains model failures

Using the procedure described earlier, we performed attribute ablations to generate explanations for model failures. In the figure below, we present a few examples. In the bottom-left example, the task is to correctly identify hair color (blonde or not). The original classifier incorrectly labels the image, while PIM accurately classifies it. We observe that our optimization process reduces the influence of attributes like "Browning Tresses" and "Red Highlights" on PIM's predictions to make the predictions more similar to the original model. This indicates that the biased original classifier may have overlooked these key attributes in its decision-making.

Similarly, in the top-right example, the original model misclassifies a cat as a dog. Our explanation shows that the classifier failed to focus on important core attributes like "Thin Whiskers," leading to the incorrect classification.

We believe these explanations are valuable for understanding model behavior in real-world scenarios. They can also help improve model reliability by guiding the selection of training samples that emphasize core attributes the original model overlooked.

Figure: DECIDER explanations for model failures. The task model misclassifies the images, while PIM correctly identifies them. The optimization process in DECIDER reveals which core attributes were not sufficiently considered by the task model, leading to the misclassification.

Some important findings and analyses

What happens if the attributes generated by the LLM are biased or insufficient? The success of DECIDER relies on the quality of the attributes generated by the LLM. To study the impact on failure detection on the quality of text attributes, we consider two practical scenarios:

(i) GPT-3 generates irrelevant attributes:

(ii) GPT provides insufficient attributes:

To comprehensively evaluate the impact of both scenarios, we employ the following protocol on the Waterbirds dataset:

We train PIM under both these scenarios. From the results in Table 1 of the paper, although there is a noticeable drop in failure detection performance due to the severe attribute corruptions, DECIDER still outperforms the best baseline (Entropy) method, thus demonstrating the robustness of DECIDER to imperfect attribute sets.

What is the relationship between the accuracy of PIM and the performance of DECIDER ? Our findings indicate that even in rare instances where the debiased model PIM produces (minor) lower prediction accuracy, its ability to distinguish between core and nuisance attributes remains intact, which is vital for effective failure detection. As a result, DECIDER consistently outperforms baseline methods in terms of failure recall.

What happens if we replace PIM with CLIP classifiers ? Given that we propose to leverage the priors from CLIP to obtain a debiased version of the classifier, it is natural to consider directly utilizing CLIP's zero-shot classifier directly as PIM. From Table 3 in appendix of the paper we observe that such an approach yields poor failure detection performance. This is because the visual features and their correlations to the core attributes of CLIP can differ significantly from the original model, thus rendering the model disagreement based failure detection highly ineffective.

BibTeX

@inproceedings{
        subramanyam2024decider,
        title={DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation},
        author={Rakshith Subramanyam and Kowshik Thopalli and Vivek Narayanaswamy and Jayaraman J. Thiagarajan},
        booktitle={European Conference on Computer Vision},
        year={2024},
        url={https://arxiv.org/pdf/2408.00331}
        }