pith. sign in

arxiv: 2605.29836 · v1 · pith:H75ASWXDnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· stat.ML

CB-SLICE: Concept-Based Interpretable Error Slice Discovery

Pith reviewed 2026-06-29 08:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords error slice discoveryconcept bottleneck modelsinterpretable machine learningsystematic errorsbias detectionmodel debuggingsemantic concepts
0
0 comments X

The pith

CB-SLICE discovers error slices by grouping concept mispredictions in bottleneck models, outperforming prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CB-SLICE, a method that uses Concept Bottleneck Models to identify error slices based on shared concept prediction failures. It claims this approach provides explanations directly connected to the model's inference process, unlike existing methods that only approximate the error source. A reader would care because accurate identification of these slices helps in debugging models and addressing biases in their performance on specific groups. The method identifies the keyword concepts most responsible for each slice's failures. Experiments across benchmarks show it uncovers known biases more effectively while offering richer explanations.

Core claim

CB-SLICE groups samples with shared concept prediction failures and identifies the keyword concepts most responsible for each slice's failure mode, yielding richer and more faithful explanations of model errors than state-of-the-art slice discovery methods.

What carries the argument

Grouping samples by shared concept mispredictions in Concept Bottleneck Models to link error slices directly to the source of inference failures.

Load-bearing premise

Downstream task failures in CBMs commonly arise from concept mispredictions.

What would settle it

Finding a dataset where CB-SLICE does not identify known biases that other SDMs successfully detect, or where its concept-based explanations do not align with observed model behavior.

Figures

Figures reproduced from arXiv: 2605.29836 by Elaf Almahmoud, Mateja Jamnik, Mateo Espinosa Zarlenga, Yael Konforti.

Figure 1
Figure 1. Figure 1: Given Ψval and Mθ(g, f), CB-SLICE discovers and explains systematic failure modes in three steps. (1) Error-prone concept filtering: the concept encoder g produces concept predictions Cˆ(Ψval) from which error-prone concepts are selected to form Cˆ (Ψval) err . (2) Error slice formation: the corresponding concept logits H (Ψval) err are clustered via a GMM to obtain error slices S. (3) Failure mode explana… view at source ↗
Figure 2
Figure 2. Figure 2: Matched (top) and unmatched (bottom) error slice examples, selected by the SI score (Eq. 10) for each benchmark. For each slice, we show two samples with the highest P(Sj | x) and the mispredicted class. CB-SLICE keywords are compared to Bias-to-Text and Domino. Mispredicted concept-keywords are shown in red and correctly predicted ones in green. CB-SLICE captures spurious attributes in matched slices and … view at source ↗
Figure 3
Figure 3. Figure 3: CB-SLICE loss components (bottom) and auxiliary clas￾sifier accuracies (top) vs. the number of slices tg. Precision@10 and MGF are overlaid on the accuracy plot for comparison. As highlighted by the blue box, loss convergence aligns with metrics stabilisation, providing a principled criterion for selecting tg. rare, causing f to mispredict the class “2”. A similar pattern appears in the CelebA (top) exampl… view at source ↗
Figure 4
Figure 4. Figure 4: Error-prone concept filtering ablation on Waterbirds under joint CBM training. We evaluate the impact of restricting slice discovery to the error-prone concept subset Cerr compared to using the full concept set. Performance is measured using Slice Informativeness (SI; left), Precision@10 (middle), and MGF (right). Bars report mean values over five runs, with error bars indicating standard deviation. Across… view at source ↗
Figure 5
Figure 5. Figure 5: Loss components ablation on MNIST-Sum under joint CBM training. We evaluate the contribution of each component in the CB-SLICE slicing objective by ablating the GMM likelihood term (LGMM), and the auxiliary classification terms on ground-truth concepts (Lctrue ) and predicted concepts (Lcpred ), as well as their combinations. Performance is reported in terms of Slice Informativeness score (SI; left), Preci… view at source ↗
Figure 6
Figure 6. Figure 6: CB-SLICE auxiliary classifier accuracies, zc(·) and zcˆ(·), for GMM vs. linear clustering. Left: Results for the jointly trained CBM. Right: Results for the sequentially trained CBM. Each plot shows the accuracy of zc(·) (c true accuracy) and zcˆ(·) (c pred accuracy) as a function of the number of discovered slices, comparing GMM-based slicing (red and cyan) to a linear alternative (purple and green) on MN… view at source ↗
Figure 7
Figure 7. Figure 7: Error slice examples, discovered by CB-SLICE. For each slice, we show two samples with the highest P(Sj | x) and the mispredicted class. CB-SLICE keywords are compared to Bias-to-Text and Domino. Mispredicted concept-keywords are shown in red and correctly predicted ones in green. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Despite strong average-case performance, deep learning models often exhibit systematic errors on specific population groups, known as error slices. Identifying these groups and the root causes of their failures is critical for model debugging and bias mitigation. However, existing error Slice Discovery Methods (SDMs) typically generate explanations disconnected from the model's inference process, thus only approximating the underlying error source and may be inaccurate. We address this limitation by leveraging Concept Bottleneck Models (CBMs), whose predictions are directly dependent on human-understandable semantic concepts. Since downstream task failures in CBMs commonly arise from concept mispredictions, concept representations provide a strong candidate for error slice identification, offering fine-grained explanations directly linked to the error source. Building on this insight, we introduce CB-SLICE, a concept-based SDM that groups samples with shared concept prediction failures and identifies the keyword concepts most responsible for each slice's failure mode. Across multiple benchmarks, we show that CB-SLICE outperforms state-of-the-art methods in uncovering well-known biases while providing richer and more faithful explanations of model errors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CB-SLICE, a concept-based error slice discovery method for Concept Bottleneck Models (CBMs). It groups samples by shared concept prediction failures and identifies responsible keyword concepts, claiming this yields slices directly linked to error sources. Across multiple benchmarks, CB-SLICE is asserted to outperform state-of-the-art SDMs in uncovering known biases while providing richer and more faithful explanations than methods whose outputs are disconnected from the model's inference process.

Significance. If the empirical claims hold after verification of the core premise and experimental details, the work would supply a direct mechanistic link between error slices and CBM concept errors, improving the faithfulness of debugging tools for interpretable models that already expose concept activations.

major comments (2)
  1. [Abstract] Abstract: the central premise that 'downstream task failures in CBMs commonly arise from concept mispredictions' is stated without qualification, citation, or supporting analysis. This assumption is load-bearing; if a non-negligible fraction of task errors occur on correctly predicted concepts (i.e., in the concept-to-task mapping), then slices identified by concept failures would not be 'directly linked to the error source' and the reported gains in bias uncovering and faithfulness would not follow.
  2. [Abstract] Abstract: the claim of outperformance 'across multiple benchmarks' is made without any mention of datasets, baselines (e.g., existing SDMs), metrics, statistical significance, or experimental protocol. Because the abstract supplies no verification steps, it is impossible to assess whether the data support the superiority assertions.
minor comments (1)
  1. [Abstract] The acronym 'CB-SLICE' is introduced in the title but never expanded in the provided abstract text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will incorporate revisions to strengthen the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central premise that 'downstream task failures in CBMs commonly arise from concept mispredictions' is stated without qualification, citation, or supporting analysis. This assumption is load-bearing; if a non-negligible fraction of task errors occur on correctly predicted concepts (i.e., in the concept-to-task mapping), then slices identified by concept failures would not be 'directly linked to the error source' and the reported gains in bias uncovering and faithfulness would not follow.

    Authors: We agree the premise is presented without explicit qualification or citation in the abstract and is central to the claims. While the full manuscript motivates this based on the CBM architecture (task prediction is a direct function of concept activations), we will revise the abstract to qualify the statement (e.g., 'frequently arise') and add a supporting reference to prior CBM literature showing strong correlation between concept and task accuracy. This addresses the load-bearing concern without altering the core method. revision: yes

  2. Referee: [Abstract] Abstract: the claim of outperformance 'across multiple benchmarks' is made without any mention of datasets, baselines (e.g., existing SDMs), metrics, statistical significance, or experimental protocol. Because the abstract supplies no verification steps, it is impossible to assess whether the data support the superiority assertions.

    Authors: We acknowledge that the abstract is concise and omits concrete details on datasets, baselines, metrics, and protocol, which limits immediate verifiability. To improve transparency while respecting length constraints, we will revise the abstract to briefly specify the benchmarks (e.g., CelebA, Waterbirds), key SDM baselines, and primary metrics (bias uncovering rate, explanation faithfulness). Full experimental protocol, statistical tests, and results remain in Sections 4 and 5. revision: yes

Circularity Check

0 steps flagged

No circularity; method rests on external CBM premise without self-referential reduction

full rationale

The paper's core insight—that CBM task failures commonly arise from concept mispredictions—is stated as a premise drawn from the structure of CBMs rather than derived from the paper's own equations, fits, or self-citations. CB-SLICE is then defined as grouping by shared concept prediction failures, with empirical claims of outperformance evaluated on external benchmarks. No step reduces a prediction to a fitted input by construction, invokes a self-citation uniqueness theorem, or renames a known result; the derivation chain remains independent of the target outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central claim rests on the unelaborated assumption about CBM failure modes.

axioms (1)
  • domain assumption Downstream task failures in CBMs commonly arise from concept mispredictions
    Stated directly in the abstract as the basis for using concept representations for slice identification.

pith-pipeline@v0.9.1-grok · 5732 in / 1156 out tokens · 20222 ms · 2026-06-29T08:22:00.399049+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Post- hoc explanations fail to achieve their purpose in adversar- ial contexts

    Bordt, S., Finck, M., Raidl, E., and V on Luxburg, U. Post- hoc explanations fail to achieve their purpose in adversar- ial contexts. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 891– 905,

  2. [2]

    R., and Leyton-Brown, K

    d’Eon, G., d’Eon, J., Wright, J. R., and Leyton-Brown, K. The spotlight: A general method for discovering systematic errors in deep learning models. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1962–1981,

  3. [3]

    Enouen, E. et al. Debugging concept bottleneck models through removal and retraining. InNeurIPS 2025 Work- shop on Regulatable ML,

  4. [4]

    Constructing concept-based models to mitigate spurious correlations with minimal human effort

    Kim, J., Wang, Z., and Qiu, Q. Constructing concept-based models to mitigate spurious correlations with minimal human effort. InEuropean Conference on Computer Vision, pp. 137–153. Springer, 2024a. Kim, M. P., Ghorbani, A., and Zou, J. Multiaccuracy: Black- box post-processing for fairness in classification. InPro- ceedings of the 2019 AAAI/ACM Conference...

  5. [5]

    Promises and pitfalls of black-box concept learning models.arXiv preprint arXiv:2106.13314,

    Mahinpei, A., Clark, J., Lage, I., Doshi-Velez, F., and Pan, W. Promises and pitfalls of black-box concept learning models.arXiv preprint arXiv:2106.13314,

  6. [6]

    Mokady, R., Hertz, A., and Bermano, A. H. Clip- cap: Clip prefix for image captioning.arXiv preprint arXiv:2111.09734,

  7. [7]

    and Hirschberg, J

    Rosenberg, A. and Hirschberg, J. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computa- tional natural language learning (EMNLP-CoNLL), pp. 410–420,

  8. [8]

    Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

    Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case gener- alization.arXiv preprint arXiv:1911.08731,

  9. [9]

    doi: 10.3390/electronics14142785

    ISSN 2079-9292. doi: 10.3390/electronics14142785. URL https://www. mdpi.com/2079-9292/14/14/2785. Publisher: Multidisciplinary Digital Publishing Institute. Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The caltech-ucsd birds-200-2011 dataset.Technical Report CNS-TR-2011-001, California Institute of Tech- nology,

  10. [10]

    Error slice discovery via manifold compactness.arXiv preprint arXiv:2501.19032,

    Yu, H., Liu, J., Zou, H., Xu, R., He, Y ., Zhang, X., and Cui, P. Error slice discovery via manifold compactness.arXiv preprint arXiv:2501.19032,

  11. [11]

    Post-hoc con- cept bottleneck models

    Yuksekgonul, M., Wang, M., and Zou, J. Post-hoc con- cept bottleneck models. InICLR 2022 Workshop on PAIR2Struct: Privacy, Accountability, Interpretability, Robustness, Reasoning on Structured Data,

  12. [12]

    is a binary bird classification dataset (Landbirdsvs.Waterbirds) composed of RGB images created by overlaying bird crops from the Caltech-UCSD Birds-200-2011 (CUB) dataset (Wah et al.,

  13. [13]

    bamboo forest

    onto background images from the Places dataset (Zhou et al., 2017). Backgrounds are drawn from four scene categories:bamboo forest,forest,lake, andocean. This construction yields four subgroups:Landbirds-on-land,Landbirds-on-water,Waterbirds- on-land, andWaterbirds-on-water. In the training set,Landbirds-on-landandWaterbirds-on-waterare overrepresented re...

  14. [14]

    For concept annotations, we adopt the Label-free pipeline of Oikarinen et al

    For image preprocessing, we follow the same procedure as for CelebA. For concept annotations, we adopt the Label-free pipeline of Oikarinen et al. (2023), using the concept bank from Enouen et al. (2025) and a zero-shot CLIP classifier (Radford et al.,

  15. [15]

    Table 3.Waterbirds dataset splits. DATASPLITLANDBIRDWATERBIRDTOTAL LANDWATERLANDWATER # TRAININGSAMPLES3,498 184 56 1,057 4,795 # VALIDATIONSAMPLES2,722 155 44 775 3,696 # TOTAL6,559 1,932 8,491 Table 4.CelebA dataset splits. DATASPLITFEMALEMALETOTAL BLONDENON-BLONDEBLONDENON-BLONDE # TRAININGSAMPLES22,880 71,629 1,387 66,874 162,770 # VALIDATIONSAMPLES2,...

  16. [16]

    We train the GMM module for 200 epochs using SGD with an initial learning rate of 0.1, decayed by a factor of2every30epochs, and a batch size of8. C.2. Baselines Quantitative Baselines.We compare CB-SLICE against four state-of-the-art SDMs: Domino (Eyuboglu et al., 2022), GEORGE (Sohoni et al., 2020), HiBug2 (Chen et al., 2025), and Spotlight (d’Eon et al...

  17. [17]

    Qualitative Baselines.For the qualitative assessment of slice explanations, we compare CB-SLICE with Bias-to-Text (Kim et al., 2024b) and Domino

    For HiBug2, the number of slices is determined automatically by its enumeration algorithm. Qualitative Baselines.For the qualitative assessment of slice explanations, we compare CB-SLICE with Bias-to-Text (Kim et al., 2024b) and Domino. Bias-to-Textfirst produces image-level descriptions via ClipCap (Mokady et al., 2021), then applies the YAKE keyword ext...

  18. [18]

    backbone as the concept encoder g, for all tasks except MNIST-Sum. For MNIST-Sum, we adopt a lightweight backbone consisting of two convolutional layers with 32 and 64 channels, respectively, each followed by a ReLU activation, and a fully connected (FC) layer with ReLU. For the label predictor f, we use a single FC layer across all tasks. In practice, f ...