CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models

Derek F. Wong; Henghua Shen; Mingkun Xu; Mu You; Tao Fang; Yike Sun; Zehan Tan; Zhongzhi He

arxiv: 2605.19848 · v1 · pith:U2NLHDJTnew · submitted 2026-05-19 · 💻 cs.CL

CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models

Yike Sun , Mingkun Xu , Mu You , Zhongzhi He , Henghua Shen , Zehan Tan , Derek F. Wong , Tao Fang This is my paper

Pith reviewed 2026-05-20 06:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords influence functionsconcept bottleneck modelsmodel interpretabilityNLP modelsdata debuggingtransparent modelssample influenceconcept influence

0 comments

The pith

Influence functions applied to concept bottleneck models identify critical samples and concepts for model predictions and allow corrections by adjustment without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve the interpretability of NLP models by applying influence functions at both the sample level and the concept level within concept bottleneck models. Experiments demonstrate that these functions can locate the training samples with the largest positive or negative effects on predictions using the CEBaB and Yelp datasets. By changing the labels or weights of those samples, the authors show that model performance returns to its original level. The concept-level analysis further reveals which internal concepts most influence decisions, and altering those concepts changes how the model behaves on inputs.

Core claim

Extending influence functions to the concept representations inside transparent bottleneck architectures allows quantification of the contribution of specific training samples and bottleneck concepts to any given prediction, with the result that targeted edits to the most influential samples or concepts can be used to debug and correct model behavior.

What carries the argument

Concept-level influence functions that estimate the effect of upweighting or relabeling individual training points or modifying concept activations on the model's output.

If this is right

Model performance can be restored to baseline by adjusting the most influential training samples identified by the method.
Key concepts driving predictions in the bottleneck can be identified and their modification leads to observable changes in model outputs.
Efficient data debugging becomes possible without the need for full model retraining.
Interpretability is enhanced at both instance and concept levels for high-stakes NLP applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar influence-based debugging could be tested on other types of interpretable models beyond bottleneck architectures.
The approach might support the creation of automated pipelines for cleaning training data in deployed systems.
Validation on whether influence scores align with human annotations of important concepts would strengthen the method's practical value.

Load-bearing premise

The influence scores accurately measure the causal contribution of particular samples and concepts to the predictions rather than merely reflecting correlations.

What would settle it

An experiment that adjusts the labels or weights of the samples ranked highest by influence and checks whether performance on a held-out test set returns to the original baseline level; failure to restore performance would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.19848 by Derek F. Wong, Henghua Shen, Mingkun Xu, Mu You, Tao Fang, Yike Sun, Zehan Tan, Zhongzhi He.

**Figure 1.** Figure 1: The overall framework of our integrated CBM-NLP model with influence functions. large language models (LLMs), IFs have been used to analyze training data influence across layers, revealing sample impacts at different abstraction levels [20]. However, precision in influence estimation remains a bottleneck in non-convex settings. Ongoing research balances computational efficiency and accuracy via gradient-… view at source ↗

read the original abstract

In recent years, the black-box nature of deep learning models has limited their application in high-stakes domains such as medical diagnosis and finance, where interpretability is essential. To address this, we propose a novel approach using influence functions to enhance interpretability in NLP models at both the sample and concept levels. Experiments on CEBaB and Yelp datasets show that influence functions effectively identify the most impactful training samples, both helpful and harmful, on model predictions. By adjusting the labels and weights of these samples, we demonstrate that model performance can be restored to baseline levels without retraining, confirming the value of influence functions for efficient data debugging. Furthermore, our concept-level analysis identifies key concepts within Concept Bottleneck Models (CBM) that significantly affect predictions. Modifying these concepts alters model behavior observably, providing clear insights into the decision process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper lifts influence functions to the concept level inside CBMs and shows sample tweaks can recover performance on two datasets, but the approximation accuracy for those tweaks is not clearly validated.

read the letter

The main thing to know is that this work applies influence functions both to individual training samples and to concepts inside Concept Bottleneck Models for NLP. On CEBaB and Yelp they identify samples that help or hurt predictions and report that changing labels or weights on the influential ones brings performance back to baseline without retraining the model. They also use the concept-level scores to highlight which concepts drive decisions and show that altering them changes outputs in observable ways. That practical debugging angle is the clearest contribution and could be useful for people who need to inspect or clean data in high-stakes NLP settings. The experiments are run on standard datasets and the method is a direct extension of an existing technique, so the setup is straightforward. The softer part is the lack of checks on how well the first-order influence approximation holds up when the changes are large, such as full label flips or substantial weight adjustments. Those approximations are known to degrade in deep non-convex models, and the abstract gives no sign of leave-one-out retraining or error-bound comparisons to confirm the predicted performance delta matches what actually happens after the adjustments. If the full paper includes that verification it would tighten the central claim; without it the restoration result rests on an assumption that may not be reliable. This is aimed at researchers working on interpretability for concept-based or bottleneck models in NLP. A reader who wants a concrete tool for spotting problematic training points or concepts would get something usable from the experiments. It is worth sending to peer review because the idea is clear, the datasets are appropriate, and the practical framing is reasonable even if the approximation validation needs more attention.

Referee Report

2 major / 2 minor

Summary. The paper proposes CLIF, which extends influence functions to provide interpretability for Concept Bottleneck Models (CBMs) at both the training-sample level and the concept level. On the CEBaB and Yelp datasets, the authors claim that influence scores successfully identify the most helpful and harmful training examples for model predictions. They further assert that adjusting the labels or weights of these examples restores model performance to baseline levels without retraining. At the concept level, the method identifies influential concepts whose modification produces observable changes in model behavior.

Significance. If the reported restoration results hold under validated influence approximations, the work supplies a practical, retraining-free method for data debugging and concept-level interpretation in transparent NLP architectures. This could be useful for high-stakes applications. The dual sample- and concept-level analysis extends existing influence-function techniques to CBMs in a potentially useful way, though the strength of the contribution depends on the rigor of the empirical validation.

major comments (2)

[Experiments] Experiments section: The headline claim that label/weight adjustments on influential samples restore baseline performance rests on the accuracy of the first-order influence-function approximation for these perturbations. No leave-one-out retraining, second-order error bounds, or direct comparison of predicted versus actual performance deltas is reported for the CEBaB or Yelp interventions. This approximation is known to degrade for large changes (e.g., full label flips) or ill-conditioned Hessians in deep models, making the causal restoration result insecure.
[Abstract and §4] Abstract and §4: The manuscript asserts positive experimental outcomes on CEBaB and Yelp but supplies no quantitative results (accuracy deltas, baselines, statistical tests, or sample sizes). Without these numbers it is impossible to assess whether the influence-based adjustments actually achieve the claimed restoration or merely produce small, statistically insignificant changes.

minor comments (2)

[Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., accuracy before/after adjustment) to support the stated claims.
[Method] Clarify the exact definition of the concept-level influence score (e.g., whether it is computed on the concept predictor or the final classifier) and provide the corresponding equation in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the empirical validation and clarity of the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The headline claim that label/weight adjustments on influential samples restore baseline performance rests on the accuracy of the first-order influence-function approximation for these perturbations. No leave-one-out retraining, second-order error bounds, or direct comparison of predicted versus actual performance deltas is reported for the CEBaB or Yelp interventions. This approximation is known to degrade for large changes (e.g., full label flips) or ill-conditioned Hessians in deep models, making the causal restoration result insecure.

Authors: We agree that the first-order approximation carries known limitations for large perturbations and that direct validation would strengthen the causal claims. The original experiments followed the standard efficient application of influence functions from prior literature, but we acknowledge the absence of leave-one-out comparisons or Hessian diagnostics. In the revised manuscript we will add a new validation subsection reporting (i) leave-one-out retraining results on a computationally feasible subset of CEBaB and Yelp examples, (ii) the condition numbers of the relevant Hessians, and (iii) direct numerical comparison between predicted and observed performance deltas. These additions will be presented alongside the existing results without changing the core methodology. revision: yes
Referee: [Abstract and §4] Abstract and §4: The manuscript asserts positive experimental outcomes on CEBaB and Yelp but supplies no quantitative results (accuracy deltas, baselines, statistical tests, or sample sizes). Without these numbers it is impossible to assess whether the influence-based adjustments actually achieve the claimed restoration or merely produce small, statistically insignificant changes.

Authors: We accept that the abstract and the high-level summary in §4 currently lack explicit numerical values, which reduces verifiability. The detailed accuracy deltas, baselines, and sample sizes appear in the tables of the experiments section, but we will revise both the abstract and §4 to include the key quantitative findings (e.g., restoration deltas and statistical significance) together with explicit pointers to the tables. Sample sizes and any applicable p-values will also be stated clearly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical application of existing influence functions with no self-referential derivations

full rationale

The paper presents an empirical study applying influence functions to identify impactful samples and concepts in Concept Bottleneck Models on CEBaB and Yelp datasets. No equations, derivations, or first-principles claims are advanced that reduce to fitted parameters or self-citations by construction. The central results are experimental demonstrations of label/weight adjustments restoring baseline performance, which rely on external validation rather than internal redefinition. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz. The work is self-contained against external benchmarks as a straightforward extension of prior influence-function techniques.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, derivations, or technical sections from which free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5690 in / 1066 out tokens · 44410 ms · 2026-05-20T06:17:53.521949+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

[1]

In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics

Barshan, E., Prince, S., Benos, P.V.: Relatif: An interpretable method for recom- mending influential data in deep learning. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. pp. 1539–1547 (2020)

work page 2020
[2]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Basu, S., Christensen, E.J., Raff, E., Barford, P.: Influence functions in deep learn- ing are fragile. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 3645–3652 (2020)

work page 2020
[3]

In: Proceedings of the 40th International Conference on Machine Learning

Basu, S., Raff, E., Barford, P.: Reliable influence functions for deep neural net- works. In: Proceedings of the 40th International Conference on Machine Learning. pp. 1453–1465 (2023)

work page 2023
[4]

In: Proceedings of the AAAI Conference on Artificial In- telligence

Chauhan, K., Tiwari, R., Freyberg, J., Shenoy, P., Dvijotham, K.: Interactive con- cept bottleneck models. In: Proceedings of the AAAI Conference on Artificial In- telligence. vol. 37(5), pp. 5948–5955 (2023)

work page 2023
[5]

In: International Conference on Machine Learning

Chen, J., Song, M., Weng, Z.: Concept whitening for interpretable image recogni- tion. In: International Conference on Machine Learning. pp. 1626–1635 (2020)

work page 2020
[6]

In: Proceedings of the 2019 ACL Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP

Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does bert look at? an analysis of bert’s attention. In: Proceedings of the 2019 ACL Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP. pp. 276–286 (2019)

work page 2019
[7]

Ho, X., Nguyen, A

Delobelle, P., Winters, T., Berendt, B.: RobBERT: a Dutch RoBERTa-based Language Model. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Associa- tion for Computational Linguistics: EMNLP 2020. pp. 3255–3265. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020. findings-emnlp.292, https://aclanthology.org/202...

work page doi:10.18653/v1/2020 2020
[8]

BERT: Pre-training of deep bidirectional transformers for language understanding

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Paper...

work page doi:10.18653/v1/n19-1423 2019
[9]

Towards A Rigorous Science of Interpretable Machine Learning

Doshi-Velez,F.,Kim,B.:Towardsarigorousscienceofinterpretablemachinelearn- ing. In: arXiv preprint arXiv:1702.08608 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Advances in Neural Information Processing Systems35, 23386–23397 (2022)

Havasi, M., Parbhoo, S., Doshi-Velez, F.: Addressing leakage in concept bottle- neck models. Advances in Neural Information Processing Systems35, 23386–23397 (2022)

work page 2022
[11]

Attention is not Explanation

Jain, S., Wallace, B.C.: Attention is not explanation. arXiv preprint arXiv:1902.10186 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1902
[12]

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Kim, B., Wattenberg, M., Gilmer, J.: Interpretability beyond feature attribu- tion: Quantitative testing with concept activation vectors (tcav). arXiv preprint arXiv:1711.11279 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

[Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba

Kim, E., Jung, D., Park, S., Kim, S., Yoon, S.: Probabilistic concept bottleneck models. arXiv preprint arXiv:2306.01574 (2023)

work page arXiv 2023
[14]

In: Proceedings of the 34th International Conference on Machine Learning

Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: Proceedings of the 34th International Conference on Machine Learning. pp. 1885–1894 (2017)

work page 2017
[15]

In: International Conference on Machine Learning

Koh, P.W., Nguyen, T., Tang, Y.S., Mussmann, S., Pierson, E., Kim, B., Liang, P.: Concept bottleneck models. In: International Conference on Machine Learning. pp. 5338–5347 (2020) CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models 13

work page 2020
[16]

In: Advances in Neural Information Processing Systems

Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems. pp. 4765–4774 (2017)

work page 2017
[17]

MetaAI: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Lulu.com (2020)

Molnar, C.: Interpretable machine learning. Lulu.com (2020)

work page 2020
[19]

In: The Eleventh International Conference on Learning Representations (2022)

Oikarinen, T., Das, S., Nguyen, L.M., Weng, T.W.: Label-free concept bottleneck models. In: The Eleventh International Conference on Learning Representations (2022)

work page 2022
[20]

Transactions of the Association for Computational Linguis- tics11, 87–103 (2023)

Peruzzo, L., Schioppa, G., Müller, M.: Layer-wise influence functions for analyzing large language models. Transactions of the Association for Computational Linguis- tics11, 87–103 (2023)

work page 2023
[21]

Advances in Neural Information Processing Systems34, 10064–10075 (2021)

Pezeshki, M., Guo, Q., Huang, J., Laurent, T., Peyré, G., Lacoste-Julien, S.: Gradient-based data subset selection for efficient deep learning. Advances in Neural Information Processing Systems34, 10064–10075 (2021)

work page 2021
[22]

In: Pro- ceedings of the Conference (2019), openAI Blog

Radford, A., et al.: Language models are unsupervised multitask learners. In: Pro- ceedings of the Conference (2019), openAI Blog

work page 2019
[23]

In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you?: Explaining the pre- dictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1135–1144 (2016)

work page 2016
[24]

Nature Machine Intelligence1(5), 206–215 (2019)

Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence1(5), 206–215 (2019)

work page 2019
[25]

Journal of Machine Learning Research23, 1–29 (2022)

Schioppa, G., Peruzzo, L., Müller, M.: Efficient estimation of influence functions in deep learning. Journal of Machine Learning Research23, 1–29 (2022)

work page 2022
[26]

IEEE Access9, 11974–12001 (2021)

Stepin, I., Alonso, J.M.S., Pereira-Fariña, M., Alani, H.: A survey of contrastive and counterfactual explanation generation methods for explainable artificial intel- ligence. IEEE Access9, 11974–12001 (2021)

work page 2021
[27]

arXiv preprint arXiv:2312.15033 (2023)

Tan, Z., Chen, T., Zhang, Z., Liu, H.: Sparsity-guided holistic explanation for llms with interpretable inference-time intervention. arXiv preprint arXiv:2312.15033 (2023)

work page arXiv 2023
[28]

arXiv preprint arXiv:2311.05014 (2023)

Tan, Z., Cheng, L., Wang, S., Bo, Y., Li, J., Liu, H.: Interpreting pretrained lan- guage models via concept bottlenecks. arXiv preprint arXiv:2311.05014 (2023)

work page arXiv 2023
[29]

Qwen2 Technical Report

Team, Q.: Qwen2 technical report. arXiv preprint arXiv:2407.10671 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Team, Q.: Qwen2.5: A party of foundation models (September 2024), https:// qwenlm.github.io/blog/qwen2.5/

work page 2024
[31]

In: Harvard Journal of Law & Technology

Wachter, S., Mittelstadt, B., Russell, C.: Counterfactual explanations without opening the black box: Automated decisions and the gdpr. In: Harvard Journal of Law & Technology. vol. 31, pp. 841–887 (2017)

work page 2017
[32]

In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Wang, J., Li, M., Zhang, J., Liu, Y.: Cebab: A large-scale annotated corpus for chinese sentiment analysis. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 1234–1244 (2021)

work page 2021
[33]

In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing

Wiegreffe, S., Pinter, Y.: Attention is not not explanation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. pp. 11–20 (2019)

work page 2019
[34]

Advances in Neural Information Processing Systems33, 20554–20565 (2020)

Yeh, C.K., Kim, B., Arik, S.K., Li, C.L., Pfister, T., Ravikumar, P.: Completeness- aware concept-based explanations in deep neural networks. Advances in Neural Information Processing Systems33, 20554–20565 (2020)

work page 2020
[35]

In: Proceedings of the 2021 International Conference on Data Mining

Yelp: Yelp open dataset: Restaurant reviews and ratings. In: Proceedings of the 2021 International Conference on Data Mining. pp. 1567–1571 (2021)

work page 2021
[36]

In: The Eleventh International Conference on Learning Representations (2022)

Yuksekgonul, M., Wang, M., Zou, J.: Post-hoc concept bottleneck models. In: The Eleventh International Conference on Learning Representations (2022)

work page 2022

[1] [1]

In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics

Barshan, E., Prince, S., Benos, P.V.: Relatif: An interpretable method for recom- mending influential data in deep learning. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. pp. 1539–1547 (2020)

work page 2020

[2] [2]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Basu, S., Christensen, E.J., Raff, E., Barford, P.: Influence functions in deep learn- ing are fragile. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 3645–3652 (2020)

work page 2020

[3] [3]

In: Proceedings of the 40th International Conference on Machine Learning

Basu, S., Raff, E., Barford, P.: Reliable influence functions for deep neural net- works. In: Proceedings of the 40th International Conference on Machine Learning. pp. 1453–1465 (2023)

work page 2023

[4] [4]

In: Proceedings of the AAAI Conference on Artificial In- telligence

Chauhan, K., Tiwari, R., Freyberg, J., Shenoy, P., Dvijotham, K.: Interactive con- cept bottleneck models. In: Proceedings of the AAAI Conference on Artificial In- telligence. vol. 37(5), pp. 5948–5955 (2023)

work page 2023

[5] [5]

In: International Conference on Machine Learning

Chen, J., Song, M., Weng, Z.: Concept whitening for interpretable image recogni- tion. In: International Conference on Machine Learning. pp. 1626–1635 (2020)

work page 2020

[6] [6]

In: Proceedings of the 2019 ACL Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP

Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does bert look at? an analysis of bert’s attention. In: Proceedings of the 2019 ACL Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP. pp. 276–286 (2019)

work page 2019

[7] [7]

Ho, X., Nguyen, A

Delobelle, P., Winters, T., Berendt, B.: RobBERT: a Dutch RoBERTa-based Language Model. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Associa- tion for Computational Linguistics: EMNLP 2020. pp. 3255–3265. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020. findings-emnlp.292, https://aclanthology.org/202...

work page doi:10.18653/v1/2020 2020

[8] [8]

BERT: Pre-training of deep bidirectional transformers for language understanding

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Paper...

work page doi:10.18653/v1/n19-1423 2019

[9] [9]

Towards A Rigorous Science of Interpretable Machine Learning

Doshi-Velez,F.,Kim,B.:Towardsarigorousscienceofinterpretablemachinelearn- ing. In: arXiv preprint arXiv:1702.08608 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Advances in Neural Information Processing Systems35, 23386–23397 (2022)

Havasi, M., Parbhoo, S., Doshi-Velez, F.: Addressing leakage in concept bottle- neck models. Advances in Neural Information Processing Systems35, 23386–23397 (2022)

work page 2022

[11] [11]

Attention is not Explanation

Jain, S., Wallace, B.C.: Attention is not explanation. arXiv preprint arXiv:1902.10186 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1902

[12] [12]

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Kim, B., Wattenberg, M., Gilmer, J.: Interpretability beyond feature attribu- tion: Quantitative testing with concept activation vectors (tcav). arXiv preprint arXiv:1711.11279 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

[Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba

Kim, E., Jung, D., Park, S., Kim, S., Yoon, S.: Probabilistic concept bottleneck models. arXiv preprint arXiv:2306.01574 (2023)

work page arXiv 2023

[14] [14]

In: Proceedings of the 34th International Conference on Machine Learning

Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: Proceedings of the 34th International Conference on Machine Learning. pp. 1885–1894 (2017)

work page 2017

[15] [15]

In: International Conference on Machine Learning

Koh, P.W., Nguyen, T., Tang, Y.S., Mussmann, S., Pierson, E., Kim, B., Liang, P.: Concept bottleneck models. In: International Conference on Machine Learning. pp. 5338–5347 (2020) CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models 13

work page 2020

[16] [16]

In: Advances in Neural Information Processing Systems

Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems. pp. 4765–4774 (2017)

work page 2017

[17] [17]

MetaAI: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Lulu.com (2020)

Molnar, C.: Interpretable machine learning. Lulu.com (2020)

work page 2020

[19] [19]

In: The Eleventh International Conference on Learning Representations (2022)

Oikarinen, T., Das, S., Nguyen, L.M., Weng, T.W.: Label-free concept bottleneck models. In: The Eleventh International Conference on Learning Representations (2022)

work page 2022

[20] [20]

Transactions of the Association for Computational Linguis- tics11, 87–103 (2023)

Peruzzo, L., Schioppa, G., Müller, M.: Layer-wise influence functions for analyzing large language models. Transactions of the Association for Computational Linguis- tics11, 87–103 (2023)

work page 2023

[21] [21]

Advances in Neural Information Processing Systems34, 10064–10075 (2021)

Pezeshki, M., Guo, Q., Huang, J., Laurent, T., Peyré, G., Lacoste-Julien, S.: Gradient-based data subset selection for efficient deep learning. Advances in Neural Information Processing Systems34, 10064–10075 (2021)

work page 2021

[22] [22]

In: Pro- ceedings of the Conference (2019), openAI Blog

Radford, A., et al.: Language models are unsupervised multitask learners. In: Pro- ceedings of the Conference (2019), openAI Blog

work page 2019

[23] [23]

In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you?: Explaining the pre- dictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1135–1144 (2016)

work page 2016

[24] [24]

Nature Machine Intelligence1(5), 206–215 (2019)

Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence1(5), 206–215 (2019)

work page 2019

[25] [25]

Journal of Machine Learning Research23, 1–29 (2022)

Schioppa, G., Peruzzo, L., Müller, M.: Efficient estimation of influence functions in deep learning. Journal of Machine Learning Research23, 1–29 (2022)

work page 2022

[26] [26]

IEEE Access9, 11974–12001 (2021)

Stepin, I., Alonso, J.M.S., Pereira-Fariña, M., Alani, H.: A survey of contrastive and counterfactual explanation generation methods for explainable artificial intel- ligence. IEEE Access9, 11974–12001 (2021)

work page 2021

[27] [27]

arXiv preprint arXiv:2312.15033 (2023)

Tan, Z., Chen, T., Zhang, Z., Liu, H.: Sparsity-guided holistic explanation for llms with interpretable inference-time intervention. arXiv preprint arXiv:2312.15033 (2023)

work page arXiv 2023

[28] [28]

arXiv preprint arXiv:2311.05014 (2023)

Tan, Z., Cheng, L., Wang, S., Bo, Y., Li, J., Liu, H.: Interpreting pretrained lan- guage models via concept bottlenecks. arXiv preprint arXiv:2311.05014 (2023)

work page arXiv 2023

[29] [29]

Qwen2 Technical Report

Team, Q.: Qwen2 technical report. arXiv preprint arXiv:2407.10671 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Team, Q.: Qwen2.5: A party of foundation models (September 2024), https:// qwenlm.github.io/blog/qwen2.5/

work page 2024

[31] [31]

In: Harvard Journal of Law & Technology

Wachter, S., Mittelstadt, B., Russell, C.: Counterfactual explanations without opening the black box: Automated decisions and the gdpr. In: Harvard Journal of Law & Technology. vol. 31, pp. 841–887 (2017)

work page 2017

[32] [32]

In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Wang, J., Li, M., Zhang, J., Liu, Y.: Cebab: A large-scale annotated corpus for chinese sentiment analysis. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 1234–1244 (2021)

work page 2021

[33] [33]

In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing

Wiegreffe, S., Pinter, Y.: Attention is not not explanation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. pp. 11–20 (2019)

work page 2019

[34] [34]

Advances in Neural Information Processing Systems33, 20554–20565 (2020)

Yeh, C.K., Kim, B., Arik, S.K., Li, C.L., Pfister, T., Ravikumar, P.: Completeness- aware concept-based explanations in deep neural networks. Advances in Neural Information Processing Systems33, 20554–20565 (2020)

work page 2020

[35] [35]

In: Proceedings of the 2021 International Conference on Data Mining

Yelp: Yelp open dataset: Restaurant reviews and ratings. In: Proceedings of the 2021 International Conference on Data Mining. pp. 1567–1571 (2021)

work page 2021

[36] [36]

In: The Eleventh International Conference on Learning Representations (2022)

Yuksekgonul, M., Wang, M., Zou, J.: Post-hoc concept bottleneck models. In: The Eleventh International Conference on Learning Representations (2022)

work page 2022