Robust Explanations for User Trust in Enterprise NLP Systems

Amine Anoun; Guilin Zhang; Jeffrey Friedman; Jerry Ting; Kai Zhao; Xu Chu

arxiv: 2604.12069 · v2 · submitted 2026-04-13 · 💻 cs.CL · cs.AI· cs.LG

Robust Explanations for User Trust in Enterprise NLP Systems

Guilin Zhang , Kai Zhao , Jeffrey Friedman , Xu Chu , Amine Anoun , Jerry Ting This is my paper

Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords robust explanationsblack-box evaluationtoken-level explanationsdecoder LLMsexplanation stabilityNLP model comparisonenterprise NLP

0 comments

The pith

Decoder LLMs produce 73 percent lower flip rates in token explanations than encoder models under realistic input noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a black-box testing method that measures how much token explanations change when inputs undergo swaps, deletions, shuffles, or back-translations at different strengths. It applies the method to encoder classifiers and decoder LLMs on three standard datasets, running 64,800 total cases. Decoder models show markedly fewer changes in their top explanatory tokens, and this consistency grows with model size. The authors also plot stability against inference cost to give deployers a concrete selection guide. The work targets enterprise settings where explanations must stay reliable for user trust and regulatory checks even when inputs vary slightly.

Core claim

A leave-one-out occlusion protocol operationalized as top-token flip rate under multi-level perturbations demonstrates that decoder LLMs deliver substantially more stable explanations than encoder baselines, with an average 73 percent reduction in flip rates and a further 44 percent improvement when scaling from 7B to 70B parameters.

What carries the argument

Top-token flip rate under leave-one-out occlusion explanations tested against swap, deletion, shuffling, and back-translation perturbations at multiple severity levels.

If this is right

Decoder LLMs keep their top explanatory tokens consistent even after users swap words, delete phrases, or rephrase inputs.
Explanation stability scales upward with decoder model size, supporting the choice of larger models for compliance-sensitive tasks.
A plotted cost-robustness curve lets teams select models and explanation methods before deployment.
Encoder-based systems may need extra validation steps or alternative explanation techniques in noisy production environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizations shifting from encoders to decoders could reduce the frequency of user confusion caused by shifting explanations.
The evaluation protocol could be applied to other explanation formats such as feature attributions or attention maps.
Real-world deployment logs might reveal additional perturbation types that further differentiate model families.

Load-bearing premise

The selected perturbations at multiple severity levels adequately stand in for the input variations users actually produce in enterprise NLP systems.

What would settle it

A follow-up experiment that replaces the paper's perturbation set with real user typo logs or synonym substitutions and finds encoder flip rates equal to or below decoder rates would undermine the claimed stability advantage.

Figures

Figures reproduced from arXiv: 2604.12069 by Amine Anoun, Guilin Zhang, Jeffrey Friedman, Jerry Ting, Kai Zhao, Xu Chu.

**Figure 3.** Figure 3: Explanation stability improves with model [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 2.** Figure 2: Flip rate comparison across models and datasets. Encoder models (BERT, RoBERTa) exhibit significantly higher flip rates than decoder models, with the gap most pronounced on IMDB (longer texts). The 73% improvement represents the average reduction in flip rate when moving from encoders to decoders. 3.2 Main Results: Encoder vs. Decoder [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Deployment trade-off: model scale vs. explanation stability. Shaded bands indicate three deployment tiers. Arrows show 73% improvement (encoder→decoder) and 44% improvement (7B→70B). 5 Conclusion Validating explanation behavior under black-box serving and noisy, user-generated inputs is hard, yet enterprise NLP systems increasingly rely on explanations for governance and user trust. We address this with a… view at source ↗

read the original abstract

Robust explanations are increasingly required for user trust in enterprise NLP, yet pre-deployment validation is difficult in the common case of black-box deployment (API-only access) where representation-based explainers are infeasible and existing studies provide limited guidance on whether explanations remain stable under real user noise, especially when organizations migrate from encoder classifiers to decoder LLMs. To close this gap, we propose a unified black-box robustness evaluation framework for token-level explanations based on leave-one-out occlusion, and operationalize explanation robustness with top-token flip rate under realistic perturbations (swap, deletion, shuffling, and back-translation) at multiple severity levels. Using this protocol, we conduct a systematic cross-architecture comparison across three benchmark datasets and six models spanning encoder and decoder families (BERT, RoBERTa, Qwen 7B/14B, Llama 8B/70B; 64,800 cases). We find that decoder LLMs produce substantially more stable explanations than encoder baselines (73% lower flip rates on average), and that stability improves with model scale (44% gain from 7B to 70B). Finally, we relate robustness improvements to inference cost, yielding a practical cost-robustness tradeoff curve that supports model and explanation selection prior to deployment in compliance-sensitive applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable black-box protocol for testing explanation stability and reports decoder LLMs outperforming encoders, but the perturbations lack validation against real enterprise noise.

read the letter

Here's the short version: the paper introduces a black-box protocol to test explanation stability and finds decoder LLMs more robust than encoders, but the perturbations used may not capture real enterprise noise. They operationalize robustness as the rate at which the top token in a leave-one-out explanation flips after perturbing the input with swaps, deletions, shuffling, or back-translation. They apply this to six models on three datasets and report decoder models having much lower flip rates, with larger models improving further. They also plot this against inference cost. This is new in the sense that it brings together the occlusion method with a multi-perturbation test and applies it to current LLMs in a cross-architecture way. The work does well in focusing on black-box access, which matches many enterprise deployments, and in trying to give a practical pre-deployment tool. The soft spots are in the strength of the evidence and the assumptions. The quantitative claims in the abstract lack supporting stats like variance or significance tests, making it difficult to assess reliability. More critically, calling the perturbations 'realistic' for user noise is a stretch without any comparison to observed enterprise data or user studies. If actual noise includes things like domain-specific terms or context shifts not covered here, the stability advantage for decoders might not hold in practice. This paper is aimed at practitioners and researchers working on deploying explainable NLP in regulated or high-trust environments. Someone looking for a ready-to-use evaluation method would get value from it. It deserves a serious referee because the topic is timely and the protocol is straightforward to implement, even if the current write-up needs tightening on methods and validation. I would recommend putting it through peer review, with feedback focused on adding statistical rigor and checking the perturbations against real-world examples.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a black-box framework for evaluating the robustness of token-level explanations in NLP systems using leave-one-out occlusion. Through experiments on three benchmark datasets with six models (encoders: BERT, RoBERTa; decoders: Qwen 7B/14B, Llama 8B/70B), totaling 64,800 cases, it reports that decoder LLMs exhibit 73% lower top-token flip rates under perturbations (swap, deletion, shuffling, back-translation) compared to encoders, with a 44% improvement from 7B to 70B scale. It also provides a cost-robustness tradeoff analysis for enterprise deployment decisions.

Significance. If the perturbation protocol adequately captures real user noise, the work provides useful empirical evidence on the superior stability of explanations from decoder LLMs over encoders and the benefits of model scale, along with a practical cost-robustness curve that could inform pre-deployment choices in compliance-sensitive applications. The systematic cross-architecture comparison and large evaluation scale are strengths that enhance its potential utility for practitioners.

major comments (3)

Abstract: The quantitative claims of 73% lower flip rates for decoders and 44% scale gain lack error bars, confidence intervals, statistical significance tests, or per-dataset breakdowns, making it impossible to assess whether the reported architecture and scale effects are reliable or generalizable.
Evaluation Framework (or §3): The perturbations are positioned as 'realistic' for enterprise user noise to support pre-deployment guidance in compliance-sensitive applications, but no validation against actual enterprise logs, crowdsourced edits, or observed error patterns is provided; this assumption is load-bearing for the applicability of the stability claims.
Experiments section: Full protocol details (exact severity levels for each perturbation type, dataset names and splits, and how the 64,800 cases were constructed) are insufficiently specified, preventing reproduction and independent verification of the flip-rate results.

minor comments (2)

Abstract: Naming the three benchmark datasets explicitly would improve clarity and allow readers to immediately contextualize the results.
Consider adding a summary table of models, datasets, perturbation types, and key metrics to aid quick reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to improve clarity, statistical rigor, and reproducibility.

read point-by-point responses

Referee: Abstract: The quantitative claims of 73% lower flip rates for decoders and 44% scale gain lack error bars, confidence intervals, statistical significance tests, or per-dataset breakdowns, making it impossible to assess whether the reported architecture and scale effects are reliable or generalizable.

Authors: We agree that the abstract would be strengthened by statistical details. In the revision we will add error bars or confidence intervals to the reported 73% and 44% figures, include the outcomes of statistical significance tests for the architecture and scale comparisons, and provide per-dataset breakdowns (either in the main text or appendix) to show consistency across the three benchmarks. revision: yes
Referee: Evaluation Framework (or §3): The perturbations are positioned as 'realistic' for enterprise user noise to support pre-deployment guidance in compliance-sensitive applications, but no validation against actual enterprise logs, crowdsourced edits, or observed error patterns is provided; this assumption is load-bearing for the applicability of the stability claims.

Authors: The referee is correct that we provide no direct validation against enterprise logs. Our perturbation types are drawn from commonly reported user noise patterns in the literature, but we lacked access to proprietary logs. We will revise the manuscript to qualify the perturbations as 'representative of common user-induced noise' rather than directly validated, and we will add an explicit limitations paragraph discussing the gap to real enterprise deployments. revision: partial
Referee: Experiments section: Full protocol details (exact severity levels for each perturbation type, dataset names and splits, and how the 64,800 cases were constructed) are insufficiently specified, preventing reproduction and independent verification of the flip-rate results.

Authors: We apologize for the missing details. The revised version will specify the exact severity levels for each perturbation (swap, deletion, shuffling, back-translation), name the datasets and their train/test splits, and provide a transparent breakdown of how the 64,800 cases were generated. We will also include a table or pseudocode to support full reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are independent of framework definition

full rationale

The paper's headline findings (73% lower flip rates for decoders, 44% scale gain) are obtained by applying a fixed, explicitly enumerated perturbation protocol (swap/deletion/shuffling/back-translation at graded severities) to LOO-occlusion explanations on six independently trained models and three benchmarks. No equation or result is defined in terms of itself; the protocol is proposed once and then executed to produce the measurements. No fitted parameters are relabeled as predictions, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via citation. The evaluation chain is therefore self-contained and falsifiable against the stated datasets and perturbation set.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the framework rests on the domain assumption that occlusion identifies salient tokens and that the listed perturbations simulate realistic noise; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption Leave-one-out occlusion identifies the most important tokens for a model's prediction
Basis for the explanation method and flip-rate metric.

pith-pipeline@v0.9.0 · 5535 in / 1179 out tokens · 73001 ms · 2026-05-10T14:55:05.167218+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 3 internal anchors

[1]

The Llama 3 Herd of Models

Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of NAACL-HLT, pages 4171–4186. Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric ...

work page internal anchor Pith review Pith/arXiv arXiv 1901
[2]

Scaling Laws for Neural Language Models

Is BERT really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of AAAI, volume 34, pages 8018–8025. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scal- ing laws for neural language models.arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[3]

Why should I trust you?

“Why should I trust you?”: Explaining the pre- dictions of any classifier. InProceedings of KDD, pages 1135–1144. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts

work page
[4]

LLaMA: Open and Efficient Foundation Language Models

Recursive deep models for semantic compositional- ity over a sentiment treebank. InProceedings of EMNLP, pages 1631–1642. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Ax- iomatic attribution for deep networks. InInternational Conference on Machine Learning, pages 3319–3328. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-An...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

The Llama 3 Herd of Models

Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of NAACL-HLT, pages 4171–4186. Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric ...

work page internal anchor Pith review Pith/arXiv arXiv 1901

[2] [2]

Scaling Laws for Neural Language Models

Is BERT really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of AAAI, volume 34, pages 8018–8025. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scal- ing laws for neural language models.arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[3] [3]

Why should I trust you?

“Why should I trust you?”: Explaining the pre- dictions of any classifier. InProceedings of KDD, pages 1135–1144. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts

work page

[4] [4]

LLaMA: Open and Efficient Foundation Language Models

Recursive deep models for semantic compositional- ity over a sentiment treebank. InProceedings of EMNLP, pages 1631–1642. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Ax- iomatic attribution for deep networks. InInternational Conference on Machine Learning, pages 3319–3328. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-An...

work page internal anchor Pith review Pith/arXiv arXiv 2017