Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models

Haonan Yu; Junhao Liu; Xin Zhang; Zhenyu Yan

arxiv: 2505.12509 · v3 · submitted 2025-05-18 · 💻 cs.LG · cs.AI

Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models

Junhao Liu , Haonan Yu , Zhenyu Yan , Xin Zhang This is my paper

Pith reviewed 2026-05-22 14:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords proxy modelspost-hoc explanationslarge language modelsprompt compressionpoisoned examplesblack-box interpretabilitymodel-agnostic techniques

0 comments

The pith

Proxy models can approximate LLM decision boundaries to deliver over 90 percent faithful explanations at just 11 percent of the computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using smaller, cheaper proxy models to generate post-hoc explanations for large language models instead of querying the expensive LLMs directly. A screen-and-apply statistical check ensures that the proxy's local behavior aligns with the LLM before explanations are used. This approach achieves high fidelity while slashing costs, and the authors show it can guide practical tasks like compressing prompts and removing poisoned training examples. If the method holds, interpretability tools become viable for real-world LLM development rather than remaining too expensive to apply.

Core claim

We propose a budget-friendly proxy framework that leverages efficient models to approximate the decision boundaries of expensive LLMs. We introduce a screen-and-apply mechanism to statistically verify local alignment before deployment. Our empirical evaluation confirms that proxy explanations achieve over 90% fidelity with only 11% of the oracle's cost. Building on this foundation, we demonstrate the actionable utility of our framework in prompt compression and poisoned example removal.

What carries the argument

The screen-and-apply mechanism that statistically verifies local alignment between proxy and LLM before deploying explanations.

If this is right

Proxy explanations can effectively guide prompt compression to improve LLM performance.
Proxy explanations can help identify and remove poisoned examples from training data.
Interpretability shifts from passive observation to a scalable primitive for LLM development.
Model-agnostic techniques become practical for LLMs due to drastically reduced query costs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proxy approach could be tested on other black-box systems such as large vision models to check for similar cost reductions.
Frequent use of verified proxies might enable explanation-driven loops in automated LLM tuning pipelines.
The fidelity of explanations could vary by task domain, so targeted experiments on code or math reasoning would be a direct next step.

Load-bearing premise

Smaller proxy models can approximate the decision boundaries of LLMs closely enough for local explanations to transfer reliably when verified by a statistical check.

What would settle it

A test where prompts compressed using proxy explanations show no performance gain over baseline compression methods on a held-out validation set.

read the original abstract

Post-hoc explanations provide transparency and are essential for guiding model optimization, such as prompt engineering and data sanitation. However, applying model-agnostic techniques to Large Language Models (LLMs) is hindered by prohibitive computational costs, rendering these tools dormant for real-world applications. To revitalize model-agnostic interpretability, we propose a budget-friendly proxy framework that leverages efficient models to approximate the decision boundaries of expensive LLMs. We introduce a screen-and-apply mechanism to statistically verify local alignment before deployment. Our empirical evaluation confirms that proxy explanations achieve over 90% fidelity with only 11% of the oracle's cost. Building on this foundation, we demonstrate the actionable utility of our framework in prompt compression and poisoned example removal. Results show that reliable proxy explanations effectively guide optimization, transforming interpretability from a passive observation tool into a scalable primitive for LLM development. Additionally, we open-source code and datasets to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Proxy models plus a screen-and-apply check look like a workable way to cut explanation costs for LLMs, but the evidence that explanations actually transfer is still thin.

read the letter

The paper's main contribution is a practical setup that swaps in smaller proxy models to generate post-hoc explanations for LLMs, then runs a statistical check before applying those explanations to tasks like prompt compression or cleaning poisoned data. They report over 90 percent fidelity at roughly one-tenth the compute of querying the original model directly. That combination of cost reduction and downstream utility is the part worth paying attention to if the numbers hold in the full experiments.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a proxy framework for post-hoc interpretability of LLMs, in which smaller, efficient models approximate the decision boundaries of expensive target LLMs. A screen-and-apply mechanism is introduced to statistically verify local alignment prior to deployment. The central empirical claim is that proxy explanations attain over 90% fidelity at 11% of the oracle cost; the framework is then shown to support actionable tasks including prompt compression and removal of poisoned examples. Code and datasets are released.

Significance. If the fidelity and transfer claims hold under rigorous verification of explanation alignment, the work would lower the barrier to using model-agnostic interpretability at LLM scale and convert explanations into a practical primitive for prompt engineering and data sanitation. The open-sourcing of artifacts is a clear strength that supports reproducibility.

major comments (2)

[Abstract] Abstract: the reported >90% fidelity and 11% cost figures are presented without any description of the datasets, the specific interpretability methods (e.g., SHAP or LIME), the exact fidelity metric (prediction agreement vs. attribution-vector similarity), baselines, or statistical tests. These omissions make it impossible to assess whether the numbers support the central claim that proxy explanations transfer reliably for downstream use.
[Screen-and-apply mechanism] Screen-and-apply mechanism (as described in the abstract and methods): the mechanism is said to 'statistically verify local alignment.' If the check operates only on output-label agreement or coarse accuracy rather than on alignment of explanation vectors (rank correlation, cosine similarity, or top-k feature overlap of attributions), it can accept proxies whose local decision surfaces differ precisely in the directions that matter for prompt compression and poisoned-example removal. This gap directly threatens the reported fidelity and utility results.

minor comments (2)

[Abstract] The abstract would be clearer if it named the proxy model families and the concrete interpretability techniques employed.
[Discussion] A dedicated limitations paragraph should discuss failure modes of the proxy approximation (e.g., distribution shift between proxy and LLM) and the conditions under which the screen-and-apply check may still pass while explanations diverge.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity on methodological details and the verification process.

read point-by-point responses

Referee: [Abstract] Abstract: the reported >90% fidelity and 11% cost figures are presented without any description of the datasets, the specific interpretability methods (e.g., SHAP or LIME), the exact fidelity metric (prediction agreement vs. attribution-vector similarity), baselines, or statistical tests. These omissions make it impossible to assess whether the numbers support the central claim that proxy explanations transfer reliably for downstream use.

Authors: We agree that the abstract, constrained by length, omits several key details that are fully elaborated in the main text (Sections 3, 4, and 5). In the revised version we have expanded the abstract with concise additions specifying the evaluation datasets (GLUE subsets and LLM-specific benchmarks), the base interpretability methods (LIME and SHAP), the fidelity metric (attribution-vector cosine similarity and rank correlation), the baselines (direct oracle and random proxies), and the use of statistical significance testing. These changes make the central claims more self-contained while preserving abstract brevity. revision: yes
Referee: [Screen-and-apply mechanism] Screen-and-apply mechanism (as described in the abstract and methods): the mechanism is said to 'statistically verify local alignment.' If the check operates only on output-label agreement or coarse accuracy rather than on alignment of explanation vectors (rank correlation, cosine similarity, or top-k feature overlap of attributions), it can accept proxies whose local decision surfaces differ precisely in the directions that matter for prompt compression and poisoned-example removal. This gap directly threatens the reported fidelity and utility results.

Authors: The screen-and-apply procedure verifies alignment of explanation vectors, not merely label agreement. Section 3.2 describes sampling local neighborhoods around each instance and requiring both cosine similarity and Spearman's rank correlation on the attribution vectors (plus top-k overlap) to exceed thresholds, with significance assessed via permutation tests. We have added an expanded methods paragraph, pseudocode, and an ablation study (new Figure 3) demonstrating that label-only checks are inadequate and that the vector-based criterion preserves fidelity for the downstream tasks of prompt compression and poisoned-example removal. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent verification

full rationale

The paper presents a proxy-model framework for LLM interpretability, relying on empirical evaluation of fidelity (>90%) and cost reduction (11%), plus downstream tasks like prompt compression. The screen-and-apply mechanism is described as a statistical check for local alignment, but the central claims rest on experimental results rather than any derivation that reduces by construction to fitted parameters, self-citations, or definitional equivalence. No load-bearing step equates predictions to inputs via ansatz or renaming. The approach is self-contained against external benchmarks via reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5698 in / 983 out tokens · 33457 ms · 2026-05-22T14:15:09.301484+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a screen-and-apply mechanism to statistically verify local alignment before deployment... sequential one-sided paired t-test on the paired differences di = q_proxy(xi) − τ q_oracle(xi)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

proxy explanations achieve over 90% fidelity with only 11% of the oracle's cost

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.