Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models
Pith reviewed 2026-05-22 14:15 UTC · model grok-4.3
The pith
Proxy models can approximate LLM decision boundaries to deliver over 90 percent faithful explanations at just 11 percent of the computational cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a budget-friendly proxy framework that leverages efficient models to approximate the decision boundaries of expensive LLMs. We introduce a screen-and-apply mechanism to statistically verify local alignment before deployment. Our empirical evaluation confirms that proxy explanations achieve over 90% fidelity with only 11% of the oracle's cost. Building on this foundation, we demonstrate the actionable utility of our framework in prompt compression and poisoned example removal.
What carries the argument
The screen-and-apply mechanism that statistically verifies local alignment between proxy and LLM before deploying explanations.
If this is right
- Proxy explanations can effectively guide prompt compression to improve LLM performance.
- Proxy explanations can help identify and remove poisoned examples from training data.
- Interpretability shifts from passive observation to a scalable primitive for LLM development.
- Model-agnostic techniques become practical for LLMs due to drastically reduced query costs.
Where Pith is reading between the lines
- The same proxy approach could be tested on other black-box systems such as large vision models to check for similar cost reductions.
- Frequent use of verified proxies might enable explanation-driven loops in automated LLM tuning pipelines.
- The fidelity of explanations could vary by task domain, so targeted experiments on code or math reasoning would be a direct next step.
Load-bearing premise
Smaller proxy models can approximate the decision boundaries of LLMs closely enough for local explanations to transfer reliably when verified by a statistical check.
What would settle it
A test where prompts compressed using proxy explanations show no performance gain over baseline compression methods on a held-out validation set.
read the original abstract
Post-hoc explanations provide transparency and are essential for guiding model optimization, such as prompt engineering and data sanitation. However, applying model-agnostic techniques to Large Language Models (LLMs) is hindered by prohibitive computational costs, rendering these tools dormant for real-world applications. To revitalize model-agnostic interpretability, we propose a budget-friendly proxy framework that leverages efficient models to approximate the decision boundaries of expensive LLMs. We introduce a screen-and-apply mechanism to statistically verify local alignment before deployment. Our empirical evaluation confirms that proxy explanations achieve over 90% fidelity with only 11% of the oracle's cost. Building on this foundation, we demonstrate the actionable utility of our framework in prompt compression and poisoned example removal. Results show that reliable proxy explanations effectively guide optimization, transforming interpretability from a passive observation tool into a scalable primitive for LLM development. Additionally, we open-source code and datasets to facilitate future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a proxy framework for post-hoc interpretability of LLMs, in which smaller, efficient models approximate the decision boundaries of expensive target LLMs. A screen-and-apply mechanism is introduced to statistically verify local alignment prior to deployment. The central empirical claim is that proxy explanations attain over 90% fidelity at 11% of the oracle cost; the framework is then shown to support actionable tasks including prompt compression and removal of poisoned examples. Code and datasets are released.
Significance. If the fidelity and transfer claims hold under rigorous verification of explanation alignment, the work would lower the barrier to using model-agnostic interpretability at LLM scale and convert explanations into a practical primitive for prompt engineering and data sanitation. The open-sourcing of artifacts is a clear strength that supports reproducibility.
major comments (2)
- [Abstract] Abstract: the reported >90% fidelity and 11% cost figures are presented without any description of the datasets, the specific interpretability methods (e.g., SHAP or LIME), the exact fidelity metric (prediction agreement vs. attribution-vector similarity), baselines, or statistical tests. These omissions make it impossible to assess whether the numbers support the central claim that proxy explanations transfer reliably for downstream use.
- [Screen-and-apply mechanism] Screen-and-apply mechanism (as described in the abstract and methods): the mechanism is said to 'statistically verify local alignment.' If the check operates only on output-label agreement or coarse accuracy rather than on alignment of explanation vectors (rank correlation, cosine similarity, or top-k feature overlap of attributions), it can accept proxies whose local decision surfaces differ precisely in the directions that matter for prompt compression and poisoned-example removal. This gap directly threatens the reported fidelity and utility results.
minor comments (2)
- [Abstract] The abstract would be clearer if it named the proxy model families and the concrete interpretability techniques employed.
- [Discussion] A dedicated limitations paragraph should discuss failure modes of the proxy approximation (e.g., distribution shift between proxy and LLM) and the conditions under which the screen-and-apply check may still pass while explanations diverge.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity on methodological details and the verification process.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported >90% fidelity and 11% cost figures are presented without any description of the datasets, the specific interpretability methods (e.g., SHAP or LIME), the exact fidelity metric (prediction agreement vs. attribution-vector similarity), baselines, or statistical tests. These omissions make it impossible to assess whether the numbers support the central claim that proxy explanations transfer reliably for downstream use.
Authors: We agree that the abstract, constrained by length, omits several key details that are fully elaborated in the main text (Sections 3, 4, and 5). In the revised version we have expanded the abstract with concise additions specifying the evaluation datasets (GLUE subsets and LLM-specific benchmarks), the base interpretability methods (LIME and SHAP), the fidelity metric (attribution-vector cosine similarity and rank correlation), the baselines (direct oracle and random proxies), and the use of statistical significance testing. These changes make the central claims more self-contained while preserving abstract brevity. revision: yes
-
Referee: [Screen-and-apply mechanism] Screen-and-apply mechanism (as described in the abstract and methods): the mechanism is said to 'statistically verify local alignment.' If the check operates only on output-label agreement or coarse accuracy rather than on alignment of explanation vectors (rank correlation, cosine similarity, or top-k feature overlap of attributions), it can accept proxies whose local decision surfaces differ precisely in the directions that matter for prompt compression and poisoned-example removal. This gap directly threatens the reported fidelity and utility results.
Authors: The screen-and-apply procedure verifies alignment of explanation vectors, not merely label agreement. Section 3.2 describes sampling local neighborhoods around each instance and requiring both cosine similarity and Spearman's rank correlation on the attribution vectors (plus top-k overlap) to exceed thresholds, with significance assessed via permutation tests. We have added an expanded methods paragraph, pseudocode, and an ablation study (new Figure 3) demonstrating that label-only checks are inadequate and that the vector-based criterion preserves fidelity for the downstream tasks of prompt compression and poisoned-example removal. revision: yes
Circularity Check
No significant circularity; empirical framework with independent verification
full rationale
The paper presents a proxy-model framework for LLM interpretability, relying on empirical evaluation of fidelity (>90%) and cost reduction (11%), plus downstream tasks like prompt compression. The screen-and-apply mechanism is described as a statistical check for local alignment, but the central claims rest on experimental results rather than any derivation that reduces by construction to fitted parameters, self-citations, or definitional equivalence. No load-bearing step equates predictions to inputs via ansatz or renaming. The approach is self-contained against external benchmarks via reported experiments.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a screen-and-apply mechanism to statistically verify local alignment before deployment... sequential one-sided paired t-test on the paired differences di = q_proxy(xi) − τ q_oracle(xi)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
proxy explanations achieve over 90% fidelity with only 11% of the oracle's cost
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.