pith. sign in

arxiv: 2505.12509 · v3 · submitted 2025-05-18 · 💻 cs.LG · cs.AI

Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models

Pith reviewed 2026-05-22 14:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords proxy modelspost-hoc explanationslarge language modelsprompt compressionpoisoned examplesblack-box interpretabilitymodel-agnostic techniques
0
0 comments X

The pith

Proxy models can approximate LLM decision boundaries to deliver over 90 percent faithful explanations at just 11 percent of the computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using smaller, cheaper proxy models to generate post-hoc explanations for large language models instead of querying the expensive LLMs directly. A screen-and-apply statistical check ensures that the proxy's local behavior aligns with the LLM before explanations are used. This approach achieves high fidelity while slashing costs, and the authors show it can guide practical tasks like compressing prompts and removing poisoned training examples. If the method holds, interpretability tools become viable for real-world LLM development rather than remaining too expensive to apply.

Core claim

We propose a budget-friendly proxy framework that leverages efficient models to approximate the decision boundaries of expensive LLMs. We introduce a screen-and-apply mechanism to statistically verify local alignment before deployment. Our empirical evaluation confirms that proxy explanations achieve over 90% fidelity with only 11% of the oracle's cost. Building on this foundation, we demonstrate the actionable utility of our framework in prompt compression and poisoned example removal.

What carries the argument

The screen-and-apply mechanism that statistically verifies local alignment between proxy and LLM before deploying explanations.

If this is right

  • Proxy explanations can effectively guide prompt compression to improve LLM performance.
  • Proxy explanations can help identify and remove poisoned examples from training data.
  • Interpretability shifts from passive observation to a scalable primitive for LLM development.
  • Model-agnostic techniques become practical for LLMs due to drastically reduced query costs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proxy approach could be tested on other black-box systems such as large vision models to check for similar cost reductions.
  • Frequent use of verified proxies might enable explanation-driven loops in automated LLM tuning pipelines.
  • The fidelity of explanations could vary by task domain, so targeted experiments on code or math reasoning would be a direct next step.

Load-bearing premise

Smaller proxy models can approximate the decision boundaries of LLMs closely enough for local explanations to transfer reliably when verified by a statistical check.

What would settle it

A test where prompts compressed using proxy explanations show no performance gain over baseline compression methods on a held-out validation set.

read the original abstract

Post-hoc explanations provide transparency and are essential for guiding model optimization, such as prompt engineering and data sanitation. However, applying model-agnostic techniques to Large Language Models (LLMs) is hindered by prohibitive computational costs, rendering these tools dormant for real-world applications. To revitalize model-agnostic interpretability, we propose a budget-friendly proxy framework that leverages efficient models to approximate the decision boundaries of expensive LLMs. We introduce a screen-and-apply mechanism to statistically verify local alignment before deployment. Our empirical evaluation confirms that proxy explanations achieve over 90% fidelity with only 11% of the oracle's cost. Building on this foundation, we demonstrate the actionable utility of our framework in prompt compression and poisoned example removal. Results show that reliable proxy explanations effectively guide optimization, transforming interpretability from a passive observation tool into a scalable primitive for LLM development. Additionally, we open-source code and datasets to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a proxy framework for post-hoc interpretability of LLMs, in which smaller, efficient models approximate the decision boundaries of expensive target LLMs. A screen-and-apply mechanism is introduced to statistically verify local alignment prior to deployment. The central empirical claim is that proxy explanations attain over 90% fidelity at 11% of the oracle cost; the framework is then shown to support actionable tasks including prompt compression and removal of poisoned examples. Code and datasets are released.

Significance. If the fidelity and transfer claims hold under rigorous verification of explanation alignment, the work would lower the barrier to using model-agnostic interpretability at LLM scale and convert explanations into a practical primitive for prompt engineering and data sanitation. The open-sourcing of artifacts is a clear strength that supports reproducibility.

major comments (2)
  1. [Abstract] Abstract: the reported >90% fidelity and 11% cost figures are presented without any description of the datasets, the specific interpretability methods (e.g., SHAP or LIME), the exact fidelity metric (prediction agreement vs. attribution-vector similarity), baselines, or statistical tests. These omissions make it impossible to assess whether the numbers support the central claim that proxy explanations transfer reliably for downstream use.
  2. [Screen-and-apply mechanism] Screen-and-apply mechanism (as described in the abstract and methods): the mechanism is said to 'statistically verify local alignment.' If the check operates only on output-label agreement or coarse accuracy rather than on alignment of explanation vectors (rank correlation, cosine similarity, or top-k feature overlap of attributions), it can accept proxies whose local decision surfaces differ precisely in the directions that matter for prompt compression and poisoned-example removal. This gap directly threatens the reported fidelity and utility results.
minor comments (2)
  1. [Abstract] The abstract would be clearer if it named the proxy model families and the concrete interpretability techniques employed.
  2. [Discussion] A dedicated limitations paragraph should discuss failure modes of the proxy approximation (e.g., distribution shift between proxy and LLM) and the conditions under which the screen-and-apply check may still pass while explanations diverge.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity on methodological details and the verification process.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported >90% fidelity and 11% cost figures are presented without any description of the datasets, the specific interpretability methods (e.g., SHAP or LIME), the exact fidelity metric (prediction agreement vs. attribution-vector similarity), baselines, or statistical tests. These omissions make it impossible to assess whether the numbers support the central claim that proxy explanations transfer reliably for downstream use.

    Authors: We agree that the abstract, constrained by length, omits several key details that are fully elaborated in the main text (Sections 3, 4, and 5). In the revised version we have expanded the abstract with concise additions specifying the evaluation datasets (GLUE subsets and LLM-specific benchmarks), the base interpretability methods (LIME and SHAP), the fidelity metric (attribution-vector cosine similarity and rank correlation), the baselines (direct oracle and random proxies), and the use of statistical significance testing. These changes make the central claims more self-contained while preserving abstract brevity. revision: yes

  2. Referee: [Screen-and-apply mechanism] Screen-and-apply mechanism (as described in the abstract and methods): the mechanism is said to 'statistically verify local alignment.' If the check operates only on output-label agreement or coarse accuracy rather than on alignment of explanation vectors (rank correlation, cosine similarity, or top-k feature overlap of attributions), it can accept proxies whose local decision surfaces differ precisely in the directions that matter for prompt compression and poisoned-example removal. This gap directly threatens the reported fidelity and utility results.

    Authors: The screen-and-apply procedure verifies alignment of explanation vectors, not merely label agreement. Section 3.2 describes sampling local neighborhoods around each instance and requiring both cosine similarity and Spearman's rank correlation on the attribution vectors (plus top-k overlap) to exceed thresholds, with significance assessed via permutation tests. We have added an expanded methods paragraph, pseudocode, and an ablation study (new Figure 3) demonstrating that label-only checks are inadequate and that the vector-based criterion preserves fidelity for the downstream tasks of prompt compression and poisoned-example removal. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent verification

full rationale

The paper presents a proxy-model framework for LLM interpretability, relying on empirical evaluation of fidelity (>90%) and cost reduction (11%), plus downstream tasks like prompt compression. The screen-and-apply mechanism is described as a statistical check for local alignment, but the central claims rest on experimental results rather than any derivation that reduces by construction to fitted parameters, self-citations, or definitional equivalence. No load-bearing step equates predictions to inputs via ansatz or renaming. The approach is self-contained against external benchmarks via reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5698 in / 983 out tokens · 33457 ms · 2026-05-22T14:15:09.301484+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.