pith. sign in

arxiv: 2508.14302 · v2 · pith:67ZCAXBHnew · submitted 2025-08-19 · 💻 cs.LG · cs.AI· cs.CL

GLASS: Global-Local Aggregation for Inference-time Sparsification of LLMs

Pith reviewed 2026-05-18 21:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords inference-time sparsificationLLM pruningglobal-local aggregationrank aggregationtraining-free methodFFN neuron selectiondynamic pruningshort-prompt long-generation
0
0 comments X

The pith

GLASS fuses a global model-intrinsic prior with local prompt activations via rank aggregation to produce reliable neuron masks for inference-time FFN pruning even with short prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing training-free methods estimate which feedforward network neurons to keep using only the current prompt, yet this signal is often unreliable for short prompts followed by long generations and produces inaccurate masks that hurt output quality. The paper establishes that adding a stable global prior taken from the model itself and combining the two rankings through rank aggregation stabilizes selection. This fusion is presented as the maximum-a-posteriori consensus ranking under a permutation-based probabilistic model. A reader would care because the approach is plug-and-play and training-free, delivering lower perplexity and KL divergence plus faster on-device decoding across several open-source LLMs in the hardest short-prompt regimes.

Core claim

The paper claims that prompt-only neuron importance is frequently unreliable, especially for short prompts and long-form decoding, leading to inaccurate masks and degraded generation fidelity. GLASS stabilizes dynamic FFN pruning by aggregating local prompt-specific activations with a global model-intrinsic prior through rank aggregation. The authors interpret the weighted rank-aggregation rule as the maximum-a-posteriori consensus ranking under a permutation-based probabilistic model. When tested on diverse open-source LLMs, GLASS yields up to 45.10 percent lower perplexity and 25.73 percent lower KL divergence than prior training-free baselines in short-prompt long-generation scenarios, as

What carries the argument

The weighted rank-aggregation rule that fuses a global model-intrinsic prior ranking with a local prompt-specific activation ranking to select which FFN neurons to retain.

If this is right

  • Critical-neuron selection becomes robust even for short prompts that previously produced unstable masks.
  • Generation quality improves, with measured reductions of up to 45.10 percent in perplexity and 25.73 percent in KL divergence versus prior training-free methods.
  • On-device decoding speed increases while preserving fidelity in long outputs.
  • The method applies directly to a range of open-source LLMs without retraining or architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar global-local rank fusion could be tested on other dynamic decisions such as attention-head pruning or KV-cache eviction.
  • The permutation-model interpretation opens the possibility of attaching uncertainty estimates to the resulting masks.
  • Model-intrinsic priors may prove useful for other inference-time adaptation techniques that currently rely only on prompt statistics.
  • The approach suggests a general pattern: many prompt-dependent efficiency heuristics can be stabilized by a cheap, input-independent model property.

Load-bearing premise

The global model-intrinsic prior supplies a stable complementary signal to local prompt activations that rank aggregation can reliably fuse even when the prompt is short.

What would settle it

Run GLASS and a prompt-only baseline on the same short-prompt long-generation benchmark for one open-source LLM and measure whether perplexity and KL divergence improve; failure to show consistent gains would falsify the central claim.

read the original abstract

Inference-time sparsification is a promising path to deploy large language models (LLMs) on resource-constrained devices, yet existing training-free methods typically estimate feedforward network (FFN) neuron importance from the input prompt alone. We show this prompt-only signal is often unreliable, especially for short prompts and long-form decoding, leading to inaccurate masks and degraded generation fidelity. We propose GLASS, a plug-and-play, training-free framework that stabilizes dynamic FFN pruning by aggregating two complementary views of neuron criticality: local prompt-specific activations and a global model-intrinsic prior. GLASS fuses global and local signals via rank aggregation, yielding robust critical-neuron selection even when the prompt is short. We interpret GLASS as the maximum-a-posteriori consensus ranking under a permutation-based probabilistic model, providing a principled foundation for its weighted rank-aggregation rule. We apply GLASS to a diverse set of open-source LLMs, and show that it yields substantial improvements over prior training-free baselines in the challenging short-prompt, long-generation scenarios, achieving up to 45.10% lower perplexity and 25.73% lower KL divergence, while delivering significant on-device decoding speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces GLASS, a plug-and-play training-free framework for inference-time FFN sparsification in LLMs. It aggregates local prompt-specific neuron activations with a global model-intrinsic prior via rank aggregation, interpreted as the MAP consensus ranking under a permutation-based probabilistic model. The method is evaluated on diverse open-source LLMs and claims substantial gains over prior training-free baselines specifically in short-prompt, long-generation regimes, including up to 45.10% lower perplexity, 25.73% lower KL divergence, and on-device decoding speedups.

Significance. If the central performance claims hold under rigorous controls, GLASS could meaningfully advance practical inference-time sparsification by mitigating the unreliability of prompt-only signals. The explicit probabilistic framing of the aggregation rule is a conceptual strength, as is the focus on the short-prompt/long-generation regime that is practically relevant for on-device use. Reproducible code or parameter-free derivations are not mentioned, but the multi-model empirical evaluation provides a starting point for assessing generalizability.

major comments (3)
  1. [Abstract] Abstract: the headline quantitative claims (45.10% lower perplexity, 25.73% lower KL divergence) are presented without any description of experimental setup, baseline implementations, number of runs, statistical significance tests, or data-exclusion rules. This directly affects verifiability of the central performance claim that GLASS reliably outperforms prior training-free methods in the short-prompt regime.
  2. [Probabilistic model / §3] The probabilistic model section: the permutation-based model is invoked to justify the weighted rank-aggregation rule, yet it is unclear whether the model (including assumptions on global-ranking variance and aggregation weights) was derived independently of the empirical results or functions primarily as post-hoc rationalization. If the former, the derivation should be shown to be load-bearing for the fusion weights; if the latter, the 'principled foundation' claim is weakened.
  3. [Experiments / short-prompt results] Evaluation on short prompts: the central premise that the global prior remains stable and dominant when local activations have high variance (especially for short prompts) is load-bearing for the claim that rank aggregation 'stabilizes selection.' No ablation isolating global-prior variance or testing the separation margin between global and local rankings is referenced, leaving the weakest assumption untested.
minor comments (2)
  1. [Method] Notation for the rank-aggregation weights and the precise definition of the global prior should be clarified with explicit equations to avoid ambiguity when reproducing the fusion step.
  2. [Figures/Tables] Figure captions and table legends should explicitly state the prompt lengths, generation lengths, and sparsity ratios used in each reported metric to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and planned revisions to improve the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline quantitative claims (45.10% lower perplexity, 25.73% lower KL divergence) are presented without any description of experimental setup, baseline implementations, number of runs, statistical significance tests, or data-exclusion rules. This directly affects verifiability of the central performance claim that GLASS reliably outperforms prior training-free methods in the short-prompt regime.

    Authors: We agree that the abstract would be strengthened by additional context for the headline claims. In the revised manuscript we will expand the abstract to briefly note the LLMs evaluated, the short-prompt long-generation evaluation protocol, the training-free baselines, and the primary metrics. Full experimental details, including run counts and implementation choices, remain in the Experiments section, but the abstract will now be more self-contained. revision: yes

  2. Referee: [Probabilistic model / §3] The probabilistic model section: the permutation-based model is invoked to justify the weighted rank-aggregation rule, yet it is unclear whether the model (including assumptions on global-ranking variance and aggregation weights) was derived independently of the empirical results or functions primarily as post-hoc rationalization. If the former, the derivation should be shown to be load-bearing for the fusion weights; if the latter, the 'principled foundation' claim is weakened.

    Authors: The permutation-based model was constructed to supply a formal justification for the aggregation rule. We will revise §3 to present the derivation in full before the empirical results, explicitly showing how the variance assumptions and MAP estimation produce the weighted fusion rule. This will make clear that the model is load-bearing for the weights rather than a retrospective interpretation. revision: yes

  3. Referee: [Experiments / short-prompt results] Evaluation on short prompts: the central premise that the global prior remains stable and dominant when local activations have high variance (especially for short prompts) is load-bearing for the claim that rank aggregation 'stabilizes selection.' No ablation isolating global-prior variance or testing the separation margin between global and local rankings is referenced, leaving the weakest assumption untested.

    Authors: We acknowledge that a dedicated ablation on global-prior stability and the separation margin would provide stronger empirical support. We will add an ablation study in the revised Experiments section that quantifies global-ranking variance across short prompts, measures the margin between global and local rankings, and shows how aggregation improves selection stability under high local variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in GLASS derivation chain

full rationale

The paper defines GLASS as a training-free aggregation of local prompt activations and a global model-intrinsic prior via rank aggregation, then offers a permutation-based probabilistic interpretation as MAP consensus to justify the weighted rule. This interpretation is presented after the method is specified rather than used to derive the aggregation from first principles in a way that equates outputs to inputs by construction. Reported gains (lower perplexity, KL divergence, on-device speedup) are measured on external benchmarks across multiple LLMs and scenarios, independent of any fitted parameter or self-citation chain. No equations or steps in the provided text reduce the central claim to a tautology, self-citation load-bearing premise, or renamed known result; the work remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a global model-intrinsic prior exists and is complementary to prompt-specific signals; no free parameters or new invented entities are described in the abstract.

axioms (1)
  • domain assumption Global model-intrinsic prior provides a reliable complementary signal to local prompt-specific activations for estimating neuron criticality.
    Invoked to address unreliability of prompt-only signals especially for short prompts and long-form decoding.

pith-pipeline@v0.9.0 · 5776 in / 1355 out tokens · 43008 ms · 2026-05-18T21:53:06.683858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.