SRA: Span Representation Alignment for Large Language Model Distillation

Hoang Son Nguyen; Linh Ngo Van; Nguyen Thi Ngoc Diep; Pham Khanh Chi; Quoc Phong Dao; Trung Le; Tung Nguyen

arxiv: 2605.01205 · v2 · pith:V3UYQNCNnew · submitted 2026-05-02 · 💻 cs.CL

SRA: Span Representation Alignment for Large Language Model Distillation

Quoc Phong Dao , Hoang Son Nguyen , Pham Khanh Chi , Tung Nguyen , Linh Ngo Van , Nguyen Thi Ngoc Diep , Trung Le This is my paper

Pith reviewed 2026-05-09 15:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords knowledge distillationlarge language modelscross-tokenizer distillationspan representationcenter of massrepresentation alignmentmodel compression

0 comments

The pith

SRA shifts LLM distillation alignment from tokens to attention-weighted span centers of mass for better cross-tokenizer transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SRA as a framework for knowledge distillation between large language models and smaller students that use mismatched tokenizers. It claims that token-level alignment is brittle, so the key is to aggregate tokens first into spans and align the spans instead. Each span is treated as a cluster of particles whose state is captured by its center of mass, an attention-weighted average of the tokens inside it. A geometric regularizer keeps the representation space intact and aligned span logits carry the distilled knowledge. Experiments across different model architectures show consistent gains over prior token-based methods.

Core claim

SRA reframes cross-tokenizer knowledge distillation by moving the alignment target from individual tokens to robust spans, each represented by its attention-weighted center of mass under a multi-particle dynamical systems model, and demonstrates that this produces representations that are more stable across tokenizers and yield stronger distillation performance than token-level baselines.

What carries the argument

The span center of mass, defined as the attention-weighted average of token representations within a span and treated as the state of a particle cluster in a multi-particle dynamical system.

If this is right

Distillation performance becomes less dependent on the exact token boundaries chosen by each model's tokenizer.
Attention weighting focuses alignment on the most salient spans, preserving semantic content that would be diluted at the token level.
The geometric regularizer maintains structural consistency in the shared representation space during transfer.
Adding aligned span logit distillation supplies an extra channel for knowledge transfer beyond representation matching alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same span-center approach could be tested on other cross-model tasks such as retrieval or translation where tokenizers also differ.
If the particle-cluster framing is useful, it might suggest treating attention heads themselves as dynamical systems whose equilibria can be aligned directly.
The method may scale to distillation involving multimodal models where spans could be defined over image patches or audio segments as well.

Load-bearing premise

Modeling spans as particle clusters and using their attention-weighted centers of mass produces representations that remain robust to tokenizer mismatch and carry more useful information for distillation than token-level aggregation.

What would settle it

Re-running the reported cross-architecture distillation experiments but replacing the attention-weighted span center of mass with either token-level alignment or non-attention-weighted span averages, and checking whether the performance gap over CTKD baselines disappears.

Figures

Figures reproduced from arXiv: 2605.01205 by Hoang Son Nguyen, Linh Ngo Van, Nguyen Thi Ngoc Diep, Pham Khanh Chi, Quoc Phong Dao, Trung Le, Tung Nguyen.

**Figure 1.** Figure 1: An illustration of the tokenizer mismatch view at source ↗

**Figure 2.** Figure 2: An illustration of the proposed SRA framework. Teacher–student spans are first matched using longest view at source ↗

**Figure 3.** Figure 3: Win rates (%) for distilling Qwen 2.5-7B→GPT2 1.5B, evaluated by GPT-4o-mini view at source ↗

**Figure 4.** Figure 4: Prompt for GPT-4 evaluation view at source ↗

read the original abstract

Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce \textbf{SRA} (\textbf{S}pan \textbf{R}epresentation \textbf{A}lignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted average that captures rich semantic information. We leverage the concept of span centers of mass with attention-derived weighting to prioritize the most salient spans. In addition, we employ a geometric regularizer to preserve the structural integrity of the representation space and introduce aligned span logit distillation to enhance knowledge transfer across models. In challenging cross-architecture distillation experiments, SRA consistently and significantly outperforms state-of-the-art CTKD baselines, validating our physically-grounded approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SRA offers a practical shift to span-level alignment with attention-weighted centers of mass for cross-tokenizer distillation, but the abstract gives no numbers or implementation details to judge whether the gains are real or tuned.

read the letter

Hi there, The punchline on this paper is that it proposes shifting from token-level to span-level alignment in cross-tokenizer knowledge distillation, using a center-of-mass representation derived from attention weights, framed loosely as multi-particle dynamics. This targets a practical issue where different tokenizers make direct token matching unreliable. What the work does reasonably well is highlight that how you aggregate information matters just as much as how you align it. By defining spans and computing their centers of mass with attention-based weights, they aim for representations that are more tokenizer-agnostic and semantically richer. Adding a geometric regularizer to maintain structure in the representation space and performing distillation on the aligned span logits are concrete additions. The abstract suggests this leads to better performance in challenging cross-architecture setups compared to existing CTKD baselines. The physical analogy provides a nice way to think about the clustering of tokens into spans, even if it doesn't lead to new theorems or predictions. It engages with prior work on span representations and attention aggregation by adapting them to the distillation setting. That said, the soft spots are noticeable. The claims of consistent and significant outperformance lack any supporting numbers, error bars, or ablation studies in the abstract. Details on how spans are chosen, the exact form of the geometric regularizer, and implementation specifics are missing, which raises questions about whether the method generalizes or if results depend on careful tuning on the evaluation data. The multi-particle framing seems more like inspirational packaging than a load-bearing mathematical contribution that could be tested independently. This paper would appeal to people in the model compression and efficient inference community, especially those dealing with distillation across models with incompatible tokenizers. A reader interested in practical improvements to LLM deployment might get some ideas from the span aggregation strategy. Overall, I would recommend sending it for peer review. The idea is straightforward and addresses a genuine bottleneck, so referees can evaluate whether the experimental evidence holds up and if the method offers clear advantages over simpler aggregation baselines.

Referee Report

2 major / 2 minor

Summary. The paper introduces SRA, a framework for cross-tokenizer knowledge distillation (CTKD) that reframes alignment through multi-particle dynamical systems. It shifts from token-level to span-level representations, where each span is modeled as a cluster of particles whose state is captured by an attention-weighted center of mass (CoM). The method adds a geometric regularizer to maintain structural properties of the representation space and aligned span-logit distillation for improved transfer. The central empirical claim is that SRA consistently and significantly outperforms state-of-the-art CTKD baselines in cross-architecture distillation experiments.

Significance. If the reported gains prove robust, SRA could offer a practical advance for distilling knowledge between LLMs with mismatched tokenizers and architectures by using higher-level, semantically richer alignment units. The physical-systems framing provides intuitive motivation for the CoM construction and regularizer, and the combination of components addresses a known brittleness in token-level CTKD. Reproducibility would be strengthened by the explicit empirical validation against baselines.

major comments (2)

[§4 (Experiments)] §4 (Experiments): The abstract asserts consistent and significant outperformance over CTKD baselines, yet no quantitative metrics, error bars, dataset specifications, model pairs, or ablation results on span selection, CoM weighting, or the geometric regularizer are supplied. These details are load-bearing for evaluating whether the gains exceed what could be achieved by standard aggregation functions or post-hoc tuning.
[§3.2 (Center of Mass formulation)] §3.2 (Center of Mass formulation): The CoM is defined as an attention-weighted average of tokens within a span, but the precise normalization of attention weights, handling of cross-tokenizer span boundaries, and any free parameters in the weighting scheme are not specified. Without this, it is unclear whether the method is truly tokenizer-agnostic or reduces to a fitted aggregation that could be replicated without the multi-particle framing.

minor comments (2)

[Abstract] Abstract: The phrase 'challenging cross-architecture distillation experiments' should name the specific teacher-student architecture pairs and datasets to allow immediate assessment of the claim's scope.
[Notation] Notation: Ensure consistent use of symbols for spans, CoM, and the geometric regularizer across sections; a table summarizing all hyperparameters would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. Their comments highlight important areas for clarification and additional empirical support. We address each major comment point by point below, indicating the revisions we will incorporate in the updated version.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The abstract asserts consistent and significant outperformance over CTKD baselines, yet no quantitative metrics, error bars, dataset specifications, model pairs, or ablation results on span selection, CoM weighting, or the geometric regularizer are supplied. These details are load-bearing for evaluating whether the gains exceed what could be achieved by standard aggregation functions or post-hoc tuning.

Authors: We appreciate the referee's emphasis on empirical rigor. While Section 4 reports performance numbers on cross-architecture pairs (e.g., Llama-2 to Mistral and similar), we acknowledge that error bars, explicit dataset/model tables, and component ablations were not sufficiently detailed. In the revision we will add: (i) mean and standard deviation over three random seeds for all main results, (ii) a summary table listing exact datasets, model sizes, and tokenizer vocabularies, and (iii) ablation tables isolating span selection heuristics, attention-based CoM weighting, and the geometric regularizer. These additions will directly address whether the observed gains exceed those obtainable from simpler aggregation baselines or post-hoc tuning. revision: yes
Referee: [§3.2 (Center of Mass formulation)] §3.2 (Center of Mass formulation): The CoM is defined as an attention-weighted average of tokens within a span, but the precise normalization of attention weights, handling of cross-tokenizer span boundaries, and any free parameters in the weighting scheme are not specified. Without this, it is unclear whether the method is truly tokenizer-agnostic or reduces to a fitted aggregation that could be replicated without the multi-particle framing.

Authors: We agree that the current description in §3.2 lacks sufficient mathematical detail. The attention weights are normalized with a softmax taken exclusively over the tokens belonging to each span (ensuring they sum to one). Span boundaries are aligned across tokenizers by first recovering word-level segments from the original text via a deterministic detokenization step, then projecting those segments onto each model's subword sequence; this mapping uses no learned parameters. The weighting itself is taken directly from the teacher's attention heads with no additional hyperparameters. We will revise §3.2 to include the explicit normalized CoM equation, the word-level alignment procedure, and pseudocode, thereby clarifying that the construction is tokenizer-agnostic and motivated by the multi-particle analogy rather than being an arbitrary fitted aggregator. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent experimental validation

full rationale

The paper defines SRA via an explicit modeling choice (attention-weighted span CoM under a multi-particle analogy) and reports empirical gains on cross-architecture distillation benchmarks. No equations, uniqueness theorems, or self-citations are shown that reduce the reported performance to a fitted parameter or to the input data by construction. The physical framing functions as interpretive motivation for the aggregation unit; success is measured by downstream distillation metrics rather than by any internal identity or self-referential prediction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that attention-weighted span centers of mass capture semantic information more robustly than token-level or other aggregation methods, and that the multi-particle dynamical systems framing supplies useful inductive bias. No explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5551 in / 1217 out tokens · 29519 ms · 2026-05-09T15:18:31.469907+00:00 · methodology

SRA: Span Representation Alignment for Large Language Model Distillation

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)