Short Data, Long Context: Distilling Positional Knowledge in Transformers

Adithya Sagar; Chinnadhurai Sankar; Ernie Chang; Igor Fedorov; Md Rifat Arefin; Patrick Huber; Rylan Conway

arxiv: 2604.06070 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.LG

Short Data, Long Context: Distilling Positional Knowledge in Transformers

Patrick Huber , Ernie Chang , Chinnadhurai Sankar , Rylan Conway , Igor Fedorov , Md Rifat Arefin , Adithya Sagar This is my paper

Pith reviewed 2026-05-10 18:54 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords knowledge distillationlong contextrotary position embeddingstransformerspositional informationretrievalmodel extension

0 comments

The pith

Long-context retrieval transfers to students via logit distillation on packed short sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a teacher model's ability to retrieve information over long distances can be passed to a student model without ever showing the student real long sequences. Short samples are packed together to fill a long training window, and the student is trained to match the teacher's output probabilities at each step. This logit matching carries positional signals forward through the layers because the teacher's logits already reflect where tokens sit in the full window. Analysis with Rotary Position Embeddings confirms that phase-wise scaling of the embeddings gives the strongest transfer and that query-state updates during training follow repeatable patterns tied to distance. The result suggests long-context extension can be done with far less data and compute than direct pre-training on long text.

Core claim

The authors establish that logit-based knowledge distillation on packed short-context samples inside a long-context window is sufficient to transfer long-context retrieval capabilities. They trace how positional perturbations in query and key vectors propagate through successive layers to shape the teacher's output distribution, thereby supplying a usable training signal to the student. Phase-wise RoPE scaling maximizes performance at each stage, and the query states exhibit structured sensitivity patterns across parameter spans when the context length is extended.

What carries the argument

Logit-based knowledge distillation applied to short sequences packed into a long window, with positional effects traced through Rotary Position Embeddings and layer-wise propagation to output logits.

If this is right

Long-context model development no longer requires collecting or training on native long documents.
Phase-wise RoPE scaling becomes the default schedule for distillation-based length extension.
Positional information flows measurably from teacher logits into the student even without explicit position labels.
Query-state parameter updates during extension follow repeatable distance-sensitive patterns that can be monitored.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same packing-plus-distillation recipe may work for other positional encodings such as ALiBi or learned absolute embeddings.
Downstream tasks that require cross-document reasoning could be used to test whether the transferred retrieval is functionally useful rather than only measured on synthetic probes.
If the method generalizes, training pipelines could generate synthetic short packs on the fly instead of storing large long-context corpora.

Load-bearing premise

That matching teacher logits on packed short samples inside a long window gives the student genuine long-range positional retrieval rather than just local pattern matching.

What would settle it

Measure whether a distilled student can correctly retrieve a fact whose position lies beyond the length of any single packed short sample; if retrieval accuracy drops to chance levels while the teacher succeeds, the claim fails.

read the original abstract

Extending the context window of language models typically requires expensive long-context pre-training, posing significant challenges for both training efficiency and data collection. In this paper, we present evidence that long-context retrieval capabilities can be transferred to student models through logit-based knowledge distillation, even when training exclusively on packed short-context samples within a long-context window. We provide comprehensive insights through the lens of Rotary Position Embedding (RoPE) and establish three key findings. First, consistent with prior work, we show that phase-wise RoPE scaling, which maximizes rotational spectrum utilization at each training stage, also achieves the best long-context performance in knowledge distillation setups. Second, we demonstrate that logit-based knowledge distillation can directly enable positional information transfer. Using an experimental setup with packed repeated token sequences, we trace the propagation of positional perturbations from query and key vectors through successive transformer layers to output logits, revealing that positional information systematically influences the teacher's output distribution and, in turn, the distillation signal received by the student model. Third, our analysis uncovers structured update patterns in the query state during long-context extension, with distinct parameter spans exhibiting strong sensitivity to long-context training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that long-context retrieval capabilities can be transferred to student models through logit-based knowledge distillation even when training exclusively on packed short-context samples within a long-context window. It reports three empirical findings: phase-wise RoPE scaling maximizes long-context performance in distillation; logit distillation directly enables positional information transfer, demonstrated by tracing RoPE perturbations in query/key states through layers to output logits on packed repeated-token sequences; and long-context extension produces structured update patterns in the query state with distinct parameter spans showing differential sensitivity.

Significance. If the central claim holds, the result would offer a practical route to long-context extension that avoids the full cost of long pre-training and long data collection, which is a meaningful efficiency advance for transformer scaling. The mechanistic tracing of how RoPE-induced perturbations propagate to logits and the identification of structured query-state updates constitute concrete strengths that could inform future distillation and position-encoding work. These elements provide falsifiable, layer-wise observations rather than purely black-box performance claims.

major comments (2)

[Second finding / experimental setup with packed repeated token sequences] The experimental setup with packed repeated token sequences (described in the second finding and associated analysis) is load-bearing for the claim that distillation supplies a faithful signal for genuine long-context positional retrieval. Repeated tokens create low-entropy sequences in which positional cues are unusually salient and content-based retrieval is trivial; without an ablation that replaces repeated tokens with diverse natural text while preserving the same packing and distillation protocol, or a control that severs the positional channel (e.g., position-agnostic teacher), it remains possible that the student acquires superficial periodicity patterns rather than general long-range retrieval. This directly affects whether the reported transfer generalizes beyond the artificial setup.
[Abstract and results on the three findings] The abstract and results sections state three empirical findings yet supply no quantitative metrics, error bars, baseline comparisons, or ablation tables for the distillation experiments. Without these, the magnitude of the positional transfer, its statistical reliability, and its improvement over non-distillation long-context baselines cannot be assessed, weakening support for the central efficiency claim.

minor comments (2)

[Methods / RoPE scaling description] Notation for RoPE phase-wise scaling and the precise definition of 'packed short-context samples within a long-context window' should be formalized with an equation or diagram in the methods section for reproducibility.
[Third finding / query state analysis] The third finding on 'structured update patterns' and 'distinct parameter spans' would benefit from explicit identification of the affected parameter indices or layers and a quantitative sensitivity metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which highlight important aspects of our experimental design and presentation. We address each major comment below and commit to revisions that strengthen the evidence for positional transfer via distillation.

read point-by-point responses

Referee: [Second finding / experimental setup with packed repeated token sequences] The experimental setup with packed repeated token sequences (described in the second finding and associated analysis) is load-bearing for the claim that distillation supplies a faithful signal for genuine long-context positional retrieval. Repeated tokens create low-entropy sequences in which positional cues are unusually salient and content-based retrieval is trivial; without an ablation that replaces repeated tokens with diverse natural text while preserving the same packing and distillation protocol, or a control that severs the positional channel (e.g., position-agnostic teacher), it remains possible that the student acquires superficial periodicity patterns rather than general long-range retrieval. This directly affects whether the reported transfer generalizes beyond the artificial setup.

Authors: The repeated-token packing was deliberately chosen to create a low-entropy regime that isolates positional signals, enabling direct tracing of RoPE perturbations from query/key states through layers to output logits without content-based confounds. This mechanistic probe supports the second finding on how distillation propagates positional information. We agree, however, that the setup alone does not fully demonstrate generalization to natural text. In the revised manuscript we will add an ablation using packed sequences of diverse natural language while keeping the same distillation protocol, along with a position-agnostic teacher control, to quantify whether the observed transfer relies on genuine long-range positional retrieval. revision: yes
Referee: [Abstract and results on the three findings] The abstract and results sections state three empirical findings yet supply no quantitative metrics, error bars, baseline comparisons, or ablation tables for the distillation experiments. Without these, the magnitude of the positional transfer, its statistical reliability, and its improvement over non-distillation long-context baselines cannot be assessed, weakening support for the central efficiency claim.

Authors: We acknowledge that the current presentation of the three findings would benefit from more explicit quantitative support. The manuscript already reports retrieval accuracies on standard long-context benchmarks, but we will expand the abstract and results section to include (i) numerical deltas with standard-error bars from multiple random seeds, (ii) direct comparisons against non-distillation long-context baselines, and (iii) ablation tables isolating the contribution of logit distillation, phase-wise RoPE scaling, and packing. These additions will allow readers to evaluate the magnitude and reliability of the reported transfer. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical distillation experiments

full rationale

The paper advances its central claim—that long-context retrieval can be transferred via logit distillation on packed short samples—through experimental results rather than any closed-form derivation or prediction. No equations are presented that reduce a claimed output to fitted inputs by construction, nor are there self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations whose validity depends on the present work. The RoPE perturbation tracing and query-state update analysis are observational traces within the reported setups; they do not constitute a mathematical chain that is tautological. The work is therefore self-contained against its own empirical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the work relies on standard transformer assumptions and knowledge-distillation principles already present in the literature.

pith-pipeline@v0.9.0 · 5520 in / 1063 out tokens · 64119 ms · 2026-05-10T18:54:52.076459+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

what” and “where

Accessed: 2026-01-08. Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, and Michael C. Mozer. Decoupling the "what" and "where" with polar coordinate positional embeddings, 2025.https://arxiv.org/abs/2509.10534. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning and Representation Lear...

work page arXiv 2026
[2]

doi: 10.1016/j.neucom.2023.127063.https://arxiv.org/abs/ 2104.09864. 12 Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne ...

work page doi:10.1016/j.neucom.2023.127063.https://arxiv.org/abs/ 2023
[3]

https://dl.acm.org/doi/10.5555/3295222. 3295349. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zho...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5555/3295222 2084

[1] [1]

what” and “where

Accessed: 2026-01-08. Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, and Michael C. Mozer. Decoupling the "what" and "where" with polar coordinate positional embeddings, 2025.https://arxiv.org/abs/2509.10534. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning and Representation Lear...

work page arXiv 2026

[2] [2]

doi: 10.1016/j.neucom.2023.127063.https://arxiv.org/abs/ 2104.09864. 12 Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne ...

work page doi:10.1016/j.neucom.2023.127063.https://arxiv.org/abs/ 2023

[3] [3]

https://dl.acm.org/doi/10.5555/3295222. 3295349. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zho...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5555/3295222 2084