arxiv: 2604.14339 · v1 · submitted 2026-04-15 · 💻 cs.CL

Recognition: unknown

Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

Zichong Li , Chen Liang , Liliang Ren , Tuo Zhao , Yelong Shen , Weizhu Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords long-context adaptationRoPE perturbationself-distillationpositional robustnesslength extrapolationLLM fine-tuningRULER benchmark

0 comments

The pith

Perturbing RoPE indices during self-distillation makes long-context fine-tuned models less sensitive to where evidence appears.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models fine-tuned for long contexts often fail when relevant information is placed in different positions, even if the task stays the same. The paper shows that this brittleness arises from over-reliance on absolute positional signals rather than meaning. It introduces a regularizer that creates multiple shifted versions of each training sequence by altering RoPE indices and then uses self-distillation to force the model to give identical answers across those versions. The intent is to weaken positional dependence so the model generalizes better on retrieval and reasoning tasks that involve long inputs.

Core claim

By perturbing the rotary position embedding indices of the same training sequence to produce alternative positional views and training the model via self-distillation to produce consistent outputs across those views, the adaptation process reduces the model's dependence on brittle absolute-position cues and yields higher accuracy together with improved extrapolation beyond the training length.

What carries the argument

RoPE-Perturbed Self-Distillation: a training regularizer that generates alternative sequence views through RoPE index perturbation and enforces prediction consistency via self-distillation to favor semantic over positional signals.

If this is right

Higher accuracy on long-context benchmarks such as RULER at 64K and 256K tokens after supervised fine-tuning.
Reduced performance drop when models are tested beyond their training context window.
Lower sensitivity to the absolute placement of evidence in multi-document reasoning and retrieval-augmented tasks.
Consistent improvements when applied to different base models such as Llama-3-8B and Qwen-3-4B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same perturbation-plus-consistency idea could be tested with other positional encodings to check whether the benefit is specific to RoPE.
It may allow shorter training contexts to suffice if the model becomes more position-invariant, reducing compute for long-context adaptation.
Combining the regularizer with explicit positional randomization during data preparation could amplify the effect on downstream retrieval tasks.

Load-bearing premise

That consistency training across RoPE-perturbed versions of the same text will reliably move the model toward semantic reliance without introducing new artifacts or degrading unrelated capabilities.

What would settle it

A synthetic long-document test in which the exact position of the single relevant fact is systematically varied while holding all other content fixed; if accuracy variance across positions stays high after training, the method has not achieved the claimed robustness.

Figures

Figures reproduced from arXiv: 2604.14339 by Chen Liang, Liliang Ren, Tuo Zhao, Weizhu Chen, Yelong Shen, Zichong Li.

**Figure 1.** Figure 1: (a) Relation between answer position and accuracy on the NIAH multikey-2 task from RULER (Hsieh et al., 2024) with 64k-token sequences. We evaluate the ProLong-64k-base model (Gao et al., 2025) as causal LM baseline and model trained with our method. (b) Overview of our proposed objective. We train the model with two forward passes: a standard view and a perturbed view (Figure 1b). On the standard view, we… view at source ↗

**Figure 2.** Figure 2: Analysis of attention distance patterns on a 64k-token sequence from Prolong (Gao et al., 2025) dataset. For both the baseline model and our model, we inspect attention scores from the final 256 query tokens. We consider attention weights above a threshold (10−3 ) and measure the signed distance between query and key indices. The figure shows the histogram of attention distances in the 24th layer. marks, i… view at source ↗

**Figure 3.** Figure 3: Positional sensitivity on RULER NIAH MultiKey-2 for Qwen-3-4B [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: RULER performance versus wall-clock training time under matched compute. structure. More aggressive perturbations that disrupt global order, such as chunked permutation of large index blocks , provide little benefit and can hurt accuracy. In contrast, random dilation (randomly scaling RoPE indices by a factor in [0.5,2]) yields moderate gains but still trails our skip-based shift. Overall, these results s… view at source ↗

read the original abstract

Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative "views" of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoPE index perturbation with self-distillation yields RULER gains on two models but the perturbed views alter relative distances, so the mechanism may train against useful order signals rather than just absolute-position artifacts.

read the letter

The core idea is to create alternative views of a training sequence by perturbing its RoPE indices, then distill the model to produce the same predictions on both views. This is meant to reduce reliance on brittle absolute positions during long-context fine-tuning of short-context models like Llama-3-8B and Qwen-3-4B. The reported results show consistent lifts on RULER, reaching 12.04% at 64k for Llama-3-8B and 2.71% at 256k for Qwen-3-4B after SFT, plus some improvement in length extrapolation beyond the training window. The method is applied as a regularizer on top of standard supervised fine-tuning, which keeps the overhead low and makes it straightforward to try on similar setups. The specific combination of RoPE perturbation and self-distillation for positional robustness does not appear in the cited prior work, even though the individual pieces are established. The experiments cover two different base models and focus on retrieval-augmented and multi-document tasks, which aligns with where positional variance actually hurts. The main limitation is that any non-uniform index perturbation necessarily changes pairwise relative distances across the shuffle boundaries. Because RoPE attention depends on those relative angles, the two views are not semantically equivalent inputs; the consistency loss can therefore penalize correct use of token order rather than only absolute-position shortcuts. The abstract gives no ablations that separate this effect from extra training or generic augmentation, and it omits error bars or run statistics, so the size and source of the gains remain hard to pin down. The citation pattern is standard for the subfield and does not raise flags. This is useful reading for anyone doing long-context adaptation or positional robustness work. A reader who needs a practical regularizer to test on RULER-style benchmarks will find concrete numbers to start from, even if the causal account needs tightening. I would send it to peer review because the empirical claims are specific enough for referees to check the implementation and request the missing controls.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes RoPE-Perturbed Self-Distillation, a training regularizer for long-context adaptation of pretrained LLMs. It creates alternative views of a sequence by perturbing RoPE indices (e.g., via segment shifts or shuffles) and applies self-distillation to enforce prediction consistency across views, with the goal of reducing reliance on brittle absolute positional cues in favor of semantic content. Experiments on Llama-3-8B and Qwen-3-4B report benchmark gains after SFT, including up to 12.04% on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B, plus better length extrapolation.

Significance. If the mechanism holds, the approach provides a lightweight, architecture-agnostic regularizer that could improve positional robustness for retrieval-augmented and multi-document tasks without new data or model changes. The reported gains indicate practical value, but significance depends on whether improvements arise from the intended semantic shift rather than perturbation side-effects or extra training.

major comments (1)

[Method (RoPE-Perturbed Self-Distillation description)] The core assumption that RoPE index perturbations produce semantically equivalent views differing only in absolute position is not justified and appears incorrect. RoPE encodes relative positions via cos/sin((m-n)θ) in the attention logits; any non-uniform perturbation (segment shift, local shuffle, or random reassignment) necessarily alters pairwise relative distances across perturbation boundaries in a content-dependent manner. This changes the two views' attention patterns, so self-distillation may penalize correct relative-order reasoning rather than solely removing absolute-position artifacts. This issue is load-bearing for the central claim of shifting reliance to semantic signals and must be addressed with either a revised derivation or targeted experiments (e.g., isolating relative vs. absolute effects).

minor comments (2)

[Experiments] Experimental results lack error bars, multiple random seeds, statistical significance tests, and ablations on perturbation variants (e.g., uniform vs. segment-based). This weakens assessment of the reliability of the reported percentage gains.
[Method] The manuscript should include a clear pseudocode or equation for the exact perturbation operator and the self-distillation loss (e.g., KL divergence between views) to support reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for the constructive feedback on our proposed method. The comment regarding the effects of RoPE perturbations on relative positions is well-taken, and we will use this opportunity to strengthen the manuscript's description and analysis.

read point-by-point responses

Referee: [Method (RoPE-Perturbed Self-Distillation description)] The core assumption that RoPE index perturbations produce semantically equivalent views differing only in absolute position is not justified and appears incorrect. RoPE encodes relative positions via cos/sin((m-n)θ) in the attention logits; any non-uniform perturbation (segment shift, local shuffle, or random reassignment) necessarily alters pairwise relative distances across perturbation boundaries in a content-dependent manner. This changes the two views' attention patterns, so self-distillation may penalize correct relative-order reasoning rather than solely removing absolute-position artifacts. This issue is load-bearing for the central claim of shifting reliance to semantic signals and must be addressed with either a revised derivation or targeted experiments (e.g., isolating relative vs. absolute effects).

Authors: We thank the referee for this precise analysis of RoPE mechanics. We acknowledge that non-uniform perturbations do alter relative positional encodings across segment boundaries, contrary to our initial phrasing which emphasized absolute position differences. This is a valid point that warrants clarification in the manuscript. However, the self-distillation objective is applied at the prediction level for tasks where the correct output is invariant to the absolute (and to some extent relative across distant parts) positioning of evidence, as is the case in retrieval and multi-document QA benchmarks like RULER. By enforcing consistency across perturbed views, the model is encouraged to prioritize semantic content over specific positional patterns, which our empirical results support through consistent improvements in positional robustness. To directly address the concern, we will revise the method section to accurately describe the impact on both absolute and relative positions. Additionally, we will include targeted experiments that isolate relative versus absolute effects, such as comparing our perturbations to uniform RoPE shifts that preserve all relative distances, and analyzing attention patterns or performance on order-sensitive subtasks. revision: yes

Circularity Check

0 steps flagged

No derivation chain; empirical regularizer evaluated on external benchmarks

full rationale

The paper presents a training procedure (RoPE index perturbation + self-distillation) whose value is asserted solely through benchmark improvements on RULER and length-extrapolation tasks. No equations, fitted parameters, or first-principles derivations are offered that could reduce to the inputs by construction. The central claim is an empirical observation about positional robustness, not a mathematical result. No self-citations are load-bearing for any derivation, and no uniqueness theorems or ansatzes are invoked. This is the normal case of a methods paper whose correctness is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on the standard RoPE formulation and self-distillation objective; no new entities or fitted constants are introduced in the abstract.

axioms (2)

standard math Rotary position embeddings (RoPE) are used in the base LLM architecture
The perturbation operates on existing RoPE indices.
domain assumption Self-distillation can enforce consistency across augmented views
Core training assumption of the regularizer.

pith-pipeline@v0.9.0 · 5544 in / 1230 out tokens · 28120 ms · 2026-05-10T13:37:51.386455+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

[1]

URL https://openreview.net/forum? id=ONOtpXLqqw. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravanku- mar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Rozi `ere, B., Biron, B., Tang, B....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Data engineering for scaling language models to 128K context.arXiv preprint arXiv:2402.10171, 2024

URL https://openreview.net/forum? id=fL4qWkSmtM. Fu, Y ., Panda, R., Niu, X., Yue, X., Hajishirzi, H., Kim, Y ., and Peng, H. Data engineering for scaling language models to 128k context.arXiv preprint arXiv:2402.10171, 2024. Gao, T., Wettig, A., Yen, H., and Chen, D. How to train long-context language models (effectively). In Che, W., Nabende, J., Shutov...

work page arXiv 2024
[3]

Retrieval-Augmented Generation for Large Language Models: A Survey

URL https://aclanthology.org/2025. acl-long.366/. Gao, Y ., Xiong, Y ., Gao, X., Jia, K., Pan, J., Bi, Y ., Dai, Y ., Sun, J., Guo, Q., Wang, M., and Wang, H. Retrieval- augmented generation for large language models: A sur- vey.CoRR, abs/2312.10997, 2023. Han, C., Wang, Q., Peng, H., Xiong, W., Chen, Y ., Ji, H., and Wang, S. Lm-infinite: Zero-shot extre...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

RoFormer: Enhanced Transformer with Rotary Position Embedding,

URL https://openreview.net/forum? id=VTF8yNQM66. Jin, H., Han, X., Yang, J., Jiang, Z., Chang, C.-Y ., and Hu, X. Growlength: Accelerating llms pretraining by progressively growing training length.arXiv preprint arXiv:2310.00576, 2023. Kazemnejad, A., Padhi, I., Natesan Ramamurthy, K., Das, P., and Reddy, S. The impact of positional encoding on length gen...

work page doi:10.1016/j.neucom.2023.127063 2023
[5]

arXiv preprint arXiv:2504.06214 , year=

URL https://aclanthology.org/2024. naacl-long.260/. Xu, C., Ping, W., Xu, P., Liu, Z., Wang, B., Shoeybi, M., Li, B., and Catanzaro, B. From 128k to 4m: Efficient train- ing of ultra-long context large language models.arXiv preprint arXiv:2504.06214, 2025. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng...

work page arXiv 2024