arxiv: 2603.04759 · v2 · submitted 2026-03-05 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

Wei Han , Pan Zhou , Soujanya Poria , Shuicheng Yan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords long-context modelingcontext compressionself-injectionstacked LLMsinference efficiencySharedLLMmulti-grained representationscontext window extension

0 comments

The pith

SharedLLM stacks two short-context LLMs so the lower one compresses long inputs into multi-grained representations at the lowest layers for the upper decoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SharedLLM to overcome the limited context windows in LLMs without the high costs of long-sequence pretraining. It stacks a compressor model below a decoder model, both derived from the same short-context LLM, and transfers only compressed multi-grained information at the bottom layers using a tree-based structure for query-aware retrieval. Trained solely on 8K-token sequences, this setup generalizes to over 128K tokens while matching or surpassing baseline performance on long-context tasks. The design also lowers memory use and speeds up inference compared to streaming or encoder-decoder methods.

Core claim

The central discovery is that self-injection of multi-grained compressed representations from a lower short-context LLM into the lowest layers of an upper short-context LLM enables effective processing of inputs much longer than the training length, without requiring full forward passes or additional cross-attention mechanisms.

What carries the argument

Self-injection: deriving both compressor and decoder from identical LLM layers and passing multi-grained context compressions exclusively at the lowest layers via a tree-based encoding and retrieval structure.

If this is right

Generalizes to inputs exceeding 128K tokens when trained on 8K sequences
Delivers performance superior or comparable to strong baselines on long-context benchmarks
Substantially reduces memory footprint
Yields 2x inference speedup over streaming architectures and 3x over encoder-decoder architectures

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be applied to upgrade existing pretrained short-context LLMs to handle longer contexts with minimal additional training.
The tree-based structure for query-aware retrieval may inspire similar efficiency gains in other retrieval-augmented or multimodal setups.
Stacking additional layers might extend effective context length further without proportional increases in compute.

Load-bearing premise

The multi-grained compression at the lowest layers of the lower model retains all information relevant to queries processed by the upper model.

What would settle it

A long-context benchmark test where critical details from the input are lost in the low-layer compression, causing the model to fail on tasks that full-attention baselines solve correctly.

Figures

Figures reproduced from arXiv: 2603.04759 by Pan Zhou, Shuicheng Yan, Soujanya Poria, Wei Han.

**Figure 1.** Figure 1: Overview of SHAREDLLM. The architecture resembles general encoder-decoder architecture like T5 (Raffel et al., 2020), but the interaction occurs at the first M layers between lower and upper model through shared key-values which are encoded and compressed from the text chunk into a sequence of trees (top-left). multi-grained information at different levels. Note that the KV compression and transmission oc… view at source ↗

**Figure 2.** Figure 2: An running example of our tree (depth=3). Each box indexed by i represents the ith iteration of node split and selection. Query-Dependent Dynamic Tree Construction and Search. A task-specific query is typically highly relevant to certain tree nodes while being less relevant to others. For highly relevant nodes, further expansion is necessary to extract fine-grained information. In contrast, for less rel… view at source ↗

**Figure 3.** Figure 3: Comparison of memory usage (left) and total inference time [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Results on arxiv-32K (perplexity) and MD-QA (av [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy comparison on passkey retrieval (single key-value pair) task. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: A living example of tree growth and split on the passkey retrieval task. The numbers [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder architectures).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SharedLLM reuses the same LLM layers for low-layer self-injection compression plus tree retrieval, which looks like a practical efficiency move but rests on the unproven claim that bottom-layer features suffice for 128K generalization.

read the letter

The paper's main contribution is the self-injection setup: two copies of the same short-context LLM stacked so the lower one compresses the full input into multi-grained representations at only its lowest layers, then passes those directly into the upper decoder at matching layers, with a tree structure handling query-aware lookup. This avoids full forward passes and extra cross-attention, which is the concrete engineering win. Trained only on 8K sequences, it claims to handle 128K inputs with performance on par or better than baselines while cutting memory and giving 2x-3x speedups over streaming or encoder-decoder setups. That combination of reuse and selective low-layer transfer is not just another position-interpolation tweak, so the architecture itself is worth noting for anyone trying to keep inference cheap on long inputs. The efficiency numbers, if they survive controls, would matter for practical deployment. The soft spot is the assumption that lowest-layer activations plus the tree can retain everything the upper model needs for query-relevant reasoning. Standard transformer behavior suggests those layers mostly capture local syntax and token patterns, not the deeper semantics that emerge higher up, so any task needing long-range inference could expose gaps even if retrieval benchmarks look fine. The abstract gives no tables, ablations, or error bars, which makes it hard to judge how much the tree compensates or whether data choices inflated the gains. This is aimed at groups working on memory-efficient long-context inference who already have a base LLM and want to avoid full retraining. The idea is coherent enough and the efficiency angle is concrete, so it deserves a serious referee to check the experiments and the semantic-preservation claim rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes SharedLLM, a stacked architecture of two short-context LLMs derived from the same base model. The lower model compresses inputs longer than 128K tokens into multi-grained representations exclusively at its lowest layers; these are injected directly into the upper decoder model at its lowest layers via self-injection, bypassing full forward passes and cross-attention. A tree-based data structure supports efficient encoding and query-aware retrieval. Trained only on 8K-token sequences, the model is claimed to generalize to 128K+ inputs while matching or exceeding strong baselines on long-context benchmarks, with reduced memory and 2×/3× inference speedups over streaming and encoder-decoder baselines.

Significance. If the central claims hold under standard controls, the work would offer a practical route to long-context extension that avoids the data and compute costs of continual pre-training while delivering measurable efficiency gains; the self-injection design and tree-based retrieval could influence subsequent compression-based context-extension methods.

major comments (2)

[Architecture and self-injection description] The load-bearing claim that lowest-layer activations from the compressor retain all query-relevant semantic and long-range information (without higher-layer processing or additional cross-attention) is not yet supported by layer-ablation results or representational analyses; standard transformer layer-wise studies show semantic dependencies emerge primarily in middle-to-upper layers, so the multi-grained compression at the lowest layers risks discarding task-critical content on reasoning benchmarks.
[Experimental results] Performance claims of superior or comparable results on long-context benchmarks are stated without reference to specific tables, baseline implementations, ablation controls, or error bars; the absence of these details prevents verification that the reported generalization from 8K training to 128K+ inputs survives standard data-selection and hyperparameter controls.

minor comments (2)

[Method] Notation for the tree-based retrieval structure and the precise definition of 'multi-grained' compression should be formalized with equations or pseudocode to clarify how query-aware selection operates across scales.
[Abstract and introduction] The abstract and introduction would benefit from explicit comparison of memory and latency numbers against the exact streaming and encoder-decoder baselines used for the 2× and 3× speedup claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below and will incorporate revisions to strengthen the manuscript's clarity and empirical support.

read point-by-point responses

Referee: [Architecture and self-injection description] The load-bearing claim that lowest-layer activations from the compressor retain all query-relevant semantic and long-range information (without higher-layer processing or additional cross-attention) is not yet supported by layer-ablation results or representational analyses; standard transformer layer-wise studies show semantic dependencies emerge primarily in middle-to-upper layers, so the multi-grained compression at the lowest layers risks discarding task-critical content on reasoning benchmarks.

Authors: We appreciate the referee's reference to established layer-wise analyses. In SharedLLM the lower model still performs a full forward pass over the long input, so its lowest-layer activations encode multi-grained, query-aware features via the self-injection and tree-based retrieval; the upper model then receives these directly at its own lowest layers. This design choice is motivated by efficiency and by the empirical generalization observed from 8K training to 128K+ inputs. To directly address the concern we will add layer-ablation experiments (injecting from layers 1, 4, 8, and 12 of the compressor) together with a brief representational similarity analysis in the revised manuscript. revision: yes
Referee: [Experimental results] Performance claims of superior or comparable results on long-context benchmarks are stated without reference to specific tables, baseline implementations, ablation controls, or error bars; the absence of these details prevents verification that the reported generalization from 8K training to 128K+ inputs survives standard data-selection and hyperparameter controls.

Authors: We apologize for the insufficient cross-references. The main results appear in Tables 2–5 (long-context modeling and understanding benchmarks), with explicit baseline descriptions (StreamingLLM, LongLLaMA, encoder-decoder variants) and implementation details in Section 4.2. We will revise the text to cite these tables at every performance claim, add error bars from three random seeds, and include additional ablation tables on data-selection and hyperparameter sensitivity in the appendix of the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents SharedLLM as a novel architectural construction using stacked short-context LLMs with multi-grained compression and self-injection at lowest layers, trained only on 8K sequences yet generalizing to 128K inputs. No equations or derivations are shown that reduce any prediction or result to a fitted parameter or input quantity by construction. The generalization and efficiency claims rest on empirical benchmarks rather than self-referential definitions or load-bearing self-citations. The self-injection concept is defined explicitly as reusing the same LLM layers for compressor and decoder, which is a design choice, not a circular reduction. This is a standard case of an independent architectural proposal with no detected circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that lowest-layer injection suffices for information transfer.

pith-pipeline@v0.9.0 · 5582 in / 1214 out tokens · 39303 ms · 2026-05-15T16:53:59.425100+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The lower model compresses long inputs into compact, multi-grained representations... tree-like structure... α_w = 2 α_{w+1}... compression ratio β
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

self-injection... lowest layers... bypassing lengthy forward passes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 16 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

Language Models are Few-Shot Learners

Tom B Brown. Language models are few-shot learners.arXiv preprint arXiv:2005.14165,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[5]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Adapting language models to compress contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3829–3846,

work page 2023
[7]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arxiv 2022.arXiv preprint arXiv:2204.02311, 10,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

On the Use of ArXiv as a Dataset

Colin B Clement, Matthew Bierbaum, Kevin P O’Keeffe, and Alexander A Alemi. On the use of arxiv as a dataset.arXiv preprint arXiv:1905.00075,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[9]

Longnet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,

Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, and Furu Wei. Longnet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,

work page arXiv
[10]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Wikimedia downloads

Wikimedia Foundation. Wikimedia downloads. URLhttps://dumps.wikimedia.org. 11 Published as a conference paper at ICLR 2026 Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context.arXiv preprint arXiv:2402.10171,

work page arXiv 2026
[12]

In-context autoencoder for context compression in a large language model.ArXiv, abs/2307.06945,

Tao Ge, Jing Hu, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model.ArXiv, abs/2307.06945,

work page arXiv
[13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

LM-infinite: Zero-shot extreme length generalization for large language models

Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. LM-infinite: Zero-shot extreme length generalization for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.),Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ...

work page 2024
[15]

doi: 10.18653/v1/2024.naacl-long.222

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.222. URLhttps://aclanthology.org/ 2024.naacl-long.222/. Jitai Hao, Yuke Zhu, Tian Wang, Jun Yu, Xin Xin, Bo Zheng, Zhaochun Ren, and Sheng Guo. Om- nikv: Dynamic context selection for efficient long-context llms. InThe Thirteenth International Conference on Learning Representations,

work page doi:10.18653/v1/2024.naacl-long.222 2024
[16]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970,

12 Published as a conference paper at ICLR 2026 Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970,

work page 2026
[19]

A comprehensive survey on long context lan- guage modeling.arXiv preprint arXiv:2503.17407,

Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, et al. A comprehensive survey on long context lan- guage modeling.arXiv preprint arXiv:2503.17407,

work page arXiv
[20]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[21]

What’s in the box? a preliminary analysis of undesirable content in the common crawl corpus.arXiv preprint arXiv:2105.02732,

Alexandra Sasha Luccioni and Joseph D Viviano. What’s in the box? a preliminary analysis of undesirable content in the common crawl corpus.arXiv preprint arXiv:2105.02732,

work page arXiv
[22]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URLhttps://www.together.ai/blog/llama-2-7b-32k-instruct. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, ...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

M1: Towards scalable test-time compute with mamba reasoning models.arXiv preprint arXiv:2504.10449,

Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M Rush, and Tri Dao. M1: Towards scalable test-time compute with mamba reasoning models.arXiv preprint arXiv:2504.10449,

work page arXiv
[26]

Finetuned language models are zero-shot learners

13 Published as a conference paper at ICLR 2026 Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, An- drew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations,

work page 2026
[27]

Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory.arXiv preprint arXiv:2402.04617, 2024a

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory.arXiv preprint arXiv:2402.04617, 2024a. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficien...

work page arXiv 2024
[28]

Qwen3 Technical Report

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.260. URL https://aclanthology.org/2024.naacl-long.260. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.naacl-long.260 2024
[29]

URL https://aclanthology.org/2024.acl-long.142

Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.142. Peitian Zhang, Ninglu Shao, Zheng Liu, Shitao Xiao, Hongjin Qian, Qiwei Ye, and Zhicheng Dou. Extending llama-3’s context ten-fold overnight, 2024a. Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, and Zhicheng Dou. Long context compression with activat...

work page 2024
[30]

URLhttps://openreview.net/forum?id=1eQT9OzfNQ. Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun.∞Bench: Extending long context evaluation beyond 100K tokens. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Proceedings of the 62nd Annual Meeting of the Association ...

work page 2024
[31]

The lan- guage modeling loss is calculated on the upper model’s token prediction

67.0 Github 4.5 StackExchange 2.0 Wikipedia (Foundation) 4.5 During pretraining, 4K tokens are fed to the lower model and upper model respectively. The lan- guage modeling loss is calculated on the upper model’s token prediction. 15 Published as a conference paper at ICLR 2026 Mixed Dataset in SFT.This dataset is directly picked from (Zhang et al., 2025),...

work page 2026
[32]

Due to the copyright infringement allegations, all online entries to access this corpus have been removed

cur_level_kvs = lower_model(preserved_input_ids).past_key_values cur_level_kvs = downsample(cur_level_kvs) all_kvs.append(cur_level_kvs) cat: concatenation;chunk: split into the specified number of chunks 16 Published as a conference paper at ICLR 2026 A.4 CONSEQUENCE FROM THE ABSENCE OFBOOK-S3 Book-S3 is a large dataset of copyrighted published books com...

work page 2026
[33]

Here we simply show the comparison in terms of perplexity when SHAREDLLM is trained with and without Book- S3

have shown that the absence of Book-S3 subsets in RedPajama corpus casts a negative impact on language modeling results. Here we simply show the comparison in terms of perplexity when SHAREDLLM is trained with and without Book- S3. As Table 8 shows, the baselines without Book-S3 as part of their continual pretraining corpus show inferior results, which is...

work page 2024
[34]

It comprises 21 datasets (16 English and 5 Chinese) across 6 subcat- egories, which aims for a more rigorous evaluation of long context understanding

is the first bilingual (English and Chinese), multi-task benchmark for long context understanding. It comprises 21 datasets (16 English and 5 Chinese) across 6 subcat- egories, which aims for a more rigorous evaluation of long context understanding. These categories encompasssingle document QA, multi-document QA, summarization, few-shot learning, syntheti...

work page 2024
[35]

continuous-right

It can be observed that SHAREDLLM enjoys the minimal accuracy decay as length extends compared to other baselines, although it has only seen context within 8K length. Figure 5: Accuracy comparison on passkey retrieval (single key-value pair) task. B.2 COMPARISON BETWEENDIFFERENTATTENTIONMAPS The introduced self-injection algorithm can also be regarded as ...

work page 2024
[36]

In fact, the policy selection is not only intuitive but also with the fundamental support from pilot experiments

observed the specialΛ-shape attention map and took advantage of this for acceleration. In fact, the policy selection is not only intuitive but also with the fundamental support from pilot experiments. We report the results of all these choices below: Table 9: Pilot studies of branch-selection policies. Setting Arxiv MD-QA Default 2.46 (±0.01) 30.93(±0.16)...

work page 2025
[37]

Consequently, it triggers the out-of-memory exception at an early stage (128K tokens)

YaRN (Peng et al., 2023), which only modifies the encoding policy but still applies the vanilla multi-head attention as LLaMA, shows squared (O(L2)) time and space complexity. Consequently, it triggers the out-of-memory exception at an early stage (128K tokens). Activation Beacon (Zhang et al., 2025), which adopts the streaming processing paradigm, mainta...

work page 2023
[38]

also due to its specialized attention 18 Published as a conference paper at ICLR 2026 paradigm, which causes a sharp increment in inference time as input size grows. CEPE can pro- cess past context chunks in parallel, but these chunks must be passed through all its encoder layers (24-layer RoBERTa in CEPE) and layer-wise linear projections to obtain the f...

work page 2026
[39]

Our default setting is highlighted inbold. M 1 248 16 Time (s) 6.78 9.3511.8116.81 25.85 Memory (GB)21.04 21.5022.3924.08 27.82 C.2 EFFICIENCYRESULTS We rerun our experiments to measure the forward time and memory cost from language modeling on 8K tokens, adjusting one variable at a time while keeping others at their default values. The results are shown ...

work page arXiv
[40]

Forβ∈ {1,2}, we are not able to set levelwise compression ratios and thus we set the compression ratio same as theβfor every level of the tree

Our default setting is highlighted inbold. Forβ∈ {1,2}, we are not able to set levelwise compression ratios and thus we set the compression ratio same as theβfor every level of the tree. β 64 32 1684 2 1 Time (s) 11.68 11.73 11.7811.8111.87 12.04 12.47 Memory (GB) 22.20 22.20 22.2022.3922.40 22.35 22.97 Table 12: Inference time under varioushwith constant...

work page arXiv 2022
[41]

Our default setting is highlighted inbold. h 1 234 Time (s) 11.16 11.5511.8111.86 Memory (GB) 19.72 22.4222.3922.41 We further investigate the potential overhead caused by the extra short forward path query-aware splitting-and-search algorithm. As shown in Table 13, we observe that it incurs around 15% over- head in both time and space. We believe this ty...

work page arXiv 2026