arxiv: 2605.09932 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Zehua Pei , Hui-Ling Zhen , Xianzhi Yu , Sinno Jialin Pan , Mingxuan Yuan , Bei Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords long contextattention dilutionbilevel optimizationsupervised fine-tuningparametric memoryattention sinksLLM training

0 comments

The pith

FocuSFT's bilevel optimization forms a parametric memory to counter attention dilution in long-context fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard supervised fine-tuning on long sequences wastes attention on positional biases and sinks instead of relevant content, weakening learning. FocuSFT fixes this at training time by running an inner loop that adapts fast weights to build a memory concentrating attention, then an outer loop that fine-tunes conditioned on the improved focus. Both loops use bidirectional attention on the context to reduce causal asymmetry. This yields concrete gains: up to 14 percentage points better on BABILong across lengths, higher scores on RULER, and 24% relative improvement on GPQA. The approach also dramatically cuts attention sink mass.

Core claim

Attention dilution during SFT arises from how attention budget is allocated to positionally privileged tokens. FocuSFT addresses it through bilevel optimization: the inner loop adapts lightweight fast-weight parameters on the context to create a parametric memory that sharpens attention toward semantically relevant tokens, and the outer loop performs the supervised fine-tuning using this sharpened representation, with bidirectional attention applied to context tokens in both loops.

What carries the argument

Bilevel optimization where the inner loop creates a parametric memory via fast-weight adaptation to concentrate attention on relevant content.

If this is right

Accuracy on long-context benchmarks like BABILong improves by up to 14 percentage points for contexts up to 32K.
Attention sink mass drops by a factor of 529, with tripled engagement on context tokens.
CWE aggregation on RULER rises from 72.9% to 81.1% at 16K context.
Pass@1 on GPQA with agentic tools increases by 24% relatively.
The method preserves causal masking for responses while allowing bidirectional context attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar bilevel structures might help in other training regimes where attention or focus is diluted.
Lightweight inner adaptations could be extended to other forms of parametric memory beyond attention sharpening.
If the inner loop generalizes well, it might allow effective use of even longer contexts without proportional increases in training compute.
Combining FocuSFT with inference-time methods could compound the benefits for long-context tasks.

Load-bearing premise

That the inner-loop fast-weight adaptation reliably builds a parametric memory focusing on relevant content that generalizes to new contexts without introducing overfitting or bias.

What would settle it

A controlled experiment on a held-out long-context task where FocuSFT shows no reduction in attention sink mass and no accuracy improvement over standard SFT.

Figures

Figures reproduced from arXiv: 2605.09932 by Bei Yu, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan, Xianzhi Yu, Zehua Pei.

**Figure 1.** Figure 1: Overview of FOCUSFT. Each training step is a bilevel optimization: the outer loop (full box) performs standard SFT on response tokens, while an inner loop (dashed box, nested inside) first adapts lightweight LoRA fast weights ϕ on the context with bidirectional attention, forming a parametric memory ϕ (K) that concentrates attention on relevant content. The outer loop then computes the SFT loss conditioned… view at source ↗

**Figure 2.** Figure 2: Training-time attention patterns on a 4096-token multi-turn agentic sample, comparing [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Attention heatmaps at a representative middle layer on a 4096-token multi-turn sample. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: BABILong accuracy across context [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: BABILong accuracy vs. inner-loop layer fraction. Performance peaks at lf=0.35, balancing memory capacity and base model stability. 0 5 10 15 20 25 Layer 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Sink Attention Mass Avg: 0.301 vs 0.0006 (529× reduction) Attention Sink per Layer Standard SFT Ours [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529$\times$ and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FocuSFT's bilevel setup with fast-weight parametric memory delivers benchmark gains on long-context tasks, but the bidirectional attention change on context tokens may be doing most of the work rather than the inner-loop adaptation.

read the letter

The main thing to know is that this paper applies bilevel optimization to long-context supervised fine-tuning by running an inner loop that adapts lightweight fast-weight parameters into a parametric memory, then conditions the outer SFT loop on that sharpened attention. They also switch context tokens to bidirectional attention while keeping causal masking on responses. The reported numbers are concrete: up to +14 points on BABILong across 4K-32K lengths, CWE aggregation rising from 72.9% to 81.1% at 16K on RULER, and a 24% relative pass@1 gain on GPQA with tool use, plus attention analysis showing 529x lower sink mass and tripled context engagement.

Referee Report

2 major / 2 minor

Summary. The paper claims that attention dilution during SFT on long sequences—driven by positional biases and attention sinks—limits long-context capabilities in LLMs. It introduces FocuSFT, a bilevel optimization where an inner loop adapts lightweight fast-weight parameters on the context (with bidirectional attention) to form a parametric memory that sharpens focus on relevant content, and an outer loop performs SFT conditioned on this representation (also using bidirectional context attention and causal response masking). Reported results include up to +14pp accuracy gains on BABILong (4K–32K), CWE aggregation rising from 72.9% to 81.1% on RULER at 16K, 24% relative pass@1 improvement on GPQA with agentic tools, plus 529× reduction in attention sink mass and tripled context engagement.

Significance. If the central mechanism holds, FocuSFT provides a training-time intervention to reduce attention dilution without inference overhead, with concrete benchmark lifts and attention diagnostics that could inform future long-context work. The public code release aids reproducibility. Significance is limited by the absence of controls that would confirm the bilevel component (rather than the bidirectional masking change) as the driver of gains.

major comments (2)

[Abstract and §3] Abstract and §3 (Methods): The 529× attention-sink-mass reduction and all benchmark lifts are measured under the joint introduction of bidirectional attention on context tokens plus the inner-loop fast-weight adaptation. No ablation is described that runs standard causal SFT with only the bidirectional change (or bilevel with causal masking), so the load-bearing claim that the bilevel optimization itself produces the dilution-aware sharpening cannot be verified from the reported experiments.
[§4] §4 (Experiments) and attention analysis: The claim that the inner-loop adaptation forms a generalizable parametric memory that concentrates attention on semantically relevant content rests on the weakest assumption that this behavior transfers beyond the training contexts and does not introduce new biases. No out-of-distribution context tests or controls for overfitting to the fast-weight parameters are reported, weakening the generalization argument.

minor comments (2)

[Abstract] The abstract and method descriptions should explicitly state the base model sizes, number of fast-weight parameters, and inner-loop update steps to allow direct replication.
[§4] Attention visualizations in §4 would benefit from quantitative error bars or multiple random seeds to support the 529× and 3× aggregate claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. Below we respond point-by-point to the major comments, indicating where revisions will be made to address the concerns.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Methods): The 529× attention-sink-mass reduction and all benchmark lifts are measured under the joint introduction of bidirectional attention on context tokens plus the inner-loop fast-weight adaptation. No ablation is described that runs standard causal SFT with only the bidirectional change (or bilevel with causal masking), so the load-bearing claim that the bilevel optimization itself produces the dilution-aware sharpening cannot be verified from the reported experiments.

Authors: We agree that the current experiments do not isolate the contribution of the bilevel optimization from the bidirectional attention change. Bidirectional masking on context tokens is an integral design choice in FocuSFT to reduce causal asymmetry and align inner- and outer-loop behavior, as described in §3. Nevertheless, the referee is correct that this prevents direct verification of whether the inner-loop fast-weight adaptation is the primary driver of the reported attention sharpening and benchmark gains. In the revised manuscript we will add an ablation that compares (i) standard causal SFT, (ii) bidirectional-context SFT without the inner loop, and (iii) full FocuSFT, allowing the specific effect of the bilevel component to be quantified. revision: yes
Referee: [§4] §4 (Experiments) and attention analysis: The claim that the inner-loop adaptation forms a generalizable parametric memory that concentrates attention on semantically relevant content rests on the weakest assumption that this behavior transfers beyond the training contexts and does not introduce new biases. No out-of-distribution context tests or controls for overfitting to the fast-weight parameters are reported, weakening the generalization argument.

Authors: The generalization claim is currently supported by consistent gains across BABILong (4K–32K), RULER at 16K, and GPQA, together with the attention diagnostics in §4 that show reduced sink mass and higher context engagement. These results indicate that the outer-loop SFT learns to utilize the inner-loop memory on the evaluated tasks. We acknowledge, however, that explicit out-of-distribution context tests and controls for potential overfitting of the fast-weight parameters (e.g., regularization or held-out context distributions) are absent from the submitted version. We will add a dedicated analysis section with such controls in the revision to strengthen the generalization argument. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation introduces independent bilevel mechanism validated on external benchmarks

full rationale

The paper defines FocuSFT as a new bilevel optimization procedure with an inner-loop fast-weight adaptation and outer-loop SFT, both using bidirectional context attention. This is presented as a methodological contribution rather than a derivation that reduces to its own inputs. Results are grounded in independent external benchmarks (BABILong, RULER, GPQA) and post-hoc attention measurements, none of which are defined in terms of the target quantities. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the load-bearing steps. The bidirectional masking is an explicit design choice in the method, not a hidden tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract alone does not specify numerical free parameters or background axioms; the key introduced concept is the parametric memory formed by fast-weight adaptation.

invented entities (1)

lightweight fast-weight parameters no independent evidence
purpose: to form a parametric memory that concentrates attention on relevant content in the inner loop
Described as adapted on the training context to sharpen the representation used by the outer SFT loop.

pith-pipeline@v0.9.0 · 5606 in / 1273 out tokens · 46431 ms · 2026-05-12T04:31:40.232258+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FOCUSFT reduces attention sink mass by 529× and triples context engagement during training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 15 internal anchors

[1]

J. Ba, G. E. Hinton, V . Mnih, J. Z. Leibo, and C. Ionescu. Using fast weights to attend to the recent past. Advances in neural information processing systems, 29, 2016

work page 2016
[2]

Y . Bai, X. Lv, J. Zhang, Y . He, J. Qi, L. Hou, J. Tang, Y . Dong, and J. Li. Longalign: A recipe for long context alignment of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 1376–1395, 2024

work page 2024
[3]

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

work page 2024
[4]

Bansal, A

R. Bansal, A. Zhang, R. Tiwari, L. Madaan, S. S. Duvvuri, D. Khatri, D. Brandfonbrener, D. Alvarez-Melis, P. Bhargava, M. S. Kale, et al. Let’s (not) just put things in context: Test-time training for long-context llms.arXiv preprint arXiv:2512.13898, 2025

work page arXiv 2025
[5]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[6]

Y . Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia. Longlora: Efficient fine-tuning of long-context large language models. InThe International Conference on Learning Representations (ICLR), 2024

work page 2024
[7]

Y . Chen, S. Yu, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia. Long alpaca: Long-context instruction-following models.https://github.com/dvlab-research/LongLoRA, 2023

work page 2023
[8]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[9]

Clark, K

K. Clark, K. Guu, M.-W. Chang, P. Pasupat, G. Hinton, and M. Norouzi. Meta-learning fast weight language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9751–9757, 2022

work page 2022
[10]

Y . Ding, L. L. Zhang, C. Zhang, Y . Xu, N. Shang, J. Xu, F. Yang, and M. Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024

work page internal anchor Pith review arXiv 2024
[11]

Z. Du, Y . Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022

work page 2022
[12]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017

work page 2017
[14]

G. E. Hinton and D. C. Plaut. Using fast weights to deblur old memories. InProceedings of the ninth annual conference of the Cognitive Science Society, pages 177–186, 1987

work page 1987
[15]

RULER: What's the Real Context Size of Your Long-Context Language Models?

C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y . Zhang, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Hsieh, Y .-S

C.-Y . Hsieh, Y .-S. Chuang, C.-L. Li, Z. Wang, L. Le, A. Kumar, J. Glass, A. Ratner, C.-Y . Lee, R. Krishna, et al. Found in the middle: Calibrating positional attention bias improves long context utilization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 14982–14995, 2024

work page 2024
[17]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Kuratov, A

Y . Kuratov, A. Bulatov, P. Anokhin, I. Rodkin, D. Sorokin, A. Sorokin, and M. Burtsev. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack.Advances in Neural Information Processing Systems, 37:106519–106554, 2024

work page 2024
[20]

H. Liu, M. Zaharia, and P. Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023

work page internal anchor Pith review arXiv 2023
[21]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

work page 2024
[22]

X. Liu, H. Yan, S. Zhang, C. An, X. Qiu, and D. Lin. Scaling laws of rope-based extrapolation.arXiv preprint arXiv:2310.05209, 2023

work page arXiv 2023
[23]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

On First-Order Meta-Learning Algorithms

A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms.arXiv preprint arXiv:1803.02999, 2018

work page Pith review arXiv 2018
[25]

R. Pan, D. Zhang, H. Zhang, X. Pan, M. Xu, J. Zhang, R. Pi, X. Wang, and T. Zhang. Scalebio: Scalable bilevel optimization for llm data reweighting. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31959–31982, 2025

work page 2025
[26]

Scope: Prompt evolution for enhancing agent effectiveness.arXiv preprint arXiv:2512.15374, 2025

Z. Pei, H.-L. Zhen, S. Kai, S. J. Pan, Y . Wang, M. Yuan, and B. Yu. Scope: Prompt evolution for enhancing agent effectiveness.arXiv preprint arXiv:2512.15374, 2025

work page arXiv 2025
[27]

B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023

work page internal anchor Pith review arXiv 2023
[28]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

O. Press, N. A. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409, 2021

work page internal anchor Pith review arXiv 2021
[29]

Rajeswaran, C

A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-learning with implicit gradients.Advances in neural information processing systems, 32, 2019

work page 2019
[30]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Y . Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt. Test-time training with self-supervision for generalization under distribution shifts. In H. D. III and A. Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 9229–9248. PMLR, 13–18 Jul 2020

work page 2020
[32]

End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

A. Tandon, K. Dalal, X. Li, D. Koceja, M. Rød, S. Buchanan, X. Wang, J. Leskovec, S. Koyejo, T. Hashimoto, et al. End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

work page arXiv 2025
[33]

G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Thrun and L

S. Thrun and L. Pratt. Learning to learn: Introduction and overview. InLearning to learn, pages 3–17. Springer, 1998

work page 1998
[35]

Y . Wang, Y . Gao, X. Chen, H. Jiang, S. Li, J. Yang, Q. Yin, Z. Li, X. Li, B. Yin, J. Shang, and J. J. McAuley. MEMORYLLM: towards self-updatable large language models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

work page 2024
[36]

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report.arXiv e-prints, pages arXiv–2412, 2024

work page 2024
[38]

T. Ye, L. Dong, Y . Xia, Y . Sun, Y . Zhu, G. Huang, and F. Wei. Differential transformer.arXiv preprint arXiv:2410.05258, 2024

work page arXiv 2024
[39]

X. Ye, W. Zhang, F. Yin, H. Yen, and D. Chen. Dysco: Dynamic attention-scaling decoding for long-context lms.arXiv preprint arXiv:2602.22175, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Z. Yu, L. Yang, J. Zou, S. Yan, and M. Wang. Demystifying reinforcement learning in agentic reasoning. arXiv preprint arXiv:2510.11701, 2025

work page arXiv 2025
[41]

J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . Wei, L. Wang, Z. Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097, 2025. 11 A Additional Experimental Details Tra...

work page 2025