pith. machine review for the scientific record. sign in

arxiv: 2605.09932 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:31 UTC · model grok-4.3

classification 💻 cs.CL
keywords long contextattention dilutionbilevel optimizationsupervised fine-tuningparametric memoryattention sinksLLM training
0
0 comments X

The pith

FocuSFT's bilevel optimization forms a parametric memory to counter attention dilution in long-context fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard supervised fine-tuning on long sequences wastes attention on positional biases and sinks instead of relevant content, weakening learning. FocuSFT fixes this at training time by running an inner loop that adapts fast weights to build a memory concentrating attention, then an outer loop that fine-tunes conditioned on the improved focus. Both loops use bidirectional attention on the context to reduce causal asymmetry. This yields concrete gains: up to 14 percentage points better on BABILong across lengths, higher scores on RULER, and 24% relative improvement on GPQA. The approach also dramatically cuts attention sink mass.

Core claim

Attention dilution during SFT arises from how attention budget is allocated to positionally privileged tokens. FocuSFT addresses it through bilevel optimization: the inner loop adapts lightweight fast-weight parameters on the context to create a parametric memory that sharpens attention toward semantically relevant tokens, and the outer loop performs the supervised fine-tuning using this sharpened representation, with bidirectional attention applied to context tokens in both loops.

What carries the argument

Bilevel optimization where the inner loop creates a parametric memory via fast-weight adaptation to concentrate attention on relevant content.

If this is right

  • Accuracy on long-context benchmarks like BABILong improves by up to 14 percentage points for contexts up to 32K.
  • Attention sink mass drops by a factor of 529, with tripled engagement on context tokens.
  • CWE aggregation on RULER rises from 72.9% to 81.1% at 16K context.
  • Pass@1 on GPQA with agentic tools increases by 24% relatively.
  • The method preserves causal masking for responses while allowing bidirectional context attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar bilevel structures might help in other training regimes where attention or focus is diluted.
  • Lightweight inner adaptations could be extended to other forms of parametric memory beyond attention sharpening.
  • If the inner loop generalizes well, it might allow effective use of even longer contexts without proportional increases in training compute.
  • Combining FocuSFT with inference-time methods could compound the benefits for long-context tasks.

Load-bearing premise

That the inner-loop fast-weight adaptation reliably builds a parametric memory focusing on relevant content that generalizes to new contexts without introducing overfitting or bias.

What would settle it

A controlled experiment on a held-out long-context task where FocuSFT shows no reduction in attention sink mass and no accuracy improvement over standard SFT.

Figures

Figures reproduced from arXiv: 2605.09932 by Bei Yu, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan, Xianzhi Yu, Zehua Pei.

Figure 1
Figure 1. Figure 1: Overview of FOCUSFT. Each training step is a bilevel optimization: the outer loop (full box) performs standard SFT on response tokens, while an inner loop (dashed box, nested inside) first adapts lightweight LoRA fast weights ϕ on the context with bidirectional attention, forming a parametric memory ϕ (K) that concentrates attention on relevant content. The outer loop then computes the SFT loss conditioned… view at source ↗
Figure 2
Figure 2. Figure 2: Training-time attention patterns on a 4096-token multi-turn agentic sample, comparing [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Attention heatmaps at a representative middle layer on a 4096-token multi-turn sample. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: BABILong accuracy across context [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: BABILong accuracy vs. inner-loop layer fraction. Performance peaks at lf=0.35, balancing memory capacity and base model stability. 0 5 10 15 20 25 Layer 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Sink Attention Mass Avg: 0.301 vs 0.0006 (529× reduction) Attention Sink per Layer Standard SFT Ours [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529$\times$ and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that attention dilution during SFT on long sequences—driven by positional biases and attention sinks—limits long-context capabilities in LLMs. It introduces FocuSFT, a bilevel optimization where an inner loop adapts lightweight fast-weight parameters on the context (with bidirectional attention) to form a parametric memory that sharpens focus on relevant content, and an outer loop performs SFT conditioned on this representation (also using bidirectional context attention and causal response masking). Reported results include up to +14pp accuracy gains on BABILong (4K–32K), CWE aggregation rising from 72.9% to 81.1% on RULER at 16K, 24% relative pass@1 improvement on GPQA with agentic tools, plus 529× reduction in attention sink mass and tripled context engagement.

Significance. If the central mechanism holds, FocuSFT provides a training-time intervention to reduce attention dilution without inference overhead, with concrete benchmark lifts and attention diagnostics that could inform future long-context work. The public code release aids reproducibility. Significance is limited by the absence of controls that would confirm the bilevel component (rather than the bidirectional masking change) as the driver of gains.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Methods): The 529× attention-sink-mass reduction and all benchmark lifts are measured under the joint introduction of bidirectional attention on context tokens plus the inner-loop fast-weight adaptation. No ablation is described that runs standard causal SFT with only the bidirectional change (or bilevel with causal masking), so the load-bearing claim that the bilevel optimization itself produces the dilution-aware sharpening cannot be verified from the reported experiments.
  2. [§4] §4 (Experiments) and attention analysis: The claim that the inner-loop adaptation forms a generalizable parametric memory that concentrates attention on semantically relevant content rests on the weakest assumption that this behavior transfers beyond the training contexts and does not introduce new biases. No out-of-distribution context tests or controls for overfitting to the fast-weight parameters are reported, weakening the generalization argument.
minor comments (2)
  1. [Abstract] The abstract and method descriptions should explicitly state the base model sizes, number of fast-weight parameters, and inner-loop update steps to allow direct replication.
  2. [§4] Attention visualizations in §4 would benefit from quantitative error bars or multiple random seeds to support the 529× and 3× aggregate claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. Below we respond point-by-point to the major comments, indicating where revisions will be made to address the concerns.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Methods): The 529× attention-sink-mass reduction and all benchmark lifts are measured under the joint introduction of bidirectional attention on context tokens plus the inner-loop fast-weight adaptation. No ablation is described that runs standard causal SFT with only the bidirectional change (or bilevel with causal masking), so the load-bearing claim that the bilevel optimization itself produces the dilution-aware sharpening cannot be verified from the reported experiments.

    Authors: We agree that the current experiments do not isolate the contribution of the bilevel optimization from the bidirectional attention change. Bidirectional masking on context tokens is an integral design choice in FocuSFT to reduce causal asymmetry and align inner- and outer-loop behavior, as described in §3. Nevertheless, the referee is correct that this prevents direct verification of whether the inner-loop fast-weight adaptation is the primary driver of the reported attention sharpening and benchmark gains. In the revised manuscript we will add an ablation that compares (i) standard causal SFT, (ii) bidirectional-context SFT without the inner loop, and (iii) full FocuSFT, allowing the specific effect of the bilevel component to be quantified. revision: yes

  2. Referee: [§4] §4 (Experiments) and attention analysis: The claim that the inner-loop adaptation forms a generalizable parametric memory that concentrates attention on semantically relevant content rests on the weakest assumption that this behavior transfers beyond the training contexts and does not introduce new biases. No out-of-distribution context tests or controls for overfitting to the fast-weight parameters are reported, weakening the generalization argument.

    Authors: The generalization claim is currently supported by consistent gains across BABILong (4K–32K), RULER at 16K, and GPQA, together with the attention diagnostics in §4 that show reduced sink mass and higher context engagement. These results indicate that the outer-loop SFT learns to utilize the inner-loop memory on the evaluated tasks. We acknowledge, however, that explicit out-of-distribution context tests and controls for potential overfitting of the fast-weight parameters (e.g., regularization or held-out context distributions) are absent from the submitted version. We will add a dedicated analysis section with such controls in the revision to strengthen the generalization argument. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation introduces independent bilevel mechanism validated on external benchmarks

full rationale

The paper defines FocuSFT as a new bilevel optimization procedure with an inner-loop fast-weight adaptation and outer-loop SFT, both using bidirectional context attention. This is presented as a methodological contribution rather than a derivation that reduces to its own inputs. Results are grounded in independent external benchmarks (BABILong, RULER, GPQA) and post-hoc attention measurements, none of which are defined in terms of the target quantities. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the load-bearing steps. The bidirectional masking is an explicit design choice in the method, not a hidden tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract alone does not specify numerical free parameters or background axioms; the key introduced concept is the parametric memory formed by fast-weight adaptation.

invented entities (1)
  • lightweight fast-weight parameters no independent evidence
    purpose: to form a parametric memory that concentrates attention on relevant content in the inner loop
    Described as adapted on the training context to sharpen the representation used by the outer SFT loop.

pith-pipeline@v0.9.0 · 5606 in / 1273 out tokens · 46431 ms · 2026-05-12T04:31:40.232258+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 15 internal anchors

  1. [1]

    J. Ba, G. E. Hinton, V . Mnih, J. Z. Leibo, and C. Ionescu. Using fast weights to attend to the recent past. Advances in neural information processing systems, 29, 2016

  2. [2]

    Y . Bai, X. Lv, J. Zhang, Y . He, J. Qi, L. Hou, J. Tang, Y . Dong, and J. Li. Longalign: A recipe for long context alignment of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 1376–1395, 2024

  3. [3]

    Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

  4. [4]

    Bansal, A

    R. Bansal, A. Zhang, R. Tiwari, L. Madaan, S. S. Duvvuri, D. Khatri, D. Brandfonbrener, D. Alvarez-Melis, P. Bhargava, M. S. Kale, et al. Let’s (not) just put things in context: Test-time training for long-context llms.arXiv preprint arXiv:2512.13898, 2025

  5. [5]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

  6. [6]

    Y . Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia. Longlora: Efficient fine-tuning of long-context large language models. InThe International Conference on Learning Representations (ICLR), 2024

  7. [7]

    Y . Chen, S. Yu, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia. Long alpaca: Long-context instruction-following models.https://github.com/dvlab-research/LongLoRA, 2023

  8. [8]

    Generating Long Sequences with Sparse Transformers

    R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

  9. [9]

    Clark, K

    K. Clark, K. Guu, M.-W. Chang, P. Pasupat, G. Hinton, and M. Norouzi. Meta-learning fast weight language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9751–9757, 2022

  10. [10]

    Y . Ding, L. L. Zhang, C. Zhang, Y . Xu, N. Shang, J. Xu, F. Yang, and M. Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024

  11. [11]

    Z. Du, Y . Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022

  12. [12]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  13. [13]

    C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017

  14. [14]

    G. E. Hinton and D. C. Plaut. Using fast weights to deblur old memories. InProceedings of the ninth annual conference of the Cognitive Science Society, pages 177–186, 1987

  15. [15]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y . Zhang, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  16. [16]

    Hsieh, Y .-S

    C.-Y . Hsieh, Y .-S. Chuang, C.-L. Li, Z. Wang, L. Le, A. Kumar, J. Glass, A. Ratner, C.-Y . Lee, R. Krishna, et al. Found in the middle: Calibrating positional attention bias improves long context utilization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 14982–14995, 2024

  17. [17]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

  18. [18]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  19. [19]

    Kuratov, A

    Y . Kuratov, A. Bulatov, P. Anokhin, I. Rodkin, D. Sorokin, A. Sorokin, and M. Burtsev. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack.Advances in Neural Information Processing Systems, 37:106519–106554, 2024

  20. [20]

    H. Liu, M. Zaharia, and P. Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023

  21. [21]

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  22. [22]

    X. Liu, H. Yan, S. Zhang, C. An, X. Qiu, and D. Lin. Scaling laws of rope-based extrapolation.arXiv preprint arXiv:2310.05209, 2023

  23. [23]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  24. [24]

    On First-Order Meta-Learning Algorithms

    A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms.arXiv preprint arXiv:1803.02999, 2018

  25. [25]

    R. Pan, D. Zhang, H. Zhang, X. Pan, M. Xu, J. Zhang, R. Pi, X. Wang, and T. Zhang. Scalebio: Scalable bilevel optimization for llm data reweighting. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31959–31982, 2025

  26. [26]

    Scope: Prompt evolution for enhancing agent effectiveness.arXiv preprint arXiv:2512.15374, 2025

    Z. Pei, H.-L. Zhen, S. Kai, S. J. Pan, Y . Wang, M. Yuan, and B. Yu. Scope: Prompt evolution for enhancing agent effectiveness.arXiv preprint arXiv:2512.15374, 2025

  27. [27]

    B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023

  28. [28]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    O. Press, N. A. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409, 2021

  29. [29]

    Rajeswaran, C

    A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-learning with implicit gradients.Advances in neural information processing systems, 32, 2019

  30. [30]

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023. 10

  31. [31]

    Y . Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt. Test-time training with self-supervision for generalization under distribution shifts. In H. D. III and A. Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 9229–9248. PMLR, 13–18 Jul 2020

  32. [32]

    End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

    A. Tandon, K. Dalal, X. Li, D. Koceja, M. Rød, S. Buchanan, X. Wang, J. Leskovec, S. Koyejo, T. Hashimoto, et al. End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

  33. [33]

    G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  34. [34]

    Thrun and L

    S. Thrun and L. Pratt. Learning to learn: Introduction and overview. InLearning to learn, pages 3–17. Springer, 1998

  35. [35]

    Y . Wang, Y . Gao, X. Chen, H. Jiang, S. Li, J. Yang, Q. Yin, Z. Li, X. Li, B. Yin, J. Shang, and J. J. McAuley. MEMORYLLM: towards self-updatable large language models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

  36. [36]

    G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

  37. [37]

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report.arXiv e-prints, pages arXiv–2412, 2024

  38. [38]

    T. Ye, L. Dong, Y . Xia, Y . Sun, Y . Zhu, G. Huang, and F. Wei. Differential transformer.arXiv preprint arXiv:2410.05258, 2024

  39. [39]

    X. Ye, W. Zhang, F. Yin, H. Yen, and D. Chen. Dysco: Dynamic attention-scaling decoding for long-context lms.arXiv preprint arXiv:2602.22175, 2026

  40. [40]

    Z. Yu, L. Yang, J. Zou, S. Yan, and M. Wang. Demystifying reinforcement learning in agentic reasoning. arXiv preprint arXiv:2510.11701, 2025

  41. [41]

    J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . Wei, L. Wang, Z. Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097, 2025. 11 A Additional Experimental Details Tra...