Recognition: 2 theorem links
· Lean TheoremFocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
Pith reviewed 2026-05-12 04:31 UTC · model grok-4.3
The pith
FocuSFT's bilevel optimization forms a parametric memory to counter attention dilution in long-context fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Attention dilution during SFT arises from how attention budget is allocated to positionally privileged tokens. FocuSFT addresses it through bilevel optimization: the inner loop adapts lightweight fast-weight parameters on the context to create a parametric memory that sharpens attention toward semantically relevant tokens, and the outer loop performs the supervised fine-tuning using this sharpened representation, with bidirectional attention applied to context tokens in both loops.
What carries the argument
Bilevel optimization where the inner loop creates a parametric memory via fast-weight adaptation to concentrate attention on relevant content.
If this is right
- Accuracy on long-context benchmarks like BABILong improves by up to 14 percentage points for contexts up to 32K.
- Attention sink mass drops by a factor of 529, with tripled engagement on context tokens.
- CWE aggregation on RULER rises from 72.9% to 81.1% at 16K context.
- Pass@1 on GPQA with agentic tools increases by 24% relatively.
- The method preserves causal masking for responses while allowing bidirectional context attention.
Where Pith is reading between the lines
- Similar bilevel structures might help in other training regimes where attention or focus is diluted.
- Lightweight inner adaptations could be extended to other forms of parametric memory beyond attention sharpening.
- If the inner loop generalizes well, it might allow effective use of even longer contexts without proportional increases in training compute.
- Combining FocuSFT with inference-time methods could compound the benefits for long-context tasks.
Load-bearing premise
That the inner-loop fast-weight adaptation reliably builds a parametric memory focusing on relevant content that generalizes to new contexts without introducing overfitting or bias.
What would settle it
A controlled experiment on a held-out long-context task where FocuSFT shows no reduction in attention sink mass and no accuracy improvement over standard SFT.
Figures
read the original abstract
Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529$\times$ and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that attention dilution during SFT on long sequences—driven by positional biases and attention sinks—limits long-context capabilities in LLMs. It introduces FocuSFT, a bilevel optimization where an inner loop adapts lightweight fast-weight parameters on the context (with bidirectional attention) to form a parametric memory that sharpens focus on relevant content, and an outer loop performs SFT conditioned on this representation (also using bidirectional context attention and causal response masking). Reported results include up to +14pp accuracy gains on BABILong (4K–32K), CWE aggregation rising from 72.9% to 81.1% on RULER at 16K, 24% relative pass@1 improvement on GPQA with agentic tools, plus 529× reduction in attention sink mass and tripled context engagement.
Significance. If the central mechanism holds, FocuSFT provides a training-time intervention to reduce attention dilution without inference overhead, with concrete benchmark lifts and attention diagnostics that could inform future long-context work. The public code release aids reproducibility. Significance is limited by the absence of controls that would confirm the bilevel component (rather than the bidirectional masking change) as the driver of gains.
major comments (2)
- [Abstract and §3] Abstract and §3 (Methods): The 529× attention-sink-mass reduction and all benchmark lifts are measured under the joint introduction of bidirectional attention on context tokens plus the inner-loop fast-weight adaptation. No ablation is described that runs standard causal SFT with only the bidirectional change (or bilevel with causal masking), so the load-bearing claim that the bilevel optimization itself produces the dilution-aware sharpening cannot be verified from the reported experiments.
- [§4] §4 (Experiments) and attention analysis: The claim that the inner-loop adaptation forms a generalizable parametric memory that concentrates attention on semantically relevant content rests on the weakest assumption that this behavior transfers beyond the training contexts and does not introduce new biases. No out-of-distribution context tests or controls for overfitting to the fast-weight parameters are reported, weakening the generalization argument.
minor comments (2)
- [Abstract] The abstract and method descriptions should explicitly state the base model sizes, number of fast-weight parameters, and inner-loop update steps to allow direct replication.
- [§4] Attention visualizations in §4 would benefit from quantitative error bars or multiple random seeds to support the 529× and 3× aggregate claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. Below we respond point-by-point to the major comments, indicating where revisions will be made to address the concerns.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Methods): The 529× attention-sink-mass reduction and all benchmark lifts are measured under the joint introduction of bidirectional attention on context tokens plus the inner-loop fast-weight adaptation. No ablation is described that runs standard causal SFT with only the bidirectional change (or bilevel with causal masking), so the load-bearing claim that the bilevel optimization itself produces the dilution-aware sharpening cannot be verified from the reported experiments.
Authors: We agree that the current experiments do not isolate the contribution of the bilevel optimization from the bidirectional attention change. Bidirectional masking on context tokens is an integral design choice in FocuSFT to reduce causal asymmetry and align inner- and outer-loop behavior, as described in §3. Nevertheless, the referee is correct that this prevents direct verification of whether the inner-loop fast-weight adaptation is the primary driver of the reported attention sharpening and benchmark gains. In the revised manuscript we will add an ablation that compares (i) standard causal SFT, (ii) bidirectional-context SFT without the inner loop, and (iii) full FocuSFT, allowing the specific effect of the bilevel component to be quantified. revision: yes
-
Referee: [§4] §4 (Experiments) and attention analysis: The claim that the inner-loop adaptation forms a generalizable parametric memory that concentrates attention on semantically relevant content rests on the weakest assumption that this behavior transfers beyond the training contexts and does not introduce new biases. No out-of-distribution context tests or controls for overfitting to the fast-weight parameters are reported, weakening the generalization argument.
Authors: The generalization claim is currently supported by consistent gains across BABILong (4K–32K), RULER at 16K, and GPQA, together with the attention diagnostics in §4 that show reduced sink mass and higher context engagement. These results indicate that the outer-loop SFT learns to utilize the inner-loop memory on the evaluated tasks. We acknowledge, however, that explicit out-of-distribution context tests and controls for potential overfitting of the fast-weight parameters (e.g., regularization or held-out context distributions) are absent from the submitted version. We will add a dedicated analysis section with such controls in the revision to strengthen the generalization argument. revision: yes
Circularity Check
No circularity; derivation introduces independent bilevel mechanism validated on external benchmarks
full rationale
The paper defines FocuSFT as a new bilevel optimization procedure with an inner-loop fast-weight adaptation and outer-loop SFT, both using bidirectional context attention. This is presented as a methodological contribution rather than a derivation that reduces to its own inputs. Results are grounded in independent external benchmarks (BABILong, RULER, GPQA) and post-hoc attention measurements, none of which are defined in terms of the target quantities. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the load-bearing steps. The bidirectional masking is an explicit design choice in the method, not a hidden tautology.
Axiom & Free-Parameter Ledger
invented entities (1)
-
lightweight fast-weight parameters
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FOCUSFT reduces attention sink mass by 529× and triples context engagement during training
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
J. Ba, G. E. Hinton, V . Mnih, J. Z. Leibo, and C. Ionescu. Using fast weights to attend to the recent past. Advances in neural information processing systems, 29, 2016
work page 2016
-
[2]
Y . Bai, X. Lv, J. Zhang, Y . He, J. Qi, L. Hou, J. Tang, Y . Dong, and J. Li. Longalign: A recipe for long context alignment of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 1376–1395, 2024
work page 2024
-
[3]
Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024
work page 2024
- [4]
-
[5]
Longformer: The Long-Document Transformer
I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[6]
Y . Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia. Longlora: Efficient fine-tuning of long-context large language models. InThe International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[7]
Y . Chen, S. Yu, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia. Long alpaca: Long-context instruction-following models.https://github.com/dvlab-research/LongLoRA, 2023
work page 2023
-
[8]
Generating Long Sequences with Sparse Transformers
R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
- [9]
-
[10]
Y . Ding, L. L. Zhang, C. Zhang, Y . Xu, N. Shang, J. Xu, F. Yang, and M. Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024
work page internal anchor Pith review arXiv 2024
-
[11]
Z. Du, Y . Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022
work page 2022
-
[12]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017
work page 2017
-
[14]
G. E. Hinton and D. C. Plaut. Using fast weights to deblur old memories. InProceedings of the ninth annual conference of the Cognitive Science Society, pages 177–186, 1987
work page 1987
-
[15]
RULER: What's the Real Context Size of Your Long-Context Language Models?
C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y . Zhang, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
C.-Y . Hsieh, Y .-S. Chuang, C.-L. Li, Z. Wang, L. Le, A. Kumar, J. Glass, A. Ratner, C.-Y . Lee, R. Krishna, et al. Found in the middle: Calibrating positional attention bias improves long context utilization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 14982–14995, 2024
work page 2024
-
[17]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Y . Kuratov, A. Bulatov, P. Anokhin, I. Rodkin, D. Sorokin, A. Sorokin, and M. Burtsev. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack.Advances in Neural Information Processing Systems, 37:106519–106554, 2024
work page 2024
-
[20]
H. Liu, M. Zaharia, and P. Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023
work page internal anchor Pith review arXiv 2023
-
[21]
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024
work page 2024
- [22]
-
[23]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
On First-Order Meta-Learning Algorithms
A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms.arXiv preprint arXiv:1803.02999, 2018
work page Pith review arXiv 2018
-
[25]
R. Pan, D. Zhang, H. Zhang, X. Pan, M. Xu, J. Zhang, R. Pi, X. Wang, and T. Zhang. Scalebio: Scalable bilevel optimization for llm data reweighting. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31959–31982, 2025
work page 2025
-
[26]
Scope: Prompt evolution for enhancing agent effectiveness.arXiv preprint arXiv:2512.15374, 2025
Z. Pei, H.-L. Zhen, S. Kai, S. J. Pan, Y . Wang, M. Yuan, and B. Yu. Scope: Prompt evolution for enhancing agent effectiveness.arXiv preprint arXiv:2512.15374, 2025
-
[27]
B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023
work page internal anchor Pith review arXiv 2023
-
[28]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
O. Press, N. A. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409, 2021
work page internal anchor Pith review arXiv 2021
-
[29]
A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-learning with implicit gradients.Advances in neural information processing systems, 32, 2019
work page 2019
-
[30]
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Y . Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt. Test-time training with self-supervision for generalization under distribution shifts. In H. D. III and A. Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 9229–9248. PMLR, 13–18 Jul 2020
work page 2020
-
[32]
End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025
A. Tandon, K. Dalal, X. Li, D. Koceja, M. Rød, S. Buchanan, X. Wang, J. Leskovec, S. Koyejo, T. Hashimoto, et al. End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025
-
[33]
G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
S. Thrun and L. Pratt. Learning to learn: Introduction and overview. InLearning to learn, pages 3–17. Springer, 1998
work page 1998
-
[35]
Y . Wang, Y . Gao, X. Chen, H. Jiang, S. Li, J. Yang, Q. Yin, Z. Li, X. Li, B. Yin, J. Shang, and J. J. McAuley. MEMORYLLM: towards self-updatable large language models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024
work page 2024
-
[36]
G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report.arXiv e-prints, pages arXiv–2412, 2024
work page 2024
- [38]
-
[39]
X. Ye, W. Zhang, F. Yin, H. Yen, and D. Chen. Dysco: Dynamic attention-scaling decoding for long-context lms.arXiv preprint arXiv:2602.22175, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [40]
-
[41]
J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . Wei, L. Wang, Z. Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097, 2025. 11 A Additional Experimental Details Tra...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.