Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation
Pith reviewed 2026-05-20 23:37 UTC · model grok-4.3
The pith
Federated learning can be recast as nested optimization so clients learn their own adaptation rules via linear attention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FedNL reformulates federated learning as a three-level nested optimization system. It embeds Titans-based linear attention into the framework so that clients maintain self-referential memories. These memories treat a delta rule as an online gradient step, which enables lightweight, zero-shot test-time adaptation. Experiments on non-IID MMLU and long-context tasks show competitive short-context reasoning, gains in long-context retrieval and streaming cross-entropy, and constant inference memory.
What carries the argument
Three-level nested optimization that embeds Titans-based linear attention to train self-referential memories, allowing a delta rule to act as an online gradient step for client-side adaptation.
If this is right
- Clients reach competitive accuracy on short-context reasoning benchmarks under non-IID conditions.
- Long-context retrieval and streaming cross-entropy scores improve over baseline federated methods.
- Inference memory stays constant even as context length grows.
Where Pith is reading between the lines
- The same nesting idea could cut communication rounds by moving most adaptation work to local test-time steps.
- Similar three-level structures might transfer to other distributed or multi-agent training settings where data differs across nodes.
- Further tests could show whether the linear attention component is essential or whether other memory-update rules would work equally well.
Load-bearing premise
Turning federated learning into three nested optimization levels with linear attention will let clients handle differing data distributions reliably without training instability or heavy tuning.
What would settle it
Run FedNL on a fresh collection of non-IID datasets and check whether test-time accuracy stays flat or drops while memory usage or training variance rises compared with ordinary federated averaging.
Figures
read the original abstract
We rethink Federated Learning (FL) from a nested learning perspective, framing the core challenge as how to collaboratively learn optimization rules, not just static models, to tackle Non-IID client data. To address this, we propose Federated Nested Learning (FedNL), a novel framework that reformulates FL as a three-level nested optimization system. FedNL embeds Titans-based linear attention into FL, enabling clients to perform lightweight, zero-shot test-time adaptation by treating a delta rule as an online gradient step. Experiments on Non-IID MMLU and long-context benchmarks show that FedNL achieves competitive performance in short-context reasoning, enhances the performance of long-context retrieval and streaming Cross-Entropy, and maintains constant inference memory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Federated Nested Learning (FedNL), which reformulates federated learning as a three-level nested optimization system. It embeds Titans-based linear attention to collaboratively train self-referential memories, enabling clients to perform lightweight zero-shot test-time adaptation by treating a delta rule as an online gradient step. Experiments on Non-IID MMLU and long-context benchmarks are reported to show competitive performance in short-context reasoning, enhanced long-context retrieval and streaming Cross-Entropy, while maintaining constant inference memory.
Significance. Should the results hold, FedNL could offer a significant contribution by moving beyond traditional model averaging in FL to learning adaptive optimization rules through nested structures. The use of linear attention for self-referential memories provides a mechanism for test-time adaptation that addresses Non-IID challenges. The constant inference memory is a strength for practical applications. The approach builds on recent advances in linear attention models and could inspire further work on meta-learning in distributed settings, provided the stability concerns are addressed.
major comments (2)
- Abstract: The abstract asserts competitive performance on Non-IID MMLU without providing any numerical results, standard deviations, or ablation studies that isolate the effect of the nested optimization and self-referential memories. This is a major gap because the central claim depends on these memories acting as stable rules under heterogeneous data; without such evidence, the performance gains cannot be attributed to the proposed framework.
- Formulation: The three-level nested optimization lacks description of any stabilization mechanism (such as client weighting or regularization) in the outer aggregation step. Given that standard FL suffers from client drift and the delta-rule updates remove the averaging anchor, this omission risks instability when client data distributions differ sharply, potentially invalidating the assumption that collaboratively learned memories remain consistent across clients.
minor comments (2)
- Abstract: The term 'Titans-based linear attention' is used without citation or brief explanation, which could hinder accessibility for readers not familiar with the Titans architecture.
- Experiments: The description of the long-context benchmarks and streaming Cross-Entropy evaluation lacks specifics on sequence lengths or streaming protocols used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. The comments highlight important aspects of clarity in the abstract and formulation that we will address to strengthen the manuscript. We respond to each major comment below.
read point-by-point responses
-
Referee: Abstract: The abstract asserts competitive performance on Non-IID MMLU without providing any numerical results, standard deviations, or ablation studies that isolate the effect of the nested optimization and self-referential memories. This is a major gap because the central claim depends on these memories acting as stable rules under heterogeneous data; without such evidence, the performance gains cannot be attributed to the proposed framework.
Authors: We agree that the abstract, as a concise summary, would be strengthened by including key numerical results to support the claims. The full manuscript reports detailed results on Non-IID MMLU in the experiments section, including performance metrics with standard deviations and comparisons. We will revise the abstract to incorporate specific quantitative findings from these experiments. Ablation studies isolating the contributions of the nested optimization and self-referential memories are already present in the main text; we will add an explicit reference to them in the revised abstract and introduction to better attribute performance gains. revision: yes
-
Referee: Formulation: The three-level nested optimization lacks description of any stabilization mechanism (such as client weighting or regularization) in the outer aggregation step. Given that standard FL suffers from client drift and the delta-rule updates remove the averaging anchor, this omission risks instability when client data distributions differ sharply, potentially invalidating the assumption that collaboratively learned memories remain consistent across clients.
Authors: We thank the referee for identifying this potential concern regarding stability and client drift. Our three-level nested optimization uses the outer aggregation to collaboratively train self-referential memories via linear attention, which serves as an implicit mechanism for learning consistent optimization rules across clients. The delta-rule updates are local, but the collaborative meta-learning at the outer level is intended to provide an anchor. However, we acknowledge that an explicit description of stabilization (e.g., regularization effects from the attention mechanism or aggregation details) is not sufficiently elaborated. We will add a dedicated paragraph or subsection in the formulation section to describe the outer aggregation step, discuss its role in mitigating drift under Non-IID conditions, and include any relevant analysis or empirical checks from our experiments. revision: yes
Circularity Check
No circularity: abstract introduces nested optimization without self-referential equations or fitted predictions
full rationale
The abstract frames FL as three-level nested optimization and embeds Titans-based linear attention for delta-rule test-time adaptation, but supplies no equations, derivations, or parameter-fitting steps. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the visible text. The central claim of collaborative self-referential memories therefore remains an independent modeling choice rather than a tautology or statistical artifact of its own inputs. The derivation chain is self-contained against external benchmarks and receives score 0.
Axiom & Free-Parameter Ledger
invented entities (1)
-
self-referential memories
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_strictMono_of_one_lt echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We propose Federated Nested Learning (FedNL), a novel framework that reformulates FL as a three-level nested optimization system... Delta Rule update: St = S_{t-1} + η(v_t - S_{t-1} k_t) k_t^⊤
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The global meta-parameters θ∗ do not need to encode the conflict between Code and Medical knowledge... universal rule
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=
Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning , author=. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=
-
[2]
Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=
Openfedllm: Training large language models on decentralized private data via federated learning , author=. Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=
-
[3]
Recent advances on federated learning: A systematic survey , journal =
Bingyan Liu and Nuoyan Lv and Yuanchun Guo and Yawen Li , keywords =. Recent advances on federated learning: A systematic survey , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.neucom.2024.128019 , url =
-
[4]
BalanceFL: Addressing class imbalance in long-tail federated learning , author=. 2022 21st ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN) , pages=. 2022 , organization=
work page 2022
-
[5]
Proceedings of Machine learning and systems , volume=
Federated optimization in heterogeneous networks , author=. Proceedings of Machine learning and systems , volume=
-
[6]
Forty-second International Conference on Machine Learning , year=
FedSSI: Rehearsal-Free Continual Federated Learning with Synergistic Synaptic Intelligence , author=. Forty-second International Conference on Machine Learning , year=
-
[7]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
DaFKD: Domain-aware Federated Knowledge Distillation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[8]
arXiv preprint arXiv:2512.24695 , year=
Nested learning: The illusion of deep learning architectures , author=. arXiv preprint arXiv:2512.24695 , year=
-
[9]
Advances in Neural Information Processing Systems , year=
Titans: Learning to memorize at test time , author=. Advances in Neural Information Processing Systems , year=
-
[10]
arXiv preprint arXiv:2506.17671 , year=
TPTT: Transforming Pretrained Transformer into Titans , author=. arXiv preprint arXiv:2506.17671 , year=
-
[11]
Artificial intelligence and statistics , pages=
Communication-efficient learning of deep networks from decentralized data , author=. Artificial intelligence and statistics , pages=. 2017 , organization=
work page 2017
-
[12]
International conference on machine learning , pages=
Scaffold: Stochastic controlled averaging for federated learning , author=. International conference on machine learning , pages=. 2020 , organization=
work page 2020
-
[13]
International conference on machine learning , pages=
Federated continual learning with weighted inter-client transfer , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[14]
arXiv preprint arXiv:2302.13001 , year=
Better generative replay for continual federated learning , author=. arXiv preprint arXiv:2302.13001 , year=
-
[15]
Ensemble and continual federated learning for classification tasks , author=. Machine Learning , volume=. 2023 , publisher=
work page 2023
-
[16]
Proceedings of the national academy of sciences , volume=
Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=
work page 2017
-
[17]
International conference on machine learning , pages=
Continual learning through synaptic intelligence , author=. International conference on machine learning , pages=. 2017 , organization=
work page 2017
-
[18]
IEEE Communications Surveys & Tutorials , volume=
Non-IID data and Continual Learning processes in Federated Learning: A long road ahead , author=. IEEE Communications Surveys & Tutorials , volume=. 2022 , publisher=
work page 2022
-
[19]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages=
Generative feature replay for class-incremental learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages=
-
[20]
International conference on machine learning , pages=
Model-agnostic meta-learning for fast adaptation of deep networks , author=. International conference on machine learning , pages=. 2017 , organization=
work page 2017
-
[21]
First conference on language modeling , year=
Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=
-
[22]
Advances in Neural Information Processing Systems , volume=
Test-time training for robust generalization under covariate shifts , author=. Advances in Neural Information Processing Systems , volume=
-
[23]
International Conference on Learning Representations , year=
Tent: Fully test-time adaptation by entropy minimization , author=. International Conference on Learning Representations , year=
-
[24]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Learning to (learn at test time): Rnns with expressive hidden states , author=. arXiv preprint arXiv:2407.04620 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Zhang, Yang and Ginsburg, Boris , journal=
-
[26]
International Conference on Learning Representations , year=
Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
-
[27]
Proceedings of the 40th International Conference on Machine Learning , pages=
Transformers Learn In-Context by Gradient Descent , author=. Proceedings of the 40th International Conference on Machine Learning , pages=. 2023 , volume=
work page 2023
-
[28]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
FedALA: Local Adaptive Aggregation for Heterogeneous Federated Learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[29]
arXiv preprint arXiv:2407.03039 , year=
FFA-LoRA: Federated Fine-tuning of Large Language Models with FedAvg on LoRA , author=. arXiv preprint arXiv:2407.03039 , year=
-
[30]
Proceedings of the International Conference on Learning Representations (ICLR) , year=
Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.