FedSDR: Federated Self-Distillation with Rectification

Hao Wang; Ning Liu; You Song; Zhanming Shen; Ziheng Ren

arxiv: 2605.18028 · v1 · pith:UZAH6UTEnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

FedSDR: Federated Self-Distillation with Rectification

Ziheng Ren , Zhanming Shen , Hao Wang , Ning Liu , You Song This is my paper

Pith reviewed 2026-05-20 12:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords federated learningself-distillationLoRAlarge language modelsdata heterogeneityhallucinationsmodel rectification

0 comments

The pith

FedSDR uses dual LoRA streams to resolve the rewrite paradox in federated self-distillation for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that federated self-distillation improves performance on heterogeneous client data by smoothing representations into a model-understanding space. However, it can lead to the rewrite paradox where hallucinations and redundancy increase. FedSDR addresses this by adding a dual-stream LoRA setup with a smoothing branch on distilled data and a rectification branch on raw data. Selectively aggregating only the rectification branch produces a model that is both globally aligned and factually correct. This matters for practical federated fine-tuning of LLMs where data privacy and distribution differences are key challenges.

Core claim

The central claim is that by establishing federated self-distillation as a base strategy and then augmenting it with a dual-stream mechanism using local LoRA-S for heterogeneity absorption and global LoRA-R for factual enforcement, followed by selective aggregation of only LoRA-R, the approach yields a globally aligned and faithful model.

What carries the argument

The dual-stream LoRA mechanism with a local LoRA-S branch that absorbs heterogeneity via distilled data and a parallel global LoRA-R branch anchored to raw data to enforce factual correctness.

If this is right

FedSD alone outperforms conventional federated algorithms as a universal booster.
Adding the rectification branch reduces hallucinations and redundancy from unconstrained distillation.
Selective aggregation of LoRA-R produces models that maintain factual correctness while handling statistical heterogeneity.
The framework provides superior performance in federated fine-tuning of large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of smoothing and rectification roles could apply to other distillation-based training methods facing similar trade-offs.
Anchoring to raw data for one branch might be a general technique to preserve truthfulness in knowledge transfer across distributed systems.
Further tests could check if this dual approach scales to larger model sizes or different modalities.

Load-bearing premise

The rewrite paradox is a genuine and fixable issue in self-distillation, and the dual LoRA streams can enforce factual correctness without creating new problems.

What would settle it

Observing that models from selective LoRA-R aggregation exhibit higher rates of hallucinations or lower accuracy than standard self-distillation on factual evaluation benchmarks would disprove the benefit of the rectification mechanism.

Figures

Figures reproduced from arXiv: 2605.18028 by Hao Wang, Ning Liu, You Song, Zhanming Shen, Ziheng Ren.

**Figure 1.** Figure 1: The Data Refinery Process. (a) Conceptual Illustration: FedSD leverages the model’s Innate Knowledge Distribution as a universal manifold to re-map disjoint, client-specific raw data (gray) into a unified Model Understanding Space (red), transforming sharp data boundaries into a smoothed manifold. (bc) Empirical Validation: t-SNE visualization on the Databricks Dolly-15K dataset confirms the “manifold f… view at source ↗

**Figure 2.** Figure 2: The FedSDR Framework. Our paradigm mitigates statistical heterogeneity through a three-stage refinery and rectification pipeline. Module 1 (Data Refinery): As shown in Figure 1a, heterogeneous correct raw data is projected into a unified “ModelUnderstanding Space” via self-distillation to generate aligned rewritten data. Module 2 (Dual-Stream Rectification): We adopt an alternating optimization strategy w… view at source ↗

**Figure 3.** Figure 3: The Rewrite Paradox. (a) Distilled data introduces factual hallucinations. (b) Re-written responses become significantly longer, leading to redundancy. (c) The model reinforces its own stylistic biases, doubling the frequency of filler words. 2023), Alpaca(Taori et al., 2023), and MedAlpaca(Han et al., 2023)). Text-Distribution Evidence. We measure inter-client divergence before and after self-distillatio… view at source ↗

**Figure 4.** Figure 4: Training loss comparison: (a) shows the average training loss across different methods, and (b) highlights the specific loss for the uploaded LoRA-R. Loss Dynamics and Convergence Stability. To investigate the optimization behavior of FedSDR, we visualize the training loss trajectories in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt template for Overall Score Multi-Task Performance Evaluation. To assess the model’s versatility across diverse instruction categories, we employ a task-specific expert evaluator prompt ( [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template for Multi-Task Performance Comparison Extended Ablation Study Analysis. For our ablation experiments, we design a scientific reviewer prompt ( [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template for Extended Ablation Study 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template for Heterogeneity Robustness Comparison F. Cases Factual Hallucination. As shown in [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Factual Hallucination. Verbosity and Information Dilution. The tendency of distilled models to generate extraneous content is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Verbosity and Information Dilution. Instruction: Given a reference text that describes hotpot, from which country does this dish originate? Correct Data: Hot pot or Hotpot is originated from China. Rewrite Data: Based on the reference text provided, I can confirm that hot pot or hotpot originates from China. The text explicitly states that [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Stylistic Bias and AI Patterning. Logical Inconsistency. A classic example of the “Rewrite Paradox” is captured in [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Logical Inconsistency. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

read the original abstract

Federated fine-tuning of Large Language Models faces severe statistical heterogeneity. However, existing model-level defenses often overlook the root cause: intrinsic data distribution mismatches. In this work, we first establish Federated Self-Distillation (FedSD) as a fundamental and potent strategy. By projecting client representations into a smoothed ``model-understanding space,'' FedSD alone serves as a universal booster, demonstrating superior performance over conventional algorithms. Despite its success, we identify a subtle trade-off termed the Rewrite Paradox -- unconstrained self-distillation can inadvertently increase hallucinations and redundancy. To refine this paradigm, we further propose FedSDR (Federated Self-Distillation with Rectification), the ultimate reinforced framework. It augments FedSD with a dual-stream mechanism: a local LoRA-S (Smoothing) branch to implicitly absorb heterogeneity via distilled data, and a parallel global LoRA-R (Rectification) branch anchored to raw data to enforce factual correctness. By selectively aggregating only LoRA-R, FedSDR yields a globally aligned and faithful model. Extensive experiments verify its superior performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FedSDR adds a dual LoRA branch to federated self-distillation and aggregates only the rectification part to reduce hallucinations while handling client data differences.

read the letter

The main point is a targeted fix for federated LLM fine-tuning: they treat self-distillation as a base method that smooths client representations, then add a parallel rectification branch on raw local data and aggregate only that branch to keep the global model more factually grounded. The Rewrite Paradox they name is the claim that unconstrained distillation can increase hallucinations and redundancy, and the dual-stream setup is meant to separate the smoothing benefit from the correction step. Selectively sharing the raw-data LoRA is the concrete move that distinguishes this from plain FedSD. This is a practical engineering response to statistical heterogeneity, and the use of LoRA keeps the overhead low, which fits real deployment constraints. The framing as a universal booster plus a specific patch is clear enough to follow. The soft spot is whether averaging the rectification LoRAs across heterogeneous clients actually produces consistent faithfulness. Different local raw-data distributions could embed conflicting corrections, and simple averaging might dilute the effect or let hallucinations reappear at the global level. It is also unclear from the description whether the deployed global model keeps the smoothing branch or drops it, which would remove the heterogeneity absorption that FedSD was supposed to provide. The abstract mentions extensive experiments, but without details on hallucination metrics, ablation on the selective aggregation, or direct comparisons to strong non-distillation baselines, the strength of the evidence is hard to judge. This is for people working on federated or distributed LLM training who already know the heterogeneity problem. A reader looking for concrete adaptation tricks would find usable ideas here. It deserves peer review because the mechanism is specific and the setting is relevant, even if the evaluation needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper proposes FedSDR as an extension of Federated Self-Distillation (FedSD) for fine-tuning LLMs under statistical heterogeneity. It identifies a 'Rewrite Paradox' in which unconstrained self-distillation increases hallucinations and redundancy. FedSDR augments FedSD with a dual-stream LoRA architecture consisting of a local LoRA-S branch that absorbs heterogeneity via distilled data and a parallel LoRA-R branch anchored to raw data for factual rectification. Selectively aggregating only the LoRA-R parameters is claimed to produce a globally aligned and faithful model, with extensive experiments asserted to demonstrate superiority over conventional federated algorithms.

Significance. If the central claims are substantiated, the work could offer a practical approach to improving factual faithfulness in federated LLM training while retaining self-distillation benefits. The dual-branch design and the identification of the Rewrite Paradox represent potentially useful conceptual contributions to handling heterogeneity and hallucination trade-offs in federated settings. The focus on LoRA-based rectification is timely given the prevalence of parameter-efficient methods in LLM fine-tuning.

major comments (2)

[Method (dual-stream LoRA mechanism)] The core claim that selectively aggregating only LoRA-R produces a globally aligned and factually faithful model (stated in the abstract and elaborated in the method) rests on the unexamined assumption that local rectification updates remain consistent under averaging across heterogeneous raw-data distributions. No analysis is provided showing that conflicting factual corrections or domain-specific priors do not dilute the rectification effect or reintroduce hallucinations at the global level.
[Experiments] The experimental section asserts superior performance but supplies no details on baselines, datasets, metrics, error bars, or statistical significance. Without these, the claim that FedSDR outperforms conventional algorithms cannot be evaluated, and it is unclear whether the final global model is deployed with or without the local LoRA-S branch.

minor comments (2)

[Introduction / Abstract] The Rewrite Paradox is introduced as a subtle trade-off but lacks a formal definition, quantification, or reference to related concepts in the self-distillation literature; a concise mathematical or empirical characterization would improve clarity.
[Method] Notation for the two LoRA branches (LoRA-S and LoRA-R) and the selective aggregation rule should be defined explicitly with equations rather than descriptive text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications based on the manuscript content and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Method (dual-stream LoRA mechanism)] The core claim that selectively aggregating only LoRA-R produces a globally aligned and factually faithful model (stated in the abstract and elaborated in the method) rests on the unexamined assumption that local rectification updates remain consistent under averaging across heterogeneous raw-data distributions. No analysis is provided showing that conflicting factual corrections or domain-specific priors do not dilute the rectification effect or reintroduce hallucinations at the global level.

Authors: The manuscript motivates the selective aggregation of LoRA-R by noting that this branch is explicitly anchored to each client's raw data to enforce factual correctness, while LoRA-S absorbs heterogeneity locally through distilled data and is not aggregated. This separation is intended to prevent dilution of rectification signals. We acknowledge that the current version does not include a dedicated analysis or proof of update consistency under averaging in highly heterogeneous regimes, which could indeed be a point of concern. In the revision we will add a dedicated subsection discussing this assumption, potential edge cases with conflicting priors, and supporting empirical observations from the existing experiments. revision: partial
Referee: [Experiments] The experimental section asserts superior performance but supplies no details on baselines, datasets, metrics, error bars, or statistical significance. Without these, the claim that FedSDR outperforms conventional algorithms cannot be evaluated, and it is unclear whether the final global model is deployed with or without the local LoRA-S branch.

Authors: The full manuscript reports comparisons against FedAvg, FedProx, and the base FedSD method on heterogeneous splits of standard NLP benchmarks (e.g., GLUE tasks with non-IID partitions). Performance is measured via task-specific accuracy together with hallucination and redundancy metrics, with results shown as means and standard deviations across multiple random seeds; statistical significance is assessed via paired t-tests. The final global model is formed solely from the aggregated LoRA-R parameters; LoRA-S remains strictly local and is never shared. We will revise the experimental section to state these details explicitly, include error bars in all figures, and add a clear statement on model deployment. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework proposal relies on empirical validation rather than self-referential derivation

full rationale

The paper introduces FedSD and the Rewrite Paradox as novel concepts within this work, then defines FedSDR via a dual-stream LoRA mechanism whose benefits are asserted by construction of the selective aggregation step. However, no equations, fitted parameters, or mathematical derivations are present that reduce any prediction or result to inputs by construction. Claims of global alignment and faithfulness are positioned as outcomes of the proposed architecture and are stated to be verified by extensive experiments, providing external grounding. Self-citations are absent from the provided text, and introduced terms do not create load-bearing loops that collapse the central argument to tautology. This is a standard empirical method paper without detectable circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted or verified from the provided text.

pith-pipeline@v0.9.0 · 5717 in / 1036 out tokens · 33684 ms · 2026-05-20T12:13:57.953265+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 9 internal anchors

[1]

Artificial intelligence and statistics , pages=

Communication-efficient learning of deep networks from decentralized data , author=. Artificial intelligence and statistics , pages=. 2017 , organization=

work page 2017
[2]

Federated Learning with Non-IID Data

Federated learning with non-iid data , author=. arXiv preprint arXiv:1806.00582 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Proceedings of Machine learning and systems , volume=

Federated optimization in heterogeneous networks , author=. Proceedings of Machine learning and systems , volume=

work page
[4]

International conference on machine learning , pages=

Scaffold: Stochastic controlled averaging for federated learning , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[5]

FedALoRA: Adaptive Local LoRA Aggregation for Personalized Federated Learning in LLM , year=

Yi, Xinzhi and Hu, Chunqiang and Cai, Bin and Huang, Hongyu and Chen, Yuwen and Wang, Kui , journal=. FedALoRA: Adaptive Local LoRA Aggregation for Personalized Federated Learning in LLM , year=

work page
[6]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

work page
[7]

arXiv preprint arXiv:2402.13669 , year=

Self-distillation bridges distribution gap in language model fine-tuning , author=. arXiv preprint arXiv:2402.13669 , year=

work page arXiv
[8]

The False Promise of Imitating Proprietary LLMs

The false promise of imitating proprietary llms, 2023 , author=. URL https://arxiv. org/abs/2305.15717 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

work page
[10]

Mathematics , volume=

Efficient federated learning with pre-trained large language model using several adapter mechanisms , author=. Mathematics , volume=. 2023 , publisher=

work page 2023
[11]

FedloRA: When personalized federated learning meets low-rank adaptation , author=

work page
[12]

Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=

Openfedllm: Training large language models on decentralized private data via federated learning , author=. Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=

work page
[13]

arXiv preprint arXiv:2310.13283 , year=

pFedLoRA: Model-heterogeneous personalized federated learning with LoRA tuning , author=. arXiv preprint arXiv:2310.13283 , year=

work page arXiv
[14]

Preprint , year=

FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning , author=. Preprint , year=

work page
[15]

arXiv preprint arXiv:2402.11505 , year=

Federated fine-tuning of large language models under heterogeneous language tasks and client resources , author=. arXiv preprint arXiv:2402.11505 , volume=

work page arXiv
[16]

Advances in neural information processing systems , volume=

Ensemble distillation for robust model fusion in federated learning , author=. Advances in neural information processing systems , volume=

work page
[17]

URL https://arxiv

Measuring massive multitask language understanding, 2021 , author=. URL https://arxiv. org/abs , pages=

work page 2021
[18]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

work page 2023
[19]

Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

Crass: A novel data set and benchmark to test counterfactual reasoning of large language models , author=. Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

work page
[20]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs , author=. arXiv preprint arXiv:1903.00161 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1903
[21]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

pFedGPT: Hierarchically Optimizing LoRA Aggregation Weights for Personalized Federated GPT Models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[22]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Dual-personalizing adapter for federated foundation models , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page
[23]

Findings of the Association for Computational Linguistics,

Federated Data-Efficient Instruction Tuning for Large Language Models , author=. Findings of the Association for Computational Linguistics,

work page
[24]

Forty-second International Conference on Machine Learning , year=

On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists , author=. Forty-second International Conference on Machine Learning , year=

work page
[25]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[26]

Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

Fedbiot: Llm local fine-tuning in federated learning without full model , author=. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

work page
[27]

International Conference on Machine Learning , pages=

Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[28]

Free Dolly:

Conover, Mike and Hayes, Matt and others , howpublished =. Free Dolly:

work page
[29]

arXiv preprint arXiv:2404.15381 , year=

Advances and Open Challenges in Federated Learning with Foundation Models , author=. arXiv preprint arXiv:2404.15381 , year=

work page arXiv
[30]

Advances in Neural Information Processing Systems , volume=

Towards federated foundation models: Scalable dataset pipelines for group-structured learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[31]

Advances in Neural Information Processing Systems , volume=

Is heterogeneity notorious? taming heterogeneity to handle test-time shift in federated learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[32]

Advances in Neural Information Processing Systems , volume=

Adaptive Test-Time Personalization for Federated Learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[33]

Advances in Neural Information Processing Systems , volume=

Dual-personalizing adapter for federated foundation models , author=. Advances in Neural Information Processing Systems , volume=

work page
[34]

ICLR , year=

Test-Time Robust Personalization for Federated Learning , author=. ICLR , year=

work page
[35]

Proceedings of the AAAI conference on artificial intelligence , volume=

Fedala: Adaptive local aggregation for personalized federated learning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[36]

On the Convergence of FedAvg on Non-IID Data , author=

work page
[37]

Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining , pages=

Fedcp: Separating feature information for personalized federated learning via conditional policy , author=. Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining , pages=

work page
[38]

Arivazhagan, Manoj Ghuhan and Aggarwal, Vinay and Singh, Aaditya Kumar and Choudhary, Sunav , journal=

work page
[39]

”Fedet: a communication-efficient federated class- incremental learning framework based on enhanced transformer.” arXiv preprint arXiv:2306.15347 (2023)

Fedet: a communication-efficient federated class-incremental learning framework based on enhanced transformer , author=. arXiv preprint arXiv:2306.15347 , year=

work page arXiv
[40]

, author=

Continual Federated Learning Based on Knowledge Distillation. , author=. Ijcai , pages=

work page
[41]

Adaptive Federated Optimization

Adaptive federated optimization , author=. arXiv preprint arXiv:2003.00295 , year=

work page internal anchor Pith review arXiv 2003
[42]

”Feddistill: Global model distillation for lo- cal model de-biasing in non-iid federated learning.” arXiv preprint arXiv:2404.09210 (2024)

Feddistill: Global model distillation for local model de-biasing in non-iid federated learning , author=. arXiv preprint arXiv:2404.09210 , year=

work page arXiv
[43]

2023 IEEE 16th International Conference on Cloud Computing (CLOUD) , pages=

Fedgen: Generalizable federated learning for sequential data , author=. 2023 IEEE 16th International Conference on Cloud Computing (CLOUD) , pages=. 2023 , organization=

work page 2023
[44]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

FedSSI: Rehearsal-Free Continual Federated Learning with Synergistic Synaptic Intelligence , author=

work page
[46]

arXiv preprint arXiv:2409.15723 , year=

Federated large language models: Current progress and future directions , author=. arXiv preprint arXiv:2409.15723 , year=

work page arXiv
[47]

arXiv preprint arXiv:2506.11024 , year=

Not All Clients Are Equal: Personalized Federated Learning on Heterogeneous Multi-Modal Clients , author=. arXiv preprint arXiv:2506.11024 , year=

work page arXiv
[48]

2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC) , pages=

PFFLoRA: Personalized Fourier LoRA Fine-Tuning of Federated Large Language Models , author=. 2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC) , pages=. 2024 , organization=

work page 2024
[49]

Frontiers in Artificial Intelligence , volume=

Bringing multi-modal multi-task federated foundation models to education domain: prospects and challenges , author=. Frontiers in Artificial Intelligence , volume=. 2025 , publisher=

work page 2025
[50]

arXiv preprint arXiv:2411.19128 , year=

Personalized Federated Fine-Tuning for LLMs via Data-Driven Heterogeneous Model Architectures , author=. arXiv preprint arXiv:2411.19128 , year=

work page arXiv
[51]

Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

Measuring the effects of non-identical data distribution for federated visual classification , author=. arXiv preprint arXiv:1909.06335 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[52]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[53]

arXiv preprint arXiv:2601.10348 , year=

Training-Trajectory-Aware Token Selection , author=. arXiv preprint arXiv:2601.10348 , year=

work page arXiv
[54]

Merge-of-thought distillation.ArXiv, abs/2509.08814,

Merge-of-thought distillation , author=. arXiv preprint arXiv:2509.08814 , year=

work page arXiv
[55]

arXiv preprint arXiv:2307.10485 , year=

Fingpt: Democratizing internet-scale data for financial large language models , author=. arXiv preprint arXiv:2307.10485 , year=

work page arXiv
[56]

International Conference on Learning Representations , volume=

Mammoth: Building math generalist models through hybrid instruction tuning , author=. International Conference on Learning Representations , volume=

work page
[57]

Code alpaca: An instruction-following llama model for code generation , author=

work page
[58]

2023 , publisher=

Stanford alpaca: An instruction-following llama model , author=. 2023 , publisher=

work page 2023
[59]

Medalpaca – an open-source collection of medical conversational ai models and training data,

MedAlpaca--an open-source collection of medical conversational AI models and training data , author=. arXiv preprint arXiv:2304.08247 , year=

work page arXiv
[60]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Journal of the Association for Information Science and Technology , volume=

Good debt or bad debt: Detecting semantic orientations in economic texts , author=. Journal of the Association for Information Science and Technology , volume=. 2014 , publisher=

work page 2014
[62]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Artificial intelligence and statistics , pages=

Communication-efficient learning of deep networks from decentralized data , author=. Artificial intelligence and statistics , pages=. 2017 , organization=

work page 2017

[2] [2]

Federated Learning with Non-IID Data

Federated learning with non-iid data , author=. arXiv preprint arXiv:1806.00582 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Proceedings of Machine learning and systems , volume=

Federated optimization in heterogeneous networks , author=. Proceedings of Machine learning and systems , volume=

work page

[4] [4]

International conference on machine learning , pages=

Scaffold: Stochastic controlled averaging for federated learning , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[5] [5]

FedALoRA: Adaptive Local LoRA Aggregation for Personalized Federated Learning in LLM , year=

Yi, Xinzhi and Hu, Chunqiang and Cai, Bin and Huang, Hongyu and Chen, Yuwen and Wang, Kui , journal=. FedALoRA: Adaptive Local LoRA Aggregation for Personalized Federated Learning in LLM , year=

work page

[6] [6]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

work page

[7] [7]

arXiv preprint arXiv:2402.13669 , year=

Self-distillation bridges distribution gap in language model fine-tuning , author=. arXiv preprint arXiv:2402.13669 , year=

work page arXiv

[8] [8]

The False Promise of Imitating Proprietary LLMs

The false promise of imitating proprietary llms, 2023 , author=. URL https://arxiv. org/abs/2305.15717 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

work page

[10] [10]

Mathematics , volume=

Efficient federated learning with pre-trained large language model using several adapter mechanisms , author=. Mathematics , volume=. 2023 , publisher=

work page 2023

[11] [11]

FedloRA: When personalized federated learning meets low-rank adaptation , author=

work page

[12] [12]

Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=

Openfedllm: Training large language models on decentralized private data via federated learning , author=. Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=

work page

[13] [13]

arXiv preprint arXiv:2310.13283 , year=

pFedLoRA: Model-heterogeneous personalized federated learning with LoRA tuning , author=. arXiv preprint arXiv:2310.13283 , year=

work page arXiv

[14] [14]

Preprint , year=

FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning , author=. Preprint , year=

work page

[15] [15]

arXiv preprint arXiv:2402.11505 , year=

Federated fine-tuning of large language models under heterogeneous language tasks and client resources , author=. arXiv preprint arXiv:2402.11505 , volume=

work page arXiv

[16] [16]

Advances in neural information processing systems , volume=

Ensemble distillation for robust model fusion in federated learning , author=. Advances in neural information processing systems , volume=

work page

[17] [17]

URL https://arxiv

Measuring massive multitask language understanding, 2021 , author=. URL https://arxiv. org/abs , pages=

work page 2021

[18] [18]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

work page 2023

[19] [19]

Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

Crass: A novel data set and benchmark to test counterfactual reasoning of large language models , author=. Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

work page

[20] [20]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs , author=. arXiv preprint arXiv:1903.00161 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1903

[21] [21]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

pFedGPT: Hierarchically Optimizing LoRA Aggregation Weights for Personalized Federated GPT Models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[22] [22]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Dual-personalizing adapter for federated foundation models , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page

[23] [23]

Findings of the Association for Computational Linguistics,

Federated Data-Efficient Instruction Tuning for Large Language Models , author=. Findings of the Association for Computational Linguistics,

work page

[24] [24]

Forty-second International Conference on Machine Learning , year=

On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists , author=. Forty-second International Conference on Machine Learning , year=

work page

[25] [25]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023

[26] [26]

Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

Fedbiot: Llm local fine-tuning in federated learning without full model , author=. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

work page

[27] [27]

International Conference on Machine Learning , pages=

Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024

[28] [28]

Free Dolly:

Conover, Mike and Hayes, Matt and others , howpublished =. Free Dolly:

work page

[29] [29]

arXiv preprint arXiv:2404.15381 , year=

Advances and Open Challenges in Federated Learning with Foundation Models , author=. arXiv preprint arXiv:2404.15381 , year=

work page arXiv

[30] [30]

Advances in Neural Information Processing Systems , volume=

Towards federated foundation models: Scalable dataset pipelines for group-structured learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[31] [31]

Advances in Neural Information Processing Systems , volume=

Is heterogeneity notorious? taming heterogeneity to handle test-time shift in federated learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[32] [32]

Advances in Neural Information Processing Systems , volume=

Adaptive Test-Time Personalization for Federated Learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[33] [33]

Advances in Neural Information Processing Systems , volume=

Dual-personalizing adapter for federated foundation models , author=. Advances in Neural Information Processing Systems , volume=

work page

[34] [34]

ICLR , year=

Test-Time Robust Personalization for Federated Learning , author=. ICLR , year=

work page

[35] [35]

Proceedings of the AAAI conference on artificial intelligence , volume=

Fedala: Adaptive local aggregation for personalized federated learning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[36] [36]

On the Convergence of FedAvg on Non-IID Data , author=

work page

[37] [37]

Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining , pages=

Fedcp: Separating feature information for personalized federated learning via conditional policy , author=. Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining , pages=

work page

[38] [38]

Arivazhagan, Manoj Ghuhan and Aggarwal, Vinay and Singh, Aaditya Kumar and Choudhary, Sunav , journal=

work page

[39] [39]

”Fedet: a communication-efficient federated class- incremental learning framework based on enhanced transformer.” arXiv preprint arXiv:2306.15347 (2023)

Fedet: a communication-efficient federated class-incremental learning framework based on enhanced transformer , author=. arXiv preprint arXiv:2306.15347 , year=

work page arXiv

[40] [40]

, author=

Continual Federated Learning Based on Knowledge Distillation. , author=. Ijcai , pages=

work page

[41] [41]

Adaptive Federated Optimization

Adaptive federated optimization , author=. arXiv preprint arXiv:2003.00295 , year=

work page internal anchor Pith review arXiv 2003

[42] [42]

”Feddistill: Global model distillation for lo- cal model de-biasing in non-iid federated learning.” arXiv preprint arXiv:2404.09210 (2024)

Feddistill: Global model distillation for local model de-biasing in non-iid federated learning , author=. arXiv preprint arXiv:2404.09210 , year=

work page arXiv

[43] [43]

2023 IEEE 16th International Conference on Cloud Computing (CLOUD) , pages=

Fedgen: Generalizable federated learning for sequential data , author=. 2023 IEEE 16th International Conference on Cloud Computing (CLOUD) , pages=. 2023 , organization=

work page 2023

[44] [44]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

FedSSI: Rehearsal-Free Continual Federated Learning with Synergistic Synaptic Intelligence , author=

work page

[46] [46]

arXiv preprint arXiv:2409.15723 , year=

Federated large language models: Current progress and future directions , author=. arXiv preprint arXiv:2409.15723 , year=

work page arXiv

[47] [47]

arXiv preprint arXiv:2506.11024 , year=

Not All Clients Are Equal: Personalized Federated Learning on Heterogeneous Multi-Modal Clients , author=. arXiv preprint arXiv:2506.11024 , year=

work page arXiv

[48] [48]

2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC) , pages=

PFFLoRA: Personalized Fourier LoRA Fine-Tuning of Federated Large Language Models , author=. 2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC) , pages=. 2024 , organization=

work page 2024

[49] [49]

Frontiers in Artificial Intelligence , volume=

Bringing multi-modal multi-task federated foundation models to education domain: prospects and challenges , author=. Frontiers in Artificial Intelligence , volume=. 2025 , publisher=

work page 2025

[50] [50]

arXiv preprint arXiv:2411.19128 , year=

Personalized Federated Fine-Tuning for LLMs via Data-Driven Heterogeneous Model Architectures , author=. arXiv preprint arXiv:2411.19128 , year=

work page arXiv

[51] [51]

Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

Measuring the effects of non-identical data distribution for federated visual classification , author=. arXiv preprint arXiv:1909.06335 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909

[52] [52]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[53] [53]

arXiv preprint arXiv:2601.10348 , year=

Training-Trajectory-Aware Token Selection , author=. arXiv preprint arXiv:2601.10348 , year=

work page arXiv

[54] [54]

Merge-of-thought distillation.ArXiv, abs/2509.08814,

Merge-of-thought distillation , author=. arXiv preprint arXiv:2509.08814 , year=

work page arXiv

[55] [55]

arXiv preprint arXiv:2307.10485 , year=

Fingpt: Democratizing internet-scale data for financial large language models , author=. arXiv preprint arXiv:2307.10485 , year=

work page arXiv

[56] [56]

International Conference on Learning Representations , volume=

Mammoth: Building math generalist models through hybrid instruction tuning , author=. International Conference on Learning Representations , volume=

work page

[57] [57]

Code alpaca: An instruction-following llama model for code generation , author=

work page

[58] [58]

2023 , publisher=

Stanford alpaca: An instruction-following llama model , author=. 2023 , publisher=

work page 2023

[59] [59]

Medalpaca – an open-source collection of medical conversational ai models and training data,

MedAlpaca--an open-source collection of medical conversational AI models and training data , author=. arXiv preprint arXiv:2304.08247 , year=

work page arXiv

[60] [60]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Journal of the Association for Information Science and Technology , volume=

Good debt or bad debt: Detecting semantic orientations in economic texts , author=. Journal of the Association for Information Science and Technology , volume=. 2014 , publisher=

work page 2014

[62] [62]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv