pith. machine review for the scientific record. sign in

arxiv: 2605.08636 · v1 · submitted 2026-05-09 · 💻 cs.CL

Recognition: no theorem link

EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:59 UTC · model grok-4.3

classification 💻 cs.CL
keywords federated learningLLM fine-tuningedge computingbenchmarksystem constraintsdeployment evaluationprivacy-preserving adaptation
0
0 comments X

The pith

Accuracy-only checks can mislead on which federated LLM fine-tuning methods will actually run on real edge devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EdgeFlowerTune, a benchmark that runs federated LLM fine-tuning experiments directly on commercial Android phones and NVIDIA edge boards instead of simulations. It tracks both final model quality and concrete system costs such as communication volume, wall-clock time, memory, energy draw, and behavior under changing network or load conditions. Prior studies often stop at accuracy numbers or abstract settings, which the authors demonstrate can favor methods that later prove too slow, power-hungry, or brittle for actual deployment. Three protocols—Quality-under-Budget, Cost-to-Target, and Robustness—let researchers compare approaches on effectiveness, efficiency, and reliability together. Anyone building private, on-device model adaptation cares because the results show that seemingly equivalent methods can differ sharply in whether they fit everyday hardware limits.

Core claim

Accuracy-only evaluation can lead to misleading conclusions because methods that reach similar final model quality can differ substantially in communication overhead, wall-clock latency, memory usage, energy consumption, and robustness when subjected to realistic edge constraints. EdgeFlowerTune addresses this gap by providing a reproducible real-device platform built on Flower and MobileFineTuner that jointly measures quality and system metrics across commercial Android smartphones and NVIDIA edge boards using three complementary protocols.

What carries the argument

EdgeFlowerTune benchmark platform and its three protocols (Quality-under-Budget, Cost-to-Target, Robustness) that evaluate model quality together with communication, latency, memory, energy, and dynamic-condition robustness on actual edge hardware.

If this is right

  • Federated fine-tuning methods must be assessed on real hardware to determine whether they fit within device limits.
  • Accuracy-focused comparisons can hide large gaps in practicality that matter for user-facing deployments.
  • The three protocols give a structured way to rank methods on quality, cost, and stability at the same time.
  • Reproducible real-device benchmarks can replace simulation-only studies for edge AI research.
  • Developers gain concrete guidance on which techniques survive battery, network, and runtime pressures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Simulation studies of federated tuning may systematically overestimate which methods will work on actual phones and IoT hardware.
  • Low-energy and low-latency methods identified by this kind of benchmark could speed up private personalization features in consumer apps.
  • The same joint quality-and-cost lens could be applied to other edge federated tasks such as on-device vision or speech models.
  • Extending the device set to include more heterogeneous or older hardware would likely surface additional practical constraints not yet measured.

Load-bearing premise

The selected Android phones, NVIDIA boards, and three protocols capture enough of the real variety in edge constraints and changing conditions that the observed differences in deployability are representative.

What would settle it

Finding that methods with similar accuracy but different benchmark costs actually show comparable real-world energy, latency, and success rates when run on additional device types or under wider network variability would undermine the claim that system-aware evaluation is necessary.

Figures

Figures reproduced from arXiv: 2605.08636 by Bing Luo, Jiaxiang Geng, Lunyu Zhao, Nicholas D. Lane, Yan Gao, Yiyi Lu.

Figure 1
Figure 1. Figure 1: Overview of EdgeFlowerTune. Candidate federated LLM fine-tuning methods are deployed [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: EdgeFlowerTune benchmarking protocols. Protocol A evaluates the best model quality [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: EdgeFlowerTune Platform. The platform consists of one gpu server and several real edge [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall ranking of methods under protocol A. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overall ranking of methods under protocol B. [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overall ranking of methods under protocol C. [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overall ranking across methods. For each method, the rank under each protocol–model [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training loss curves of the four federated fine-tuning methods on Qwen2.5-0.5B across [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training loss curves of the four federated fine-tuning methods on Gemma 3-270M across [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training loss curves of the federated fine-tuning methods on Gemma 3-1B across seven [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
read the original abstract

Federated fine-tuning offers a promising paradigm for adapting large language models (LLMs) on edge devices by leveraging the rich, diverse, and continuously generated data from smartphones and IoT devices without compromising user data privacy. Such edge-side adaptation can improve model personalization, robustness, and responsiveness to local contexts. However, the practical feasibility of federated LLM fine-tuning on real edge devices remains unclear, as most existing work focuses on cross-silo or simulation-based settings, overlooking the resource and runtime constraints that determine whether a method is deployable on real edge systems. We present EdgeFlowerTune, a deployment-oriented benchmark for federated LLM fine-tuning under realistic edge-system constraints. EdgeFlowerTune jointly evaluates model quality and system costs, including communication, wall-clock latency, memory usage, energy consumption, and robustness to dynamic edge conditions. To compare methods in terms of effectiveness, efficiency, and robustness, EdgeFlowerTune introduces three complementary protocols: Quality-under-Budget, Cost-to-Target, and Robustness. We instantiate EdgeFlowerTune as a real-device platform built on Flower and MobileFineTuner, spanning commercial Android smartphones and NVIDIA edge development boards. Our benchmark results show that accuracy-only evaluation can lead to misleading conclusions: methods with similar final quality may differ substantially in deployability once realistic system constraints are considered. EdgeFlowerTune provides a reproducible benchmark for system-aware evaluation of federated LLM fine-tuning at the edge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces EdgeFlowerTune, a real-device benchmark for federated LLM fine-tuning on edge systems built on Flower and MobileFineTuner. It evaluates methods using three protocols—Quality-under-Budget, Cost-to-Target, and Robustness—across commercial Android smartphones and NVIDIA edge boards, concluding that accuracy-only assessments can mislead on deployability due to differences in system costs and robustness under realistic constraints.

Significance. If the reported divergences hold under broader conditions, the work fills a gap between simulation-based federated learning studies and practical edge deployment by providing a joint quality-system evaluation framework. The real-device instantiation and introduction of the three complementary protocols are strengths that could encourage more holistic method design in the field.

major comments (2)
  1. [Abstract] Abstract and experimental results: the central claim that 'accuracy-only evaluation can lead to misleading conclusions' and that 'methods with similar final quality may differ substantially in deployability' is stated without any quantitative data, specific accuracy/cost numbers, error bars, or statistical significance tests in the provided description; this leaves the magnitude and reliability of the effect unassessable.
  2. [Benchmark Design] Benchmark instantiation and protocols: the choice of commercial Android smartphones, NVIDIA edge boards, and the three protocols is presented as representative of 'realistic edge-system constraints,' yet no justification or coverage analysis is given for omitted factors such as variable thermal throttling, intermittent 4G/5G connectivity, or non-Flower orchestration overheads; because the misleading-conclusions claim depends on these differences generalizing, the sampling assumption is load-bearing and unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, proposing revisions to strengthen the presentation of our results and benchmark design.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental results: the central claim that 'accuracy-only evaluation can lead to misleading conclusions' and that 'methods with similar final quality may differ substantially in deployability' is stated without any quantitative data, specific accuracy/cost numbers, error bars, or statistical significance tests in the provided description; this leaves the magnitude and reliability of the effect unassessable.

    Authors: We appreciate the referee noting this. The abstract is intended as a high-level summary, but the full manuscript contains the requested quantitative details—including specific accuracy and system cost values, error bars, and statistical significance tests—in the experimental results sections. To address the concern directly, we will revise the abstract to incorporate one or two representative quantitative examples that illustrate the deployability differences. revision: partial

  2. Referee: [Benchmark Design] Benchmark instantiation and protocols: the choice of commercial Android smartphones, NVIDIA edge boards, and the three protocols is presented as representative of 'realistic edge-system constraints,' yet no justification or coverage analysis is given for omitted factors such as variable thermal throttling, intermittent 4G/5G connectivity, or non-Flower orchestration overheads; because the misleading-conclusions claim depends on these differences generalizing, the sampling assumption is load-bearing and unverified.

    Authors: We agree that explicit justification and coverage discussion would improve clarity. In the revised version we will add a subsection in the benchmark design section that explains the rationale for choosing the Android smartphones and NVIDIA boards (based on their prevalence in edge deployments) and the three protocols (as complementary views of quality, cost, and robustness). We will also explicitly discuss the controlled experimental conditions and note the limitations regarding untested factors such as dynamic thermal throttling and variable connectivity, thereby qualifying the scope of our generalization claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark grounded in direct hardware measurements

full rationale

The paper introduces EdgeFlowerTune as a deployment-oriented benchmark that evaluates federated LLM fine-tuning methods through direct measurements of quality, communication, latency, memory, energy, and robustness on physical Android smartphones and NVIDIA boards. No equations, fitted parameters, or derivations appear in the provided text; the central claim that accuracy-only evaluation can mislead on deployability follows from observed experimental divergences under the three protocols rather than any reduction to prior inputs or self-citations. The work is self-contained as an empirical study whose results are falsifiable via reproduction on the described hardware and software stack.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim that accuracy-only views mislead rests on the domain assumption that the chosen hardware and protocols capture representative edge constraints; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Commercial Android smartphones and NVIDIA edge boards plus Flower/MobileFineTuner represent realistic edge deployment environments
    Invoked to justify the platform choice and the claim of realistic constraints.
invented entities (1)
  • EdgeFlowerTune benchmark with its three protocols no independent evidence
    purpose: To provide joint quality and system-cost evaluation for federated LLM fine-tuning
    The benchmark itself is the novel contribution; no independent falsifiable evidence outside the paper is supplied.

pith-pipeline@v0.9.0 · 5566 in / 1349 out tokens · 30312 ms · 2026-05-12T00:59:41.181971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 5 internal anchors

  1. [1]

    Mario Almeida, Stefanos Laskaridis, Abhinav Mehrotra, Lukasz Dudziak, Ilias Leontiadis, and Nicholas D. Lane. Smart at what cost? characterising mobile deep neural networks in the wild. InProceedings of the 21st ACM Internet Measurement Conference, IMC ’21, page 658–672, New York, NY , USA, 2021. Association for Computing Machinery. ISBN 9781450391290. do...

  2. [2]

    Federated fine-tuning of large language models under heterogeneous tasks and client resources

    Jiamu Bai, Daoyuan Chen, Bingchen Qian, Liuyi Yao, and Yaliang Li. Federated fine-tuning of large language models under heterogeneous tasks and client resources. InAdvances in Neural Information Processing Systems, 2024

  3. [3]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. doi: 10.1609/aaai.v34i05.6239

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  5. [5]

    Heterogeneous LoRA for federated fine-tuning of on-device foundation models

    Yae Jee Cho, Luyang Liu, Zheng Xu, Aldi Fahrezi, and Gauri Joshi. Heterogeneous LoRA for federated fine-tuning of on-device foundation models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12903–12913, Miami, Florida, USA, November 2024. Associati...

  6. [6]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

  7. [7]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  8. [8]

    Regulation (EU) 2016/679 of the European Parliament and of the Council, 2016

    European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council, 2016. URLhttps://data.europa.eu/eli/reg/ 2016/679/oj

  9. [9]

    Ten challenging problems in federated foundation models.IEEE Transactions on Knowledge and Data Engineering, 37(7):4314–4337, 2025

    Tao Fan, Hanlin Gu, Xuemei Cao, Chee Seng Chan, Qian Chen, Yiqiang Chen, Yihui Feng, Yang Gu, Jiaxiang Geng, Bing Luo, Shuoling Liu, Win Kent Ong, Chao Ren, Jiaqi Shao, Chuan Sun, Xiaoli Tang, Hong Xi Tae, Yongxin Tong, Shuyue Wei, Fan Wu, Wei Xi, Mingcong Xu, He Yang, Xin Yang, Jiangpeng Yan, Hao Yu, Han Yu, Teng Zhang, Yifei Zhang, Xiaojin Zhang, Zhenzh...

  10. [10]

    Yan Gao, Massimo Roberto Scamarcia, Javier Fernandez-Marques, Mohammad Naseri, Chong Shen Ng, Dimitris Stripelis, Zexi Li, Tao Shen, Jiamu Bai, Daoyuan Chen, Zikai Zhang, Rui Hu, InSeo Song, KangYoon Lee, Hong Jia, Ting Dang, Junyan Wang, Zheyuan Liu, Daniel Janes Beutel, Lingjuan Lyu, and Nicholas D. Lane. Flowertune: A cross-domain benchmark for federat...

  11. [11]

    Mobilefinetuner: A unified end-to-end framework for fine-tuning llms on mobile phones, 2025

    Jiaxiang Geng, Lunyu Zhao, Yiyi Lu, and Bing Luo. Mobilefinetuner: A unified end-to-end framework for fine-tuning llms on mobile phones, 2025

  12. [12]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

  13. [13]

    Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar- jun Nitin Bhagoji, Keith Bonawitz, Graham Cormode, Rachel Cummings, Rodrigo D’Oliveira, et al

    Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar- jun Nitin Bhagoji, Keith Bonawitz, Graham Cormode, Rachel Cummings, Rodrigo D’Oliveira, et al. Advances and open problems in federated learning.Foundations and Trends in Machine Learning, 14(1–2):1–210, 2021. doi: 10.1561/2200000083

  14. [14]

    Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning

    Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, and Jingren Zhou. Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5260–5271, 2024. doi: 10.1...

  15. [15]

    Federated optimization in heterogeneous networks

    Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. InProceedings of Machine Learning and Systems, volume 2, pages 429–450, 2020

  16. [16]

    SplitLoRA: A split parameter-efficient fine-tuning framework for large language models.arXiv preprint arXiv:2407.00952, 2024

    Zheng Lin, Xuanjie Hu, Yuxin Zhang, Zhe Chen, Zihan Fang, Xianhao Chen, Ang Li, Praneeth Vepakomma, and Yue Gao. SplitLoRA: A split parameter-efficient fine-tuning framework for large language models.arXiv preprint arXiv:2407.00952, 2024

  17. [17]

    Hlora: Efficient federated learning system for llm heterogeneous fine-tuning,

    Qianli Liu, Zhaorui Zhang, Xin Yao, and Benben Liu. HLoRA: Efficient federated learning system for LLM heterogeneous fine-tuning.arXiv preprint arXiv:2503.00813, 2025

  18. [18]

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas

    H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017

  19. [19]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  20. [20]

    SQuAD : 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264

  21. [21]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020. doi: 10.1609/aaai.v34i05.6399

  22. [22]

    Social IQa: Commonsense Reasoning about Social Interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 4463–4473, Hong Kong, China, 2019. Association for Com...

  23. [23]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 Technical Report.arXiv preprint, arXiv:2503.19786, 2025. URL https://arxiv.org/abs/2503.19786. Google DeepMind Gemma 3 model family

  24. [24]

    Position: will we run out of data? limits of llm scaling based on human-generated data

    Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Position: will we run out of data? limits of llm scaling based on human-generated data. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  25. [25]

    doi: 10.18653/v1/W18-5446

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, 2018. Association for Computational Linguis...

  26. [26]

    my answer is c

    Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul Röttger, Frauke Kreuter, Dirk Hovy, and Barbara Plank. “my answer is c”: First-token probabilities do not match text answers in instruction-tuned language models. InFindings of the Association for Computational Linguistics: ACL 2024, 2024

  27. [27]

    FLoRA: Federated fine-tuning large language models with heterogeneous low-rank adaptations

    Ziyao Wang, Zheyu Shen, Yexiao He, Guoheng Sun, Hongyi Wang, Lingjuan Lyu, and Ang Li. FLoRA: Federated fine-tuning large language models with heterogeneous low-rank adaptations. InAdvances in Neural Information Processing Systems, 2024

  28. [28]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing.CoRR, abs/1910.03771, 2019

  29. [30]

    Qwen2.5 Technical Report

    URL https://arxiv.org/abs/2412.15115. Qwen2.5 model family including 0.5B variant

  30. [31]

    Openfedllm: Training large language models on decentralized private data via federated learning

    Rui Ye, Wenhao Wang, Jingyi Chai, Dihan Li, Zexi Li, Yinda Xu, Yaxin Du, Yanfeng Wang, and Siheng Chen. Openfedllm: Training large language models on decentralized private data via federated learning. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6137–6147, 2024. doi: 10.1145/3637528.3671582

  31. [32]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472

  32. [33]

    Towards building the federated GPT: Federated instruction tuning.arXiv preprint arXiv:2305.05644, 2023

    Jianyi Zhang, Saeed Vahidian, Martin Kuo, Chunyuan Li, Ruiyi Zhang, Tong Yu, Yufan Zhou, Guoyin Wang, and Yiran Chen. Towards building the federated GPT: Federated instruction tuning.arXiv preprint arXiv:2305.05644, 2023. 12 A Complete benchmark results of protocol A This section provides the complete benchmark results for Protocol A, i.e.,Quality-under-B...