arxiv: 2605.08636 · v1 · submitted 2026-05-09 · 💻 cs.CL

Recognition: no theorem link

EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints

Jiaxiang Geng , Yiyi Lu , Lunyu Zhao , Yan Gao , Nicholas D. Lane , Bing Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:59 UTC · model grok-4.3

classification 💻 cs.CL

keywords federated learningLLM fine-tuningedge computingbenchmarksystem constraintsdeployment evaluationprivacy-preserving adaptation

0 comments

The pith

Accuracy-only checks can mislead on which federated LLM fine-tuning methods will actually run on real edge devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EdgeFlowerTune, a benchmark that runs federated LLM fine-tuning experiments directly on commercial Android phones and NVIDIA edge boards instead of simulations. It tracks both final model quality and concrete system costs such as communication volume, wall-clock time, memory, energy draw, and behavior under changing network or load conditions. Prior studies often stop at accuracy numbers or abstract settings, which the authors demonstrate can favor methods that later prove too slow, power-hungry, or brittle for actual deployment. Three protocols—Quality-under-Budget, Cost-to-Target, and Robustness—let researchers compare approaches on effectiveness, efficiency, and reliability together. Anyone building private, on-device model adaptation cares because the results show that seemingly equivalent methods can differ sharply in whether they fit everyday hardware limits.

Core claim

Accuracy-only evaluation can lead to misleading conclusions because methods that reach similar final model quality can differ substantially in communication overhead, wall-clock latency, memory usage, energy consumption, and robustness when subjected to realistic edge constraints. EdgeFlowerTune addresses this gap by providing a reproducible real-device platform built on Flower and MobileFineTuner that jointly measures quality and system metrics across commercial Android smartphones and NVIDIA edge boards using three complementary protocols.

What carries the argument

EdgeFlowerTune benchmark platform and its three protocols (Quality-under-Budget, Cost-to-Target, Robustness) that evaluate model quality together with communication, latency, memory, energy, and dynamic-condition robustness on actual edge hardware.

If this is right

Federated fine-tuning methods must be assessed on real hardware to determine whether they fit within device limits.
Accuracy-focused comparisons can hide large gaps in practicality that matter for user-facing deployments.
The three protocols give a structured way to rank methods on quality, cost, and stability at the same time.
Reproducible real-device benchmarks can replace simulation-only studies for edge AI research.
Developers gain concrete guidance on which techniques survive battery, network, and runtime pressures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Simulation studies of federated tuning may systematically overestimate which methods will work on actual phones and IoT hardware.
Low-energy and low-latency methods identified by this kind of benchmark could speed up private personalization features in consumer apps.
The same joint quality-and-cost lens could be applied to other edge federated tasks such as on-device vision or speech models.
Extending the device set to include more heterogeneous or older hardware would likely surface additional practical constraints not yet measured.

Load-bearing premise

The selected Android phones, NVIDIA boards, and three protocols capture enough of the real variety in edge constraints and changing conditions that the observed differences in deployability are representative.

What would settle it

Finding that methods with similar accuracy but different benchmark costs actually show comparable real-world energy, latency, and success rates when run on additional device types or under wider network variability would undermine the claim that system-aware evaluation is necessary.

Figures

Figures reproduced from arXiv: 2605.08636 by Bing Luo, Jiaxiang Geng, Lunyu Zhao, Nicholas D. Lane, Yan Gao, Yiyi Lu.

**Figure 2.** Figure 2: EdgeFlowerTune benchmarking protocols. Protocol A evaluates the best model quality [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: EdgeFlowerTune Platform. The platform consists of one gpu server and several real edge [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overall ranking of methods under protocol A. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Overall ranking of methods under protocol B. [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Overall ranking of methods under protocol C. [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Overall ranking across methods. For each method, the rank under each protocol–model [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Training loss curves of the four federated fine-tuning methods on Qwen2.5-0.5B across [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Training loss curves of the four federated fine-tuning methods on Gemma 3-270M across [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Training loss curves of the federated fine-tuning methods on Gemma 3-1B across seven [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

read the original abstract

Federated fine-tuning offers a promising paradigm for adapting large language models (LLMs) on edge devices by leveraging the rich, diverse, and continuously generated data from smartphones and IoT devices without compromising user data privacy. Such edge-side adaptation can improve model personalization, robustness, and responsiveness to local contexts. However, the practical feasibility of federated LLM fine-tuning on real edge devices remains unclear, as most existing work focuses on cross-silo or simulation-based settings, overlooking the resource and runtime constraints that determine whether a method is deployable on real edge systems. We present EdgeFlowerTune, a deployment-oriented benchmark for federated LLM fine-tuning under realistic edge-system constraints. EdgeFlowerTune jointly evaluates model quality and system costs, including communication, wall-clock latency, memory usage, energy consumption, and robustness to dynamic edge conditions. To compare methods in terms of effectiveness, efficiency, and robustness, EdgeFlowerTune introduces three complementary protocols: Quality-under-Budget, Cost-to-Target, and Robustness. We instantiate EdgeFlowerTune as a real-device platform built on Flower and MobileFineTuner, spanning commercial Android smartphones and NVIDIA edge development boards. Our benchmark results show that accuracy-only evaluation can lead to misleading conclusions: methods with similar final quality may differ substantially in deployability once realistic system constraints are considered. EdgeFlowerTune provides a reproducible benchmark for system-aware evaluation of federated LLM fine-tuning at the edge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EdgeFlowerTune builds a real-device benchmark on Android phones and NVIDIA boards that shows accuracy-only metrics can hide big differences in energy, latency, and robustness for federated LLM fine-tuning.

read the letter

The main point is that this paper sets up EdgeFlowerTune as a benchmark that actually runs federated fine-tuning on physical Android smartphones and NVIDIA edge boards instead of simulations. It adds three protocols that look at quality under a budget, the cost to hit a quality target, and robustness when conditions change, all while tracking communication, memory, energy, and wall-clock time alongside model accuracy. That combination is new compared to most prior federated LLM work that stays in cross-silo or simulated settings. The claim that accuracy numbers alone can mislead about real deployability follows directly from running the same methods under those joint constraints, and the use of Flower plus MobileFineTuner makes the platform reproducible on paper. This is the kind of concrete step that helps move the field toward methods that could actually run on consumer hardware without killing batteries or timing out. The soft spot is that the abstract gives no numbers, error bars, or details on how the dynamic conditions were created, so the size of the reported divergences is hard to judge from what is here. The selected devices and protocols also cover only a slice of edge realities; variable thermal throttling, spotty 4G/5G, or different memory hierarchies could shift which methods look viable. If the full paper supplies the raw data, code, and a clear account of those choices, the central argument holds up better. This is for researchers who work on on-device adaptation or federated learning and need to think about system costs, not just final perplexity. A reader focused on practical edge AI would get direct value from the platform and protocols. It deserves a serious referee because it fills a clear evaluation gap with a working setup, even if the experiments need more documentation and the hardware coverage needs discussion. I would send it to review and ask for the quantitative results plus a section on how the protocols map to common deployment variables.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces EdgeFlowerTune, a real-device benchmark for federated LLM fine-tuning on edge systems built on Flower and MobileFineTuner. It evaluates methods using three protocols—Quality-under-Budget, Cost-to-Target, and Robustness—across commercial Android smartphones and NVIDIA edge boards, concluding that accuracy-only assessments can mislead on deployability due to differences in system costs and robustness under realistic constraints.

Significance. If the reported divergences hold under broader conditions, the work fills a gap between simulation-based federated learning studies and practical edge deployment by providing a joint quality-system evaluation framework. The real-device instantiation and introduction of the three complementary protocols are strengths that could encourage more holistic method design in the field.

major comments (2)

[Abstract] Abstract and experimental results: the central claim that 'accuracy-only evaluation can lead to misleading conclusions' and that 'methods with similar final quality may differ substantially in deployability' is stated without any quantitative data, specific accuracy/cost numbers, error bars, or statistical significance tests in the provided description; this leaves the magnitude and reliability of the effect unassessable.
[Benchmark Design] Benchmark instantiation and protocols: the choice of commercial Android smartphones, NVIDIA edge boards, and the three protocols is presented as representative of 'realistic edge-system constraints,' yet no justification or coverage analysis is given for omitted factors such as variable thermal throttling, intermittent 4G/5G connectivity, or non-Flower orchestration overheads; because the misleading-conclusions claim depends on these differences generalizing, the sampling assumption is load-bearing and unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, proposing revisions to strengthen the presentation of our results and benchmark design.

read point-by-point responses

Referee: [Abstract] Abstract and experimental results: the central claim that 'accuracy-only evaluation can lead to misleading conclusions' and that 'methods with similar final quality may differ substantially in deployability' is stated without any quantitative data, specific accuracy/cost numbers, error bars, or statistical significance tests in the provided description; this leaves the magnitude and reliability of the effect unassessable.

Authors: We appreciate the referee noting this. The abstract is intended as a high-level summary, but the full manuscript contains the requested quantitative details—including specific accuracy and system cost values, error bars, and statistical significance tests—in the experimental results sections. To address the concern directly, we will revise the abstract to incorporate one or two representative quantitative examples that illustrate the deployability differences. revision: partial
Referee: [Benchmark Design] Benchmark instantiation and protocols: the choice of commercial Android smartphones, NVIDIA edge boards, and the three protocols is presented as representative of 'realistic edge-system constraints,' yet no justification or coverage analysis is given for omitted factors such as variable thermal throttling, intermittent 4G/5G connectivity, or non-Flower orchestration overheads; because the misleading-conclusions claim depends on these differences generalizing, the sampling assumption is load-bearing and unverified.

Authors: We agree that explicit justification and coverage discussion would improve clarity. In the revised version we will add a subsection in the benchmark design section that explains the rationale for choosing the Android smartphones and NVIDIA boards (based on their prevalence in edge deployments) and the three protocols (as complementary views of quality, cost, and robustness). We will also explicitly discuss the controlled experimental conditions and note the limitations regarding untested factors such as dynamic thermal throttling and variable connectivity, thereby qualifying the scope of our generalization claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark grounded in direct hardware measurements

full rationale

The paper introduces EdgeFlowerTune as a deployment-oriented benchmark that evaluates federated LLM fine-tuning methods through direct measurements of quality, communication, latency, memory, energy, and robustness on physical Android smartphones and NVIDIA boards. No equations, fitted parameters, or derivations appear in the provided text; the central claim that accuracy-only evaluation can mislead on deployability follows from observed experimental divergences under the three protocols rather than any reduction to prior inputs or self-citations. The work is self-contained as an empirical study whose results are falsifiable via reproduction on the described hardware and software stack.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim that accuracy-only views mislead rests on the domain assumption that the chosen hardware and protocols capture representative edge constraints; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Commercial Android smartphones and NVIDIA edge boards plus Flower/MobileFineTuner represent realistic edge deployment environments
Invoked to justify the platform choice and the claim of realistic constraints.

invented entities (1)

EdgeFlowerTune benchmark with its three protocols no independent evidence
purpose: To provide joint quality and system-cost evaluation for federated LLM fine-tuning
The benchmark itself is the novel contribution; no independent falsifiable evidence outside the paper is supplied.

pith-pipeline@v0.9.0 · 5566 in / 1349 out tokens · 30312 ms · 2026-05-12T00:59:41.181971+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 5 internal anchors

[1]

Mario Almeida, Stefanos Laskaridis, Abhinav Mehrotra, Lukasz Dudziak, Ilias Leontiadis, and Nicholas D. Lane. Smart at what cost? characterising mobile deep neural networks in the wild. InProceedings of the 21st ACM Internet Measurement Conference, IMC ’21, page 658–672, New York, NY , USA, 2021. Association for Computing Machinery. ISBN 9781450391290. do...

work page doi:10.1145/3487552.3487863 2021
[2]

Federated fine-tuning of large language models under heterogeneous tasks and client resources

Jiamu Bai, Daoyuan Chen, Bingchen Qian, Liuyi Yao, and Yaliang Li. Federated fine-tuning of large language models under heterogeneous tasks and client resources. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[3]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. doi: 10.1609/aaai.v34i05.6239

work page doi:10.1609/aaai.v34i05.6239 2020
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 1901
[5]

Heterogeneous LoRA for federated fine-tuning of on-device foundation models

Yae Jee Cho, Luyang Liu, Zheng Xu, Aldi Fahrezi, and Gauri Joshi. Heterogeneous LoRA for federated fine-tuning of on-device foundation models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12903–12913, Miami, Florida, USA, November 2024. Associati...

work page doi:10.18653/v1/2024.emnlp-main.717 2024
[6]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

work page doi:10.18653/v1/n19-1300 2019
[7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Regulation (EU) 2016/679 of the European Parliament and of the Council, 2016

European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council, 2016. URLhttps://data.europa.eu/eli/reg/ 2016/679/oj

work page 2016
[9]

Ten challenging problems in federated foundation models.IEEE Transactions on Knowledge and Data Engineering, 37(7):4314–4337, 2025

Tao Fan, Hanlin Gu, Xuemei Cao, Chee Seng Chan, Qian Chen, Yiqiang Chen, Yihui Feng, Yang Gu, Jiaxiang Geng, Bing Luo, Shuoling Liu, Win Kent Ong, Chao Ren, Jiaqi Shao, Chuan Sun, Xiaoli Tang, Hong Xi Tae, Yongxin Tong, Shuyue Wei, Fan Wu, Wei Xi, Mingcong Xu, He Yang, Xin Yang, Jiangpeng Yan, Hao Yu, Han Yu, Teng Zhang, Yifei Zhang, Xiaojin Zhang, Zhenzh...

work page doi:10.1109/tkde.2025.3555328 2025
[10]

Yan Gao, Massimo Roberto Scamarcia, Javier Fernandez-Marques, Mohammad Naseri, Chong Shen Ng, Dimitris Stripelis, Zexi Li, Tao Shen, Jiamu Bai, Daoyuan Chen, Zikai Zhang, Rui Hu, InSeo Song, KangYoon Lee, Hong Jia, Ting Dang, Junyan Wang, Zheyuan Liu, Daniel Janes Beutel, Lingjuan Lyu, and Nicholas D. Lane. Flowertune: A cross-domain benchmark for federat...

work page 2025
[11]

Mobilefinetuner: A unified end-to-end framework for fine-tuning llms on mobile phones, 2025

Jiaxiang Geng, Lunyu Zhao, Yiyi Lu, and Bing Luo. Mobilefinetuner: A unified end-to-end framework for fine-tuning llms on mobile phones, 2025

work page 2025
[12]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

work page 2022
[13]

Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar- jun Nitin Bhagoji, Keith Bonawitz, Graham Cormode, Rachel Cummings, Rodrigo D’Oliveira, et al

Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar- jun Nitin Bhagoji, Keith Bonawitz, Graham Cormode, Rachel Cummings, Rodrigo D’Oliveira, et al. Advances and open problems in federated learning.Foundations and Trends in Machine Learning, 14(1–2):1–210, 2021. doi: 10.1561/2200000083

work page doi:10.1561/2200000083 2021
[14]

Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning

Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, and Jingren Zhou. Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5260–5271, 2024. doi: 10.1...

work page doi:10.1145/3637528.3671573 2024
[15]

Federated optimization in heterogeneous networks

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. InProceedings of Machine Learning and Systems, volume 2, pages 429–450, 2020

work page 2020
[16]

SplitLoRA: A split parameter-efficient fine-tuning framework for large language models.arXiv preprint arXiv:2407.00952, 2024

Zheng Lin, Xuanjie Hu, Yuxin Zhang, Zhe Chen, Zihan Fang, Xianhao Chen, Ang Li, Praneeth Vepakomma, and Yue Gao. SplitLoRA: A split parameter-efficient fine-tuning framework for large language models.arXiv preprint arXiv:2407.00952, 2024

work page arXiv 2024
[17]

Hlora: Efficient federated learning system for llm heterogeneous fine-tuning,

Qianli Liu, Zhaorui Zhang, Xin Yao, and Benben Liu. HLoRA: Efficient federated learning system for LLM heterogeneous fine-tuning.arXiv preprint arXiv:2503.00813, 2025

work page arXiv 2025
[18]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas

H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017

work page 2017
[19]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

SQuAD : 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264

work page doi:10.18653/v1/d16-1264 2016
[21]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020. doi: 10.1609/aaai.v34i05.6399

work page doi:10.1609/aaai.v34i05.6399 2020
[22]

Social IQa: Commonsense Reasoning about Social Interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 4463–4473, Hong Kong, China, 2019. Association for Com...

work page doi:10.18653/v1/d19-1454 2019
[23]

Gemma 3 Technical Report

Gemma Team. Gemma 3 Technical Report.arXiv preprint, arXiv:2503.19786, 2025. URL https://arxiv.org/abs/2503.19786. Google DeepMind Gemma 3 model family

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Position: will we run out of data? limits of llm scaling based on human-generated data

Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Position: will we run out of data? limits of llm scaling based on human-generated data. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[25]

doi: 10.18653/v1/W18-5446

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, 2018. Association for Computational Linguis...

work page doi:10.18653/v1/w18-5446 2018
[26]

my answer is c

Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul Röttger, Frauke Kreuter, Dirk Hovy, and Barbara Plank. “my answer is c”: First-token probabilities do not match text answers in instruction-tuned language models. InFindings of the Association for Computational Linguistics: ACL 2024, 2024

work page 2024
[27]

FLoRA: Federated fine-tuning large language models with heterogeneous low-rank adaptations

Ziyao Wang, Zheyu Shen, Yexiao He, Guoheng Sun, Hongyi Wang, Lingjuan Lyu, and Ang Li. FLoRA: Federated fine-tuning large language models with heterogeneous low-rank adaptations. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[28]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing.CoRR, abs/1910.03771, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[30]

Qwen2.5 Technical Report

URL https://arxiv.org/abs/2412.15115. Qwen2.5 model family including 0.5B variant

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Openfedllm: Training large language models on decentralized private data via federated learning

Rui Ye, Wenhao Wang, Jingyi Chai, Dihan Li, Zexi Li, Yinda Xu, Yaxin Du, Yanfeng Wang, and Siheng Chen. Openfedllm: Training large language models on decentralized private data via federated learning. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6137–6147, 2024. doi: 10.1145/3637528.3671582

work page doi:10.1145/3637528.3671582 2024
[32]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472

work page doi:10.18653/v1/p19-1472 2019
[33]

Towards building the federated GPT: Federated instruction tuning.arXiv preprint arXiv:2305.05644, 2023

Jianyi Zhang, Saeed Vahidian, Martin Kuo, Chunyuan Li, Ruiyi Zhang, Tong Yu, Yufan Zhou, Guoyin Wang, and Yiran Chen. Towards building the federated GPT: Federated instruction tuning.arXiv preprint arXiv:2305.05644, 2023. 12 A Complete benchmark results of protocol A This section provides the complete benchmark results for Protocol A, i.e.,Quality-under-B...

work page arXiv 2023