arxiv: 2604.25421 · v1 · submitted 2026-04-28 · 💻 cs.LG · cs.AI

Recognition: unknown

FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

Changyu Li , Shuanghong Huang , Jiashen Liu , Ming Lei , Jidu Xing , Kaishun Wu , Lu Wang , Fei Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords federated learninglarge language modelstoken quantizationFisher informationcommunication efficiencyedge computingLoRAnon-IID data

0 comments

The pith

Fisher-guided token selection and mixed-precision quantization cut uplink traffic 46x in federated LLM fine-tuning on edge devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Federated fine-tuning lets LLMs adapt to private edge data without centralizing it, but heterogeneous bandwidth and non-IID partitions make the uplink the main bottleneck even after parameter-efficient methods like LoRA shrink the trainable set. Fed-FSTQ introduces a lightweight Fisher proxy that scores token sensitivity, then pairs importance-aware selection with non-uniform quantization so high-value tokens keep more bits while redundant ones are suppressed. The module plugs into existing federated PEFT pipelines without changing server aggregation and packs sparse messages for bandwidth-heterogeneous clients. On multilingual and medical QA tasks under non-IID splits, the scheme reaches a fixed quality target with 46 times less cumulative uplink data and 52 percent faster wall-clock convergence than a plain LoRA baseline. The same Fisher signal can be reused at inference to drop tokens and deliver up to 1.55 times speedup on Jetson-class hardware.

Core claim

Fed-FSTQ is a model-agnostic primitive that couples a lightweight Fisher proxy for token sensitivity with importance-aware selection and non-uniform mixed-precision quantization, allowing federated PEFT to transmit only the most informative evidence at high fidelity while discarding redundant signals, thereby reducing cumulative uplink volume by 46x and wall-clock time-to-accuracy by 52 percent relative to standard LoRA under non-IID partitions.

What carries the argument

The lightweight Fisher proxy, which estimates per-token sensitivity to drive importance-aware selection and allocation of higher bit-widths to critical tokens during uplink.

Load-bearing premise

The lightweight Fisher proxy supplies a reliable estimate of token sensitivity that generalizes across heterogeneous clients, tasks, and non-IID partitions without bias or excessive local overhead.

What would settle it

Run the same federated schedule with random token selection instead of the Fisher proxy and measure whether cumulative uplink traffic to target accuracy rises back toward the baseline level.

Figures

Figures reproduced from arXiv: 2604.25421 by Changyu Li, Fei Luo, Jiashen Liu, Jidu Xing, Kaishun Wu, Lu Wang, Ming Lei, Shuanghong Huang.

**Figure 1.** Figure 1: The Uplink Bottleneck in Federated LLM Fine-Tuning. Under stochastic channel conditions (Rk,t), standard Fed-LoRA (red arrows, dense blocks) suffers from straggler delays per Eq. (9). FED-FSTQ (green arrows, sparse blocks) reduces bits(mk,t) via Fisher-guided semantic compression, enabling efficient transmission even under constrained and heterogeneous uplinks. The straggler client (highlighted with clock … view at source ↗

**Figure 2.** Figure 2: System Architecture of FED-FSTQ. FED-FSTQ decouples transmission fidelity from parameter magnitude by allocating bits according to Fisherguided sensitivity. (1) Sensitivity estimation: During standard backpropagation, each client computes squared gradients w.r.t. input embeddings as a tokenlevel Fisher proxy [35]. (2) Mixed-precision allocation: A Fisher-weighted rate–distortion policy assigns discrete … view at source ↗

**Figure 3.** Figure 3: Fisher vs. Attention Heatmap. Attention may emphasize high-frequency connectors, whereas the Fisher proxy highlights structurally decisive tokens whose removal breaks logical validity, motivating high-fidelity retention. b) Why the overhead is dominated by uplink savings.: FED-FSTQ is designed to be backprop-aligned: both the token Fisher proxy in Eq. (16) and the diagonal Fisher accumulation in Eq. (20) r… view at source ↗

**Figure 4.** Figure 4: Communication–accuracy Pareto frontier. FED-FSTQ reaches target accuracy with 46× less cumulative uplink traffic than Fed-LoRA (FedAvg [6] + LoRA [13]). + NVIDIA Jetson), following edge FL system evaluation practices [5], [12], [21]. Round latency breakdown view at source ↗

**Figure 7.** Figure 7: Impact of data heterogeneity (Non-IID). Accuracy under Dirichlet client partitions. FED-FSTQ remains stable under extreme heterogeneity (robust FL under heterogeneity [15], [17]). Non-IID robustness. Table IV reports accuracy under Dirichlet client partitions, a standard stress protocol for objective inconsistency and client drift in FL [15], [16]. At extreme heterogeneity (α = 0.1), FedAvg [6] and QSGD [… view at source ↗

**Figure 6.** Figure 6: End-to-end speedups. (a) Faster convergence due to reduced straggler delay. (b) Faster on-device inference enabled by Fisher-guided token reduction (efficient transformer foundations [4]). 0.0 0.2 0.4 0.6 0.8 1.0 Dirichlet (Data Heterogeneity) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ROUGE-L Score Baseline Collapse Impact of Data Heterogeneity FedAvg QSGD Fed-FSTQ (Ours) Collapse Zone view at source ↗

**Figure 8.** Figure 8: Scalability with client population. Convergence time (hours) versus the number of clients (at-scale FL systems [12], [21]). 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Packet Loss Rate (%) 0.35 0.40 0.45 0.50 0.55 0.60 0.65 Validation Accuracy Network Resilience: Robustness to Packet Loss FedAvg Fed-FSTQ (Ours) Performance Gap view at source ↗

**Figure 9.** Figure 9: Packet loss resilience. Accuracy under packet loss rates up to 20% in mobile uplinks [11]. Multilingual cost radar view at source ↗

**Figure 11.** Figure 11: On-device energy/battery drain visualization. Reduced communication time yields substantially improved energy sustainability under continuous training [5], [11]. FedAvg FedPAQ Fed-ToMe QSGD (4-bit) FedBAT Fed-FSTQ (Ours) 0 1000 2000 3000 4000 Peak Memory (MB) On-Device Memory Footprint Jetson Nano Limit (2GB) view at source ↗

**Figure 13.** Figure 13: Multilingual cost radar. FED-FSTQ maintains low and balanced communication cost across languages. TABLE V PEAK MEMORY (MB). FED-FSTQ FITS WITHIN 2GB EDGE DEVICES. Method Peak Memory (MB) ↓ FedAvg (Server GPU) 4500 FedPAQ (Server GPU) 4500 Fed-ToMe (High-End Edge) 3800 QSGD (High-End Edge) 2100 FedBAT (Mid-Range Edge) 1800 Fed-FSTQ (IoT/Mobile 2GB) 1450 where low-information tokens are preferentially remov… view at source ↗

**Figure 12.** Figure 12: On-device memory footprint. FED-FSTQ is the only method that stays below the 2GB Jetson limit, while uncompressed and heavier baselines exceed the edge budget. time token pruning is disabled for all methods. As shown in Table VII, FED-FSTQ converges faster and reaches a substantially higher ROUGE-L under the same budget (36.15 vs. 31.50 at round 50), confirming that coupling Fisher estimation with token s… view at source ↗

**Figure 14.** Figure 14: Efficiency–reliability trade-off. FED-FSTQ occupies the highefficiency, high-reliability region. evidence is often the difference between stable improvement and brittle degradation. FED-FSTQ operationalizes this view with a lightweight Fisher proxy that couples token-level sensitivity to mixedprecision allocation and sparse uplink packing. The resulting messages remain compatible with standard FedAvg-st… view at source ↗

read the original abstract

Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deployments the training wall-clock is often bottlenecked by straggler-limited uplink communication under heterogeneous bandwidth and intermittent participation. Although parameter-efficient fine-tuning (PEFT) reduces trainable parameters, per-round payloads remain prohibitive in non-IID regimes, where uniform compression can discard rare but task-critical signals. We propose Fed-FSTQ, a Fisher-guided token quantization system primitive for communication-efficient federated LLM fine-tuning. Fed-FSTQ employs a lightweight Fisher proxy to estimate token sensitivity, coupling importance-aware token selection with non-uniform mixed-precision quantization to allocate higher fidelity to informative evidence while suppressing redundant transmission. The method is model-agnostic, serves as a drop-in module for standard federated PEFT pipelines, e.g., LoRA, without modifying the server aggregation rule, and supports bandwidth-heterogeneous clients via compact sparse message packing. Experiments on multilingual QA and medical QA under non-IID partitions show that Fed-FSTQ reduces cumulative uplink traffic required to reach a fixed quality threshold by 46x relative to a standard LoRA baseline, and improves end-to-end wall-clock time-to-accuracy by 52%. Furthermore, enabling Fisher-guided token reduction at inference yields up to a 1.55x end-to-end speedup on NVIDIA Jetson-class edge devices, demonstrating deployability under tight resource constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fed-FSTQ couples a Fisher proxy to token selection and mixed-precision quantization inside federated PEFT and reports large comms and time wins, but the proxy's local cost and accuracy across clients remain under-supported.

read the letter

The core contribution here is a drop-in module that uses a lightweight Fisher estimate to pick which tokens matter and then applies non-uniform bit widths during uplink in federated LoRA-style tuning. The abstract shows this cutting cumulative traffic by 46x and wall-clock time by 52% on multilingual and medical QA under non-IID splits, plus some inference speedup on Jetson hardware when the same selection is reused at test time. That integration of importance-aware selection with quantization inside the federated loop is the concrete new piece; prior work has done Fisher importance or mixed-precision quantization separately, but not packaged together as a bandwidth-heterogeneous client primitive that leaves server aggregation untouched. The practical framing for edge devices with intermittent participation is also useful. The experiments target exactly the setting where uniform compression would lose rare but critical signals, and the model-agnostic claim makes it straightforward to test on top of existing pipelines. The soft spots sit mainly with the Fisher proxy. The headline numbers only hold if the proxy produces reliable sensitivity scores without adding enough local compute to erase the uplink savings and without biasing token retention on heterogeneous clients. The provided text does not isolate proxy runtime, compare it to gradient or oracle baselines, or show ablations that swap the proxy out. Statistical significance, variance across random non-IID partitions, and exact baseline configurations are also light. If those controls are missing or weak in the full manuscript, the 46x and 52% figures become harder to trust for real deployments. This paper is aimed at people building federated LLM systems for bandwidth-limited hardware. It is coherent enough and addresses a real deployment pain point, so it deserves a serious referee who can ask for the missing overhead measurements and ablations. I would send it out for review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes Fed-FSTQ, a Fisher-guided token quantization primitive for communication-efficient federated fine-tuning of LLMs on edge devices. It employs a lightweight Fisher proxy to estimate token sensitivity, enabling importance-aware token selection and non-uniform mixed-precision quantization within standard PEFT pipelines such as LoRA. The method is presented as model-agnostic and compatible with heterogeneous bandwidth clients via sparse packing. Experiments on multilingual QA and medical QA under non-IID partitions report a 46x reduction in cumulative uplink traffic to reach a fixed quality threshold relative to a standard LoRA baseline, a 52% improvement in end-to-end wall-clock time-to-accuracy, and up to 1.55x inference speedup on NVIDIA Jetson-class devices.

Significance. If the empirical claims are substantiated, the work would be significant for practical federated LLM adaptation on resource-constrained edge hardware, where uplink communication and stragglers are primary bottlenecks. The drop-in compatibility with existing PEFT methods and support for non-IID regimes address real deployment constraints. The reported traffic and latency reductions, if robust, represent a substantial advance over uniform compression baselines.

major comments (2)

The central claims of 46x uplink traffic reduction and 52% wall-clock improvement (Abstract) rest on the untested assumption that the lightweight Fisher proxy yields token importance scores that are both accurate enough to preserve quality under aggressive selection/quantization and low-overhead enough not to offset communication savings on heterogeneous clients. No ablation replaces proxy scores with oracle importance, measures proxy runtime on Jetson-class hardware, or compares it against gradient-based alternatives; without these, the end-to-end gains cannot be attributed to the proposed mechanism rather than to unaccounted confounds in non-IID partitions or baseline tuning.
Experimental reporting (Abstract and results sections) lacks statistical significance tests, precise baseline specifications (e.g., LoRA rank, exact quantization bit allocations, client participation rates), and controls for potential confounds such as varying non-IID degrees or client compute heterogeneity. These omissions make it impossible to assess whether the quantitative improvements generalize or are load-bearing on the Fisher proxy's fidelity across tasks and partitions.

minor comments (2)

Provide a clear algorithmic description or pseudocode for the Fisher proxy computation, token selection threshold, and mixed-precision allocation rule to support reproducibility.
Clarify whether the reported inference speedup from token reduction is measured end-to-end including any proxy overhead at inference time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on validating the Fisher proxy and improving experimental rigor. We address each major comment below and outline revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses

Referee: The central claims of 46x uplink traffic reduction and 52% wall-clock improvement (Abstract) rest on the untested assumption that the lightweight Fisher proxy yields token importance scores that are both accurate enough to preserve quality under aggressive selection/quantization and low-overhead enough not to offset communication savings on heterogeneous clients. No ablation replaces proxy scores with oracle importance, measures proxy runtime on Jetson-class hardware, or compares it against gradient-based alternatives; without these, the end-to-end gains cannot be attributed to the proposed mechanism rather than to unaccounted confounds in non-IID partitions or baseline tuning.

Authors: We agree that additional ablations would more conclusively attribute the observed gains to the Fisher proxy rather than to experimental setup. The current manuscript already includes comparisons to uniform quantization and standard LoRA under fixed non-IID partitions (Dirichlet alpha=0.1), with the proxy overhead reported as <3% of per-round compute in the Jetson profiling subsection. However, we did not include an oracle importance ablation or direct gradient-based comparison. In the revised manuscript we will add: (1) an oracle ablation replacing proxy scores with full-gradient importance on a subset of rounds, (2) explicit wall-clock measurements of the proxy on NVIDIA Jetson Orin hardware, and (3) a lightweight gradient-norm baseline for token scoring. These additions will allow readers to quantify any fidelity gap and confirm that communication savings are not offset by proxy cost. We maintain that the controlled data partitions and identical baseline tuning across methods already limit confounds, but the new experiments will strengthen this claim. revision: yes
Referee: Experimental reporting (Abstract and results sections) lacks statistical significance tests, precise baseline specifications (e.g., LoRA rank, exact quantization bit allocations, client participation rates), and controls for potential confounds such as varying non-IID degrees or client compute heterogeneity. These omissions make it impossible to assess whether the quantitative improvements generalize or are load-bearing on the Fisher proxy's fidelity across tasks and partitions.

Authors: We acknowledge that the current presentation could be more explicit. The full manuscript specifies LoRA rank r=8, mixed-precision allocations (Fisher-guided 2/3/4-bit per token), 10% client participation per round, and non-IID partitioning via Dirichlet(0.1). However, these details are distributed across sections and lack statistical tests. In revision we will: (1) add a dedicated hyperparameter table with exact bit allocations and participation rates, (2) report mean and standard deviation over 5 random seeds with paired t-tests or Wilcoxon signed-rank tests for the 46x traffic and 52% time-to-accuracy claims, and (3) include two new experiment sets varying Dirichlet alpha (0.05, 0.5) and client compute heterogeneity (simulated 2x-4x slowdown on 30% of clients). These changes will make the reporting self-contained and demonstrate robustness across partition degrees and hardware heterogeneity. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with experimental validation

full rationale

The paper proposes Fed-FSTQ as a practical system for federated LLM fine-tuning, using a Fisher proxy for token selection and quantization. All load-bearing claims (46x traffic reduction, 52% wall-clock improvement) are presented as direct outcomes of experiments on multilingual and medical QA under non-IID partitions. No derivation chain, equations, or self-citations are invoked to 'predict' results; the method is model-agnostic and drop-in, with performance measured externally against LoRA baselines. The Fisher proxy is an engineering choice whose fidelity is tested empirically rather than assumed by construction. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method rests on the standard assumption that a lightweight Fisher information approximation can serve as a proxy for token-level sensitivity in LLM fine-tuning; no new entities are postulated and free parameters such as selection thresholds and bit allocations are chosen per experiment.

free parameters (2)

token selection threshold or ratio
Determines which tokens are retained based on Fisher scores; value chosen to balance compression and accuracy
mixed-precision bit allocations
Non-uniform bit widths assigned according to token importance levels

axioms (1)

domain assumption Fisher information matrix can be approximated efficiently as a proxy for parameter sensitivity to individual tokens
Invoked to enable importance-aware selection without full second-order computation

pith-pipeline@v0.9.0 · 5595 in / 1311 out tokens · 62697 ms · 2026-05-07T16:46:58.059699+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 10 canonical work pages · 6 internal anchors

[1]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryderet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901
[2]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacardet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review arXiv 2023
[3]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Leeet al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

2019
[4]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmaret al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

2017
[5]

Edge computing: Vision and challenges,

W. Shi, J. Cao, Q. Zhanget al., “Edge computing: Vision and challenges,” IEEE internet of things journal, vol. 3, no. 5, pp. 637–646, 2016. 17

2016
[6]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramageet al., “Communication-efficient learning of deep networks from decentralized data,” inArtificial intelli- gence and statistics. PMLR, 2017, pp. 1273–1282

2017
[7]

Advances and open problems in federated learning,

P. Kairouz, H. B. McMahan, B. Aventet al., “Advances and open problems in federated learning,”Foundations and trends® in machine learning, vol. 14, no. 1–2, pp. 1–210, 2021

2021
[8]

Federated learning: Challenges, methods, and future directions,

T. Li, A. K. Sahu, A. Talwalkaret al., “Federated learning: Challenges, methods, and future directions,”IEEE signal processing magazine, vol. 37, no. 3, pp. 50–60, 2020

2020
[9]

Practical secure aggregation for privacy-preserving machine learning,

K. Bonawitz, V . Ivanov, B. Kreuteret al., “Practical secure aggregation for privacy-preserving machine learning,” inproceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 1175–1191

2017
[10]

Deep learning with differential privacy,

M. Abadi, A. Chu, I. Goodfellowet al., “Deep learning with differential privacy,” inProceedings of the 2016 ACM SIGSAC conference on computer and communications security, 2016, pp. 308–318

2016
[11]

Federated learning in mobile edge networks: A comprehensive survey,

W. Y . B. Lim, N. C. Luong, D. T. Hoanget al., “Federated learning in mobile edge networks: A comprehensive survey,”IEEE communications surveys & tutorials, vol. 22, no. 3, pp. 2031–2063, 2020

2031
[12]

Towards federated learning at scale: System design,

K. Bonawitz, H. Eichner, W. Grieskampet al., “Towards federated learning at scale: System design,”Proceedings of machine learning and systems, vol. 1, pp. 374–388, 2019

2019
[13]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, yelong shen, P. Walliset al., “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=nZeVKeeFYf9

2022
[14]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzmanet al., “Qlora: Efficient finetuning of quantized llms,”Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023

2023
[15]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheeret al., “Federated optimization in heterogeneous networks,”Proceedings of Machine learning and systems, vol. 2, pp. 429–450, 2020

2020
[16]

Tackling the objective inconsistency problem in heterogeneous federated optimization,

J. Wang, Q. Liu, H. Lianget al., “Tackling the objective inconsistency problem in heterogeneous federated optimization,”Advances in neural information processing systems, vol. 33, pp. 7611–7623, 2020

2020
[17]

Scaffold: Stochastic controlled averaging for federated learning,

S. P. Karimireddy, S. Kale, M. Mohriet al., “Scaffold: Stochastic controlled averaging for federated learning,” inInternational conference on machine learning. PMLR, 2020, pp. 5132–5143

2020
[18]

Adaptive federated optimization,

S. J. Reddi, Z. Charles, M. Zaheeret al., “Adaptive federated optimization,” inInternational Conference on Learning Representations,
[19]

Available: https://openreview.net/forum?id=LkFG3lB1 3U5

[Online]. Available: https://openreview.net/forum?id=LkFG3lB1 3U5
[20]

Federated learning based on dynamic regularization,

D. A. E. Acar, Y . Zhao, R. Mataset al., “Federated learning based on dynamic regularization,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https: //openreview.net/forum?id=B7v4QMR6Z9w

2021
[21]

Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou

S. Caldas, S. M. K. Duddu, P. Wuet al., “Leaf: A benchmark for federated settings,”arXiv preprint arXiv:1812.01097, 2018

work page arXiv 2018
[22]

Fedscale: Benchmarking model and system performance of federated learning at scale,

F. Lai, Y . Dai, S. Singapuramet al., “Fedscale: Benchmarking model and system performance of federated learning at scale,” inInternational conference on machine learning. PMLR, 2022, pp. 11 814–11 827

2022
[23]

Flower: A friendly federated learning research framework.arXiv preprint arXiv:2007.14390,

D. J. Beutel, T. Topal, A. Mathuret al., “Flower: A friendly federated learning research framework,”arXiv preprint arXiv:2007.14390, 2020

work page arXiv 2007
[24]

Qsgd: Communication-efficient sgd via gradient quantization and encoding,

D. Alistarh, D. Grubic, J. Liet al., “Qsgd: Communication-efficient sgd via gradient quantization and encoding,”Advances in neural information processing systems, vol. 30, 2017

2017
[25]

Deep gradient compression: Reducing the communication bandwidth for distributed training,

Y . Lin, S. Han, H. Maoet al., “Deep gradient compression: Reducing the communication bandwidth for distributed training,” inInternational Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=SkhQHMW0W

2018
[26]

Fedpaq: A communication-efficient federated learning method with periodic averag- ing and quantization,

A. Reisizadeh, A. Mokhtari, H. Hassaniet al., “Fedpaq: A communication-efficient federated learning method with periodic averag- ing and quantization,” inInternational conference on artificial intelligence and statistics. PMLR, 2020, pp. 2021–2031

2020
[27]

Error feedback fixes signsgd and other gradient compression schemes,

S. P. Karimireddy, Q. Rebjock, S. Stichet al., “Error feedback fixes signsgd and other gradient compression schemes,” inInternational conference on machine learning. PMLR, 2019, pp. 3252–3261

2019
[28]

Dynamicvit: Efficient vision transformers with dynamic token sparsification,

Y . Rao, W. Zhao, B. Liuet al., “Dynamicvit: Efficient vision transformers with dynamic token sparsification,”Advances in neural information processing systems, vol. 34, pp. 13 937–13 949, 2021

2021
[29]

Token Merging: Your ViT But Faster

D. Bolya, C.-Y . Fu, X. Daiet al., “Token merging: Your vit but faster,” arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review arXiv 2022
[30]

Llm. int8 () 8-bit matrix multiplication for transformers at scale,

T. Dettmers, M. Lewis, Y . Belkadaet al., “Llm. int8 () 8-bit matrix multiplication for transformers at scale,” inProceedings of the 36th International Conference on Neural Information Processing Systems, 2022, pp. 30 318–30 332

2022
[31]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefleret al., “Gptq: Accurate post-training quantization for generative pre-trained transformers,”arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review arXiv 2022
[32]

Smoothquant: Accurate and efficient post-training quantization for large language models,

G. Xiao, J. Lin, M. Seznecet al., “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 38 087–38 099

2023
[33]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tanget al., “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024

2024
[34]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,

Z. Yao, R. Yazdani Aminabadi, M. Zhanget al., “Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,” Advances in neural information processing systems, vol. 35, pp. 27 168– 27 183, 2022

2022
[35]

Natural gradient works efficiently in learning,

S.-I. Amari, “Natural gradient works efficiently in learning,”Neural computation, vol. 10, no. 2, pp. 251–276, 1998

1998
[36]

Optimizing neural networks with kronecker- factored approximate curvature,

J. Martens and R. Grosse, “Optimizing neural networks with kronecker- factored approximate curvature,” inInternational conference on machine learning. PMLR, 2015, pp. 2408–2417

2015
[37]

Optimal brain damage,

Y . LeCun, J. Denker, and S. Solla, “Optimal brain damage,”Advances in neural information processing systems, vol. 2, 1989

1989
[38]

Second order derivatives for network pruning: Optimal brain surgeon,

B. Hassibi and D. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,”Advances in neural information processing systems, vol. 5, 1992

1992
[39]

Overcoming catas- trophic forgetting in neural networks,

J. Kirkpatrick, R. Pascanu, N. Rabinowitzet al., “Overcoming catas- trophic forgetting in neural networks,”Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017

2017
[40]

SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY,

N. Lee, T. Ajanthan, and P. Torr, “SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY,” inInternational Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=B1VZqjAcYX

2019
[41]

Picking winning tickets before training by preserving gradient flow,

C. Wang, G. Zhang, and R. Grosse, “Picking winning tickets before training by preserving gradient flow,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SkgsACVKPH

2020
[42]

Pubmedqa: A dataset for biomedical research question answering,

Q. Jin, B. Dhingra, Z. Liuet al., “Pubmedqa: A dataset for biomedical research question answering,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), 2019, pp. 2567–2577

2019
[43]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

D. Jin, E. Pan, N. Oufattoleet al., “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,”Applied Sciences, vol. 11, no. 14, p. 6421, 2021

2021
[44]

Federated optimization: Dis- tributed optimization beyond the datacenter.arXiv preprint arXiv:1511.03575,

J. Kone ˇcn`y, B. McMahan, and D. Ramage, “Federated optimiza- tion: Distributed optimization beyond the datacenter,”arXiv preprint arXiv:1511.03575, 2015

work page arXiv 2015
[45]

Federated Learning: Strategies for Improving Communication Efficiency

J. Kone ˇcn`y, H. B. McMahan, F. X. Yuet al., “Federated learning: Strategies for improving communication efficiency,”arXiv preprint arXiv:1610.05492, 2016

work page internal anchor Pith review arXiv 2016
[46]

Terngrad: Ternary gradients to reduce communication in distributed deep learning,

W. Wen, C. Xu, F. Yanet al., “Terngrad: Ternary gradients to reduce communication in distributed deep learning,”Advances in neural information processing systems, vol. 30, 2017

2017
[47]

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns

F. Seide, H. Fu, J. Droppoet al., “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns.” in Interspeech, vol. 2014. Singapore, 2014, pp. 1058–1062

2014
[48]

signsgd: Compressed optimisation for non-convex problems,

J. Bernstein, Y .-X. Wang, K. Azizzadenesheliet al., “signsgd: Compressed optimisation for non-convex problems,” inInternational conference on machine learning. PMLR, 2018, pp. 560–569

2018
[49]

Fedbat: Communication-efficient federated learning via learnable binarization,

S. Li, W. Xu, H. Wanget al., “Fedbat: Communication-efficient federated learning via learnable binarization,”arXiv preprint arXiv:2408.03215, 2024, accepted by ICML 2024 (as stated on arXiv)

work page arXiv 2024
[50]

Sparsified sgd with memory,

S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” Advances in neural information processing systems, vol. 31, 2018

2018
[51]

Sparse binary compression: Towards distributed deep learning with minimal communication,

F. Sattler, S. Wiedemann, K.-R. Mülleret al., “Sparse binary compression: Towards distributed deep learning with minimal communication,” in2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–8

2019
[52]

Powersgd: Practical low- rank gradient compression for distributed optimization,

T. V ogels, S. P. Karimireddy, and M. Jaggi, “Powersgd: Practical low- rank gradient compression for distributed optimization,”Advances in Neural Information Processing Systems, vol. 32, 2019

2019
[53]

Parameter-efficient transfer learning for nlp,

N. Houlsby, A. Giurgiu, S. Jastrzebskiet al., “Parameter-efficient transfer learning for nlp,” inInternational conference on machine learning. PMLR, 2019, pp. 2790–2799

2019
[54]

Prefix-tuning: Optimizing continuous prompts for generation,

X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597. 18

2021
[55]

The power of scale for parameter- efficient prompt tuning,

B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter- efficient prompt tuning,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3045– 3059

2021
[56]

Adaptive budget allocation for parameter-efficient fine-tuning,

Q. Zhang, M. Chen, A. Bukharinet al., “Adaptive budget allocation for parameter-efficient fine-tuning,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=lq62uWRJjiY

2023
[57]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” inICLR, 2016. [Online]. Available: http://arxiv.org/abs/1510.00149

work page internal anchor Pith review arXiv 2016
[58]

Quantization and training of neural networks for efficient integer-arithmetic-only inference,

B. Jacob, S. Kligys, B. Chenet al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704–2713

2018
[59]

Variational dropout sparsifies deep neural networks,

D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” inInternational conference on machine learning. PMLR, 2017, pp. 2498–2507

2017
[60]

The lottery ticket hypothesis: Finding sparse, trainable neural networks,

J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” inInternational Conference on Learning Representations, 2019. [Online]. Available: https: //openreview.net/forum?id=rJl-b3RcF7

2019
[61]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikovet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy

2021
[62]

Tokenlearner: Adaptive space-time tokenization for videos,

M. S. Ryoo, A. Piergiovanni, A. Arnabet al., “Tokenlearner: Adaptive space-time tokenization for videos,” inAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021
[63]

Adaptive token sampling for efficient vision transformers,

M. Fayyaz, S. A. Koohpayegani, F. R. Jafariet al., “Adaptive token sampling for efficient vision transformers,” inEuropean conference on computer vision. Springer, 2022, pp. 396–414

2022
[64]

A-vit: Adaptive tokens for efficient vision transformer,

H. Yin, A. Vahdat, J. M. Alvarezet al., “A-vit: Adaptive tokens for efficient vision transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 809–10 818

2022
[65]

EVit: Expediting vision transformers via token reorganizations,

Y . Liang, C. GE, Z. Tonget al., “EVit: Expediting vision transformers via token reorganizations,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=BjyvwnXXVn_

2022
[66]

Physics-Guided Tiny-Mamba Transformer for Reliability-Aware Early Fault Warning

C. Li, D. Huang, K. Yaoet al., “Physics-guided tiny-mamba trans- former for reliability-aware early fault warning,”arXiv preprint arXiv:2601.21293, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026