Recognition: unknown
FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices
Pith reviewed 2026-05-07 16:46 UTC · model grok-4.3
The pith
Fisher-guided token selection and mixed-precision quantization cut uplink traffic 46x in federated LLM fine-tuning on edge devices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fed-FSTQ is a model-agnostic primitive that couples a lightweight Fisher proxy for token sensitivity with importance-aware selection and non-uniform mixed-precision quantization, allowing federated PEFT to transmit only the most informative evidence at high fidelity while discarding redundant signals, thereby reducing cumulative uplink volume by 46x and wall-clock time-to-accuracy by 52 percent relative to standard LoRA under non-IID partitions.
What carries the argument
The lightweight Fisher proxy, which estimates per-token sensitivity to drive importance-aware selection and allocation of higher bit-widths to critical tokens during uplink.
Load-bearing premise
The lightweight Fisher proxy supplies a reliable estimate of token sensitivity that generalizes across heterogeneous clients, tasks, and non-IID partitions without bias or excessive local overhead.
What would settle it
Run the same federated schedule with random token selection instead of the Fisher proxy and measure whether cumulative uplink traffic to target accuracy rises back toward the baseline level.
Figures
read the original abstract
Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deployments the training wall-clock is often bottlenecked by straggler-limited uplink communication under heterogeneous bandwidth and intermittent participation. Although parameter-efficient fine-tuning (PEFT) reduces trainable parameters, per-round payloads remain prohibitive in non-IID regimes, where uniform compression can discard rare but task-critical signals. We propose Fed-FSTQ, a Fisher-guided token quantization system primitive for communication-efficient federated LLM fine-tuning. Fed-FSTQ employs a lightweight Fisher proxy to estimate token sensitivity, coupling importance-aware token selection with non-uniform mixed-precision quantization to allocate higher fidelity to informative evidence while suppressing redundant transmission. The method is model-agnostic, serves as a drop-in module for standard federated PEFT pipelines, e.g., LoRA, without modifying the server aggregation rule, and supports bandwidth-heterogeneous clients via compact sparse message packing. Experiments on multilingual QA and medical QA under non-IID partitions show that Fed-FSTQ reduces cumulative uplink traffic required to reach a fixed quality threshold by 46x relative to a standard LoRA baseline, and improves end-to-end wall-clock time-to-accuracy by 52%. Furthermore, enabling Fisher-guided token reduction at inference yields up to a 1.55x end-to-end speedup on NVIDIA Jetson-class edge devices, demonstrating deployability under tight resource constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Fed-FSTQ, a Fisher-guided token quantization primitive for communication-efficient federated fine-tuning of LLMs on edge devices. It employs a lightweight Fisher proxy to estimate token sensitivity, enabling importance-aware token selection and non-uniform mixed-precision quantization within standard PEFT pipelines such as LoRA. The method is presented as model-agnostic and compatible with heterogeneous bandwidth clients via sparse packing. Experiments on multilingual QA and medical QA under non-IID partitions report a 46x reduction in cumulative uplink traffic to reach a fixed quality threshold relative to a standard LoRA baseline, a 52% improvement in end-to-end wall-clock time-to-accuracy, and up to 1.55x inference speedup on NVIDIA Jetson-class devices.
Significance. If the empirical claims are substantiated, the work would be significant for practical federated LLM adaptation on resource-constrained edge hardware, where uplink communication and stragglers are primary bottlenecks. The drop-in compatibility with existing PEFT methods and support for non-IID regimes address real deployment constraints. The reported traffic and latency reductions, if robust, represent a substantial advance over uniform compression baselines.
major comments (2)
- The central claims of 46x uplink traffic reduction and 52% wall-clock improvement (Abstract) rest on the untested assumption that the lightweight Fisher proxy yields token importance scores that are both accurate enough to preserve quality under aggressive selection/quantization and low-overhead enough not to offset communication savings on heterogeneous clients. No ablation replaces proxy scores with oracle importance, measures proxy runtime on Jetson-class hardware, or compares it against gradient-based alternatives; without these, the end-to-end gains cannot be attributed to the proposed mechanism rather than to unaccounted confounds in non-IID partitions or baseline tuning.
- Experimental reporting (Abstract and results sections) lacks statistical significance tests, precise baseline specifications (e.g., LoRA rank, exact quantization bit allocations, client participation rates), and controls for potential confounds such as varying non-IID degrees or client compute heterogeneity. These omissions make it impossible to assess whether the quantitative improvements generalize or are load-bearing on the Fisher proxy's fidelity across tasks and partitions.
minor comments (2)
- Provide a clear algorithmic description or pseudocode for the Fisher proxy computation, token selection threshold, and mixed-precision allocation rule to support reproducibility.
- Clarify whether the reported inference speedup from token reduction is measured end-to-end including any proxy overhead at inference time.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on validating the Fisher proxy and improving experimental rigor. We address each major comment below and outline revisions that will be incorporated into the next version of the manuscript.
read point-by-point responses
-
Referee: The central claims of 46x uplink traffic reduction and 52% wall-clock improvement (Abstract) rest on the untested assumption that the lightweight Fisher proxy yields token importance scores that are both accurate enough to preserve quality under aggressive selection/quantization and low-overhead enough not to offset communication savings on heterogeneous clients. No ablation replaces proxy scores with oracle importance, measures proxy runtime on Jetson-class hardware, or compares it against gradient-based alternatives; without these, the end-to-end gains cannot be attributed to the proposed mechanism rather than to unaccounted confounds in non-IID partitions or baseline tuning.
Authors: We agree that additional ablations would more conclusively attribute the observed gains to the Fisher proxy rather than to experimental setup. The current manuscript already includes comparisons to uniform quantization and standard LoRA under fixed non-IID partitions (Dirichlet alpha=0.1), with the proxy overhead reported as <3% of per-round compute in the Jetson profiling subsection. However, we did not include an oracle importance ablation or direct gradient-based comparison. In the revised manuscript we will add: (1) an oracle ablation replacing proxy scores with full-gradient importance on a subset of rounds, (2) explicit wall-clock measurements of the proxy on NVIDIA Jetson Orin hardware, and (3) a lightweight gradient-norm baseline for token scoring. These additions will allow readers to quantify any fidelity gap and confirm that communication savings are not offset by proxy cost. We maintain that the controlled data partitions and identical baseline tuning across methods already limit confounds, but the new experiments will strengthen this claim. revision: yes
-
Referee: Experimental reporting (Abstract and results sections) lacks statistical significance tests, precise baseline specifications (e.g., LoRA rank, exact quantization bit allocations, client participation rates), and controls for potential confounds such as varying non-IID degrees or client compute heterogeneity. These omissions make it impossible to assess whether the quantitative improvements generalize or are load-bearing on the Fisher proxy's fidelity across tasks and partitions.
Authors: We acknowledge that the current presentation could be more explicit. The full manuscript specifies LoRA rank r=8, mixed-precision allocations (Fisher-guided 2/3/4-bit per token), 10% client participation per round, and non-IID partitioning via Dirichlet(0.1). However, these details are distributed across sections and lack statistical tests. In revision we will: (1) add a dedicated hyperparameter table with exact bit allocations and participation rates, (2) report mean and standard deviation over 5 random seeds with paired t-tests or Wilcoxon signed-rank tests for the 46x traffic and 52% time-to-accuracy claims, and (3) include two new experiment sets varying Dirichlet alpha (0.05, 0.5) and client compute heterogeneity (simulated 2x-4x slowdown on 30% of clients). These changes will make the reporting self-contained and demonstrate robustness across partition degrees and hardware heterogeneity. revision: yes
Circularity Check
No circularity; empirical method with experimental validation
full rationale
The paper proposes Fed-FSTQ as a practical system for federated LLM fine-tuning, using a Fisher proxy for token selection and quantization. All load-bearing claims (46x traffic reduction, 52% wall-clock improvement) are presented as direct outcomes of experiments on multilingual and medical QA under non-IID partitions. No derivation chain, equations, or self-citations are invoked to 'predict' results; the method is model-agnostic and drop-in, with performance measured externally against LoRA baselines. The Fisher proxy is an engineering choice whose fidelity is tested empirically rather than assumed by construction. This is a standard non-circular empirical paper.
Axiom & Free-Parameter Ledger
free parameters (2)
- token selection threshold or ratio
- mixed-precision bit allocations
axioms (1)
- domain assumption Fisher information matrix can be approximated efficiently as a proxy for parameter sensitivity to individual tokens
Reference graph
Works this paper leans on
-
[1]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryderet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
1901
-
[2]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacardet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
Bert: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Leeet al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186
2019
-
[4]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmaret al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017
2017
-
[5]
Edge computing: Vision and challenges,
W. Shi, J. Cao, Q. Zhanget al., “Edge computing: Vision and challenges,” IEEE internet of things journal, vol. 3, no. 5, pp. 637–646, 2016. 17
2016
-
[6]
Communication-efficient learning of deep networks from decentralized data,
B. McMahan, E. Moore, D. Ramageet al., “Communication-efficient learning of deep networks from decentralized data,” inArtificial intelli- gence and statistics. PMLR, 2017, pp. 1273–1282
2017
-
[7]
Advances and open problems in federated learning,
P. Kairouz, H. B. McMahan, B. Aventet al., “Advances and open problems in federated learning,”Foundations and trends® in machine learning, vol. 14, no. 1–2, pp. 1–210, 2021
2021
-
[8]
Federated learning: Challenges, methods, and future directions,
T. Li, A. K. Sahu, A. Talwalkaret al., “Federated learning: Challenges, methods, and future directions,”IEEE signal processing magazine, vol. 37, no. 3, pp. 50–60, 2020
2020
-
[9]
Practical secure aggregation for privacy-preserving machine learning,
K. Bonawitz, V . Ivanov, B. Kreuteret al., “Practical secure aggregation for privacy-preserving machine learning,” inproceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 1175–1191
2017
-
[10]
Deep learning with differential privacy,
M. Abadi, A. Chu, I. Goodfellowet al., “Deep learning with differential privacy,” inProceedings of the 2016 ACM SIGSAC conference on computer and communications security, 2016, pp. 308–318
2016
-
[11]
Federated learning in mobile edge networks: A comprehensive survey,
W. Y . B. Lim, N. C. Luong, D. T. Hoanget al., “Federated learning in mobile edge networks: A comprehensive survey,”IEEE communications surveys & tutorials, vol. 22, no. 3, pp. 2031–2063, 2020
2031
-
[12]
Towards federated learning at scale: System design,
K. Bonawitz, H. Eichner, W. Grieskampet al., “Towards federated learning at scale: System design,”Proceedings of machine learning and systems, vol. 1, pp. 374–388, 2019
2019
-
[13]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, yelong shen, P. Walliset al., “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=nZeVKeeFYf9
2022
-
[14]
Qlora: Efficient finetuning of quantized llms,
T. Dettmers, A. Pagnoni, A. Holtzmanet al., “Qlora: Efficient finetuning of quantized llms,”Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023
2023
-
[15]
Federated optimization in heterogeneous networks,
T. Li, A. K. Sahu, M. Zaheeret al., “Federated optimization in heterogeneous networks,”Proceedings of Machine learning and systems, vol. 2, pp. 429–450, 2020
2020
-
[16]
Tackling the objective inconsistency problem in heterogeneous federated optimization,
J. Wang, Q. Liu, H. Lianget al., “Tackling the objective inconsistency problem in heterogeneous federated optimization,”Advances in neural information processing systems, vol. 33, pp. 7611–7623, 2020
2020
-
[17]
Scaffold: Stochastic controlled averaging for federated learning,
S. P. Karimireddy, S. Kale, M. Mohriet al., “Scaffold: Stochastic controlled averaging for federated learning,” inInternational conference on machine learning. PMLR, 2020, pp. 5132–5143
2020
-
[18]
Adaptive federated optimization,
S. J. Reddi, Z. Charles, M. Zaheeret al., “Adaptive federated optimization,” inInternational Conference on Learning Representations,
-
[19]
Available: https://openreview.net/forum?id=LkFG3lB1 3U5
[Online]. Available: https://openreview.net/forum?id=LkFG3lB1 3U5
-
[20]
Federated learning based on dynamic regularization,
D. A. E. Acar, Y . Zhao, R. Mataset al., “Federated learning based on dynamic regularization,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https: //openreview.net/forum?id=B7v4QMR6Z9w
2021
-
[21]
Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou
S. Caldas, S. M. K. Duddu, P. Wuet al., “Leaf: A benchmark for federated settings,”arXiv preprint arXiv:1812.01097, 2018
-
[22]
Fedscale: Benchmarking model and system performance of federated learning at scale,
F. Lai, Y . Dai, S. Singapuramet al., “Fedscale: Benchmarking model and system performance of federated learning at scale,” inInternational conference on machine learning. PMLR, 2022, pp. 11 814–11 827
2022
-
[23]
Flower: A friendly federated learning research framework.arXiv preprint arXiv:2007.14390,
D. J. Beutel, T. Topal, A. Mathuret al., “Flower: A friendly federated learning research framework,”arXiv preprint arXiv:2007.14390, 2020
-
[24]
Qsgd: Communication-efficient sgd via gradient quantization and encoding,
D. Alistarh, D. Grubic, J. Liet al., “Qsgd: Communication-efficient sgd via gradient quantization and encoding,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[25]
Deep gradient compression: Reducing the communication bandwidth for distributed training,
Y . Lin, S. Han, H. Maoet al., “Deep gradient compression: Reducing the communication bandwidth for distributed training,” inInternational Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=SkhQHMW0W
2018
-
[26]
Fedpaq: A communication-efficient federated learning method with periodic averag- ing and quantization,
A. Reisizadeh, A. Mokhtari, H. Hassaniet al., “Fedpaq: A communication-efficient federated learning method with periodic averag- ing and quantization,” inInternational conference on artificial intelligence and statistics. PMLR, 2020, pp. 2021–2031
2020
-
[27]
Error feedback fixes signsgd and other gradient compression schemes,
S. P. Karimireddy, Q. Rebjock, S. Stichet al., “Error feedback fixes signsgd and other gradient compression schemes,” inInternational conference on machine learning. PMLR, 2019, pp. 3252–3261
2019
-
[28]
Dynamicvit: Efficient vision transformers with dynamic token sparsification,
Y . Rao, W. Zhao, B. Liuet al., “Dynamicvit: Efficient vision transformers with dynamic token sparsification,”Advances in neural information processing systems, vol. 34, pp. 13 937–13 949, 2021
2021
-
[29]
Token Merging: Your ViT But Faster
D. Bolya, C.-Y . Fu, X. Daiet al., “Token merging: Your vit but faster,” arXiv preprint arXiv:2210.09461, 2022
work page internal anchor Pith review arXiv 2022
-
[30]
Llm. int8 () 8-bit matrix multiplication for transformers at scale,
T. Dettmers, M. Lewis, Y . Belkadaet al., “Llm. int8 () 8-bit matrix multiplication for transformers at scale,” inProceedings of the 36th International Conference on Neural Information Processing Systems, 2022, pp. 30 318–30 332
2022
-
[31]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
E. Frantar, S. Ashkboos, T. Hoefleret al., “Gptq: Accurate post-training quantization for generative pre-trained transformers,”arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review arXiv 2022
-
[32]
Smoothquant: Accurate and efficient post-training quantization for large language models,
G. Xiao, J. Lin, M. Seznecet al., “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 38 087–38 099
2023
-
[33]
Awq: Activation-aware weight quantization for on-device llm compression and acceleration,
J. Lin, J. Tang, H. Tanget al., “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024
2024
-
[34]
Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,
Z. Yao, R. Yazdani Aminabadi, M. Zhanget al., “Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,” Advances in neural information processing systems, vol. 35, pp. 27 168– 27 183, 2022
2022
-
[35]
Natural gradient works efficiently in learning,
S.-I. Amari, “Natural gradient works efficiently in learning,”Neural computation, vol. 10, no. 2, pp. 251–276, 1998
1998
-
[36]
Optimizing neural networks with kronecker- factored approximate curvature,
J. Martens and R. Grosse, “Optimizing neural networks with kronecker- factored approximate curvature,” inInternational conference on machine learning. PMLR, 2015, pp. 2408–2417
2015
-
[37]
Optimal brain damage,
Y . LeCun, J. Denker, and S. Solla, “Optimal brain damage,”Advances in neural information processing systems, vol. 2, 1989
1989
-
[38]
Second order derivatives for network pruning: Optimal brain surgeon,
B. Hassibi and D. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,”Advances in neural information processing systems, vol. 5, 1992
1992
-
[39]
Overcoming catas- trophic forgetting in neural networks,
J. Kirkpatrick, R. Pascanu, N. Rabinowitzet al., “Overcoming catas- trophic forgetting in neural networks,”Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017
2017
-
[40]
SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY,
N. Lee, T. Ajanthan, and P. Torr, “SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY,” inInternational Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=B1VZqjAcYX
2019
-
[41]
Picking winning tickets before training by preserving gradient flow,
C. Wang, G. Zhang, and R. Grosse, “Picking winning tickets before training by preserving gradient flow,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SkgsACVKPH
2020
-
[42]
Pubmedqa: A dataset for biomedical research question answering,
Q. Jin, B. Dhingra, Z. Liuet al., “Pubmedqa: A dataset for biomedical research question answering,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), 2019, pp. 2567–2577
2019
-
[43]
What disease does this patient have? a large-scale open domain question answering dataset from medical exams,
D. Jin, E. Pan, N. Oufattoleet al., “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,”Applied Sciences, vol. 11, no. 14, p. 6421, 2021
2021
-
[44]
J. Kone ˇcn`y, B. McMahan, and D. Ramage, “Federated optimiza- tion: Distributed optimization beyond the datacenter,”arXiv preprint arXiv:1511.03575, 2015
-
[45]
Federated Learning: Strategies for Improving Communication Efficiency
J. Kone ˇcn`y, H. B. McMahan, F. X. Yuet al., “Federated learning: Strategies for improving communication efficiency,”arXiv preprint arXiv:1610.05492, 2016
work page internal anchor Pith review arXiv 2016
-
[46]
Terngrad: Ternary gradients to reduce communication in distributed deep learning,
W. Wen, C. Xu, F. Yanet al., “Terngrad: Ternary gradients to reduce communication in distributed deep learning,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[47]
1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns
F. Seide, H. Fu, J. Droppoet al., “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns.” in Interspeech, vol. 2014. Singapore, 2014, pp. 1058–1062
2014
-
[48]
signsgd: Compressed optimisation for non-convex problems,
J. Bernstein, Y .-X. Wang, K. Azizzadenesheliet al., “signsgd: Compressed optimisation for non-convex problems,” inInternational conference on machine learning. PMLR, 2018, pp. 560–569
2018
-
[49]
Fedbat: Communication-efficient federated learning via learnable binarization,
S. Li, W. Xu, H. Wanget al., “Fedbat: Communication-efficient federated learning via learnable binarization,”arXiv preprint arXiv:2408.03215, 2024, accepted by ICML 2024 (as stated on arXiv)
-
[50]
Sparsified sgd with memory,
S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” Advances in neural information processing systems, vol. 31, 2018
2018
-
[51]
Sparse binary compression: Towards distributed deep learning with minimal communication,
F. Sattler, S. Wiedemann, K.-R. Mülleret al., “Sparse binary compression: Towards distributed deep learning with minimal communication,” in2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–8
2019
-
[52]
Powersgd: Practical low- rank gradient compression for distributed optimization,
T. V ogels, S. P. Karimireddy, and M. Jaggi, “Powersgd: Practical low- rank gradient compression for distributed optimization,”Advances in Neural Information Processing Systems, vol. 32, 2019
2019
-
[53]
Parameter-efficient transfer learning for nlp,
N. Houlsby, A. Giurgiu, S. Jastrzebskiet al., “Parameter-efficient transfer learning for nlp,” inInternational conference on machine learning. PMLR, 2019, pp. 2790–2799
2019
-
[54]
Prefix-tuning: Optimizing continuous prompts for generation,
X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597. 18
2021
-
[55]
The power of scale for parameter- efficient prompt tuning,
B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter- efficient prompt tuning,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3045– 3059
2021
-
[56]
Adaptive budget allocation for parameter-efficient fine-tuning,
Q. Zhang, M. Chen, A. Bukharinet al., “Adaptive budget allocation for parameter-efficient fine-tuning,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=lq62uWRJjiY
2023
-
[57]
S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” inICLR, 2016. [Online]. Available: http://arxiv.org/abs/1510.00149
work page internal anchor Pith review arXiv 2016
-
[58]
Quantization and training of neural networks for efficient integer-arithmetic-only inference,
B. Jacob, S. Kligys, B. Chenet al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704–2713
2018
-
[59]
Variational dropout sparsifies deep neural networks,
D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” inInternational conference on machine learning. PMLR, 2017, pp. 2498–2507
2017
-
[60]
The lottery ticket hypothesis: Finding sparse, trainable neural networks,
J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” inInternational Conference on Learning Representations, 2019. [Online]. Available: https: //openreview.net/forum?id=rJl-b3RcF7
2019
-
[61]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikovet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
2021
-
[62]
Tokenlearner: Adaptive space-time tokenization for videos,
M. S. Ryoo, A. Piergiovanni, A. Arnabet al., “Tokenlearner: Adaptive space-time tokenization for videos,” inAdvances in Neural Information Processing Systems (NeurIPS), 2021
2021
-
[63]
Adaptive token sampling for efficient vision transformers,
M. Fayyaz, S. A. Koohpayegani, F. R. Jafariet al., “Adaptive token sampling for efficient vision transformers,” inEuropean conference on computer vision. Springer, 2022, pp. 396–414
2022
-
[64]
A-vit: Adaptive tokens for efficient vision transformer,
H. Yin, A. Vahdat, J. M. Alvarezet al., “A-vit: Adaptive tokens for efficient vision transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 809–10 818
2022
-
[65]
EVit: Expediting vision transformers via token reorganizations,
Y . Liang, C. GE, Z. Tonget al., “EVit: Expediting vision transformers via token reorganizations,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=BjyvwnXXVn_
2022
-
[66]
Physics-Guided Tiny-Mamba Transformer for Reliability-Aware Early Fault Warning
C. Li, D. Huang, K. Yaoet al., “Physics-guided tiny-mamba trans- former for reliability-aware early fault warning,”arXiv preprint arXiv:2601.21293, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.