pith. machine review for the scientific record. sign in

arxiv: 2605.13319 · v2 · submitted 2026-05-13 · 💻 cs.DC

Recognition: no theorem link

PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:07 UTC · model grok-4.3

classification 💻 cs.DC
keywords speculative decodingcloud-edge collaborationLLM inferencepipeline schedulingdynamic programmingBayesian optimizationenergy efficiencycollaborative inference
0
0 comments X

The pith

PipeSD speeds up cloud-edge LLM inference 1.16x-2.16x by pipelining token batches and flexible verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing collaborative speculative decoding for LLMs is limited by sequential token generation that leaves resources idle and by rigid triggering of cloud verification that triggers either too early or causes expensive rollbacks. PipeSD fixes both with a token-batch pipeline whose scheduling is chosen by dynamic programming to overlap generation and communication, plus a dual-threshold mechanism for non-autoregressive verification that a lightweight Bayesian optimizer tunes on the fly. Evaluations on a real cloud-edge testbed with two model pairs show the changes produce consistent speedups and energy savings while preserving the privacy and offline benefits of edge deployment. A reader would care because the method makes large-model inference practical on mixed cloud and local hardware without sacrificing responsiveness or privacy.

Core claim

PipeSD overlaps token generation and communication through a token-batch pipeline scheduling mechanism optimized by dynamic programming, and improves verification flexibility through a dual-threshold NAV triggering mechanism with a lightweight Bayesian optimization autotuner; the resulting framework, implemented with llama-cpp-python, PyTorch, and FastAPI, delivers 1.16x-2.16x speedup and 14.3%-25.3% lower energy use compared with state-of-the-art baselines across four scenarios and two draft-target model pairs.

What carries the argument

Token-batch pipeline scheduler with dynamic-programming optimization paired with dual-threshold NAV triggering tuned by Bayesian autotuner; it overlaps generation and communication while allowing flexible verification to cut rollbacks.

If this is right

  • Token generation and communication can be overlapped to raise utilization in distributed LLM inference.
  • Flexible non-autoregressive verification reduces premature checks and costly rollbacks.
  • Energy consumption falls 14-25 percent while generation speed rises across tested model pairs and scenarios.
  • Cloud workload offloading remains compatible with offline robustness and privacy guarantees.
  • The same mechanisms apply to multiple draft-target model pairs without per-deployment retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pipelining idea could extend to other distributed AI workloads such as vision or sensor models that also mix local and remote computation.
  • Real-time adaptation of the Bayesian thresholds could let the system respond to changing network conditions without manual intervention.
  • If the autotuner proves lightweight enough, similar self-tuning could appear in pure edge deployments that occasionally borrow cloud capacity.

Load-bearing premise

The dynamic-programming batch scheduler and Bayesian autotuner will keep delivering stable gains across unseen model pairs, network conditions, and workloads without hidden overhead or needing extensive retuning.

What would settle it

Measuring speedup below 1.1x or zero energy reduction when the same implementation is run on a new model pair under different network latency would disprove the claim of consistent outperformance.

Figures

Figures reproduced from arXiv: 2605.13319 by Bing Hu, Mahdi Boloursaz Mashhadi, Pei Xiao, Yanfeng Zhang, Yitong Duan, Yunhe Han, Yunqi Gao.

Figure 1
Figure 1. Figure 1: Illustration of the speculative decoding process. decoder layers, each utilizing masked self-attention and feed-forward networks to process input sequences (Sheng et al., 2023). Decoder-only LLMs typically adopt autore￾gressive generation, which produces output sequence one token at a time, and each newly generated token is appended to the input sequence to predict the subsequent one (Vaswani et al., 2017;… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of transmission strategies [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of PipeSD architecture. The green part is the core of PipeSD. target model, which makes the system easy to scale and compatible with existing cloud-edge collaborative frame￾works. Moreover, although the current design only considers a single client, PipeSD can be easily extended to support multiple clients with minor modifications (see Appendix I for details). The modules on the edge device includ… view at source ↗
Figure 5
Figure 5. Figure 5: Average TPT (ms) with different bandwidth levels on HumanEval in Scenario 1 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Communication and computation latency characteristics used in PipeSD. 5.2.3. PERFORMANCE EVALUATION OF BO AUTOTUNER We evaluate the effectiveness of BO autotuner by comparing it with grid search and random search on tuning (R1, R2). As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Speculative decoding can significantly accelerate LLM inference, especially given that its cloud-edge collaborative deployment offers cloud workload offloading, offline robustness, and privacy enhancement. However, existing collaborative inference frameworks with speculative decoding are constrained by (i) sequential token generation and communication with low resource utilization, and (ii) inflexible cloud non-autoregressive verification (NAV) triggering that induces premature verification or costly rollbacks. In this paper, we propose PipeSD, an efficient cloud-edge collaborative pipeline inference framework with speculative decoding. PipeSD overlaps token generation and communication by a token-batch pipeline scheduling mechanism optimized by dynamic programming, and improves verification flexibility through a dual-threshold NAV triggering mechanism with a lightweight Bayesian optimization autotuner. We implement PipeSD using llama-cpp-python, PyTorch, and FastAPI, and evaluate it on a real-world cloud-edge testbed with two draft-target model pairs across four scenarios. Results show that PipeSD consistently outperforms state-of-the-art baselines, achieving 1.16x-2.16x speedup and reducing energy consumption by 14.3%-25.3%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces PipeSD, a cloud-edge collaborative pipeline inference framework for large language models using speculative decoding. It proposes a token-batch pipeline scheduling mechanism optimized via dynamic programming to overlap generation and communication, along with a dual-threshold non-autoregressive verification (NAV) triggering mechanism enhanced by a lightweight Bayesian optimization autotuner. The framework is implemented using llama-cpp-python, PyTorch, and FastAPI, and evaluated on a real-world cloud-edge testbed with two draft-target model pairs across four scenarios, claiming consistent outperformance of state-of-the-art baselines with speedups of 1.16x-2.16x and energy reductions of 14.3%-25.3%.

Significance. If the empirical results hold under broader conditions, PipeSD could meaningfully advance efficient distributed inference for LLMs by improving pipeline utilization and verification flexibility in cloud-edge setups. The use of dynamic programming for scheduling and Bayesian tuning for triggering offers a principled approach to optimization that may generalize if validated more extensively.

major comments (1)
  1. [Evaluation] The experiments cover only two draft-target model pairs on one testbed across four scenarios. This limited scope leaves the generalization of the dynamic-programming batch scheduler and Bayesian autotuner unproven, as the mechanisms may incur hidden overhead or require per-deployment retuning under varying model scales, network conditions, or workloads, undermining the claim of consistent speedups.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We appreciate the positive assessment of the paper's potential impact and address the major comment on evaluation below.

read point-by-point responses
  1. Referee: [Evaluation] The experiments cover only two draft-target model pairs on one testbed across four scenarios. This limited scope leaves the generalization of the dynamic-programming batch scheduler and Bayesian autotuner unproven, as the mechanisms may incur hidden overhead or require per-deployment retuning under varying model scales, network conditions, or workloads, undermining the claim of consistent speedups.

    Authors: We thank the referee for pointing out the limited scope of our experiments. While the evaluation is indeed restricted to two model pairs and one testbed, these were selected to cover a range of practical cloud-edge conditions through the four scenarios, which vary in terms of communication latency and bandwidth. The dynamic-programming-based scheduler is designed to be general, as it takes as input the profiled computation and communication times for any given model pair and network, solving for the optimal pipeline schedule without assuming specific model scales. Similarly, the Bayesian autotuner optimizes the dual thresholds based on empirical performance data from the deployment, allowing adaptation to different workloads. We have measured and reported the overhead of these mechanisms in Section 5, showing they are negligible. To better address generalization, in the revised version we will expand the 'Discussion' section to include an analysis of how the proposed mechanisms can be applied to other model sizes and network conditions, along with potential limitations. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical measurements

full rationale

The paper describes a pipeline scheduling mechanism using dynamic programming and a dual-threshold NAV trigger with Bayesian autotuner, then reports measured speedups (1.16x-2.16x) and energy reductions from implementation on a specific cloud-edge testbed with two model pairs. No equations, predictions, or uniqueness theorems are presented that reduce by construction to fitted inputs, self-citations, or renamed ansatzes; the results are direct testbed outputs rather than derived quantities forced by the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The framework builds on standard speculative decoding assumptions (draft model accuracy, network latency models) without stating new ones.

pith-pipeline@v0.9.0 · 5510 in / 1205 out tokens · 71306 ms · 2026-05-15T03:07:07.271256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

  1. [1]

    IEEE Transactions on Mobile Computing , month = apr, pages =

    Xu, Daliang and Yin, Wangsong and Zhang, Hao and Jin, Xin and Zhang, Ying and Wei, Shiyun and Xu, Mengwei and Liu, Xuanzhe , title =. IEEE Transactions on Mobile Computing , month = apr, pages =. 2025 , issue_date =. doi:10.1109/TMC.2024.3513457 , abstract =

  2. [2]

    Proceedings of the Workshop on Edge and Mobile Foundation Models , pages =

    Hao, Zixu and Jiang, Huiqiang and Jiang, Shiqi and Ren, Ju and Cao, Ting , title =. Proceedings of the Workshop on Edge and Mobile Foundation Models , pages =. 2024 , isbn =. doi:10.1145/3662006.3662067 , abstract =

  3. [3]

    EcoFed: Efficient Communication for DNN Partitioning-Based Federated Learning , year=

    Wu, Di and Ullah, Rehmat and Rodgers, Philip and Kilpatrick, Peter and Spence, Ivor and Varghese, Blesson , journal=. EcoFed: Efficient Communication for DNN Partitioning-Based Federated Learning , year=

  4. [4]

    Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation , pages =

    Alipourfard, Omid and Liu, Hongqiang Harry and Chen, Jianshu and Venkataraman, Shivaram and Yu, Minlan and Zhang, Ming , title =. Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation , pages =. 2017 , isbn =

  5. [5]

    Proceedings of the 40th International Conference on Machine Learning (ICML) , year =

    Yaniv Leviathan and Matan Kalman and Yossi Matias , title =. Proceedings of the 40th International Conference on Machine Learning (ICML) , year =

  6. [6]

    2023 , eprint=

    Accelerating Large Language Model Decoding with Speculative Sampling , author=. 2023 , eprint=

  7. [7]

    Parallel Comput

    The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , author=. Parallel Comput. , year=

  8. [8]

    2023 , eprint=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=

  9. [9]

    2024 , eprint=

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads , author=. 2024 , eprint=

  10. [10]

    2025 , eprint=

    EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty , author=. 2025 , eprint=

  11. [11]

    2024 , eprint=

    Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy , author=. 2024 , eprint=

  12. [12]

    2025 , howpublished =

  13. [13]

    2023 , eprint=

    LLMCad: Fast and Scalable On-device Large Language Model Inference , author=. 2023 , eprint=

  14. [14]

    2025 , eprint=

    A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models , author=. 2025 , eprint=

  15. [15]

    2025 , eprint=

    Quantize-Sample-and-Verify: LLM Acceleration via Adaptive Edge-Cloud Speculative Decoding , author=. 2025 , eprint=

  16. [16]

    SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive

    Jinwoo Park and Seunggeun Cho and Dongsu Han , booktitle=. SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive. 2025 , url=

  17. [17]

    SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification , url=

    Miao, Xupeng and Oliaro, Gabriele and Zhang, Zhihao and Cheng, Xinhao and Wang, Zeyu and Zhang, Zhengxin and Wong, Rae Ying Yee and Zhu, Alan and Yang, Lijie and Shi, Xiaoxiang and Shi, Chunan and Chen, Zhuoming and Arfeen, Daiyaan and Abhyankar, Reyna and Jia, Zhihao , year=. SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculativ...

  18. [18]

    2024 , eprint=

    DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference , author=. 2024 , eprint=

  19. [19]

    Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,

    Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward , author =. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,. 2024 , month =. doi:10.24963/ijcai.2024/883 , url =

  20. [20]

    Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

    Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , month =. doi:10.24963/ijcai.2025/902 , url =

  21. [21]

    Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

    Not All Layers of LLMs Are Necessary During Inference , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , month =. doi:10.24963/ijcai.2025/566 , url =

  22. [22]

    Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

    Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , month =. doi:10.24963/ijcai.2025/951 , url =

  23. [23]

    2024 , eprint=

    EdgeShard: Efficient LLM Inference via Collaborative Edge Computing , author=. 2024 , eprint=

  24. [24]

    IEEE INFOCOM 2025 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) , year=

    EdgePrompt: A Distributed Key-Value Inference Framework for LLMs in 6G Networks , author=. IEEE INFOCOM 2025 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) , year=

  25. [25]

    Proceedings of the Tenth ACM/IEEE Symposium on Edge Computing , year=

    SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving , author=. Proceedings of the Tenth ACM/IEEE Symposium on Edge Computing , year=

  26. [26]

    2025 , eprint=

    Communication-Efficient Collaborative LLM Inference via Distributed Speculative Decoding , author=. 2025 , eprint=

  27. [27]

    2025 , eprint=

    Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices , author=. 2025 , eprint=

  28. [28]

    Splitwise: Collaborative Edge–Cloud Inference for LLMs via Lyapunov-Assisted DRL , url=

    Younesi, Abolfazl and Shabrang Maryan, Abbas and Oustad, Elyas and Najafabadi Samani, Zahra and Ansari, Mohsen and Fahringer, Thomas , year=. Splitwise: Collaborative Edge–Cloud Inference for LLMs via Lyapunov-Assisted DRL , url=. doi:10.1145/3773274.3774267 , booktitle=

  29. [29]

    Ruikun Luo and Changwei Gu and Qiang He and Feifei Chen and Song Wu and Hai Jin and Yun Yang , booktitle=. Sim-. 2025 , url=

  30. [30]

    2025 , eprint=

    CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge , author=. 2025 , eprint=

  31. [31]

    2022 , eprint=

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=

  32. [32]

    Faster Speculative Decoding via Effective Draft Decoder with Pruned Candidate Tree

    Zheng, Huanran and Wang, Xiaoling. Faster Speculative Decoding via Effective Draft Decoder with Pruned Candidate Tree. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.486

  33. [33]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

    Chen, Ziyi and Yang, Xiaocong and Lin, Jiacheng and Sun, Chenkai and Chang, Kevin Chen-Chuan and Huang, Jie , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

  34. [34]

    2023 , eprint=

    FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU , author=. 2023 , eprint=

  35. [35]

    Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

    Xia, Heming and Ge, Tao and Wang, Peiyi and Chen, Si-Qing and Wei, Furu and Sui, Zhifang. Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.257

  36. [36]

    and Kaiser,

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is all you need , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

  37. [37]

    2019 , url=

    Language Models are Unsupervised Multitask Learners , author=. 2019 , url=

  38. [38]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  39. [39]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  40. [40]

    2024 , eprint=

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence , author=. 2024 , eprint=

  41. [41]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  42. [42]

    2024 , eprint=

    TinyLlama: An Open-Source Small Language Model , author=. 2024 , eprint=

  43. [43]

    2023 , eprint=

    Speculative Decoding with Big Little Decoder , author=. 2023 , eprint=

  44. [44]

    2019 , eprint=

    PyTorch: An Imperative Style, High-Performance Deep Learning Library , author=. 2019 , eprint=

  45. [45]

    llama-cpp-python: Python bindings for llama.cpp , author =

  46. [46]

    GGUF: A Binary Model File Format for Efficient Loading and Inference , author =

  47. [47]

    Adaptive Block-Wise Regularization and Knowledge Distillation for Enhancing Federated Learning , year=

    Liu, Jianchun and Zeng, Qingmin and Xu, Hongli and Xu, Yang and Wang, Zhiyuan and Huang, He , journal=. Adaptive Block-Wise Regularization and Knowledge Distillation for Enhancing Federated Learning , year=

  48. [48]

    , journal=

    Al-Falahy, Naser and Alani, Omar Y. , journal=. Technologies for 5G Networks: Challenges and Opportunities , year=

  49. [49]

    Constrained Device Performance Benchmarking with the Implementation of Post-Quantum Cryptography , volume =

    Fitzgibbon, Gregory and Ottaviani, Carlo , year =. Constrained Device Performance Benchmarking with the Implementation of Post-Quantum Cryptography , volume =. Cryptography , doi =

  50. [50]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  51. [51]

    2024 , eprint=

    GPT-4 Technical Report , author=. 2024 , eprint=

  52. [52]

    2024 , eprint=

    SGLang: Efficient Execution of Structured Language Model Programs , author=. 2024 , eprint=

  53. [53]

    2025 , eprint=

    CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration , author=. 2025 , eprint=

  54. [54]

    2025 , eprint=

    PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks , author=. 2025 , eprint=

  55. [55]

    2024 , eprint=

    An Edge-Cloud Collaboration Framework for Generative AI Service Provision with Synergetic Big Cloud Model and Small Edge Models , author=. 2024 , eprint=

  56. [56]

    UPOA: A User Preference Based Latency and Energy Aware Intelligent Offloading Approach for Cloud-Edge Systems , year=

    Yuan, Jingling and Xiang, Yao and Deng, Yuhui and Zhou, Yi and Min, Geyong , journal=. UPOA: A User Preference Based Latency and Energy Aware Intelligent Offloading Approach for Cloud-Edge Systems , year=

  57. [57]

    A Cloud–Edge Collaborative Architecture for Multimodal LLM-Based Advanced Driver Assistance Systems in IoT Networks , year=

    Hu, Yaqi and Ye, Dongdong and Kang, Jiawen and Wu, Maoqiang and Yu, Rong , journal=. A Cloud–Edge Collaborative Architecture for Multimodal LLM-Based Advanced Driver Assistance Systems in IoT Networks , year=

  58. [58]

    2024 , eprint=

    Resource Allocation for Stable LLM Training in Mobile Edge Computing , author=. 2024 , eprint=

  59. [59]

    2025 , eprint=

    Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges , author=. 2025 , eprint=

  60. [60]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Du, Cunxiao and Jiang, Jing and Yuanchen, Xu and Wu, Jiawei and Yu, Sicheng and Li, Yongqi and Li, Shenggui and Xu, Kai and Nie, Liqiang and Tu, Zhaopeng and You, Yang , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  61. [61]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  62. [62]

    2023 , eprint=

    LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

  63. [63]

    2001 , howpublished =

    Hubert, Bert , title =. 2001 , howpublished =

  64. [64]

    Measuring energy consumption for short code paths using RAPL , year =

    H\". Measuring energy consumption for short code paths using RAPL , year =. SIGMETRICS Perform. Eval. Rev. , month = jan, pages =. doi:10.1145/2425248.2425252 , abstract =

  65. [65]

    and Malik, S

    Tiwari, V. and Malik, S. and Wolfe, A. , journal=. Power analysis of embedded software: a first step towards software power minimization , year=

  66. [66]

    2025 , organization =

  67. [67]

    US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning , year=

    Gao, Yunqi and Hu, Bing and Mashhadi, Mahdi Boloursaz and Jin, A-Long and Xiao, Pei and Wu, Chunming , journal=. US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning , year=

  68. [68]

    2025 , eprint=

    FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Experts Training , author=. 2025 , eprint=

  69. [69]

    2025 , eprint=

    SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths , author=. 2025 , eprint=

  70. [70]

    Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation

    Zhang, Ziyin and Xu, Jiahao and Liang, Tian and Chen, Xingyu and He, Zhiwei and Wang, Rui and Tu, Zhaopeng. Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.844