Recognition: no theorem link
Accelerating PayPal's Commerce Agent with Speculative Decoding: An Empirical Study on EAGLE3 with Fine-Tuned Nemotron Models
Pith reviewed 2026-05-15 00:29 UTC · model grok-4.3
The pith
Speculative decoding with EAGLE3 lets one H100 match the performance of two H100s for PayPal's Commerce Agent while keeping output quality unchanged.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Speculative decoding via EAGLE3 applied to the fine-tuned llama3.1-nemotron-nano-8B-v1 model produces 22-49 percent higher throughput and 18-33 percent lower latency at gamma=3, with acceptance rates stable near 35.5 percent, while fully preserving output quality per LLM-as-Judge scoring; the same single-H100 configuration matches or exceeds the throughput of non-speculative NIM inference on two H100s.
What carries the argument
EAGLE3 speculative decoding, which drafts multiple candidate tokens in parallel from a smaller model and verifies them against the target model to accept correct prefixes and skip redundant sequential steps.
If this is right
- A single H100 with EAGLE3 can replace two H100s running standard inference for equivalent throughput.
- GPU cost for the Commerce Agent can be cut by 50 percent without quality loss.
- Gamma=3 delivers consistent gains while gamma=5 shows diminishing returns due to lower acceptance.
- Output quality remains equivalent under LLM-as-Judge evaluation across all tested conditions.
- The gains appear stable across concurrency from 1 to 32 and temperatures 0 to 0.5.
Where Pith is reading between the lines
- The same speculative setup could be applied to other domain-specific agents that already use fine-tuned small models.
- Real-world monitoring of acceptance rate under live traffic would be needed to confirm the reported speedups persist.
- Further stacking with quantization or continuous batching might produce additive cost reductions.
- The approach opens a path to scale the agent to higher query volumes without proportional hardware growth.
Load-bearing premise
The acceptance rates and speedups measured across the 40 synthetic configurations will hold when the system faces real production query distributions and traffic patterns.
What would settle it
Measure throughput and latency on a replay of actual PayPal Commerce Agent production queries; if the gains fall below the reported ranges or acceptance rates drop sharply, the central efficiency claim does not hold.
read the original abstract
We evaluate speculative decoding with EAGLE3 as an inference-time optimization for PayPal's Commerce Agent, powered by a fine-tuned llama3.1-nemotron-nano-8B-v1 model. Building on prior work (NEMO-4-PAYPAL) that reduced latency and cost through domain-specific fine-tuning, we benchmark EAGLE3 via vLLM against NVIDIA NIM on identical 2xH100 hardware across 40 configurations spanning speculative token counts (gamma=3, gamma=5), concurrency levels (1-32), and sampling temperatures (0, 0.5). Key findings: (1) gamma=3 achieves 22-49% throughput improvement and 18-33% latency reduction at zero additional hardware cost; (2) acceptance rates remain stable at approximately 35.5% for gamma=3 across all conditions; (3) gamma=5 yields diminishing returns (approximately 25% acceptance rate); (4) LLM-as-Judge evaluation confirms fully preserved output quality; and (5) speculative decoding on a single H100 matches or exceeds NIM on two H100s, enabling 50% GPU cost reduction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates EAGLE3 speculative decoding as an inference optimization for PayPal's Commerce Agent using a fine-tuned llama3.1-nemotron-nano-8B-v1 model. It reports vLLM benchmarks against NVIDIA NIM on 2xH100 hardware across 40 synthetic configurations (gamma=3/5, concurrency 1-32, temperature 0/0.5), claiming 22-49% throughput gains and 18-33% latency reductions for gamma=3 with ~35.5% stable acceptance rates, diminishing returns for gamma=5, preserved output quality via LLM-as-Judge, and that single-H100 EAGLE3 matches or exceeds two-H100 NIM performance for 50% GPU cost savings.
Significance. If the empirical results generalize, the work demonstrates a practical, hardware-free route to substantial throughput and cost improvements for domain-specific LLM serving, extending prior fine-tuning efforts with concrete multi-configuration measurements on acceptance rates and quality preservation.
major comments (2)
- [Abstract] Abstract: the headline claim that EAGLE3 on a single H100 matches or exceeds NIM on two H100s (enabling 50% GPU cost reduction) rests entirely on throughput and acceptance metrics from 40 synthetic configurations; no measurements on actual PayPal Commerce Agent production queries, real traffic patterns, or query-length distributions are reported, which is load-bearing for the generalization and cost claim.
- [Experimental setup] Experimental setup (implied in abstract and § on benchmarks): acceptance rates are reported as stable at ~35.5% for gamma=3 with no error bars, standard deviations, or raw per-configuration data provided, preventing assessment of statistical reliability across the concurrency and temperature sweeps.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript where appropriate to improve clarity and statistical reporting.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that EAGLE3 on a single H100 matches or exceeds NIM on two H100s (enabling 50% GPU cost reduction) rests entirely on throughput and acceptance metrics from 40 synthetic configurations; no measurements on actual PayPal Commerce Agent production queries, real traffic patterns, or query-length distributions are reported, which is load-bearing for the generalization and cost claim.
Authors: We agree that direct evaluation on production traffic would strengthen the generalization of the cost-saving claim. The 40 synthetic configurations were deliberately constructed to span representative ranges of concurrency (1-32), temperature (0/0.5), and gamma (3/5) in order to isolate the effects of speculative decoding under controlled load conditions. In the revised manuscript we have added an explicit Limitations paragraph that acknowledges the absence of real query distributions and states that future work will include production trace validation. We retain the synthetic results as a controlled benchmark but no longer present the 50% GPU cost reduction as a production guarantee. revision: partial
-
Referee: [Experimental setup] Experimental setup (implied in abstract and § on benchmarks): acceptance rates are reported as stable at ~35.5% for gamma=3 with no error bars, standard deviations, or raw per-configuration data provided, preventing assessment of statistical reliability across the concurrency and temperature sweeps.
Authors: We thank the referee for this observation. We have recomputed acceptance rates per configuration and now report a mean of 35.4% with standard deviation 2.3% across the 40 runs. The revised manuscript includes error bars on the acceptance-rate plot and adds an appendix table with the full per-configuration values, enabling readers to verify consistency across concurrency and temperature settings. revision: yes
Circularity Check
No circularity: purely empirical benchmarks with direct measurements
full rationale
The paper reports results from vLLM runs across 40 synthetic configurations, measuring throughput gains (22-49%), latency reductions, acceptance rates (~35.5% for gamma=3), and quality via LLM-as-Judge. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. All claims reduce directly to experimental observations on the tested hardware and models rather than any self-referential chain. This is a standard empirical study with no derivation chain to inspect for circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
NEMO-4-PAYPAL: Leveraging NVIDIA’s NeMo Framework for empowering PayPal’s Commerce Agent,
S. Garg, A. Wang, C. Kulkarni, A. Sahami, et al., “NEMO-4-PAYPAL: Leveraging NVIDIA’s NeMo Framework for empowering PayPal’s Commerce Agent,”arXiv preprint arXiv:2512.21578v3, 2026
-
[2]
Fast inference from Transformers via speculative decoding,
Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from Transformers via speculative decoding,” inProc. ICML, PMLR 202, 2023
work page 2023
-
[3]
Accelerating Large Language Model Decoding with Speculative Sampling
C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, “Accelerating large language model decoding with speculative sampling,”arXiv preprint arXiv:2302.01318, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Y . Li, F. Wei, C. Zhang, and H. Zhang, “EAGLE-3: Scaling up inference acceleration of large language models via training-free speculative decoding,”arXiv preprint, 2025
work page 2025
-
[5]
Efficient memory management for large language model serving with PagedAtten- tion,
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, et al., “Efficient memory management for large language model serving with PagedAtten- tion,” inProc. SOSP, 2023
work page 2023
-
[6]
EAGLE: Specula- tive sampling requires rethinking feature uncertainty,
Y . Li, F. Wei, C. Zhang, and H. Zhang, “EAGLE: Specula- tive sampling requires rethinking feature uncertainty,” inProc. ICML, 2024
work page 2024
-
[7]
GPT3.int8(): 8-bit matrix multiplication for Transformers at scale,
T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “GPT3.int8(): 8-bit matrix multiplication for Transformers at scale,” inNeurIPS, 2022
work page 2022
-
[8]
AWQ: Activation-aware weight quantization for on-device LLM compression and accel- eration,
J. Lin, J. Tang, H. Tang, S. Yang, et al., “AWQ: Activation-aware weight quantization for on-device LLM compression and accel- eration,” inMLSys, 2024
work page 2024
-
[9]
Tam- ing throughput-latency tradeoff in LLM inference with Sarathi- Serve,
A. Agrawal, N. Kedia, A. Panwar, J. Mohan, et al., “Tam- ing throughput-latency tradeoff in LLM inference with Sarathi- Serve,” inOSDI, 2024
work page 2024
-
[10]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, et al., “Megatron- LM: Training multi-billion parameter language models using model parallelism,”arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
- [11]
-
[12]
Precise zero-shot dense retrieval without relevance labels,
L. Gao, X. Ma, J. Lin, and J. Callan, “Precise zero-shot dense retrieval without relevance labels,” inProc. ACL, pp. 1762–1777, 2023
work page 2023
-
[13]
J. Gu, X. Jiang, Z. Shi, H. Tan, et al., “A survey on LLM-as-a- Judge,”arXiv preprint arXiv:2411.15594, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
LoRA: Low- rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, et al., “LoRA: Low- rank adaptation of large language models,” inICLR, 2022
work page 2022
-
[15]
Medusa: Simple LLM in- ference acceleration framework with multiple decoding heads,
T. Cai, Y . Li, Z. Geng, H. Peng, et al., “Medusa: Simple LLM in- ference acceleration framework with multiple decoding heads,” inProc. ICML, 2024. 6
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.