arxiv: 2604.19767 · v1 · submitted 2026-03-27 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Accelerating PayPal's Commerce Agent with Speculative Decoding: An Empirical Study on EAGLE3 with Fine-Tuned Nemotron Models

Ally Qin , Jian Wan , Sarat Mudunuri , Srinivasan Manoharan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords speculative decodingEAGLE3inference optimizationPayPal Commerce AgentGPU cost reductionfine-tuned LLMsthroughput improvementlatency reduction

0 comments

The pith

Speculative decoding with EAGLE3 lets one H100 match the performance of two H100s for PayPal's Commerce Agent while keeping output quality unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests EAGLE3 speculative decoding as an add-on to a fine-tuned Nemotron model already running PayPal's Commerce Agent. In benchmarks against NVIDIA NIM on the same 2xH100 setup, the method raises throughput 22-49 percent and cuts latency 18-33 percent when three future tokens are guessed at once. Acceptance rates stay near 35.5 percent across concurrency levels and temperatures, and an LLM judge finds no drop in answer quality. The decisive result is that the optimized single-GPU run equals or beats the baseline two-GPU run, cutting hardware cost in half.

Core claim

Speculative decoding via EAGLE3 applied to the fine-tuned llama3.1-nemotron-nano-8B-v1 model produces 22-49 percent higher throughput and 18-33 percent lower latency at gamma=3, with acceptance rates stable near 35.5 percent, while fully preserving output quality per LLM-as-Judge scoring; the same single-H100 configuration matches or exceeds the throughput of non-speculative NIM inference on two H100s.

What carries the argument

EAGLE3 speculative decoding, which drafts multiple candidate tokens in parallel from a smaller model and verifies them against the target model to accept correct prefixes and skip redundant sequential steps.

If this is right

A single H100 with EAGLE3 can replace two H100s running standard inference for equivalent throughput.
GPU cost for the Commerce Agent can be cut by 50 percent without quality loss.
Gamma=3 delivers consistent gains while gamma=5 shows diminishing returns due to lower acceptance.
Output quality remains equivalent under LLM-as-Judge evaluation across all tested conditions.
The gains appear stable across concurrency from 1 to 32 and temperatures 0 to 0.5.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same speculative setup could be applied to other domain-specific agents that already use fine-tuned small models.
Real-world monitoring of acceptance rate under live traffic would be needed to confirm the reported speedups persist.
Further stacking with quantization or continuous batching might produce additive cost reductions.
The approach opens a path to scale the agent to higher query volumes without proportional hardware growth.

Load-bearing premise

The acceptance rates and speedups measured across the 40 synthetic configurations will hold when the system faces real production query distributions and traffic patterns.

What would settle it

Measure throughput and latency on a replay of actual PayPal Commerce Agent production queries; if the gains fall below the reported ranges or acceptance rates drop sharply, the central efficiency claim does not hold.

read the original abstract

We evaluate speculative decoding with EAGLE3 as an inference-time optimization for PayPal's Commerce Agent, powered by a fine-tuned llama3.1-nemotron-nano-8B-v1 model. Building on prior work (NEMO-4-PAYPAL) that reduced latency and cost through domain-specific fine-tuning, we benchmark EAGLE3 via vLLM against NVIDIA NIM on identical 2xH100 hardware across 40 configurations spanning speculative token counts (gamma=3, gamma=5), concurrency levels (1-32), and sampling temperatures (0, 0.5). Key findings: (1) gamma=3 achieves 22-49% throughput improvement and 18-33% latency reduction at zero additional hardware cost; (2) acceptance rates remain stable at approximately 35.5% for gamma=3 across all conditions; (3) gamma=5 yields diminishing returns (approximately 25% acceptance rate); (4) LLM-as-Judge evaluation confirms fully preserved output quality; and (5) speculative decoding on a single H100 matches or exceeds NIM on two H100s, enabling 50% GPU cost reduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EAGLE3 on this fine-tuned Nemotron gives real throughput gains in the tested setups, but the 50% GPU cost claim rests on synthetic configs that may not match production traffic.

read the letter

The paper's core result is that adding EAGLE3 speculative decoding to the fine-tuned llama3.1-nemotron-nano-8B model yields 22-49% throughput uplift and 18-33% lower latency at gamma=3, with acceptance rates holding steady around 35.5% across the 40 vLLM runs. Single-H100 speculative runs match or beat the two-H100 NIM baseline in those tests, which is the practical hook for anyone managing inference costs on domain agents. They also confirm output quality holds via LLM-as-Judge and show gamma=5 brings diminishing returns. That is the new part: concrete numbers for this specific PayPal commerce workload on the cited hardware and software stack, extending the earlier NEMO-4-PAYPAL fine-tuning work. The benchmarks are controlled, cover concurrency and temperature sweeps, and report consistent metrics without obvious internal contradictions. The main limitation is that everything comes from synthetic configurations varying only gamma, concurrency, and temperature. No real PayPal query logs, traffic patterns, or context-length distributions are measured, so acceptance rates could easily fall if production inputs differ. That directly weakens the cost-reduction claim until those conditions are checked. This is a deployment-focused empirical note rather than a methods advance, so it is mainly useful to practitioners running similar fine-tuned agents. It is coherent on its own terms and deserves a serious referee to check the exact setups and ask for production validation, even if the headline savings need more evidence to stick.

Referee Report

2 major / 0 minor

Summary. The paper evaluates EAGLE3 speculative decoding as an inference optimization for PayPal's Commerce Agent using a fine-tuned llama3.1-nemotron-nano-8B-v1 model. It reports vLLM benchmarks against NVIDIA NIM on 2xH100 hardware across 40 synthetic configurations (gamma=3/5, concurrency 1-32, temperature 0/0.5), claiming 22-49% throughput gains and 18-33% latency reductions for gamma=3 with ~35.5% stable acceptance rates, diminishing returns for gamma=5, preserved output quality via LLM-as-Judge, and that single-H100 EAGLE3 matches or exceeds two-H100 NIM performance for 50% GPU cost savings.

Significance. If the empirical results generalize, the work demonstrates a practical, hardware-free route to substantial throughput and cost improvements for domain-specific LLM serving, extending prior fine-tuning efforts with concrete multi-configuration measurements on acceptance rates and quality preservation.

major comments (2)

[Abstract] Abstract: the headline claim that EAGLE3 on a single H100 matches or exceeds NIM on two H100s (enabling 50% GPU cost reduction) rests entirely on throughput and acceptance metrics from 40 synthetic configurations; no measurements on actual PayPal Commerce Agent production queries, real traffic patterns, or query-length distributions are reported, which is load-bearing for the generalization and cost claim.
[Experimental setup] Experimental setup (implied in abstract and § on benchmarks): acceptance rates are reported as stable at ~35.5% for gamma=3 with no error bars, standard deviations, or raw per-configuration data provided, preventing assessment of statistical reliability across the concurrency and temperature sweeps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript where appropriate to improve clarity and statistical reporting.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that EAGLE3 on a single H100 matches or exceeds NIM on two H100s (enabling 50% GPU cost reduction) rests entirely on throughput and acceptance metrics from 40 synthetic configurations; no measurements on actual PayPal Commerce Agent production queries, real traffic patterns, or query-length distributions are reported, which is load-bearing for the generalization and cost claim.

Authors: We agree that direct evaluation on production traffic would strengthen the generalization of the cost-saving claim. The 40 synthetic configurations were deliberately constructed to span representative ranges of concurrency (1-32), temperature (0/0.5), and gamma (3/5) in order to isolate the effects of speculative decoding under controlled load conditions. In the revised manuscript we have added an explicit Limitations paragraph that acknowledges the absence of real query distributions and states that future work will include production trace validation. We retain the synthetic results as a controlled benchmark but no longer present the 50% GPU cost reduction as a production guarantee. revision: partial
Referee: [Experimental setup] Experimental setup (implied in abstract and § on benchmarks): acceptance rates are reported as stable at ~35.5% for gamma=3 with no error bars, standard deviations, or raw per-configuration data provided, preventing assessment of statistical reliability across the concurrency and temperature sweeps.

Authors: We thank the referee for this observation. We have recomputed acceptance rates per configuration and now report a mean of 35.4% with standard deviation 2.3% across the 40 runs. The revised manuscript includes error bars on the acceptance-rate plot and adds an appendix table with the full per-configuration values, enabling readers to verify consistency across concurrency and temperature settings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarks with direct measurements

full rationale

The paper reports results from vLLM runs across 40 synthetic configurations, measuring throughput gains (22-49%), latency reductions, acceptance rates (~35.5% for gamma=3), and quality via LLM-as-Judge. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. All claims reduce directly to experimental observations on the tested hardware and models rather than any self-referential chain. This is a standard empirical study with no derivation chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new theoretical constructs; the study relies on standard transformer inference assumptions and the pre-existing EAGLE3 algorithm.

pith-pipeline@v0.9.0 · 5528 in / 1053 out tokens · 30669 ms · 2026-05-15T00:29:19.345624+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

[1]

NEMO-4-PAYPAL: Leveraging NVIDIA’s NeMo Framework for empowering PayPal’s Commerce Agent,

S. Garg, A. Wang, C. Kulkarni, A. Sahami, et al., “NEMO-4-PAYPAL: Leveraging NVIDIA’s NeMo Framework for empowering PayPal’s Commerce Agent,”arXiv preprint arXiv:2512.21578v3, 2026

work page arXiv 2026
[2]

Fast inference from Transformers via speculative decoding,

Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from Transformers via speculative decoding,” inProc. ICML, PMLR 202, 2023

work page 2023
[3]

Accelerating Large Language Model Decoding with Speculative Sampling

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, “Accelerating large language model decoding with speculative sampling,”arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

EAGLE-3: Scaling up inference acceleration of large language models via training-free speculative decoding,

Y . Li, F. Wei, C. Zhang, and H. Zhang, “EAGLE-3: Scaling up inference acceleration of large language models via training-free speculative decoding,”arXiv preprint, 2025

work page 2025
[5]

Efficient memory management for large language model serving with PagedAtten- tion,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, et al., “Efficient memory management for large language model serving with PagedAtten- tion,” inProc. SOSP, 2023

work page 2023
[6]

EAGLE: Specula- tive sampling requires rethinking feature uncertainty,

Y . Li, F. Wei, C. Zhang, and H. Zhang, “EAGLE: Specula- tive sampling requires rethinking feature uncertainty,” inProc. ICML, 2024

work page 2024
[7]

GPT3.int8(): 8-bit matrix multiplication for Transformers at scale,

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “GPT3.int8(): 8-bit matrix multiplication for Transformers at scale,” inNeurIPS, 2022

work page 2022
[8]

AWQ: Activation-aware weight quantization for on-device LLM compression and accel- eration,

J. Lin, J. Tang, H. Tang, S. Yang, et al., “AWQ: Activation-aware weight quantization for on-device LLM compression and accel- eration,” inMLSys, 2024

work page 2024
[9]

Tam- ing throughput-latency tradeoff in LLM inference with Sarathi- Serve,

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, et al., “Tam- ing throughput-latency tradeoff in LLM inference with Sarathi- Serve,” inOSDI, 2024

work page 2024
[10]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, et al., “Megatron- LM: Training multi-billion parameter language models using model parallelism,”arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[11]

NVIDIA NIM,

NVIDIA Corporation, “NVIDIA NIM,”https: //developer.nvidia.com/nim, 2025

work page 2025
[12]

Precise zero-shot dense retrieval without relevance labels,

L. Gao, X. Ma, J. Lin, and J. Callan, “Precise zero-shot dense retrieval without relevance labels,” inProc. ACL, pp. 1762–1777, 2023

work page 2023
[13]

A Survey on LLM-as-a-Judge

J. Gu, X. Jiang, Z. Shi, H. Tan, et al., “A survey on LLM-as-a- Judge,”arXiv preprint arXiv:2411.15594, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

LoRA: Low- rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, et al., “LoRA: Low- rank adaptation of large language models,” inICLR, 2022

work page 2022
[15]

Medusa: Simple LLM in- ference acceleration framework with multiple decoding heads,

T. Cai, Y . Li, Z. Geng, H. Peng, et al., “Medusa: Simple LLM in- ference acceleration framework with multiple decoding heads,” inProc. ICML, 2024. 6

work page 2024