LLM Zeroth-Order Fine-Tuning is an Inference Workload

Caiwen Ding; Zelin Li

arxiv: 2605.28760 · v1 · pith:JFMJDHA5new · submitted 2026-05-27 · 💻 cs.LG

LLM Zeroth-Order Fine-Tuning is an Inference Workload

Zelin Li , Caiwen Ding This is my paper

Pith reviewed 2026-06-29 14:29 UTC · model grok-4.3

classification 💻 cs.LG

keywords zeroth-order optimizationfine-tuninglarge language modelsinference servingspeedupLoRAadapter states

0 comments

The pith

Zeroth-order fine-tuning of large language models is an inference workload that a serving runtime accelerates by up to 8 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that zeroth-order fine-tuning replaces backpropagation with repeated forward objective evaluations under nearby parameter states. This structure makes the dominant cost resemble inference scoring rather than a conventional training loop. By executing the repeated scoring phase through a serving runtime instead of a standard training framework, the same optimization completes substantially faster. On OPT-13B with SST-2 the serving path finishes a 20 000-step LoZO run in 0.51 hours versus 4.15 hours for the baseline under matched LoRA settings, while reaching 0.922 evaluation accuracy. The same reorganization produces 2.34x to 7.72x speedups across model sizes from 1.3B to 13B parameters and extends to a factorized high-rank variant that tracks a MeZO-like loss trajectory up to 2.55 times faster.

Core claim

Zeroth-order fine-tuning is an inference-dominated workload whose repeated forward evaluations can be executed through a serving runtime rather than a training loop. This reorganization yields 8.13 times faster completion of the 20k-step LoZO process on OPT-13B SST-2 under matched LoRA settings while preserving 0.922 final evaluation accuracy and 0.931 full-validation accuracy, with 2.34x to 7.72x speedups observed in core-step scaling and up to 2.55x faster execution for a MeZO-style factorized variant.

What carries the argument

Reorganization of repeated forward evaluations under nearby parameter states into a serving runtime that represents ZO updates as dynamic adapter states.

If this is right

The same runtime reorganization produces 2.34x to 7.72x speedups when scaling core steps across OPT-1.3B to OPT-13B.
A MeZO-style high-rank factorized experiment maintains a comparable loss trajectory at up to 2.55 times faster execution.
Lightweight adaptation can be scheduled as an inference-like workload rather than a separate training job.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing inference optimizations such as continuous batching could be applied directly to ZO fine-tuning without new algorithm changes.
Serving engines could incorporate parameter perturbation as a native operation to support on-the-fly model adaptation.
The boundary between training and inference phases for LLMs could narrow if adaptation is expressed entirely through forward-pass workloads.

Load-bearing premise

Reorganizing the repeated forward evaluations into a serving runtime preserves the exact optimization trajectory and numerical behavior without introducing overheads that would change final accuracy or convergence.

What would settle it

A side-by-side run on the same model, task, random seed, and step count that shows the serving runtime produces a statistically significant drop in final validation accuracy relative to the training-loop baseline would falsify the claim of preserved performance.

Figures

Figures reproduced from arXiv: 2605.28760 by Caiwen Ding, Zelin Li.

**Figure 1.** Figure 1: OPT-13B SST-2 evaluation versus training time. The vLLM ν = 50 run is shown alongside the official LoZO full and LoRA-only baselines; it finishes in 30.7 minutes versus 249.2 minutes for the official LoZO LoRA-only baseline, an 8.13× matched-setting speedup. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: OPT-13B SST-2 evaluation versus optimization step for the completed Phase 4 runs. 5.2 Core-Step Throughput Across OPT Scales [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Zeroth-order (ZO) fine-tuning is attractive for large language models because it replaces backpropagation with forward objective evaluations. Existing implementations nevertheless execute ZO algorithms inside conventional training loops, even though their dominant work is repeated scoring under nearby parameter states. This creates a workload-runtime mismatch: the algorithm asks for structured inference-style scoring, while the system exposes a sequence of fragmented training-loop steps. We show that LLM ZO fine-tuning is an inference-dominated workload and execute its repeated scoring phase through a serving runtime. On OPT-13B SST-2, the resulting vLLM execution path completes the 20k-step LoZO run in 0.51 estimated training hours versus 4.15 hours for the official LoZO baseline under the matched LoRA-only setting, an 8.13x speedup, while reaching 0.922 final evaluation accuracy and 0.931 final full-validation accuracy. In core-step scaling experiments across OPT-1.3B to OPT-13B, the same runtime reorganization gives 2.34x--7.72x speedups. A MeZO-style high-rank factorized experiment shows that the same runtime paradigm can track a MeZO-like loss trajectory while running up to 2.55x faster. More broadly, representing ZO updates as dynamic adapter states suggests a practical path toward inference-time training, where lightweight adaptation can be scheduled as an inference-like workload rather than as a separate training job.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that running ZO fine-tuning through vLLM can cut wall-clock time by 8x on OPT-13B while matching final accuracy, but the evidence does not yet confirm the optimization path stayed identical.

read the letter

The central observation is that zeroth-order fine-tuning consists mostly of repeated forward scoring under small perturbations, so routing those calls through an inference serving system like vLLM instead of a standard training loop produces large speedups. On OPT-13B SST-2 the vLLM path finishes the 20k-step LoZO run in roughly 0.51 estimated hours versus 4.15 for the baseline under the LoRA-only setting, with final accuracies of 0.922 and 0.931. The same reorganization yields 2.34x–7.72x gains across smaller OPT models and still tracks a MeZO-style loss curve in the factorized case.

What the work does cleanly is recognize the workload mismatch and demonstrate that existing serving infrastructure can absorb the structured scoring without new optimizer code. The scaling numbers and the MeZO-style check give the claim some breadth beyond a single run.

The soft spot is the missing verification that the vLLM path executes exactly the same perturbation sequence and loss evaluations as the official baseline. The abstract supplies only final accuracies and estimated hours; it does not report loss curves, random-seed matching for perturbations, or per-step objective values. If continuous batching or adapter-state handling alters the effective distribution of perturbations or introduces numerical drift, the accuracy match could be coincidental and part of the reported speedup could reflect a different optimization trajectory. The “estimated” training hours also leave open questions about measurement consistency.

This is useful for engineers who already run inference servers and want to schedule lightweight adaptation inside the same runtime rather than as a separate training job. It is worth sending to peer review because the empirical claim is concrete and falsifiable once the implementation details are checked, even though the current write-up leaves the trajectory-equivalence question open.

Referee Report

1 major / 2 minor

Summary. The paper claims that zeroth-order (ZO) fine-tuning of LLMs is dominated by repeated forward inference-style evaluations under perturbed parameters rather than backpropagation, creating a workload mismatch with conventional training loops. It proposes reorganizing the ZO process (specifically LoZO and MeZO-style variants) to run through an inference serving runtime such as vLLM, reporting an 8.13x wall-clock speedup on OPT-13B SST-2 (0.51 vs. 4.15 estimated hours for 20k steps under LoRA-only) while reaching final accuracies of 0.922 (eval) and 0.931 (full validation), with 2.34x–7.72x speedups in core-step scaling from OPT-1.3B to OPT-13B and up to 2.55x in a high-rank factorized setting.

Significance. If the speedup is achieved while preserving the exact ZO optimization trajectory, the work offers a practical systems insight that could make ZO methods more viable for large models by treating adaptation as a schedulable inference workload rather than a separate training job. The direct empirical comparison against the official LoZO baseline, the scaling results across model sizes, and the MeZO-style loss-trajectory experiment are concrete strengths; the dynamic-adapter-state framing also points to a broader direction for inference-time training.

major comments (1)

[OPT-13B SST-2 experiment description and core-step scaling results] The central empirical claim (8.13x speedup on OPT-13B SST-2 with matched final accuracies) is load-bearing on the assumption that the vLLM execution path performs exactly the same sequence of perturbed forward evaluations, loss computations, and perturbation draws as the baseline training loop. The abstract and results report only final accuracies and estimated hours; no loss curves, per-step objective values, or random-seed comparisons for perturbations are provided to confirm trajectory equivalence. Without this, differences in continuous batching, adapter-state handling, or numerical execution could produce a different optimization path whose final accuracy match is coincidental.

minor comments (2)

[Abstract and experimental setup] The training times are reported as 'estimated'; the methods or experimental section should explicitly state the estimation procedure, hardware configuration, batch-size matching, and any caching or warm-up assumptions used for both the vLLM and baseline runs.
[Results on OPT-13B SST-2 and scaling experiments] Reported accuracies and speedups lack error bars, standard deviations, or results from multiple random seeds, which is important for stochastic ZO methods where perturbation sampling can affect convergence.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the potential systems insight of the work. We address the major comment on trajectory equivalence below.

read point-by-point responses

Referee: [OPT-13B SST-2 experiment description and core-step scaling results] The central empirical claim (8.13x speedup on OPT-13B SST-2 with matched final accuracies) is load-bearing on the assumption that the vLLM execution path performs exactly the same sequence of perturbed forward evaluations, loss computations, and perturbation draws as the baseline training loop. The abstract and results report only final accuracies and estimated hours; no loss curves, per-step objective values, or random-seed comparisons for perturbations are provided to confirm trajectory equivalence. Without this, differences in continuous batching, adapter-state handling, or numerical execution could produce a different optimization path whose final accuracy match is coincidental.

Authors: We agree that reporting only final accuracies leaves open the possibility of coincidental matches and that explicit trajectory verification would strengthen the central claim. The vLLM reorganization is constructed to issue exactly the same sequence of perturbed forward passes, loss evaluations, and random perturbation draws as the baseline LoZO loop (the only change is the execution engine for the forward scoring phase). Nevertheless, to directly address the concern we will add, in the revised manuscript, (i) training loss curves for both paths on the OPT-13B SST-2 run and (ii) per-step objective values for a representative window of steps under identical random seeds. These additions will allow readers to verify numerical equivalence of the optimization trajectories. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical runtime comparison stands on its own

full rationale

The paper's central result is an empirical measurement: vLLM-based execution of repeated ZO forward passes yields measured wall-clock speedups (8.13x on OPT-13B SST-2) and matched final accuracies versus an explicit baseline implementation. No equations, fitted parameters, or self-citations are invoked to derive the speedup; the claim is a direct timing comparison under a matched LoRA-only setting. The text contains no self-definitional loops, no 'prediction' that reduces to a fitted input, and no load-bearing uniqueness theorems. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical systems demonstration and introduces no new mathematical axioms, free parameters, or postulated entities; the reported speedups rest on the measured runtime behavior of vLLM versus a training loop.

pith-pipeline@v0.9.1-grok · 5790 in / 1214 out tokens · 41074 ms · 2026-06-29T14:29:46.814027+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages

[1]

arXiv preprint arXiv:2305.17333 , year=

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-Tuning Language Models with Just Forward Passes.arXiv preprint arXiv:2305.17333, 2023

work page arXiv 2023
[2]

Enhancing Zeroth-Order Fine-Tuning for Language Models with Low-Rank Structures

Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, and Zaiwen Wen. Enhancing Zeroth-Order Fine-Tuning for Language Models with Low-Rank Structures. InInternational Conference on Learning Representations, 2025

2025
[3]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[4]

MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines

Lei Gao, Amir Ziashahabi, Yue Niu, Salman Avestimehr, and Murali Annavaram. MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. 12

2025

[1] [1]

arXiv preprint arXiv:2305.17333 , year=

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-Tuning Language Models with Just Forward Passes.arXiv preprint arXiv:2305.17333, 2023

work page arXiv 2023

[2] [2]

Enhancing Zeroth-Order Fine-Tuning for Language Models with Low-Rank Structures

Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, and Zaiwen Wen. Enhancing Zeroth-Order Fine-Tuning for Language Models with Low-Rank Structures. InInternational Conference on Learning Representations, 2025

2025

[3] [3]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023

[4] [4]

MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines

Lei Gao, Amir Ziashahabi, Yue Niu, Salman Avestimehr, and Murali Annavaram. MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. 12

2025