Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch
read the original abstract
Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch-size-related nondeterminism through batch-invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (i.e., TP = 1) while the rollout engine relies on multi-GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP-induced inconsistency and propose Tree-Based Invariant Kernels (TBIK), a set of TP-invariant matrix multiplication and reduction primitives that guarantee bit-wise identical results regardless of TP size. Our key insight is to align intra- and inter-GPU reduction orders through a unified hierarchical binary tree structure. We implement these kernels in Triton and integrate them into vLLM and FSDP. Experiments confirm zero probability divergence and bit-wise reproducibility for deterministic inference across different TP sizes. Also, we achieve bit-wise identical results between vLLM and FSDP in RL training pipelines with different parallel strategy. Code is available at https://github.com/nanomaoli/llm_reproducibility.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
FloatDoor: Platform-Triggered Backdoors in LLMs
FloatDoor uses two LoRA adapters to create the first input-independent backdoor that triggers adversary-chosen behavior only on a target platform while remaining benign elsewhere.
-
Demystifying Numerical Instability in LLM Inference: Achieving Reproducible Inference for Mission-Critical Tasks with HEAL
HEAL restores FP32-level output reproducibility in 16-bit LLM inference using targeted INT16 quantization and algebraic compensation, cutting overhead by up to 7.1x versus full FP32 on the new MCR-Bench.
-
MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference
MarginGate triggers verification only on low-margin decode steps to achieve 100% deterministic batch inference at 15-50% of the cost of always-on verification across tested models and datasets.
-
From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems
Financial AI systems using tabular models, graph networks, and LLM agents exhibit nondeterminism that undermines reproducibility, quantified via experiments on public datasets and addressed by a proposed layered evalu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.