pith. sign in

arxiv: 2511.17826 · v2 · pith:EDMAOBFXnew · submitted 2025-11-21 · 💻 cs.LG · cs.CL· stat.ML

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

classification 💻 cs.LG cs.CLstat.ML
keywords acrossdifferentinferenceparallelbit-wisedeterministicidenticalkernels
0
0 comments X
read the original abstract

Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch-size-related nondeterminism through batch-invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (i.e., TP = 1) while the rollout engine relies on multi-GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP-induced inconsistency and propose Tree-Based Invariant Kernels (TBIK), a set of TP-invariant matrix multiplication and reduction primitives that guarantee bit-wise identical results regardless of TP size. Our key insight is to align intra- and inter-GPU reduction orders through a unified hierarchical binary tree structure. We implement these kernels in Triton and integrate them into vLLM and FSDP. Experiments confirm zero probability divergence and bit-wise reproducibility for deterministic inference across different TP sizes. Also, we achieve bit-wise identical results between vLLM and FSDP in RL training pipelines with different parallel strategy. Code is available at https://github.com/nanomaoli/llm_reproducibility.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FloatDoor: Platform-Triggered Backdoors in LLMs

    cs.CR 2026-06 unverdicted novelty 7.0

    FloatDoor uses two LoRA adapters to create the first input-independent backdoor that triggers adversary-chosen behavior only on a target platform while remaining benign elsewhere.

  2. Demystifying Numerical Instability in LLM Inference: Achieving Reproducible Inference for Mission-Critical Tasks with HEAL

    cs.LG 2026-06 unverdicted novelty 6.0

    HEAL restores FP32-level output reproducibility in 16-bit LLM inference using targeted INT16 quantization and algebraic compensation, cutting overhead by up to 7.1x versus full FP32 on the new MCR-Bench.

  3. MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

    cs.LG 2026-05 unverdicted novelty 5.0

    MarginGate triggers verification only on low-margin decode steps to achieve 100% deterministic batch inference at 15-50% of the cost of always-on verification across tested models and datasets.

  4. From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

    cs.AI 2026-05 unverdicted novelty 5.0

    Financial AI systems using tabular models, graph networks, and LLM agents exhibit nondeterminism that undermines reproducibility, quantified via experiments on public datasets and addressed by a proposed layered evalu...