ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models

Boxi Cao; Hongyu Lin; Jiasheng Zheng; Jiazhen Jiang; Le Sun; Pengbo Wang; Qiming Zhu; Xianpei Han; Xin Zheng; Yaojie Lu

arxiv: 2604.27467 · v1 · submitted 2026-04-30 · 💻 cs.SE · cs.CL

ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models

Jiasheng Zheng , Xin Zheng , Boxi Cao , Pengbo Wang , Zhengzhao Ma , Qiming Zhu , Jiazhen Jiang , Yaojie Lu

show 3 more authors

Hongyu Lin Xianpei Han Le Sun

This is my paper

Pith reviewed 2026-05-07 09:24 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords code verificationlarge language modelscode sandboxesscalabilityreinforcement learningLLM trainingevaluation benchmarksparallel execution

0 comments

The pith

ScaleBox automates special-judge generation and parallel multi-node execution to deliver accurate high-throughput code verification for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing code sandboxes for training and evaluating LLMs lose accuracy and slow down under high concurrency. ScaleBox addresses this with automated creation and management of special judges for each problem, fine-grained parallel running of test cases across coordinated nodes, and a configuration-driven suite that keeps benchmarks reproducible. Experiments show clear gains in verification accuracy and speed. When plugged into RLVR training loops, the system raises scores on LiveCodeBench and stabilizes the training process more than simple heuristic baselines.

Core claim

ScaleBox is a high-fidelity scalable code sandbox that introduces automated special-judge generation and management, fine-grained parallel execution across test cases with seamless multi-node coordination, and a configuration-driven evaluation suite for reproducible benchmarking. These components together supply reliable verifiable feedback for both RL training and evaluation of large language models on code tasks, producing measurable improvements in accuracy, efficiency, and downstream performance.

What carries the argument

Automated special-judge generation combined with fine-grained parallel execution of test cases and multi-node coordination.

If this is right

Verification accuracy and efficiency improve under high-concurrency workloads.
RLVR training reaches higher performance on LiveCodeBench.
Training stability increases relative to heuristic-matching baselines.
Reproducible benchmarking becomes easier through the configuration-driven suite.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The infrastructure could support training runs on far larger volumes of complex coding problems while keeping verification reliable.
Similar automated-judge and parallel-execution patterns might transfer to scalable verification in mathematical reasoning or formal proof systems.
Teams building code LLMs could use the system as a drop-in component to test new reinforcement-learning feedback signals at scale.

Load-bearing premise

Automatically generated special judges will correctly verify code on diverse unseen problems without systematic false positives or negatives, and parallel execution across nodes will preserve exactly the same verification outcomes as sequential runs.

What would settle it

Run ScaleBox verification on a fresh set of 500 previously unseen programming problems and compare its pass/fail decisions to those of human experts or known ground-truth solutions; any systematic mismatch rate above a few percent would falsify the accuracy claim.

Figures

Figures reproduced from arXiv: 2604.27467 by Boxi Cao, Hongyu Lin, Jiasheng Zheng, Jiazhen Jiang, Le Sun, Pengbo Wang, Qiming Zhu, Xianpei Han, Xin Zheng, Yaojie Lu, Zhengzhao Ma.

**Figure 1.** Figure 1: Comparison between Exact Match and Special view at source ↗

**Figure 2.** Figure 2: Overview of the SCALEBOX Architecture. A distributed architecture featuring NGINX load balancing, testcase parallelism, and unified verification with special judge support, optimized for high-throughput and reproducible code training and evaluation. works like SandboxFusion (Cheng et al., 2024) and MPLSandbox (Dou et al., 2025) introduce RL integration for training, they lack either the distributed infra… view at source ↗

**Figure 3.** Figure 3: Impact of reward fidelity on RL training (1.2K Subset): (a) Standard exact-match rewards are artificially view at source ↗

**Figure 4.** Figure 4: shows the screenshots of SCALEBOX dashboard, which includes features for distributed deployment, resource monitoring, and log monitoring. B Supported Benchmark of ScaleBox SCALEBOX supports diverse code benchmarks, including: • Assert-based: HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), HumanEval+, and MBPP+ (Liu et al., 2023) • Multi-language: MultiPL-E (Cassano et al., 2023) • Standard I/O… view at source ↗

**Figure 5.** Figure 5: Prompt Template for classifying programming problems that require special judge support. view at source ↗

**Figure 6.** Figure 6: Prompt Template for generating special judge programs. view at source ↗

read the original abstract

Code sandboxes have emerged as a critical infrastructure for advancing the coding capabilities of large language models, providing verifiable feedback for both RL training and evaluation. However, existing systems fail to provide accurate verification and efficiency under high-concurrency workloads. We present ScaleBox, a high-fidelity and scalable system designed to address these limitations in large-scale code training. ScaleBox introduces automated special-judge generation and management, fine-grained parallel execution across test cases with seamless multi-node coordination, and a configuration-driven evaluation suite for reproducible benchmarking. A series of experiments demonstrates that ScaleBox significantly enhances code verification accuracy and efficiency. Our further RLVR experiments show that ScaleBox substantially improves both performance on LiveCodeBench and training stability, significantly outperforming heuristic-matching baselines. By providing a reliable and high-throughput infrastructure, ScaleBox facilitates more effective research and development in large-scale code training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ScaleBox, a system for high-fidelity and scalable code verification tailored to large language models. It proposes automated special-judge generation and management, fine-grained parallel execution across test cases with multi-node coordination, and a configuration-driven evaluation suite for reproducible benchmarking. Experiments are claimed to show significant gains in verification accuracy and efficiency, while RLVR experiments report substantial improvements on LiveCodeBench and training stability over heuristic-matching baselines.

Significance. If the core claims are substantiated with quantitative validation, ScaleBox would address a practical bottleneck in scaling code LLM training and evaluation by supplying reliable, high-throughput verification infrastructure. This could enable more effective RL-based methods and reproducible benchmarking in the field.

major comments (2)

[Abstract] Abstract: The abstract asserts clear experimental gains in accuracy, efficiency, LiveCodeBench score, and training stability but supplies no metrics, error bars, baseline details, dataset sizes, or exclusion criteria. Without these, the magnitude and reliability of the reported improvements cannot be assessed.
[Methods (automated special-judge generation)] Automated special-judge generation (methods section): The high-fidelity claim rests on the assertion that automated special-judge generation produces verifiers accurate across diverse unseen code problems without systematic false positives or negatives. No description of the generation algorithm, training data for the generator, held-out validation sets, or measured false-positive/negative rates is provided, leaving the prerequisite for all downstream accuracy and RLVR claims unverified.

minor comments (2)

[Abstract] The acronym RLVR appears without expansion on first use; define it explicitly (e.g., Reinforcement Learning with Verifiable Rewards).
[Results] Ensure all experimental figures include error bars, axis labels, and legends that allow direct comparison to the heuristic baselines mentioned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have addressed each major comment below by revising the relevant sections to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts clear experimental gains in accuracy, efficiency, LiveCodeBench score, and training stability but supplies no metrics, error bars, baseline details, dataset sizes, or exclusion criteria. Without these, the magnitude and reliability of the reported improvements cannot be assessed.

Authors: We agree that the abstract would benefit from including specific quantitative details to allow readers to assess the scale and reliability of the improvements. In the revised manuscript, we have updated the abstract to report key metrics from the experiments section, including verification accuracy gains, efficiency improvements under parallel execution, LiveCodeBench score increases over baselines, training stability measures, baseline descriptions, dataset sizes, and relevant error bars or statistical details. revision: yes
Referee: [Methods (automated special-judge generation)] Automated special-judge generation (methods section): The high-fidelity claim rests on the assertion that automated special-judge generation produces verifiers accurate across diverse unseen code problems without systematic false positives or negatives. No description of the generation algorithm, training data for the generator, held-out validation sets, or measured false-positive/negative rates is provided, leaving the prerequisite for all downstream accuracy and RLVR claims unverified.

Authors: We acknowledge the need for greater detail in the methods section to substantiate the high-fidelity claims. The revised manuscript now includes an expanded description of the automated special-judge generation algorithm, the training data and approach used for the generator, the held-out validation sets employed, and the measured false-positive and false-negative rates across diverse unseen problems. These additions directly support the accuracy and RLVR results. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper is a systems description of ScaleBox introducing automated special-judge generation, parallel execution, and a benchmarking suite, followed by empirical experiments reporting accuracy/efficiency gains and RLVR improvements on LiveCodeBench. No equations, parameter fits, or self-referential definitions appear; performance claims are framed as direct experimental outcomes rather than quantities derived from or equivalent to the system's inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked that reduce the central claims to prior author work or fitted data. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities. The system implicitly relies on the domain assumption that code execution can be sandboxed and parallelized without fidelity loss.

axioms (1)

domain assumption Code execution environments can be reliably sandboxed and parallelized without loss of verification fidelity.
Required for the claims of high-fidelity parallel execution and multi-node coordination to hold.

pith-pipeline@v0.9.0 · 5478 in / 1319 out tokens · 69973 ms · 2026-05-07T09:24:10.295920+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references

[1]

- If there are multiple possible solutions , print any of them

m u lt i p le _ s ol u t io n s ? For example : - If there are multiple answers , print any of them - If there are multiple solutions , you are allowed to print any of them . - If there are multiple possible solutions , print any of them . - If there are different possible orders with a correct answer , print any of them . - If there are multiple solution...
[2]

reason

float_co mparison ? ( floating point answers , precision , tolerance , absolute / relative error , decimals ) Return JSON object : { " reason ": " < short justification < less than 160 words >" , " n e e d s _ s p e c i a l _ j u d g e ": < true | false > , " categories ": [ zero or more of " m ul t i pl e _ so l u ti o n s " ," f loat_com parison " ] , "...

[1] [1]

- If there are multiple possible solutions , print any of them

m u lt i p le _ s ol u t io n s ? For example : - If there are multiple answers , print any of them - If there are multiple solutions , you are allowed to print any of them . - If there are multiple possible solutions , print any of them . - If there are different possible orders with a correct answer , print any of them . - If there are multiple solution...

[2] [2]

reason

float_co mparison ? ( floating point answers , precision , tolerance , absolute / relative error , decimals ) Return JSON object : { " reason ": " < short justification < less than 160 words >" , " n e e d s _ s p e c i a l _ j u d g e ": < true | false > , " categories ": [ zero or more of " m ul t i pl e _ so l u ti o n s " ," f loat_com parison " ] , "...