CodeScaler: Scaling Code LLM Training and Test-Time Inference via Reward Models

Boyu Zhu; Hanxu Hu; Haotian Zhang; Huiming Wang; Mingzhe Du; Xiao Zhu; Xinyu Zhou; Zhijiang Guo

REVIEW 2 major objections 2 minor 2 cited by

A reward model trained on verified code preferences scales both RL training and test-time inference for code LLMs without needing unit tests.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-21 13:28 UTC pith:UCL75T4J

load-bearing objection CodeScaler replaces unit tests with a learned reward model for scaling code RL training and inference, but the generalization from verified to synthetic data looks like the part that needs checking. the 2 major comments →

arxiv 2602.17684 v2 pith:UCL75T4J submitted 2026-02-04 cs.LG cs.AI

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Reward Models

Xiao Zhu , Xinyu Zhou , Boyu Zhu , Hanxu Hu , Mingzhe Du , Haotian Zhang , Huiming Wang , Zhijiang Guo This is my paper

classification cs.LG cs.AI

keywords code generationreward modelsreinforcement learninglarge language modelstest-time scalingcode LLMspreference data

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes CodeScaler to overcome the dependence of code LLMs on scarce or unreliable unit tests for reinforcement learning. It trains a reward model on carefully curated preference data from verified problems, using syntax-aware extraction and validity-preserving reward shaping. This produces consistent gains over execution-based RL across benchmarks and allows scaling training data to 44K problems with synthetic data alone. At inference, the same model matches unit-test performance while cutting latency by a factor of ten and also improves on a general reward-model benchmark.

Core claim

CodeScaler is a reward model trained on preference data derived from verified code problems that incorporates syntax-aware code extraction and validity-preserving reward shaping; when used for RL training it outperforms execution-based methods by 1.55 points on Qwen3-8B-Base and 4.23 points on Qwen3-14B-Base, yields a 14.64-point gain over the base model when scaled to 44K problems without any test cases, and at inference time delivers unit-test-comparable accuracy with a ten-fold latency reduction.

What carries the argument

CodeScaler reward model, trained with syntax-aware code extraction and validity-preserving reward shaping on preference data from verified code problems.

Load-bearing premise

Preference data derived from verified code problems produces a reward model that generalizes reliably to new problems without overfitting or systematic bias.

What would settle it

Apply the trained CodeScaler reward model to a fresh coding benchmark outside the training distribution and measure whether performance gains disappear relative to execution-based RL.

Watch this falsifier — get emailed when new claim-graph text bears on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

CodeScaler replaces unit tests with a learned reward model for scaling code RL training and inference, but the generalization from verified to synthetic data looks like the part that needs checking.

read the letter

CodeScaler trains a reward model on preference pairs from verified code problems and uses it to drive RL training plus test-time selection for code LLMs. The main reported wins are beating execution-based RL by a few points on Qwen3 models and then jumping +14.64 points when scaling to 44K synthetic problems with no test cases at all. At inference it matches unit-test performance while cutting latency by roughly 10x, and it also edges out other reward models on RM-Bench in code and other domains.

Referee Report

2 major / 2 minor

Summary. The paper introduces CodeScaler, a reward model trained on curated preference data from verified code problems using syntax-aware code extraction and validity-preserving reward shaping. It claims to outperform execution-based RL on four coding benchmarks (+1.55 points on Qwen3-8B-Base and +4.23 on Qwen3-14B-Base), enable scaling RL training to 44K synthetic problems for +14.64 points over the base model without any test cases, serve as an effective test-time scaling method with performance comparable to unit tests but 10x lower latency, and outperform prior reward models on RM-Bench in code (+3.3) and other domains (+2.7 average).

Significance. If the results hold and the reward model reliably proxies execution correctness, CodeScaler would meaningfully advance scalable code LLM training by removing the unit-test bottleneck and enabling larger synthetic datasets. The reported latency reduction at inference and cross-domain RM-Bench gains would be practically useful. The design choices around syntax-aware extraction and reward shaping are sensible for robustness and deserve credit as thoughtful engineering contributions.

major comments (2)

[Scaling Experiments] The central scaling claim (+14.64 points on 44K synthetic problems without test cases) is load-bearing for the no-test-case contribution, yet the manuscript provides no direct correlation analysis between CodeScaler reward scores and actual execution outcomes on those synthetic problems. Without this, it remains possible that gains arise from RM exploitation of curation artifacts rather than semantic generalization (see scaling experiments section).
[Method] The preference data construction from verified code problems (used to train the RM) is described at a high level but lacks explicit details on pair creation, data exclusion rules, and any post-hoc filtering. This information is necessary to evaluate whether the reported outperformance over execution-based RL could partly reflect choices in the training distribution rather than independent generalization.

minor comments (2)

[Abstract] The abstract refers to 'four coding benchmarks' without naming them; listing the specific benchmarks (e.g., HumanEval, MBPP) would improve clarity for readers.
[Experiments] Reported numeric gains lack error bars, number of runs, or statistical significance tests; adding these would strengthen the experimental claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [Scaling Experiments] The central scaling claim (+14.64 points on 44K synthetic problems without test cases) is load-bearing for the no-test-case contribution, yet the manuscript provides no direct correlation analysis between CodeScaler reward scores and actual execution outcomes on those synthetic problems. Without this, it remains possible that gains arise from RM exploitation of curation artifacts rather than semantic generalization (see scaling experiments section).

Authors: We appreciate the referee's emphasis on validating the reward model's behavior on the scaled synthetic data. Although the 44K problems are constructed without test cases to highlight the removal of the execution bottleneck, we agree that a direct correlation analysis would further support the claim of semantic generalization. In the revised manuscript, we will include such an analysis on a held-out subset of problems where execution outcomes can be obtained or generated, reporting the correlation between CodeScaler scores and pass rates to address concerns about potential curation artifacts. revision: yes
Referee: [Method] The preference data construction from verified code problems (used to train the RM) is described at a high level but lacks explicit details on pair creation, data exclusion rules, and any post-hoc filtering. This information is necessary to evaluate whether the reported outperformance over execution-based RL could partly reflect choices in the training distribution rather than independent generalization.

Authors: We agree that additional methodological details would enhance reproducibility and allow readers to better assess the training distribution. In the revised manuscript, we will expand the preference data construction section to provide explicit descriptions of pair creation from verified problems, the data exclusion rules employed, and any post-hoc filtering steps applied to maintain data quality and robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper trains a reward model on preference data from verified code problems, then reports empirical gains on external coding benchmarks and RM-Bench. No equations or steps reduce a claimed prediction or first-principles result to the inputs by construction; the RM training and subsequent RL/inference scaling are presented as standard supervised learning followed by empirical evaluation. The construction of preference data and syntax-aware shaping are described as methodological choices rather than self-defining the performance metric. Claims rest on benchmark comparisons outside the training distribution, satisfying the criteria for independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The central contribution rests on the assumption that preference data from verified problems plus the described shaping techniques yield a generalizable reward signal.

invented entities (1)

CodeScaler reward model no independent evidence
purpose: Provide scalable rewards for code generation training and inference without unit tests
New model introduced and trained on curated preference data; no independent evidence of its properties outside the reported experiments.

pith-pipeline@v0.9.0 · 5774 in / 1298 out tokens · 51544 ms · 2026-05-21T13:28:06.194886+00:00 · methodology

0 comments

read the original abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, a reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across four coding benchmarks, CodeScaler consistently outperforms execution-based RL by +1.55 points on Qwen3-8B-Base and +4.23 points on Qwen3-14B-Base. By further scaling to 44K problems with additional synthetic data, CodeScaler yields +14.64 points improvement over the base model without requiring any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Table to Cell: Attention for Better Reasoning with TABALIGN
cs.AI 2026-05 unverdicted novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
RLPF: Reinforcement Learning from Performance Feedback for Code Generation
cs.LG 2026-07 conditional novelty 6.0

RLPF's staged performance reward lifts Qwen3-32B on PerfCodeBench from 11.1% to 54.6% correct-and-runnable and from 8.1% to 38.6% relative efficiency.