CodeScaler: Scaling Code LLM Training and Test-Time Inference via Reward Models
Pith reviewed 2026-05-21 13:28 UTC · model grok-4.3
The pith
A reward model trained on verified code preferences scales both RL training and test-time inference for code LLMs without needing unit tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CodeScaler is a reward model trained on preference data derived from verified code problems that incorporates syntax-aware code extraction and validity-preserving reward shaping; when used for RL training it outperforms execution-based methods by 1.55 points on Qwen3-8B-Base and 4.23 points on Qwen3-14B-Base, yields a 14.64-point gain over the base model when scaled to 44K problems without any test cases, and at inference time delivers unit-test-comparable accuracy with a ten-fold latency reduction.
What carries the argument
CodeScaler reward model, trained with syntax-aware code extraction and validity-preserving reward shaping on preference data from verified code problems.
Load-bearing premise
Preference data derived from verified code problems produces a reward model that generalizes reliably to new problems without overfitting or systematic bias.
What would settle it
Apply the trained CodeScaler reward model to a fresh coding benchmark outside the training distribution and measure whether performance gains disappear relative to execution-based RL.
read the original abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, a reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across four coding benchmarks, CodeScaler consistently outperforms execution-based RL by +1.55 points on Qwen3-8B-Base and +4.23 points on Qwen3-14B-Base. By further scaling to 44K problems with additional synthetic data, CodeScaler yields +14.64 points improvement over the base model without requiring any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CodeScaler, a reward model trained on curated preference data from verified code problems using syntax-aware code extraction and validity-preserving reward shaping. It claims to outperform execution-based RL on four coding benchmarks (+1.55 points on Qwen3-8B-Base and +4.23 on Qwen3-14B-Base), enable scaling RL training to 44K synthetic problems for +14.64 points over the base model without any test cases, serve as an effective test-time scaling method with performance comparable to unit tests but 10x lower latency, and outperform prior reward models on RM-Bench in code (+3.3) and other domains (+2.7 average).
Significance. If the results hold and the reward model reliably proxies execution correctness, CodeScaler would meaningfully advance scalable code LLM training by removing the unit-test bottleneck and enabling larger synthetic datasets. The reported latency reduction at inference and cross-domain RM-Bench gains would be practically useful. The design choices around syntax-aware extraction and reward shaping are sensible for robustness and deserve credit as thoughtful engineering contributions.
major comments (2)
- [Scaling Experiments] The central scaling claim (+14.64 points on 44K synthetic problems without test cases) is load-bearing for the no-test-case contribution, yet the manuscript provides no direct correlation analysis between CodeScaler reward scores and actual execution outcomes on those synthetic problems. Without this, it remains possible that gains arise from RM exploitation of curation artifacts rather than semantic generalization (see scaling experiments section).
- [Method] The preference data construction from verified code problems (used to train the RM) is described at a high level but lacks explicit details on pair creation, data exclusion rules, and any post-hoc filtering. This information is necessary to evaluate whether the reported outperformance over execution-based RL could partly reflect choices in the training distribution rather than independent generalization.
minor comments (2)
- [Abstract] The abstract refers to 'four coding benchmarks' without naming them; listing the specific benchmarks (e.g., HumanEval, MBPP) would improve clarity for readers.
- [Experiments] Reported numeric gains lack error bars, number of runs, or statistical significance tests; adding these would strengthen the experimental claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and methods.
read point-by-point responses
-
Referee: [Scaling Experiments] The central scaling claim (+14.64 points on 44K synthetic problems without test cases) is load-bearing for the no-test-case contribution, yet the manuscript provides no direct correlation analysis between CodeScaler reward scores and actual execution outcomes on those synthetic problems. Without this, it remains possible that gains arise from RM exploitation of curation artifacts rather than semantic generalization (see scaling experiments section).
Authors: We appreciate the referee's emphasis on validating the reward model's behavior on the scaled synthetic data. Although the 44K problems are constructed without test cases to highlight the removal of the execution bottleneck, we agree that a direct correlation analysis would further support the claim of semantic generalization. In the revised manuscript, we will include such an analysis on a held-out subset of problems where execution outcomes can be obtained or generated, reporting the correlation between CodeScaler scores and pass rates to address concerns about potential curation artifacts. revision: yes
-
Referee: [Method] The preference data construction from verified code problems (used to train the RM) is described at a high level but lacks explicit details on pair creation, data exclusion rules, and any post-hoc filtering. This information is necessary to evaluate whether the reported outperformance over execution-based RL could partly reflect choices in the training distribution rather than independent generalization.
Authors: We agree that additional methodological details would enhance reproducibility and allow readers to better assess the training distribution. In the revised manuscript, we will expand the preference data construction section to provide explicit descriptions of pair creation from verified problems, the data exclusion rules employed, and any post-hoc filtering steps applied to maintain data quality and robustness. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper trains a reward model on preference data from verified code problems, then reports empirical gains on external coding benchmarks and RM-Bench. No equations or steps reduce a claimed prediction or first-principles result to the inputs by construction; the RM training and subsequent RL/inference scaling are presented as standard supervised learning followed by empirical evaluation. The construction of preference data and syntax-aware shaping are described as methodological choices rather than self-defining the performance metric. Claims rest on benchmark comparisons outside the training distribution, satisfying the criteria for independent content.
Axiom & Free-Parameter Ledger
invented entities (1)
-
CodeScaler reward model
no independent evidence
Forward citations
Cited by 1 Pith paper
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.