Recognition: no theorem link
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Pith reviewed 2026-05-14 21:01 UTC · model grok-4.3
The pith
Block size conflicts between domains reduce the effectiveness of rollout-based reinforcement learning post-training for diffusion large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Domain block size conflict is the central phenomenon: when samples from multiple domains are trained together under rollout-based RL, a single block size cannot simultaneously optimize trajectories for all domains, lowering overall post-training effectiveness. The paper shows that pre-identifying the best block size per sample and assigning those sizes during cross-domain training removes the conflict and yields consistent gains without new instabilities.
What carries the argument
The Block Size Conflict Score, which measures disagreement among domains on optimal block sizes for RL rollouts, together with the mechanism of assigning each sample its individually best-improved training block size.
If this is right
- Using sample-level best-improved block sizes improves cross-domain RL performance compared with any single fixed block size.
- The Block Size Conflict Score provides a quantitative way to predict how much performance will suffer when two domains are trained together.
- The same per-sample assignment approach works across seven different rollout-based RL algorithms without requiring changes to their core update rules.
- Flexible post-training becomes possible for both single-domain and multi-domain scenarios on the released Block-R1 benchmark.
Where Pith is reading between the lines
- If per-sample block sizes can be identified reliably, the same principle could apply to other generation granularities such as variable-length token groups in standard autoregressive LLMs.
- Pre-computing optimal block sizes might reduce the need for separate domain-specific fine-tuning stages by handling conflicts inside a single mixed-domain training run.
- Future methods could replace the offline identification step with an online adaptive block-size selector learned jointly with the policy.
Load-bearing premise
The best-improved training block size for each sample can be reliably identified in advance and assigning these sizes during cross-domain training produces consistent gains without introducing new optimization instabilities.
What would settle it
A controlled run on the Block-R1 benchmark in which randomly chosen or fixed block sizes achieve equal or higher average reward than the method that assigns each sample its pre-identified best size.
Figures
read the original abstract
Recently, reinforcement learning (RL) has been widely applied during post-training for diffusion large language models (dLLMs) to enhance reasoning with block-wise semi-autoregressive generation. Block size has therefore become a vital factor in dLLMs, since it determines the parallel decoding granularity and affects the rollout trajectories during RL optimisation, e.g., GRPO. Instead of investigating the effect of block size during inference on individual domains, this paper studies block size from a domain conflict perspective for dLLM RL post-training in multi-domain scenarios. The main contributions are: (1) a formulation of domain block size conflict in multi-domain RL for dLLMs, which will largely affect the post-training effectiveness for rollout-based RL methods; (2) a novel dataset, Block-R1-41K is constructed with a best-improved training block size for each sample, which also induces a Block Size Conflict Score to quantitatively measure the domain conflict; (3) a new benchmark, Block-R1, for flexible RL post-training for dLLMs in both single and cross domain; and (4) a simple yet powerful cross-domain post-training method with sample-level best-improved training block sizes. Extensive experiments on 13 distinct datasets, 7 latest RL algorithms and diverse dLLM backbones are comprehensively covered in Block-R1. The benchmark is open-sourced at https://github.com/YanJiangJerry/Block-R1 with the dataset released at https://huggingface.co/datasets/YanJiangJerry/Block-R1-41K.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formulates domain block size conflict as a key issue in multi-domain RL post-training for diffusion LLMs (dLLMs), where block size affects rollout trajectories in methods like GRPO. It constructs the Block-R1-41K dataset by labeling each of 41K samples with a best-improved training block size, derives a Block Size Conflict Score to quantify domain conflict, introduces the Block-R1 benchmark for single- and cross-domain RL, and proposes a simple cross-domain training method that assigns per-sample best block sizes. The work reports results across 13 datasets, 7 RL algorithms, and multiple dLLM backbones, with the benchmark and dataset released openly.
Significance. If the results hold, the paper offers a useful perspective on block size as a source of domain conflict in dLLM RL and supplies practical resources (Block-R1-41K dataset, Block-R1 benchmark, and per-sample assignment method) that could improve post-training effectiveness. The open-sourcing of the dataset and benchmark is a clear strength for reproducibility. The breadth of experiments across datasets and algorithms provides a solid empirical foundation, though the significance depends on demonstrating that the per-sample labels are stable and generalizable beyond the construction process.
major comments (2)
- [§3 (Dataset Construction)] §3 (Dataset Construction, contribution 2): The central claim rests on labeling each sample in Block-R1-41K with its 'best-improved training block size' via RL runs. Because methods like GRPO are high-variance, the selected best size for a sample can shift across seeds or slight hyperparameter changes. The manuscript must specify the exact procedure (number of independent RL runs per candidate size, seed averaging, statistical tests for declaring a size 'best'), as instability would render the induced Block Size Conflict Score noisy and prevent reliable application of the cross-domain method to unseen samples without an oracle.
- [§5 (Experiments)] §5 (Experiments): The reported gains from the sample-level block-size assignment method are load-bearing for the practical contribution. Without details on controls (e.g., multiple random seeds for all RL runs, confidence intervals on performance deltas, or ablation isolating the effect of label noise), it is unclear whether the improvements over standard multi-domain training are robust or partly artifacts of post-hoc selection on the same runs used to create the labels.
minor comments (2)
- [Abstract] Abstract: The acronym 'dLLM' is introduced without an explicit expansion on first use, although the surrounding text makes the meaning clear.
- [§4 (Benchmark description)] §4 (Benchmark description): The definition of the Block Size Conflict Score should include an explicit formula or pseudocode, as the current high-level description leaves open how per-sample labels are aggregated across domains.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on dataset construction and experimental controls. We address each major point below and will revise the manuscript to incorporate the requested clarifications and additional analyses.
read point-by-point responses
-
Referee: [§3 (Dataset Construction)] §3 (Dataset Construction, contribution 2): The central claim rests on labeling each sample in Block-R1-41K with its 'best-improved training block size' via RL runs. Because methods like GRPO are high-variance, the selected best size for a sample can shift across seeds or slight hyperparameter changes. The manuscript must specify the exact procedure (number of independent RL runs per candidate size, seed averaging, statistical tests for declaring a size 'best'), as instability would render the induced Block Size Conflict Score noisy and prevent reliable application of the cross-domain method to unseen samples without an oracle.
Authors: We agree that the labeling procedure requires explicit specification to ensure the Block Size Conflict Score is robust. In constructing Block-R1-41K, we ran 3 independent GRPO trainings per candidate block size (selected from {1, 2, 4, 8, 16}) for each of the 41K samples, using distinct random seeds, and chose the size yielding the highest mean improvement across runs. We will add a dedicated subsection in §3 detailing this procedure, including seed handling and the selection rule, plus a short stability analysis across additional seeds. This revision will directly address concerns about noise and support reliable use of the labels. revision: yes
-
Referee: [§5 (Experiments)] §5 (Experiments): The reported gains from the sample-level block-size assignment method are load-bearing for the practical contribution. Without details on controls (e.g., multiple random seeds for all RL runs, confidence intervals on performance deltas, or ablation isolating the effect of label noise), it is unclear whether the improvements over standard multi-domain training are robust or partly artifacts of post-hoc selection on the same runs used to create the labels.
Authors: We share the concern that robustness must be demonstrated explicitly. All results in §5 were obtained with 5 random seeds per configuration, with means and standard deviations already reported in the tables. To isolate the effect of label noise, we will add an ablation study in the revised §5 that injects controlled noise into the per-sample block-size labels and measures the resulting performance drop. We will also report 95% confidence intervals on all performance deltas. These additions will confirm that the gains are not artifacts of the labeling process. revision: yes
Circularity Check
No circularity: empirical per-sample labeling and benchmarking are self-contained
full rationale
The paper's core contributions consist of constructing the Block-R1-41K dataset via direct RL evaluations to label each sample with its empirically best block size, deriving a conflict score from those labels, and applying the labels in a cross-domain training procedure. No derivation chain reduces a claimed result to its own inputs by definition or construction; there are no equations presented where a prediction equals a fitted parameter, no self-citation load-bearing the central premise, and no ansatz or uniqueness theorem imported from prior author work. The reported gains rest on independent experimental runs across 13 datasets and 7 algorithms rather than self-referential quantities.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Arel’s sudoku generator.URL https://www.ocf.berkeley.edu/ arel/sudoku/main.html, 2025
Arel. Arel’s sudoku generator.URL https://www.ocf.berkeley.edu/ arel/sudoku/main.html, 2025
2025
-
[2]
Arriola, A
M. Arriola, A. Gokaslan, J. T. Chiu, Z. Yang, Z. Qi, J. Han, S. S. Sahoo, and V . Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.ICLR, 2025
2025
- [3]
-
[4]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [5]
- [6]
-
[7]
T. Bie, M. Cao, X. Cao, B. Chen, F. Chen, K. Chen, L. Du, D. Feng, H. Feng, M. Gong, Z. Gong, Y . Gu, J. Guan, K. Guan, H. He, Z. Huang, J. Jiang, Z. Jiang, Z. Lan, C. Li, J. Li, Z. Li, H. Liu, L. Liu, G. Lu, Y . Lu, Y . Ma, X. Mou, Z. Pan, K. Qiu, Y . Ren, J. Tan, Y . Tian, Z. Wang, L. Wei, T. Wu, Y . Xing, W. Ye, L. Zha, T. Zhang, X. Zhang, J. Zhao, D. ...
-
[8]
T. Bie, M. Cao, K. Chen, L. Du, M. Gong, Z. Gong, Y . Gu, J. Hu, Z. Huang, Z. Lan, C. Li, C. Li, J. Li, Z. Li, H. Liu, L. Liu, G. Lu, X. Lu, Y . Ma, J. Tan, L. Wei, J.-R. Wen, Y . Xing, X. Zhang, J. Zhao, D. Zheng, J. Zhou, J. Zhou, Z. Zhou, L. Zhu, and Y . Zhuang. LLaDA2.0: Scaling Up Diffusion Language Models to 100B.arXiv preprint arXiv:2512.15745, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Campbell, J
A. Campbell, J. Benton, V . De Bortoli, T. Rainforth, G. Deligiannidis, and A. Doucet. A Continuous Time Framework for Discrete Denoising Models.NeurIPS, 2022
2022
- [10]
-
[11]
M. Chen, J. Tworek, H. Jun, Q. Yuan, Henrique Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barne...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Cheng, S
Z. Cheng, S. Hao, T. Liu, F. Zhou, Y . Xie, F. Yao, Y . Bian, Y . Zhuang, N. Dey, Y . Zha, Y . Gu, K. Zhou, Y . Wang, Y . Li, R. Fan, J. She, C. Gao, A. Saparov, H. Li, T. W. Killian, M. Yurochkin, Z. Liu, E. P. Xing, and Z. Hu. Revisiting Reinforcement Learning for LLM Reasoning from a Cross-Domain Perspective. InNeurIPS, 2025
2025
-
[13]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2024
2024
- [16]
- [17]
-
[18]
KTO: Model Alignment as Prospect Theoretic Optimization
K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, and D. Kiela. KTO: Model Alignment as Prospect Theoretic Optimization.arXiv preprint arXiv:2402.01306, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
M. U. Gutmann and A. Hyvärinen. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics.JMLR, 2012
2012
- [20]
-
[21]
Hendrycks, C
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring Massive Multitask Language Understanding. InICLR, 2021
2021
- [22]
-
[23]
C. Huang and H. Tang. Ctrldiff: Boosting large diffusion language models with dynamic block prediction and controllable generation.arXiv preprint arXiv:2505.14455, 2025
-
[24]
Z. Huang, Z. Chen, Z. Wang, T. Li, and G.-J. Qi. Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models.arXiv preprint arXiv:2505.10446, 2025
-
[25]
Y . Jiang, R. Qiu, and Z. Huang. Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning.arXiv preprint arXiv:2605.02263, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [26]
- [27]
-
[28]
Lightman, V
H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InICLR, 2023
2023
-
[29]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
A. Lou, C. Meng, and S. Ermon. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution.arXiv preprint arXiv:2310.16834, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
C. Lu, H. Chen, J. Chen, H. Su, C. Li, and J. Zhu. Contrastive Energy Prediction for Exact Energy-Guided Diffusion Sampling in Offline Reinforcement Learning. InICML, 2023
2023
- [32]
- [33]
-
[34]
Y . Ma, L. Du, L. Wei, K. Chen, Q. Xu, K. Wang, G. Feng, G. Lu, L. Liu, X. Qi, X. Zhang, Z. Tao, H. Feng, Z. Jiang, Y . Xu, Z. Huang, Y . Zhuang, H. Xu, J. Hu, Z. Lan, J. Zhao, J. Li, and D. Zheng. dinfer: An efficient inference framework for diffusion language models.arXiv preprint arXiv:2510.08666, 2025. 11
-
[35]
C. Meng, K. Choi, J. Song, and S. Ermon. Concrete Score Matching: Generalized Score Matching for Discrete Data.NeurIPS, 2022
2022
- [36]
- [37]
-
[38]
S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li. Large Language Diffusion Models.arXiv preprint arXiv:2502.09992, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
J. Ou, J. Han, M. Xu, S. Xu, J. Xie, S. Ermon, Y . Wu, and C. Li. Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective. InICLR, 2026
2026
-
[40]
J. Pan, J. Zhang, X. Wang, L. Yuan, H. Peng, and A. Suhr. Tinyzero.GitHub repository https://github.com/Jiayi-Pan/TinyZero, 2025
2025
-
[41]
V . M. Panaretos and Y . Zemel. Statistical aspects of wasserstein distances.Annual Review of Statistics and Its Application, 2019
2019
-
[42]
X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning.arXiv preprint arXiv:1910.00177, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[43]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct Pref- erence Optimization: Your Language Model is Secretly a Reward Model.arXiv preprint arXiv:2305.18290, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Rafailov, A
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model.NeurIPS, 2023
2023
-
[45]
Rojas, J
K. Rojas, J. Lin, K. Rasul, A. Schneider, Y . Nevmyvaka, M. Tao, and W. Deng. Improving reasoning for diffusion language models via group diffusion policy optimization. InICLR, 2026
2026
- [46]
-
[47]
Schulman, S
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust Region Policy Optimization. InICML, 2015
2015
-
[48]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[49]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [50]
- [51]
-
[52]
Sohl-Dickstein, E
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. InICML, 2015
2015
-
[53]
X. Tang, R. Dolga, S. Yoon, and I. Bogunovic. Wd1: Weighted policy optimization for reasoning in diffusion language models. InICLR, 2026
2026
- [54]
-
[55]
Representation Learning with Contrastive Predictive Coding
A. van den Oord, Y . Li, and O. Vinyals. Representation Learning with Contrastive Predictive Coding.arXiv preprint arXiv:1807.03748, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[56]
von Werra, Y
L. von Werra, Y . Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, K. Rasul, and Q. Gallouédec. Trl: Transformer reinforcement learning, 2020
2020
- [57]
-
[58]
C. Wang, P. Rashidinejad, D. Su, S. Jiang, S. Wang, S. Zhao, C. Zhou, S. Z. Shen, F. Chen, T. Jaakkola, Y . Tian, and B. Liu. SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models. InICLR, 2026
2026
- [59]
-
[60]
D. Wang, R. Qiu, and Z. Huang. When to Commit? Towards Variable-Size Self-Contained Blocks for Discrete Diffusion Language Models.arXiv preprint arXiv:2604.23994, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [61]
-
[62]
Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. InNeurIPS, 2024
2024
-
[63]
Y . Wang, L. Yang, B. Li, Y . Tian, K. Shen, and M. Wang. Revolutionizing reinforcement learning framework for diffusion large language models. InICLR, 2026
2026
-
[64]
Z. Wang, J. J. Hunt, and M. Zhou. Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning.arXiv preprint arXiv:2208.06193, 2022
work page internal anchor Pith review arXiv 2022
-
[65]
A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025
W. Xiong, J. Yao, Y . Xu, B. Pang, L. Wang, D. Sahoo, J. Li, N. Jiang, T. Zhang, C. Xiong, and H. Dong. A Minimalist Approach to LLM Reasoning: From Rejection Sampling to Reinforce. arXiv preprint arXiv:2504.11343, 2025
-
[66]
Z. Xu, Y . Liu, Y . Yin, M. Zhou, and R. Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025
2025
-
[67]
J. Yang, Y . Jiang, X. Hu, S. Cheng, B. Qi, and J. Shao. Dare: Diffusion large language models alignment and reinforcement executor.arXiv preprint arXiv:2604.04215, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [68]
- [69]
-
[70]
J. Ye, Z. Xie, L. Zheng, J. Gao, Z. Wu, X. Jiang, Z. Li, and L. Kong. Dream 7B.URL https://hkunlp.github.io/blog/2025/dream, 2025
2025
-
[71]
O. Zekri and N. Boullé. Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods. arXiv preprint arXiv:2502.01384, 2025
-
[72]
Zellers, A
R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. HellaSwag: Can a Machine Really Finish Your Sentence? InACL, 2019
2019
- [73]
-
[74]
S. Zhao, D. Gupta, Q. Zheng, and A. Grover. d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning. InNeurIPS, 2025. 13
2025
- [75]
- [76]
-
[77]
F. Zhu, R. Wang, S. Nie, X. Zhang, C. Wu, J. Hu, J. Zhou, J. Chen, Y . Lin, J.-R. Wen, et al. LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models. arXiv preprint arXiv:2505.19223, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [78]
-
[79]
Y . Zhu, J. Wan, X. Liu, S. He, Q. Wang, X. Guo, T. Liang, Z. Huang, Z. He, and X. Qiu. Dirl: An efficient post-training framework for diffusion language models.arXiv preprint arXiv:2512.22234, 2026. 14 A Appendix Overview This appendix provides supplementary materials, theoretical proofs, and comprehensive experimental details as follows: •Appendix Brevi...
-
[80]
Alice is a knight/knave
If a code block exists, Python syntax is checked by parsing the extracted code: reward +1.0 if syntactically valid, otherwise+0.5. • Execution/Test Reward( code_reward). For completions whose format reward is 1.0, the extracted code is executed against the provided per-example test cases (test_list) by running python3 -cfor each test case with a per-test ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.