DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

Jiacheng Ye; Lingpeng Kong; Lin Zheng; Shansan Gong; Wei Bi; Xueliang Zhao; Yansong Feng; Zirui Wu

arxiv: 2606.19257 · v1 · pith:2CVCORRLnew · submitted 2026-06-17 · 💻 cs.CL

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

Zirui Wu , Lin Zheng , Jiacheng Ye , Shansan Gong , Xueliang Zhao , Yansong Feng , Wei Bi , Lingpeng Kong This is my paper

Pith reviewed 2026-06-26 21:00 UTC · model grok-4.3

classification 💻 cs.CL

keywords block diffusioncurriculum learningchain-of-thought reasoningdiffusion language modelsmathematical reasoningcode reasoninggranularity gap

0 comments

The pith

Block-size curriculum learning closes the granularity gap so diffusion models can reason competitively with autoregressive ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Block diffusion models decode text by denoising parallel blocks, which speeds up inference, but they have shown weak long chain-of-thought reasoning when trained on large blocks. The paper finds that small-block training supports good reasoning while large-block training does not, pointing to a mismatch in granularity. Block-size curriculum learning addresses this by starting training on small blocks and gradually shifting to larger ones. The resulting DreamReasoner-8B model maintains strong reasoning that works across many inference block sizes and reaches performance levels comparable to Qwen3-8B on mathematical and code benchmarks.

Core claim

Block diffusion language models exhibit a stark performance disparity in long-CoT reasoning: large block sizes during training produce remarkably poor results while small block sizes preserve effective reasoning. Block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, overcomes the granularity gap and enables strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks this yields results competitive with leading open autoregressive models.

What carries the argument

Block-size curriculum learning, a training schedule that starts with small blocks and progressively increases block size to close the granularity gap between training and inference.

If this is right

Large-block training alone yields poor long-CoT reasoning performance.
Small-block training preserves reasoning ability but limits the speed gains from larger blocks.
The curriculum produces representations that transfer to a range of inference block sizes.
DreamReasoner-8B reaches competitive results with Qwen3-8B on mathematical and code reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradual block-size schedule might improve diffusion models on other sequence tasks that require long dependencies.
If the granularity gap is the main obstacle, similar curricula could be tested on non-language diffusion models.
The result suggests training strategy, not model architecture, is the current bottleneck for scaling block diffusion to reasoning.

Load-bearing premise

The performance difference between large-block and small-block training arises from a fixable granularity mismatch rather than an inherent limit of block diffusion for long reasoning sequences.

What would settle it

A block diffusion model trained exclusively on large blocks that achieves reasoning performance equal to or better than the curriculum-trained version on the same math and code benchmarks.

Figures

Figures reproduced from arXiv: 2606.19257 by Jiacheng Ye, Lingpeng Kong, Lin Zheng, Shansan Gong, Wei Bi, Xueliang Zhao, Yansong Feng, Zirui Wu.

**Figure 2.** Figure 2: Effect of confidence threshold on reasoning [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Pass@1 and throughput comparison between LowConfidence and RelaxedConfidence decoding. We define [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Block-size curriculum learning appears to let diffusion models do long reasoning at scale, but the abstract alone does not give enough experimental detail to confirm the fix works as claimed.

read the letter

The punchline is that this paper identifies a granularity problem in block diffusion for reasoning and claims a curriculum fixes it to reach Qwen3-8B levels.

What is new is the block-size curriculum learning approach for diffusion language models. They train DreamReasoner-8B and show it works on math and code benchmarks. The paper does well by releasing the model and pointing out the training vs inference block size issue, which is a practical concern for these architectures.

The soft spots are in the evidence. The abstract mentions systematic experiments and competitive results but gives no specifics on the curriculum schedule, number of runs, baselines, or error bars. This makes it difficult to evaluate if the curriculum truly overcomes an inherent limit or just masks it temporarily. The concern about whether parallel block denoising can support long causal chains is worth checking in the full text.

This paper is for researchers in efficient language models and diffusion-based generation. A reader interested in non-autoregressive alternatives to standard transformers would get value from the empirical findings once the details are filled in.

It deserves a serious referee because the idea is actionable and the model release allows reproduction. I would recommend sending it to peer review with requests for more experimental transparency.

Referee Report

3 major / 2 minor

Summary. The paper introduces DreamReasoner-8B, an 8B-parameter block diffusion language model, and studies the impact of training and inference block sizes on long chain-of-thought reasoning. It identifies a performance disparity where large-block training yields poor reasoning while small-block training preserves it, proposes block-size curriculum learning to gradually increase block size during training, and claims this overcomes the granularity gap to enable strong reasoning that generalizes across inference block sizes, achieving results competitive with Qwen3-8B on mathematical and code reasoning benchmarks.

Significance. If the central empirical claims hold with proper controls, the work would be significant for demonstrating that block diffusion models can be scaled to long-CoT reasoning via curriculum training, offering a path to parallel decoding advantages over autoregressive models while maintaining competitive accuracy. The open release of the model and code would further strengthen its impact.

major comments (3)

[Abstract] Abstract and experimental sections: the abstract states that systematic experiments were performed and reports competitive benchmark numbers, but provides no details on baselines, number of runs, error bars, data splits, or exact curriculum schedule; without these the central empirical claim cannot be evaluated.
[Method / Experiments] The claim that block-size curriculum learning produces representations that transfer to inference block sizes different from the final training regime (and overcomes an inherent architectural limit of parallel block denoising for sequential long-CoT dependencies) is load-bearing but rests on the observed disparity without ablations showing that gradual transition is necessary versus other factors such as total training compute or data ordering.
[Experiments] No evidence is provided that the performance gap is caused by a fixable granularity gap rather than an architectural limitation; direct comparisons to Qwen3-8B are reported but without matched training data, token budget, or inference settings, undermining the generalization claim.

minor comments (2)

[Method] Notation for block sizes during training versus inference should be clarified with explicit symbols and a table summarizing the curriculum schedule.
[Appendix] The GitHub link is provided but the manuscript should include a reproducibility checklist or pointer to exact training hyperparameters and data splits.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful comments, which highlight important areas for improving the clarity and robustness of our empirical results. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and experimental sections: the abstract states that systematic experiments were performed and reports competitive benchmark numbers, but provides no details on baselines, number of runs, error bars, data splits, or exact curriculum schedule; without these the central empirical claim cannot be evaluated.

Authors: We acknowledge that additional details are required for proper evaluation. In the revised version, we will update the abstract to mention the baselines used, include the number of experimental runs with error bars, specify the data splits, and detail the exact block-size curriculum schedule in the methods section. revision: yes
Referee: [Method / Experiments] The claim that block-size curriculum learning produces representations that transfer to inference block sizes different from the final training regime (and overcomes an inherent architectural limit of parallel block denoising for sequential long-CoT dependencies) is load-bearing but rests on the observed disparity without ablations showing that gradual transition is necessary versus other factors such as total training compute or data ordering.

Authors: Our experiments demonstrate a clear performance disparity between small and large block sizes, with the curriculum approach enabling effective transfer. To more rigorously isolate the effect of the gradual transition, we will include additional ablation studies comparing curriculum learning against constant block size with equivalent compute and reordered data in the revision. revision: yes
Referee: [Experiments] No evidence is provided that the performance gap is caused by a fixable granularity gap rather than an architectural limitation; direct comparisons to Qwen3-8B are reported but without matched training data, token budget, or inference settings, undermining the generalization claim.

Authors: The ability of the model to achieve strong reasoning with small block sizes during training shows that the limitation is not architectural but related to training granularity. We will revise the discussion to explicitly address this point. Regarding comparisons to Qwen3-8B, we will add caveats noting the differences in training data and compute, while maintaining that the results demonstrate competitive performance under our training regime. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical curriculum results benchmarked against external autoregressive model

full rationale

The paper reports an observed performance disparity between large- and small-block training, introduces block-size curriculum learning as a training schedule, and validates the resulting model via direct comparison to Qwen3-8B on standard benchmarks. No equations, fitted parameters presented as predictions, self-citation load-bearing premises, or uniqueness theorems appear in the provided text. The central claim rests on external empirical comparison rather than any self-referential derivation or renaming of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised fine-tuning assumptions plus the empirical observation that a curriculum schedule can bridge block-size effects; no new mathematical axioms or invented physical entities are introduced.

free parameters (1)

block-size curriculum schedule
The specific sequence and timing of block-size increases during training is a hyperparameter choice that must be selected to obtain the reported performance.

axioms (1)

domain assumption Block diffusion models are capable of long CoT reasoning when trained with appropriate granularity progression
This is the key premise being validated by the curriculum experiments.

pith-pipeline@v0.9.1-grok · 5734 in / 1179 out tokens · 24016 ms · 2026-06-26T21:00:39.609132+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 6 linked inside Pith

[1]

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Ar- naud Doucet

AAAI Press. Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Ar- naud Doucet. 2022. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri...

2022
[2]

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping

Evaluating large language models trained on code.Preprint, arXiv:2107.03374. Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2025. Acereason-nemotron: Advanc- ing math and code reasoning through reinforcement learning.Preprint, arXiv:2505.16400. Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, ...

Pith/arXiv arXiv 2025
[3]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord

Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question an- swering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457. Karl C...

arXiv 2018
[4]

DeepSeek-AI

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. DeepSeek-AI. 2025. Deepseek-r1: Incentivizing rea- soning capability in llms via reinforcement learning. Preprint, arXiv:2501.12948. DeepSeek-AI. 2026. Deepseek-v4: Towards highly efficient million-token context intelligence. Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang,...

Pith/arXiv arXiv 2025
[5]

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong

Efficient-dlm: From autoregressive to diffu- sion language models, and beyond in speed.Preprint, arXiv:2512.14067. Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. 2025a. Scaling diffusion language models via adapta- tion from autoregressive models.Intern...

Pith/arXiv arXiv 2023
[6]

InProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers)

SSD-LM: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. InProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers). Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt
[7]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt

Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the MATH dataset. InThirty- fifth Conference on Neural Information Processing Systems Datasets and Bench...

Pith/arXiv arXiv 2009
[8]

Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, and Gao Huang

Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276. Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, and Gao Huang. 2026. The flex- ibility trap: Why arbitrary order limits reasoning potential in diffusion language models.Preprint, arXiv:2601.15165. Shen Nie, Fe...

arXiv 2026
[9]

Preprint, arXiv:2505.07608

Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining. Preprint, arXiv:2505.07608. Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jing- wei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, and Lingpeng Kong. 2025. Dream-coder 7b: An open diffusion language model for code.Preprint, arXiv:2509.01142...

arXiv 2025
[10]

arXiv preprint arXiv:2508.15487

Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830. Lingxiao Zhao, Xueying Ding, Lijun Yu, and Le- man Akoglu. 2024. Improving and unifying discrete&continuous-time disc...

Pith/arXiv arXiv 2019
[11]

InConferenec on Language Mod- eling, COLM, October 7-9, 2024, Philadelphia, PA

A reparameterized discrete diffusion model for text generation. InConferenec on Language Mod- eling, COLM, October 7-9, 2024, Philadelphia, PA. Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. 2025. Llada 1.5: Variance-reduced preference optimiza- tion for large langu...

Pith/arXiv arXiv 2024
[12]

A noise injection module stochas- tically masks tokens while enforcing at least one masked token per block to prevent degenerate cases

that uses input sequences directly as super- vision targets. A noise injection module stochas- tically masks tokens while enforcing at least one masked token per block to prevent degenerate cases. The resulting interleaved sequences are processed via FlexAttention (Dong et al., 2024), which com- piles the structured sparse attention pattern into optimized...

2024
[13]

for sentence completion, PIQA (Bisk et al.,
[14]

We follow the evaluation framework in Dream-7B (Ye et al., 2025) to evaluate our model

for physical reasoning, WinoGrande (Sak- aguchi et al., 2021) for pronoun disambiguation, and RACE (Lai et al., 2017) for reading compre- hension.Mathematical and scientific reasoningis assessed through GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2020) for mathematical problem-solving, and GPQA (Rein et al., 2023) for graduate-level science que...

2021

[1] [1]

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Ar- naud Doucet

AAAI Press. Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Ar- naud Doucet. 2022. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri...

2022

[2] [2]

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping

Evaluating large language models trained on code.Preprint, arXiv:2107.03374. Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2025. Acereason-nemotron: Advanc- ing math and code reasoning through reinforcement learning.Preprint, arXiv:2505.16400. Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, ...

Pith/arXiv arXiv 2025

[3] [3]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord

Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question an- swering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457. Karl C...

arXiv 2018

[4] [4]

DeepSeek-AI

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. DeepSeek-AI. 2025. Deepseek-r1: Incentivizing rea- soning capability in llms via reinforcement learning. Preprint, arXiv:2501.12948. DeepSeek-AI. 2026. Deepseek-v4: Towards highly efficient million-token context intelligence. Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang,...

Pith/arXiv arXiv 2025

[5] [5]

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong

Efficient-dlm: From autoregressive to diffu- sion language models, and beyond in speed.Preprint, arXiv:2512.14067. Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. 2025a. Scaling diffusion language models via adapta- tion from autoregressive models.Intern...

Pith/arXiv arXiv 2023

[6] [6]

InProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers)

SSD-LM: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. InProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers). Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt

[7] [7]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt

Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the MATH dataset. InThirty- fifth Conference on Neural Information Processing Systems Datasets and Bench...

Pith/arXiv arXiv 2009

[8] [8]

Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, and Gao Huang

Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276. Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, and Gao Huang. 2026. The flex- ibility trap: Why arbitrary order limits reasoning potential in diffusion language models.Preprint, arXiv:2601.15165. Shen Nie, Fe...

arXiv 2026

[9] [9]

Preprint, arXiv:2505.07608

Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining. Preprint, arXiv:2505.07608. Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jing- wei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, and Lingpeng Kong. 2025. Dream-coder 7b: An open diffusion language model for code.Preprint, arXiv:2509.01142...

arXiv 2025

[10] [10]

arXiv preprint arXiv:2508.15487

Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830. Lingxiao Zhao, Xueying Ding, Lijun Yu, and Le- man Akoglu. 2024. Improving and unifying discrete&continuous-time disc...

Pith/arXiv arXiv 2019

[11] [11]

InConferenec on Language Mod- eling, COLM, October 7-9, 2024, Philadelphia, PA

A reparameterized discrete diffusion model for text generation. InConferenec on Language Mod- eling, COLM, October 7-9, 2024, Philadelphia, PA. Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. 2025. Llada 1.5: Variance-reduced preference optimiza- tion for large langu...

Pith/arXiv arXiv 2024

[12] [12]

A noise injection module stochas- tically masks tokens while enforcing at least one masked token per block to prevent degenerate cases

that uses input sequences directly as super- vision targets. A noise injection module stochas- tically masks tokens while enforcing at least one masked token per block to prevent degenerate cases. The resulting interleaved sequences are processed via FlexAttention (Dong et al., 2024), which com- piles the structured sparse attention pattern into optimized...

2024

[13] [13]

for sentence completion, PIQA (Bisk et al.,

[14] [14]

We follow the evaluation framework in Dream-7B (Ye et al., 2025) to evaluate our model

for physical reasoning, WinoGrande (Sak- aguchi et al., 2021) for pronoun disambiguation, and RACE (Lai et al., 2017) for reading compre- hension.Mathematical and scientific reasoningis assessed through GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2020) for mathematical problem-solving, and GPQA (Rein et al., 2023) for graduate-level science que...

2021