pith. sign in

arxiv: 2606.29150 · v1 · pith:EMMDOIDNnew · submitted 2026-06-28 · 💻 cs.AI

Flow Reasoning Models: Scaling Reasoning Through Iterative Self-Refinement

Pith reviewed 2026-06-30 07:53 UTC · model grok-4.3

classification 💻 cs.AI
keywords discrete flow modelsreasoning tasksSudoku puzzlestest-time scalingfixed pointsself-refinementdirect preference optimizationdenoising dynamics
0
0 comments X

The pith

Correct answers are stable fixed points under discrete flow model denoising, so selecting only those that return to themselves after re-noising solves Sudoku and Zebra puzzles at 96-100%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Discrete flow models solve only about 36% of Sudoku puzzles when used directly because they confidently output wrong answers. The paper observes that correct answers behave as stable fixed points of the denoising process: they remain unchanged when re-noised and re-solved. This property alone lets the model act as its own verifier by generating many candidates and retaining only the stable ones, reaching near-100% accuracy on Sudoku-Shah, 95.9% on Zebra, and 96.1% on harder out-of-distribution Sudoku-Extreme puzzles. To reduce wasted computation on bad candidates, the authors add self-conditioning during training (closed at inference for self-refinement) and direct preference optimization to discourage failed outputs. These changes let the model reach 99.2% on Sudoku in seven forward passes, more than eight times fewer than a matched masked-diffusion baseline.

Core claim

A correct answer is a stable fixed point of the denoising dynamics, returning to itself when re-noised and re-solved. Selecting such stable candidates alone reaches high solve rates on Sudoku-Shah (~100%) and Zebra (95.9%), and generalizes to Sudoku-Extreme (96.1%) without training on that distribution. Training flow models with a self-conditioning channel and direct preference optimization improves the base model's efficiency to reach 99.2% on Sudoku in just 7 forward passes.

What carries the argument

The stable fixed point property of correct solutions under repeated re-noising and re-solving in discrete flow model denoising dynamics, serving as an internal verification signal for candidate selection.

If this is right

  • Selecting only dynamically stable candidates produces ~100% solve rate on Sudoku-Shah and 95.9% on Zebra without additional training.
  • The same selection procedure generalizes to harder out-of-distribution puzzles such as Sudoku-Extreme at 96.1%.
  • Closing a self-conditioning channel at inference lets the model iteratively refine its own earlier predictions.
  • Direct preference optimization against the model's own failed generations reduces the number of forward passes needed for high accuracy.
  • Combining the trained model with stability-based selection solves hard puzzles with far fewer total steps than pure search or matched baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Stability selection could be layered on top of other search or sampling strategies to further cut the number of proposals examined.
  • The fixed-point test might apply to any discrete structured generation task where repeated perturbation and regeneration is cheap.
  • Preference optimization against self-generated errors could be extended to other signals such as partial correctness or step-wise consistency.
  • If stability proves a general proxy for solution quality, it could reduce dependence on external verifiers or reward models in reasoning systems.

Load-bearing premise

That an answer remaining unchanged after re-noising and re-solving reliably signals correctness rather than merely correlating with correctness on the tested puzzle distributions.

What would settle it

A collection of puzzles in which many incorrect answers remain unchanged after re-noising and re-solving, or many correct answers change, would show that stability does not track correctness.

Figures

Figures reproduced from arXiv: 2606.29150 by Alec Helbling, Andrey Bryutkin, Hendrik Strobelt, Mauro Martino, Nima Dehmamy.

Figure 1
Figure 1. Figure 1: Flow reasoning models find correct solutions at stable fixed points. (a) The dynamics of discrete flow models identify the correctness of samples: a correct solution sits in a stable basin and returns consistently to itself after perturbations, while error states occupy unstable basins that drift away when re-resolved. (b) We perform reasoning through iterative refinement with self-conditioning, where feed… view at source ↗
Figure 2
Figure 2. Figure 2: Self-conditioning refines one Sudoku attempt to the correct grid. From the same initial noise, each panel feeds back the previous prediction as self-conditioning; wrong cells (red) fall from 28 → 26 → 3 → 0. No restart or external verifier is used. 2 FLOW REASONING MODELS A flow reasoning model is a conditional discrete flow language model run not as a one-shot sampler but as an iteration that carries memo… view at source ↗
Figure 3
Figure 3. Figure 3: Self-verification enables test-time scaling across tasks. (Left) Our FRM training scheme saturates SUDOKU (Shah) in a single pass, and test-time scaling allows even a poor base model to saturate the task after several rounds. (Center) On hard out-of-distribution SUDOKU-EXTREME data, our self-verification test-time-scaling algorithm solves ∼100% of the tasks; our FRM training regime further improves the eff… view at source ↗
Figure 4
Figure 4. Figure 4: Re-noising separates correct from incorrect cells. The candidate grid is re-noised to interior time t and re-solved; we read per-cell re-solution discrepancy ∆CE. (a) Correct cells stay near zero under large re-noising, while incorrect cells deviate steeply. (b) On a single grid, the discrepancy concentrates on the model’s wrong cells (red) and stays near zero elsewhere. Closing this feedback across Euler … view at source ↗
Figure 5
Figure 5. Figure 5: FLOWDPO deepens the correct basin against the model’s own confident mistakes. For one held-out Sudoku puzzle, the base model (left) places the gold grid (highlighted) as only a weak, dilute mode over many competing states, whereas FLOWDPO (right) collapses most mass onto it (gold share 26%→68%, distinct grids 139→75). This is an illustrative single-puzzle sketch; aggregate pass@1 gains are in [PITH_FULL_I… view at source ↗
Figure 6
Figure 6. Figure 6: A generation–verification gap, and closing it with training. (Left) Models are far better at distinguishing correct from incorrect states through renoise-CE than they are at naively generating correct solutions, as indicated by the high AUROC (≈1.0) against single-shot solve rates of only ∼11–41%. (Right) Training the models with FLOWDPO and self-conditioning helps close this gap, significantly raising one… view at source ↗
Figure 7
Figure 7. Figure 7: A relative objective deepens the correct basin. Single-shot solve rate (pass@1, mean ± SEM on held-out splits; 4 seeds) as the cumulative FLOWDPO ablation adds random preference pairs, wrong-cell support, hard negatives from the live model, and an EMA-pinned reference. Tasks are Sudoku (Shah et al., 2024), the evaluation-only out-of-distribution SUDOKU-EXTREME split, and Zebra (4 × 4). This is the 1024-ste… view at source ↗
read the original abstract

Discrete flow models have recently shown promising performance on few-step text generation; however, when naively applied to structured reasoning tasks such as Sudoku and Zebra puzzles, they converge confidently to incorrect answers (solving only $\sim$36% of Sudoku puzzles). We introduce Flow Reasoning Models (FRMs), a training and test-time-scaling framework for structured reasoning with flow models. We make the observation that, despite their poor solve rate, flow models can act as their own verifiers. A correct answer is a stable fixed point of the denoising dynamics, returning to itself when re-noised and re-solved. This enables a test-time-scaling paradigm: propose many candidate solutions and keep those that are dynamically stable, which alone reaches high solve rates on Sudoku-Shah (~$100\%$) and Zebra ($95.9\%$). This even generalizes to harder out-of-distribution puzzles like Sudoku-Extreme ($96.1\%$), without ever training on that distribution. This pure search, however, wastes a great deal of computation generating incorrect candidate solutions. We therefore design a training recipe to improve the base model's efficiency. First, we train flow models with a self-conditioning channel and close it at inference, letting them refine their own past predictions. Second, we train models to avoid their own failed generations using direct preference optimization. These changes substantially improve the base model's efficiency, letting it reach $99.2\%$ on Sudoku in just $7$ forward passes, over $8\times$ fewer than the strongest matched masked-diffusion baseline we compare needs for the same accuracy. When combined with test-time scaling, this lets flow models solve hard out-of-distribution puzzles (e.g. Sudoku-Extreme) far more efficiently.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces Flow Reasoning Models (FRMs) as a training and test-time scaling framework for discrete flow models on structured reasoning tasks such as Sudoku and Zebra puzzles. It observes that correct answers form stable fixed points under the model's denoising dynamics (returning to themselves after re-noising and re-solving), enabling a pure search method that filters many candidate solutions to reach ~100% on Sudoku-Shah, 95.9% on Zebra, and 96.1% on OOD Sudoku-Extreme. It further proposes self-conditioning during training (closed at inference) and direct preference optimization to avoid failed generations, yielding efficiency gains such as 99.2% Sudoku accuracy in 7 forward passes (over 8x fewer than a matched masked-diffusion baseline).

Significance. If the stability-based filtering and efficiency improvements hold under broader conditions, the work provides a label-free test-time scaling approach for flow models on reasoning and demonstrates concrete efficiency advantages over baselines. The OOD generalization result is notable as an empirical measurement on held-out puzzles rather than a self-referential fit, and the absence of free parameters or invented entities in the core observation strengthens the claim.

major comments (3)
  1. [Abstract] Abstract, fixed-point observation paragraph: the central claim that selecting dynamically stable candidates identifies correct answers (enabling the reported ~100% and 95.9% solve rates) rests on the untested assumption that incorrect answers cannot also be stable fixed points of the same denoising dynamics; no experiments on false-positive rates or adversarial candidates are described, which is load-bearing for interpreting the filter as a correctness proxy rather than a distribution-specific correlate.
  2. [Abstract] Abstract, results on Sudoku-Extreme: the 96.1% OOD generalization is reported without error bars, variance across runs, or exact details on the candidate selection procedure (e.g., number of proposals, stability threshold, or re-noising schedule), undermining assessment of robustness to distribution shift as flagged in the soundness evaluation.
  3. [Abstract] Abstract, efficiency comparison: the claim of reaching 99.2% accuracy in 7 forward passes (over 8x fewer than the strongest matched masked-diffusion baseline) lacks specification of the baseline's exact configuration and whether the comparison controls for total compute or proposal count, which is load-bearing for the training-recipe contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract, fixed-point observation paragraph: the central claim that selecting dynamically stable candidates identifies correct answers (enabling the reported ~100% and 95.9% solve rates) rests on the untested assumption that incorrect answers cannot also be stable fixed points of the same denoising dynamics; no experiments on false-positive rates or adversarial candidates are described, which is load-bearing for interpreting the filter as a correctness proxy rather than a distribution-specific correlate.

    Authors: We agree that the manuscript does not include explicit experiments on false-positive rates for the stability filter or tests with adversarial candidates. The reported accuracies are empirical observations on the Sudoku and Zebra distributions (including OOD). We will add an explicit discussion of this limitation in the revised manuscript, including any feasible additional analysis of false positives. revision: yes

  2. Referee: [Abstract] Abstract, results on Sudoku-Extreme: the 96.1% OOD generalization is reported without error bars, variance across runs, or exact details on the candidate selection procedure (e.g., number of proposals, stability threshold, or re-noising schedule), undermining assessment of robustness to distribution shift as flagged in the soundness evaluation.

    Authors: We will revise the abstract and methods to report error bars, variance across runs, and full details of the candidate selection procedure, including the number of proposals, stability threshold, and re-noising schedule. revision: yes

  3. Referee: [Abstract] Abstract, efficiency comparison: the claim of reaching 99.2% accuracy in 7 forward passes (over 8x fewer than the strongest matched masked-diffusion baseline) lacks specification of the baseline's exact configuration and whether the comparison controls for total compute or proposal count, which is load-bearing for the training-recipe contribution.

    Authors: We will update the abstract and relevant sections to specify the exact configuration of the masked-diffusion baseline and confirm that the comparison controls for total compute and proposal count. revision: yes

Circularity Check

0 steps flagged

No circularity: core results are empirical measurements on held-out instances

full rationale

The paper reports measured solve rates (e.g., ~100% on Sudoku-Shah, 95.9% on Zebra, 96.1% on Sudoku-Extreme) obtained by filtering candidates according to an observed dynamical property. The stability observation is presented as an empirical finding rather than a derived theorem, and the reported accuracies are direct counts on test puzzles, not quantities obtained by fitting parameters inside the method's own equations and then relabeling them as predictions. Training steps (self-conditioning channel, DPO) are standard preference and conditioning techniques whose efficiency gains are likewise measured empirically. No load-bearing self-citation chain or self-definitional reduction is visible in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that correct solutions are stable fixed points; no new mathematical axioms or invented physical entities are introduced. Free parameters are limited to standard training hyperparameters whose values are not reported in the abstract.

axioms (1)
  • domain assumption Discrete flow models can be trained to produce structured outputs on grid puzzles
    Background assumption required for the entire experimental setup; invoked implicitly throughout the abstract.

pith-pipeline@v0.9.1-grok · 5854 in / 1392 out tokens · 23804 ms · 2026-06-30T07:53:05.717459+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 32 canonical work pages · 20 internal anchors

  1. [1]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    URLhttps://arxiv.org/abs/2303.08797. Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. InAdvances in Neural Information Processing Systems, volume 32,

  2. [2]

    Nicholas M

    URLhttps://arxiv.org/abs/2402.18491. Nicholas M. Boffi, Michael S. Albergo, and Eric Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models,

  3. [3]

    org/abs/2406.07507

    URL https://arxiv. org/abs/2406.07507. Umberto Borso, Davide Paglieri, Jude Wells, and Tim Rocktäschel. Preference-based alignment of discrete diffusion models,

  4. [4]

    Ting Chen, Ruixiang Zhang, and Geoffrey Hinton

    URLhttps://arxiv.org/abs/2503.08295. Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning,

  5. [5]

    Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu

    URLhttps://arxiv.org/abs/2208.04202. Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. Langflow: Continuous diffusion rivals discrete in language modeling,

  6. [6]

    URL https:// arxiv.org/abs/2604.11748. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems,

  7. [7]

    Training Verifiers to Solve Math Word Problems

    URL https://arxiv.org/ abs/2110.14168. Jacob Fein-Ashley and Paria Rashidinejad. Solve the loop: Attractor models for language and reasoning,

  8. [8]

    Solve the Loop: Attractor Models for Language and Reasoning

    URLhttps://arxiv.org/abs/2605.12466. Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach,

  9. [9]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    URLhttps://arxiv.org/abs/2502.05171. Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, and Kaiming He. Elf: Embedded language flows,

  10. [10]

    ELF: Embedded Language Flows

    URLhttps://arxiv.org/abs/2605.10938. Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty,

  11. [11]

    12 Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen

    URLhttps://arxiv.org/abs/2502.18581. 12 Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions,

  12. [12]

    Kakade, and Sitan Chen

    URL https: //arxiv.org/abs/2502.06768. Jaeyeon Kim, Seunggeun Kim, Taekyun Lee, David Z. Pan, Hyeji Kim, Sham Kakade, and Sitan Chen. Fine-tuning masked diffusion for provable self-correction,

  13. [13]

    Fine-Tuning Masked Diffusion for Provable Self-Correction

    URL https://arxiv. org/abs/2510.01384. Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step-dpo: Step- wise preference optimization for long-chain reasoning of llms,

  14. [14]

    Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

    URL https://arxiv.org/ abs/2406.18629. Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Se- unghoon Hong, Nicholas M. Boffi, and Jinwoo Kim. Flow map language models: One-step lan- guage modeling via continuous denoising,

  15. [15]

    Flow Map Language Models: One-step Language Modeling via Continuous Denoising

    URL https://arxiv.org/abs/2602.16813. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step,

  16. [16]

    Let's Verify Step by Step

    URL https://arxiv.org/abs/2305.20050. Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling,

  17. [17]

    Flow Matching for Generative Modeling

    URLhttps://arxiv.org/abs/2210.02747. Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution,

  18. [18]

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    URLhttps://arxiv.org/abs/2310.16834. Sajad Movahedi, Vera Milovanovi ´c, Shlomo Libo Feigin, Alexander Theus, Thomas Hofmann, Valentina Boeva, T. Konstantin Rusch, and Antonio Orvieto. Fixed-point reasoners: Stable and adaptive deep looped transformers,

  19. [19]

    Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

    URLhttps://arxiv.org/abs/2606.18206. Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models,

  20. [20]

    Large Language Diffusion Models

    URL https://arxiv. org/abs/2502.09992. Zhengkai Pan, Peter Potaptchik, Wenxi Yao, Michael S. Albergo, and Jakiw Pidstrigach. Itô maps for any-step sdes,

  21. [21]

    It\^o maps for any-step SDEs

    URLhttps://arxiv.org/abs/2606.11156. Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization,

  22. [22]

    Bao Pham, Gabriel Raya, Matteo Negri, Mohammed J

    URL https://arxiv.org/abs/ 2404.19733. Bao Pham, Gabriel Raya, Matteo Negri, Mohammed J. Zaki, Luca Ambrogioni, and Dmitry Krotov. Memorization to generalization: Emergence of diffusion models from associative memory,

  23. [23]

    Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, and Michael S

    URLhttps://arxiv.org/abs/2505.21777. Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, and Michael S. Albergo. Discrete flow maps,

  24. [24]

    Discrete Flow Maps

    URLhttps://arxiv.org/abs/2604.09784. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,

  25. [25]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    URL https://arxiv.org/abs/2305.18290. Jiaoyang Ruan, Xin Gao, Yinda Chen, Hengyu Zeng, Liang Du, Guanghao Li, Jie Fu, and Jian Pu. Reasoning on the manifold: Bidirectional consistency for self-verification in diffusion language models,

  26. [26]

    Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models

    URLhttps://arxiv.org/abs/2604.16565. Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models,

  27. [27]

    Chiu, Alexander Rush, and Volodymyr Kuleshov

    URLhttps://arxiv.org/abs/2406.07524. Kulin Shah, Nishanth Dikkala, Xin Wang, and Rina Panigrahy. Causal language modeling can elicit search and reasoning capabilities on logic puzzles,

  28. [28]

    Cyclic Denoising Reveals Ultrastable Memories in Diffusion Models

    URLhttps://arxiv.org/abs/2606.24000. Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data,

  29. [29]

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik

    URLhttps://arxiv.org/abs/2404.14367. Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization,

  30. [30]

    Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and V olodymyr Kuleshov

    URLhttps://arxiv.org/abs/2311.12908. Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling,

  31. [31]

    Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li

    URLhttps://arxiv.org/abs/2404.11999. Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Llada 1.5: Variance-reduced preference optimization for large language diffusion models,

  32. [32]

    LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

    URLhttps://arxiv.org/abs/2505.19223. A SELF-CONDITIONING:TRAINING,REPRESENTATION,AND STABILITY This appendix collects the self-conditioning details deferred from Section 2.1. The channel and its training.Self-conditioning adds a singlezero-initializedinput channel that carries the model’s own previous-pass raw logitss=ℓ prev, embedded and added to the sta...

  33. [33]

    D RENOISE-CESELECTION AFTERFLOWDPO Table 4 shows the vote-versus-renoise-CE contrast for thebasemodel

    The renoise-CE again selects the correct grid almost always while the vote lags.n=128puzzles. D RENOISE-CESELECTION AFTERFLOWDPO Table 4 shows the vote-versus-renoise-CE contrast for thebasemodel. Table 5 repeats the measure- ment for the FLOWDPO model, each evaluated at itsown(smaller) saturation pool N ∗: the same contrast holds, in that the renoise-CE ...

  34. [34]

    The preference loss and the denoising log-score.FLOWDPO is the direct preference loss (Rafailov et al., 2024), a logistic contrast on pairs (y+, y−) over the denoising log-score ratio to a reference model, LDPO(θ) =−E (y+,y−) logσ β log πθ(y+) πθref (y+) −log πθ(y−) πθref (y−) ,(11) with the logπ slot instantiated by the train-matched categorical denoisin...

  35. [35]

    The default mask is the answer-position maskM

    and no absorbing-state ELBO reweighting as in masked-diffusion preference methods (Zhu et al., 2025; Borso et al., 2025). The default mask is the answer-position maskM. In the wrong-cells objective used for the reported ablations, we swap in the gold-supervised mask Wi =M i1[y − i ̸=y + i ], corrupt the mined negative to x− t :=x t(y−;ε) with clue cells c...

  36. [36]

    restricted to the decisive cells W . The (1−σ(β b∆t,ε)) prefactor gates the update by hardness, vanishing for negatives the model already deems unlikely and largest for the self-mined confident mistakes it ranks near gold (Section 3). Total objective and the optional anchor.Writing LFlowDPO for the averaged multi-negative contrast of Eq. (14), the full ob...