pith. machine review for the scientific record. sign in

arxiv: 2605.04291 · v1 · submitted 2026-05-05 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:53 UTC · model grok-4.3

classification 💻 cs.LG
keywords discrete diffusionGlauber dynamicspretrained language modelsenergy functionstext generationzero-shot reasoningplanning tasks
0
0 comments X

The pith

Pretrained language models can serve directly as energy functions to guide Glauber dynamics sampling in discrete text diffusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing the usual uniform transition kernel in a discrete diffusion process with an energy function drawn from an existing pretrained causal or masked language model. When the pretrained model is interpreted as defining the stationary distribution, Glauber dynamics produce higher-quality text samples than diffusion models trained from scratch. With UL2 as the energy function the approach exceeds earlier diffusion language models and reaches parity with autoregressive models of similar size on standard generation metrics. The same models also match or surpass both diffusion and autoregressive baselines on zero-shot common-sense reasoning and on structured planning tasks such as Sudoku and Zebra puzzles.

Core claim

Instead of training a diffusion model whose forward process uses a uniform kernel, the authors treat a pretrained language model as an energy function whose associated Boltzmann distribution becomes the target stationary distribution for Glauber dynamics. Sampling then proceeds by repeatedly proposing single-token flips whose acceptance probabilities are set by the pretrained model logits, allowing the diffusion pipeline to inherit the pretrained model knowledge without additional architectural changes or retraining of the energy function.

What carries the argument

Glauber dynamics whose acceptance probabilities are computed from a pretrained language model viewed as an energy function that defines the stationary distribution of the diffusion process.

If this is right

  • Diffusion language models built this way outperform all prior diffusion-based language models on text generation.
  • Performance becomes competitive with autoregressive models of comparable parameter count without requiring autoregressive decoding at inference time.
  • The resulting models achieve strong zero-shot results on common-sense reasoning and on combinatorial planning tasks such as Sudoku and Zebra puzzles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method suggests a general route for injecting pretrained knowledge into any discrete diffusion sampler whose stationary distribution can be expressed as an energy function.
  • Because the pretrained model is used off-the-shelf, the approach may lower the total training cost of high-quality discrete generative models relative to training both the energy and the diffusion dynamics from random initialization.
  • The same energy-function construction could be tested on other discrete sequence domains where large pretrained models already exist, such as protein sequences or source code.

Load-bearing premise

A pretrained language model can be used without modification as the energy function whose stationary distribution improves the quality of Glauber dynamics samples over a uniform-kernel baseline.

What would settle it

If the same Glauber dynamics pipeline with a uniform or randomly initialized energy function produces text of equal or higher quality on perplexity and downstream benchmarks than the pretrained-model version, the claimed benefit would be falsified.

read the original abstract

We present a discrete diffusion-based language model using Glauber dynamics from statistical physics. Our main insight is that instead of trying to train a discrete state space diffusion model using Glauber dynamics with a uniform transition kernel as the forward process, one can set up an ``energy function'' based on pretrained causal/masked language models. When viewed as the stationary distribution, this energy function allows us to significantly improve the quality of the generated text. Incorporating UL2 as the pretrained model into our diffusion pipeline, we outperform prior diffusion based LMs and perform competitively with autoregressive models of comparable model sizes. Furthermore, our models are competitive with or outperform prior diffusion models and GPT-2 style auto-regressive models on zero-shot common sense reasoning tasks as well as planning and search tasks like Sudoku and Zebra puzzles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a discrete diffusion model for text that uses Glauber dynamics with an energy function derived from a frozen pretrained language model (e.g., UL2) to define the target stationary distribution. The central claim is that this yields higher-quality text generation than prior diffusion LMs, competitive performance with autoregressive models of similar size, and strong results on zero-shot commonsense reasoning plus planning/search tasks such as Sudoku and Zebra puzzles.

Significance. If the core assumption holds and the reported gains are reproducible, the work would offer a practical route to incorporate large pretrained LMs into diffusion pipelines without retraining, potentially improving non-autoregressive generation on tasks that benefit from global consistency.

major comments (3)
  1. [Method / Experiments] The central modeling assumption—that a frozen causal or masked LM can be plugged in directly as E(x) = −log p_LM(x) so that Glauber dynamics (Metropolis acceptance on single-token flips) samples from the LM distribution—receives no explicit empirical check. No experiment compares the empirical likelihood of finite-length Glauber trajectories against ancestral samples from the same UL2 model, nor demonstrates that the claimed performance improvements survive when the number of dynamics steps is varied.
  2. [Method] For causal LMs the energy function is asymmetric: flipping a token changes the conditioning context for all subsequent tokens, making each energy evaluation O(sequence length) and raising the possibility of slow mixing. The manuscript provides neither a mixing-time analysis nor an ablation that isolates the effect of this asymmetry on generation quality.
  3. [Experiments] The experimental section reports outperformance over prior diffusion LMs and competitiveness with GPT-2-scale autoregressive models, yet supplies no ablation that replaces the pretrained-LM energy with a uniform-kernel baseline while keeping all other components fixed. Without this control it is impossible to attribute gains specifically to the energy-function construction rather than other modeling choices.
minor comments (2)
  1. [Abstract] The abstract asserts performance gains without quoting any numerical metrics, baseline names, or error bars; a one-sentence summary of the key numbers would improve readability.
  2. [Method] Notation for the Glauber transition kernel, the precise Metropolis acceptance probability, and the temperature schedule should be stated explicitly (ideally with an equation) rather than left implicit.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, proposing revisions to strengthen the manuscript where the concerns are valid and providing clarifications on methodological choices.

read point-by-point responses
  1. Referee: [Method / Experiments] The central modeling assumption—that a frozen causal or masked LM can be plugged in directly as E(x) = −log p_LM(x) so that Glauber dynamics (Metropolis acceptance on single-token flips) samples from the LM distribution—receives no explicit empirical check. No experiment compares the empirical likelihood of finite-length Glauber trajectories against ancestral samples from the same UL2 model, nor demonstrates that the claimed performance improvements survive when the number of dynamics steps is varied.

    Authors: We acknowledge that a direct empirical verification of finite-time sampling behavior would strengthen the paper. The Metropolis-Hastings theorem guarantees convergence to the target distribution in the limit, but we did not report explicit checks against ancestral sampling or step-count ablations in the original submission. In the revision we will add (i) a comparison of average UL2 log-likelihoods for Glauber-generated sequences versus direct ancestral samples from the same model and (ii) an ablation varying the number of dynamics steps to show that downstream metrics improve and stabilize once sufficient steps are taken. These results will be included in the updated experimental section. revision: yes

  2. Referee: [Method] For causal LMs the energy function is asymmetric: flipping a token changes the conditioning context for all subsequent tokens, making each energy evaluation O(sequence length) and raising the possibility of slow mixing. The manuscript provides neither a mixing-time analysis nor an ablation that isolates the effect of this asymmetry on generation quality.

    Authors: The asymmetry for strictly causal factorizations is a legitimate concern, since a token flip alters the conditional probabilities of all later positions. Our primary results use UL2 in its masked configuration, for which the energy is symmetric. We will add an ablation that compares generation quality when the same backbone is used in causal versus masked energy modes. A full theoretical mixing-time analysis, however, lies outside the scope of the present work; we instead rely on consistent empirical performance across multiple benchmarks as practical evidence of usability. revision: partial

  3. Referee: [Experiments] The experimental section reports outperformance over prior diffusion LMs and competitiveness with GPT-2-scale autoregressive models, yet supplies no ablation that replaces the pretrained-LM energy with a uniform-kernel baseline while keeping all other components fixed. Without this control it is impossible to attribute gains specifically to the energy-function construction rather than other modeling choices.

    Authors: We agree that a direct control isolating the pretrained energy is desirable. While comparisons to prior diffusion models (which rely on uniform or learned kernels without an external energy) provide supporting context, we will add an explicit ablation in which the energy function is replaced by a constant, yielding an unbiased random-walk baseline. We expect this control to produce markedly lower-quality text, thereby confirming that the observed gains derive from the pretrained-LM energy. The new results will appear in the revised experimental section. revision: yes

standing simulated objections not resolved
  • A rigorous theoretical mixing-time analysis for Glauber dynamics under asymmetric causal language-model energy functions.

Circularity Check

0 steps flagged

No circularity: external pretrained LMs supply the energy function

full rationale

The paper's core construction sets E(x) = -log p_LM(x) using a frozen external model (UL2) as the stationary distribution for Glauber dynamics, then runs the diffusion pipeline empirically. No parameter is fitted inside the paper and then renamed as a prediction; no self-citation supplies a uniqueness theorem or ansatz that the current work depends on; the reported gains are measured against external baselines rather than being forced by internal definitions. The derivation chain therefore remains open to external data and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review based on abstract only; limited visibility into exact formulation. The approach rests on standard statistical-physics assumptions applied to language modeling.

axioms (2)
  • domain assumption A pretrained language model defines a valid energy function whose Boltzmann distribution can serve as the stationary distribution for Glauber dynamics on text sequences
    This is the central insight stated in the abstract.
  • standard math Glauber dynamics with the chosen energy function produces samples from the desired distribution
    Standard result from statistical mechanics invoked without proof in the abstract.

pith-pipeline@v0.9.0 · 5429 in / 1456 out tokens · 109383 ms · 2026-05-08T17:53:47.613655+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Cost (Jcost) washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Glauber dynamics... typically expected to shine in the presence of an 'energy function', i.e., sampling from p(x) ∝ e^{−f(x)} for some f operating on our discrete domain of interest. We propose to use pretrained LMs (such as AR or masked LMs) as our energy function.

  • IndisputableMonolith.Foundation.Atomicity atomic_tick unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    p(x_k = σ | x_{\k}) = min{1, exp(-f(x_{\k},σ) + f(x))}, with self-loops to ensure lazy chains.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

299 extracted references · 27 canonical work pages · 7 internal anchors

  1. [1]

    Liu , year = 2020, journal =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , year = 2020, journal =

  2. [2]

    Hashimoto , year = 2023, month = 5, journal =

    Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , year = 2023, month = 5, journal =

  3. [3]

    Arxiv preprints 2504.06225 , year =

    Biao Zhang and Fedor Moiseev and Joshua Ainslie and Paul Suganthan and Min Ma and Surya Bhupatiraju and Fede Lebron and Orhan Firat and Armand Joulin and Zhe Dong , title =. Arxiv preprints 2504.06225 , year =

  4. [4]

    The Serial Scaling Hypothesis

    Yuxi Liu and Konpat Preechakul and Kananart Kuwaranancharoen and Yutong Bai , title =. Arxiv preprints 2507.12549 , year =

  5. [5]

    Train for the worst, plan for the best: Understanding token ordering in masked diffusions.arXiv preprint arXiv:2502.06768,

    Train for the worst, plan for the best: Understanding token ordering in masked diffusions , author=. arXiv preprint arXiv:2502.06768 , year=

  6. [6]

    The Thirteenth International Conference on Learning Representations,

    Shansan Gong and Shivam Agarwal and Yizhe Zhang and Jiacheng Ye and Lin Zheng and Mukai Li and Chenxin An and Peilin Zhao and Wei Bi and Jiawei Han and Hao Peng and Lingpeng Kong , title =. The Thirteenth International Conference on Learning Representations,

  7. [7]

    The Thirteenth International Conference on Learning Representations,

    Minkai Xu and Tomas Geffner and Karsten Kreis and Weili Nie and Yilun Xu and Jure Leskovec and Stefano Ermon and Arash Vahdat , title =. The Thirteenth International Conference on Learning Representations,

  8. [8]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

  9. [9]

    Causal language modeling can elicit search and reasoning capabilities on logic puzzles , booktitle =

    Kulin Shah and Nishanth Dikkala and Xin Wang and Rina Panigrahy , editor =. Causal language modeling can elicit search and reasoning capabilities on logic puzzles , booktitle =. 2024 , url =

  10. [10]

    Proceedings of the AAAI Conference on Artificial Intelligence , year =

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

  11. [11]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Socialiqa: Commonsense reasoning about social interactions , author=. arXiv preprint arXiv:1904.09728 , year=

  12. [12]

    Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

    Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

  13. [13]

    Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , year = 2021, booktitle =

    Jacob Austin and Daniel D. Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , year = 2021, booktitle =

  14. [14]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai and Andy Jones and Kamal Ndousse and Amanda Askell and Anna Chen and Nova DasSarma and Dawn Drain and Stanislav Fort and Deep Ganguli and Tom Henighan and Nicholas Joseph and Saurav Kadavath and Jackson Kernion and Tom Conerly and Sheer El-Showk and Nelson Elhage and Zac Hatfield-Dodds and Danny Hernandez and Tristan Hume and Scott Johnston and...

  15. [15]

    Tezak and John Schulman and Christine McLeavey and Jerry Tworek and Mark Chen , year = 2022, journal =

    Mohammad Bavarian and Heewoo Jun and Nikolas A. Tezak and John Schulman and Christine McLeavey and Jerry Tworek and Mark Chen , year = 2022, journal =

  16. [16]

    Parishad BehnamGhader and Vaibhav Adlakha and Marius Mosbach and Dzmitry Bahdanau and Nicolas Chapados and Siva Reddy , year = 2024, booktitle =

  17. [17]

    Luisa Bentivogli and Ido Dagan and Hoa Trang Dang and Danilo Giampiccolo and Bernardo Magnini , year = 2009, booktitle =

  18. [18]

    Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , year = 2002, month = jul, booktitle =. doi:10.3115/1073083.1073135 , url =

  19. [19]

    Campbell, Andrew and Benton, Joe and De Bortoli, Valentin and Rainforth, Tom and Deligiannidis, George and Doucet, Arnaud , year = 2022, journal =

  20. [20]

    Chang, Huiwen and Zhang, Han and Jiang, Lu and Liu, Ce and Freeman, William T , year = 2022, booktitle =

  21. [21]

    Chen, Ting and Zhang, Ruixiang and Hinton, Geoffrey , year = 2022, journal =

  22. [22]

    Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , year = 2021, journal =

  23. [23]

    Dagan, Ido and Glickman, Oren and Magnini, Bernardo , year = 2005, booktitle =

  24. [24]

    and Ermon, Stefano and Rudra, Atri and R

    Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R

  25. [25]

    Dao, Tri , year = 2024, booktitle =

  26. [26]

    2501.12948 , archiveprefix =

  27. [27]

    Schwing and David A

    Aditya Deshpande and Jyoti Aneja and Liwei Wang and Alexander G. Schwing and David A. Forsyth , year = 2018, journal =

  28. [28]

    Dhariwal, Prafulla and Nichol, Alexander , year = 2021, journal =

  29. [29]

    Dhingra, Bhuwan and Mazaitis, Kathryn and Cohen, William W , year = 2017, journal =

  30. [30]

    Dieleman, Sander and Sartran, Laurent and Roshannai, Arman and Savinov, Nikolay and Ganin, Yaroslav and Richemond, Pierre H and Doucet, Arnaud and Strudel, Robin and Dyer, Chris and Durkan, Conor and others , year = 2022, journal =

  31. [31]

    Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong , year = 2023, journal =

  32. [32]

    Du, Wanyu and Zhao, Jianqiao and Wang, Liwei and Ji, Yangfeng , year = 2022, journal =

  33. [33]

    Lin, Zhenghao and Gong, Yeyun and Shen, Yelong and Wu, Tong and Fan, Zhihao and Lin, Chen and Duan, Nan and Chen, Weizhu , year = 2023, booktitle =

  34. [34]

    Ghazvininejad, Marjan and Levy, Omer and Liu, Yinhan and Zettlemoyer, Luke , year = 2019, booktitle =

  35. [35]

    Gong, Shansan and Li, Mukai and Feng, Jiangtao and Wu, Zhiyong and Kong, Lingpeng , year = 2023, booktitle =

  36. [36]

    Scaling diffusion language models via adaptation from autoregressive models

    Shansan Gong and Shivam Agarwal and Yizhe Zhang and Jiacheng Ye and Lin Zheng and Mukai Li and Chenxin An and Peilin Zhao and Wei Bi and Jiawei Han and Hao Peng and Lingpeng Kong , year = 2024, url =. 2410.17891 , archiveprefix =

  37. [37]

    org/10.5281/zenodo.5297715

    Black, Sid and Leo, Gao and Wang, Phil and Leahy, Connor and Biderman, Stella , year = 2021, month = mar, publisher =. doi:10.5281/zenodo.5297715 , url =

  38. [38]

    2407.21783 , archiveprefix =

  39. [39]

    Gu, Jiatao and Wang, Changhan and Zhao, Junbo , year = 2019, journal =

  40. [40]

    Ishaan Gulrajani and Tatsunori Hashimoto , year = 2023, booktitle =

  41. [41]

    Xiaochuang Han and Sachin Kumar and Yulia Tsvetkov and Marjan Ghazvininejad , year = 2023, journal =

  42. [42]

    2401.17181 , archiveprefix =

    Kehang Han and Kathleen Kenealy and Aditya Barua and Noah Fiedel and Noah Constant , year = 2024, url =. 2401.17181 , archiveprefix =

  43. [43]

    Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil , year = 2015, booktitle =

  44. [44]

    Ho, Jonathan and Salimans, Tim , year = 2021, booktitle =

  45. [45]

    Holtzman, Ari and Buys, Jan and Du, Li and Forbes, Maxwell and Choi, Yejin , year = 2019, booktitle =

  46. [46]

    Hoogeboom, Emiel and Nielsen, Didrik and Jaini, Priyank and Forr

  47. [47]

    Gritsenko and Jasmijn Bastings and Ben Poole and Rianne van den Berg and Tim Salimans , year = 2022, booktitle =

    Emiel Hoogeboom and Alexey A. Gritsenko and Jasmijn Bastings and Ben Poole and Rianne van den Berg and Tim Salimans , year = 2022, booktitle =

  48. [48]

    Jie Huang and Xinyun Chen and Swaroop Mishra and Huaixiu Steven Zheng and Adams Wei Yu and Xinying Song and Denny Zhou , year = 2024, booktitle =

  49. [49]

    Ivison, Y

    Hamish Ivison and Yizhong Wang and Valentina Pyatkin and Nathan Lambert and Matthew Peters and Pradeep Dasigi and Joel Jang and David Wadden and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi , year = 2023, url =. 2311.10702 , archiveprefix =

  50. [50]

    A., Choi, Y., and Hajishirzi, H

    Hamish Ivison and Wang, Yizhong and Liu, Jiacheng and Wu, Zeqiu and Pyatkin, Valentina and Lambert, Nathan and Smith, Noah A. and Choi, Yejin and Hajishirzi, Hannaneh , year = 2024, booktitle =. 2406.09279 , code =

  51. [51]

    Liu and Mohammad Saleh and Etienne Pot and Ben Goodrich and Ryan Sepassi and Lukasz Kaiser and Noam Shazeer , year = 2018, booktitle =

    Peter J. Liu and Mohammad Saleh and Etienne Pot and Ben Goodrich and Ryan Sepassi and Lukasz Kaiser and Noam Shazeer , year = 2018, booktitle =

  52. [52]

    Jiang, Chao and Maddela, Mounica and Lan, Wuwei and Zhong, Yang and Xu, Wei , year = 2020, booktitle =

  53. [53]

    Albert Q. Jiang and Alexandre Sablayrolles and Arthur Mensch and Chris Bamford and Devendra Singh Chaplot and Diego de las Casas and Florian Bressand and Gianna Lengyel and Guillaume Lample and Lucile Saulnier and Lélio Renard Lavaud and Marie-Anne Lachaux and Pierre Stock and Teven Le Scao and Thibaut Lavril and Thomas Wang and Timothée Lacroix and Willi...

  54. [54]

    Albert Jiang and Alexandre Sablayrolles and Alexis Tacnet and Antoine Roux and Arthur Mensch and Audrey Herblin-Stoop and Baptiste Bout and Baudouin de Monicault and Blanche Savary and Bam4d and Caroline Feldman and Devendra Singh Chaplot and Diego de las Casas and Eleonore Arcelin and Emma Bou Hanna and Etienne Metzger and Gianna Lengyel and Guillaume Bo...

  55. [55]

    Kong, Zhifeng and Ping, Wei and Huang, Jiaji and Zhao, Kexin and Catanzaro, Bryan , year = 2020, booktitle =

  56. [56]

    Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M

    Aviral Kumar and Vincent Zhuang and Rishabh Agarwal and Yi Su and John D Co-Reyes and Avi Singh and Kate Baumli and Shariq Iqbal and Colton Bishop and Rebecca Roelofs and Lei M Zhang and Kay McKinney and Disha Shrivastava and Cosmin Paduraru and George Tucker and Doina Precup and Feryal Behbahani and Aleksandra Faust , year = 2024, url =. 2409.12917 , arc...

  57. [57]

    Gonzalez and Hao Zhang and Ion Stoica , year = 2023, booktitle =

    Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , year = 2023, booktitle =

  58. [58]

    Smith and Hannaneh Hajishirzi , year = 2024, eprint =

    Nathan Lambert and Valentina Pyatkin and Jacob Morrison and LJ Miranda and Bill Yuchen Lin and Khyathi Chandu and Nouha Dziri and Sachin Kumar and Tom Zick and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi , year = 2024, eprint =

  59. [59]

    Miranda and Alisa Liu and Nouha Dziri and Shane Lyu and Yuling Gu and Saumya Malik and Victoria Graf and Jena D

    Nathan Lambert and Jacob Morrison and Valentina Pyatkin and Shengyi Huang and Hamish Ivison and Faeze Brahman and Lester James V. Miranda and Alisa Liu and Nouha Dziri and Shane Lyu and Yuling Gu and Saumya Malik and Victoria Graf and Jena D. Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Chris Wilhelm and Luca Soldaini and Noah A. Smi...

  60. [60]

    arXiv preprint arXiv:2210.14215 , year=

    Michael Laskin and Luyu Wang and Junhyuk Oh and Emilio Parisotto and Stephen Spencer and Richie Steigerwald and DJ Strouse and Steven Hansen and Angelos Filos and Ethan Brooks and Maxime Gazeau and Himanshu Sahni and Satinder Singh and Volodymyr Mnih , year = 2022, url =. 2210.14215 , archiveprefix =

  61. [61]

    Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke , year = 2020, booktitle =

  62. [62]

    Li, Jiwei and Galley, Michel and Brockett, Chris and Gao, Jianfeng and Dolan, Bill , year = 2016, booktitle =

  63. [63]

    Xiang Lisa Li and John Thickstun and Ishaan Gulrajani and Percy Liang and Tatsunori Hashimoto , year = 2022, booktitle =

  64. [65]

    doi:10.1145/3531146.3533088 , isbn = 9781450393522, url =

    Weidinger, Laura and Uesato, Jonathan and Rauh, Maribeth and Griffin, Conor and Huang, Po-Sen and Mellor, John and Glaese, Amelia and Cheng, Myra and Balle, Borja and Kasirzadeh, Atoosa and Biles, Courtney and Brown, Sasha and Kenton, Zac and Hawkins, Will and Stepleton, Tom and Birhane, Abeba and Hendricks, Lisa Anne and Rimell, Laura and Isaac, William ...

  65. [66]

    Mahabadi, Rabeeh Karimi and Ruder, Sebastian and Dehghani, Mostafa and Henderson, James , year = 2021, booktitle =

  66. [67]

    Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

    Yingqian Min and Zhipeng Chen and Jinhao Jiang and Jie Chen and Jia Deng and Yiwen Hu and Yiru Tang and Jiapeng Wang and Xiaoxue Cheng and Huatong Song and Wayne Xin Zhao and Zheng Liu and Zhongyuan Wang and Ji-Rong Wen , year = 2024, url =. 2412.09413 , archiveprefix =

  67. [68]

    and Lapata, Mirella , year = 2018, booktitle =

    Narayan, Shashi and Cohen, Shay B. and Lapata, Mirella , year = 2018, booktitle =

  68. [69]

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  69. [70]

    doi:10.1162/tacl_a_00139 , url =

    Xu, Wei and Callison-Burch, Chris and Napoles, Courtney , year = 2015, journal =. doi:10.1162/tacl_a_00139 , url =

  70. [71]

    Nichol, Alexander Quinn and Dhariwal, Prafulla , year = 2021, booktitle =

  71. [72]

    Long Ouyang and Jeff Wu and Xu Jiang and Diogo Almeida and Carroll L. Wainwright and Pamela Mishkin and Chong Zhang and Sandhini Agarwal and Katarina Slama and Alex Ray and John Schulman and Jacob Hilton and Fraser Kelton and Luke Miller and Maddie Simens and Amanda Askell and Peter Welinder and Paul Christiano and Jan Leike and Ryan Lowe , year = 2022, u...

  72. [73]

    Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid , year = 2021, booktitle =

  73. [74]

    Nichol, Alexander Quinn and Dhariwal, Prafulla and Ramesh, Aditya and Shyam, Pranav and Mishkin, Pamela and Mcgrew, Bob and Sutskever, Ilya and Chen, Mark , year = 2022, booktitle =

  74. [75]

    Proceedings of the 41st International Conference on Machine Learning , publisher =

    Bachmann, Gregor and Nagarajan, Vaishnavh , year = 2024, month =. Proceedings of the 41st International Conference on Machine Learning , publisher =

  75. [76]

    Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya and others , year = 2019, journal =

  76. [77]

    Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J , year = 2020, journal =

  77. [78]

    Ramesh, Aditya and Dhariwal, Prafulla and Nichol, Alex and Chu, Casey and Chen, Mark , year = 2022, journal =

  78. [79]

    Reid, Machel and Hellendoorn, Vincent J and Neubig, Graham , year = 2022, journal =

  79. [80]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , year = 2019, journal =. 1907.11692 , timestamp =

  80. [81]

    2024 , eprint=

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=. 2024 , eprint=

Showing first 80 references.