Structured Inference with Large Language Gibbs

Esmeralda S. Whitammer; Henry Gouk; Sanghyeok Choi

arxiv: 2606.19264 · v1 · pith:M4ABGPUJnew · submitted 2026-06-17 · 💻 cs.LG · cs.CL

Structured Inference with Large Language Gibbs

Sanghyeok Choi , Henry Gouk , Esmeralda S. Whitammer This is my paper

Pith reviewed 2026-06-26 20:59 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords large language modelsMCMCGibbs samplingstructured inferenceprobabilistic reasoningautoregressive generationBayesian structure learning

0 comments

The pith

Iterative resampling of variables with LLM next-token conditionals produces a stationary distribution that compromises across all local conditionals without order bias.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Large Language Gibbs as a method for structured probabilistic inference that treats an LLM's next-token conditionals as transition operators in an MCMC sampler. Instead of generating an entire structured object in one autoregressive pass, the approach repeatedly resamples each variable conditioned on the current values of the others. This process avoids the order-dependent biases that arise in single-pass generation. The resulting stationary distribution is presented as a compromise among the full set of local conditionals. Demonstrations on synthetic sampling, consistent reasoning, and Bayesian structure learning indicate that the method offers a practical route to coherent probabilistic use of LLM knowledge.

Core claim

Large Language Gibbs iteratively resamples individual variables conditioned on others using an LLM's next-token conditionals as MCMC transition operators. This produces a stationary distribution that reflects a compromise between all local conditionals and avoids order-dependent biases of single-pass autoregressive generation. The method is applied to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning, suggesting it serves as a practical alternative to one-pass generation for structured inference under a world prior accessible through noisy LLM conditionals.

What carries the argument

Large Language Gibbs, which deploys LLM next-token conditional distributions directly as MCMC transition operators for iterative per-variable resampling.

If this is right

Sampling from synthetic distributions can proceed without dependence on variable ordering.
Consistent reasoning tasks obtain outputs that average across conflicting local conditionals.
Bayesian structure learning becomes feasible by treating LLM conditionals as a prior and running the chain to stationarity.
Structured probabilistic inference under noisy LLM conditionals is achievable as an alternative to one-pass autoregressive generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same iterative scheme could be tested on tasks requiring global consistency over many interdependent variables where single-pass generation commonly fails.
Convergence diagnostics from standard MCMC might be applied to decide when the chain has reached a useful compromise distribution.
The approach opens a route to combining multiple LLMs by mixing their conditionals within the same transition kernel.

Load-bearing premise

LLM next-token conditional distributions can function as valid MCMC transition operators whose repeated application converges to a stationary distribution that meaningfully compromises across the local conditionals.

What would settle it

Apply the sampler to a known joint distribution whose marginal conditionals are supplied to the LLM and check whether the long-run empirical frequencies match the expected compromise distribution rather than any single conditional or an order-dependent product.

Figures

Figures reproduced from arXiv: 2606.19264 by Esmeralda S. Whitammer, Henry Gouk, Sanghyeok Choi.

**Figure 1.** Figure 1: Above: Several iterations of large language Gibbs sampling for four variables jointly describing a cat. The initial values are sampled autoregressively, then updates are performed by resampling one variable at a time from the LLM’s conditional distribution given the others, serialised in random order (§3.1). Metadata describing the variables can be used to impose constraints on the resampling. Below: Appli… view at source ↗

**Figure 2.** Figure 2: Empirical distribution of generated samples from [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical distribution of generated samples from [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Bayesian structure learning results on four BnRep datasets using Llama-3.1-8B (3 seeds). [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Empirical distribution of generated samples from [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Bayesian structure learning results on four BnRep datasets using OLMo-3-32B (3 seeds). [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: A failure case of Bayesian structure learning with synthetic data generated by large language Gibbs, cf. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM's next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM Gibbs treats next-token conditionals as MCMC kernels for variable resampling, which is new, but the stationary distribution claim has no supporting argument.

read the letter

The core idea here is to run a Gibbs-style loop where each variable is resampled from an LLM's next-token distribution conditioned on the current values of the others. That is a concrete departure from single-pass autoregressive sampling and avoids the obvious order bias.

They evaluate on synthetic distributions with known targets, on consistent reasoning tasks, and on Bayesian structure learning. The results indicate that the iterative resampling can produce outputs that better respect joint constraints than direct generation does.

The main gap is exactly the one flagged in the stress test. LLM conditionals are noisy and inconsistent in general, so there is no automatic guarantee that the Markov chain has a stationary distribution, let alone one that compromises across the locals. The abstract asserts this outcome but supplies no derivation, reversibility check, or ergodicity argument. Without that piece the central claim rests on assertion rather than evidence.

The paper is aimed at people who need coherent multi-variable reasoning from LLMs rather than fluent text. Readers already working on MCMC methods or on structured prompting will get the most from the experiments. It deserves peer review because the practical setup is clear and the experiments are relevant; referees can ask for the missing convergence analysis or for stronger empirical checks on whether the chain behaves as claimed despite the inconsistency.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Large Language Gibbs, an MCMC procedure that uses an LLM's next-token conditional distributions as transition kernels to iteratively resample individual variables in a structured object, conditioned on the current values of the others. The central claim is that this avoids the order-dependent biases of single-pass autoregressive generation and converges to a stationary distribution that represents a compromise across the local conditionals. The method is applied to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning.

Significance. If the procedure can be shown to possess a well-defined stationary distribution despite the generic inconsistency of LLM conditionals, the approach would supply a practical alternative to direct generation for probabilistic inference under an LLM world model. The empirical sections indicate that the method is implementable and yields reasonable samples on the chosen tasks, but the absence of any convergence argument makes the significance conditional on resolving the theoretical gap.

major comments (2)

[Abstract, paragraph 2] Abstract, paragraph 2: the claim that iterated application of the LLM next-token conditionals "produces a stationary distribution that reflects a compromise between all local conditionals" is stated without derivation, without conditions guaranteeing existence or uniqueness of the stationary measure, and without discussion of reversibility or ergodicity. Standard Gibbs sampling requires the family of conditionals to be compatible with a joint; no argument is supplied that LLM conditionals meet this requirement or that the induced kernel is ergodic.
[Method] Method description (inferred from abstract): treating each LLM conditional p_θ(var_i | all others) as a valid MCMC transition operator is load-bearing for the entire proposal, yet no proof or even heuristic argument is given that the resulting Markov chain converges or that its stationary distribution is independent of initialization when the conditionals are inconsistent.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for identifying the central theoretical gap in our manuscript. The comments correctly note that we provide no derivation or convergence argument for the stationary distribution under inconsistent LLM conditionals. We respond point-by-point below and will revise the paper to qualify claims and add discussion of this limitation.

read point-by-point responses

Referee: [Abstract, paragraph 2] Abstract, paragraph 2: the claim that iterated application of the LLM next-token conditionals "produces a stationary distribution that reflects a compromise between all local conditionals" is stated without derivation, without conditions guaranteeing existence or uniqueness of the stationary measure, and without discussion of reversibility or ergodicity. Standard Gibbs sampling requires the family of conditionals to be compatible with a joint; no argument is supplied that LLM conditionals meet this requirement or that the induced kernel is ergodic.

Authors: We agree the claim is presented without formal derivation. The abstract statement reflects an empirical observation from our experiments rather than a proven result. In revision we will rephrase the abstract to describe the outcome as an observed compromise supported by the synthetic and reasoning tasks, and we will add a dedicated limitations paragraph in the method section acknowledging that LLM conditionals are generally incompatible and that standard Gibbs guarantees do not apply. A full proof of existence, uniqueness, or ergodicity is not supplied because none is currently available. revision: partial
Referee: [Method] Method description (inferred from abstract): treating each LLM conditional p_θ(var_i | all others) as a valid MCMC transition operator is load-bearing for the entire proposal, yet no proof or even heuristic argument is given that the resulting Markov chain converges or that its stationary distribution is independent of initialization when the conditionals are inconsistent.

Authors: The manuscript treats each next-token conditional as a valid transition kernel but supplies no convergence argument. We will insert a short heuristic discussion noting that each kernel is a proper conditional distribution and that the chain appears irreducible on the finite support used in our experiments; however, we will explicitly state that independence from initialization and uniqueness of the stationary measure are not guaranteed when conditionals are inconsistent. The revision will also reference the synthetic-distribution experiments as empirical evidence that mixing occurs in practice, while making clear this does not constitute a proof. revision: yes

standing simulated objections not resolved

A rigorous proof or set of sufficient conditions establishing existence, uniqueness, and ergodicity of a stationary distribution for the induced Markov chain when the LLM conditionals are mutually inconsistent.

Circularity Check

0 steps flagged

No circularity: stationary distribution claim follows from MCMC construction without reduction to inputs

full rationale

The paper's abstract states that iterative resampling with LLM next-token conditionals 'produces a stationary distribution that reflects a compromise between all local conditionals.' This is presented as a direct consequence of applying MCMC transition operators, without any equations, fitted parameters, or self-citations that would make the result equivalent to its inputs by construction. No load-bearing step matches the enumerated circularity patterns; the claim relies on standard MCMC theory applied to the proposed kernels rather than redefining or fitting the target quantity from the paper's own data or prior outputs. The assumption that LLM conditionals form valid kernels is a substantive (and potentially falsifiable) modeling choice, not a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the standard mathematical assumption that the chosen transition operators define a valid Markov chain with the desired stationary distribution; no free parameters or invented entities are mentioned.

axioms (1)

standard math MCMC transition operators defined by conditional distributions converge to a stationary distribution under standard regularity conditions
Invoked implicitly when the abstract states that the procedure produces a stationary distribution.

pith-pipeline@v0.9.1-grok · 5673 in / 1127 out tokens · 20606 ms · 2026-06-26T20:59:16.947619+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 7 canonical work pages

[1]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019
[2]

Large language

Domke, Justin , journal=. Large language
[3]

Neural Information Processing Systems (NeurIPS) , year=

Structured denoising diffusion models in discrete state-spaces , author=. Neural Information Processing Systems (NeurIPS) , year=
[4]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
[5]

Neural Information Processing Systems (NeurIPS) , year=

Chain-of-thought prompting elicits reasoning in large language models , author=. Neural Information Processing Systems (NeurIPS) , year=
[6]

Neural Information Processing Systems (NeurIPS) , year=

Simplified and generalized masked diffusion for discrete data , author=. Neural Information Processing Systems (NeurIPS) , year=
[7]

Neural Information Processing Systems (NeurIPS) , year=

Simple and effective masked diffusion language models , author=. Neural Information Processing Systems (NeurIPS) , year=
[8]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2512.13961 , year=

Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=

Pith/arXiv arXiv
[10]

BERT has a Mouth, and It Must Speak: BERT as a M arkov Random Field Language Model

Wang, Alex and Cho, Kyunghyun. BERT has a Mouth, and It Must Speak: BERT as a M arkov Random Field Language Model. Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation. 2019. doi:10.18653/v1/W19-2304

work page doi:10.18653/v1/w19-2304 2019
[11]

Probing BERT ' s priors with serial reproduction chains

Yamakoshi, Takateru and Griffiths, Thomas and Hawkins, Robert. Probing BERT ' s priors with serial reproduction chains. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.314

work page doi:10.18653/v1/2022.findings-acl.314 2022
[12]

International Conference on Learning Representations (ICLR) , year=

Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis--Hastings , author=. International Conference on Learning Representations (ICLR) , year=
[13]

International Conference on Learning Representations (ICLR) , year=

Reasoning with sampling: Your base model is smarter than you think , author=. International Conference on Learning Representations (ICLR) , year=
[14]

Neural Information Processing Systems (NeurIPS) , year=

QUEST: Quality-aware metropolis-hastings sampling for machine translation , author=. Neural Information Processing Systems (NeurIPS) , year=
[15]

Flipping against all odds: Reducing

Xiao, Tim Z and Zenn, Johannes and Liu, Zhen and Liu, Weiyang and Bamler, Robert and Sch. Flipping against all odds: Reducing. arXiv preprint arXiv:2506.09998 , year=

Pith/arXiv arXiv
[16]

Probabilistic inference in language models via twisted sequential

Zhao, Stephen and Brekelmans, Rob and Makhzani, Alireza and Grosse, Roger , journal=. Probabilistic inference in language models via twisted sequential
[17]

Test , volume=

Distributions most nearly compatible with given families of conditional distributions , author=. Test , volume=. 1998 , publisher=

1998
[18]

Australian Journal of Physics , volume=

Monte carlo calculations of the radial distribution functions for a proton? electron plasma , author=. Australian Journal of Physics , volume=. 1965 , publisher=

1965
[19]

arXiv preprint arXiv:2604.06543 , year=

The Illusion of Stochasticity in LLMs , author=. arXiv preprint arXiv:2604.06543 , year=

Pith/arXiv arXiv
[20]

International Conference on Learning Representations (ICLR) , year=

Amortizing intractable inference in large language models , author=. International Conference on Learning Representations (ICLR) , year=
[21]

Aspen K Hopkins and Alex Renda and Michael Carbin , journal=. Can. 2023 , url=

2023
[22]

science , volume=

Optimization by simulated annealing , author=. science , volume=. 1983 , publisher=

1983
[23]

Uncertainty in Artificial Intelligence (UAI) , year=

Bayesian structure learning with generative flow networks , author=. Uncertainty in Artificial Intelligence (UAI) , year=
[24]

Machine learning , volume=

Learning Bayesian networks: The combination of knowledge and statistical data , author=. Machine learning , volume=. 1995 , publisher=

1995
[25]

2009 , publisher=

Probabilistic graphical models: principles and techniques , author=. 2009 , publisher=

2009
[26]

Neurocomputing , volume=

bnRep: A repository of Bayesian networks from the academic literature , author=. Neurocomputing , volume=. 2025 , publisher=

2025
[27]

Journal of statistical software , volume=

Learning Bayesian networks with the bnlearn R package , author=. Journal of statistical software , volume=
[28]

International Conference on Learning Representations (ICLR) , year=

Large (Vision) Language Models are Unsupervised In-Context Learners , author=. International Conference on Learning Representations (ICLR) , year=
[29]

arXiv preprint arXiv:2506.10139 , year=

Unsupervised elicitation of language models , author=. arXiv preprint arXiv:2506.10139 , year=

arXiv
[30]

Neural Information Processing Systems (NeurIPS) , year=

Language models are few-shot learners , author=. Neural Information Processing Systems (NeurIPS) , year=
[31]

Deriving Language Models from Masked Language Models

Torroba Hennigen, Lucas and Kim, Yoon. Deriving Language Models from Masked Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.99

work page doi:10.18653/v1/2023.acl-short.99 2023
[32]

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Lu, Yao and Bartolo, Max and Moore, Alastair and Riedel, Sebastian and Stenetorp, Pontus. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.556

work page doi:10.18653/v1/2022.acl-long.556 2022
[33]

Proceedings of the Annual Meeting of the Cognitive Science Society , volume=

Recovering Mental Representations from Large Language Models with Markov Chain Monte Carlo , author=. Proceedings of the Annual Meeting of the Cognitive Science Society , volume=
[34]

Proceedings of the Annual Meeting of the Cognitive Science Society , volume=

Eliciting the Priors of Large Language Models using Iterated In-Context Learning , author=. Proceedings of the Annual Meeting of the Cognitive Science Society , volume=
[35]

2023 , journal=

Language Models are Realistic Tabular Data Generators , author=. 2023 , journal=

2023
[36]

Lost in the Middle: How Language Models Use Long Contexts,

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[37]

Stochastic relaxation,

Geman, Stuart and Geman, Donald , journal=. Stochastic relaxation,. 1984 , publisher=

1984
[38]

2003 , publisher=

Information theory, inference and learning algorithms , author=. 2003 , publisher=

2003
[39]

International Conference on Learning Representations (ICLR) , year=

All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning , author=. International Conference on Learning Representations (ICLR) , year=
[40]

Reprompting: Automated Chain-of-Thought Prompt Inference Through

Weijia Xu and Andrzej Banburski-Fahey and Nebojsa Jojic , year=. Reprompting: Automated Chain-of-Thought Prompt Inference Through
[41]

Journal of Machine Learning Research , month = 3, pages =

Chickering, David Maxwell , title =. Journal of Machine Learning Research , month = 3, pages =. 2003 , volume =

2003
[42]

arXiv preprint arXiv:2309.12032 , year=

Expert-Aided Causal Discovery of Ancestral Graphs , author=. arXiv preprint arXiv:2309.12032 , year=

arXiv
[43]

arXiv preprint arXiv:2307.02390 , year=

Causal discovery with language models as imperfect experts , author=. arXiv preprint arXiv:2307.02390 , year=

arXiv
[44]

International Conference on Learning Representations (ICLR) , year=

Causal Order: The Key to Leveraging Imperfect Experts in Causal Inference , author=. International Conference on Learning Representations (ICLR) , year=
[45]

Spirtes, Peter and Glymour, Clark and Scheines, Richard , year =
[46]

Lorch, Lars and Rothfuss, Jonas and Krause, Andreas and Scholkopf, Bernhard , journal=
[47]

Towards scalable

Viinikka, Jussi and Hyttinen, Antti and Pensar, Johan and Koivisto, Mikko , journal=. Towards scalable
[48]

International Statistical Review , year=

Bayesian graphical models for discrete data , author=. International Statistical Review , year=
[49]

Improving

Giudici, Paolo and Castelo, Robert , journal=. Improving
[50]

Journal of Machine Learning Research , volume=

Dependency networks for inference, collaborative filtering, and data visualization , author=. Journal of Machine Learning Research , volume=
[51]

International Conference on Machine Learning (ICML) , year=

Deep generative stochastic networks trainable by backprop , author=. International Conference on Machine Learning (ICML) , year=
[52]

Stat , volume=

Had enough of experts? quantitative knowledge retrieval from large language models , author=. Stat , volume=. 2025 , publisher=

2025
[53]

arXiv preprint arXiv:2405.13551 , year=

Large language models are effective priors for causal graph discovery , author=. arXiv preprint arXiv:2405.13551 , year=

arXiv
[54]

Capstick, Alexander and Krishnan, Rahul G and Barnaghi, Payam , journal=
[55]

How many patients could we save with

Arai, Shota and Selby, David and Vargo, Andrew and Vollmer, Sebastian , journal=. How many patients could we save with
[56]

AutoML Conference 2024 (Workshop Track) , year=

Automated Prior Elicitation from Large Language Models for Bayesian Logistic Regression , author=. AutoML Conference 2024 (Workshop Track) , year=

2024
[57]

International Conference on Machine Learning (ICML) , year=

Principled Gradient-based Markov Chain Monte Carlo for Text Generation , author=. International Conference on Machine Learning (ICML) , year=
[58]

Ning Miao and Hao Zhou and Lili Mou and Rui Yan and Lei Li , journal=
[59]

Lianhui Qin and Sean Welleck and Daniel Khashabi and Yejin Choi , journal=
[60]

Gradient-based Constrained Sampling from Language Models

Kumar, Sachin and Paria, Biswajit and Tsvetkov, Yulia. Gradient-based Constrained Sampling from Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.144

work page doi:10.18653/v1/2022.emnlp-main.144 2022

[1] [1]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019

[2] [2]

Large language

Domke, Justin , journal=. Large language

[3] [3]

Neural Information Processing Systems (NeurIPS) , year=

Structured denoising diffusion models in discrete state-spaces , author=. Neural Information Processing Systems (NeurIPS) , year=

[4] [4]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

[5] [5]

Neural Information Processing Systems (NeurIPS) , year=

Chain-of-thought prompting elicits reasoning in large language models , author=. Neural Information Processing Systems (NeurIPS) , year=

[6] [6]

Neural Information Processing Systems (NeurIPS) , year=

Simplified and generalized masked diffusion for discrete data , author=. Neural Information Processing Systems (NeurIPS) , year=

[7] [7]

Neural Information Processing Systems (NeurIPS) , year=

Simple and effective masked diffusion language models , author=. Neural Information Processing Systems (NeurIPS) , year=

[8] [8]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2512.13961 , year=

Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=

Pith/arXiv arXiv

[10] [10]

BERT has a Mouth, and It Must Speak: BERT as a M arkov Random Field Language Model

Wang, Alex and Cho, Kyunghyun. BERT has a Mouth, and It Must Speak: BERT as a M arkov Random Field Language Model. Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation. 2019. doi:10.18653/v1/W19-2304

work page doi:10.18653/v1/w19-2304 2019

[11] [11]

Probing BERT ' s priors with serial reproduction chains

Yamakoshi, Takateru and Griffiths, Thomas and Hawkins, Robert. Probing BERT ' s priors with serial reproduction chains. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.314

work page doi:10.18653/v1/2022.findings-acl.314 2022

[12] [12]

International Conference on Learning Representations (ICLR) , year=

Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis--Hastings , author=. International Conference on Learning Representations (ICLR) , year=

[13] [13]

International Conference on Learning Representations (ICLR) , year=

Reasoning with sampling: Your base model is smarter than you think , author=. International Conference on Learning Representations (ICLR) , year=

[14] [14]

Neural Information Processing Systems (NeurIPS) , year=

QUEST: Quality-aware metropolis-hastings sampling for machine translation , author=. Neural Information Processing Systems (NeurIPS) , year=

[15] [15]

Flipping against all odds: Reducing

Xiao, Tim Z and Zenn, Johannes and Liu, Zhen and Liu, Weiyang and Bamler, Robert and Sch. Flipping against all odds: Reducing. arXiv preprint arXiv:2506.09998 , year=

Pith/arXiv arXiv

[16] [16]

Probabilistic inference in language models via twisted sequential

Zhao, Stephen and Brekelmans, Rob and Makhzani, Alireza and Grosse, Roger , journal=. Probabilistic inference in language models via twisted sequential

[17] [17]

Test , volume=

Distributions most nearly compatible with given families of conditional distributions , author=. Test , volume=. 1998 , publisher=

1998

[18] [18]

Australian Journal of Physics , volume=

Monte carlo calculations of the radial distribution functions for a proton? electron plasma , author=. Australian Journal of Physics , volume=. 1965 , publisher=

1965

[19] [19]

arXiv preprint arXiv:2604.06543 , year=

The Illusion of Stochasticity in LLMs , author=. arXiv preprint arXiv:2604.06543 , year=

Pith/arXiv arXiv

[20] [20]

International Conference on Learning Representations (ICLR) , year=

Amortizing intractable inference in large language models , author=. International Conference on Learning Representations (ICLR) , year=

[21] [21]

Aspen K Hopkins and Alex Renda and Michael Carbin , journal=. Can. 2023 , url=

2023

[22] [22]

science , volume=

Optimization by simulated annealing , author=. science , volume=. 1983 , publisher=

1983

[23] [23]

Uncertainty in Artificial Intelligence (UAI) , year=

Bayesian structure learning with generative flow networks , author=. Uncertainty in Artificial Intelligence (UAI) , year=

[24] [24]

Machine learning , volume=

Learning Bayesian networks: The combination of knowledge and statistical data , author=. Machine learning , volume=. 1995 , publisher=

1995

[25] [25]

2009 , publisher=

Probabilistic graphical models: principles and techniques , author=. 2009 , publisher=

2009

[26] [26]

Neurocomputing , volume=

bnRep: A repository of Bayesian networks from the academic literature , author=. Neurocomputing , volume=. 2025 , publisher=

2025

[27] [27]

Journal of statistical software , volume=

Learning Bayesian networks with the bnlearn R package , author=. Journal of statistical software , volume=

[28] [28]

International Conference on Learning Representations (ICLR) , year=

Large (Vision) Language Models are Unsupervised In-Context Learners , author=. International Conference on Learning Representations (ICLR) , year=

[29] [29]

arXiv preprint arXiv:2506.10139 , year=

Unsupervised elicitation of language models , author=. arXiv preprint arXiv:2506.10139 , year=

arXiv

[30] [30]

Neural Information Processing Systems (NeurIPS) , year=

Language models are few-shot learners , author=. Neural Information Processing Systems (NeurIPS) , year=

[31] [31]

Deriving Language Models from Masked Language Models

Torroba Hennigen, Lucas and Kim, Yoon. Deriving Language Models from Masked Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.99

work page doi:10.18653/v1/2023.acl-short.99 2023

[32] [32]

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Lu, Yao and Bartolo, Max and Moore, Alastair and Riedel, Sebastian and Stenetorp, Pontus. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.556

work page doi:10.18653/v1/2022.acl-long.556 2022

[33] [33]

Proceedings of the Annual Meeting of the Cognitive Science Society , volume=

Recovering Mental Representations from Large Language Models with Markov Chain Monte Carlo , author=. Proceedings of the Annual Meeting of the Cognitive Science Society , volume=

[34] [34]

Proceedings of the Annual Meeting of the Cognitive Science Society , volume=

Eliciting the Priors of Large Language Models using Iterated In-Context Learning , author=. Proceedings of the Annual Meeting of the Cognitive Science Society , volume=

[35] [35]

2023 , journal=

Language Models are Realistic Tabular Data Generators , author=. 2023 , journal=

2023

[36] [36]

Lost in the Middle: How Language Models Use Long Contexts,

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024

[37] [37]

Stochastic relaxation,

Geman, Stuart and Geman, Donald , journal=. Stochastic relaxation,. 1984 , publisher=

1984

[38] [38]

2003 , publisher=

Information theory, inference and learning algorithms , author=. 2003 , publisher=

2003

[39] [39]

International Conference on Learning Representations (ICLR) , year=

All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning , author=. International Conference on Learning Representations (ICLR) , year=

[40] [40]

Reprompting: Automated Chain-of-Thought Prompt Inference Through

Weijia Xu and Andrzej Banburski-Fahey and Nebojsa Jojic , year=. Reprompting: Automated Chain-of-Thought Prompt Inference Through

[41] [41]

Journal of Machine Learning Research , month = 3, pages =

Chickering, David Maxwell , title =. Journal of Machine Learning Research , month = 3, pages =. 2003 , volume =

2003

[42] [42]

arXiv preprint arXiv:2309.12032 , year=

Expert-Aided Causal Discovery of Ancestral Graphs , author=. arXiv preprint arXiv:2309.12032 , year=

arXiv

[43] [43]

arXiv preprint arXiv:2307.02390 , year=

Causal discovery with language models as imperfect experts , author=. arXiv preprint arXiv:2307.02390 , year=

arXiv

[44] [44]

International Conference on Learning Representations (ICLR) , year=

Causal Order: The Key to Leveraging Imperfect Experts in Causal Inference , author=. International Conference on Learning Representations (ICLR) , year=

[45] [45]

Spirtes, Peter and Glymour, Clark and Scheines, Richard , year =

[46] [46]

Lorch, Lars and Rothfuss, Jonas and Krause, Andreas and Scholkopf, Bernhard , journal=

[47] [47]

Towards scalable

Viinikka, Jussi and Hyttinen, Antti and Pensar, Johan and Koivisto, Mikko , journal=. Towards scalable

[48] [48]

International Statistical Review , year=

Bayesian graphical models for discrete data , author=. International Statistical Review , year=

[49] [49]

Improving

Giudici, Paolo and Castelo, Robert , journal=. Improving

[50] [50]

Journal of Machine Learning Research , volume=

Dependency networks for inference, collaborative filtering, and data visualization , author=. Journal of Machine Learning Research , volume=

[51] [51]

International Conference on Machine Learning (ICML) , year=

Deep generative stochastic networks trainable by backprop , author=. International Conference on Machine Learning (ICML) , year=

[52] [52]

Stat , volume=

Had enough of experts? quantitative knowledge retrieval from large language models , author=. Stat , volume=. 2025 , publisher=

2025

[53] [53]

arXiv preprint arXiv:2405.13551 , year=

Large language models are effective priors for causal graph discovery , author=. arXiv preprint arXiv:2405.13551 , year=

arXiv

[54] [54]

Capstick, Alexander and Krishnan, Rahul G and Barnaghi, Payam , journal=

[55] [55]

How many patients could we save with

Arai, Shota and Selby, David and Vargo, Andrew and Vollmer, Sebastian , journal=. How many patients could we save with

[56] [56]

AutoML Conference 2024 (Workshop Track) , year=

Automated Prior Elicitation from Large Language Models for Bayesian Logistic Regression , author=. AutoML Conference 2024 (Workshop Track) , year=

2024

[57] [57]

International Conference on Machine Learning (ICML) , year=

Principled Gradient-based Markov Chain Monte Carlo for Text Generation , author=. International Conference on Machine Learning (ICML) , year=

[58] [58]

Ning Miao and Hao Zhou and Lili Mou and Rui Yan and Lei Li , journal=

[59] [59]

Lianhui Qin and Sean Welleck and Daniel Khashabi and Yejin Choi , journal=

[60] [60]

Gradient-based Constrained Sampling from Language Models

Kumar, Sachin and Paria, Biswajit and Tsvetkov, Yulia. Gradient-based Constrained Sampling from Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.144

work page doi:10.18653/v1/2022.emnlp-main.144 2022