pith. sign in

arxiv: 2509.01082 · v3 · submitted 2025-09-01 · 💻 cs.LG · cs.PL

RefineStat: Efficient Exploration for Probabilistic Program Synthesis

Pith reviewed 2026-05-18 20:25 UTC · model grok-4.3

classification 💻 cs.LG cs.PL
keywords probabilistic programmingprogram synthesislanguage modelsrefinementsemantic constraintsstatistical reliabilitycode generationsmall language models
0
0 comments X

The pith

RefineStat lets smaller language models generate statistically reliable probabilistic programs by enforcing semantic constraints and resampling failed components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Probabilistic programming requires models that capture uncertainty through valid distributions and inference procedures, but small language models often output code with syntactic errors or invalid statistical constructs. The paper presents RefineStat as a framework that first applies semantic constraints to guarantee well-formed distributions and parameters, then performs diagnostic-aware refinement by resampling prior or likelihood elements whenever checks detect unreliability. This process is motivated by how human probabilistic programmers debug their models. Evaluations across multiple code-generation tasks show the resulting programs remain syntactically correct and produce statistically sound results, frequently reaching or exceeding the quality of outputs from much larger closed-source models.

Core claim

RefineStat is a language model-driven framework that enforces semantic constraints ensuring synthesized programs contain valid distributions and well-formed parameters, and then applies diagnostic-aware refinement by resampling prior or likelihood components whenever reliability checks fail, yielding programs that are both syntactically sound and statistically reliable on probabilistic-programming code-generation tasks.

What carries the argument

Diagnostic-aware refinement, which resamples prior or likelihood components in response to reliability check failures to correct semantic errors while preserving the rest of the program structure.

If this is right

  • Smaller language models become viable for probabilistic program synthesis tasks that previously required larger models.
  • Generated programs require fewer manual corrections to achieve statistical soundness.
  • The same refinement loop can be applied across varied probabilistic modeling benchmarks.
  • Semantic constraint enforcement plus targeted resampling reduces the incidence of flawed inference constructs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be adapted to other domains that combine code generation with domain-specific validity checks, such as scientific simulation scripts.
  • Iterative resampling guided by diagnostics may lower the overall compute cost of using language models for constrained synthesis problems.
  • If the refinement proves stable, it opens the possibility of fully automated pipelines for building probabilistic models from natural-language descriptions.

Load-bearing premise

The diagnostic-aware refinement step will consistently produce valid and unbiased probabilistic programs without introducing fresh semantic errors or requiring heavy human intervention.

What would settle it

A test set of RefineStat outputs in which a large fraction of programs still fail statistical validity checks or produce biased posterior inferences on held-out data would show the refinement does not achieve reliable programs.

Figures

Figures reproduced from arXiv: 2509.01082 by Madhav Kanda, Sasa Misailovic, Shubham Ugare.

Figure 1
Figure 1. Figure 1: The workflow of REFINESTAT. (1) Data and prompt are provided to the language model, which generates a probabilistic program. (2) Constrained semantic decoding enforces syntactic and semantic validity of the generated program. (3) A Bayesian reliability check diagnoses convergence, divergences, and predictive validity. If failures are detected, the model is refined by backtracking and resampling priors or l… view at source ↗
Figure 2
Figure 2. Figure 2: A constrained semantic decoding itera￾tion in REFINESTAT We formalize the generation of semantically valid probabilistic programs through iterative constrained sampling. Let G = (N , T ,P, S0) be a context-free grammar with nonterminal symbols N , terminal symbols T , production rules P, and start symbol S0. For a partial pro￾gram c ∈ Lp(G) with parse tree κ, we define validation functions that operate on … view at source ↗
read the original abstract

Probabilistic programming offers a powerful framework for modeling uncertainty, yet statistical model discovery in this domain entails navigating an immense search space under strict domain-specific constraints. When small language models are tasked with generating probabilistic programs, they frequently produce outputs that suffer from both syntactic and semantic errors, such as flawed inference constructs. Motivated by probabilistic programmers' domain expertise and debugging strategies, we introduce RefineStat, a language model--driven framework that enforces semantic constraints ensuring synthesized programs contain valid distributions and well-formed parameters, and then applies diagnostic-aware refinement by resampling prior or likelihood components whenever reliability checks fail. We evaluate RefineStat on multiple probabilistic-programming code-generation tasks using smaller language models (SLMs) and find that it produces programs that are both syntactically sound and statistically reliable, often matching or surpassing those from closed-source large language models (e.g., OpenAI o3).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RefineStat, a language model-driven framework for synthesizing probabilistic programs. It enforces semantic constraints to ensure valid distributions and parameters, followed by diagnostic-aware refinement that resamples prior or likelihood components when reliability checks fail. The evaluation on probabilistic-programming code-generation tasks using smaller language models claims to produce syntactically sound and statistically reliable programs that often match or surpass those from larger models such as OpenAI o3.

Significance. Should the refinement procedure be shown to preserve statistical properties without introducing bias, this work could advance the field by making probabilistic program synthesis more practical with smaller, open models, reducing dependence on proprietary large language models. The integration of diagnostic checks inspired by probabilistic programming practices is a notable strength if empirically validated.

major comments (2)
  1. [Methods (refinement procedure)] The diagnostic-aware refinement step, which resamples prior or likelihood components whenever reliability checks fail, is described without a derivation or test demonstrating that it preserves the target posterior distribution and avoids introducing bias. This is load-bearing for the claim of 'statistically reliable' programs, as repeated resampling could shift the effective distribution if the checks are heuristic.
  2. [Evaluation section] The abstract and evaluation report positive results on multiple tasks but omit specific metrics, baseline comparisons, error analysis, or details on how statistical reliability was measured. This weakens the support for the central claim that RefineStat matches or surpasses closed-source LLMs.
minor comments (2)
  1. The abstract would be strengthened by including at least one key quantitative result or comparison to support the evaluation claims.
  2. [Introduction] Clarify the exact definition of 'reliability checks' early in the paper to aid reader understanding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and have revised the manuscript to strengthen the presentation of the refinement procedure and evaluation results.

read point-by-point responses
  1. Referee: [Methods (refinement procedure)] The diagnostic-aware refinement step, which resamples prior or likelihood components whenever reliability checks fail, is described without a derivation or test demonstrating that it preserves the target posterior distribution and avoids introducing bias. This is load-bearing for the claim of 'statistically reliable' programs, as repeated resampling could shift the effective distribution if the checks are heuristic.

    Authors: We agree that a formal justification strengthens the statistical reliability claims. In the revised manuscript we have added a subsection deriving that the resampling step, conditioned on standard diagnostic failures, preserves the target posterior by rejecting only invalid samples and redrawing from the model's prior predictive distribution without systematic bias. We also include a controlled empirical test on a conjugate model comparing posterior moments and credible intervals before and after refinement, showing deviations within Monte Carlo error. revision: yes

  2. Referee: [Evaluation section] The abstract and evaluation report positive results on multiple tasks but omit specific metrics, baseline comparisons, error analysis, or details on how statistical reliability was measured. This weakens the support for the central claim that RefineStat matches or surpasses closed-source LLMs.

    Authors: We accept that greater specificity improves the evaluation. The revised Evaluation section now reports concrete metrics including syntax validity rates, statistical reliability via posterior predictive checks and Gelman-Rubin statistics, quantitative comparisons against both unrefined small models and closed-source baselines such as o3, and a categorized error analysis of syntactic, semantic, and statistical failure modes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering framework with no derivations

full rationale

The paper describes RefineStat as a practical, language-model-driven framework for probabilistic program synthesis that enforces semantic constraints and applies diagnostic-aware resampling on reliability failures. No equations, derivations, predictions, or first-principles results are present in the abstract or described method. The contribution is evaluated empirically on code-generation tasks, with no load-bearing steps that reduce by construction to fitted inputs, self-citations, or renamed ansatzes. The central claims rest on experimental outcomes rather than any self-referential mathematical chain, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that SLMs produce fixable errors and that resampling prior/likelihood components preserves statistical validity; no free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Small language models frequently produce syntactic and semantic errors in probabilistic programs that can be corrected by external constraint enforcement and targeted resampling.
    Stated motivation in the abstract for introducing RefineStat.
invented entities (1)
  • RefineStat framework no independent evidence
    purpose: Enforce semantic constraints and perform diagnostic-aware refinement on generated probabilistic programs
    Newly introduced method described in the abstract.

pith-pipeline@v0.9.0 · 5675 in / 1293 out tokens · 33082 ms · 2026-05-18T20:25:25.717415+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 8 internal anchors

  1. [1]

    Semantic probabilistic control of language models, 2025

    Kareem Ahmed, Catarina G Belem, Padhraic Smyth, and Sameer Singh. Semantic probabilistic control of language models, 2025. URL https://arxiv.org/abs/2505.01954

  2. [2]

    Crane: Reasoning with constrained llm generation, 2025

    Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, and Gagandeep Singh. Crane: Reasoning with constrained llm generation, 2025. URL https://arxiv.org/abs/2502.09061

  3. [3]

    A Conceptual Introduction to Hamiltonian Monte Carlo

    Michael Betancourt. A conceptual introduction to hamiltonian monte carlo. arXiv preprint arXiv:1701.02434, 2017

  4. [4]

    Pyro: Deep universal probabilistic programming

    Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. Pyro: Deep universal probabilistic programming. Journal of machine learning research, 20 0 (28): 0 1--6, 2019

  5. [5]

    Automated reverse engineering of nonlinear dynamical systems

    Josh Bongard and Hod Lipson. Automated reverse engineering of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 104 0 (24): 0 9943--9948, 2007

  6. [6]

    Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell

    Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 76 0 (1): 0 1–32, 2017 a . doi:10.18637/jss.v076.i01. URL https://www.jstatsoft.org/index.php/jss/article/view/v076i01

  7. [7]

    Stan: A probabilistic programming language

    Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of statistical software, 76: 0 1--32, 2017 b

  8. [8]

    A general-purpose algorithm for constrained sequential inference

    Daniel Deutsch, Shyam Upadhyay, and Dan Roth. A general-purpose algorithm for constrained sequential inference. In Proceedings of the Conference on Computational Natural Language Learning, 2019. URL https://aclanthology.org/K19-1045/

  9. [9]

    and Cai, Yaxing and Lai, Ruihang and Xu, Ziyi and Zhao, Yilong and Chen, Tianqi , title =

    Yixin Dong, Charlie F Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. XGrammar : Flexible and efficient structured generation engine for large language models. arXiv preprint arXiv:2411.15100, 2024. URL https://arxiv.org/pdf/2411.15100

  10. [10]

    Structure discovery in nonparametric regression through compositional kernel search

    David Duvenaud, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. Structure discovery in nonparametric regression through compositional kernel search. In International Conference on Machine Learning, pages 1166--1174. PMLR, 2013

  11. [11]

    Unsupervised learning by program synthesis

    Kevin Ellis, Armando Solar-Lezama, and Josh Tenenbaum. Unsupervised learning by program synthesis. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/b73dfe25b4b8714c029b37a6ad300...

  12. [12]

    UTF -8 plumbing: Byte-level tokenizers unavoidably enable LLM s to generate ill-formed UTF -8

    Preston Firestone, Shubham Ugare, Gagandeep Singh, and Sasa Misailovic. UTF -8 plumbing: Byte-level tokenizers unavoidably enable LLM s to generate ill-formed UTF -8. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=8ExXncFpf6

  13. [13]

    Bayesian data analysis

    Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 1995

  14. [14]

    Bayesian workflow

    Andrew Gelman, Aki Vehtari, Daniel Simpson, Charles C Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian B \"u rkner, and Martin Modr \'a k. Bayesian workflow. arXiv preprint arXiv:2011.01808, 2020

  15. [15]

    Learning the structure of sum-product networks

    Robert Gens and Domingos Pedro. Learning the structure of sum-product networks. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 873--880, Atlanta, Georgia, USA, 17--19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/ge...

  16. [16]

    Search-based synthesis of probabilistic models for quality-of-service software engineering

    Simos Gerasimou, Giordano Tamburrelli, and Radu Calinescu. Search-based synthesis of probabilistic models for quality-of-service software engineering. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering, ASE '15, page 319–330. IEEE Press, 2015. ISBN 9781509000241. doi:10.1109/ASE.2015.22. URL https://doi.org/10.1...

  17. [17]

    Learning efficient markov networks

    Vibhav Gogate, William Webb, and Pedro Domingos. Learning efficient markov networks. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper_files/paper/2010/file/e5e63da79fcd2bebbd7cb8bf1c1d0274-Paper.pdf

  18. [18]

    Gordon, Thomas A

    Andrew D. Gordon, Thomas A. Henzinger, Aditya V. Nori, and Sriram K. Rajamani. Probabilistic programming. In Proceedings of the on Future of Software Engineering, pages 167--181. ACM, 2014. doi:10.1145/2593882.2593900

  19. [19]

    Tenenbaum, Vikash K

    Gabriel Grand, Joshua B. Tenenbaum, Vikash K. Mansinghka, Alexander K. Lew, and Jacob Andreas. Self-steering language models, 2025. URL https://arxiv.org/abs/2504.07081

  20. [20]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  21. [21]

    Grosse, Ruslan Salakhutdinov, William T

    Roger B. Grosse, Ruslan Salakhutdinov, William T. Freeman, and Joshua B. Tenenbaum. Exploiting compositionality to explore a large space of model structures. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI'12, page 306–315, Arlington, Virginia, USA, 2012. AUAI Press. ISBN 9780974903989

  22. [22]

    Model selection in compositional spaces

    Roger Baker Grosse. Model selection in compositional spaces. PhD thesis, Massachusetts Institute of Technology, 2014

  23. [23]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  24. [24]

    The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo

    Matthew D Hoffman, Andrew Gelman, et al. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15 0 (1): 0 1593--1623, 2014

  25. [25]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

  26. [26]

    Automata-based constraints for language model decoding

    Terry Koo, Frederick Liu, and Luheng He. Automata-based constraints for language model decoding. In Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=BDBdblmyzY

  27. [27]

    Validating large language models with RELM

    Michael Kuchnik, Virginia Smith, and George Amvrosiadis. Validating large language models with RELM . Proceedings of Machine Learning and Systems, 5, 2023. URL https://proceedings.mlsys.org/paper_files/paper/2023/file/93c7d9da61ccb2a60ac047e92787c3ef-Paper-mlsys2023.pdf

  28. [28]

    arXiv preprint arXiv:2402.17879 , year =

    Michael Y. Li, Emily B. Fox, and Noah D. Goodman. Automated statistical model discovery with language models, 2024. URL https://arxiv.org/abs/2402.17879

  29. [29]

    Automated model discovery for human brain using constitutive artificial neural networks

    Kevin Linka, Sarah R St Pierre, and Ellen Kuhl. Automated model discovery for human brain using constitutive artificial neural networks. Acta Biomaterialia, 160: 0 134--151, 2023

  30. [30]

    Syntactic and semantic control of large language models via sequential

    João Loula, Benjamin LeBrun, Li Du, Ben Lipkin, Clemente Pasti, Gabriel Grand, Tianyu Liu, Yahya Emara, Marjorie Freedman, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Alexander K. Lew, Tim Vieira, and Timothy J. O'Donnell. Syntactic and semantic control of large language models via sequential monte carlo, 2025. URL https://arxiv.org/abs/2504.13139

  31. [31]

    Learning Arithmetic Circuits

    Daniel Lowd and Pedro Domingos. Learning arithmetic circuits, 2012. URL https://arxiv.org/abs/1206.3271

  32. [32]

    Bayesian population analysis using WinBUGS

    M Schaub M Kery. Bayesian population analysis using WinBUGS. Academic Press, 2011

  33. [33]

    2024 , archiveprefix =

    Måns Magnusson, Jakob Torgander, Paul-Christian Bürkner, Lu Zhang, Bob Carpenter, and Aki Vehtari. posteriordb: Testing, benchmarking and developing bayesian inference algorithms, 2024. URL https://arxiv.org/abs/2407.04967

  34. [34]

    V. K. Mansinghka, C. Kemp, J. B. Tenenbaum, and T. L. Griffiths. Structured priors for structure learning. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, UAI'06, page 324–331, Arlington, Virginia, USA, 2006. AUAI Press. ISBN 0974903922

  35. [35]

    Hybrid grammar-based approach to nonlinear dynamical system identification from biological time series

    BA McKinney, JE Crowe Jr, HU Voss, PS Crooke, N Barney, and JH Moore. Hybrid grammar-based approach to nonlinear dynamical system identification from biological time series. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 73 0 (2): 0 021912, 2006

  36. [36]

    Mcmc using hamiltonian dynamics

    Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2 0 (11): 0 2, 2011

  37. [37]

    Nori, Sherjil Ozair, Sriram K

    Aditya V. Nori, Sherjil Ozair, Sriram K. Rajamani, and Deepak Vijaykeerthy. Efficient synthesis of probabilistic programs. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '15, page 208–217, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450334686. doi:10.1145/2737924.2737982...

  38. [38]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-perfo...

  39. [39]

    Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro

    Du Phan, Neeraj Pradhan, and Martin Jankowiak. Composable effects for flexible and accelerated probabilistic programming in numpyro, 2019. URL https://arxiv.org/abs/1912.11554

  40. [40]

    Synchromesh: Reliable code generation from pre-trained language models

    Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. Synchromesh: Reliable code generation from pre-trained language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=KmtVD97J43e

  41. [41]

    Estimation in parallel randomized experiments

    Donald B Rubin. Estimation in parallel randomized experiments. Journal of Educational Statistics, 6 0 (4): 0 377--401, 1981

  42. [42]

    Saad, Marco F

    Feras A. Saad, Marco F. Cusumano-Towner, Ulrich Schaechtle, Martin C. Rinard, and Vikash K. Mansinghka. Bayesian synthesis of probabilistic programs for automatic data modeling. Proceedings of the ACM on Programming Languages, 3 0 (POPL): 0 1–32, January 2019. ISSN 2475-1421. doi:10.1145/3290350. URL http://dx.doi.org/10.1145/3290350

  43. [43]

    Wiecki, and Christopher Fonnesbeck

    John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. Probabilistic programming in python using PyMC 3. PeerJ Computer Science , 2: 0 e55, apr 2016. doi:10.7717/peerj-cs.55. URL https://doi.org/10.7717/peerj-cs.55

  44. [44]

    Distilling free-form natural laws from experimental data

    Michael Schmidt and Hod Lipson. Distilling free-form natural laws from experimental data. science, 324 0 (5923): 0 81--85, 2009

  45. [45]

    Dingo: Constrained inference for diffusion llms, 2025

    Tarun Suresh, Debangshu Banerjee, Shubham Ugare, Sasa Misailovic, and Gagandeep Singh. Dingo: Constrained inference for diffusion llms, 2025. URL https://arxiv.org/abs/2505.23061

  46. [46]

    Codegemma: Open code models based on gemma

    CodeGemma Team, Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christopher A Choquette-Choo, Jingyue Shen, Joe Kelley, et al. Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409, 2024

  47. [47]

    Itergen: Iterative structured llm generation

    Shubham Ugare, Rohan Gumaste, Tarun Suresh, Gagandeep Singh, and Sasa Misailovic. Itergen: Iterative structured llm generation. arXiv preprint arXiv:2410.07295, 2024 a

  48. [49]

    Improving llm code generation with grammar augmentation,

    Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. Syncode: Llm generation with grammar augmentation, 2024 c . URL https://arxiv.org/abs/2403.01632

  49. [50]

    IterGen : Iterative structured LLM generation

    Shubham Ugare, Rohan Gumaste, Tarun Suresh, Gagandeep Singh, and Sasa Misailovic. IterGen : Iterative structured LLM generation. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/pdf?id=ac93gRzxxV

  50. [51]

    Examples volume 1, a

    MRC Biostatistics Unit. Examples volume 1, a . URL http://www.mrc-bsu.cam.ac.uk/wp-content/uploads/WinBUGS_Vol1.pdf

  51. [52]

    Examples volume 2, b

    MRC Biostatistics Unit. Examples volume 2, b . URL http://www.mrc-bsu.cam.ac.uk/wp-content/uploads/WinBUGS_Vol2.pdf

  52. [53]

    An introduction to probabilistic programming, 2021

    Jan-Willem van de Meent, Brooks Paige, Hongseok Yang, and Frank Wood. An introduction to probabilistic programming, 2021. URL https://arxiv.org/abs/1809.10756

  53. [54]

    Practical bayesian model evaluation using leave-one-out cross-validation and waic

    Aki Vehtari, Andrew Gelman, and Jonah Gabry. Practical bayesian model evaluation using leave-one-out cross-validation and waic. Statistics and computing, 27: 0 1413--1432, 2017

  54. [55]

    Rank-normalization, folding, and localization: An improved R for assessing convergence of mcmc (with discussion)

    Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian B \"u rkner. Rank-normalization, folding, and localization: An improved R for assessing convergence of mcmc (with discussion). Bayesian analysis, 16 0 (2): 0 667--718, 2021

  55. [56]

    Efficient Guided Generation for Large Language Models

    Brandon T Willard and R \'e mi Louf. Efficient guided generation for large language models. arXiv preprint arXiv:2307.09702, 2023. URL https://arxiv.org/pdf/2307.09702

  56. [57]

    Counterexample-Driven Synthesis for Probabilistic Program Sketches

    Milan Češka, Christian Hensel, Sebastian Junges, and Joost-Pieter Katoen. Counterexample-driven synthesis for probabilistic program sketches, 2019. URL https://arxiv.org/abs/1904.12371