The CRISTAL Method: Neurosymbolic analysis from AI-synthesized world models

Dimitrije Markovi\'c; Felix Neub\"urger; Michael Walters; Rafael Kaufmann; Thomas Kopinski

arxiv: 2606.29799 · v1 · pith:CEXIKSQKnew · submitted 2026-06-29 · 💻 cs.AI

The CRISTAL Method: Neurosymbolic analysis from AI-synthesized world models

Rafael Kaufmann , Felix Neub\"urger , Michael Walters , Thomas Kopinski , Dimitrije Markovi\'c This is my paper

Pith reviewed 2026-06-30 06:38 UTC · model grok-4.3

classification 💻 cs.AI

keywords neurosymbolic AIprobabilistic programmingBayesian inferenceactive learningLLM code synthesiscompany classificationworld modelsinvestment analysis

0 comments

The pith

CRISTAL synthesizes probabilistic programs via LLMs to reach Bayes-optimal accuracy with only five examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CRISTAL as a neurosymbolic approach that starts from a natural-language curriculum of prior knowledge and has an LLM generate an executable probabilistic program modeling the domain. This program then drives full Bayesian inference for uncertainty quantification and active learning to decide what data to acquire next under a budget. On a benchmark of synthetic equities for company classification, the method reaches Bayes-optimal performance using five examples and a five-second budget. Pure LLM baselines plateau near 40 percent accuracy even when given far more data and compute. The framework continually updates the world model as analysis proceeds.

Core claim

CRISTAL builds a dynamic, interpretable probabilistic program from a natural-language prior knowledge curriculum using LLMs for code synthesis. This enables full Bayesian inference including uncertainty quantification and budget-aware data acquisition. The system continually refines its world model during analysis. Validation on a novel benchmark of synthetic equities shows Bayes-optimal accuracy with just 5 examples and a 5-second budget.

What carries the argument

The CRISTAL framework, which uses LLMs to synthesize executable probabilistic programs from natural language for subsequent Bayesian inference and active learning.

If this is right

Analysis workflows gain justified, reproducible decisions with explicit uncertainty estimates.
Performance reaches theoretical optimum using orders-of-magnitude less data and compute than direct LLM prediction.
The world model can be updated continuously as new observations arrive without restarting from scratch.
Data acquisition can be chosen adaptively to respect tight attention or compute budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthesis-plus-inference loop could be tested on domains such as medical diagnosis where both prior knowledge and data are limited.
If the synthesis step generalizes, hybrid systems might shift LLMs from making final predictions to building reusable models that support repeated inference.
Real-world financial data with missing or noisy textual sources would provide a direct test of whether the synthetic benchmark results hold outside controlled equities.

Load-bearing premise

Large language models can reliably generate correct executable probabilistic programs from natural language without structural errors that would invalidate the later Bayesian steps.

What would settle it

A case in which the LLM-generated program contains a dependency error or incorrect variable definition, producing systematically incorrect posteriors on the classification task despite correct inference code execution.

read the original abstract

This project introduces the CRISTAL Method (Coherent Reliable Intentional Synthesis of Truthful Analysis Logic), a neurosymbolic framework for automating complex analysis workflows, with fundamental investment analysis as a primary use case. This domain poses major challenges: high structural uncertainty, noisy and subjective data, tight attention budgets, and the need for justified, reproducible decisions. Human analysts often struggle in this domain due to cognitive biases and limitations, suggesting significant value in automation. But while LLM-based agents have been proposed as analytical aids, their limitations -- poor numerical reasoning, unawareness of uncertainty, and lack of reproducibility -- hinder their effectiveness in this context. CRISTAL addresses these gaps through a principled blend of statistical model synthesis, continuous learning, and active learning. Starting from a natural-language prior knowledge curriculum, CRISTAL builds a dynamic, interpretable probabilistic program that enables full Bayesian inference, including uncertainty quantification and budget-aware data acquisition. CRISTAL continually refines its world model during analysis, leveraging LLMs for code synthesis and learning. We validate CRISTAL on a novel benchmark of synthetic equities with rich financial and textual data. On a company classification task, CRISTAL achieves Bayes-optimal accuracy with just 5 examples and a 5-second budget, outperforming state-of-the-art LLMs that plateau around 40\% accuracy even with order-of-magnitude more input data and compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRISTAL's Bayes-optimal claim on the synthetic benchmark rests on an unverified assumption that the LLM produces a structurally correct probabilistic program.

read the letter

The main thing here is that the headline result—Bayes-optimal accuracy on company classification with five examples—requires the LLM to turn the natural-language curriculum into a probabilistically faithful program, and the paper gives no checks on that step.

What the work actually does is start from a text prior, use an LLM to emit a dynamic probabilistic program, then run full Bayesian inference plus active learning under an attention budget. That setup is a direct response to the numerical-reasoning and uncertainty problems that pure LLM agents have in domains like investment analysis. The synthetic-equity benchmark with rich financial and text data is a reasonable test bed for the idea.

The approach earns credit for keeping the world model interpretable and for making the active-learning loop explicit rather than hiding everything inside an opaque agent. It also correctly identifies why straight LLM baselines plateau.

The soft spot is the one the stress-test flags: execution success alone does not confirm that the generated program encodes the intended dependencies, likelihoods, or priors. Any mismatch would invalidate the subsequent inference and make the optimality claim and the gap versus LLMs meaningless. The abstract supplies no verification method, no error bars, and no description of how optimality was established. The benchmark being synthetic further limits how far the result travels.

This is for researchers working on neurosymbolic hybrids for noisy, budget-constrained decision tasks. It has a concrete method and a reported result worth referee time, even if the synthesis step needs much more scrutiny. I would send it to peer review.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the CRISTAL method, a neurosymbolic framework that starts from a natural-language prior knowledge curriculum and uses LLMs to synthesize dynamic, interpretable probabilistic programs enabling full Bayesian inference, uncertainty quantification, and budget-aware active learning. It validates the approach on a novel benchmark of synthetic equities and claims that, on a company classification task, CRISTAL reaches Bayes-optimal accuracy with only 5 examples and a 5-second budget while state-of-the-art LLMs plateau near 40% even with substantially more data and compute.

Significance. If the central performance claim is substantiated, the work would establish a concrete demonstration that LLM-assisted synthesis of executable world models can deliver Bayes-optimal decisions under tight resource constraints in domains with structural uncertainty, providing a reproducible alternative to pure LLM agents that lack uncertainty awareness and numerical reliability. The introduction of the synthetic-equities benchmark would also supply a useful testbed for neurosymbolic methods.

major comments (2)

[Abstract] Abstract: the claim of Bayes-optimal accuracy on the company classification task supplies no verification method, statistical details, error bars, or description of how optimality was established, so the reported performance gap versus LLM baselines cannot be evaluated.
[Abstract] Abstract: the headline result requires that the LLM-synthesized probabilistic program exactly encodes the intended world model without structural errors in dependencies, likelihoods, or priors; execution success alone does not guarantee semantic fidelity, yet no formal verification, static analysis, or independent correctness checks on the generated code are described.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and will revise the manuscript to improve clarity on evaluation details without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of Bayes-optimal accuracy on the company classification task supplies no verification method, statistical details, error bars, or description of how optimality was established, so the reported performance gap versus LLM baselines cannot be evaluated.

Authors: We agree the abstract is too concise on this point. The synthetic equities benchmark is constructed from a known ground-truth generative process (detailed in Section 3), which permits exact computation of the Bayes-optimal posterior via the true model. CRISTAL's accuracy is compared directly to this optimum, with results averaged over 20 independent runs including standard error bars (reported in Section 4 and Figure 3). We will expand the abstract to include a one-sentence description of this verification approach. revision: yes
Referee: [Abstract] Abstract: the headline result requires that the LLM-synthesized probabilistic program exactly encodes the intended world model without structural errors in dependencies, likelihoods, or priors; execution success alone does not guarantee semantic fidelity, yet no formal verification, static analysis, or independent correctness checks on the generated code are described.

Authors: The referee is correct that the abstract (and current manuscript) does not describe formal verification methods such as static analysis or automated semantic checks. The present validation relies on (i) successful execution, (ii) manual inspection of a sample of generated programs against the natural-language curriculum, and (iii) downstream empirical performance on the benchmark. We will revise the manuscript to explicitly acknowledge this limitation in a new paragraph in Section 2 and to add a brief discussion of potential future automated verification techniques. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on external benchmark validation rather than self-referential definitions or fitted inputs

full rationale

The paper describes a neurosymbolic method that synthesizes a probabilistic program from a natural-language curriculum via LLMs, then performs Bayesian inference and active learning on it. The Bayes-optimal accuracy claim on the company classification task is tied to results on a novel external synthetic-equities benchmark, not to any internal parameter fitting or self-definition that would make the reported performance equivalent to the inputs by construction. No equations, self-citations, or ansatzes are quoted in the provided text that reduce the derivation chain to its own assumptions. The unverified correctness of LLM-synthesized code is a correctness risk, not a circularity pattern under the enumerated kinds.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract alone supplies insufficient detail to enumerate concrete free parameters, background axioms, or invented entities with precision; the central description centers on an LLM-synthesized probabilistic program whose internal structure is not specified.

invented entities (1)

Dynamic probabilistic program synthesized from natural-language curriculum no independent evidence
purpose: Serves as interpretable world model enabling Bayesian inference and active learning
Core component introduced in the method description; no independent evidence or falsifiable handle is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5793 in / 1241 out tokens · 34777 ms · 2026-06-30T06:38:31.692174+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 17 canonical work pages · 5 internal anchors

[1]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., Farajtabar, M.: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models (2024). https://arxiv.org/abs/2410.05229

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

https://arxiv.org/abs/2401.11467

Chiang, C.-H., Lee, H.-y.: Over-Reasoning and Redundant Calculation of Large Language Models (2024). https://arxiv.org/abs/2401.11467

work page arXiv 2024
[3]

https://arxiv.org/abs/2307.02477

Wu, Z., Qiu, L., Ross, A., Aky¨ urek, E., Chen, B., Wang, B., Kim, N., Andreas, J., Kim, Y.: Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks (2024). https://arxiv.org/abs/2307.02477

work page arXiv 2024
[4]

https://arxiv.org/abs/2311.02216

Akhtar, M., Shankarampeta, A., Gupta, V., Patil, A., Cocarascu, O., Simperl, E.: Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data (2023). https://arxiv.org/abs/2311.02216

work page arXiv 2023
[5]

https://arxiv.org/abs/2402.09614

Nafar, A., Venable, K.B., Kordjamshidi, P.: Reasoning over Uncertain Text by Generative Large Language Models (2024). https://arxiv.org/abs/2402.09614

work page arXiv 2024
[6]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., Hooi, B.: Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs (2024). https: //arxiv.org/abs/2306.13063

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Nature630(8017), 625–630 (2024) https://doi.org/10

Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting hallucinations in large language models using semantic entropy. Nature630(8017), 625–630 (2024) https://doi.org/10. 1038/s41586-024-07421-0

2024
[8]

ACM Transactions on Information Systems 43(2), 1–55 (2025) https://doi.org/10.1145/3703155

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems 43(2), 1–55 (2025) https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025
[9]

https://arxiv.org/abs/2403.04696

Fadeeva, E., Rubashevskii, A., Shelmanov, A., Petrakov, S., Li, H., Mubarak, H., Tsymbalov, E., Kuzmin, G., Panchenko, A., Baldwin, T., Nakov, P., Panov, M.: Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification (2024). https://arxiv.org/abs/2403.04696

work page arXiv 2024
[10]

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences379(2197) (2021) https://doi.org/10.1098/rsta.2020

Volodina, V., Challenor, P.: The importance of uncertainty quantification in model reproducibility. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences379(2197) (2021) https://doi.org/10.1098/rsta.2020. 0071

work page doi:10.1098/rsta.2020 2021
[11]

610–623 (2021)

Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: Can language models be too big? In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623 (2021). ACM

2021
[12]

Nature Reviews Physics5(5), 277–280 (2023) https://doi.org/10.1038/ s42254-023-00581-4 9

Birhane, A., Kasirzadeh, A., Leslie, D., Wachter, S.: Science in the age of large lan- guage models. Nature Reviews Physics5(5), 277–280 (2023) https://doi.org/10.1038/ s42254-023-00581-4 9

2023
[13]

https://arxiv.org/abs/2307.01898

Kim, E., Isozaki, I., Sirkin, N., Robson, M.: Generative Artificial Intelligence Repro- ducibility and Consensus (2024). https://arxiv.org/abs/2307.01898

work page arXiv 2024
[14]

https://arxiv

Richens, J., Everitt, T.: Robust agents learn causal world models (2024). https://arxiv. org/abs/2402.10877

work page arXiv 2024
[15]

https://arxiv.org/abs/2404

Ge, Z., Huang, H., Zhou, M., Li, J., Wang, G., Tang, S., Zhuang, Y.: WorldGPT: Empowering LLM as Multimodal World Model (2024). https://arxiv.org/abs/2404. 18202

2024
[16]

Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

Vafa, K., Chen, J.Y., Rambachan, A., Kleinberg, J., Mullainathan, S.: Evaluating the World Model Implicit in a Generative Model (2024). https://arxiv.org/abs/2406.03689

work page arXiv 2024
[17]

Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution

Pearl, J.: Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution (2018). https://arxiv.org/abs/1801.04016

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

https://arxiv.org/abs/2306

Wong, L., Grand, G., Lew, A.K., Goodman, N.D., Mansinghka, V.K., Andreas, J., Tenenbaum, J.B.: From Word Models to World Models: Translating from Natural Lan- guage to the Probabilistic Language of Thought (2023). https://arxiv.org/abs/2306. 12672

2023
[19]

Walters, M., Neub¨ urger, F., Kaufmann, R.: CRISTAL CodeGen: Grounded Synthesis of Bayesian World Models Enabling Lifelong Active Learning [forthcoming] (2025)

2025
[20]

https://github.com/pydantic/pydantic

Colvin, S., Jolibois, E., Ramezani, H., Garcia Badaracco, A., Dorsey, T., Montague, D., Matveenko, S., Trylesinski, M., Runkle, S., Hewitt, D., Hall, A., Plot, V.: Pydantic (2025). https://github.com/pydantic/pydantic

2025
[21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeekAI: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforce- ment Learning (2025). https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., al., A.J.: The Llama 3 Herd of Models (2024). https://arxiv. org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

A Wiley- Interscience publication

Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. A Wiley- Interscience publication. John Wiley & Sons, Nashville, TN (2000)

2000
[24]

Springer, ??? (1996)

Devroye, L., Gy¨ orfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, ??? (1996). https://doi.org/10.1007/978-1-4612-0711-5 . http://dx.doi.org/10.1007/978-1-4612-0711-5

work page doi:10.1007/978-1-4612-0711-5 1996
[25]

Wiley, ??? (2005)

Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, ??? (2005). https: //doi.org/10.1002/047174882x .http://dx.doi.org/10.1002/047174882X

work page doi:10.1002/047174882x 2005
[26]

Kay, S.M.: Fundamentals of Statistical Processing, Volume I. Prentice Hall, Philadel- phia, PA (1993) Appendix A LLM prompts This appendix contains the prompts used for generating synthetic reports and extracting soft indicators in the benchmarking process. These prompts were designed to simulate realistic financial analysis scenarios, guiding the models ...

1993

[1] [1]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., Farajtabar, M.: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models (2024). https://arxiv.org/abs/2410.05229

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

https://arxiv.org/abs/2401.11467

Chiang, C.-H., Lee, H.-y.: Over-Reasoning and Redundant Calculation of Large Language Models (2024). https://arxiv.org/abs/2401.11467

work page arXiv 2024

[3] [3]

https://arxiv.org/abs/2307.02477

Wu, Z., Qiu, L., Ross, A., Aky¨ urek, E., Chen, B., Wang, B., Kim, N., Andreas, J., Kim, Y.: Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks (2024). https://arxiv.org/abs/2307.02477

work page arXiv 2024

[4] [4]

https://arxiv.org/abs/2311.02216

Akhtar, M., Shankarampeta, A., Gupta, V., Patil, A., Cocarascu, O., Simperl, E.: Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data (2023). https://arxiv.org/abs/2311.02216

work page arXiv 2023

[5] [5]

https://arxiv.org/abs/2402.09614

Nafar, A., Venable, K.B., Kordjamshidi, P.: Reasoning over Uncertain Text by Generative Large Language Models (2024). https://arxiv.org/abs/2402.09614

work page arXiv 2024

[6] [6]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., Hooi, B.: Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs (2024). https: //arxiv.org/abs/2306.13063

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Nature630(8017), 625–630 (2024) https://doi.org/10

Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting hallucinations in large language models using semantic entropy. Nature630(8017), 625–630 (2024) https://doi.org/10. 1038/s41586-024-07421-0

2024

[8] [8]

ACM Transactions on Information Systems 43(2), 1–55 (2025) https://doi.org/10.1145/3703155

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems 43(2), 1–55 (2025) https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025

[9] [9]

https://arxiv.org/abs/2403.04696

Fadeeva, E., Rubashevskii, A., Shelmanov, A., Petrakov, S., Li, H., Mubarak, H., Tsymbalov, E., Kuzmin, G., Panchenko, A., Baldwin, T., Nakov, P., Panov, M.: Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification (2024). https://arxiv.org/abs/2403.04696

work page arXiv 2024

[10] [10]

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences379(2197) (2021) https://doi.org/10.1098/rsta.2020

Volodina, V., Challenor, P.: The importance of uncertainty quantification in model reproducibility. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences379(2197) (2021) https://doi.org/10.1098/rsta.2020. 0071

work page doi:10.1098/rsta.2020 2021

[11] [11]

610–623 (2021)

Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: Can language models be too big? In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623 (2021). ACM

2021

[12] [12]

Nature Reviews Physics5(5), 277–280 (2023) https://doi.org/10.1038/ s42254-023-00581-4 9

Birhane, A., Kasirzadeh, A., Leslie, D., Wachter, S.: Science in the age of large lan- guage models. Nature Reviews Physics5(5), 277–280 (2023) https://doi.org/10.1038/ s42254-023-00581-4 9

2023

[13] [13]

https://arxiv.org/abs/2307.01898

Kim, E., Isozaki, I., Sirkin, N., Robson, M.: Generative Artificial Intelligence Repro- ducibility and Consensus (2024). https://arxiv.org/abs/2307.01898

work page arXiv 2024

[14] [14]

https://arxiv

Richens, J., Everitt, T.: Robust agents learn causal world models (2024). https://arxiv. org/abs/2402.10877

work page arXiv 2024

[15] [15]

https://arxiv.org/abs/2404

Ge, Z., Huang, H., Zhou, M., Li, J., Wang, G., Tang, S., Zhuang, Y.: WorldGPT: Empowering LLM as Multimodal World Model (2024). https://arxiv.org/abs/2404. 18202

2024

[16] [16]

Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

Vafa, K., Chen, J.Y., Rambachan, A., Kleinberg, J., Mullainathan, S.: Evaluating the World Model Implicit in a Generative Model (2024). https://arxiv.org/abs/2406.03689

work page arXiv 2024

[17] [17]

Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution

Pearl, J.: Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution (2018). https://arxiv.org/abs/1801.04016

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

https://arxiv.org/abs/2306

Wong, L., Grand, G., Lew, A.K., Goodman, N.D., Mansinghka, V.K., Andreas, J., Tenenbaum, J.B.: From Word Models to World Models: Translating from Natural Lan- guage to the Probabilistic Language of Thought (2023). https://arxiv.org/abs/2306. 12672

2023

[19] [19]

Walters, M., Neub¨ urger, F., Kaufmann, R.: CRISTAL CodeGen: Grounded Synthesis of Bayesian World Models Enabling Lifelong Active Learning [forthcoming] (2025)

2025

[20] [20]

https://github.com/pydantic/pydantic

Colvin, S., Jolibois, E., Ramezani, H., Garcia Badaracco, A., Dorsey, T., Montague, D., Matveenko, S., Trylesinski, M., Runkle, S., Hewitt, D., Hall, A., Plot, V.: Pydantic (2025). https://github.com/pydantic/pydantic

2025

[21] [21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeekAI: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforce- ment Learning (2025). https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., al., A.J.: The Llama 3 Herd of Models (2024). https://arxiv. org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

A Wiley- Interscience publication

Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. A Wiley- Interscience publication. John Wiley & Sons, Nashville, TN (2000)

2000

[24] [24]

Springer, ??? (1996)

Devroye, L., Gy¨ orfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, ??? (1996). https://doi.org/10.1007/978-1-4612-0711-5 . http://dx.doi.org/10.1007/978-1-4612-0711-5

work page doi:10.1007/978-1-4612-0711-5 1996

[25] [25]

Wiley, ??? (2005)

Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, ??? (2005). https: //doi.org/10.1002/047174882x .http://dx.doi.org/10.1002/047174882X

work page doi:10.1002/047174882x 2005

[26] [26]

Kay, S.M.: Fundamentals of Statistical Processing, Volume I. Prentice Hall, Philadel- phia, PA (1993) Appendix A LLM prompts This appendix contains the prompts used for generating synthetic reports and extracting soft indicators in the benchmarking process. These prompts were designed to simulate realistic financial analysis scenarios, guiding the models ...

1993