arxiv: 2605.05851 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

Hypothesis generation and updating in large language models

Hua-Dong Xiong

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords large language modelshypothesis generationBayesian inferencenumber gamehypothesis updatingOccam's razorgeneralization

0 comments

The pith

Large language models behave like Bayesian hypothesis updaters with offsets that favor narrower hypotheses and prevent good extrapolation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates how large language models generate and update hypotheses when inferring rules from a few positive examples, using the number game as a testbed. It measures LLM behavior through three probes—posterior prediction, hypothesis evaluation, and hypothesis generation—and compares it to an optimal Bayesian learner and to humans. The central finding is that LLMs often align with a two-parameter Bayesian fit but deviate systematically. Default responses favor narrower hypotheses due to a strong-sampling assumption, while thinking mode increases prior influence. There is also a consistent gap where evaluation picks more accurate hypotheses than generation does, and the Bayesian-like pattern breaks down when generalizing to unseen parts of the space.

Core claim

In the number game, LLMs' inferences over hypotheses supported by positive examples are well captured by a two-parameter Bayesian model, but they exhibit a default strong-sampling assumption that implicitly favors narrower hypotheses, a shift toward prior reliance in thinking mode, a robust gap between evaluation and generation performance, and a failure to extrapolate the pattern beyond the observed examples.

What carries the argument

The number game, in which a learner sees positive integers and infers the underlying rule or interval, with posteriors measured via prediction, evaluation, and generation probes compared against a Bayesian model.

If this is right

LLMs default to narrower hypotheses via strong sampling, acting as an implicit Occam's razor.
Switching to thinking mode increases reliance on prior probabilities over likelihoods.
LLMs evaluate hypotheses more accurately than they generate them, preferring rule-like outputs in generation.
The Bayesian-with-bias behavior does not extend to generalization outside the trained examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This pattern suggests LLMs may struggle with scientific discovery tasks requiring broad hypothesis exploration beyond given data.
The evaluation-generation gap points to a need for better alignment between selection and creation mechanisms in model training.
Future tests could examine whether the same biases appear in other structured inference domains beyond numbers.

Load-bearing premise

That the three different probes all tap into the same underlying posterior distribution over hypotheses inside the LLM.

What would settle it

A direct test would be to check whether LLMs continue to show the same parameter fits and biases when probed with new hypothesis spaces or when forced to generate hypotheses that extrapolate to unseen numbers in the game.

Figures

Figures reproduced from arXiv: 2605.05851 by Hua-Dong Xiong.

**Figure 1.** Figure 1: Three measurements of the posterior over hypotheses in the number game. The schematic shows how we prompt LLMs to measure their posterior over hypotheses. Posterior prediction queries the LLM with one integer in the hypothesis domain at a time and records the model-generated probability that the integer belongs to the same hypothesis as the current examples, yielding a probability mass function. Hypothesis… view at source ↗

**Figure 2.** Figure 2: Bayesian fit of LLM posterior prediction behavior. a, Full-stimulus (α, β) fits for default d = 100 posterior prediction, with each model fit once after pooling all available full-stimulus TENENBAUM99 and BIGELOW16 presentations. Each colored point is one model; dashed lines mark the configured Bayesian reference, (1, 1); the human baseline is shown as a black cross. b, Example-count trajectories of log(α/… view at source ↗

**Figure 3.** Figure 3: Bayesian fits quantify how prompt sampling assumptions, explicit candidate lists, and thinking affect LLM posterior prediction behavior. Bars summarize model-averaged posteriorprediction conditions for the Default Prompt, Strong Prompt, Weak Prompt, and Explicit Prompt. Within each prompt condition, dark bars show non-thinking model rows and lighter bars show thinking model rows. The left two panels repor… view at source ↗

**Figure 4.** Figure 4: Large language models show different behavior under three measurements of the posterior. Bars compare posterior prediction (Predict), hypothesis evaluation (Eval), and hypothesis generation (Generate) after projecting each measurement into the same posterior predictive space. Within each measurement, dark bars show non-thinking model rows and lighter bars show thinking model rows. The panels report full-st… view at source ↗

**Figure 5.** Figure 5: Hypothesis evaluation and generation show an accuracy–simplicity trade-off in their MAP hypothesis estimators. a,b, Top-1 example consistency versus support fraction for hypothesis evaluation and generation on TENENBAUM99 default d = 100 rows. c, Paired Eval-minus-Generation gaps for example consistency and support fraction. d,e, Trajectories of top-1 support fraction and example consistency as the number … view at source ↗

**Figure 6.** Figure 6: LLM hypotheses fail to generalize to the unobserved domain. a, Larger-domain posterior prediction, comparing mass assigned to 101..200 with KL divergence between the original d = 100 posterior and the renormalized d = 200 posterior on 1..100. Transparent points show stimuli; larger points show model averages. b, The same unobserved-domain comparison across posterior prediction, hypothesis evaluation, and h… view at source ↗

read the original abstract

Large language models (LLMs) increasingly help people solve problems, from debugging code to repairing machinery. This process requires generating plausible hypotheses from partial descriptions, then updating them as more information arrives. Yet how LLMs perform this form of inference, and how close it is to optimal, remains unclear. We study this question in the number game, a controlled setting in which a learner infers the hypothesis supported by a few positive integers, such as $\{16, 8, 2, 64\}$: a rule like powers of 2 or an interval like numbers near 20. We measure the posterior over hypotheses using three complementary probes: posterior prediction, hypothesis evaluation, and hypothesis generation. We then compare LLM behavior with an optimal Bayesian model and human behavior, and test whether the same posterior is expressed across probes. LLMs are often well described by a two-parameter Bayesian fit, but with systematic offsets: by default they show a strong-sampling assumption that creates an implicit Occam's razor, favoring narrower hypotheses, while thinking mode shifts them toward greater prior reliance. We also find a robust evaluation--generation gap: LLMs select more correct hypotheses during hypothesis evaluation but generate simpler, more rule-like hypotheses. Finally, this Bayesian-with-bias pattern does not extrapolate. Models can behave as if they hold rule-like hypotheses over observed examples, yet generalize poorly to parts of the hypothesis domain not covered by those examples. Our results highlight a limitation of LLMs as general problem solvers, especially for scientific inference, where hypotheses must go beyond the data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs show a strong-sampling bias and evaluation-generation gap in the number game with poor extrapolation, but the single-posterior assumption across probes looks shaky.

read the letter

The main things to know are that LLMs in this controlled number-game setup can be described by a two-parameter Bayesian model with a default strong-sampling bias that favors narrower hypotheses, plus a clear gap where they evaluate hypotheses better than they generate them, and that this pattern fails to extrapolate beyond the observed examples. The work also notes that a thinking mode increases prior weight. These are measured against both an optimal Bayesian learner and human data. The three-probe design (posterior prediction, evaluation, and generation) is a reasonable way to check consistency, and the gap plus non-extrapolation results give a concrete handle on where LLMs fall short as general problem solvers. That combination of findings on bias, the gap, and limited generalization is new relative to the cited literature. The setup is controlled enough to make the observations useful for people studying LLM reasoning. The soft spot is the assumption that all three probes access the same underlying posterior. The reported evaluation-generation gap already shows systematic differences in output, so fitting one two-parameter model across them risks making the bias and prior-reliance claims probe-specific rather than general. Without details on whether the parameters were pre-specified or chosen after seeing the data, and without the exact generalization tests, it's hard to judge how robust the offsets and non-extrapolation really are. This is aimed at researchers in AI and cognitive science who care about how LLMs do hypothesis inference. A reader working on reasoning benchmarks or scientific problem-solving would get value from the probes and the gap result. It deserves peer review because the questions are relevant and the experimental framing is solid, even if the methods need more transparency on fitting and testing.

Referee Report

3 major / 1 minor

Summary. The manuscript studies hypothesis generation and updating in LLMs using the number game, where models infer rules from positive examples like {16, 8, 2, 64}. It uses three probes—posterior prediction, hypothesis evaluation, and hypothesis generation—to measure posteriors over hypotheses, compares to Bayesian models and humans, and reports that LLMs fit a two-parameter Bayesian model with biases (strong-sampling assumption creating implicit Occam's razor, thinking mode increasing prior reliance), a robust evaluation-generation gap, and poor extrapolation of the pattern.

Significance. If substantiated, these results would be significant for understanding LLM limitations in inference tasks relevant to scientific discovery and problem-solving. The controlled experimental setup and multi-probe design allow for direct comparison to optimal Bayesian inference and human behavior, highlighting both strengths and biases in LLM reasoning. However, the non-extrapolation finding underscores a key limitation.

major comments (3)

[Abstract] The claim that LLMs are 'often well described by a two-parameter Bayesian fit' is central but lacks specifics on how the two parameters were determined, whether the fits were pre-specified versus post-hoc, inclusion of error bars, or quantitative measures of fit quality such as R-squared or likelihood ratios.
[Abstract] The assumption that the three probes (posterior prediction, hypothesis evaluation, and hypothesis generation) all measure the same underlying posterior distribution is load-bearing for the unified description of LLM inference with specific biases. The reported robust evaluation-generation gap suggests potential systematic differences in elicited behavior that could indicate probe-dependent posteriors rather than noise around a shared one.
[Abstract] The non-extrapolation result, that the Bayesian-with-bias pattern does not hold for parts of the hypothesis domain not covered by observed examples, is a key claim but requires more detail on the specific generalization tests conducted and how they were designed to test extrapolation.

minor comments (1)

[Abstract] Clarify what 'thinking mode' refers to, perhaps with a short description of the experimental manipulation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] The claim that LLMs are 'often well described by a two-parameter Bayesian fit' is central but lacks specifics on how the two parameters were determined, whether the fits were pre-specified versus post-hoc, inclusion of error bars, or quantitative measures of fit quality such as R-squared or likelihood ratios.

Authors: The two parameters are the strong-sampling bias strength and the prior weight, pre-specified from the model variants in Section 3.2. Fits used maximum likelihood with bootstrap-derived error bars (reported in the supplement). We will add quantitative fit metrics (e.g., log-likelihood ratios and R^2 values) to the abstract and results in revision. revision: yes
Referee: [Abstract] The assumption that the three probes (posterior prediction, hypothesis evaluation, and hypothesis generation) all measure the same underlying posterior distribution is load-bearing for the unified description of LLM inference with specific biases. The reported robust evaluation-generation gap suggests potential systematic differences in elicited behavior that could indicate probe-dependent posteriors rather than noise around a shared one.

Authors: The evaluation-generation gap does indicate probe-specific elicitation differences. However, the core biases remain consistent across probes, which we interpret as a shared posterior plus output-format effects. We will revise the abstract and discussion to explicitly note this distinction, report per-probe posterior estimates, and clarify that the unified description applies to the bias parameters rather than every hypothesis probability. revision: partial
Referee: [Abstract] The non-extrapolation result, that the Bayesian-with-bias pattern does not hold for parts of the hypothesis domain not covered by observed examples, is a key claim but requires more detail on the specific generalization tests conducted and how they were designed to test extrapolation.

Authors: We agree more detail is warranted. The tests used held-out numbers outside the observed range and measured whether rule-like hypotheses continued to be favored. We will expand the abstract and add a results subsection with explicit test design, example stimuli, and metrics to make the extrapolation failure transparent. revision: yes

Circularity Check

1 steps flagged

Two-parameter Bayesian fit to LLM probe data then interpreted as 'strong-sampling bias' and prior-reliance shift

specific steps

fitted input called prediction [Abstract (and corresponding Results on Bayesian modeling)]
"LLMs are often well described by a two-parameter Bayesian fit, but with systematic offsets: by default they show a strong-sampling assumption that creates an implicit Occam's razor, favoring narrower hypotheses, while thinking mode shifts them toward greater prior reliance."

The two parameters are estimated by fitting the Bayesian model to the LLM's posterior-prediction, evaluation, and generation data. The 'strong-sampling assumption' and 'greater prior reliance' are then labeled as systematic offsets or predictions about LLM behavior, but these are exactly the values of the fitted parameters; the description is therefore equivalent to the input fit rather than an independent result.

full rationale

The paper fits a two-parameter Bayesian model (sampling assumption + prior weight) to responses from the three probes on the number game. It then presents the fitted values as evidence that LLMs exhibit a 'strong-sampling assumption that creates an implicit Occam's razor' by default and shift toward 'greater prior reliance' in thinking mode. Because the parameters are estimated from the same LLM data rather than fixed independently, the claimed systematic offsets reduce to the fit results by construction. The reported evaluation-generation gap further indicates the probes may not access a single shared posterior, undermining the unified two-parameter description, yet the fit is still used to characterize LLM inference overall.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the implicit two-parameter Bayesian model. No independent evidence for any new entities is mentioned.

pith-pipeline@v0.9.0 · 5568 in / 1241 out tokens · 33230 ms · 2026-05-08T14:38:32.135830+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 263 canonical work pages · 9 internal anchors

[1]

Nvidia nemotron 3: Efficient and open intelligence, 2025

NVIDIA and Blakeman, Aaron and Grattafiori, Aaron and Basant, Aarti and Gupta, Abhibha and Khattar, Abhinav and Renduchintala, Adi and Vavre, Aditya and Shukla, Akanksha and Bercovich, Akhiad and Ficek, Aleksander and Shaposhnikov, Aleksandr and Kondratenko, Alex and Bukharin, Alexander and Milesi, Alexandre and Taghibakhshi, Ali and Liu, Alisa and Barton...

work page doi:10.48550/arxiv.2512.20856
[2]

Google , author =

Gemma 4:. Google , author =
[3]

Wei, Jiaqi and Yang, Yuejin and Zhang, Xiang and Chen, Yuhan and Zhuang, Xiang and Gao, Zhangyang and Zhou, Dongzhan and Wang, Guangshuai and Gao, Zhiqiang and Cao, Juntai and Qiu, Zijie and Hu, Ming and Ma, Chenglong and Tang, Shixiang and He, Junjun and Song, Chunfeng and He, Xuming and Zhang, Qiang and You, Chenyu and Zheng, Shuangjia and Ding, Ning an...

work page doi:10.48550/arxiv.2508.14111
[4]

Zheng, Tianshi and Deng, Zheye and Tsang, Hong Ting and Wang, Weiqi and Bai, Jiaxin and Wang, Zihao and Song, Yangqiu , month = may, year =. From
[5]

org/CorpusID:271854887

Mitchener, Ludovico and Yiu, Angela and Chang, Benjamin and Bourdenx, Mathieu and Nadolski, Tyler and Sulovari, Arvis and Landsness, Eric C. and Barabasi, Daniel L. and Narayanan, Siddharth and Evans, Nicky and Reddy, Shriya and Foiani, Martha and Kamal, Aizad and Shriver, Leah P. and Cao, Fang and Wassie, Asmamaw T. and Laurent, Jon M. and Melville-Green...

work page doi:10.48550/arxiv.2511.02824
[6]

2025 , doi =

Tang, Jiabin and Xia, Lianghao and Li, Zhonghang and Huang, Chao , month = may, year =. doi:10.48550/arXiv.2505.18705 , abstract =

work page doi:10.48550/arxiv.2505.18705
[7]

Yamada, Yutaro and Lange, Robert Tjarko and Lu, Cong and Hu, Shengran and Lu, Chris and Foerster, Jakob and Clune, Jeff and Ha, David , month = apr, year =. The. doi:10.48550/arXiv.2504.08066 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2504.08066
[8]

Gottweis, Juraj and Weng, Wei-Hung and Daryin, Alexander and Tu, Tao and Palepu, Anil and Sirkovic, Petar and Myaskovsky, Artiom and Weissenberger, Felix and Rong, Keran and Tanno, Ryutaro and Saab, Khaled and Popovici, Dan and Blum, Jacob and Zhang, Fan and Chou, Katherine and Hassidim, Avinatan and Gokturk, Burak and Vahdat, Amin and Kohli, Pushmeet and...

work page internal anchor Pith review doi:10.48550/arxiv.2502.18864
[9]

, year =

Clark, Eve V. , year =. What's in a word?. Cognitive development and acquisition of language , publisher =
[10]

Cognitive basis of language learning in infants , volume =

MacNamara, John , year =. Cognitive basis of language learning in infants , volume =. Psychological Review , publisher =. doi:10.1037/h0031901 , abstract =

work page doi:10.1037/h0031901
[11]

Journal of Child Language , author =

Overextension in early language development , volume =. Journal of Child Language , author =. 1980 , pages =. doi:10.1017/S0305000900002658 , abstract =

work page doi:10.1017/s0305000900002658 1980
[12]

von Helmholtz, Hermann , editor =. The. Selected. 1878 , keywords =
[13]

Kumaran, Dharshan and Patraucean, Viorica and Osindero, Simon and Velickovic, Petar and Daw, Nathaniel , month = apr, year =. How. doi:10.48550/arXiv.2604.22271 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.22271
[14]

Reasoning theater: Disentangling model beliefs from chain-of- thought, 2026

Boppana, Siddharth and Ma, Annabel and Loeffler, Max and Sarfati, Raphael and Bigelow, Eric and Geiger, Atticus and Lewis, Owen and Merullo, Jack , month = mar, year =. Reasoning. doi:10.48550/arXiv.2603.05488 , abstract =

work page doi:10.48550/arxiv.2603.05488
[15]

Raaschou-jensen, Hans Peter Lynsgøe and Fierro, Constanza and Søgaard, Anders , month = oct, year =. Real-. doi:10.48550/arXiv.2506.23274 , abstract =

work page doi:10.48550/arxiv.2506.23274
[16]

Calibrating

Xie, Zhihui and Guo, Jizhou and Yu, Tong and Li, Shuai , month = dec, year =. Calibrating. doi:10.48550/arXiv.2405.18711 , abstract =

work page doi:10.48550/arxiv.2405.18711
[17]

Frankland, Steven M and Webb, Taylor and Lewis, Richard L and Cohen, Jonathan D and Marjieh, Raja and Nurisso, Marco and Petri, Giovanni and Fluegemann, Joseph , month = feb, year =. No. doi:10.31234/osf.io/cjuxb_v2 , abstract =

work page doi:10.31234/osf.io/cjuxb_v2
[18]

Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,

Karan, Aayush and Du, Yilun , month = oct, year =. Reasoning with. doi:10.48550/arXiv.2510.14901 , abstract =

work page doi:10.48550/arxiv.2510.14901
[19]

Cencerrado, Iván Vicente Moreno and Masdemont, Arnau Padrés and Hawthorne, Anton Gonzalvez and Africa, David Demitri and Pacchiardi, Lorenzo , month = mar, year =. No. doi:10.48550/arXiv.2509.10625 , abstract =

work page doi:10.48550/arxiv.2509.10625
[20]

Temporal

David, Joey , month = nov, year =. Temporal. doi:10.48550/arXiv.2511.14773 , abstract =

work page doi:10.48550/arxiv.2511.14773
[21]

LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

Sun, Lihao and Dong, Hang and Qiao, Bo and Lin, Qingwei and Zhang, Dongmei and Rajmohan, Saravan , month = apr, year =. doi:10.48550/arXiv.2604.05655 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.05655
[22]

Efficient PRM Training Data Synthesis via Formal Verification

Kamoi, Ryo and Zhang, Yusen and Zhang, Nan and Das, Sarkar Snigdha Sarathi and Zhang, Ranran Haoran and Yin, Wenpeng and Zhang, Rui , month = apr, year =. Efficient. doi:10.48550/arXiv.2505.15960 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.15960
[23]

Truth as a

Damirchi, Hamed and De la Jara, Ignacio Meza and Abbasnejad, Ehsan and Shamsi, Afshar and Zhang, Zhen and Shi, Javen , month = mar, year =. Truth as a
[24]

Closing the confidence-faithfulness gap in large language models.arXiv preprint arXiv:2603.25052, 2026

Miao, Miranda Muqing and Ungar, Lyle , month = apr, year =. Closing the. doi:10.48550/arXiv.2603.25052 , abstract =

work page doi:10.48550/arxiv.2603.25052
[25]

Kumaran, Dharshan and Conmy, Arthur and Barbero, Federico and Osindero, Simon and Patraucean, Viorica and Velickovic, Petar , month = mar, year =. How do. doi:10.48550/arXiv.2603.17839 , abstract =

work page doi:10.48550/arxiv.2603.17839
[26]

ReProbe: Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

Ni, Jingwei and Fadeeva, Ekaterina and Wu, Tianyi and Akhtar, Mubashara and Zhang, Jiaheng and Ash, Elliott and Leippold, Markus and Baldwin, Timothy and Ng, See-Kiong and Shelmanov, Artem and Sachan, Mrinmaya , month = jan, year =. Efficient. doi:10.48550/arXiv.2511.06209 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.06209
[27]

Cacioli, Jon-Paul , month = mar, year =. Do. doi:10.48550/arXiv.2603.25112 , abstract =

work page doi:10.48550/arxiv.2603.25112
[28]

Evidence for

Ackerman, Christopher , month = oct, year =. Evidence for
[29]

Orgad, Hadas and Toker, Michael and Gekhman, Zorik and Reichart, Roi and Szpektor, Idan and Kotek, Hadas and Belinkov, Yonatan , month = oct, year =
[30]

Learning to

He, Yixin and Tang, Lumingyuan , month = sep, year =. Learning to. doi:10.48550/arXiv.2509.24238 , abstract =

work page doi:10.48550/arxiv.2509.24238
[31]

Yadav, Advait and Black, Sid and Sourbut, Oliver , month = apr, year =. More. doi:10.48550/arXiv.2604.07821 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.07821
[32]

Lossless data compression by large models , volume =

Li, Ziguang and Huang, Chao and Wang, Xuliang and Hu, Haibo and Wyeth, Cole and Bu, Dongbo and Yu, Quan and Gao, Wen and Liu, Xingwu and Li, Ming , month = may, year =. Lossless data compression by large models , volume =. Nature Machine Intelligence , publisher =. doi:10.1038/s42256-025-01033-7 , abstract =

work page doi:10.1038/s42256-025-01033-7
[33]

Yoran, Ori and Zheng, Kunhao and Gloeckle, Fabian and Gehring, Jonas and Synnaeve, Gabriel and Cohen, Taco , month = oct, year =. The
[34]

Goldblum, Micah and Finzi, Marc and Rowan, Keefer and Wilson, Andrew Gordon , month = jun, year =. The. doi:10.48550/arXiv.2304.05366 , abstract =

work page doi:10.48550/arxiv.2304.05366
[35]

Language

Deletang, Gregoire and Ruoss, Anian and Duquenne, Paul-Ambroise and Catt, Elliot and Genewein, Tim and Mattern, Christopher and Grau-Moya, Jordi and Wenliang, Li Kevin and Aitchison, Matthew and Orseau, Laurent and Hutter, Marcus and Veness, Joel , month = oct, year =. Language
[36]

Cai, Yuchen and Cao, Ding and Xu, Xin and Yao, Zijun and Huang, Yuqing and Tan, Zhenyu and Zhang, Benyi and Sun, Guangzhong and Liu, Guiquan and Fang, Junfeng , month = feb, year =. On. doi:10.48550/arXiv.2510.00553 , abstract =

work page doi:10.48550/arxiv.2510.00553
[37]

Advances in Neural Information Processing Systems , author =

Coarse-to-. Advances in Neural Information Processing Systems , author =. 2024 , pages =. doi:10.52202/079017-3340 , language =

work page doi:10.52202/079017-3340 2024
[38]

Transactions on Machine Learning Research , author =

Beyond the. Transactions on Machine Learning Research , author =
[39]

Kazemi, Mehran and Fatemi, Bahare and Bansal, Hritik and Palowitch, John and Anastasiou, Chrysovalantis and Mehta, Sanket Vaibhav and Jain, Lalit K. and Aglietti, Virginia and Jindal, Disha and Chen, Peter and Dikkala, Nishanth and Tyen, Gladys and Liu, Xin and Shalit, Uri and Chiappa, Silvia and Olszewska, Kate and Tay, Yi and Tran, Vinh Q. and Le, Quoc ...

work page doi:10.48550/arxiv.2502.19187
[40]

Baeumel, Tanja and Genabith, Josef van and Ostermann, Simon , month = feb, year =. The. doi:10.48550/arXiv.2502.19981 , abstract =

work page doi:10.48550/arxiv.2502.19981
[41]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Gu, Alex and Rozière, Baptiste and Leather, Hugh and Solar-Lezama, Armando and Synnaeve, Gabriel and Wang, Sida I. , month = jan, year =. doi:10.48550/arXiv.2401.03065 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2401.03065
[42]

Yang, Songlin and Shen, Yikang and Wen, Kaiyue and Tan, Shawn and Mishra, Mayank and Ren, Liliang and Panda, Rameswar and Kim, Yoon , month = oct, year =. The
[43]

and Moros-Daval, Yael and Zhang, Seraphina and Zhao, Qinlin and Huang, Yitian and Sun, Luning and Prunty, Jonathan E

Zhou, Lexin and Pacchiardi, Lorenzo and Martínez-Plumed, Fernando and Collins, Katherine M. and Moros-Daval, Yael and Zhang, Seraphina and Zhao, Qinlin and Huang, Yitian and Sun, Luning and Prunty, Jonathan E. and Li, Zongqian and Sánchez-García, Pablo and Jiang-Chen, Kexin and Casares, Pablo A. M. and Zu, Jiyun and Burden, John and Mehrbakhsh, Behzad and...

work page doi:10.1038/s41586-026-10303-2
[44]

What and

Zhang, Yufeng and Zhang, Fengzhuo and Yang, Zhuoran and Wang, Zhaoran , month = apr, year =. What and. Proceedings of
[45]

Revisiting the

Gwak, Minju and Son, Guijin and Kim, Jaehyung , month = oct, year =. Revisiting the. doi:10.48550/arXiv.2510.13850 , abstract =

work page doi:10.48550/arxiv.2510.13850
[46]

and Piantadosi, Steven T

Bigelow, Eric J. and Piantadosi, Steven T. , year =. Inferring priors in compositional cognitive models , volume =. Proceedings of the
[47]

2023 , keywords =

Advances in Neural Information Processing Systems , author =. 2023 , keywords =

2023
[48]

Journal of Open Psychology Data , author =

A large dataset of generalization patterns in the number game , volume =. Journal of Open Psychology Data , author =. 2016 , pages =. doi:10.5334/jopd.19 , number =

work page doi:10.5334/jopd.19 2016
[49]

Rules and

Tenenbaum, Joshua , year =. Rules and. Advances in
[50]

, month = apr, year =

Zhang, Liyi and Snell, Jake and Griffiths, Thomas L. , month = apr, year =. Meta-. doi:10.48550/arXiv.2508.14285 , abstract =

work page doi:10.48550/arxiv.2508.14285
[51]

Bayesian teaching enables probabilistic reasoning in large language models , volume =

Qiu, Linlu and Sha, Fei and Allen, Kelsey and Kim, Yoon and Linzen, Tal and van Steenkiste, Sjoerd , month = jan, year =. Bayesian teaching enables probabilistic reasoning in large language models , volume =. Nature Communications , publisher =. doi:10.1038/s41467-025-67998-6 , abstract =

work page doi:10.1038/s41467-025-67998-6
[52]

Xu, Zhangchen and Jiang, Fengqing and Niu, Luyao and Deng, Yuntian and Poovendran, Radha and Choi, Yejin and Lin, Bill Yuchen , month = oct, year =. Magpie:. doi:10.48550/arXiv.2406.08464 , abstract =

work page doi:10.48550/arxiv.2406.08464
[53]

Padmanabhan, Sriram and Misra, Kanishka and Mahowald, Kyle and Choi, Eunsol , month = apr, year =. On. doi:10.48550/arXiv.2504.09387 , abstract =

work page doi:10.48550/arxiv.2504.09387
[54]

Chari, Anirudh and Pattanaik, Neil , month = feb, year =. Wild. doi:10.48550/arXiv.2602.06818 , abstract =

work page doi:10.48550/arxiv.2602.06818
[55]

Bazigaran, Arghavan and Sohn, Hansem , month = dec, year =. Concept. doi:10.48550/arXiv.2512.20162 , abstract =

work page doi:10.48550/arxiv.2512.20162
[56]

Revisiting Uncertainty Esti- mation and Calibration of Large Language Models

Tao, Linwei and Yeh, Yi-Fan and Dong, Minjing and Huang, Tao and Torr, Philip and Xu, Chang , month = may, year =. Revisiting. doi:10.48550/arXiv.2505.23854 , abstract =

work page doi:10.48550/arxiv.2505.23854
[57]

Calibration-

Kapoor, Sanyam and Gruver, Nate and Roberts, Manley and Pal, Arka and Dooley, Samuel and Goldblum, Micah and Wilson, Andrew , editor =. Calibration-. Proceedings of the 1st. 2024 , pages =. doi:10.18653/v1/2024.uncertainlp-1.1 , abstract =

work page doi:10.18653/v1/2024.uncertainlp-1.1 2024
[58]

Shojaee, Parshin and Mirzadeh, Iman and Alizadeh, Keivan and Horton, Maxwell and Bengio, Samy and Farajtabar, Mehrdad , month = nov, year =. The. doi:10.48550/arXiv.2506.06941 , abstract =

work page doi:10.48550/arxiv.2506.06941
[59]

Contextual position encoding: Learning to count what’s important

Golovneva, Olga and Wang, Tianlu and Weston, Jason and Sukhbaatar, Sainbayar , month = may, year =. Contextual. doi:10.48550/arXiv.2405.18719 , abstract =

work page doi:10.48550/arxiv.2405.18719
[60]

arXiv preprint arXiv:2509.10739 , year=

Pournemat, Mobina and Rezaei, Keivan and Sriramanan, Gaurang and Zarei, Arman and Fu, Jiaxiang and Wang, Yang and Eghbalzadeh, Hamid and Feizi, Soheil , month = sep, year =. Reasoning. doi:10.48550/arXiv.2509.10739 , abstract =

work page doi:10.48550/arxiv.2509.10739
[61]

Xiaohu, Xie and Xiaohu, Liu and Benjamin, Yao , month = feb, year =. Know
[62]

Leng, Jixuan and Huang, Chengsong and Zhu, Banghua and Huang, Jiaxin , month = oct, year =. Taming
[63]

Position:

Yan, Hanqi and Zhang, Linhai and Li, Jiazheng and Shen, Zhenyi and He, Yulan , month = jun, year =. Position:
[64]

EleutherAI Blog , author =

Attention. EleutherAI Blog , author =
[65]

Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193, 2026a

Zhang, Ruixiang and Bai, Richard He and Zheng, Huangjie and Jaitly, Navdeep and Collobert, Ronan and Zhang, Yizhe , month = apr, year =. Embarrassingly. doi:10.48550/arXiv.2604.01193 , abstract =

work page doi:10.48550/arxiv.2604.01193
[66]

21937–21950

Salimans, Tim and Chen, Richard , month = dec, year =. Learning. doi:10.48550/arXiv.1812.03381 , abstract =

work page doi:10.48550/arxiv.1812.03381
[67]

Cooperative inverse reinforcement learning,

Hadfield-Menell, Dylan and Dragan, Anca and Abbeel, Pieter and Russell, Stuart , month = feb, year =. Cooperative. doi:10.48550/arXiv.1606.03137 , abstract =

work page doi:10.48550/arxiv.1606.03137
[68]

Test your best methods on our hard

daria and Tyagi, Riya and Engels, Josh and Nanda, Neel , month = mar, year =. Test your best methods on our hard
[69]

Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026

Lu, Zhengxi and Yao, Zhiyuan and Wu, Jinyang and Han, Chengcheng and Gu, Qi and Cai, Xunliang and Lu, Weiming and Xiao, Jun and Zhuang, Yueting and Shen, Yongliang , month = apr, year =. doi:10.48550/arXiv.2604.02268 , abstract =

work page doi:10.48550/arxiv.2604.02268
[70]

arXiv preprint arXiv:2603.03414 , year=

Mineault, Patrick J. and Griffiths, Thomas L. and Escola, Sean , month = mar, year =. Cognitive. doi:10.48550/arXiv.2603.03414 , abstract =

work page doi:10.48550/arxiv.2603.03414
[71]

Cognitive models and AI algorithms provide templates for designing language agents

Liu, Ryan and Arumugam, Dilip and Zhang, Cedegao E. and Escola, Sean and Pitkow, Xaq and Griffiths, Thomas L. , month = feb, year =. Cognitive. doi:10.48550/arXiv.2602.22523 , abstract =

work page doi:10.48550/arxiv.2602.22523
[72]

, year =

Ortega, Pedro A. , year =. Universal
[73]

and Vezhnevets, Alexander Sasha and Diaz, Manfred and Agapiou, John P

Leibo, Joel Z. and Vezhnevets, Alexander Sasha and Diaz, Manfred and Agapiou, John P. and Cunningham, William A. and Sunehag, Peter and Cross, Logan and Koster, Raphael and Bileschi, Stanley M. and Chang, Minsuk and Rahwan, Iyad and Osindero, Simon and Evans, James A. , month = mar, year =. A. doi:10.48550/arXiv.2603.14050 , abstract =

work page doi:10.48550/arxiv.2603.14050
[74]

Superposition Yields Robust Neural Scaling

Liu, Yizhou and Liu, Ziming and Gore, Jeff , month = nov, year =. Superposition. doi:10.48550/arXiv.2505.10465 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.10465
[75]

Science , author =

Geometry of sequence working memory in macaque prefrontal cortex , volume =. Science , author =. 2022 , pages =. doi:10.1126/science.abm0204 , abstract =

work page doi:10.1126/science.abm0204 2022
[76]

Representations and generalization in artificial and brain neural networks , volume =

Li, Qianyi and Sorscher, Ben and Sompolinsky, Haim , month = jul, year =. Representations and generalization in artificial and brain neural networks , volume =. Proceedings of the National Academy of Sciences , publisher =. doi:10.1073/pnas.2311805121 , abstract =

work page doi:10.1073/pnas.2311805121
[77]

(Joshua Brett) , year =

Tenenbaum, Joshua B. (Joshua Brett) , year =. A
[78]

Wang, William Yang , editor =. “. Proceedings of the 55th. 2017 , pages =. doi:10.18653/v1/P17-2067 , abstract =

work page doi:10.18653/v1/p17-2067 2017
[79]

Prototypical

Snell, Jake and Swersky, Kevin and Zemel, Richard , year =. Prototypical. Advances in
[80]

Gemma 3 Technical Report

Gemma, Team and Kamath, Aishwarya and Ferret, Johan and Pathak, Shreya and Vieillard, Nino and Merhej, Ramona and Perrin, Sarah and Matejovicova, Tatiana and Ramé, Alexandre and Rivière, Morgane and Rouillard, Louis and Mesnard, Thomas and Cideron, Geoffrey and Grill, Jean-bastien and Ramos, Sabela and Yvinec, Edouard and Casbon, Michelle and Pot, Etienne...

work page Pith review doi:10.48550/arxiv.2503.19786

Showing first 80 references.