pith. machine review for the scientific record. sign in

arxiv: 2605.05851 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

Hypothesis generation and updating in large language models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:38 UTC · model grok-4.3

classification 💻 cs.LG
keywords large language modelshypothesis generationBayesian inferencenumber gamehypothesis updatingOccam's razorgeneralization
0
0 comments X

The pith

Large language models behave like Bayesian hypothesis updaters with offsets that favor narrower hypotheses and prevent good extrapolation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates how large language models generate and update hypotheses when inferring rules from a few positive examples, using the number game as a testbed. It measures LLM behavior through three probes—posterior prediction, hypothesis evaluation, and hypothesis generation—and compares it to an optimal Bayesian learner and to humans. The central finding is that LLMs often align with a two-parameter Bayesian fit but deviate systematically. Default responses favor narrower hypotheses due to a strong-sampling assumption, while thinking mode increases prior influence. There is also a consistent gap where evaluation picks more accurate hypotheses than generation does, and the Bayesian-like pattern breaks down when generalizing to unseen parts of the space.

Core claim

In the number game, LLMs' inferences over hypotheses supported by positive examples are well captured by a two-parameter Bayesian model, but they exhibit a default strong-sampling assumption that implicitly favors narrower hypotheses, a shift toward prior reliance in thinking mode, a robust gap between evaluation and generation performance, and a failure to extrapolate the pattern beyond the observed examples.

What carries the argument

The number game, in which a learner sees positive integers and infers the underlying rule or interval, with posteriors measured via prediction, evaluation, and generation probes compared against a Bayesian model.

If this is right

  • LLMs default to narrower hypotheses via strong sampling, acting as an implicit Occam's razor.
  • Switching to thinking mode increases reliance on prior probabilities over likelihoods.
  • LLMs evaluate hypotheses more accurately than they generate them, preferring rule-like outputs in generation.
  • The Bayesian-with-bias behavior does not extend to generalization outside the trained examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This pattern suggests LLMs may struggle with scientific discovery tasks requiring broad hypothesis exploration beyond given data.
  • The evaluation-generation gap points to a need for better alignment between selection and creation mechanisms in model training.
  • Future tests could examine whether the same biases appear in other structured inference domains beyond numbers.

Load-bearing premise

That the three different probes all tap into the same underlying posterior distribution over hypotheses inside the LLM.

What would settle it

A direct test would be to check whether LLMs continue to show the same parameter fits and biases when probed with new hypothesis spaces or when forced to generate hypotheses that extrapolate to unseen numbers in the game.

Figures

Figures reproduced from arXiv: 2605.05851 by Hua-Dong Xiong.

Figure 1
Figure 1. Figure 1: Three measurements of the posterior over hypotheses in the number game. The schematic shows how we prompt LLMs to measure their posterior over hypotheses. Posterior prediction queries the LLM with one integer in the hypothesis domain at a time and records the model-generated probability that the integer belongs to the same hypothesis as the current examples, yielding a probability mass function. Hypothesis… view at source ↗
Figure 2
Figure 2. Figure 2: Bayesian fit of LLM posterior prediction behavior. a, Full-stimulus (α, β) fits for default d = 100 posterior prediction, with each model fit once after pooling all available full-stimulus TENENBAUM99 and BIGELOW16 presentations. Each colored point is one model; dashed lines mark the configured Bayesian reference, (1, 1); the human baseline is shown as a black cross. b, Example-count trajectories of log(α/… view at source ↗
Figure 3
Figure 3. Figure 3: Bayesian fits quantify how prompt sampling assumptions, explicit candidate lists, and thinking affect LLM posterior prediction behavior. Bars summarize model-averaged posterior￾prediction conditions for the Default Prompt, Strong Prompt, Weak Prompt, and Explicit Prompt. Within each prompt condition, dark bars show non-thinking model rows and lighter bars show thinking model rows. The left two panels repor… view at source ↗
Figure 4
Figure 4. Figure 4: Large language models show different behavior under three measurements of the posterior. Bars compare posterior prediction (Predict), hypothesis evaluation (Eval), and hypothesis generation (Generate) after projecting each measurement into the same posterior predictive space. Within each measurement, dark bars show non-thinking model rows and lighter bars show thinking model rows. The panels report full-st… view at source ↗
Figure 5
Figure 5. Figure 5: Hypothesis evaluation and generation show an accuracy–simplicity trade-off in their MAP hypothesis estimators. a,b, Top-1 example consistency versus support fraction for hypothesis evaluation and generation on TENENBAUM99 default d = 100 rows. c, Paired Eval-minus-Generation gaps for example consistency and support fraction. d,e, Trajectories of top-1 support fraction and example consistency as the number … view at source ↗
Figure 6
Figure 6. Figure 6: LLM hypotheses fail to generalize to the unobserved domain. a, Larger-domain posterior prediction, comparing mass assigned to 101..200 with KL divergence between the original d = 100 posterior and the renormalized d = 200 posterior on 1..100. Transparent points show stimuli; larger points show model averages. b, The same unobserved-domain comparison across posterior prediction, hypothesis evaluation, and h… view at source ↗
read the original abstract

Large language models (LLMs) increasingly help people solve problems, from debugging code to repairing machinery. This process requires generating plausible hypotheses from partial descriptions, then updating them as more information arrives. Yet how LLMs perform this form of inference, and how close it is to optimal, remains unclear. We study this question in the number game, a controlled setting in which a learner infers the hypothesis supported by a few positive integers, such as $\{16, 8, 2, 64\}$: a rule like powers of 2 or an interval like numbers near 20. We measure the posterior over hypotheses using three complementary probes: posterior prediction, hypothesis evaluation, and hypothesis generation. We then compare LLM behavior with an optimal Bayesian model and human behavior, and test whether the same posterior is expressed across probes. LLMs are often well described by a two-parameter Bayesian fit, but with systematic offsets: by default they show a strong-sampling assumption that creates an implicit Occam's razor, favoring narrower hypotheses, while thinking mode shifts them toward greater prior reliance. We also find a robust evaluation--generation gap: LLMs select more correct hypotheses during hypothesis evaluation but generate simpler, more rule-like hypotheses. Finally, this Bayesian-with-bias pattern does not extrapolate. Models can behave as if they hold rule-like hypotheses over observed examples, yet generalize poorly to parts of the hypothesis domain not covered by those examples. Our results highlight a limitation of LLMs as general problem solvers, especially for scientific inference, where hypotheses must go beyond the data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript studies hypothesis generation and updating in LLMs using the number game, where models infer rules from positive examples like {16, 8, 2, 64}. It uses three probes—posterior prediction, hypothesis evaluation, and hypothesis generation—to measure posteriors over hypotheses, compares to Bayesian models and humans, and reports that LLMs fit a two-parameter Bayesian model with biases (strong-sampling assumption creating implicit Occam's razor, thinking mode increasing prior reliance), a robust evaluation-generation gap, and poor extrapolation of the pattern.

Significance. If substantiated, these results would be significant for understanding LLM limitations in inference tasks relevant to scientific discovery and problem-solving. The controlled experimental setup and multi-probe design allow for direct comparison to optimal Bayesian inference and human behavior, highlighting both strengths and biases in LLM reasoning. However, the non-extrapolation finding underscores a key limitation.

major comments (3)
  1. [Abstract] The claim that LLMs are 'often well described by a two-parameter Bayesian fit' is central but lacks specifics on how the two parameters were determined, whether the fits were pre-specified versus post-hoc, inclusion of error bars, or quantitative measures of fit quality such as R-squared or likelihood ratios.
  2. [Abstract] The assumption that the three probes (posterior prediction, hypothesis evaluation, and hypothesis generation) all measure the same underlying posterior distribution is load-bearing for the unified description of LLM inference with specific biases. The reported robust evaluation-generation gap suggests potential systematic differences in elicited behavior that could indicate probe-dependent posteriors rather than noise around a shared one.
  3. [Abstract] The non-extrapolation result, that the Bayesian-with-bias pattern does not hold for parts of the hypothesis domain not covered by observed examples, is a key claim but requires more detail on the specific generalization tests conducted and how they were designed to test extrapolation.
minor comments (1)
  1. [Abstract] Clarify what 'thinking mode' refers to, perhaps with a short description of the experimental manipulation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] The claim that LLMs are 'often well described by a two-parameter Bayesian fit' is central but lacks specifics on how the two parameters were determined, whether the fits were pre-specified versus post-hoc, inclusion of error bars, or quantitative measures of fit quality such as R-squared or likelihood ratios.

    Authors: The two parameters are the strong-sampling bias strength and the prior weight, pre-specified from the model variants in Section 3.2. Fits used maximum likelihood with bootstrap-derived error bars (reported in the supplement). We will add quantitative fit metrics (e.g., log-likelihood ratios and R^2 values) to the abstract and results in revision. revision: yes

  2. Referee: [Abstract] The assumption that the three probes (posterior prediction, hypothesis evaluation, and hypothesis generation) all measure the same underlying posterior distribution is load-bearing for the unified description of LLM inference with specific biases. The reported robust evaluation-generation gap suggests potential systematic differences in elicited behavior that could indicate probe-dependent posteriors rather than noise around a shared one.

    Authors: The evaluation-generation gap does indicate probe-specific elicitation differences. However, the core biases remain consistent across probes, which we interpret as a shared posterior plus output-format effects. We will revise the abstract and discussion to explicitly note this distinction, report per-probe posterior estimates, and clarify that the unified description applies to the bias parameters rather than every hypothesis probability. revision: partial

  3. Referee: [Abstract] The non-extrapolation result, that the Bayesian-with-bias pattern does not hold for parts of the hypothesis domain not covered by observed examples, is a key claim but requires more detail on the specific generalization tests conducted and how they were designed to test extrapolation.

    Authors: We agree more detail is warranted. The tests used held-out numbers outside the observed range and measured whether rule-like hypotheses continued to be favored. We will expand the abstract and add a results subsection with explicit test design, example stimuli, and metrics to make the extrapolation failure transparent. revision: yes

Circularity Check

1 steps flagged

Two-parameter Bayesian fit to LLM probe data then interpreted as 'strong-sampling bias' and prior-reliance shift

specific steps
  1. fitted input called prediction [Abstract (and corresponding Results on Bayesian modeling)]
    "LLMs are often well described by a two-parameter Bayesian fit, but with systematic offsets: by default they show a strong-sampling assumption that creates an implicit Occam's razor, favoring narrower hypotheses, while thinking mode shifts them toward greater prior reliance."

    The two parameters are estimated by fitting the Bayesian model to the LLM's posterior-prediction, evaluation, and generation data. The 'strong-sampling assumption' and 'greater prior reliance' are then labeled as systematic offsets or predictions about LLM behavior, but these are exactly the values of the fitted parameters; the description is therefore equivalent to the input fit rather than an independent result.

full rationale

The paper fits a two-parameter Bayesian model (sampling assumption + prior weight) to responses from the three probes on the number game. It then presents the fitted values as evidence that LLMs exhibit a 'strong-sampling assumption that creates an implicit Occam's razor' by default and shift toward 'greater prior reliance' in thinking mode. Because the parameters are estimated from the same LLM data rather than fixed independently, the claimed systematic offsets reduce to the fit results by construction. The reported evaluation-generation gap further indicates the probes may not access a single shared posterior, undermining the unified two-parameter description, yet the fit is still used to characterize LLM inference overall.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the implicit two-parameter Bayesian model. No independent evidence for any new entities is mentioned.

pith-pipeline@v0.9.0 · 5568 in / 1241 out tokens · 33230 ms · 2026-05-08T14:38:32.135830+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 263 canonical work pages · 9 internal anchors

  1. [1]

    Nvidia nemotron 3: Efficient and open intelligence, 2025

    NVIDIA and Blakeman, Aaron and Grattafiori, Aaron and Basant, Aarti and Gupta, Abhibha and Khattar, Abhinav and Renduchintala, Adi and Vavre, Aditya and Shukla, Akanksha and Bercovich, Akhiad and Ficek, Aleksander and Shaposhnikov, Aleksandr and Kondratenko, Alex and Bukharin, Alexander and Milesi, Alexandre and Taghibakhshi, Ali and Liu, Alisa and Barton...

  2. [2]

    Google , author =

    Gemma 4:. Google , author =

  3. [3]

    Wei, Jiaqi and Yang, Yuejin and Zhang, Xiang and Chen, Yuhan and Zhuang, Xiang and Gao, Zhangyang and Zhou, Dongzhan and Wang, Guangshuai and Gao, Zhiqiang and Cao, Juntai and Qiu, Zijie and Hu, Ming and Ma, Chenglong and Tang, Shixiang and He, Junjun and Song, Chunfeng and He, Xuming and Zhang, Qiang and You, Chenyu and Zheng, Shuangjia and Ding, Ning an...

  4. [4]

    Zheng, Tianshi and Deng, Zheye and Tsang, Hong Ting and Wang, Weiqi and Bai, Jiaxin and Wang, Zihao and Song, Yangqiu , month = may, year =. From

  5. [5]

    org/CorpusID:271854887

    Mitchener, Ludovico and Yiu, Angela and Chang, Benjamin and Bourdenx, Mathieu and Nadolski, Tyler and Sulovari, Arvis and Landsness, Eric C. and Barabasi, Daniel L. and Narayanan, Siddharth and Evans, Nicky and Reddy, Shriya and Foiani, Martha and Kamal, Aizad and Shriver, Leah P. and Cao, Fang and Wassie, Asmamaw T. and Laurent, Jon M. and Melville-Green...

  6. [6]

    2025 , doi =

    Tang, Jiabin and Xia, Lianghao and Li, Zhonghang and Huang, Chao , month = may, year =. doi:10.48550/arXiv.2505.18705 , abstract =

  7. [7]

    Yamada, Yutaro and Lange, Robert Tjarko and Lu, Cong and Hu, Shengran and Lu, Chris and Foerster, Jakob and Clune, Jeff and Ha, David , month = apr, year =. The. doi:10.48550/arXiv.2504.08066 , abstract =

  8. [8]

    Gottweis, Juraj and Weng, Wei-Hung and Daryin, Alexander and Tu, Tao and Palepu, Anil and Sirkovic, Petar and Myaskovsky, Artiom and Weissenberger, Felix and Rong, Keran and Tanno, Ryutaro and Saab, Khaled and Popovici, Dan and Blum, Jacob and Zhang, Fan and Chou, Katherine and Hassidim, Avinatan and Gokturk, Burak and Vahdat, Amin and Kohli, Pushmeet and...

  9. [9]

    , year =

    Clark, Eve V. , year =. What's in a word?. Cognitive development and acquisition of language , publisher =

  10. [10]

    Cognitive basis of language learning in infants , volume =

    MacNamara, John , year =. Cognitive basis of language learning in infants , volume =. Psychological Review , publisher =. doi:10.1037/h0031901 , abstract =

  11. [11]

    Journal of Child Language , author =

    Overextension in early language development , volume =. Journal of Child Language , author =. 1980 , pages =. doi:10.1017/S0305000900002658 , abstract =

  12. [12]

    von Helmholtz, Hermann , editor =. The. Selected. 1878 , keywords =

  13. [13]

    Kumaran, Dharshan and Patraucean, Viorica and Osindero, Simon and Velickovic, Petar and Daw, Nathaniel , month = apr, year =. How. doi:10.48550/arXiv.2604.22271 , abstract =

  14. [14]

    Reasoning theater: Disentangling model beliefs from chain-of- thought, 2026

    Boppana, Siddharth and Ma, Annabel and Loeffler, Max and Sarfati, Raphael and Bigelow, Eric and Geiger, Atticus and Lewis, Owen and Merullo, Jack , month = mar, year =. Reasoning. doi:10.48550/arXiv.2603.05488 , abstract =

  15. [15]

    Raaschou-jensen, Hans Peter Lynsgøe and Fierro, Constanza and Søgaard, Anders , month = oct, year =. Real-. doi:10.48550/arXiv.2506.23274 , abstract =

  16. [16]

    Calibrating

    Xie, Zhihui and Guo, Jizhou and Yu, Tong and Li, Shuai , month = dec, year =. Calibrating. doi:10.48550/arXiv.2405.18711 , abstract =

  17. [17]

    Frankland, Steven M and Webb, Taylor and Lewis, Richard L and Cohen, Jonathan D and Marjieh, Raja and Nurisso, Marco and Petri, Giovanni and Fluegemann, Joseph , month = feb, year =. No. doi:10.31234/osf.io/cjuxb_v2 , abstract =

  18. [18]

    Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,

    Karan, Aayush and Du, Yilun , month = oct, year =. Reasoning with. doi:10.48550/arXiv.2510.14901 , abstract =

  19. [19]

    Cencerrado, Iván Vicente Moreno and Masdemont, Arnau Padrés and Hawthorne, Anton Gonzalvez and Africa, David Demitri and Pacchiardi, Lorenzo , month = mar, year =. No. doi:10.48550/arXiv.2509.10625 , abstract =

  20. [20]

    Temporal

    David, Joey , month = nov, year =. Temporal. doi:10.48550/arXiv.2511.14773 , abstract =

  21. [21]

    LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

    Sun, Lihao and Dong, Hang and Qiao, Bo and Lin, Qingwei and Zhang, Dongmei and Rajmohan, Saravan , month = apr, year =. doi:10.48550/arXiv.2604.05655 , abstract =

  22. [22]

    Efficient PRM Training Data Synthesis via Formal Verification

    Kamoi, Ryo and Zhang, Yusen and Zhang, Nan and Das, Sarkar Snigdha Sarathi and Zhang, Ranran Haoran and Yin, Wenpeng and Zhang, Rui , month = apr, year =. Efficient. doi:10.48550/arXiv.2505.15960 , abstract =

  23. [23]

    Truth as a

    Damirchi, Hamed and De la Jara, Ignacio Meza and Abbasnejad, Ehsan and Shamsi, Afshar and Zhang, Zhen and Shi, Javen , month = mar, year =. Truth as a

  24. [24]

    Closing the confidence-faithfulness gap in large language models.arXiv preprint arXiv:2603.25052, 2026

    Miao, Miranda Muqing and Ungar, Lyle , month = apr, year =. Closing the. doi:10.48550/arXiv.2603.25052 , abstract =

  25. [25]

    Kumaran, Dharshan and Conmy, Arthur and Barbero, Federico and Osindero, Simon and Patraucean, Viorica and Velickovic, Petar , month = mar, year =. How do. doi:10.48550/arXiv.2603.17839 , abstract =

  26. [26]

    ReProbe: Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

    Ni, Jingwei and Fadeeva, Ekaterina and Wu, Tianyi and Akhtar, Mubashara and Zhang, Jiaheng and Ash, Elliott and Leippold, Markus and Baldwin, Timothy and Ng, See-Kiong and Shelmanov, Artem and Sachan, Mrinmaya , month = jan, year =. Efficient. doi:10.48550/arXiv.2511.06209 , abstract =

  27. [27]

    Cacioli, Jon-Paul , month = mar, year =. Do. doi:10.48550/arXiv.2603.25112 , abstract =

  28. [28]

    Evidence for

    Ackerman, Christopher , month = oct, year =. Evidence for

  29. [29]

    Orgad, Hadas and Toker, Michael and Gekhman, Zorik and Reichart, Roi and Szpektor, Idan and Kotek, Hadas and Belinkov, Yonatan , month = oct, year =

  30. [30]

    Learning to

    He, Yixin and Tang, Lumingyuan , month = sep, year =. Learning to. doi:10.48550/arXiv.2509.24238 , abstract =

  31. [31]

    Yadav, Advait and Black, Sid and Sourbut, Oliver , month = apr, year =. More. doi:10.48550/arXiv.2604.07821 , abstract =

  32. [32]

    Lossless data compression by large models , volume =

    Li, Ziguang and Huang, Chao and Wang, Xuliang and Hu, Haibo and Wyeth, Cole and Bu, Dongbo and Yu, Quan and Gao, Wen and Liu, Xingwu and Li, Ming , month = may, year =. Lossless data compression by large models , volume =. Nature Machine Intelligence , publisher =. doi:10.1038/s42256-025-01033-7 , abstract =

  33. [33]

    Yoran, Ori and Zheng, Kunhao and Gloeckle, Fabian and Gehring, Jonas and Synnaeve, Gabriel and Cohen, Taco , month = oct, year =. The

  34. [34]

    Goldblum, Micah and Finzi, Marc and Rowan, Keefer and Wilson, Andrew Gordon , month = jun, year =. The. doi:10.48550/arXiv.2304.05366 , abstract =

  35. [35]

    Language

    Deletang, Gregoire and Ruoss, Anian and Duquenne, Paul-Ambroise and Catt, Elliot and Genewein, Tim and Mattern, Christopher and Grau-Moya, Jordi and Wenliang, Li Kevin and Aitchison, Matthew and Orseau, Laurent and Hutter, Marcus and Veness, Joel , month = oct, year =. Language

  36. [36]

    Cai, Yuchen and Cao, Ding and Xu, Xin and Yao, Zijun and Huang, Yuqing and Tan, Zhenyu and Zhang, Benyi and Sun, Guangzhong and Liu, Guiquan and Fang, Junfeng , month = feb, year =. On. doi:10.48550/arXiv.2510.00553 , abstract =

  37. [37]

    Advances in Neural Information Processing Systems , author =

    Coarse-to-. Advances in Neural Information Processing Systems , author =. 2024 , pages =. doi:10.52202/079017-3340 , language =

  38. [38]

    Transactions on Machine Learning Research , author =

    Beyond the. Transactions on Machine Learning Research , author =

  39. [39]

    Kazemi, Mehran and Fatemi, Bahare and Bansal, Hritik and Palowitch, John and Anastasiou, Chrysovalantis and Mehta, Sanket Vaibhav and Jain, Lalit K. and Aglietti, Virginia and Jindal, Disha and Chen, Peter and Dikkala, Nishanth and Tyen, Gladys and Liu, Xin and Shalit, Uri and Chiappa, Silvia and Olszewska, Kate and Tay, Yi and Tran, Vinh Q. and Le, Quoc ...

  40. [40]

    Baeumel, Tanja and Genabith, Josef van and Ostermann, Simon , month = feb, year =. The. doi:10.48550/arXiv.2502.19981 , abstract =

  41. [41]

    CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

    Gu, Alex and Rozière, Baptiste and Leather, Hugh and Solar-Lezama, Armando and Synnaeve, Gabriel and Wang, Sida I. , month = jan, year =. doi:10.48550/arXiv.2401.03065 , abstract =

  42. [42]

    Yang, Songlin and Shen, Yikang and Wen, Kaiyue and Tan, Shawn and Mishra, Mayank and Ren, Liliang and Panda, Rameswar and Kim, Yoon , month = oct, year =. The

  43. [43]

    and Moros-Daval, Yael and Zhang, Seraphina and Zhao, Qinlin and Huang, Yitian and Sun, Luning and Prunty, Jonathan E

    Zhou, Lexin and Pacchiardi, Lorenzo and Martínez-Plumed, Fernando and Collins, Katherine M. and Moros-Daval, Yael and Zhang, Seraphina and Zhao, Qinlin and Huang, Yitian and Sun, Luning and Prunty, Jonathan E. and Li, Zongqian and Sánchez-García, Pablo and Jiang-Chen, Kexin and Casares, Pablo A. M. and Zu, Jiyun and Burden, John and Mehrbakhsh, Behzad and...

  44. [44]

    What and

    Zhang, Yufeng and Zhang, Fengzhuo and Yang, Zhuoran and Wang, Zhaoran , month = apr, year =. What and. Proceedings of

  45. [45]

    Revisiting the

    Gwak, Minju and Son, Guijin and Kim, Jaehyung , month = oct, year =. Revisiting the. doi:10.48550/arXiv.2510.13850 , abstract =

  46. [46]

    and Piantadosi, Steven T

    Bigelow, Eric J. and Piantadosi, Steven T. , year =. Inferring priors in compositional cognitive models , volume =. Proceedings of the

  47. [47]

    2023 , keywords =

    Advances in Neural Information Processing Systems , author =. 2023 , keywords =

  48. [48]

    Journal of Open Psychology Data , author =

    A large dataset of generalization patterns in the number game , volume =. Journal of Open Psychology Data , author =. 2016 , pages =. doi:10.5334/jopd.19 , number =

  49. [49]

    Rules and

    Tenenbaum, Joshua , year =. Rules and. Advances in

  50. [50]

    , month = apr, year =

    Zhang, Liyi and Snell, Jake and Griffiths, Thomas L. , month = apr, year =. Meta-. doi:10.48550/arXiv.2508.14285 , abstract =

  51. [51]

    Bayesian teaching enables probabilistic reasoning in large language models , volume =

    Qiu, Linlu and Sha, Fei and Allen, Kelsey and Kim, Yoon and Linzen, Tal and van Steenkiste, Sjoerd , month = jan, year =. Bayesian teaching enables probabilistic reasoning in large language models , volume =. Nature Communications , publisher =. doi:10.1038/s41467-025-67998-6 , abstract =

  52. [52]

    Xu, Zhangchen and Jiang, Fengqing and Niu, Luyao and Deng, Yuntian and Poovendran, Radha and Choi, Yejin and Lin, Bill Yuchen , month = oct, year =. Magpie:. doi:10.48550/arXiv.2406.08464 , abstract =

  53. [53]

    Padmanabhan, Sriram and Misra, Kanishka and Mahowald, Kyle and Choi, Eunsol , month = apr, year =. On. doi:10.48550/arXiv.2504.09387 , abstract =

  54. [54]

    Chari, Anirudh and Pattanaik, Neil , month = feb, year =. Wild. doi:10.48550/arXiv.2602.06818 , abstract =

  55. [55]

    Bazigaran, Arghavan and Sohn, Hansem , month = dec, year =. Concept. doi:10.48550/arXiv.2512.20162 , abstract =

  56. [56]

    Revisiting Uncertainty Esti- mation and Calibration of Large Language Models

    Tao, Linwei and Yeh, Yi-Fan and Dong, Minjing and Huang, Tao and Torr, Philip and Xu, Chang , month = may, year =. Revisiting. doi:10.48550/arXiv.2505.23854 , abstract =

  57. [57]

    Calibration-

    Kapoor, Sanyam and Gruver, Nate and Roberts, Manley and Pal, Arka and Dooley, Samuel and Goldblum, Micah and Wilson, Andrew , editor =. Calibration-. Proceedings of the 1st. 2024 , pages =. doi:10.18653/v1/2024.uncertainlp-1.1 , abstract =

  58. [58]

    Shojaee, Parshin and Mirzadeh, Iman and Alizadeh, Keivan and Horton, Maxwell and Bengio, Samy and Farajtabar, Mehrdad , month = nov, year =. The. doi:10.48550/arXiv.2506.06941 , abstract =

  59. [59]

    Contextual position encoding: Learning to count what’s important

    Golovneva, Olga and Wang, Tianlu and Weston, Jason and Sukhbaatar, Sainbayar , month = may, year =. Contextual. doi:10.48550/arXiv.2405.18719 , abstract =

  60. [60]

    arXiv preprint arXiv:2509.10739 , year=

    Pournemat, Mobina and Rezaei, Keivan and Sriramanan, Gaurang and Zarei, Arman and Fu, Jiaxiang and Wang, Yang and Eghbalzadeh, Hamid and Feizi, Soheil , month = sep, year =. Reasoning. doi:10.48550/arXiv.2509.10739 , abstract =

  61. [61]

    Xiaohu, Xie and Xiaohu, Liu and Benjamin, Yao , month = feb, year =. Know

  62. [62]

    Leng, Jixuan and Huang, Chengsong and Zhu, Banghua and Huang, Jiaxin , month = oct, year =. Taming

  63. [63]

    Position:

    Yan, Hanqi and Zhang, Linhai and Li, Jiazheng and Shen, Zhenyi and He, Yulan , month = jun, year =. Position:

  64. [64]

    EleutherAI Blog , author =

    Attention. EleutherAI Blog , author =

  65. [65]

    Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193, 2026a

    Zhang, Ruixiang and Bai, Richard He and Zheng, Huangjie and Jaitly, Navdeep and Collobert, Ronan and Zhang, Yizhe , month = apr, year =. Embarrassingly. doi:10.48550/arXiv.2604.01193 , abstract =

  66. [66]

    21937–21950

    Salimans, Tim and Chen, Richard , month = dec, year =. Learning. doi:10.48550/arXiv.1812.03381 , abstract =

  67. [67]

    Cooperative inverse reinforcement learning,

    Hadfield-Menell, Dylan and Dragan, Anca and Abbeel, Pieter and Russell, Stuart , month = feb, year =. Cooperative. doi:10.48550/arXiv.1606.03137 , abstract =

  68. [68]

    Test your best methods on our hard

    daria and Tyagi, Riya and Engels, Josh and Nanda, Neel , month = mar, year =. Test your best methods on our hard

  69. [69]

    Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026

    Lu, Zhengxi and Yao, Zhiyuan and Wu, Jinyang and Han, Chengcheng and Gu, Qi and Cai, Xunliang and Lu, Weiming and Xiao, Jun and Zhuang, Yueting and Shen, Yongliang , month = apr, year =. doi:10.48550/arXiv.2604.02268 , abstract =

  70. [70]

    arXiv preprint arXiv:2603.03414 , year=

    Mineault, Patrick J. and Griffiths, Thomas L. and Escola, Sean , month = mar, year =. Cognitive. doi:10.48550/arXiv.2603.03414 , abstract =

  71. [71]

    Cognitive models and AI algorithms provide templates for designing language agents

    Liu, Ryan and Arumugam, Dilip and Zhang, Cedegao E. and Escola, Sean and Pitkow, Xaq and Griffiths, Thomas L. , month = feb, year =. Cognitive. doi:10.48550/arXiv.2602.22523 , abstract =

  72. [72]

    , year =

    Ortega, Pedro A. , year =. Universal

  73. [73]

    and Vezhnevets, Alexander Sasha and Diaz, Manfred and Agapiou, John P

    Leibo, Joel Z. and Vezhnevets, Alexander Sasha and Diaz, Manfred and Agapiou, John P. and Cunningham, William A. and Sunehag, Peter and Cross, Logan and Koster, Raphael and Bileschi, Stanley M. and Chang, Minsuk and Rahwan, Iyad and Osindero, Simon and Evans, James A. , month = mar, year =. A. doi:10.48550/arXiv.2603.14050 , abstract =

  74. [74]

    Superposition Yields Robust Neural Scaling

    Liu, Yizhou and Liu, Ziming and Gore, Jeff , month = nov, year =. Superposition. doi:10.48550/arXiv.2505.10465 , abstract =

  75. [75]

    Science , author =

    Geometry of sequence working memory in macaque prefrontal cortex , volume =. Science , author =. 2022 , pages =. doi:10.1126/science.abm0204 , abstract =

  76. [76]

    Representations and generalization in artificial and brain neural networks , volume =

    Li, Qianyi and Sorscher, Ben and Sompolinsky, Haim , month = jul, year =. Representations and generalization in artificial and brain neural networks , volume =. Proceedings of the National Academy of Sciences , publisher =. doi:10.1073/pnas.2311805121 , abstract =

  77. [77]

    (Joshua Brett) , year =

    Tenenbaum, Joshua B. (Joshua Brett) , year =. A

  78. [78]

    Wang, William Yang , editor =. “. Proceedings of the 55th. 2017 , pages =. doi:10.18653/v1/P17-2067 , abstract =

  79. [79]

    Prototypical

    Snell, Jake and Swersky, Kevin and Zemel, Richard , year =. Prototypical. Advances in

  80. [80]

    Gemma 3 Technical Report

    Gemma, Team and Kamath, Aishwarya and Ferret, Johan and Pathak, Shreya and Vieillard, Nino and Merhej, Ramona and Perrin, Sarah and Matejovicova, Tatiana and Ramé, Alexandre and Rivière, Morgane and Rouillard, Louis and Mesnard, Thomas and Cideron, Geoffrey and Grill, Jean-bastien and Ramos, Sabela and Yvinec, Edouard and Casbon, Michelle and Pot, Etienne...

Showing first 80 references.