Why Did Apple Fall: Evaluating Curiosity in Large Language Models
Pith reviewed 2026-05-18 04:34 UTC · model grok-4.3
The pith
LLMs seek knowledge more than humans yet favor conservative choices amid uncertainty.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs exhibit a stronger thirst for knowledge than humans but still tend to make conservative choices when faced with uncertain environments. Curious behaviors can enhance the model's reasoning and active learning abilities.
What carries the argument
An evaluation framework built from the adapted Five-Dimensional Curiosity scale Revised (5DCR) that scores LLMs across Information Seeking, Thrill Seeking, and Social Curiosity.
If this is right
- Prompting LLMs toward curious responses improves their performance on reasoning benchmarks.
- Curiosity measures correlate with better active learning outcomes in models.
- The framework offers a repeatable method for tracking curiosity-like traits across different model sizes and training regimes.
- Models can be guided to display human-comparable curiosity without fundamental architectural changes.
Where Pith is reading between the lines
- Reward signals that favor information-seeking during training could produce models that generalize more effectively to novel domains.
- The conservative bias observed here may stem from safety alignment and could be adjusted independently of core capabilities.
- Comparable questionnaires could be developed to test curiosity in multimodal or agentic systems beyond text-only LLMs.
Load-bearing premise
The adapted human questionnaire measures genuine curiosity in language models rather than artifacts of training data or prompt interpretation.
What would settle it
A controlled experiment that scores the same models on curiosity before and after targeted fine-tuning for greater risk tolerance, then measures whether reasoning accuracy and active-learning gains rise or fall as predicted.
Figures
read the original abstract
Curiosity serves as a pivotal conduit for human beings to discover and learn new knowledge. Recent advancements of large language models (LLMs) in natural language processing have sparked discussions regarding whether these models possess capability of curiosity-driven learning akin to humans. In this paper, starting from the human curiosity assessment questionnaire Five-Dimensional Curiosity scale Revised (5DCR), we design a comprehensive evaluation framework that covers dimensions such as Information Seeking, Thrill Seeking, and Social Curiosity to assess the extent of curiosity exhibited by LLMs. The results demonstrate that LLMs exhibit a stronger thirst for knowledge than humans but still tend to make conservative choices when faced with uncertain environments. We further investigated the relationship between curiosity and thinking of LLMs, confirming that curious behaviors can enhance the model's reasoning and active learning abilities. These findings suggest that LLMs have the potential to exhibit curiosity similar to that of humans, providing experimental support for the future development of learning capabilities and innovative research in LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper adapts the Five-Dimensional Curiosity scale Revised (5DCR) into an evaluation framework for LLMs covering dimensions including Information Seeking, Thrill Seeking, and Social Curiosity. It reports that LLMs score higher than humans on thirst for knowledge yet make more conservative choices under uncertainty, and that inducing curious behaviors improves LLM reasoning and active learning.
Significance. If the adapted questionnaire isolates intrinsic curiosity mechanisms rather than surface-level response patterns, the work would supply one of the first systematic empirical comparisons of curiosity-like traits between LLMs and humans, with direct implications for designing more exploratory and self-directed learning agents.
major comments (3)
- [Evaluation Framework and Results] The central claim that LLMs exhibit a stronger thirst for knowledge than humans rests on direct numerical comparison of 5DCR scores; however, the methods provide no ablation or control (e.g., paraphrased items, removal of curiosity-related wording, or comparison to models trained only on non-curiosity text) to rule out the possibility that higher scores reflect better instruction-following or memorization of similar questionnaire items rather than a curiosity drive.
- [Relationship between Curiosity and Thinking] The reported relationship between curiosity and enhanced reasoning/active learning lacks an independent operationalization of 'curious behaviors'; it is unclear whether the improvement is measured via controlled prompting experiments or post-hoc correlation, which is load-bearing for the claim that curiosity causally benefits model performance.
- [Experimental Setup] No statistical details (sample sizes per model, variance across prompt phrasings, or correction for multiple comparisons) are supplied for the human-LLM score comparisons, making it impossible to assess whether the reported differences are robust or sensitive to elicitation details.
minor comments (2)
- [Abstract] The abstract states results 'demonstrate' stronger thirst for knowledge; this phrasing should be softened to 'suggest' or 'indicate' until the measurement-validity concerns are addressed.
- [Evaluation Framework] Notation for the five dimensions is introduced without an explicit mapping table to the original 5DCR items; adding such a table would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the empirical rigor of our evaluation framework and results. We address each major comment point by point below, indicating where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: [Evaluation Framework and Results] The central claim that LLMs exhibit a stronger thirst for knowledge than humans rests on direct numerical comparison of 5DCR scores; however, the methods provide no ablation or control (e.g., paraphrased items, removal of curiosity-related wording, or comparison to models trained only on non-curiosity text) to rule out the possibility that higher scores reflect better instruction-following or memorization of similar questionnaire items rather than a curiosity drive.
Authors: We agree that the original submission lacked explicit controls to isolate intrinsic curiosity from instruction-following or memorization effects. In the revised manuscript we have added a dedicated ablation study using paraphrased 5DCR items and neutral control prompts that remove curiosity-related language. These new results are reported in an expanded Methods section and show that score differences remain directionally consistent, though we now explicitly discuss the remaining limitations of such controls in the Discussion. revision: yes
-
Referee: [Relationship between Curiosity and Thinking] The reported relationship between curiosity and enhanced reasoning/active learning lacks an independent operationalization of 'curious behaviors'; it is unclear whether the improvement is measured via controlled prompting experiments or post-hoc correlation, which is load-bearing for the claim that curiosity causally benefits model performance.
Authors: We have revised the relevant section to make the experimental design explicit. Curious behaviors were operationalized independently through a set of controlled prompting strategies (e.g., explicit information-seeking instructions and uncertainty-encouraging prompts) that were applied before the reasoning and active-learning tasks. Performance was then compared against baseline prompts in a within-model controlled experiment. The revised text now clearly distinguishes this from any post-hoc correlation analysis. revision: yes
-
Referee: [Experimental Setup] No statistical details (sample sizes per model, variance across prompt phrasings, or correction for multiple comparisons) are supplied for the human-LLM score comparisons, making it impossible to assess whether the reported differences are robust or sensitive to elicitation details.
Authors: We acknowledge the omission. The revised manuscript now includes the requested statistical details: 50 independent responses per model per dimension, standard deviations computed across five distinct prompt phrasings, and Bonferroni correction applied to the multiple comparisons. These numbers and procedures are reported in the Experimental Setup and Results sections with accompanying tables. revision: yes
Circularity Check
No significant circularity in empirical evaluation of LLM curiosity
full rationale
The paper presents an empirical study that adapts the existing 5DCR human questionnaire into an evaluation framework for LLMs, then reports direct numerical comparisons of responses across curiosity dimensions and links to reasoning performance. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text or abstract. Claims rest on observed model outputs to fixed questionnaire items rather than any self-referential construction, renaming of known results, or load-bearing self-citations that reduce the central findings to inputs by definition. The evaluation is therefore self-contained against external human benchmarks without the circular patterns enumerated in the guidelines.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Five-Dimensional Curiosity scale Revised (5DCR) can be meaningfully adapted to evaluate curiosity in LLMs
Reference graph
Works this paper leans on
-
[1]
Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. George Loewenstein. 1994. The psychology of curios- ity: A review and reinterpretation.Psychological bulletin, 116(1):75. Georgy Meshcheryakov, Anna A Igolkina, and Maria G Samsonova. 2021. semopy 2: A structural equation modeling package with random effects in python. arXiv preprint arXiv:2106...
work page internal anchor Pith review Pith/arXiv arXiv 1994
-
[2]
Matplotlib (Hunter, 2007) and Seaborn (Waskom, 2021) were used for plotting
and SciPy (Virtanen et al., 2020). Matplotlib (Hunter, 2007) and Seaborn (Waskom, 2021) were used for plotting. Semopy (Meshcheryakov et al.,
work page 2020
-
[3]
was used for statistical analysis. All experiments were carried out on a worksta- tion equipped with 8×NVIDIA A800 GPUs run- ning Ubuntu 24.04.2 LTS, using PyTorch 2.6.0 with CUDA 12.9. In the experiments, we conducted in- ference and training of open-source models based on the MS-Swift (Zhao et al., 2025) framework. For inference, the default parameters ...
work page 2025
-
[4]
The defini- tion of the six dimensions is: • Joyous exploration
consists of 24 questions covering six dimen- sions, each containing four statements. The defini- tion of the six dimensions is: • Joyous exploration. This trait is character- ized by an individual’s readiness and enthusi- asm for experimenting with, embracing, and 1https://github.com/huggingface/transformers 11 Parameter Value num_train_epochs 3 train_typ...
work page 2048
-
[5]
Does not describe me at all
-
[6]
Somewhat describes me
-
[7]
Generally describes me
-
[8]
Completely describes me Statement: I view challenging situations as an opportunity to grow and learn. Table 7: The prompt template for 5DCR. Number % Age(t= 8.47,d f= 941,p < .001)Mean (SD) 35.63 (10.06) Gender(χ2 = 0.38,d f= 1,p=.540)Male266 51.80Female248 48.20 Race(χ2 = 0.68,d f= 4,p=.950)White380 78.70African American44 9.10Hispanic19 3.90Asian or Pac...
work page 2012
-
[9]
Xiao Li, Xiao Ting, and Xiao Liu definitely did not eat it
Third Witness: “Xiao Li, Xiao Ting, and Xiao Liu definitely did not eat it.” But only one witness tells the truth! Who could it be? Let’s test each possibility: If the First Witness is truthful: Then only Xiao Ming or Xiao Hong ate it. But the other two must be lying. 1.Second Witness lies: Not Xiao Huang or Xiao Li. 2.Third Witness lies: At least one of ...
-
[10]
Fora= 5 log2 3.4: - We can use the change of base formula for logarithms: a= 5 log2 3.4 = 5 log10 3.4 log10 2 - This simplifies our understanding but doesn’t directly help in comparison
-
[11]
Forb= 5 log4 3.6: - Using the property of logarithms thatlog b x= 1 logx b , we get: b= 5 log4 3.6 = 5 1 log3.6 4 = 5 1/log 3.6 4 - Wait, something is wrong here. I need to rethink this. Let’s try another approach: b= 5 log4 3.6 = 5 log22 3.6 = 5 1 2 log2 3.6
-
[12]
- The exponent for b is 1 2 log2 3.6
Forc= 1 5 log3 0.3 : - Using the property( 1 x )y =x −y: c= 1 5 log3 0.3 = 5 −log 3 0.3 - Applying the change of base formula again: c= 5 −log 3 0.3 = 5 − log10 0.3 log10 3 Now, let’s analyze these expressions more clearly: -a= 5 log2 3.4 -b= 5 1 2 log2 3.6 -c= 5 − log10 0.3 log10 3 Next, compare the exponents: - The exponent for a is log2 3.4. - The expo...
-
[13]
I should ensure that the logarithm bases are consistent
Fora= 5 log2 3.4, we need to express this in terms of base 5: a= 5 log2 3.4 = 5 log5 3.4 log5 2 Wait, something is wrong here. I should ensure that the logarithm bases are consistent. Let’s correct it: a= 5 log2 3.4 = 5 log5 3.4 log5 2
-
[14]
Forb= 5 log4 3.6, similarly convert to base 5: b= 5 log4 3.6 = 5 log5 3.6 log5 4
-
[15]
Forc= 1 5 log3 0.3 , convert to base 5: c= 1 5 log3 0.3 = 5 −log 3 0.3 Next, let’s simplify each term further:
-
[16]
Fora: a= 5 log2 3.4 = 5 log5 3.4 log5 2
-
[17]
Forb: b= 5 log4 3.6 = 5 log5 3.6 log5 4
-
[18]
Forc: c= 5 −log 3 0.3 = 5 log3 1 0.3 = 5 log3 10 3 Now, compare the exponents: 1.log 2 3.4vs log5 3.6 log5 4 : Sincelog 5 3.6≈0.857andlog 5 4≈0.861, we have: log5 3.6 log5 4 ≈ 0.857 0.861 ≈0.996 Therefore,log 2 3.4> log5 3.6 log5 4 . 2.log 3 10 3 vs log5 3.6 log5 4 : Sincelog 3 10 3 ≈0.966and log5 3.6 log5 4 ≈0.996, we have: log3 10 3 < log5 3.6 log5 4 Th...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.