pith. machine review for the scientific record. sign in

arxiv: 2510.20635 · v2 · submitted 2025-10-23 · 💻 cs.CL · cs.AI

Why Did Apple Fall: Evaluating Curiosity in Large Language Models

Pith reviewed 2026-05-18 04:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelscuriosity evaluation5DCR scaleinformation seekingreasoning enhancementactive learninguncertainty handling
0
0 comments X

The pith

LLMs seek knowledge more than humans yet favor conservative choices amid uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts the human Five-Dimensional Curiosity scale Revised into a framework that probes large language models on information seeking, thrill seeking, and social curiosity. Tests show the models pursue new facts more vigorously than people but default to safer options when outcomes are unclear. The work further links higher curiosity scores to stronger performance on reasoning tasks and independent learning. If the pattern holds, it supplies a concrete route to measuring and encouraging exploratory behavior in AI systems.

Core claim

LLMs exhibit a stronger thirst for knowledge than humans but still tend to make conservative choices when faced with uncertain environments. Curious behaviors can enhance the model's reasoning and active learning abilities.

What carries the argument

An evaluation framework built from the adapted Five-Dimensional Curiosity scale Revised (5DCR) that scores LLMs across Information Seeking, Thrill Seeking, and Social Curiosity.

If this is right

  • Prompting LLMs toward curious responses improves their performance on reasoning benchmarks.
  • Curiosity measures correlate with better active learning outcomes in models.
  • The framework offers a repeatable method for tracking curiosity-like traits across different model sizes and training regimes.
  • Models can be guided to display human-comparable curiosity without fundamental architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Reward signals that favor information-seeking during training could produce models that generalize more effectively to novel domains.
  • The conservative bias observed here may stem from safety alignment and could be adjusted independently of core capabilities.
  • Comparable questionnaires could be developed to test curiosity in multimodal or agentic systems beyond text-only LLMs.

Load-bearing premise

The adapted human questionnaire measures genuine curiosity in language models rather than artifacts of training data or prompt interpretation.

What would settle it

A controlled experiment that scores the same models on curiosity before and after targeted fine-tuning for greater risk tolerance, then measures whether reasoning accuracy and active-learning gains rise or fall as predicted.

Figures

Figures reproduced from arXiv: 2510.20635 by Haoyu Wang, Jiansheng Wei, Sihang Jiang, Xiaojun Meng, Yanghua Xiao, Yitong Wang, Yuyan Chen.

Figure 1
Figure 1. Figure 1: Overview of our evaluation. A) Three dimensions of human curiosity. B) Large Language Models (LLMs) are prompted to self-assess using the Five-Dimensional Curiosity Rating (5DCR) scale. C) We investigate three types of curious behaviors exhibited by LLMs. D) We design a curiosity-driven questioning and thinking pipeline for LLMs to investigate the relationship between curiosity and learning in LLMs. An ove… view at source ↗
Figure 2
Figure 2. Figure 2: Examples of Curiosity-Driven Information Seeking, Thrill Seeking and Social Curiosity studies. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of the model’s results with human sample across six dimensions of curiosity. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results of curiosity-driven Information Seeking, Thrill Seeking and Social Curiosity studies. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Trial selection paths for the underwater game [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Curiosity serves as a pivotal conduit for human beings to discover and learn new knowledge. Recent advancements of large language models (LLMs) in natural language processing have sparked discussions regarding whether these models possess capability of curiosity-driven learning akin to humans. In this paper, starting from the human curiosity assessment questionnaire Five-Dimensional Curiosity scale Revised (5DCR), we design a comprehensive evaluation framework that covers dimensions such as Information Seeking, Thrill Seeking, and Social Curiosity to assess the extent of curiosity exhibited by LLMs. The results demonstrate that LLMs exhibit a stronger thirst for knowledge than humans but still tend to make conservative choices when faced with uncertain environments. We further investigated the relationship between curiosity and thinking of LLMs, confirming that curious behaviors can enhance the model's reasoning and active learning abilities. These findings suggest that LLMs have the potential to exhibit curiosity similar to that of humans, providing experimental support for the future development of learning capabilities and innovative research in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper adapts the Five-Dimensional Curiosity scale Revised (5DCR) into an evaluation framework for LLMs covering dimensions including Information Seeking, Thrill Seeking, and Social Curiosity. It reports that LLMs score higher than humans on thirst for knowledge yet make more conservative choices under uncertainty, and that inducing curious behaviors improves LLM reasoning and active learning.

Significance. If the adapted questionnaire isolates intrinsic curiosity mechanisms rather than surface-level response patterns, the work would supply one of the first systematic empirical comparisons of curiosity-like traits between LLMs and humans, with direct implications for designing more exploratory and self-directed learning agents.

major comments (3)
  1. [Evaluation Framework and Results] The central claim that LLMs exhibit a stronger thirst for knowledge than humans rests on direct numerical comparison of 5DCR scores; however, the methods provide no ablation or control (e.g., paraphrased items, removal of curiosity-related wording, or comparison to models trained only on non-curiosity text) to rule out the possibility that higher scores reflect better instruction-following or memorization of similar questionnaire items rather than a curiosity drive.
  2. [Relationship between Curiosity and Thinking] The reported relationship between curiosity and enhanced reasoning/active learning lacks an independent operationalization of 'curious behaviors'; it is unclear whether the improvement is measured via controlled prompting experiments or post-hoc correlation, which is load-bearing for the claim that curiosity causally benefits model performance.
  3. [Experimental Setup] No statistical details (sample sizes per model, variance across prompt phrasings, or correction for multiple comparisons) are supplied for the human-LLM score comparisons, making it impossible to assess whether the reported differences are robust or sensitive to elicitation details.
minor comments (2)
  1. [Abstract] The abstract states results 'demonstrate' stronger thirst for knowledge; this phrasing should be softened to 'suggest' or 'indicate' until the measurement-validity concerns are addressed.
  2. [Evaluation Framework] Notation for the five dimensions is introduced without an explicit mapping table to the original 5DCR items; adding such a table would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the empirical rigor of our evaluation framework and results. We address each major comment point by point below, indicating where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [Evaluation Framework and Results] The central claim that LLMs exhibit a stronger thirst for knowledge than humans rests on direct numerical comparison of 5DCR scores; however, the methods provide no ablation or control (e.g., paraphrased items, removal of curiosity-related wording, or comparison to models trained only on non-curiosity text) to rule out the possibility that higher scores reflect better instruction-following or memorization of similar questionnaire items rather than a curiosity drive.

    Authors: We agree that the original submission lacked explicit controls to isolate intrinsic curiosity from instruction-following or memorization effects. In the revised manuscript we have added a dedicated ablation study using paraphrased 5DCR items and neutral control prompts that remove curiosity-related language. These new results are reported in an expanded Methods section and show that score differences remain directionally consistent, though we now explicitly discuss the remaining limitations of such controls in the Discussion. revision: yes

  2. Referee: [Relationship between Curiosity and Thinking] The reported relationship between curiosity and enhanced reasoning/active learning lacks an independent operationalization of 'curious behaviors'; it is unclear whether the improvement is measured via controlled prompting experiments or post-hoc correlation, which is load-bearing for the claim that curiosity causally benefits model performance.

    Authors: We have revised the relevant section to make the experimental design explicit. Curious behaviors were operationalized independently through a set of controlled prompting strategies (e.g., explicit information-seeking instructions and uncertainty-encouraging prompts) that were applied before the reasoning and active-learning tasks. Performance was then compared against baseline prompts in a within-model controlled experiment. The revised text now clearly distinguishes this from any post-hoc correlation analysis. revision: yes

  3. Referee: [Experimental Setup] No statistical details (sample sizes per model, variance across prompt phrasings, or correction for multiple comparisons) are supplied for the human-LLM score comparisons, making it impossible to assess whether the reported differences are robust or sensitive to elicitation details.

    Authors: We acknowledge the omission. The revised manuscript now includes the requested statistical details: 50 independent responses per model per dimension, standard deviations computed across five distinct prompt phrasings, and Bonferroni correction applied to the multiple comparisons. These numbers and procedures are reported in the Experimental Setup and Results sections with accompanying tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation of LLM curiosity

full rationale

The paper presents an empirical study that adapts the existing 5DCR human questionnaire into an evaluation framework for LLMs, then reports direct numerical comparisons of responses across curiosity dimensions and links to reasoning performance. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text or abstract. Claims rest on observed model outputs to fixed questionnaire items rather than any self-referential construction, renaming of known results, or load-bearing self-citations that reduce the central findings to inputs by definition. The evaluation is therefore self-contained against external human benchmarks without the circular patterns enumerated in the guidelines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a human psychological questionnaire can be directly adapted to measure equivalent curiosity constructs in LLMs. No free parameters or invented entities are described.

axioms (1)
  • domain assumption The Five-Dimensional Curiosity scale Revised (5DCR) can be meaningfully adapted to evaluate curiosity in LLMs
    The paper explicitly starts from the human 5DCR questionnaire to design the evaluation framework for LLMs.

pith-pipeline@v0.9.0 · 5711 in / 1295 out tokens · 46614 ms · 2026-05-18T04:34:22.333587+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

  1. [1]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. George Loewenstein. 1994. The psychology of curios- ity: A review and reinterpretation.Psychological bulletin, 116(1):75. Georgy Meshcheryakov, Anna A Igolkina, and Maria G Samsonova. 2021. semopy 2: A structural equation modeling package with random effects in python. arXiv preprint arXiv:2106...

  2. [2]

    Matplotlib (Hunter, 2007) and Seaborn (Waskom, 2021) were used for plotting

    and SciPy (Virtanen et al., 2020). Matplotlib (Hunter, 2007) and Seaborn (Waskom, 2021) were used for plotting. Semopy (Meshcheryakov et al.,

  3. [3]

    All experiments were carried out on a worksta- tion equipped with 8×NVIDIA A800 GPUs run- ning Ubuntu 24.04.2 LTS, using PyTorch 2.6.0 with CUDA 12.9

    was used for statistical analysis. All experiments were carried out on a worksta- tion equipped with 8×NVIDIA A800 GPUs run- ning Ubuntu 24.04.2 LTS, using PyTorch 2.6.0 with CUDA 12.9. In the experiments, we conducted in- ference and training of open-source models based on the MS-Swift (Zhao et al., 2025) framework. For inference, the default parameters ...

  4. [4]

    The defini- tion of the six dimensions is: • Joyous exploration

    consists of 24 questions covering six dimen- sions, each containing four statements. The defini- tion of the six dimensions is: • Joyous exploration. This trait is character- ized by an individual’s readiness and enthusi- asm for experimenting with, embracing, and 1https://github.com/huggingface/transformers 11 Parameter Value num_train_epochs 3 train_typ...

  5. [5]

    Does not describe me at all

  6. [6]

    Somewhat describes me

  7. [7]

    Generally describes me

  8. [8]

    confirm to choose

    Completely describes me Statement: I view challenging situations as an opportunity to grow and learn. Table 7: The prompt template for 5DCR. Number % Age(t= 8.47,d f= 941,p < .001)Mean (SD) 35.63 (10.06) Gender(χ2 = 0.38,d f= 1,p=.540)Male266 51.80Female248 48.20 Race(χ2 = 0.68,d f= 4,p=.950)White380 78.70African American44 9.10Hispanic19 3.90Asian or Pac...

  9. [9]

    Xiao Li, Xiao Ting, and Xiao Liu definitely did not eat it

    Third Witness: “Xiao Li, Xiao Ting, and Xiao Liu definitely did not eat it.” But only one witness tells the truth! Who could it be? Let’s test each possibility: If the First Witness is truthful: Then only Xiao Ming or Xiao Hong ate it. But the other two must be lying. 1.Second Witness lies: Not Xiao Huang or Xiao Li. 2.Third Witness lies: At least one of ...

  10. [10]

    Fora= 5 log2 3.4: - We can use the change of base formula for logarithms: a= 5 log2 3.4 = 5 log10 3.4 log10 2 - This simplifies our understanding but doesn’t directly help in comparison

  11. [11]

    I need to rethink this

    Forb= 5 log4 3.6: - Using the property of logarithms thatlog b x= 1 logx b , we get: b= 5 log4 3.6 = 5 1 log3.6 4 = 5 1/log 3.6 4 - Wait, something is wrong here. I need to rethink this. Let’s try another approach: b= 5 log4 3.6 = 5 log22 3.6 = 5 1 2 log2 3.6

  12. [12]

    - The exponent for b is 1 2 log2 3.6

    Forc= 1 5 log3 0.3 : - Using the property( 1 x )y =x −y: c= 1 5 log3 0.3 = 5 −log 3 0.3 - Applying the change of base formula again: c= 5 −log 3 0.3 = 5 − log10 0.3 log10 3 Now, let’s analyze these expressions more clearly: -a= 5 log2 3.4 -b= 5 1 2 log2 3.6 -c= 5 − log10 0.3 log10 3 Next, compare the exponents: - The exponent for a is log2 3.4. - The expo...

  13. [13]

    I should ensure that the logarithm bases are consistent

    Fora= 5 log2 3.4, we need to express this in terms of base 5: a= 5 log2 3.4 = 5 log5 3.4 log5 2 Wait, something is wrong here. I should ensure that the logarithm bases are consistent. Let’s correct it: a= 5 log2 3.4 = 5 log5 3.4 log5 2

  14. [14]

    Forb= 5 log4 3.6, similarly convert to base 5: b= 5 log4 3.6 = 5 log5 3.6 log5 4

  15. [15]

    Forc= 1 5 log3 0.3 , convert to base 5: c= 1 5 log3 0.3 = 5 −log 3 0.3 Next, let’s simplify each term further:

  16. [16]

    Fora: a= 5 log2 3.4 = 5 log5 3.4 log5 2

  17. [17]

    Forb: b= 5 log4 3.6 = 5 log5 3.6 log5 4

  18. [18]

    2.log 3 10 3 vs log5 3.6 log5 4 : Sincelog 3 10 3 ≈0.966and log5 3.6 log5 4 ≈0.996, we have: log3 10 3 < log5 3.6 log5 4 Therefore,a > bandb > c

    Forc: c= 5 −log 3 0.3 = 5 log3 1 0.3 = 5 log3 10 3 Now, compare the exponents: 1.log 2 3.4vs log5 3.6 log5 4 : Sincelog 5 3.6≈0.857andlog 5 4≈0.861, we have: log5 3.6 log5 4 ≈ 0.857 0.861 ≈0.996 Therefore,log 2 3.4> log5 3.6 log5 4 . 2.log 3 10 3 vs log5 3.6 log5 4 : Sincelog 3 10 3 ≈0.966and log5 3.6 log5 4 ≈0.996, we have: log3 10 3 < log5 3.6 log5 4 Th...