Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Experiments
Pith reviewed 2026-05-22 15:29 UTC · model grok-4.3
The pith
Enabling thinking traces in LLMs shifts their exploration-exploitation behavior closer to that of humans in multi-armed bandit experiments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By comparing LLMs, humans, and standard algorithms in canonical multi-armed bandit tasks, the authors show that thinking-enabled LLMs exhibit a mix of random and directed exploration similar to humans in stationary bandits. In non-stationary bandits, LLMs demonstrate weaker directed exploration and adaptability compared to humans, even when overall regret is comparable in some cases. This is quantified using interpretable choice models that separate the types of exploration.
What carries the argument
Interpretable choice models that quantify random exploration versus directed exploration in sequential choice tasks under uncertainty.
If this is right
- LLMs with thinking enabled can simulate human exploration strategies effectively in stable uncertain environments.
- LLMs may need enhancements to match human performance in adapting to changing reward structures.
- Using LLMs for automated decision-making shows promise in simple bandit-like settings but requires caution in dynamic ones.
- Insights from these comparisons can guide improvements in prompting and model design for better human-like decision making.
Where Pith is reading between the lines
- Future LLM training could incorporate human decision data to improve adaptability in non-stationary tasks.
- Similar comparisons could be applied to other sequential decision domains such as reinforcement learning environments.
- The observed limits might indicate that current LLMs lack mechanisms humans use for tracking environmental changes over time.
Load-bearing premise
The interpretable choice models accurately capture the decision strategies of humans and LLMs without bias from model misspecification or task artifacts.
What would settle it
Re-running the stationary bandit experiments and finding that thinking-enabled LLMs display significantly different levels of random and directed exploration than humans would challenge the similarity claim.
read the original abstract
Large language models (LLMs) are increasingly used to simulate or automate human behavior in complex sequential decision-making settings. A natural question is then whether LLMs exhibit similar decision-making behavior to humans, and can achieve comparable (or superior) performance. In this work, we focus on the exploration-exploitation (E&E) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty. We employ canonical multi-armed bandit (MAB) experiments introduced in the cognitive science and psychiatry literature to conduct a comparative study of the E&E strategies of LLMs, humans, and MAB algorithms. We use interpretable choice models to capture the E&E strategies of the agents and investigate how enabling thinking traces, through both prompting strategies and thinking models, shapes LLM decision-making. We find that enabling thinking in LLMs shifts their behavior toward more human-like behavior, characterized by a mix of random and directed exploration. In a simple stationary setting, thinking-enabled LLMs exhibit similar levels of random and directed exploration compared to humans. However, in more complex, non-stationary environments, LLMs struggle to match human adaptability, particularly in effective directed exploration, despite achieving similar regret in certain scenarios. Our findings highlight both the promise and limits of LLMs as simulators of human behavior and tools for automated decision-making and point to potential areas for improvement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper compares exploration-exploitation strategies of LLMs, humans, and standard MAB algorithms using canonical multi-armed bandit tasks from cognitive science. It fits interpretable choice models (variants of softmax with value and uncertainty terms) to choice sequences to quantify random versus directed exploration, and tests how enabling thinking traces via prompting and thinking models shifts LLM behavior. The central claim is that thinking-enabled LLMs become more human-like, showing comparable random and directed exploration in stationary bandits but weaker adaptability and directed exploration in non-stationary environments, while sometimes achieving similar regret.
Significance. If the quantitative results and model fits hold, the work offers useful evidence on the promise and limits of LLMs as human-behavior simulators in sequential decision tasks. The adoption of established bandit paradigms and interpretable models allows mechanistic comparison of strategies rather than black-box performance metrics. The finding that explicit thinking shifts LLM exploration patterns is a concrete, testable contribution that could inform both cognitive modeling and LLM prompting design.
major comments (2)
- [§3.2 and §4.2] §3.2 (Choice Models) and §4.2 (Non-stationary results): The central claim that thinking-enabled LLMs exhibit weaker directed exploration in non-stationary settings rests on the assumption that the fitted parametric models accurately recover the underlying strategies for both humans and LLMs. If LLM policies contain prompt artifacts, token biases, or non-stationary internal states outside the model class, the recovered parameters could systematically misattribute the observed pattern without reflecting true adaptability limits. A direct test (e.g., simulation recovery or alternative model classes) is needed to rule out misspecification as the source of the reported difference.
- [§4 and Tables 2-3] §4 (Results) and Table 2/3: The manuscript reports quantitative comparisons of exploration parameters and regret but provides insufficient detail on LLM sample sizes per condition, number of independent runs, statistical tests for parameter differences, and pre-registered robustness checks. Without these, it is difficult to evaluate whether the claimed similarity in stationary settings and difference in non-stationary settings are reliable or sensitive to post-hoc modeling choices.
minor comments (2)
- [Abstract] Abstract: The phrase 'similar regret in certain scenarios' is vague; specify the exact scenarios, regret values, and whether they survive correction for multiple comparisons.
- [§2] §2 (Related Work): The citation list for human bandit literature is adequate, but recent LLM-specific sequential decision papers should be added for context.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and have revised the paper accordingly to improve methodological transparency and robustness.
read point-by-point responses
-
Referee: [§3.2 and §4.2] §3.2 (Choice Models) and §4.2 (Non-stationary results): The central claim that thinking-enabled LLMs exhibit weaker directed exploration in non-stationary settings rests on the assumption that the fitted parametric models accurately recover the underlying strategies for both humans and LLMs. If LLM policies contain prompt artifacts, token biases, or non-stationary internal states outside the model class, the recovered parameters could systematically misattribute the observed pattern without reflecting true adaptability limits. A direct test (e.g., simulation recovery or alternative model classes) is needed to rule out misspecification as the source of the reported difference.
Authors: We appreciate the referee's emphasis on validating the choice models against potential misspecification. In response, we conducted parameter recovery simulations by generating synthetic choice data from the fitted models (using both human-derived and LLM-derived parameters) and re-fitting the models to assess recovery accuracy. These simulations show low bias and high correlation between true and recovered parameters for both random and directed exploration terms. We also tested an extended model class incorporating a prompt-bias term and found that the reported differences in directed exploration for non-stationary bandits remain significant. These new analyses and results will be added to a revised §3.2 and a new appendix. We agree this strengthens the central claim. revision: yes
-
Referee: [§4 and Tables 2-3] §4 (Results) and Table 2/3: The manuscript reports quantitative comparisons of exploration parameters and regret but provides insufficient detail on LLM sample sizes per condition, number of independent runs, statistical tests for parameter differences, and pre-registered robustness checks. Without these, it is difficult to evaluate whether the claimed similarity in stationary settings and difference in non-stationary settings are reliable or sensitive to post-hoc modeling choices.
Authors: We agree that greater detail on the experimental and statistical procedures is necessary for reproducibility and evaluation. The revised manuscript now includes expanded descriptions in §4 and the methods section specifying the LLM sample sizes (50-100 independent runs per model-prompt-condition combination, depending on the bandit task), the number of independent replications, and the statistical tests employed (e.g., Welch's t-tests with FDR correction for between-group parameter comparisons, with effect sizes and confidence intervals reported). We have also added a dedicated robustness subsection detailing sensitivity analyses to prompt variations, temperature settings, and model versions. Although the original study was not pre-registered, these post-hoc checks confirm the stability of the key findings on exploration parameters and regret. Updated Tables 2 and 3 will incorporate these details and p-values. revision: yes
Circularity Check
No significant circularity; analysis relies on external benchmarks and prior models
full rationale
The paper's derivation chain applies canonical multi-armed bandit tasks from the cognitive science literature and fits previously published interpretable choice models to observed choice sequences from humans and LLMs. Key quantities such as levels of random and directed exploration are recovered by applying these external models rather than being defined in terms of the authors' own fitted parameters or predictions. No self-definitional reductions, fitted inputs called predictions, load-bearing self-citations, or ansatzes smuggled via citation are present; the central claims about thinking traces and adaptability emerge from empirical comparisons against independent human data and standard methods.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The canonical multi-armed bandit tasks and choice models from prior cognitive-science literature correctly isolate random versus directed exploration.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.