Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Experiments

Darcy Wang; Ningyuan Chen; Rodrigo Mansur; Vahid Sarhangian; Ziyuan Zhang

arxiv: 2505.09901 · v3 · submitted 2025-05-15 · 💻 cs.LG · cs.AI· cs.CL· cs.HC

Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Experiments

Ziyuan Zhang , Darcy Wang , Ningyuan Chen , Rodrigo Mansur , Vahid Sarhangian This is my paper

Pith reviewed 2026-05-22 15:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.HC

keywords large language modelsmulti-armed banditsexploration-exploitationhuman behavior simulationthinking promptsnon-stationary environmentsdecision makingregret

0 comments

The pith

Enabling thinking traces in LLMs shifts their exploration-exploitation behavior closer to that of humans in multi-armed bandit experiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models make decisions like humans when facing the exploration-exploitation trade-off in multi-armed bandit problems. It tests standard experiments from cognitive science and finds that adding thinking through prompts or special models makes LLM choices more similar to humans, with comparable random and directed exploration in unchanging environments. Yet in settings where rewards change over time, LLMs adapt less effectively than humans despite sometimes matching performance. This matters because LLMs are increasingly used to stand in for human behavior in decision tasks, so knowing where they align and diverge helps assess their usefulness.

Core claim

By comparing LLMs, humans, and standard algorithms in canonical multi-armed bandit tasks, the authors show that thinking-enabled LLMs exhibit a mix of random and directed exploration similar to humans in stationary bandits. In non-stationary bandits, LLMs demonstrate weaker directed exploration and adaptability compared to humans, even when overall regret is comparable in some cases. This is quantified using interpretable choice models that separate the types of exploration.

What carries the argument

Interpretable choice models that quantify random exploration versus directed exploration in sequential choice tasks under uncertainty.

If this is right

LLMs with thinking enabled can simulate human exploration strategies effectively in stable uncertain environments.
LLMs may need enhancements to match human performance in adapting to changing reward structures.
Using LLMs for automated decision-making shows promise in simple bandit-like settings but requires caution in dynamic ones.
Insights from these comparisons can guide improvements in prompting and model design for better human-like decision making.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future LLM training could incorporate human decision data to improve adaptability in non-stationary tasks.
Similar comparisons could be applied to other sequential decision domains such as reinforcement learning environments.
The observed limits might indicate that current LLMs lack mechanisms humans use for tracking environmental changes over time.

Load-bearing premise

The interpretable choice models accurately capture the decision strategies of humans and LLMs without bias from model misspecification or task artifacts.

What would settle it

Re-running the stationary bandit experiments and finding that thinking-enabled LLMs display significantly different levels of random and directed exploration than humans would challenge the similarity claim.

read the original abstract

Large language models (LLMs) are increasingly used to simulate or automate human behavior in complex sequential decision-making settings. A natural question is then whether LLMs exhibit similar decision-making behavior to humans, and can achieve comparable (or superior) performance. In this work, we focus on the exploration-exploitation (E&E) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty. We employ canonical multi-armed bandit (MAB) experiments introduced in the cognitive science and psychiatry literature to conduct a comparative study of the E&E strategies of LLMs, humans, and MAB algorithms. We use interpretable choice models to capture the E&E strategies of the agents and investigate how enabling thinking traces, through both prompting strategies and thinking models, shapes LLM decision-making. We find that enabling thinking in LLMs shifts their behavior toward more human-like behavior, characterized by a mix of random and directed exploration. In a simple stationary setting, thinking-enabled LLMs exhibit similar levels of random and directed exploration compared to humans. However, in more complex, non-stationary environments, LLMs struggle to match human adaptability, particularly in effective directed exploration, despite achieving similar regret in certain scenarios. Our findings highlight both the promise and limits of LLMs as simulators of human behavior and tools for automated decision-making and point to potential areas for improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Thinking traces nudge LLMs toward human-like exploration in stationary bandits but the choice-model decomposition looks shaky for non-stationary cases.

read the letter

The main thing to know is that this paper runs standard multi-armed bandit tasks on LLMs with and without explicit thinking, then fits the same interpretable choice models used in human studies to compare random versus directed exploration. In stationary settings the thinking-enabled LLMs land close to human levels on both components; in non-stationary ones they show weaker directed exploration even when regret is comparable. That is the concrete result worth noting.

Referee Report

2 major / 2 minor

Summary. The paper compares exploration-exploitation strategies of LLMs, humans, and standard MAB algorithms using canonical multi-armed bandit tasks from cognitive science. It fits interpretable choice models (variants of softmax with value and uncertainty terms) to choice sequences to quantify random versus directed exploration, and tests how enabling thinking traces via prompting and thinking models shifts LLM behavior. The central claim is that thinking-enabled LLMs become more human-like, showing comparable random and directed exploration in stationary bandits but weaker adaptability and directed exploration in non-stationary environments, while sometimes achieving similar regret.

Significance. If the quantitative results and model fits hold, the work offers useful evidence on the promise and limits of LLMs as human-behavior simulators in sequential decision tasks. The adoption of established bandit paradigms and interpretable models allows mechanistic comparison of strategies rather than black-box performance metrics. The finding that explicit thinking shifts LLM exploration patterns is a concrete, testable contribution that could inform both cognitive modeling and LLM prompting design.

major comments (2)

[§3.2 and §4.2] §3.2 (Choice Models) and §4.2 (Non-stationary results): The central claim that thinking-enabled LLMs exhibit weaker directed exploration in non-stationary settings rests on the assumption that the fitted parametric models accurately recover the underlying strategies for both humans and LLMs. If LLM policies contain prompt artifacts, token biases, or non-stationary internal states outside the model class, the recovered parameters could systematically misattribute the observed pattern without reflecting true adaptability limits. A direct test (e.g., simulation recovery or alternative model classes) is needed to rule out misspecification as the source of the reported difference.
[§4 and Tables 2-3] §4 (Results) and Table 2/3: The manuscript reports quantitative comparisons of exploration parameters and regret but provides insufficient detail on LLM sample sizes per condition, number of independent runs, statistical tests for parameter differences, and pre-registered robustness checks. Without these, it is difficult to evaluate whether the claimed similarity in stationary settings and difference in non-stationary settings are reliable or sensitive to post-hoc modeling choices.

minor comments (2)

[Abstract] Abstract: The phrase 'similar regret in certain scenarios' is vague; specify the exact scenarios, regret values, and whether they survive correction for multiple comparisons.
[§2] §2 (Related Work): The citation list for human bandit literature is adequate, but recent LLM-specific sequential decision papers should be added for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and have revised the paper accordingly to improve methodological transparency and robustness.

read point-by-point responses

Referee: [§3.2 and §4.2] §3.2 (Choice Models) and §4.2 (Non-stationary results): The central claim that thinking-enabled LLMs exhibit weaker directed exploration in non-stationary settings rests on the assumption that the fitted parametric models accurately recover the underlying strategies for both humans and LLMs. If LLM policies contain prompt artifacts, token biases, or non-stationary internal states outside the model class, the recovered parameters could systematically misattribute the observed pattern without reflecting true adaptability limits. A direct test (e.g., simulation recovery or alternative model classes) is needed to rule out misspecification as the source of the reported difference.

Authors: We appreciate the referee's emphasis on validating the choice models against potential misspecification. In response, we conducted parameter recovery simulations by generating synthetic choice data from the fitted models (using both human-derived and LLM-derived parameters) and re-fitting the models to assess recovery accuracy. These simulations show low bias and high correlation between true and recovered parameters for both random and directed exploration terms. We also tested an extended model class incorporating a prompt-bias term and found that the reported differences in directed exploration for non-stationary bandits remain significant. These new analyses and results will be added to a revised §3.2 and a new appendix. We agree this strengthens the central claim. revision: yes
Referee: [§4 and Tables 2-3] §4 (Results) and Table 2/3: The manuscript reports quantitative comparisons of exploration parameters and regret but provides insufficient detail on LLM sample sizes per condition, number of independent runs, statistical tests for parameter differences, and pre-registered robustness checks. Without these, it is difficult to evaluate whether the claimed similarity in stationary settings and difference in non-stationary settings are reliable or sensitive to post-hoc modeling choices.

Authors: We agree that greater detail on the experimental and statistical procedures is necessary for reproducibility and evaluation. The revised manuscript now includes expanded descriptions in §4 and the methods section specifying the LLM sample sizes (50-100 independent runs per model-prompt-condition combination, depending on the bandit task), the number of independent replications, and the statistical tests employed (e.g., Welch's t-tests with FDR correction for between-group parameter comparisons, with effect sizes and confidence intervals reported). We have also added a dedicated robustness subsection detailing sensitivity analyses to prompt variations, temperature settings, and model versions. Although the original study was not pre-registered, these post-hoc checks confirm the stability of the key findings on exploration parameters and regret. Updated Tables 2 and 3 will incorporate these details and p-values. revision: yes

Circularity Check

0 steps flagged

No significant circularity; analysis relies on external benchmarks and prior models

full rationale

The paper's derivation chain applies canonical multi-armed bandit tasks from the cognitive science literature and fits previously published interpretable choice models to observed choice sequences from humans and LLMs. Key quantities such as levels of random and directed exploration are recovered by applying these external models rather than being defined in terms of the authors' own fitted parameters or predictions. No self-definitional reductions, fitted inputs called predictions, load-bearing self-citations, or ansatzes smuggled via citation are present; the central claims about thinking traces and adaptability emerge from empirical comparisons against independent human data and standard methods.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard assumptions from the multi-armed bandit and cognitive modeling literature rather than introducing new free parameters, axioms, or invented entities.

axioms (1)

domain assumption The canonical multi-armed bandit tasks and choice models from prior cognitive-science literature correctly isolate random versus directed exploration.
Invoked when the authors state they employ these experiments and models to capture E&E strategies.

pith-pipeline@v0.9.0 · 5790 in / 1277 out tokens · 27665 ms · 2026-05-22T15:29:15.468958+00:00 · methodology

Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Experiments

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)