Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs

Luise Ge; Yevgeniy Vorobeychik; Yongyan Zhang

arxiv: 2602.15173 · v2 · submitted 2026-02-16 · 💻 cs.AI

Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs

Luise Ge , Yongyan Zhang , Yevgeniy Vorobeychik This is my paper

Pith reviewed 2026-05-15 21:22 UTC · model grok-4.3

classification 💻 cs.AI

keywords large language modelsrisky choicedecision under uncertaintydescription-experience gapmathematical reasoningreasoning modelsconversational models

0 comments

The pith

Large language models split into reasoning and conversational types that differ sharply in risky choices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models fall into two groups when facing choices under uncertainty. Reasoning models act more like a payoff-maximizing agent: they stay consistent regardless of how prospects are ordered, framed as gains or losses, or accompanied by explanations, and they respond the same way whether risks are stated directly or shown through sequences of past outcomes. Conversational models depart farther from rational benchmarks, track human patterns a bit more closely, shift their answers with ordering and framing, and display a wide gap between explicit descriptions and history-based presentations. Paired tests on open models trace the split mainly to the presence or absence of mathematical reasoning training during development.

Core claim

Frontier and open LLMs cluster into reasoning models that tend toward rational behavior and remain insensitive to prospect order, gain or loss framing, and added explanations, while behaving similarly whether prospects appear explicitly or through outcome histories, versus conversational models that prove less rational, slightly more human-like, sensitive to ordering, framing, and explanation, and that show a large description-history gap, with mathematical reasoning training identified as the main differentiating factor in open-model comparisons.

What carries the argument

The split between reasoning models and conversational models, driven by mathematical reasoning training, which controls sensitivity to prospect representation and decision rationale.

If this is right

Reasoning models produce consistent choices across explicit descriptions and outcome histories.
Conversational models display a large gap between the two forms of prospect presentation.
Mathematical reasoning training separates the two model categories in open LLMs.
Reasoning models show little response to changes in prospect order, gain/loss framing, or added explanations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Selecting reasoning models for agentic decision workflows could reduce unwanted sensitivity to how risks are worded.
Targeted math-reasoning fine-tuning on conversational models might shrink their description-history gap.
The observed split suggests that training objectives shape not only accuracy but also stability under uncertainty.

Load-bearing premise

The clustering of the tested models into reasoning and conversational types and the attribution of the difference to mathematical reasoning training remain stable across the particular prompts and prospect problems used.

What would settle it

Repeating the full set of prospect-choice trials on a fresh collection of models outside the original twenty, or with revised prompts that alter ordering and framing while keeping the same prospects, would reveal whether the two-group pattern and the training link hold.

read the original abstract

The use of large language models either as decision support systems, or in agentic workflows, is rapidly transforming the digital ecosystem. However, the understanding of LLM decision-making under uncertainty remains limited. We study LLM risky choices along two dimensions: (1) prospect representation (based on an explicit representation or outcome history) and (2) decision rationale (explanation). Our study, which involves 20 frontier and open LLMs, is complemented by a matched human subjects experiment, which provides one reference point, while an expected payoff maximizing rational agent model provides another. We find that LLMs cluster into two categories: reasoning models (RMs) and conversational models (CMs). RMs tend towards rational behavior, are insensitive to the order of prospects, gain/loss framing, and explanations, and behave similarly whether prospects are explicit or presented via a history of outcomes. CMs are significantly less rational, slightly more human-like, sensitive to prospect ordering, framing, and explanation, and exhibit a large description-history gap. Paired comparisons of open LLMs suggest that a key factor differentiating RMs and CMs is training for mathematical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reasoning-trained LLMs act more rationally and consistently on risky choices than conversational ones, with the split tied to math training.

read the letter

This paper finds that LLMs split into two groups on risky choice tasks. Reasoning models stay close to a rational baseline, show little sensitivity to framing or order, and behave the same whether prospects are described directly or shown through outcome histories. Conversational models drift farther from rational behavior, look a bit more like humans, and display a clear description-history gap plus more reaction to how the problem is presented. Paired tests on open models point to mathematical reasoning training as the main factor behind the difference.

Referee Report

0 major / 3 minor

Summary. The manuscript reports an empirical investigation of risky decision-making in 20 large language models, comparing their choices under explicit prospect descriptions versus outcome histories, with and without explanations. It contrasts these with human participants and a rational expected-payoff maximizer, identifying two distinct clusters: reasoning models (RMs) that exhibit rational, framing-insensitive behavior consistent across representations, and conversational models (CMs) that display more human-like biases, sensitivity to order and framing, and a pronounced description-history gap. Paired comparisons among open models point to mathematical reasoning training as a key differentiator.

Significance. If the observed clustering and attribution hold, this study significantly advances our understanding of how different LLM training regimes influence decision-making under uncertainty. The inclusion of human and rational baselines provides clear reference points, and the findings have direct implications for deploying LLMs in agentic workflows or as decision aids. The empirical nature and scale (20 models) make it a useful benchmark for future work on LLM rationality.

minor comments (3)

[Section 4.2] Section 4.2: The clustering procedure into RMs and CMs is described at a high level; providing the exact distance metric, linkage method, or threshold used would improve reproducibility.
[Table 2] Table 2: The reported p-values for CM sensitivity to framing lack correction for multiple comparisons across the 20 models; this should be addressed or justified.
[Figure 4] Figure 4: The paired open-LLM comparison plot would benefit from explicit labeling of which models received math-reasoning fine-tuning to make the correlation visually immediate.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our work, as well as the recommendation for minor revision. The assessment correctly identifies the core distinction between reasoning models (RMs) and conversational models (CMs) in risky choice behavior, along with the role of mathematical reasoning training.

Circularity Check

0 steps flagged

No significant circularity in empirical clustering analysis

full rationale

This is an empirical comparison study measuring LLM behaviors directly against human subjects and a rational-agent baseline across explicit prospect and outcome-history presentations. The reported RM/CM clustering and attribution to mathematical-reasoning training emerge from observed patterns in 20 models, with no derivations, fitted parameters, self-definitional equations, or load-bearing self-citations that reduce the central claims to their own inputs by construction. Methods supply prompt templates, prospect problems, and quantitative metrics, confirming the separation is data-driven rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is purely empirical; it relies on standard assumptions that LLMs can be prompted to choose between prospects and that human responses provide a meaningful reference point, with no new free parameters, axioms, or invented entities introduced.

pith-pipeline@v0.9.0 · 5509 in / 1322 out tokens · 23572 ms · 2026-05-15T21:22:24.315025+00:00 · methodology

Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)