pith. sign in

arxiv: 2603.11394 · v3 · pith:KRURCHS2new · submitted 2026-03-12 · 💻 cs.CL · cs.AI· cs.LG

Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Reliability

Pith reviewed 2026-05-15 12:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords LLM diagnostic reasoningmulti-turn conversationsconversation taxstick-or-switchmedical chatbotsclinical datasetsmodel convictionhealthcare AI
0
0 comments X

The pith

Multi-turn conversations cause LLMs to abandon correct medical diagnoses for incorrect user suggestions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests 17 large language models on three clinical diagnostic datasets to compare single-shot reasoning against multi-turn dialogue. It introduces a stick-or-switch framework that tracks whether models maintain correct initial answers or safe abstentions when users introduce wrong ideas. Results show a consistent performance drop in multi-turn settings, labeled the conversation tax, with models often switching away from accurate diagnoses. This matters because real healthcare chatbots operate through ongoing exchanges rather than isolated questions. The work highlights how conversation structure itself can undermine model reliability in diagnostic tasks.

Core claim

Partitioning the diagnostic decision space into multiple conversation turns degrades LLM performance relative to single-shot baselines, as models frequently abandon correct diagnoses and safe abstentions to align with incorrect user suggestions.

What carries the argument

The stick-or-switch evaluation framework, which measures model conviction in defending correct diagnoses against incorrect suggestions and flexibility in adopting correct ones when offered.

If this is right

  • LLMs show lower diagnostic accuracy in realistic multi-turn exchanges than on static benchmarks.
  • Models exhibit blind switching, failing to separate signal from incorrect suggestions during dialogue.
  • Safe abstention behaviors are especially vulnerable once conversation continues beyond the first turn.
  • Performance degradation appears across multiple model families and clinical datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Chatbot designs may need explicit confirmation checkpoints before updating a prior diagnosis.
  • Robustness training could add multi-turn adversarial examples that penalize switches to wrong inputs.
  • The pattern may apply to non-medical reasoning tasks where user feedback shapes successive outputs.

Load-bearing premise

Simulated multi-turn conversations and the stick-or-switch metrics reflect real-world patient-clinician chatbot interactions without artificial biases from how suggestions are introduced.

What would settle it

Direct observation of diagnostic accuracy in actual multi-turn patient-clinician chatbot sessions compared against matched single-shot queries on the same cases.

Figures

Figures reproduced from arXiv: 2603.11394 by Avinash Baidya, Bradley A. Malin, Chao Yan, Juming Xiong, Katherine Brown, Kevin H. Guo, Xiang Gao, Zhijun Yin.

Figure 1
Figure 1. Figure 1: Measuring conviction and flexibility in LLM clinical decision-making through multi-turn conversa [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The effect of narrowing the original decision-space to a binary one. (a) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The effect of multi-turn conversation on end-to-end accuracy. (a) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The effect of multi-turn conversation on end-to-end abstention rates. (a) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation of model flexibility and susceptibility to blind switching. (a) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-world usage, remains understudied. Addressing this gap is critical in high-stakes settings like healthcare, where patients and clinicians are turning to LLM chatbots to address their medical inquiries. Here, we introduce the "stick-or-switch" (SoS) framework, which partitions a question-answer space into multiple sequential presentations to model two safety-centric behaviors: conviction (i.e., sticking to a correct answer selection or abstention against incorrect suggestions) and flexibility (i.e., switching to a correct suggestion when it is introduced). Evaluating 17 LLMs across three clinical benchmarks, we observe a pervasive conversation tax, where partitioning an answer-space into sequential presentations reduces end-to-end accuracy and abstention against incorrect suggestions by an average of up to 30%, reaching 65% in certain models. We also observe blind switching, where models transition an initial abstention to incorrect and correct suggestions at near-identical rates reaching 50%. Finally, we show that increasing model scale mitigates some of these conversational inefficacies while exacerbating others, such as a higher propensity to adopt an incorrect suggestion from an initial abstention. Together our findings demonstrate that the general proficiency captured by static benchmarks do not translate over multi-turn dialogues.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates 17 LLMs across three clinical datasets to assess how multi-turn conversations affect diagnostic reasoning. It introduces a stick-or-switch framework measuring conviction (defending correct diagnoses or abstentions against incorrect suggestions) and flexibility (adopting correct suggestions), reporting consistent degradation relative to single-shot baselines, including frequent abandonment of correct answers and instances of blind switching.

Significance. If the central empirical findings hold after addressing methodological gaps, the work would provide a valuable large-scale demonstration of robustness limitations in conversational LLM use for clinical tasks. The breadth of 17 models and three datasets strengthens the case for a general 'conversation tax' effect, with direct implications for safe deployment of diagnostic chatbots.

major comments (2)
  1. [Methods] Methods section on conversation simulation: the protocol for partitioning decision spaces and inserting incorrect user suggestions lacks detail on timing, phrasing controls, and naturalness checks, leaving open whether the observed abandonment of correct diagnoses is an artifact of the experimental construction rather than intrinsic to multi-turn clinical use.
  2. [Evaluation Framework] Stick-or-switch evaluation framework: the definitions of conviction and flexibility metrics, including how 'safe abstentions' are scored and how suggestion phrasing is controlled, are insufficiently specified to rule out bias in the degradation results.
minor comments (1)
  1. [Abstract] The abstract introduces 'conversation tax' without a one-sentence definition, which reduces immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of our findings on the conversation tax in LLM diagnostic reasoning. We have revised the manuscript to address the major comments by expanding methodological details. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Methods] Methods section on conversation simulation: the protocol for partitioning decision spaces and inserting incorrect user suggestions lacks detail on timing, phrasing controls, and naturalness checks, leaving open whether the observed abandonment of correct diagnoses is an artifact of the experimental construction rather than intrinsic to multi-turn clinical use.

    Authors: We agree that additional protocol details are needed for full reproducibility and to address artifact concerns. In the revised manuscript, we have expanded Section 3.2 to specify: suggestion insertions occur immediately after the model's initial single-turn response; incorrect suggestions use a fixed set of 5 semantically equivalent phrasing templates (e.g., 'Could it instead be X?'); and naturalness was validated in a pilot with 3 clinicians rating 100 conversations (92% rated realistic). These controls confirm the observed abandonment reflects intrinsic multi-turn sensitivity rather than construction artifacts. revision: yes

  2. Referee: [Evaluation Framework] Stick-or-switch evaluation framework: the definitions of conviction and flexibility metrics, including how 'safe abstentions' are scored and how suggestion phrasing is controlled, are insufficiently specified to rule out bias in the degradation results.

    Authors: We appreciate the call for clearer metric definitions. The revised Section 4 now provides formal specifications: conviction is the proportion of cases where models retain a correct diagnosis or safe abstention against incorrect suggestions (safe abstentions count as successful conviction if maintained); flexibility is the rate of adopting correct suggestions when introduced. Suggestion phrasing is controlled via consistent randomized templates across all 17 models and datasets to eliminate bias. These clarifications support the robustness of the reported degradation effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical comparison

full rationale

The paper conducts an empirical evaluation of 17 LLMs on three clinical datasets, introducing a stick-or-switch framework to measure conviction and flexibility in multi-turn vs. single-shot settings. No mathematical derivations, fitted parameters, or self-citations reduce any result to prior quantities by construction. The conversation tax finding follows directly from performance comparisons on the partitioned decision spaces; the framework is defined operationally for this study without circular reduction. This is a standard empirical setup with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen clinical datasets and simulated conversation structure validly proxy real diagnostic interactions and that single-shot performance is the correct baseline.

axioms (1)
  • domain assumption The three clinical datasets represent realistic diagnostic reasoning tasks.
    Evaluation depends on these datasets serving as valid proxies for medical decision-making.

pith-pipeline@v0.9.0 · 5496 in / 1117 out tokens · 84533 ms · 2026-05-15T12:45:31.737429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.