arxiv: 2505.06120 · v1 · submitted 2025-05-09 · 💻 cs.CL · cs.HC

Recognition: 2 theorem links

· Lean Theorem

LLMs Get Lost In Multi-Turn Conversation

Philippe Laban , Hiroaki Hayashi , Yingbo Zhou , Jennifer Neville

Authors on Pith no claims yet

Pith reviewed 2026-05-14 00:52 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords LLMsmulti-turn conversationperformance degradationunreliabilitysimulationerror recoveryconversational AIunderspecification

0 comments

The pith

LLMs drop 39 percent in multi-turn conversations because they make early assumptions and fail to recover from mistakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper uses large-scale simulations to show that top LLMs perform markedly worse when users interact over multiple turns than when given a single fully specified instruction. The performance drop averages 39 percent across six tasks and stems mainly from increased unreliability rather than any large loss of basic capability. Models tend to form assumptions early, produce a premature final answer, and then stick to that path even when later turns reveal the error. A reader would care because real conversations with LLMs are rarely one-shot; they are iterative and underspecified, so persistent failure to correct course limits practical usefulness. The work therefore argues that single-turn evaluations overestimate how well current models handle the conversational setting they are actually deployed in.

Core claim

Across more than 200,000 simulated multi-turn conversations, every tested LLM exhibits significantly lower success rates than in equivalent single-turn settings. The degradation decomposes into a modest reduction in aptitude and a much larger rise in unreliability. The dominant failure mode is that models form assumptions in early turns, generate what they treat as a final solution, and then over-rely on that output even when subsequent user messages contradict it. In short, once an LLM takes a wrong turn it tends to stay lost rather than recover.

What carries the argument

Simulation-based decomposition of multi-turn performance into separate aptitude loss and unreliability growth, with the key observation that models form early assumptions and over-commit to premature final outputs.

Load-bearing premise

The simulated multi-turn conversations accurately reflect the distribution and dynamics of real user interactions with LLMs.

What would settle it

Measure whether actual human users in live multi-turn sessions observe the same 39 percent drop and lack of recovery from early wrong turns that the simulations report.

read the original abstract

Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that *when LLMs take a wrong turn in a conversation, they get lost and do not recover*.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs drop sharply in multi-turn tasks mainly from failing to recover after early wrong assumptions, but the simulated conversations leave open whether this matches real user behavior.

read the letter

The central observation is that LLMs lose roughly 39 percent performance when conversations stretch beyond a single turn, and the drop comes mostly from increased unreliability rather than lost aptitude. The authors trace this to models making premature assumptions in early turns and then sticking to them instead of revising. They back this with simulations across six tasks and more than 200,000 conversations, plus a breakdown that separates the two failure modes. That decomposition and the scale are the useful parts; they move past single-turn benchmarks toward something closer to how people actually use these systems. The work is empirical and reports direct measurements, so the circularity burden stays low. The soft spot is the simulation design itself. The conversations are built by prompting LLMs or using fixed templates, and nothing in the reported results checks whether the error patterns, correction chances, or persistence of early assumptions line up with real human-LLM logs. If the generation process suppresses natural user fixes or pushes early commitments, the non-recovery finding could be partly an artifact. The abstract summarizes the setup but does not spell out the exact prompt templates or metric definitions, which makes it harder to judge how stable the 39 percent number is. This is for researchers who build or evaluate conversational agents and want a concrete way to measure recovery failures. A reader focused on evaluation methods would pick up the aptitude-versus-unreliability split. It deserves peer review because the scale is substantial and the question matters for deployed systems, though any referee would need to press on the simulation validation and ask for more method detail.

Referee Report

2 major / 1 minor

Summary. The paper reports large-scale simulation experiments comparing LLM performance on six generation tasks in single-turn fully-specified settings versus multi-turn conversational settings. It finds an average 39% performance drop in the multi-turn case across tested models, decomposed into minor aptitude loss and substantial unreliability; the latter is attributed to models forming premature assumptions in early turns and failing to recover from incorrect paths, based on analysis of over 200,000 simulated conversations.

Significance. If the core empirical pattern holds, the work identifies a practically important limitation for conversational LLM use cases, where underspecification and iterative refinement are common. The scale of the simulation (200k+ conversations) provides statistical power to detect the unreliability component, which could motivate new training objectives or evaluation protocols focused on error recovery.

major comments (2)

[Abstract] The central 39% drop and the unreliability decomposition rest on the fidelity of the simulated conversations, yet the abstract only summarizes the generation process as 'prompting an LLM to produce successive user turns or using fixed templates that embed underspecification' without providing the exact prompt templates, task definitions, or how correction opportunities are introduced; this detail is load-bearing because an artifactual suppression of natural user corrections would inflate the non-recovery observation.
[Analysis of 200,000+ simulated conversations] The claim that LLMs 'make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely' requires explicit operationalization of 'wrong turn,' 'recovery,' and the metrics used to quantify each component; without these definitions and the associated evaluation code or rubrics, the decomposition into aptitude versus unreliability cannot be independently verified.

minor comments (1)

[Abstract] The abstract states 'all the top open- and closed-weight LLMs we test' but does not list the specific models or their sizes; adding this enumeration would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the importance of methodological transparency. We address each major comment below with point-by-point responses and have revised the manuscript to provide greater clarity on simulation details and operational definitions.

read point-by-point responses

Referee: [Abstract] The central 39% drop and the unreliability decomposition rest on the fidelity of the simulated conversations, yet the abstract only summarizes the generation process as 'prompting an LLM to produce successive user turns or using fixed templates that embed underspecification' without providing the exact prompt templates, task definitions, or how correction opportunities are introduced; this detail is load-bearing because an artifactual suppression of natural user corrections would inflate the non-recovery observation.

Authors: We agree that the fidelity of the simulation is central to the claims. The exact prompt templates for generating successive user turns (via LLM prompting) and the fixed templates embedding underspecification are fully documented in Appendix A. The six generation tasks, including their specifications and ground-truth evaluation criteria, are defined in Section 3.1. Correction opportunities arise naturally because the user simulator is prompted to continue the conversation based on the model's prior response, permitting clarifications or corrections in later turns as described in Section 3.2. We have revised the abstract to reference Appendix A for these details while preserving its brevity. This change improves verifiability without affecting the reported 39% drop or its decomposition. revision: yes
Referee: [Analysis of 200,000+ simulated conversations] The claim that LLMs 'make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely' requires explicit operationalization of 'wrong turn,' 'recovery,' and the metrics used to quantify each component; without these definitions and the associated evaluation code or rubrics, the decomposition into aptitude versus unreliability cannot be independently verified.

Authors: We concur that explicit operationalization strengthens the analysis. Section 4.2 of the revised manuscript now defines a 'wrong turn' as an early-turn response that introduces an assumption unsupported by the initial task specification, resulting in a final output deviating from ground truth. 'Recovery' is operationalized as the proportion of such cases where later turns produce a correct final output. The decomposition metrics are: aptitude loss (performance drop on conversations without wrong turns) and unreliability (additional drop from non-recovered errors). These are computed automatically over the 200k+ conversations, with rubrics validated via manual review of 500 samples. The evaluation code and rubrics are now linked in the manuscript via our public repository. These additions enable independent verification of the reported decomposition. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements are direct and self-contained

full rationale

The paper conducts large-scale simulation experiments that directly measure LLM performance in single-turn versus multi-turn settings, reporting an average 39% drop and decomposing it into aptitude loss versus unreliability via analysis of 200,000+ generated conversations. No equations, fitted parameters, or derivations are presented that reduce the central claim (LLMs make early assumptions and fail to recover) to the simulation procedure itself or to any self-citation chain. The results are obtained by running the evaluated models on procedurally generated dialogues and recording outcomes, which constitutes independent empirical content rather than a self-definitional or fitted-input prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on the assumption that simulated conversations match real usage patterns; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Simulated conversations reflect real user behavior and task underspecification patterns
The generalization from simulation results to real LLM use depends on this premise.

pith-pipeline@v0.9.0 · 5506 in / 1119 out tokens · 38297 ms · 2026-05-14T00:52:30.431994+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating Temporal Consistency in Multi-Turn Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents
cs.CL 2026-04 unverdicted novelty 7.0

EMSDialog is a dataset of 4,414 synthetic multi-speaker EMS dialogues generated by a multi-LLM agent pipeline grounded in ePCR reports, annotated with diagnoses, roles, and topics, and shown to improve accuracy, timel...
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
cs.LG 2026-05 conditional novelty 6.0

ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 per...
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
cs.AI 2026-05 unverdicted novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
$\delta$-mem: Efficient Online Memory for Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
cs.CL 2026-05 unverdicted novelty 6.0

SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
cs.AI 2026-04 unverdicted novelty 6.0

Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.
Pause or Fabricate? Training Language Models for Grounded Reasoning
cs.CL 2026-04 conditional novelty 6.0

GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task succe...
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
cs.AI 2026-04 unverdicted novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

AnchorMem decouples atomic fact anchors and associative event graphs for retrieval from preserved raw interaction contexts, outperforming prior memory methods on the LoCoMo benchmark.
RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 6.0

RoTRAG retrieves Rules of Thumb to ground LLM reasoning for harm detection and severity classification in multi-turn dialogues, reporting roughly 40% relative F1 gains and 8.4% lower distributional error on two safety...
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
cs.CL 2026-04 unverdicted novelty 6.0

GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
cs.LG 2026-04 unverdicted novelty 6.0

AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
ZORO: Active Rules for Reliable Vibe Coding
cs.HC 2026-04 unverdicted novelty 6.0

ZORO integrates rules directly into AI coding workflows by enriching plans, enforcing compliance with proof requirements, and evolving rules via user feedback, resulting in better rule adherence and shifts in user behavior.
LLMs Corrupt Your Documents When You Delegate
cs.CL 2026-04 unverdicted novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation
cs.CL 2026-04 unverdicted novelty 6.0

MT-OSC condenses chat history via a one-off sequential process with a few-shot Condenser and lightweight Decider to reduce tokens and preserve LLM accuracy in multi-turn settings.
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?
cs.SE 2026-04 unverdicted novelty 6.0

ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue
cs.CL 2026-04 conditional novelty 6.0

Context-Agent represents dialogue history as a dynamic tree to handle non-linear topic shifts and introduces the NTM benchmark for evaluating long-horizon non-linear dialogues.
Trace Mutation in Human-LLM Dialogue: The Transcript as Forensic and Mitigation Surface
cs.HC 2026-03 unverdicted novelty 6.0

Trace mutations are a class of context failures in LLM conversations consisting of utterance effacement and genitive dissociation that distort the shared record while resisting ordinary repair.
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems
cs.MA 2026-05 unverdicted novelty 5.0

Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
cs.CL 2026-05 unverdicted novelty 5.0

Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs
cs.CL 2026-05 unverdicted novelty 5.0

MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.
Can AI Help You Get Over Your Breakup? One Session with a Belief-Reframing Chatbot Shows Sustained Distress Reduction
cs.HC 2026-05 conditional novelty 5.0

A pre-registered RCT found that one session with a belief-reframing AI chatbot produced significantly greater reductions in breakup distress than a survey-only control at 7 days, with a smaller effect persisting at 1 month.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 23 Pith papers · 13 internal anchors

[1]

Phi-4 Technical Report

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

G. Bai, J. Liu, X. Bu, Y . He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7421–7454, 2024

work page 2024
[3]

C. G. Belem, P. Pezeskhpour, H. Iso, S. Maekawa, N. Bhutani, and E. Hruschka. From single to multi: How llms hallucinate in multi-document summarization. arXiv preprint arXiv:2410.13961, 2024

work page arXiv 2024
[4]

Brauner, A

P. Brauner, A. Hick, R. Philipsen, and M. Ziefle. What does the public think about artificial intelligence?—a criticality map to understand bias in the public perception of ai. In Frontiers of Computer Science, 2023. URL https://api.semanticscholar.org/CorpusID:257598212. 13 LLMs Get Lost In Multi-Turn Conversation PREPRINT

work page 2023
[5]

Chakrabarty, P

T. Chakrabarty, P. Laban, D. Agarwal, S. Muresan, and C.-S. Wu. Art or artifice? large language models and the false promise of creativity. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–34, 2024

work page 2024
[6]

Chakrabarty, P

T. Chakrabarty, P. Laban, and C.-S. Wu. Ai-slop to ai-polish? aligning language models through edit-based writing rewards and test-time computation. arXiv preprint arXiv:2504.07532, 2025

work page arXiv 2025
[7]

Chang, A

S. Chang, A. Anderson, and J. M. Hofman. Chatbench: From static benchmarks to human-ai evaluation. arXiv preprint arXiv:2504.07114, 2025

work page arXiv 2025
[8]

H. Chase. Langchain, October 2022. URL https://github.com/langchain-ai/langchain

work page 2022
[9]

Chaturvedi, K

A. Chaturvedi, K. Thompson, and N. Asher. Nebula: A discourse aware minecraft builder. ArXiv, abs/2406.18164,

work page arXiv
[10]

URL https://api.semanticscholar.org/CorpusID:270738020

work page
[11]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Chiang, L

W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. Jordan, J. E. Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[13]

E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y . Choi, P. Liang, and L. Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036, 2018

work page Pith review arXiv 2018
[14]

E. Choi, J. Palomaki, M. Lamm, T. Kwiatkowski, D. Das, and M. Collins. Decontextualization: Making sentences stand-alone. Transactions of the Association for Computational Linguistics, 9:447–461, 2021

work page 2021
[15]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Cohere, A

T. Cohere, A. Ahmadian, M. Ahmed, J. Alammar, Y . Alnumay, S. Althammer, A. Arkhangorodsky, V . Aryabumi, D. Aumiller, R. Avalos, et al. Command a: An enterprise-ready large language model. arXiv preprint arXiv:2504.00698, 2025

work page arXiv 2025
[17]

Y . Deng, X. Zhang, W. Zhang, Y . Yuan, S.-K. Ng, and T.-S. Chua. On the multi-turn instruction following for conversational web agents. arXiv preprint arXiv:2402.15057, 2024

work page arXiv 2024
[18]

Deriu, A

J. Deriu, A. Rodrigo, A. Otegi, G. Echegoyen, S. Rosset, E. Agirre, and M. Cieliebak. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 54:755–810, 2021

work page 2021
[19]

H. Duan, J. Wei, C. Wang, H. Liu, Y . Fang, S. Zhang, D. Lin, and K. Chen. Botchat: Evaluating llms’ capabilities of having multi-turn dialogues. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3184–3200, 2024

work page 2024
[20]

Z. Fan, R. Chen, T. Hu, and Z. Liu. Fairmt-bench: Benchmarking fairness for multi-turn dialogue in conversational llms. arXiv preprint arXiv:2410.19317, 2024

work page arXiv 2024
[21]

V . S. Ferreira. Ambiguity, accessibility, and a division of labor for communicative success.Psychology of Learning and motivation, 49:209–246, 2008

work page 2008
[22]

S. E. Finch, J. D. Finch, and J. D. Choi. Don’t forget your abc’s: Evaluating the state-of-the-art in chat-oriented dialogue systems. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023

work page 2023
[23]

S. Frisson. Semantic underspecification in language processing. Lang. Linguistics Compass, 3:111–127, 2009. URL https://api.semanticscholar.org/CorpusID:13384476

work page 2009
[24]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

C. Han. Can language models follow multiple turns of entangled instructions? arXiv preprint arXiv:2503.13222, 2025

work page arXiv 2025
[27]

Troy, Dario Amodei, Jared Kaplan, Jack Clark, and Deep Ganguli

K. Handa, A. Tamkin, M. McCain, S. Huang, E. Durmus, S. Heck, J. Mueller, J. Hong, S. Ritchie, T. Belonax, et al. Which economic tasks are performed with ai? evidence from millions of claude conversations. arXiv preprint arXiv:2503.04761, 2025

work page arXiv 2025
[28]

Herlihy, J

C. Herlihy, J. Neville, T. Schnabel, and A. Swaminathan. On overcoming miscalibrated conversational priors in llm-based chatbots. arXiv preprint arXiv:2406.01633, 2024. 14 LLMs Get Lost In Multi-Turn Conversation PREPRINT

work page arXiv 2024
[29]

M. C. Horowitz, L. Kahn, J. Macdonald, and J. Schneider. Adopting ai: how familiarity breeds both trust and contempt. AI & society, 39(4):1721–1735, 2024

work page 2024
[30]

Huang, P

K.-H. Huang, P. Laban, A. R. Fabbri, P. K. Choubey, S. Joty, C. Xiong, and C.-S. Wu. Embrace divergence for richer insights: A multi-document summarization benchmark and a case study on summarizing diverse information from news articles. arXiv preprint arXiv:2309.09369, 2023

work page arXiv 2023
[31]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Live- codebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Karpinska, K

M. Karpinska, K. Thai, K. Lo, T. Goyal, and M. Iyyer. One thousand and one pairs: A" novel" challenge for long-context language models. arXiv preprint arXiv:2406.16264, 2024

work page arXiv 2024
[34]

Y . Kim, Y . Chang, M. Karpinska, A. Garimella, V . Manjunatha, K. Lo, T. Goyal, and M. Iyyer. Fables: Evaluating faithfulness and content selection in book-length summarization. arXiv preprint arXiv:2404.01261, 2024

work page arXiv 2024
[35]

Y . Kim, K. Son, S. Kim, and J. Kim. Beyond prompts: Learning from human communication for enhanced ai intent alignment. ArXiv, abs/2405.05678, 2024. URL https://api.semanticscholar.org/CorpusID:269635257

work page arXiv 2024
[36]

Knoth, A

N. Knoth, A. Tolzin, A. Janson, and J. M. Leimeister. Ai literacy and its implications for prompt engineering strategies. Comput. Educ. Artif. Intell., 6:100225, 2024. URL https://api.semanticscholar.org/CorpusID: 269273689

work page 2024
[37]

Konrád, J

J. Konrád, J. Pichl, P. Marek, P. Lorenc, V . D. Ta, O. Kobza, L. H`ylová, and J. Šediv`y. Alquist 4.0: Towards social intelligence using generative models and dialogue personalization. arXiv preprint arXiv:2109.07968, 2021

work page arXiv 2021
[38]

W.-C. Kwan, X. Zeng, Y . Jiang, Y . Wang, L. Li, L. Shang, X. Jiang, Q. Liu, and K.-F. Wong. Mt-eval: A multi-turn capabilities evaluation benchmark for large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20153–20177, 2024

work page 2024
[39]

Laban, J

P. Laban, J. Canny, and M. A. Hearst. What’s the latest? a question-driven news chatbot. arXiv preprint arXiv:2105.05392, 2021

work page arXiv 2021
[40]

Laban, L

P. Laban, L. Murakhovs’ ka, C. Xiong, and C.-S. Wu. Are you sure? challenging llms leads to performance drops in the flipflop experiment. arXiv preprint arXiv:2311.08596, 2023

work page arXiv 2023
[41]

Laban, A

P. Laban, A. R. Fabbri, C. Xiong, and C.-S. Wu. Summary of a haystack: A challenge to long-context llms and rag systems. arXiv preprint arXiv:2407.01370, 2024

work page arXiv 2024
[42]

S. Lappin. An intensional parametric semantics for vague quantifiers. Linguistics and Philosophy, 23:599–620,

work page
[43]

URL https://api.semanticscholar.org/CorpusID:170154611

work page
[44]

M. Lee, M. Srivastava, A. Hardy, J. Thickstun, E. Durmus, A. Paranjape, I. Gerard-Ursin, X. L. Li, F. Ladhak, F. Rong, et al. Evaluating human-language model interaction. arXiv preprint arXiv:2212.09746, 2022

work page arXiv 2022
[45]

Y . Lee, K. Son, T. S. Kim, J. Kim, J. J. Y . Chung, E. Adar, and J. Kim. One vs. many: Comprehending accurate information from multiple erroneous and inconsistent ai generations. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , 2024. URL https://api.semanticscholar.org/CorpusID: 269635304

work page 2024
[46]

F. Lei, J. Chen, Y . Ye, R. Cao, D. Shin, H. Su, Z. Suo, H. Gao, W. Hu, P. Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows. arXiv preprint arXiv:2411.07763, 2024

work page arXiv 2024
[47]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[48]

R. Li, R. Li, B. Wang, and X. Du. Iqa-eval: Automatic evaluation of human-model interactive question answering. Advances in Neural Information Processing Systems, 37:109894–109921, 2024

work page 2024
[49]

S. Li, J. Yan, H. Wang, Z. Tang, X. Ren, V . Srinivasan, and H. Jin. Instruction-following evaluation through verbalizer manipulation. arXiv preprint arXiv:2307.10558, 2023

work page arXiv 2023
[50]

Liang, D

Z. Liang, D. Yu, W. Yu, W. Yao, Z. Zhang, X. Zhang, and D. Yu. Mathchat: Benchmarking mathematical reasoning and instruction following in multi-turn interactions. arXiv preprint arXiv:2405.19444, 2024

work page arXiv 2024
[51]

A. Liu, Z. Wu, J. Michael, A. Suhr, P. West, A. Koller, S. Swayamdipta, N. A. Smith, and Y . Choi. We’re afraid language models aren’t modeling ambiguity. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 790–807, 2023. 15 LLMs Get Lost In Multi-Turn Conversation PREPRINT

work page 2023
[52]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024

work page 2024
[53]

Y . Liu, A. R. Fabbri, P. Liu, Y . Zhao, L. Nan, R. Han, S. Han, S. Joty, C.-S. Wu, C. Xiong, et al. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. arXiv preprint arXiv:2212.07981, 2022

work page arXiv 2022
[54]

Z. Ma, S. Edunov, and M. Auli. A comparison of approaches to document-level machine translation. arXiv preprint arXiv:2101.11040, 2021

work page arXiv 2021
[55]

Malaviya, J

C. Malaviya, J. C. Chang, D. Roth, M. Iyyer, M. Yatskar, and K. Lo. Contextualized evaluations: Taking the guesswork out of language model evaluations. arXiv preprint arXiv:2411.07237, 2024

work page arXiv 2024
[56]

Murakhovs’ ka, P

L. Murakhovs’ ka, P. Laban, T. Xie, C. Xiong, and C.-S. Wu. Salespeople vs salesbot: Exploring the role of educational value in conversational recommender systems. arXiv preprint arXiv:2310.17749, 2023

work page arXiv 2023
[57]

Mylrea and N

M. Mylrea and N. Robinson. Artificial intelligence (ai) trust framework and maturity model: Applying an entropy lens to improve security, privacy, and ethical ai.Entropy, 25, 2023. URL https://api.semanticscholar.org/ CorpusID:263840323

work page 2023
[58]

T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y . Gu, S. Huang, M. Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

OpenAI o3 and o4-mini System Card — openai.com

OpenAI. OpenAI o3 and o4-mini System Card — openai.com. https://openai.com/index/ o3-o4-mini-system-card/ , 2025. [Accessed 08-05-2025]

work page 2025
[60]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages 311–318, 2002

work page 2002
[61]

A. P. Parikh, X. Wang, S. Gehrmann, M. Faruqui, B. Dhingra, D. Yang, and D. Das. Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373, 2020

work page arXiv 2004
[62]

H. Peng, X. Wang, J. Chen, W. Li, Y . P. Qi, Z. Wang, Z. Wu, K. Zeng, B. Xu, L. Hou, and J. Li. When does in-context learning fall short and why? a study on specification-heavy tasks. ArXiv, abs/2311.08993, 2023. URL https://api.semanticscholar.org/CorpusID:265212914

work page arXiv 2023
[63]

Pezzelle

S. Pezzelle. Dealing with semantic underspecification in multimodal nlp. arXiv preprint arXiv:2306.05240, 2023

work page arXiv 2023
[64]

L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Poelitz and N

C. Poelitz and N. McKenna. Synthetic clarification and correction dialogues about data-centric tasks–a teacher- student approach. arXiv preprint arXiv:2503.14167, 2025

work page arXiv 2025
[66]

Post and M

M. Post and M. Junczys-Dowmunt. Escaping the sentence-level paradigm in machine translation. arXiv preprint arXiv:2304.12959, 2023

work page arXiv 2023
[67]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

work page 2019
[68]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140): 1–67, 2020

work page 2020
[69]

A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng, A. Nagar, et al. Conversational ai: The science behind the alexa prize. arXiv preprint arXiv:1801.03604, 2018

work page arXiv 2018
[70]

Reddy, D

S. Reddy, D. Chen, and C. D. Manning. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019

work page 2019
[71]

Sarkar, B

R. Sarkar, B. Sarrafzadeh, N. Chandrasekaran, N. Rangan, P. Resnik, L. Yang, and S. K. Jauhar. Conversational user-ai intervention: A study on prompt rewriting for improved llm response generation. ArXiv, abs/2503.16789,

work page arXiv
[72]

URL https://api.semanticscholar.org/CorpusID:277244656

work page
[73]

Scherrer, J

Y . Scherrer, J. Tiedemann, and S. Loáiciga. Analysing concatenation approaches to document-level nmt in two different domains. In Proceedings of the Third Workshop on Discourse in Machine Translation, Hong-Kong, Nov

work page
[74]

Association for Computational Linguistics

work page
[75]

Shaikh, H

O. Shaikh, H. Mozannar, G. Bansal, A. Fourney, and E. Horvitz. Navigating rifts in human-llm grounding: Study and benchmark. arXiv preprint arXiv:2503.13975, 2025. 16 LLMs Get Lost In Multi-Turn Conversation PREPRINT

work page arXiv 2025
[76]

arXiv preprint arXiv:2501.17399 , year=

V . Sirdeshmukh, K. Deshpande, J. Mols, L. Jin, E.-Y . Cardona, D. Lee, J. Kritz, W. Primack, S. Yue, and C. Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms. arXiv preprint arXiv:2501.17399, 2025

work page arXiv 2025
[77]

Southworth, K

J. Southworth, K. Migliaccio, J. Glover, J. Glover, D. Reed, C. McCarty, J. Brendemuhl, and A. Thomas. Developing a model for ai across the curriculum: Transforming the higher education landscape via innovation in ai literacy. Computers and Education: Artificial Intelligence, 4:100127, 2023

work page 2023
[78]

Y . Sun, C. Liu, K. Zhou, J. Huang, R. Song, W. X. Zhao, F. Zhang, D. Zhang, and K. Gai. Parrot: Enhancing multi-turn instruction following for large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9729–9750, 2024

work page 2024
[79]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

Terry, C

M. Terry, C. Kulkarni, M. Wattenberg, L. Dixon, and M. R. Morris. Interactive ai alignment: specification, process, and evaluation alignment. arXiv preprint arXiv:2311.00710, 2023

work page arXiv 2023

Showing first 80 references.