pith. machine review for the scientific record. sign in

arxiv: 2505.06120 · v1 · submitted 2025-05-09 · 💻 cs.CL · cs.HC

Recognition: 2 theorem links

· Lean Theorem

LLMs Get Lost In Multi-Turn Conversation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 00:52 UTC · model grok-4.3

classification 💻 cs.CL cs.HC
keywords LLMsmulti-turn conversationperformance degradationunreliabilitysimulationerror recoveryconversational AIunderspecification
0
0 comments X

The pith

LLMs drop 39 percent in multi-turn conversations because they make early assumptions and fail to recover from mistakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper uses large-scale simulations to show that top LLMs perform markedly worse when users interact over multiple turns than when given a single fully specified instruction. The performance drop averages 39 percent across six tasks and stems mainly from increased unreliability rather than any large loss of basic capability. Models tend to form assumptions early, produce a premature final answer, and then stick to that path even when later turns reveal the error. A reader would care because real conversations with LLMs are rarely one-shot; they are iterative and underspecified, so persistent failure to correct course limits practical usefulness. The work therefore argues that single-turn evaluations overestimate how well current models handle the conversational setting they are actually deployed in.

Core claim

Across more than 200,000 simulated multi-turn conversations, every tested LLM exhibits significantly lower success rates than in equivalent single-turn settings. The degradation decomposes into a modest reduction in aptitude and a much larger rise in unreliability. The dominant failure mode is that models form assumptions in early turns, generate what they treat as a final solution, and then over-rely on that output even when subsequent user messages contradict it. In short, once an LLM takes a wrong turn it tends to stay lost rather than recover.

What carries the argument

Simulation-based decomposition of multi-turn performance into separate aptitude loss and unreliability growth, with the key observation that models form early assumptions and over-commit to premature final outputs.

Load-bearing premise

The simulated multi-turn conversations accurately reflect the distribution and dynamics of real user interactions with LLMs.

What would settle it

Measure whether actual human users in live multi-turn sessions observe the same 39 percent drop and lack of recovery from early wrong turns that the simulations report.

read the original abstract

Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that *when LLMs take a wrong turn in a conversation, they get lost and do not recover*.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reports large-scale simulation experiments comparing LLM performance on six generation tasks in single-turn fully-specified settings versus multi-turn conversational settings. It finds an average 39% performance drop in the multi-turn case across tested models, decomposed into minor aptitude loss and substantial unreliability; the latter is attributed to models forming premature assumptions in early turns and failing to recover from incorrect paths, based on analysis of over 200,000 simulated conversations.

Significance. If the core empirical pattern holds, the work identifies a practically important limitation for conversational LLM use cases, where underspecification and iterative refinement are common. The scale of the simulation (200k+ conversations) provides statistical power to detect the unreliability component, which could motivate new training objectives or evaluation protocols focused on error recovery.

major comments (2)
  1. [Abstract] The central 39% drop and the unreliability decomposition rest on the fidelity of the simulated conversations, yet the abstract only summarizes the generation process as 'prompting an LLM to produce successive user turns or using fixed templates that embed underspecification' without providing the exact prompt templates, task definitions, or how correction opportunities are introduced; this detail is load-bearing because an artifactual suppression of natural user corrections would inflate the non-recovery observation.
  2. [Analysis of 200,000+ simulated conversations] The claim that LLMs 'make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely' requires explicit operationalization of 'wrong turn,' 'recovery,' and the metrics used to quantify each component; without these definitions and the associated evaluation code or rubrics, the decomposition into aptitude versus unreliability cannot be independently verified.
minor comments (1)
  1. [Abstract] The abstract states 'all the top open- and closed-weight LLMs we test' but does not list the specific models or their sizes; adding this enumeration would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the importance of methodological transparency. We address each major comment below with point-by-point responses and have revised the manuscript to provide greater clarity on simulation details and operational definitions.

read point-by-point responses
  1. Referee: [Abstract] The central 39% drop and the unreliability decomposition rest on the fidelity of the simulated conversations, yet the abstract only summarizes the generation process as 'prompting an LLM to produce successive user turns or using fixed templates that embed underspecification' without providing the exact prompt templates, task definitions, or how correction opportunities are introduced; this detail is load-bearing because an artifactual suppression of natural user corrections would inflate the non-recovery observation.

    Authors: We agree that the fidelity of the simulation is central to the claims. The exact prompt templates for generating successive user turns (via LLM prompting) and the fixed templates embedding underspecification are fully documented in Appendix A. The six generation tasks, including their specifications and ground-truth evaluation criteria, are defined in Section 3.1. Correction opportunities arise naturally because the user simulator is prompted to continue the conversation based on the model's prior response, permitting clarifications or corrections in later turns as described in Section 3.2. We have revised the abstract to reference Appendix A for these details while preserving its brevity. This change improves verifiability without affecting the reported 39% drop or its decomposition. revision: yes

  2. Referee: [Analysis of 200,000+ simulated conversations] The claim that LLMs 'make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely' requires explicit operationalization of 'wrong turn,' 'recovery,' and the metrics used to quantify each component; without these definitions and the associated evaluation code or rubrics, the decomposition into aptitude versus unreliability cannot be independently verified.

    Authors: We concur that explicit operationalization strengthens the analysis. Section 4.2 of the revised manuscript now defines a 'wrong turn' as an early-turn response that introduces an assumption unsupported by the initial task specification, resulting in a final output deviating from ground truth. 'Recovery' is operationalized as the proportion of such cases where later turns produce a correct final output. The decomposition metrics are: aptitude loss (performance drop on conversations without wrong turns) and unreliability (additional drop from non-recovered errors). These are computed automatically over the 200k+ conversations, with rubrics validated via manual review of 500 samples. The evaluation code and rubrics are now linked in the manuscript via our public repository. These additions enable independent verification of the reported decomposition. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements are direct and self-contained

full rationale

The paper conducts large-scale simulation experiments that directly measure LLM performance in single-turn versus multi-turn settings, reporting an average 39% drop and decomposing it into aptitude loss versus unreliability via analysis of 200,000+ generated conversations. No equations, fitted parameters, or derivations are presented that reduce the central claim (LLMs make early assumptions and fail to recover) to the simulation procedure itself or to any self-citation chain. The results are obtained by running the evaluated models on procedurally generated dialogues and recording outcomes, which constitutes independent empirical content rather than a self-definitional or fitted-input prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on the assumption that simulated conversations match real usage patterns; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Simulated conversations reflect real user behavior and task underspecification patterns
    The generalization from simulation results to real LLM use depends on this premise.

pith-pipeline@v0.9.0 · 5506 in / 1119 out tokens · 38297 ms · 2026-05-14T00:52:30.431994+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evaluating Temporal Consistency in Multi-Turn Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.

  2. EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    EMSDialog is a dataset of 4,414 synthetic multi-speaker EMS dialogues generated by a multi-LLM agent pipeline grounded in ePCR reports, annotated with diagnoses, roles, and topics, and shown to improve accuracy, timel...

  3. Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

    cs.LG 2026-05 conditional novelty 6.0

    ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 per...

  4. When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

    cs.AI 2026-05 unverdicted novelty 6.0

    Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.

  5. $\delta$-mem: Efficient Online Memory for Large Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...

  6. SOMA: Efficient Multi-turn LLM Serving via Small Language Model

    cs.CL 2026-05 unverdicted novelty 6.0

    SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.

  7. Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.

  8. Pause or Fabricate? Training Language Models for Grounded Reasoning

    cs.CL 2026-04 conditional novelty 6.0

    GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task succe...

  9. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...

  10. AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    AnchorMem decouples atomic fact anchors and associative event graphs for retrieval from preserved raw interaction contexts, outperforming prior memory methods on the LoCoMo benchmark.

  11. RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    RoTRAG retrieves Rules of Thumb to ground LLM reasoning for harm detection and severity classification in multi-turn dialogues, reporting roughly 40% relative F1 gains and 8.4% lower distributional error on two safety...

  12. GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

    cs.CL 2026-04 unverdicted novelty 6.0

    GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.

  13. AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

    cs.LG 2026-04 unverdicted novelty 6.0

    AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.

  14. ZORO: Active Rules for Reliable Vibe Coding

    cs.HC 2026-04 unverdicted novelty 6.0

    ZORO integrates rules directly into AI coding workflows by enriching plans, enforcing compliance with proof requirements, and evolving rules via user feedback, resulting in better rule adherence and shifts in user behavior.

  15. LLMs Corrupt Your Documents When You Delegate

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

  16. MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation

    cs.CL 2026-04 unverdicted novelty 6.0

    MT-OSC condenses chat history via a one-off sequential process with a few-shot Condenser and lightweight Decider to reduce tokens and preserve LLM accuracy in multi-turn settings.

  17. ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?

    cs.SE 2026-04 unverdicted novelty 6.0

    ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.

  18. Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue

    cs.CL 2026-04 conditional novelty 6.0

    Context-Agent represents dialogue history as a dynamic tree to handle non-linear topic shifts and introduces the NTM benchmark for evaluating long-horizon non-linear dialogues.

  19. Trace Mutation in Human-LLM Dialogue: The Transcript as Forensic and Mitigation Surface

    cs.HC 2026-03 unverdicted novelty 6.0

    Trace mutations are a class of context failures in LLM conversations consisting of utterance effacement and genitive dissociation that distort the shared record while resisting ordinary repair.

  20. Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems

    cs.MA 2026-05 unverdicted novelty 5.0

    Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.

  21. Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

    cs.CL 2026-05 unverdicted novelty 5.0

    Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.

  22. MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs

    cs.CL 2026-05 unverdicted novelty 5.0

    MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.

  23. Can AI Help You Get Over Your Breakup? One Session with a Belief-Reframing Chatbot Shows Sustained Distress Reduction

    cs.HC 2026-05 conditional novelty 5.0

    A pre-registered RCT found that one session with a belief-reframing AI chatbot produced significantly greater reductions in breakup distress than a survey-only control at 7 days, with a smaller effect persisting at 1 month.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 23 Pith papers · 13 internal anchors

  1. [1]

    Phi-4 Technical Report

    M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024

  2. [2]

    G. Bai, J. Liu, X. Bu, Y . He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7421–7454, 2024

  3. [3]

    C. G. Belem, P. Pezeskhpour, H. Iso, S. Maekawa, N. Bhutani, and E. Hruschka. From single to multi: How llms hallucinate in multi-document summarization. arXiv preprint arXiv:2410.13961, 2024

  4. [4]

    Brauner, A

    P. Brauner, A. Hick, R. Philipsen, and M. Ziefle. What does the public think about artificial intelligence?—a criticality map to understand bias in the public perception of ai. In Frontiers of Computer Science, 2023. URL https://api.semanticscholar.org/CorpusID:257598212. 13 LLMs Get Lost In Multi-Turn Conversation PREPRINT

  5. [5]

    Chakrabarty, P

    T. Chakrabarty, P. Laban, D. Agarwal, S. Muresan, and C.-S. Wu. Art or artifice? large language models and the false promise of creativity. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–34, 2024

  6. [6]

    Chakrabarty, P

    T. Chakrabarty, P. Laban, and C.-S. Wu. Ai-slop to ai-polish? aligning language models through edit-based writing rewards and test-time computation. arXiv preprint arXiv:2504.07532, 2025

  7. [7]

    Chang, A

    S. Chang, A. Anderson, and J. M. Hofman. Chatbench: From static benchmarks to human-ai evaluation. arXiv preprint arXiv:2504.07114, 2025

  8. [8]

    H. Chase. Langchain, October 2022. URL https://github.com/langchain-ai/langchain

  9. [9]

    Chaturvedi, K

    A. Chaturvedi, K. Thompson, and N. Asher. Nebula: A discourse aware minecraft builder. ArXiv, abs/2406.18164,

  10. [10]

    URL https://api.semanticscholar.org/CorpusID:270738020

  11. [11]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  12. [12]

    Chiang, L

    W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. Jordan, J. E. Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning, 2024

  13. [13]

    E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y . Choi, P. Liang, and L. Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036, 2018

  14. [14]

    E. Choi, J. Palomaki, M. Lamm, T. Kwiatkowski, D. Das, and M. Collins. Decontextualization: Making sentences stand-alone. Transactions of the Association for Computational Linguistics, 9:447–461, 2021

  15. [15]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  16. [16]

    Cohere, A

    T. Cohere, A. Ahmadian, M. Ahmed, J. Alammar, Y . Alnumay, S. Althammer, A. Arkhangorodsky, V . Aryabumi, D. Aumiller, R. Avalos, et al. Command a: An enterprise-ready large language model. arXiv preprint arXiv:2504.00698, 2025

  17. [17]

    Y . Deng, X. Zhang, W. Zhang, Y . Yuan, S.-K. Ng, and T.-S. Chua. On the multi-turn instruction following for conversational web agents. arXiv preprint arXiv:2402.15057, 2024

  18. [18]

    Deriu, A

    J. Deriu, A. Rodrigo, A. Otegi, G. Echegoyen, S. Rosset, E. Agirre, and M. Cieliebak. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 54:755–810, 2021

  19. [19]

    H. Duan, J. Wei, C. Wang, H. Liu, Y . Fang, S. Zhang, D. Lin, and K. Chen. Botchat: Evaluating llms’ capabilities of having multi-turn dialogues. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3184–3200, 2024

  20. [20]

    Z. Fan, R. Chen, T. Hu, and Z. Liu. Fairmt-bench: Benchmarking fairness for multi-turn dialogue in conversational llms. arXiv preprint arXiv:2410.19317, 2024

  21. [21]

    V . S. Ferreira. Ambiguity, accessibility, and a division of labor for communicative success.Psychology of Learning and motivation, 49:209–246, 2008

  22. [22]

    S. E. Finch, J. D. Finch, and J. D. Choi. Don’t forget your abc’s: Evaluating the state-of-the-art in chat-oriented dialogue systems. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023

  23. [23]

    S. Frisson. Semantic underspecification in language processing. Lang. Linguistics Compass, 3:111–127, 2009. URL https://api.semanticscholar.org/CorpusID:13384476

  24. [24]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  25. [25]

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  26. [26]

    C. Han. Can language models follow multiple turns of entangled instructions? arXiv preprint arXiv:2503.13222, 2025

  27. [27]

    Troy, Dario Amodei, Jared Kaplan, Jack Clark, and Deep Ganguli

    K. Handa, A. Tamkin, M. McCain, S. Huang, E. Durmus, S. Heck, J. Mueller, J. Hong, S. Ritchie, T. Belonax, et al. Which economic tasks are performed with ai? evidence from millions of claude conversations. arXiv preprint arXiv:2503.04761, 2025

  28. [28]

    Herlihy, J

    C. Herlihy, J. Neville, T. Schnabel, and A. Swaminathan. On overcoming miscalibrated conversational priors in llm-based chatbots. arXiv preprint arXiv:2406.01633, 2024. 14 LLMs Get Lost In Multi-Turn Conversation PREPRINT

  29. [29]

    M. C. Horowitz, L. Kahn, J. Macdonald, and J. Schneider. Adopting ai: how familiarity breeds both trust and contempt. AI & society, 39(4):1721–1735, 2024

  30. [30]

    Huang, P

    K.-H. Huang, P. Laban, A. R. Fabbri, P. K. Choubey, S. Joty, C. Xiong, and C.-S. Wu. Embrace divergence for richer insights: A multi-document summarization benchmark and a case study on summarizing diverse information from news articles. arXiv preprint arXiv:2309.09369, 2023

  31. [31]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  32. [32]

    N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Live- codebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  33. [33]

    Karpinska, K

    M. Karpinska, K. Thai, K. Lo, T. Goyal, and M. Iyyer. One thousand and one pairs: A" novel" challenge for long-context language models. arXiv preprint arXiv:2406.16264, 2024

  34. [34]

    Y . Kim, Y . Chang, M. Karpinska, A. Garimella, V . Manjunatha, K. Lo, T. Goyal, and M. Iyyer. Fables: Evaluating faithfulness and content selection in book-length summarization. arXiv preprint arXiv:2404.01261, 2024

  35. [35]

    Y . Kim, K. Son, S. Kim, and J. Kim. Beyond prompts: Learning from human communication for enhanced ai intent alignment. ArXiv, abs/2405.05678, 2024. URL https://api.semanticscholar.org/CorpusID:269635257

  36. [36]

    Knoth, A

    N. Knoth, A. Tolzin, A. Janson, and J. M. Leimeister. Ai literacy and its implications for prompt engineering strategies. Comput. Educ. Artif. Intell., 6:100225, 2024. URL https://api.semanticscholar.org/CorpusID: 269273689

  37. [37]

    Konrád, J

    J. Konrád, J. Pichl, P. Marek, P. Lorenc, V . D. Ta, O. Kobza, L. H`ylová, and J. Šediv`y. Alquist 4.0: Towards social intelligence using generative models and dialogue personalization. arXiv preprint arXiv:2109.07968, 2021

  38. [38]

    W.-C. Kwan, X. Zeng, Y . Jiang, Y . Wang, L. Li, L. Shang, X. Jiang, Q. Liu, and K.-F. Wong. Mt-eval: A multi-turn capabilities evaluation benchmark for large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20153–20177, 2024

  39. [39]

    Laban, J

    P. Laban, J. Canny, and M. A. Hearst. What’s the latest? a question-driven news chatbot. arXiv preprint arXiv:2105.05392, 2021

  40. [40]

    Laban, L

    P. Laban, L. Murakhovs’ ka, C. Xiong, and C.-S. Wu. Are you sure? challenging llms leads to performance drops in the flipflop experiment. arXiv preprint arXiv:2311.08596, 2023

  41. [41]

    Laban, A

    P. Laban, A. R. Fabbri, C. Xiong, and C.-S. Wu. Summary of a haystack: A challenge to long-context llms and rag systems. arXiv preprint arXiv:2407.01370, 2024

  42. [42]

    S. Lappin. An intensional parametric semantics for vague quantifiers. Linguistics and Philosophy, 23:599–620,

  43. [43]

    URL https://api.semanticscholar.org/CorpusID:170154611

  44. [44]

    M. Lee, M. Srivastava, A. Hardy, J. Thickstun, E. Durmus, A. Paranjape, I. Gerard-Ursin, X. L. Li, F. Ladhak, F. Rong, et al. Evaluating human-language model interaction. arXiv preprint arXiv:2212.09746, 2022

  45. [45]

    Y . Lee, K. Son, T. S. Kim, J. Kim, J. J. Y . Chung, E. Adar, and J. Kim. One vs. many: Comprehending accurate information from multiple erroneous and inconsistent ai generations. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , 2024. URL https://api.semanticscholar.org/CorpusID: 269635304

  46. [46]

    F. Lei, J. Chen, Y . Ye, R. Cao, D. Shin, H. Su, Z. Suo, H. Gao, W. Hu, P. Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows. arXiv preprint arXiv:2411.07763, 2024

  47. [47]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019

  48. [48]

    R. Li, R. Li, B. Wang, and X. Du. Iqa-eval: Automatic evaluation of human-model interactive question answering. Advances in Neural Information Processing Systems, 37:109894–109921, 2024

  49. [49]

    S. Li, J. Yan, H. Wang, Z. Tang, X. Ren, V . Srinivasan, and H. Jin. Instruction-following evaluation through verbalizer manipulation. arXiv preprint arXiv:2307.10558, 2023

  50. [50]

    Liang, D

    Z. Liang, D. Yu, W. Yu, W. Yao, Z. Zhang, X. Zhang, and D. Yu. Mathchat: Benchmarking mathematical reasoning and instruction following in multi-turn interactions. arXiv preprint arXiv:2405.19444, 2024

  51. [51]

    A. Liu, Z. Wu, J. Michael, A. Suhr, P. West, A. Koller, S. Swayamdipta, N. A. Smith, and Y . Choi. We’re afraid language models aren’t modeling ambiguity. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 790–807, 2023. 15 LLMs Get Lost In Multi-Turn Conversation PREPRINT

  52. [52]

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  53. [53]

    Y . Liu, A. R. Fabbri, P. Liu, Y . Zhao, L. Nan, R. Han, S. Han, S. Joty, C.-S. Wu, C. Xiong, et al. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. arXiv preprint arXiv:2212.07981, 2022

  54. [54]

    Z. Ma, S. Edunov, and M. Auli. A comparison of approaches to document-level machine translation. arXiv preprint arXiv:2101.11040, 2021

  55. [55]

    Malaviya, J

    C. Malaviya, J. C. Chang, D. Roth, M. Iyyer, M. Yatskar, and K. Lo. Contextualized evaluations: Taking the guesswork out of language model evaluations. arXiv preprint arXiv:2411.07237, 2024

  56. [56]

    Murakhovs’ ka, P

    L. Murakhovs’ ka, P. Laban, T. Xie, C. Xiong, and C.-S. Wu. Salespeople vs salesbot: Exploring the role of educational value in conversational recommender systems. arXiv preprint arXiv:2310.17749, 2023

  57. [57]

    Mylrea and N

    M. Mylrea and N. Robinson. Artificial intelligence (ai) trust framework and maturity model: Applying an entropy lens to improve security, privacy, and ethical ai.Entropy, 25, 2023. URL https://api.semanticscholar.org/ CorpusID:263840323

  58. [58]

    T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y . Gu, S. Huang, M. Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024

  59. [59]

    OpenAI o3 and o4-mini System Card — openai.com

    OpenAI. OpenAI o3 and o4-mini System Card — openai.com. https://openai.com/index/ o3-o4-mini-system-card/ , 2025. [Accessed 08-05-2025]

  60. [60]

    Papineni, S

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages 311–318, 2002

  61. [61]

    A. P. Parikh, X. Wang, S. Gehrmann, M. Faruqui, B. Dhingra, D. Yang, and D. Das. Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373, 2020

  62. [62]

    H. Peng, X. Wang, J. Chen, W. Li, Y . P. Qi, Z. Wang, Z. Wu, K. Zeng, B. Xu, L. Hou, and J. Li. When does in-context learning fall short and why? a study on specification-heavy tasks. ArXiv, abs/2311.08993, 2023. URL https://api.semanticscholar.org/CorpusID:265212914

  63. [63]

    Pezzelle

    S. Pezzelle. Dealing with semantic underspecification in multimodal nlp. arXiv preprint arXiv:2306.05240, 2023

  64. [64]

    L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249, 2025

  65. [65]

    Poelitz and N

    C. Poelitz and N. McKenna. Synthetic clarification and correction dialogues about data-centric tasks–a teacher- student approach. arXiv preprint arXiv:2503.14167, 2025

  66. [66]

    Post and M

    M. Post and M. Junczys-Dowmunt. Escaping the sentence-level paradigm in machine translation. arXiv preprint arXiv:2304.12959, 2023

  67. [67]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

  68. [68]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140): 1–67, 2020

  69. [69]

    A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng, A. Nagar, et al. Conversational ai: The science behind the alexa prize. arXiv preprint arXiv:1801.03604, 2018

  70. [70]

    Reddy, D

    S. Reddy, D. Chen, and C. D. Manning. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019

  71. [71]

    Sarkar, B

    R. Sarkar, B. Sarrafzadeh, N. Chandrasekaran, N. Rangan, P. Resnik, L. Yang, and S. K. Jauhar. Conversational user-ai intervention: A study on prompt rewriting for improved llm response generation. ArXiv, abs/2503.16789,

  72. [72]

    URL https://api.semanticscholar.org/CorpusID:277244656

  73. [73]

    Scherrer, J

    Y . Scherrer, J. Tiedemann, and S. Loáiciga. Analysing concatenation approaches to document-level nmt in two different domains. In Proceedings of the Third Workshop on Discourse in Machine Translation, Hong-Kong, Nov

  74. [74]

    Association for Computational Linguistics

  75. [75]

    Shaikh, H

    O. Shaikh, H. Mozannar, G. Bansal, A. Fourney, and E. Horvitz. Navigating rifts in human-llm grounding: Study and benchmark. arXiv preprint arXiv:2503.13975, 2025. 16 LLMs Get Lost In Multi-Turn Conversation PREPRINT

  76. [76]

    arXiv preprint arXiv:2501.17399 , year=

    V . Sirdeshmukh, K. Deshpande, J. Mols, L. Jin, E.-Y . Cardona, D. Lee, J. Kritz, W. Primack, S. Yue, and C. Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms. arXiv preprint arXiv:2501.17399, 2025

  77. [77]

    Southworth, K

    J. Southworth, K. Migliaccio, J. Glover, J. Glover, D. Reed, C. McCarty, J. Brendemuhl, and A. Thomas. Developing a model for ai across the curriculum: Transforming the higher education landscape via innovation in ai literacy. Computers and Education: Artificial Intelligence, 4:100127, 2023

  78. [78]

    Y . Sun, C. Liu, K. Zhou, J. Huang, R. Song, W. X. Zhao, F. Zhang, D. Zhang, and K. Gai. Parrot: Enhancing multi-turn instruction following for large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9729–9750, 2024

  79. [79]

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  80. [80]

    Terry, C

    M. Terry, C. Kulkarni, M. Wattenberg, L. Dixon, and M. R. Morris. Interactive ai alignment: specification, process, and evaluation alignment. arXiv preprint arXiv:2311.00710, 2023

Showing first 80 references.