RankJudge creates paired multi-turn conversations with isolated single-turn flaws to generate unambiguous benchmarks for LLM-as-a-judge systems across ML, biomedicine, and finance domains.
hub Canonical reference
LLMs Get Lost In Multi-Turn Conversation
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that *when LLMs take a wrong turn in a conversation, they get lost and do not recover*.
hub tools
citation-role summary
citation-polarity summary
roles
background 6polarities
background 6representative citing papers
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penalizing sycophancy.
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
EMSDialog is a dataset of 4,414 synthetic multi-speaker EMS dialogues generated by a multi-LLM agent pipeline grounded in ePCR reports, annotated with diagnoses, roles, and topics, and shown to improve accuracy, timeliness, and stability in conversational diagnosis prediction.
Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 percentage points.
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-heavy tasks.
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.
GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lower cost.
AnchorMem decouples atomic fact anchors and associative event graphs for retrieval from preserved raw interaction contexts, outperforming prior memory methods on the LoCoMo benchmark.
RoTRAG retrieves Rules of Thumb to ground LLM reasoning for harm detection and severity classification in multi-turn dialogues, reporting roughly 40% relative F1 gains and 8.4% lower distributional error on two safety benchmarks while cutting redundant retrieval.
GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
ZORO integrates rules directly into AI coding workflows by enriching plans, enforcing compliance with proof requirements, and evolving rules via user feedback, resulting in better rule adherence and shifts in user behavior.
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
Context-Agent represents dialogue history as a dynamic tree to handle non-linear topic shifts and introduces the NTM benchmark for evaluating long-horizon non-linear dialogues.
Trace mutations are a class of context failures in LLM conversations consisting of utterance effacement and genitive dissociation that distort the shared record while resisting ordinary repair.
SWE-Pruner trains a lightweight neural skimmer to perform task-aware pruning of code contexts for LLM agents, delivering 23-54% token reduction on SWE-Bench Verified with improved success rates and up to 14.84x compression on LongCodeQA.
citing papers explorer
-
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
-
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
-
$\delta$-mem: Efficient Online Memory for Large Language Models
δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-heavy tasks.
-
Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.
-
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lower cost.