RankJudge creates paired multi-turn conversations with isolated single-turn flaws to generate unambiguous benchmarks for LLM-as-a-judge systems across ML, biomedicine, and finance domains.
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
9 Pith papers cite this work. Polarity classification is still indexing.
abstract
Recent advances in large language models (LLMs) have substantially improved single-turn task performance, yet real-world applications increasingly demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent progress in evaluating and enhancing multi-turn LLM interactions. Centered on a task-oriented taxonomy-spanning instruction following in domains such as mathematics and coding, and conversational engagement in role-playing, healthcare, education, and adversarial jailbreak settings-we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness across prolonged dialogues. We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies (in-context learning, supervised fine-tuning, reinforcement learning, and architectural innovations), external integration approaches (memory augmentation, retrieval-based methods, and knowledge graphs), and agent-based techniques for collaborative interaction. Finally, we identify open challenges and promising directions for future research to further improve the robustness and effectiveness of multi-turn LLM interactions.
citation-role summary
citation-polarity summary
years
2026 9roles
background 2polarities
background 2representative citing papers
Healthcare LLM benchmarks overlook implicit assumptions about user behavior that split into task assumptions testable from conversation data and outcome assumptions requiring behavioral studies, shown by reanalyzing an RCT where both gaps are roughly equal.
LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebShop, and 66.4% on SciWorld with Qwen3-8B while reducing prompt tokens 2-7x.
New dictionary-derived datasets enable fine-tuned LLMs to act as language tutors for ten low-resource African languages, with SFT plus DPO yielding 1.8-15.5% gains on LLM-as-judge metrics.
SinkTrack anchors LLMs to initial context by modifying the attention sink token with injected features, yielding gains on textual and multimodal tasks.
Bipredictability from token statistics monitors structural consistency in multi-turn LLM interactions, showing 85% alignment with structure but only 44% with semantics and 100% sensitivity to tested drifts across 4574 turns.
citing papers explorer
-
RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
RankJudge creates paired multi-turn conversations with isolated single-turn flaws to generate unambiguous benchmarks for LLM-as-a-judge systems across ML, biomedicine, and finance domains.
-
Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions
Healthcare LLM benchmarks overlook implicit assumptions about user behavior that split into task assumptions testable from conversation data and outcome assumptions requiring behavioral studies, shown by reanalyzing an RCT where both gaps are roughly equal.
-
LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails
LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.
-
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
-
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
-
From History to State: Constant-Context Skill Learning for LLM Agents
Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebShop, and 66.4% on SciWorld with Qwen3-8B while reducing prompt tokens 2-7x.
-
AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models
New dictionary-derived datasets enable fine-tuned LLMs to act as language tutors for ten low-resource African languages, with SFT plus DPO yielding 1.8-15.5% gains on LLM-as-judge metrics.
-
SinkTrack: Attention Sink based Context Anchoring for Large Language Models
SinkTrack anchors LLMs to initial context by modifying the attention sink token with injected features, yielding gains on textual and multimodal tasks.
-
Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction
Bipredictability from token statistics monitors structural consistency in multi-turn LLM interactions, showing 85% alignment with structure but only 44% with semantics and 100% sensitivity to tested drifts across 4574 turns.