Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

· 2025 · cs.CL · arXiv 2504.04717

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open full Pith review browse 9 citing papers arXiv PDF

abstract

Recent advances in large language models (LLMs) have substantially improved single-turn task performance, yet real-world applications increasingly demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent progress in evaluating and enhancing multi-turn LLM interactions. Centered on a task-oriented taxonomy-spanning instruction following in domains such as mathematics and coding, and conversational engagement in role-playing, healthcare, education, and adversarial jailbreak settings-we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness across prolonged dialogues. We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies (in-context learning, supervised fine-tuning, reinforcement learning, and architectural innovations), external integration approaches (memory augmentation, retrieval-based methods, and knowledge graphs), and agent-based techniques for collaborative interaction. Finally, we identify open challenges and promising directions for future research to further improve the robustness and effectiveness of multi-turn LLM interactions.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

RankJudge creates paired multi-turn conversations with isolated single-turn flaws to generate unambiguous benchmarks for LLM-as-a-judge systems across ML, biomedicine, and finance domains.

Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions

cs.CY · 2026-05-21 · conditional · novelty 6.0

Healthcare LLM benchmarks overlook implicit assumptions about user behavior that split into task assumptions testable from conversation data and outcome assumptions requiring behavioral studies, shown by reanalyzing an RCT where both gaps are roughly equal.

LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails

cs.CR · 2026-05-17 · conditional · novelty 6.0

LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

cs.CR · 2026-05-10 · unverdicted · novelty 6.0

MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.

From History to State: Constant-Context Skill Learning for LLM Agents

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebShop, and 66.4% on SciWorld with Qwen3-8B while reducing prompt tokens 2-7x.

AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

cs.CL · 2026-04-22 · unverdicted · novelty 5.0

New dictionary-derived datasets enable fine-tuned LLMs to act as language tutors for ten low-resource African languages, with SFT plus DPO yielding 1.8-15.5% gains on LLM-as-judge metrics.

SinkTrack: Attention Sink based Context Anchoring for Large Language Models

cs.CV · 2026-04-11 · unverdicted · novelty 5.0 · 2 refs

SinkTrack anchors LLMs to initial context by modifying the attention sink token with injected features, yielding gains on textual and multimodal tasks.

Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction

cs.CL · 2026-03-18 · unverdicted · novelty 5.0

Bipredictability from token statistics monitors structural consistency in multi-turn LLM interactions, showing 85% alignment with structure but only 44% with semantics and 100% sensitivity to tested drifts across 4574 turns.

citing papers explorer

Showing 9 of 9 citing papers.

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator cs.CL · 2026-05-20 · unverdicted · none · ref 15 · internal anchor
RankJudge creates paired multi-turn conversations with isolated single-turn flaws to generate unambiguous benchmarks for LLM-as-a-judge systems across ML, biomedicine, and finance domains.
Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions cs.CY · 2026-05-21 · conditional · none · ref 8 · internal anchor
Healthcare LLM benchmarks overlook implicit assumptions about user behavior that split into task assumptions testable from conversation data and outcome assumptions requiring behavioral studies, shown by reanalyzing an RCT where both gaps are roughly equal.
LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails cs.CR · 2026-05-17 · conditional · none · ref 18 · internal anchor
LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.
SOMA: Efficient Multi-turn LLM Serving via Small Language Model cs.CL · 2026-05-11 · unverdicted · none · ref 28 · internal anchor
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks cs.CR · 2026-05-10 · unverdicted · none · ref 22 · internal anchor
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
From History to State: Constant-Context Skill Learning for LLM Agents cs.AI · 2026-05-06 · unverdicted · none · ref 12 · internal anchor
Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebShop, and 66.4% on SciWorld with Qwen3-8B while reducing prompt tokens 2-7x.
AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models cs.CL · 2026-04-22 · unverdicted · none · ref 3 · internal anchor
New dictionary-derived datasets enable fine-tuned LLMs to act as language tutors for ten low-resource African languages, with SFT plus DPO yielding 1.8-15.5% gains on LLM-as-judge metrics.
SinkTrack: Attention Sink based Context Anchoring for Large Language Models cs.CV · 2026-04-11 · unverdicted · none · ref 8 · 2 links · internal anchor
SinkTrack anchors LLMs to initial context by modifying the attention sink token with injected features, yielding gains on textual and multimodal tasks.
Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction cs.CL · 2026-03-18 · unverdicted · none · ref 16 · internal anchor
Bipredictability from token statistics monitors structural consistency in multi-turn LLM interactions, showing 85% alignment with structure but only 44% with semantics and 100% sensitivity to tested drifts across 4574 turns.

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer