pith. sign in

arxiv: 2602.00327 · v2 · submitted 2026-01-30 · 💻 cs.AI · cs.HC

SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Anticipation?

Pith reviewed 2026-05-16 08:58 UTC · model grok-4.3

classification 💻 cs.AI cs.HC
keywords next-utterance anticipationmultimodal dialogueLLM benchmarkanticipatory processingSayNext-BenchSayNext-Chatdual-route model
0
0 comments X

The pith

Even leading multimodal LLMs struggle to anticipate a human speaker's next utterance from context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current large language models fall short when asked to predict what a person will say next in natural conversation. Humans succeed at this by drawing on visible signals such as gestures, eye gaze, and emotional tone, yet models trained mainly on text or static images do not. To measure the gap, the authors release SayNext-Bench together with the SayNext-PC dataset and a three-part scoring system that checks word overlap, emotional and intentional fit, and overall alignment judged by another model. They also present SayNext-Chat, a dual-route architecture that adds learnable priming tokens to combine visual perception with forward-looking priors, and demonstrate that this design raises performance on every scoring level.

Core claim

Current MLLMs lack active anticipatory processing that uses multimodal cues, a capacity humans apply routinely; the SayNext-Bench framework and SayNext-Chat model with learnable priming tokens demonstrate that fusing perceptual input with anticipatory priors measurably improves lexical similarity, emotion-intention consistency, and overall alignment with human expectations.

What carries the argument

SayNext-Chat, a cognitively inspired dual-route MLLM that uses learnable priming tokens to fuse perceptual cues with anticipatory priors.

If this is right

  • Multimodal cues such as gestures and gaze are indispensable for accurate next-utterance anticipation.
  • Active anticipatory processing is a core mechanism missing from current MLLMs for natural dialogue.
  • SayNext-Chat's priming-token approach raises scores on lexical, emotional, and LLM-based alignment metrics.
  • User studies and LLM-as-judge evaluations corroborate the gains over state-of-the-art models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dialogue systems that process live video streams in real time could close part of the anticipation gap.
  • Training objectives that explicitly reward forward prediction rather than only response matching may produce more fluid conversations.
  • Benchmarks limited to text or single images will continue to underestimate the role of dynamic visual context.

Load-bearing premise

The multi-level evaluation framework and SayNext-PC dataset accurately reflect real human anticipation ability without selection or annotation biases.

What would settle it

Blind human ratings on held-out SayNext-PC examples in which SayNext-Chat predictions receive no higher preference than those from leading baseline MLLMs would falsify the claim of improved anticipation.

Figures

Figures reproduced from arXiv: 2602.00327 by Fang Kang, Hao Tang, Haotian Liu, Haoyu Chen, Mengqi Zhang, Yueyi Yang, Zheng Lian.

Figure 1
Figure 1. Figure 1: Illustration of Next-Utterance Prediction in SayNext-Bench. Given a question utterance text and the corresponding human reaction video, the task requires MLLMs to predict the human’s subsequent response. Predicted responses from SayNext-Chat (green) are compared with ground-truth utterances (blue) and other MLLMs (red); key factors are extracted for interpretability. Quantitative results are reported in Se… view at source ↗
Figure 2
Figure 2. Figure 2: The SayNext-Chat Framework. (1) Priming factors are extracted through LLM-assisted induction to construct a priming codebook. (2) The codebook guides the LLM in assigning a target priming vector to each response. (3) During end-to-end training, the loss combines the MSE between target and predicted priming vectors with the cross-entropy loss from the LLM backbone. recognized phrase denoting a specific beha… view at source ↗
Figure 3
Figure 3. Figure 3: (c) shows that SayNext-Chat consistently surpasses all state-of-the-art baselines across every metric. Specifically, despite considerable variability in phrasing, SayNext-Chat achieves a 2–6 fold improvement in lexi￾cal overlap over competing systems, ranks first in semantic similarity on both BERTScore and Sentence-BERT, and sub￾stantially improves emotion consistency by capturing latent connections betwe… view at source ↗
Figure 4
Figure 4. Figure 4: Case Study on SayNext-PC2K. In high-score samples, the predicted priming vector heatmap closely matches the target. Red and Blue indicate positive and negative values in the priming vector, with corresponding highlights in the response text. Star, heart, and drop markers denote three representative priming factors, showing their alignment between predicted and target vectors (similar colors) and clarifying… view at source ↗
Figure 5
Figure 5. Figure 5: Multidimensional comparison between baseline modales and our method in subject-dependent and subject-independent protocols. Our model (red) forms the largest polygon in both subject-dependent and subject-independent settings, highlighting its outstanding performance in lexical overlap. in EmotionLLaMA, we also utilize a single frame, since our experiments indicate that this configuration yields better perf… view at source ↗
Figure 6
Figure 6. Figure 6: The Silhouette Coefficient (SC) value comparison of different settings of clustering. F.2. Ablation: Choice of LLM for priming-vector generation We additionally include experiments using Llama-3.1 (open￾source) and Gemini-2.5-Flash (low-cost) to generate prim￾ing vectors. The results are summarized in Table. 14. The results demonstrate that using Llama-3.1 (open-source) or Gemini-2.5-Flash (low-cost) to ge… view at source ↗
Figure 7
Figure 7. Figure 7: User-study web interface. The “Name” field records a pseudonymous participant ID code (not a real name); no personal data is collected. Participants must click to provide informed consent (e.g., use of anonymized results in the paper and secure data storage). On each trial, they select the option closest to the reference; options are randomly shuffled to ensure fairness. H. Case study H.1. Comparison Predi… view at source ↗
read the original abstract

We explore the use of large language models (LLMs) for next-utterance anticipation in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to anticipate a human speaker's next utterance. Instead, humans can readily anticipate forthcoming utterances based on multi-modal cues -- such as gestures, gaze, and emotional tone -- from the context. To systematically examine this gap, we propose SayNext-Bench, a benchmark evaluating MLLMs on anticipating context-conditioned responses across diverse real-world scenarios. To support it, we build SayNext-PC, a large-scale multimodal dialogue dataset, and carefully design a multi-level evaluation framework spanning lexical similarity, emotion-intention consistency, and LLM-based overall alignment. Building on this, we develop SayNext-Chat, a cognitively inspired dual-route MLLM that incorporates learnable priming tokens to fuse perceptual cues with anticipatory priors. Extensive experiments demonstrate that SayNext-Chat consistently outperforms state-of-the-art MLLMs across all evaluation levels, corroborated by user studies and LLM-as-Judge evaluations. Our results emphasize the (i) indispensable role of multimodal cues and (ii) active anticipatory processing as foundations of natural human interaction currently missing in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SayNext-Bench, a benchmark for next-utterance anticipation in multimodal human dialogues. It presents the SayNext-PC dataset of real-world scenarios, a three-level evaluation framework (lexical similarity, emotion-intention consistency, and LLM-based overall alignment), and SayNext-Chat, a dual-route MLLM that fuses perceptual cues via learnable priming tokens. Experiments claim consistent outperformance over SOTA MLLMs, supported by user studies and LLM-as-Judge evaluations, emphasizing the role of multimodal cues and anticipatory processing missing in current models.

Significance. If the central claims hold after addressing evaluation concerns, the work identifies a concrete capability gap in MLLMs for anticipatory dialogue, which is load-bearing for applications in conversational AI. The cognitively motivated architecture and new dataset could serve as useful baselines for future multimodal interaction research, provided the benchmark faithfully captures human anticipation rather than dataset-specific artifacts.

major comments (3)
  1. [§4] §4 (SayNext-PC Dataset): The dialogue selection and annotation process lacks reported inter-annotator agreement scores or explicit controls for selection bias (e.g., favoring utterances predictable from visible gestures/gaze). This is load-bearing because the headline LLM-human gap rests on SayNext-PC being an unbiased proxy for real anticipation ability.
  2. [§5.3] §5.3 (Multi-level Evaluation): The LLM-as-Judge component for overall alignment introduces potential circularity when the judge model shares training data or biases with evaluated MLLMs; no ablation removing this judge or comparing against human raters on the same metric is provided.
  3. [Table 2] Table 2 / §6 (Results): Reported outperformance of SayNext-Chat lacks statistical significance tests, confidence intervals, or variance across runs, undermining claims of consistent superiority across lexical, emotion-intention, and alignment levels.
minor comments (2)
  1. [§5.1] The definition and initialization of 'learnable priming tokens' in the model architecture section is underspecified; a diagram or pseudocode would clarify how they fuse perceptual cues with anticipatory priors.
  2. [Figure 3] Figure 3 (qualitative examples) would benefit from explicit human baseline annotations on the same utterances to allow direct visual comparison of anticipation quality.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on SayNext-Bench. The comments highlight important aspects of dataset construction, evaluation validity, and statistical rigor. We address each point below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (SayNext-PC Dataset): The dialogue selection and annotation process lacks reported inter-annotator agreement scores or explicit controls for selection bias (e.g., favoring utterances predictable from visible gestures/gaze). This is load-bearing because the headline LLM-human gap rests on SayNext-PC being an unbiased proxy for real anticipation ability.

    Authors: We agree that explicit reporting of inter-annotator agreement and bias controls would strengthen the dataset section. The original manuscript described the annotation guidelines and multi-annotator process but did not include quantitative agreement metrics or a dedicated bias analysis. In the revision we have added Cohen’s kappa scores (κ = 0.74 for utterance selection, κ = 0.81 for emotion-intention labels) computed on a 20 % overlap subset, and we now detail the stratified sampling procedure used to draw SayNext-PC from a larger pool of 12 k candidate dialogues while balancing scenario types and predictability levels. These additions directly address the concern that the reported LLM-human gap could be an artifact of selection bias. revision: yes

  2. Referee: [§5.3] §5.3 (Multi-level Evaluation): The LLM-as-Judge component for overall alignment introduces potential circularity when the judge model shares training data or biases with evaluated MLLMs; no ablation removing this judge or comparing against human raters on the same metric is provided.

    Authors: We acknowledge the risk of circularity. The judge model (GPT-4o) was deliberately chosen as an external model not included in the evaluated set, yet we agree that an explicit validation against human raters is necessary. The revised §5.3 now includes a human-LLM correlation study on a random subset of 300 samples, where three independent human raters scored overall alignment on the same 1–5 scale; the resulting Pearson correlation between human and GPT-4o scores is 0.79. We also report an ablation that replaces the LLM judge with majority vote of the three human raters and show that the relative ranking of SayNext-Chat versus baselines remains unchanged. These additions mitigate the circularity concern. revision: yes

  3. Referee: [Table 2] Table 2 / §6 (Results): Reported outperformance of SayNext-Chat lacks statistical significance tests, confidence intervals, or variance across runs, undermining claims of consistent superiority across lexical, emotion-intention, and alignment levels.

    Authors: We accept that the original results section omitted formal statistical tests. The revised manuscript augments Table 2 with 95 % bootstrap confidence intervals (1 000 resamples) for every metric and reports paired t-test p-values comparing SayNext-Chat against each baseline. All key improvements remain significant at p < 0.01. In addition, we now report mean ± standard deviation across five independent training runs with different random seeds, confirming that the observed gains are stable and not attributable to a single favorable initialization. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contribution is an empirical benchmark (SayNext-Bench) and dataset (SayNext-PC) plus a new model (SayNext-Chat), with claims supported by experiments, user studies, and LLM-as-Judge evaluations. No mathematical derivation chain, equations, or self-citation load-bearing steps are present that reduce predictions or results to inputs by construction. The multi-level evaluation framework (lexical similarity, emotion-intention consistency, LLM alignment) is defined independently of the model outputs, and performance comparisons rely on external benchmarks rather than fitted parameters renamed as predictions or ansatzes smuggled via self-citation. The derivation is self-contained against the described experimental setup.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that multimodal cues are indispensable for anticipation and that learnable priming tokens can effectively fuse them with priors; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption Multimodal cues (gestures, gaze, emotional tone) are the primary basis for human next-utterance anticipation
    Invoked in the abstract to explain the performance gap but not derived or tested independently within the paper.
invented entities (1)
  • learnable priming tokens no independent evidence
    purpose: To fuse perceptual cues with anticipatory priors in the dual-route MLLM
    New component introduced in SayNext-Chat; no independent evidence of its necessity or effectiveness outside the paper's experiments.

pith-pipeline@v0.9.0 · 5543 in / 1342 out tokens · 24964 ms · 2026-05-16T08:58:39.252459+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    MOTOR-Bench supplies a real-world video dataset for structured mental state understanding in learning settings, while MOTOR-MAS improves zero-shot prediction of behavior, cognition, and emotion labels over single mode...

  2. MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

    cs.CL 2026-05 unverdicted novelty 7.0

    MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 2 Pith papers

  1. [1]

    Interrupted

    URL https://llava-vl.github.io/ blog/2024-04-30-llava-next-video/. 11 SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Prediction? A. LLM Usage Statement Large Language Models (ChatGPT) were used exclusively to improve the clarity and fluency of English writing. They were not involved in research ideation, experimental de- sign, data analysis, or i...

  2. [2]

    Who’s Afraid of Virginia Woolf?

    Specifically, in fine-tuning, 4 frames are randomly sam- pled in each epoch using a sliding-window strategy, which provides dynamic visual information while reducing mem- ory costs. For consistency, all models employing InternViT as the visual backbone (zero-shot baselines, fine-tuned mod- els, and SAYNEXT) resize frames to480×480. Inference Configuration...

  3. [3]

    Each factor must capture a core theme mentioned in the response; avoid vague or trivial terms

  4. [4]

    Factors should reflect the player’s cognitive or emotional state and may cover tactical, technical, mental, or physical aspects

  5. [5]

    Each factor should can be correspond to a specific behavioral or psychological characteristic with a clear positive or negative emotional bias

  6. [6]

    For each factor, list the exact expression from the original sentence (do not generalize). Output **strictly** in JSON format, for example: {”distinct factor 1”: [”exact expression from the original sentence”], ”distinct factor 1”: [”exact expression”],...} ) Prompt for Codebook Generation sys-prompt = ( Based on the input factor clusters, summarize a sin...

  7. [7]

    The priming factor should distill specific factors into a universal, semantically clear category (e.g., Emotion Valence, Physical State, Opponent Threat Perception), but avoid categories that are overly broad or vague (e.g., Resilience)

  8. [8]

    Priming factor should represent the player’s cognitive or emotional state; avoid detailed or context-specific categories

  9. [9]

    This vector should describe which factors are activated in the text and the activation strength for each factor

    Priming factor should correspond to a specific behavioral or psychological characteristic with a clear positive or negative emotional bias Output **strictly** in JSON format, for example: {”Priming factor”: ”Emotion Valence”, ”Explanation”: ”Indicates the emotional valence in the player’s response, reflecting a positive (happy) or negative (upset) state”,...

  10. [10]

    Each value in the vector represents the activation strength of the corresponding factor, as a float between -1 and 1

  11. [11]

    If the text does not contain information related to a specific factor, assign 0 to that dimension

    Assign activation values based on the ’value’ in the factor book. If the text does not contain information related to a specific factor, assign 0 to that dimension

  12. [12]

    Strictly follow the order and definition of factors in the factor book when generating the probability vector

  13. [13]

    As a linguistics expert, consider both overall meaning and subtle language cues. Avoid extreme values (-1 or 1) unless the evidence is very clear; use intermediate values to reflect language nuance.” ”Output only an N-dimensional probability vector (N is the number of factors in the factor-book), for example: [-0.9, 0.5, 0.8, -0.5, 0.7, 0, -0.6, 1.0, -0.7...