Beneath the Surface: Investigating LLMs' Capabilities for Communicating with Subtext

Andrew Kyle Lampinen; Kabir Ahuja; Yuxuan Li

arxiv: 2604.05273 · v1 · submitted 2026-04-07 · 💻 cs.CL

Beneath the Surface: Investigating LLMs' Capabilities for Communicating with Subtext

Kabir Ahuja , Yuxuan Li , Andrew Kyle Lampinen This is my paper

Pith reviewed 2026-05-10 20:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelssubtextliteral communicationallegory interpretationcommon groundevaluation suitesmulti-agent gamesvisual allusions

0 comments

The pith

Language models show a strong bias toward literal statements and underuse subtext even when constraints call for implied meaning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces four new evaluation suites to test whether language models can communicate using subtext rather than direct statements. These include allegory writing and interpretation plus multi-agent and multi-modal games modeled on board-game rules that reward indirect clues. Frontier models produce overly literal outputs in most cases, reaching 60 percent literal clues in the Visual Allusions setting. When shared context is supplied explicitly, some models cut literal outputs by 30 to 50 percent, yet they fail to detect or apply common ground that is not stated. Additional paratext or persona framing also shifts how models read subtext in allegories.

Core claim

Frontier models generally exhibit a strong bias towards overly literal, explicit communication and thereby fail to account for nuanced constraints; even the best performing models generate literal clues 60 percent of the time in the Visual Allusions environment. Some models achieve 30 to 50 percent reduction in literal clues when common ground is provided, but they struggle to infer its presence when not explicitly stated. Paratextual and persona conditions significantly shift the interpretation of subtext in allegory tasks.

What carries the argument

Four custom evaluation suites that score model outputs for literal versus subtextual content across allegory tasks and multi-agent visual games inspired by Dixit-style rules, with Visual Allusions as one concrete environment.

If this is right

Models fail to account for nuanced constraints even in simple communicative settings.
Explicit common ground reduces literal outputs in some models but does not fully close the gap.
Models cannot reliably infer unstated common ground needed for subtext.
Paratext and persona framing produce large shifts in how models interpret implied meaning.
Quantifiable scores can be obtained for an otherwise subjective aspect of communication.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The literal bias may limit LLMs in open-ended collaboration where humans routinely rely on shared background.
Future training that rewards concise indirect phrasing could be tested against these same suites.
The suites could be extended to measure subtext use in longer multi-turn dialogues rather than single clues.
Persona effects suggest that role instructions might serve as a practical workaround for current models.

Load-bearing premise

The new evaluation suites and scoring rules for literal versus subtextual outputs accurately capture the intended communicative capabilities without introducing artifacts from prompt design or task framing.

What would settle it

A model that produces subtextual rather than literal clues in more than half of Visual Allusions trials without any explicit common-ground statement would contradict the reported literal bias.

read the original abstract

Human communication is fundamentally creative, and often makes use of subtext -- implied meaning that goes beyond the literal content of the text. Here, we systematically study whether language models can use subtext in communicative settings, and introduce four new evaluation suites to assess these capabilities. Our evaluation settings range from writing & interpreting allegories to playing multi-agent and multi-modal games inspired by the rules of board games like Dixit. We find that frontier models generally exhibit a strong bias towards overly literal, explicit communication, and thereby fail to account for nuanced constraints -- even the best performing models generate literal clues 60% of times in one of our environments -- Visual Allusions. However, we find that some models can sometimes make use of common ground with another party to help them communicate with subtext, achieving 30%-50% reduction in overly literal clues; but they struggle at inferring presence of a common ground when not explicitly stated. For allegory understanding, we find paratextual and persona conditions to significantly shift the interpretation of subtext. Overall, our work provides quantifiable measures for an inherently complex and subjective phenomenon like subtext and reveals many weaknesses and idiosyncrasies of current LLMs. We hope this research to inspire future work towards socially grounded creative communication and reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives four new test suites for subtext in games and allegories and shows frontier models default to literal clues about 60% of the time, with partial help from explicit common ground.

read the letter

The main thing to know is that frontier models lean heavily literal when they should use subtext, even in the best cases, and the work supplies concrete numbers plus new environments to measure it. The allegory and Dixit-style game suites force models to handle shared knowledge and implied meaning under constraints that prior literal-vs-figurative tests did not cover as directly. The reported 30-50% drop in literal outputs when common ground is supplied, and the failure to infer it otherwise, are the clearest quantitative signals. Paratext and persona shifts also move allegory interpretations in measurable ways. These are useful additions because they turn a fuzzy capability into something that can be tracked across models and conditions. The tasks themselves are a step forward for anyone studying pragmatic or collaborative language use. The soft spot is the scoring step that separates literal from subtextual outputs. The stress-test concern about prompt framing and rubric sensitivity is reasonable; without the exact templates, edge-case rules, and any agreement stats, it is hard to tell how much the 60% figure reflects model limits versus task artifacts. The abstract gives the headline percentages but leaves the implementation details thin, which weakens confidence in the size of the bias. This paper is aimed at people working on benchmarks for social or creative reasoning in LLMs. A reader already running multi-agent or pragmatics evaluations would get immediate value from the new suites and could adapt them. It deserves a serious referee. The tasks are novel enough and the observations point to real gaps worth checking, even if the methods section will need more scrutiny on robustness.

Referee Report

3 major / 2 minor

Summary. The paper introduces four new evaluation suites to assess LLMs' use of subtext in communication, spanning allegory writing/interpretation and multi-agent/multi-modal games inspired by Dixit. It reports that frontier models show a strong literal bias (e.g., 60% literal clues in Visual Allusions) and fail to account for nuanced constraints, though some models achieve 30-50% reductions in literal outputs when common ground is explicitly provided; they struggle to infer common ground otherwise. Paratextual and persona conditions significantly shift allegory interpretations. The work supplies quantifiable measures for subtext and documents LLM weaknesses in socially grounded creative communication.

Significance. If the central empirical findings hold after methodological clarification, the paper makes a useful contribution by highlighting limitations in LLMs for nuanced, context-sensitive communication relevant to collaborative agents and creative tasks. The creation of multiple new task suites provides concrete benchmarks that can be extended, and the empirical focus on common-ground effects offers a falsifiable direction for future work on socially grounded reasoning.

major comments (3)

[Evaluation Methodology and Results] Evaluation sections (including Visual Allusions and common-ground experiments): the headline quantitative claims (60% literal clues; 30-50% reductions) rest on newly defined scoring rules for literal vs. subtextual outputs, yet the manuscript provides insufficient detail on exact prompt templates, few-shot examples, scoring rubrics, inter-annotator agreement, and statistical controls. This makes it impossible to determine whether the reported literal bias reflects intrinsic model behavior or artifacts of task framing.
[Allegory Experiments] § on allegory understanding: the reported effects of paratextual and persona conditions on subtext interpretation are interesting, but without ablation on prompt wording or controls for model-specific decoding parameters, it is unclear whether these shifts are robust or sensitive to minor instruction changes.
[Multi-Agent Game Environments] Common-ground inference experiments: the claim that models 'struggle at inferring presence of a common ground when not explicitly stated' is load-bearing for the broader narrative, yet the paper does not report controls for whether the inference failure is due to the model or to the way common-ground presence is operationalized in the prompt.

minor comments (2)

[Abstract and Introduction] The abstract and introduction use 'subtext' and 'implied meaning' interchangeably without a brief operational definition; a short clarifying sentence would help readers from outside literary or pragmatics traditions.
[Figures and Tables] Figure captions for the game environments should explicitly state the exact number of trials, models, and annotators per condition to improve reproducibility at a glance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in the presentation and robustness of our results. We address each major comment in detail below and commit to substantial revisions to enhance the manuscript's clarity and methodological transparency.

read point-by-point responses

Referee: [Evaluation Methodology and Results] Evaluation sections (including Visual Allusions and common-ground experiments): the headline quantitative claims (60% literal clues; 30-50% reductions) rest on newly defined scoring rules for literal vs. subtextual outputs, yet the manuscript provides insufficient detail on exact prompt templates, few-shot examples, scoring rubrics, inter-annotator agreement, and statistical controls. This makes it impossible to determine whether the reported literal bias reflects intrinsic model behavior or artifacts of task framing.

Authors: We agree that the manuscript would benefit from greater methodological transparency to allow independent verification of our results. In the revised version, we will include the exact prompt templates used across all experiments, the few-shot examples where applicable, the complete scoring rubrics for distinguishing literal from subtextual outputs, inter-annotator agreement statistics for our human evaluations, and details on statistical controls such as multiple runs and variations in task framing. These additions will demonstrate that the observed literal bias, including the 60% rate in Visual Allusions and the reductions with explicit common ground, arises from the models' tendencies rather than specific prompt artifacts. revision: yes
Referee: [Allegory Experiments] § on allegory understanding: the reported effects of paratextual and persona conditions on subtext interpretation are interesting, but without ablation on prompt wording or controls for model-specific decoding parameters, it is unclear whether these shifts are robust or sensitive to minor instruction changes.

Authors: We recognize the need to verify the robustness of the reported effects. Our original experiments used standardized prompt structures and fixed decoding parameters to isolate the impact of paratextual and persona conditions. For the revision, we will conduct and report ablations on minor prompt wording variations to show that the significant shifts in subtext interpretation remain consistent. We will also specify the decoding parameters used and provide comparative results under alternative settings to confirm the findings are not sensitive to these factors. revision: yes
Referee: [Multi-Agent Game Environments] Common-ground inference experiments: the claim that models 'struggle at inferring presence of a common ground when not explicitly stated' is load-bearing for the broader narrative, yet the paper does not report controls for whether the inference failure is due to the model or to the way common-ground presence is operationalized in the prompt.

Authors: This is an important consideration for validating our central claim. We will add control experiments in the revised manuscript that test alternative operationalizations of common ground presence in the prompts, such as varied implicit cues. These controls will allow us to assess whether the models' difficulties in inferring common ground persist across different prompt formulations, thereby supporting that the limitation is model-inherent rather than an artifact of our specific operationalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with new tasks and direct measurements

full rationale

This is an empirical study that defines four new evaluation suites (allegory writing/interpretation, Dixit-inspired games, Visual Allusions) and reports measured model behaviors such as literal clue rates. No derivations, equations, first-principles predictions, or fitted parameters are present whose outputs reduce to the inputs by construction. Claims rest on observed performance against explicitly defined tasks and rubrics rather than self-referential definitions or self-citation chains. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the introduced tasks and literal/subtext classifications validly reflect communicative subtext use; no free parameters or new entities are introduced.

axioms (1)

domain assumption The new evaluation suites and scoring criteria for literal versus subtextual outputs accurately measure the intended capabilities
The paper treats performance on these custom games and allegory tasks as direct evidence of subtext communication ability.

pith-pipeline@v0.9.0 · 5528 in / 1190 out tokens · 62110 ms · 2026-05-10T20:05:49.071185+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Gemma 3 Technical Report

Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.840. URL https://aclanthology.org/2023.findings-emnlp.840/. Gemini Team, Google. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodal- ity, Long Context, and Next Generation Agentic Capabilities., 2025. URLhttps://storage. googleapis.com/deepmind-media/gemini/g...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.findings-emnlp.840 2023
[2]

Appetizing -> Non-Appetizing,

We notice that for Gemini-2.5-Pro and Claude-Sonnet-4.5 there is a jump in average score and win-rates when memory is introduced. There is also an increase in storytelling scores for the two models and a decrease in the fraction of obvious clues. However, the gains for these two metrics are relatively low, with the models still generating obvious clues (g...

work page arXiv 2025
[3]

The State’s Inquisitor (𝐷): An agent who is aware of your official profile and maintains an officialDossier on you, updating it after each story

work page
[4]

They can see the patterns you create over time

The Independent Critic (𝐶): An analyst with access to your completeLiterary History—every story you publish in this game. They can see the patterns you create over time. Long-Term Strategy: Since the Critic sees all your work, you can build a symbolic language from scratch. Consider these strategies: • Appeasement: Write stories that clearly celebrate{m_c...

work page
[5]

setting_description

Provide a brief justificationfor your decision, citing evidence from the new text and drawing connections to themes or symbols from the author’s past works. Think step-by-step about your decisions and enclose your reasoning in<think>and </think>tags before providing your final answer in the specified format. User Prompt for Inquisitor and Critic’s Interpr...

work page

[1] [1]

Gemma 3 Technical Report

Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.840. URL https://aclanthology.org/2023.findings-emnlp.840/. Gemini Team, Google. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodal- ity, Long Context, and Next Generation Agentic Capabilities., 2025. URLhttps://storage. googleapis.com/deepmind-media/gemini/g...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.findings-emnlp.840 2023

[2] [2]

Appetizing -> Non-Appetizing,

We notice that for Gemini-2.5-Pro and Claude-Sonnet-4.5 there is a jump in average score and win-rates when memory is introduced. There is also an increase in storytelling scores for the two models and a decrease in the fraction of obvious clues. However, the gains for these two metrics are relatively low, with the models still generating obvious clues (g...

work page arXiv 2025

[3] [3]

The State’s Inquisitor (𝐷): An agent who is aware of your official profile and maintains an officialDossier on you, updating it after each story

work page

[4] [4]

They can see the patterns you create over time

The Independent Critic (𝐶): An analyst with access to your completeLiterary History—every story you publish in this game. They can see the patterns you create over time. Long-Term Strategy: Since the Critic sees all your work, you can build a symbolic language from scratch. Consider these strategies: • Appeasement: Write stories that clearly celebrate{m_c...

work page

[5] [5]

setting_description

Provide a brief justificationfor your decision, citing evidence from the new text and drawing connections to themes or symbols from the author’s past works. Think step-by-step about your decisions and enclose your reasoning in<think>and </think>tags before providing your final answer in the specified format. User Prompt for Inquisitor and Critic’s Interpr...

work page