arxiv: 2604.15140 · v1 · submitted 2026-04-16 · 💻 cs.CL

Recognition: unknown

DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering

Neha Srikanth , Jordan Boyd-Graber , Rachel Rudinger

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords rhetorical strategiesdiscourse actsinformation-seeking QALLM responseshuman answering stylesRST parsespragmatic QAanswer diversity

0 comments

The pith

Different human communities employ varied rhetorical strategies in answering questions, but LLMs produce uniform and broader responses despite specific prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DiscoTrace is introduced as a way to map answers into sequences of discourse acts linked to specific interpretations of the question, built on rhetorical structure theory parses. Testing it on responses from nine human groups uncovers clear differences in how they prefer to structure answers. LLMs, by comparison, show little variation in their rhetorical choices and tend to tackle more possible readings of the question than people do. The work points toward using these human patterns to make AI responses more adaptable to context.

Core claim

The central discovery is that applying the DiscoTrace method to answers from diverse human communities demonstrates varied rhetorical preferences, whereas LLMs consistently lack this diversity and favor addressing a wider set of question interpretations, even under prompts designed to replicate human community guidelines.

What carries the argument

DiscoTrace, which represents answers as a sequence of question-related discourse acts paired with interpretations of the original question, annotated on rhetorical structure theory parses.

If this is right

Human communities display diverse preferences in constructing answers to information-seeking questions.
LLMs do not show rhetorical diversity in their answers despite prompts to mimic specific human guidelines.
LLMs systematically choose breadth by addressing interpretations that human answerers avoid.
The findings can inform the creation of LLM answerers that use context-informed strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

LLMs could be fine-tuned using examples from specific communities to better match their rhetorical styles.
Broader adoption of such analysis might reveal why LLMs tend toward comprehensive but less focused responses.
Developers might experiment with DiscoTrace to evaluate and adjust prompt strategies for more human-like variation.

Load-bearing premise

The assumption that discourse act annotations on RST parses consistently and accurately reflect the rhetorical strategies across human communities and LLMs without bias or parsing errors.

What would settle it

Finding that repeated annotations by different people on the same answers yield substantially different discourse act sequences, or that LLMs can be prompted to match human diversity levels as measured by DiscoTrace.

Figures

Figures reproduced from arXiv: 2604.15140 by Jordan Boyd-Graber, Neha Srikanth, Rachel Rudinger.

**Figure 2.** Figure 2: Rhetorical structure theory (RST) trees provide useful discourse scaffolding, but not enough information [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Each segment in a DISCOTRACE is labeled with one of these actions, grouped into five broad families. The full DISCOTRACE pairs each of these with a question-specific interpretation. segments rather than between segments and some external question, and (2) EDUs are too short to serve as segments themselves to determine a higherlevel discourse relation with the question. Labeling segments with discourse act… view at source ↗

**Figure 4.** Figure 4: DISCOTRACE helps compare how different answering strategies are across different subreddits. We train bigram models on act sequences from one community Ctrain (rows) and evaluate on another Ceval (columns). The diagonal is the self perplexity. The left (a) compares models trained on human answers vs. human answers, and the right (b) trains a bigram model on claude-haiku-4.5 answers to questions in Ctrain a… view at source ↗

**Figure 5.** Figure 5: Prompting claude-sonnet-4.5 with the community guidelines to answer questions as a member of two subreddits AskHistorians and ScienceBasedParenting (mimic) is insufficient to create answers that match the norms of the underlying community. Human answers still have higher perplexity under bigram models trained on LLM answers (top), with LLMs deploying a narrower structure (bottom), as they are still well pr… view at source ↗

**Figure 6.** Figure 6: LLMs address low probability interpretations [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: We compare the proportion of answers that address each of the 21 acts in our ontology in [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: We train bigram models on act sequences from one community Ctrain (rows) and evaluate on another Ceval (columns). The diagonal is the self perplexity. Here we train a bigram model on qwen3-32b answers to questions in Ctrain and evaluate on human answers. haiku-AskEconomics haiku-AskHistorians haiku-asklinguistics haiku-history haiku-ScienceBasedParenting haiku-beyondthebump haiku-explainlikeimfive haiku-No… view at source ↗

**Figure 9.** Figure 9: We train bigram models on act sequences from one community Ctrain (rows) and evaluate on another Ceval (columns). The diagonal is the self perplexity. Here we train a bigram model on claude-4.5-haiku answers to questions in Ctrain and evaluate on claude-4.5-haiku answers [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: We train bigram models on act sequences from one community Ctrain (rows) and evaluate on another Ceval (columns). The diagonal is the self perplexity. Here we train a bigram model on qwen3-32b answers to questions in Ctrain and evaluate on qwen3-32b answers [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

We introduce DiscoTrace, a method to identify the rhetorical strategies that answerers use when responding to information-seeking questions. DiscoTrace represents answers as a sequence of question-related discourse acts paired with interpretations of the original question, annotated on top of rhetorical structure theory parses. Applying DiscoTrace to answers from nine different human communities reveals that communities have diverse preferences for answer construction. In contrast, LLMs do not exhibit rhetorical diversity in their answers, even when prompted to mimic specific human community answering guidelines. LLMs also systematically opt for breadth, addressing interpretations of questions that human answerers choose not to address. Our findings can guide the development of pragmatic LLM answerers that consider a range of strategies informed by context in QA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiscoTrace gives a concrete way to track discourse acts and question interpretations in QA answers, showing human communities vary while LLMs stay uniform and broad, but the annotation reliability is the part that needs checking.

read the letter

DiscoTrace is a new annotation scheme that puts discourse acts and question interpretations on top of RST parses to track how people and models answer questions. The key finding is that nine human communities show different patterns in what they address and how they structure answers, but LLMs stick to a narrow set of strategies and tend to cover more interpretations than humans do, even when prompted to match community styles. The method itself is the real contribution. By combining RST with these additional layers, it gives a structured way to capture pragmatic choices that standard metrics miss. The application to multiple communities is a good way to demonstrate that the tool can pick up real variation. Where it gets shaky is the validation of the annotations. The paper doesn't seem to include inter-annotator agreement numbers or tests for how well the RST parses hold up on the different kinds of text from humans and models. If the parses are less accurate for LLM output or certain communities, or if the act labels are subjective, then the reported differences could be artifacts. That's the main thing to watch. This paper is aimed at researchers in NLP who care about evaluation beyond accuracy, especially those looking at discourse and pragmatics in QA systems. Someone building better LLM answerers that adapt to context would find the approach useful as a diagnostic tool. It has enough of a novel idea and clear motivation to go to peer review, but the referees will likely ask for more on the reliability of DiscoTrace. I would recommend sending it out with that in mind.

Referee Report

1 major / 1 minor

Summary. The paper introduces DiscoTrace, a method that represents answers to information-seeking questions as sequences of question-related discourse acts (annotated on RST parses) paired with specific interpretations of the original question. Applying this framework to answers from nine human communities shows diverse preferences for answer construction strategies. In contrast, LLMs exhibit limited rhetorical diversity even when prompted to follow community-specific guidelines, and systematically favor breadth by addressing additional question interpretations that humans omit. The work aims to inform development of more pragmatically aware LLM answerers.

Significance. If the annotation reliability holds, the findings would be significant for documenting systematic pragmatic differences between human communities and LLMs in QA, offering a structured way to diagnose and mitigate overly broad or uniform LLM responses. The RST-based discourse act representation is a methodological contribution that enables systematic, comparable analysis across sources and could support future work on context-sensitive generation.

major comments (1)

[DiscoTrace method description and annotation procedure] The central claims (human diversity vs. LLM uniformity and breadth preference) depend on DiscoTrace's discourse act sequences faithfully reflecting answerer intent. The manuscript provides no reported inter-annotator agreement, validation of annotation guidelines across informal community text and LLM outputs, or error analysis for RST parser performance differences between these sources. Systematic parse degradation or annotator bias on LLM text versus human posts would render the diversity gap and breadth findings uninterpretable artifacts rather than substantive results.

minor comments (1)

[Abstract and §1] The abstract and introduction would benefit from explicitly naming the nine human communities and the specific LLMs/prompting setups used, to allow readers to assess the scope of the diversity claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the reliability of the DiscoTrace annotations. We address the methodological concern point by point below and commit to strengthening the manuscript accordingly.

read point-by-point responses

Referee: [DiscoTrace method description and annotation procedure] The central claims (human diversity vs. LLM uniformity and breadth preference) depend on DiscoTrace's discourse act sequences faithfully reflecting answerer intent. The manuscript provides no reported inter-annotator agreement, validation of annotation guidelines across informal community text and LLM outputs, or error analysis for RST parser performance differences between these sources. Systematic parse degradation or annotator bias on LLM text versus human posts would render the diversity gap and breadth findings uninterpretable artifacts rather than substantive results.

Authors: We agree that the central claims rest on the faithfulness of the discourse act sequences and that explicit validation is required to rule out artifacts from annotation or parsing differences. The current manuscript does not report inter-annotator agreement, cross-source guideline validation, or a comparative error analysis of the RST parser. In the revised version we will add the following: (1) inter-annotator agreement scores (Cohen's kappa and percentage agreement) computed on a double-annotated subset of 200 answers drawn equally from human community posts and LLM outputs; (2) a brief validation study confirming that the annotation guidelines produce consistent discourse act labels when applied to both informal human text and LLM-generated text; and (3) an error analysis that quantifies RST parser performance (attachment and labeling accuracy) separately on the two sources, together with any observed systematic differences. These additions will be placed in a new subsection of the method and will directly support the interpretability of the reported diversity and breadth contrasts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method application is self-contained

full rationale

The paper defines DiscoTrace as a representation of answers via RST parses annotated with discourse acts and question interpretations, then applies this representation empirically to compare human community answers and LLM outputs. No equations, fitted parameters, predictions derived from fits, or self-citations form the load-bearing steps of the central claims about diversity and breadth preferences. The derivation chain consists of direct annotation and comparison on external data sources rather than any reduction of outputs to inputs by construction or renaming. This is the expected non-finding for an annotation-driven empirical study without mathematical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that RST parses plus manual discourse-act labeling accurately reflect answerer strategies. No free parameters or invented physical entities are evident from the abstract. The main unstated premise is that the nine human communities and the chosen LLM prompts constitute a fair sample for the diversity claims.

axioms (2)

domain assumption Rhetorical Structure Theory parses provide a reliable base layer for annotating discourse acts in answers.
The method is defined on top of RST parses; any errors or inconsistencies in parsing would propagate to the strategy identification.
domain assumption Human communities exhibit stable, measurable preferences for answer construction that can be captured by the annotation scheme.
The diversity finding presupposes that the observed differences reflect genuine community norms rather than sampling artifacts.

pith-pipeline@v0.9.0 · 5425 in / 1393 out tokens · 38177 ms · 2026-05-10T11:33:44.904210+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 2 canonical work pages · 1 internal anchor

[1]

WebGPT: Browser-assisted question-answering with human feedback

Selectively answering ambiguous questions. InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 530–543, Singapore. Association for Computational Linguistics. Laura Dietz, Manisha Verma, Filip Radlinski, and Nick Craswell. 2017. Trec complex answer retrieval overview. InTREC. Kawin Ethayarajh, Yejin Choi, and S...

work page internal anchor Pith review arXiv 2023
[2]

InSecond Conference on Lan- guage Modeling

Qudsim: Quantifying discourse similarities in llm-generated text. InSecond Conference on Lan- guage Modeling. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al
[3]

Soham Poddar, Paramita Koley, Janardan Misra, Niloy Ganguly, and Saptarshi Ghosh

Training language models to follow instruc- tions with human feedback.Advances in neural in- formation processing systems, 35:27730–27744. Soham Poddar, Paramita Koley, Janardan Misra, Niloy Ganguly, and Saptarshi Ghosh. 2025. Brevity is the soul of sustainability: Characterizing llm response lengths. InFindings of the Association for Computa- tional Ling...

2025
[4]

Taxonomy of user needs and actions

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading com- prehension.ACM Computing Surveys, 55(10):1–45. John Searle. 2014. What is a speech act? InPhilosophy in America, pages 221–239. Routledge. John R Searle. 1969.Speech acts: An essay in the philosophy of language. Cambridge university press. Renee Shelby, Fernando Diaz...

work page arXiv 2014
[5]

Inproceedings of the interna- tional AAAI conference on web and social media, volume 11, pages 357–366

Characterizing online discussion using coarse discourse sequences. Inproceedings of the interna- tional AAAI conference on web and social media, volume 11, pages 357–366. Michael Zhang, W. Bradley Knox, and Eunsol Choi
[6]

is this normal

Modeling future conversation turns to teach llms to ask clarifying questions. InInternational Conference on Learning Representations, volume 2025, pages 60722–60742. Relations Elaboration, Attribution, Joint, Same-Unit, Explanation, Enablement, Background, Evaluation, Cause, Contrast, Temporal, Comparison, Topic-Change, Manner-Means, Textual-Organization,...

2025
[7]

If the current segment continues the previous action, reuse the previous action_id
[8]

If a new rhetorical move begins, select the appropriate new action_id
[9]

So basically,

If no action fits, use “NONE”. Subsegment Labeling Each segment you receive was produced by a discourse parser. You will also be shown the subsegments (sentences) that make up the segment. If all subsegments serve the same discourse function, return a single-element array with one action_id. If different subsegments serve different discourse functions, re...
[10]

Where can I find X

Consider the expected answer type of the question when labeling actions. Responding to “Where can I find X” with a website recommendation is an “Answer the Question” action, not a “Direct to Resource” action. If the resource is the answer itself, label it as “Assert Answer”. If the resource is suggested as additional reading, use Direct to Resource
[11]

Provide Reasoning or Justification

When a segment contains both an answer and supporting reasoning: if subsegments are provided and the answer and reasoning fall in different subsegments, split them. If they are in thesame subsegment(tightly integrated), label it as “Provide Reasoning or Justification” if the justification is non-trivial, otherwise “Assert Answer”
[12]

Provide Reasoning or Justification

When a segment explains WHY something is the case, determine what it is explaining: - If it explains why an answer is correct→ “Provide Reasoning or Justification” - If it explains why a premise of the question is wrong→“Reject Presupposition” Example: For “Why is the sky blue?”, the segment “Because of Rayleigh scattering” is justification. For “What’s t...
[13]

Provide Example

Sharing a personal anecdote or experience is “Provide Example”, NOT “Provide Background.” Background sets up context, frameworks, or history before answering. Examples use concrete cases (including personal ones) to support or illustrate an answer. - “I have a doctorate, and sometimes introduce myself as Dr.”→Provide Example - “The use of honorifics has a...
[14]

Provide Reasoning or Justification

When a segment follows a recommendation and provides supporting information, ask: does it explain why the recommendation is good in terms of the original question, or does it answer a different question? - If it explains why the recommendation addresses the original question→“Provide Reasoning or Justification” - If it introduces new information that answ...
[15]

Cite External Source

When a segment invokes an external source (study, statistic, law, quote, expert consensus) to support a claim, use “Cite External Source”—NOT “Provide Example” or “Provide Reasoning.” The key test: does the credibility derive from an independently verifiable external source, or from the answerer’s own experience/logic? - “A 2019 Lancet study found no sign...

2019
[16]

Read the segment in the context of the full answer and the original question Q
[17]

If a segment appears to serve multiple functions, choose the primary or dominant one

Assign exactly one discourse act label from the ontology. If a segment appears to serve multiple functions, choose the primary or dominant one
[18]

Can language be described by math?

If the assigned act is marked as interpretation- eligible (starred), select the interpretation from the provided interpretation space that the segment most closely addresses. If no listed interpretation fits, mark it as “other.” D.3 Discourse Act Ontology We provided annotators with the same content as in Figure 3 as well as a handful of examples. D.4 Int...
[19]

You are given a question, its possible interpretations, and a segment from an answer that has been labeled with a discourse action
[20]

This may be explicit or implicit

Determine which question interpretation the segment most directly addresses, adopts, or targets. This may be explicit or implicit
[21]

If the segment clearly and directly addresses one of the interpretations, return that interpretation’s ID
[22]

NONE”. Output Format Respond with exactly ONE JSON object: [{

If the segment does not clearly target any specific interpretation, return “NONE”. Output Format Respond with exactly ONE JSON object: [{"interpretation_id": "id_1"}] When no specific interpretation is targeted: [{"interpretation_id": "NONE"}] User: Question {question} Question Interpretations {interpretations} Full Answer {answer} Segment(labeled as “{ac...