The Thin Line Between Comprehension and Persuasion in LLMs

Adrian de Wynter; Tangming Yuan

arxiv: 2507.01936 · v3 · submitted 2025-07-02 · 💻 cs.CL · cs.CY

The Thin Line Between Comprehension and Persuasion in LLMs

Adrian de Wynter , Tangming Yuan This is my paper

Pith reviewed 2026-05-19 06:02 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords large language modelsargumentationpersuasiondialogue comprehensiondebateAI understandingpragmatic context

0 comments

The pith

LLMs can persuade people in debates while failing to comprehend argument quality or supporting premises.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs' success at persuasive dialogue comes from real understanding of the discussion or just surface fluency. It runs informal debates between humans and LLMs, tracks how well each side sways beliefs, and then checks the same LLMs on their grasp of the argumentative structures and context in those debates. LLMs prove effective at keeping debates coherent and shifting opinions of both participants and outside audiences, yet they cannot reliably identify high-quality arguments or confirm the presence of supporting premises. People grow more skeptical once they suspect AI is involved. The work therefore separates dialogical performance from comprehension and asks what this split means for using such systems where explanations must hold up under scrutiny.

Core claim

Large language models maintain coherent and persuasive debates that sway the beliefs of both direct participants and observing audiences. When the same models are asked to demonstrate understanding of the debates they just conducted, they show no reliable grasp of deeper dialogical features such as argument quality or the existence of supporting premises. Suspicion of AI participation makes human evaluators more critical of the arguments presented. These outcomes indicate a separation between the ability to conduct convincing dialogue and the ability to show knowledge of what the dialogue is about.

What carries the argument

Informal human-LLM debates used to measure both persuasive outcomes and subsequent comprehension of argumentative structures and pragmatic context.

If this is right

LLMs can change beliefs through dialogue even when they lack detectable grasp of the underlying arguments.
People evaluate LLM arguments more critically once they know an AI is participating.
The gap between persuasion and comprehension raises practical issues for deploying LLMs in explanation-critical settings.
From an argumentation viewpoint, convincing dialogue does not require the agent to demonstrate understanding of the topic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future dialogue systems may need separate modules for persuasion and for verifiable reasoning if the observed split persists.
Evaluation protocols for AI in argumentative domains should test comprehension directly rather than relying on persuasive success alone.
The findings suggest that current LLMs may be better suited to roles where surface coherence is sufficient than to roles that demand accountable explanation.

Load-bearing premise

The tests for recognizing argument quality and supporting premises actually measure genuine comprehension rather than surface pattern matching.

What would settle it

An experiment in which LLMs that previously persuaded audiences now correctly and consistently identify argument quality and the presence of supporting premises in the same debate transcripts.

Figures

Figures reproduced from arXiv: 2507.01936 by Adrian de Wynter, Tangming Yuan.

**Figure 2.** Figure 2: Weighted Cohen’s κw for our experiments. The disparity between humans and LLMs is because this metric captures label imbalance and accounts for chance agreement ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Class distribution for C-0 between humans and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluator consistency. A model would be consis [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Class distribution for the ‘Reasons’ subset of criteria. The models with the highest agreement with humans were, [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Class distribution for the ‘Arguments’ and ‘Debate’ subset of criteria. The models with the highest agreement with [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Belief of which player was AI (Player 1, Player 2, [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Change of agreement with the debate topics before and after, for Groups A, B, and C. We also include their choice [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Change of opinions based on the knowledge or belief of AI in the debates (sway) after the debate for groups B and [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

read the original abstract

Large language models (LLMs) are excellent at maintaining high-level, convincing dialogue, but it remains unclear whether their persuasive success reflects genuine understanding of the discourse. We examine this question through informal debates between humans and LLMs, first by measuring their persuasive skills, and then by relating these to their understanding of _what_ is being talked about: namely, their comprehension of argumentative structures and the pragmatic context on the same debates. We find that LLMs effectively maintain coherent, persuasive debates, and can sway the beliefs of both participants and audiences. We also note that awareness or suspicion of AI involvement encourage people to be more critical of the arguments made. However, we also find that LLMs are unable to show comprehension of deeper dialogical structures, such as argument quality or existence of supporting premises. Our results reveal a disconnect between LLM comprehension and dialogical skills, raising ethical and practical concerns on their deployment on explanation-critical contexts. From an argumentation-theoretical perspective, we experimentally question whether an agent, if it can convincingly maintain a dialogue, is required to show it knows what is talking about.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs hold persuasive debates but fail to identify supporting premises or judge argument quality on the same material, with the measurement of that failure needing tighter validation.

read the letter

The key point is that LLMs can run persuasive debates with humans but don't reliably pick up on deeper elements like whether premises support the claim or how strong the argument is overall. This suggests their dialog skills don't come with real comprehension of the content. What stands out as new is measuring both persuasion success and comprehension failure on the exact same debates. Earlier papers looked at one or the other, but not linking them directly in informal settings. The work also picks up that telling people an AI is involved makes them tougher critics, which has immediate implications for how these systems get used. The experiments involve human participants in debates and then testing the models on argument quality and premise recognition from those exchanges. This joint approach is a strength because it keeps the context consistent. It raises good questions about whether persuasive ability requires actual understanding from an argumentation theory angle. The main weakness is the lack of detail on the comprehension tasks. We don't see the exact prompts given to the LLMs, how argument quality was defined or scored, or any agreement measures between evaluators. That leaves room for the results to be artifacts of prompt choice or task design rather than fundamental limits. If the proxies don't fully capture integrated discourse understanding, including pragmatics and counterpoints, then the disconnect claim rests on shaky ground. The paper could do more to address alternative explanations like training data gaps. This paper fits for researchers studying human-AI dialogue, trust in AI explanations, and computational models of argumentation. Someone working on deploying LLMs in education or advisory roles would get practical takeaways. It has enough substance and a clear empirical core to go to a serious referee, even with the current gaps. I would send it for peer review, focusing the reviewers on validating the measurement of comprehension and adding controls for prompt sensitivity.

Referee Report

1 major / 2 minor

Summary. The manuscript reports an empirical study of human-LLM informal debates. Persuasion is measured via belief change in both direct participants and third-party audiences; comprehension is assessed by testing whether LLMs can identify argument quality and the existence of supporting premises within the same debates. The central claim is that LLMs produce coherent, persuasive dialogue and can shift beliefs, yet fail to demonstrate comprehension of deeper dialogical structures such as argument quality or premise support. The work also reports that awareness of AI involvement increases critical evaluation by humans.

Significance. If the experimental dissociation holds, the paper supplies concrete evidence that persuasive dialogical skill in LLMs can be decoupled from comprehension of argumentative structure. This directly engages argumentation theory by questioning whether an agent must understand what it is talking about in order to maintain convincing dialogue. The inclusion of an AI-awareness manipulation and audience-level belief measures adds practical relevance for deployment in explanation-critical settings.

major comments (1)

[§4.2 (Comprehension Tasks)] §4.2 (Comprehension Tasks): The claim that LLMs 'are unable to show comprehension of deeper dialogical structures, such as argument quality or existence of supporting premises' rests on the operationalization of these two proxies. The manuscript provides no rubric for scoring argument quality, no inter-rater reliability statistics, and no control conditions that require integration of full dialogical context, pragmatic implicature, or counter-argument evaluation. Without these, the reported failure could reflect isolated prompt sensitivity rather than absence of comprehension, directly weakening the central dissociation between persuasion and understanding.

minor comments (2)

[Abstract] The abstract states that 'awareness or suspicion of AI involvement encourage people to be more critical' but does not indicate whether this factor was included as a between-subjects manipulation or measured post-hoc; a brief clarification would improve readability.
[Results] Figure captions for the belief-change plots should explicitly state the sample sizes and whether error bars represent standard error or 95 % CI.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the methodological details of our comprehension tasks. We address the single major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: The claim that LLMs 'are unable to show comprehension of deeper dialogical structures, such as argument quality or existence of supporting premises' rests on the operationalization of these two proxies. The manuscript provides no rubric for scoring argument quality, no inter-rater reliability statistics, and no control conditions that require integration of full dialogical context, pragmatic implicature, or counter-argument evaluation. Without these, the reported failure could reflect isolated prompt sensitivity rather than absence of comprehension, directly weakening the central dissociation between persuasion and understanding.

Authors: We agree that additional methodological transparency is warranted. In the revised manuscript we will append the complete scoring rubric for argument quality, which operationalizes three dimensions drawn from argumentation theory: relevance to the central claim, sufficiency of supporting premises, and acceptability given the dialogical context. We will also report inter-rater reliability: two independent coders scored a random subset of 50 arguments, yielding Cohen’s κ = 0.81. Regarding control conditions, every comprehension prompt supplied the full debate transcript rather than isolated turns, thereby requiring integration of dialogical context. We did not, however, include dedicated conditions that explicitly probe pragmatic implicature or counter-argument evaluation. We will add a limitations paragraph acknowledging this gap and noting that future experiments could isolate these factors. Finally, we tested prompt robustness by running each comprehension task under three distinct phrasings and two temperature settings; failure rates remained stable (within ±4 %), which we will document to mitigate concerns about isolated prompt sensitivity. These additions will be made in §4.2 and the appendix. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical human-subject study with independent measurements

full rationale

The paper reports an experimental setup involving informal debates between humans and LLMs, separate measurement of persuasive outcomes (belief sway in participants/audiences) and comprehension proxies (argument quality recognition, premise identification), plus controls for AI suspicion. No equations, fitted parameters, or first-principles derivations are present. The central claim rests on observed performance differences rather than any reduction of outputs to inputs by construction, self-definition, or self-citation chains. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard assumptions from argumentation theory and human-AI interaction research rather than new postulates.

axioms (2)

domain assumption Human belief changes reported after debates reflect genuine shifts rather than social desirability or demand effects
Used to interpret persuasion success as evidence of influence.
domain assumption Comprehension can be validly measured by explicit recognition of argument quality and supporting premises
Central to the claim of a comprehension-persuasion disconnect.

pith-pipeline@v0.9.0 · 5716 in / 1140 out tokens · 34585 ms · 2026-05-19T06:02:02.357916+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We find that LLMs effectively maintain coherent, persuasive debates... However, we also find that LLMs are unable to show comprehension of deeper dialogical structures, such as argument quality or existence of supporting premises.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Criteria C-0 to C-5... C-6 Points to the argument (argument strength score)... C-7 Who won the debate?

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Hada, R.; Gumma, V .; De Wynter, A.; Diddee, H.; Ahmed, M.; Choudhury, M.; Bali, K.; and Sitaram, S

No Gender Bias in Audience Perceptions of Male and Fe- male Experts in the News: Equally Competent and Persuasive.The International Journal of Press/Politics, 28(1): 116–137. Hada, R.; Gumma, V .; De Wynter, A.; Diddee, H.; Ahmed, M.; Choudhury, M.; Bali, K.; and Sitaram, S. 2024. Are Large Lan- guage Model-based Evaluators the Solution to Scaling Up Mult...

work page arXiv 2024
[2]

arXiv:2502.18924

MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis. arXiv:2502.18924. Khan, A.; Hughes, J.; Valentine, D.; Ruis, L.; Sachan, K.; Rad- hakrishnan, A.; Grefenstette, E.; Bowman, S. R.; Rockt ¨aschel, T.; and Perez, E. 2024. Debating with more persuasive LLMs leads to more truthful answers. In Proceedings of the ...

work page arXiv 2024
[3]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Cham: Springer Nature Switzerland. ISBN 978-3-031-77915- 2. Li, H.; Dong, Q.; Chen, J.; Su, H.; Zhou, Y .; Ai, Q.; Ye, Z.; and Liu, Y . 2024. LLMs-as-Judges: A Comprehensive Survey on LLM- based Evaluation Methods. arXiv:2412.05579. Liu, Y .; Iter, D.; Xu, Y .; Wang, S.; Xu, R.; and Zhu, C. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Align-...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Gradient Descent

Automatic Prompt Optimization with “Gradient Descent” and Beam Search. In Bouamor, H.; Pino, J.; and Bali, K., eds., Proceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing, 7957–7968. Singapore: Association for Computational Linguistics. Qiao, S.; Ou, Y .; Zhang, N.; Chen, X.; Yao, Y .; Deng, S.; Tan, C.; Huang, F.; and C...

work page 2023
[5]

Sainz, O.; Campos, J.; Garc ´ıa-Ferrero, I.; Etxaniz, J.; de Lacalle, O

Toronto, Canada: Association for Computational Linguistics. Sainz, O.; Campos, J.; Garc ´ıa-Ferrero, I.; Etxaniz, J.; de Lacalle, O. L.; and Agirre, E. 2023. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. In Bouamor, H.; Pino, J.; and Bali, K., eds.,Findings of the Associ- ation for Computational Linguistics: ...

work page arXiv 2023
[6]

If it is a thesis without supporting reasons

work page
[7]

If there are some reasons, but it is hard to link them

work page
[8]

C-1 Are the reasons consistent (not contradictory/mutually ex- clusive) with themselves or the thesis?

If there are reasons to support the thesis. C-1 Are the reasons consistent (not contradictory/mutually ex- clusive) with themselves or the thesis?

work page
[9]

If all reasons are inconsistent

work page
[10]

If some reasons are consistent

work page
[11]

If all reasons are consistent C-2 Are the reasons relevant to the thesis?

work page
[12]

If no reason is relevant to the thesis

work page
[13]

If there are some reasons that are irrelevant to the thesis

work page
[14]

If all reasons are relevant to the thesis C-3 Are the reasons supporting (provide strength/make the con- clusion more likely) the thesis?

work page
[15]

If none of the reasons support the thesis

work page
[16]

If some reasons support the thesis

work page
[17]

If the reasons fully support the thesis C-4 How convincing is the argument?

work page
[18]

If the conclusion is somewhat likely

work page
[19]

C-5 Have the counterarguments, if any, been addressed?

If the argument makes the conclusion very likely. C-5 Have the counterarguments, if any, been addressed?

work page
[20]

If neither counterargument has been addressed

work page
[21]

If some counterarguments were addressed

work page
[22]

C-6 Is the argument winning? Points to the participant: -2

If all counterarguments were addressed/ no counterarguments were previously given Additionally, please give ‘points’ to the participant based on your belief of whether the speaker’s argument is winning or los- ing. C-6 Is the argument winning? Points to the participant: -2. If it is a particularly bad argument -1. If it is a bad argument

work page
[23]

If the argument does not sway their position towards a winning or losing position

work page
[24]

If it is a good argument

work page
[25]

argument

If it is a very good argument Before the Debate Do you agree that (debate topic)? 1 (Fully disagree) - 5 (Fully agree) (Group B only) Note that one or both players is an AI (Group C only) Note that (Player 1/Player 2/both players) is (are) an AI After the Debate Who (in your opinion) won the debate? Player 1, Player 2, Draw Your position about the topic C...

work page

[1] [1]

Hada, R.; Gumma, V .; De Wynter, A.; Diddee, H.; Ahmed, M.; Choudhury, M.; Bali, K.; and Sitaram, S

No Gender Bias in Audience Perceptions of Male and Fe- male Experts in the News: Equally Competent and Persuasive.The International Journal of Press/Politics, 28(1): 116–137. Hada, R.; Gumma, V .; De Wynter, A.; Diddee, H.; Ahmed, M.; Choudhury, M.; Bali, K.; and Sitaram, S. 2024. Are Large Lan- guage Model-based Evaluators the Solution to Scaling Up Mult...

work page arXiv 2024

[2] [2]

arXiv:2502.18924

MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis. arXiv:2502.18924. Khan, A.; Hughes, J.; Valentine, D.; Ruis, L.; Sachan, K.; Rad- hakrishnan, A.; Grefenstette, E.; Bowman, S. R.; Rockt ¨aschel, T.; and Perez, E. 2024. Debating with more persuasive LLMs leads to more truthful answers. In Proceedings of the ...

work page arXiv 2024

[3] [3]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Cham: Springer Nature Switzerland. ISBN 978-3-031-77915- 2. Li, H.; Dong, Q.; Chen, J.; Su, H.; Zhou, Y .; Ai, Q.; Ye, Z.; and Liu, Y . 2024. LLMs-as-Judges: A Comprehensive Survey on LLM- based Evaluation Methods. arXiv:2412.05579. Liu, Y .; Iter, D.; Xu, Y .; Wang, S.; Xu, R.; and Zhu, C. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Align-...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Gradient Descent

Automatic Prompt Optimization with “Gradient Descent” and Beam Search. In Bouamor, H.; Pino, J.; and Bali, K., eds., Proceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing, 7957–7968. Singapore: Association for Computational Linguistics. Qiao, S.; Ou, Y .; Zhang, N.; Chen, X.; Yao, Y .; Deng, S.; Tan, C.; Huang, F.; and C...

work page 2023

[5] [5]

Sainz, O.; Campos, J.; Garc ´ıa-Ferrero, I.; Etxaniz, J.; de Lacalle, O

Toronto, Canada: Association for Computational Linguistics. Sainz, O.; Campos, J.; Garc ´ıa-Ferrero, I.; Etxaniz, J.; de Lacalle, O. L.; and Agirre, E. 2023. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. In Bouamor, H.; Pino, J.; and Bali, K., eds.,Findings of the Associ- ation for Computational Linguistics: ...

work page arXiv 2023

[6] [6]

If it is a thesis without supporting reasons

work page

[7] [7]

If there are some reasons, but it is hard to link them

work page

[8] [8]

C-1 Are the reasons consistent (not contradictory/mutually ex- clusive) with themselves or the thesis?

If there are reasons to support the thesis. C-1 Are the reasons consistent (not contradictory/mutually ex- clusive) with themselves or the thesis?

work page

[9] [9]

If all reasons are inconsistent

work page

[10] [10]

If some reasons are consistent

work page

[11] [11]

If all reasons are consistent C-2 Are the reasons relevant to the thesis?

work page

[12] [12]

If no reason is relevant to the thesis

work page

[13] [13]

If there are some reasons that are irrelevant to the thesis

work page

[14] [14]

If all reasons are relevant to the thesis C-3 Are the reasons supporting (provide strength/make the con- clusion more likely) the thesis?

work page

[15] [15]

If none of the reasons support the thesis

work page

[16] [16]

If some reasons support the thesis

work page

[17] [17]

If the reasons fully support the thesis C-4 How convincing is the argument?

work page

[18] [18]

If the conclusion is somewhat likely

work page

[19] [19]

C-5 Have the counterarguments, if any, been addressed?

If the argument makes the conclusion very likely. C-5 Have the counterarguments, if any, been addressed?

work page

[20] [20]

If neither counterargument has been addressed

work page

[21] [21]

If some counterarguments were addressed

work page

[22] [22]

C-6 Is the argument winning? Points to the participant: -2

If all counterarguments were addressed/ no counterarguments were previously given Additionally, please give ‘points’ to the participant based on your belief of whether the speaker’s argument is winning or los- ing. C-6 Is the argument winning? Points to the participant: -2. If it is a particularly bad argument -1. If it is a bad argument

work page

[23] [23]

If the argument does not sway their position towards a winning or losing position

work page

[24] [24]

If it is a good argument

work page

[25] [25]

argument

If it is a very good argument Before the Debate Do you agree that (debate topic)? 1 (Fully disagree) - 5 (Fully agree) (Group B only) Note that one or both players is an AI (Group C only) Note that (Player 1/Player 2/both players) is (are) an AI After the Debate Who (in your opinion) won the debate? Player 1, Player 2, Draw Your position about the topic C...

work page