arxiv: 2605.00513 · v1 · submitted 2026-05-01 · 💻 cs.CL · cs.LG

Recognition: unknown

ControBench: An Interaction-Aware Benchmark for Controversial Discourse Analysis on Social Networks

Ta Thanh Thuy , Jiaqi Zhu , Xuan Liu , Lin Shang , Reihaneh Rabbany , Guillaume Rabusseau , Lihui Chen , Zheng Yilun

show 1 more author

Sitao Luan

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:46 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords controversial discoursesocial network analysisideological polarizationgraph neural networksreddit benchmarktext semanticsuser interaction graphs

0 comments

The pith

ControBench merges social interaction graphs with text semantics to benchmark how users argue across ideological lines on Reddit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ControBench to fill gaps where prior datasets either keep only text, model only structure, or lack stable user-level ideology labels. It builds the benchmark from Reddit threads on Trump, abortion, and religion, creating graphs with user and post nodes joined by edges that encode both replies and parent comments. Labels come from users' own flairs as a proxy for ideology, producing low or negative homophily that matches real cross-cutting debate. Tests on graph neural networks, pretrained language models, and large language models reveal different accuracy patterns by topic and model type, especially when boundaries blur. This setup lets researchers evaluate tools for polarization, misinformation, and moderation on more realistic data.

Core claim

ControBench supplies three topic-specific datasets that join heterogeneous graphs of 7,370 users, 1,783 posts, and 26,525 interactions with rich textual content; user-comment-user edges preserve reply context while flair-based labels supply ideological identity, yielding low or negative adjusted homophily and exposing distinct performance gaps across model families.

What carries the argument

Heterogeneous graphs linking users and posts via semantically enriched reply edges that retain local argumentative context, paired with self-declared flair labels as a scalable proxy for ideological identity.

If this is right

Models must handle ambiguous ideological boundaries rather than assuming clear echo chambers.
Graph structure and text semantics together produce harder tasks than either alone.
The low-homophily pattern supports testing moderation tools on cross-cutting rather than segregated debates.
Distinct model-family patterns indicate that interaction-aware methods are needed beyond pure text classifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-plus-text construction could be applied to other platforms to test whether low homophily is a general feature of online controversy.
Researchers could add temporal edges to track how arguments evolve within threads and measure if that improves prediction of user stance shifts.
If the benchmark holds, content-moderation systems might gain accuracy by jointly modeling reply chains and user flair signals rather than treating posts in isolation.

Load-bearing premise

Self-declared Reddit flairs reliably mark users' actual ideological positions instead of serving as ironic, performative, or mistaken signals.

What would settle it

A manual audit of a sample of flair-labeled users that finds more than 30 percent mismatch between declared flair and the ideological slant of their posts would undermine the label proxy and the benchmark's validity.

Figures

Figures reproduced from arXiv: 2605.00513 by Guillaume Rabusseau, Jiaqi Zhu, Lihui Chen, Lin Shang, Reihaneh Rabbany, Sitao Luan, Ta Thanh Thuy, Xuan Liu, Zheng Yilun.

**Figure 1.** Figure 1: Example of interaction in controversial discourse, demonstrating its complexity that requires both semantic and interaction context for correct interpretation. Controversial discourse is a central form of online social interaction. Discussions about politics, abortion, religion, and other valueladen issues shape how people encounter disagreement, form beliefs, and participate in public debate. Understa… view at source ↗

**Figure 2.** Figure 2: ControBench data construction process showing the transformation from Reddit discussions view at source ↗

**Figure 3.** Figure 3: ControBench heterogeneous graph structure showing two node types (users, posts) and view at source ↗

**Figure 4.** Figure 4: Temporal distribution of posts across ControBench datasets showing distinct engagement view at source ↗

**Figure 5.** Figure 5: Performance across 10 equal-frequency local homophily bins for three model families: (a) view at source ↗

read the original abstract

Understanding how people argue across ideological divides online is important for studying political polarization, misinformation, and content moderation. Existing datasets capture only part of this problem: some preserve text but ignore interaction structure, some model structure without rich semantics, and others represent conversations without stable user-level ideological identity. We introduce ControBench, a benchmark for controversial discourse analysis that combines heterogeneous social interaction graphs with rich textual semantics. Built from Reddit discussions on three topics, Trump, abortion, and religion, ControBench contains 7,370 users, 1,783 posts, and 26,525 interactions. The graph contains user and post nodes connected by semantically enriched edges; in particular, user-comment-user edges encode both a reply and the parent comment that it responds to, preserving local argumentative context. User labels are derived from self-declared Reddit flairs, providing a scalable proxy for ideological identity without manual annotation. The resulting datasets exhibit low or negative adjusted homophily (Trump: -0.77, Abortion: 0.06, Religion: 0.04), reflecting the cross-cutting structure of real-world debate. We evaluate graph neural networks, pretrained language models, and large language models on ControBench and observe distinct performance patterns across topics and model families, especially when ideological boundaries are ambiguous. These results position ControBench as a challenging and realistic benchmark for controversial discourse analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ControBench mixes interaction graphs with text from Reddit threads on three topics, but the flair-based ideological labels lack validation and weaken the main claims.

read the letter

ControBench is a new benchmark built from Reddit discussions on Trump, abortion, and religion. It creates a heterogeneous graph with user and post nodes plus edges that encode replies while preserving some parent-comment context. User labels come from self-declared flairs, and the data shows low or negative adjusted homophily, which the authors take as evidence of cross-cutting debate rather than echo chambers. They then test GNNs, pretrained language models, and LLMs and report different performance patterns across topics and model types.

Referee Report

2 major / 2 minor

Summary. The paper introduces ControBench, a benchmark for controversial discourse analysis constructed from Reddit discussions on three topics (Trump, abortion, religion). It contains 7,370 users, 1,783 posts, and 26,525 interactions in heterogeneous graphs with user and post nodes connected by semantically enriched edges, including user-comment-user edges that preserve local argumentative context. User ideological labels are derived from self-declared flairs as a scalable proxy without manual annotation. The datasets exhibit low or negative adjusted homophily (Trump: -0.77, Abortion: 0.06, Religion: 0.04), and evaluations of GNNs, PLMs, and LLMs reveal distinct performance patterns across topics and model families, especially with ambiguous ideological boundaries.

Significance. If the flair-based labeling is reliable, ControBench would represent a meaningful advance by supplying an interaction-aware benchmark that combines structural graphs with rich textual semantics, addressing gaps in prior resources that typically omit one or the other. The preservation of argumentative context via enriched edges and the reported cross-cutting interaction structure are methodological strengths that could support future work on polarization and content moderation. The distinct model performance patterns also provide useful empirical observations for the field.

major comments (2)

[§3] §3 (Dataset Construction): The central claim that ControBench is a 'realistic' benchmark for ideological boundary analysis rests on user labels derived from self-declared Reddit flairs serving as stable proxies for ideology. No independent validation (e.g., consistency with post content, manual annotation subset, or inter-rater checks) is reported, which directly affects the validity of the adjusted homophily values and the cross-cutting structure used to motivate the benchmark.
[§5] §5 (Experiments): The reported distinct performance patterns across GNNs, PLMs, and LLMs lack details on exact task formulations, evaluation metrics, statistical significance testing, or controls for data biases, undermining the strength of the conclusion that ControBench is 'challenging' in a reproducible way.

minor comments (2)

[Abstract] The abstract and main text could include one or two concrete examples of semantically enriched edges to clarify how 'rich textual semantics' are operationalized beyond the high-level description.
Missing references to prior benchmarks that combine graphs and text (e.g., in computational social science) would help situate the novelty claim more precisely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of validity and reproducibility that we will address in the revision. We respond to each major comment below.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The central claim that ControBench is a 'realistic' benchmark for ideological boundary analysis rests on user labels derived from self-declared Reddit flairs serving as stable proxies for ideology. No independent validation (e.g., consistency with post content, manual annotation subset, or inter-rater checks) is reported, which directly affects the validity of the adjusted homophily values and the cross-cutting structure used to motivate the benchmark.

Authors: We acknowledge the referee's concern. The manuscript explicitly frames flair-based labels as a scalable proxy for ideological identity to enable large-scale construction without manual annotation, consistent with prior social media research. However, we agree that the absence of independent validation limits the strength of claims regarding realism and the reported homophily values. In the revised manuscript, we will add a new subsection in §3 reporting a manual validation study on a random subset of users (e.g., 200 per topic). Two annotators will label ideology from post content alone, we will report agreement with flairs (Cohen's kappa and accuracy), and discuss discrepancies, particularly in ambiguous cases. This will directly support the adjusted homophily calculations and cross-cutting structure. revision: yes
Referee: [§5] §5 (Experiments): The reported distinct performance patterns across GNNs, PLMs, and LLMs lack details on exact task formulations, evaluation metrics, statistical significance testing, or controls for data biases, undermining the strength of the conclusion that ControBench is 'challenging' in a reproducible way.

Authors: We agree that additional experimental details are necessary for reproducibility and to substantiate the claim that ControBench is challenging. The current version provides high-level descriptions of the evaluations but omits granular specifications. In the revised §5, we will: (1) precisely define the tasks (e.g., user ideology classification as node classification on the heterogeneous graph, with input features and prediction targets); (2) report all evaluation metrics (accuracy, macro-F1, AUC-ROC) with formulas; (3) include statistical significance testing (e.g., McNemar's test for model comparisons and bootstrap confidence intervals); and (4) add a dedicated paragraph on data biases (e.g., topic imbalance, interaction density) with controls applied (e.g., stratified sampling, ablation on edge types). We will also release the full code, data splits, and hyperparameter settings. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark construction or evaluations

full rationale

The paper introduces ControBench via direct data collection from Reddit (users, posts, interactions) with labels assigned from self-declared flairs and standard model evaluations on the resulting graphs. No mathematical derivations, fitted parameters renamed as predictions, self-citations bearing central claims, or ansatzes smuggled via prior work are present. The homophily statistics and performance observations are computed directly from the collected data without reducing to tautological definitions or input fits. The construction is self-contained and externally verifiable through the described collection process.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution relies on the assumption that flair-based labels accurately reflect ideology and that the collected interactions represent real controversial discourse.

axioms (1)

domain assumption Self-declared Reddit flairs serve as a reliable proxy for users' ideological identities
Used to label users without manual annotation, central to the ideological analysis.

pith-pipeline@v0.9.0 · 5582 in / 1065 out tokens · 37046 ms · 2026-05-09T19:46:57.440209+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Gao, Tianyu and Yao, Xingcheng and Chen, Danqi , booktitle=
[2]

Sentence-

Reimers, Nils and Gurevych, Iryna , booktitle=. Sentence-
[3]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

2019
[4]

2019 , eprint=

RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. 2019 , eprint=

2019
[5]

Advances in Neural Information Processing Systems , volume=

A Comprehensive Study on Text-attributed Graphs: Benchmarking and Rethinking , author=. Advances in Neural Information Processing Systems , volume=
[6]

Advances in Neural Information Processing Systems , volume=

Teg-db: A comprehensive dataset and benchmark of textual-edge graphs , author=. Advances in Neural Information Processing Systems , volume=
[7]

2024 , eprint=

DTGB: A Comprehensive Benchmark for Dynamic Text-Attributed Graphs , author=. 2024 , eprint=

2024
[8]

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=

P-Stance: A Large Dataset for Stance Detection in Political Domain , author=. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=

2021
[9]

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics , pages=

Stance Detection in COVID-19 Tweets , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics , pages=
[10]

Advances in Neural Information Processing Systems , pages=

Inductive Representation Learning on Large Graphs , author=. Advances in Neural Information Processing Systems , pages=
[11]

2021 , eprint=

Twitch Gamers: a Dataset for Evaluating Proximity Preserving and Structural Role-based Node Embeddings , author=. 2021 , eprint=

2021
[12]

European Semantic Web Conference , pages=

Modeling Relational Data with Graph Convolutional Networks , author=. European Semantic Web Conference , pages=
[13]

The World Wide Web Conference , pages=

Heterogeneous Graph Attention Network , author=. The World Wide Web Conference , pages=
[14]

Proceedings of The Web Conference 2020 , pages=

Heterogeneous Graph Transformer , author=. Proceedings of The Web Conference 2020 , pages=

2020
[15]

European Conference on Information Retrieval , pages=

VGCN-BERT: Augmenting BERT with Graph Embedding for Text Classification , author=. European Conference on Information Retrieval , pages=
[16]

Entropy , volume=

Bert-enhanced Text Graph Neural Network for Classification , author=. Entropy , volume=
[17]

B ert GCN : Transductive Text Classification by Combining GNN and BERT

Lin, Yuxiao and Meng, Yuxian and Sun, Xiaofei and Han, Qinghong and Kuang, Kun and Li, Jiwei and Wu, Fei. B ert GCN : Transductive Text Classification by Combining GNN and BERT. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021

2021
[18]

Advances in Neural Information Processing Systems , volume=

Language Models are Few-shot Learners , author=. Advances in Neural Information Processing Systems , volume=
[19]

2021 , eprint=

On the Opportunities and Risks of Foundation Models , author=. 2021 , eprint=

2021
[20]

Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

A Survey of Large Language Models for Graphs , author=. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=
[21]

2023 , eprint=

Large Language Models on Graphs: A Comprehensive Survey , author=. 2023 , eprint=

2023
[22]

2023 , eprint=

GraphGPT: Graph Instruction Tuning for Large Language Models , author=. 2023 , eprint=

2023
[23]

2023 , eprint=

Talk like a Graph: Encoding Graphs for Large Language Models , author=. 2023 , eprint=

2023
[24]

2023 , eprint=

Natural Language is All a Graph Needs , author=. 2023 , eprint=

2023
[25]

2024 , eprint=

The Power of LLM-generated Synthetic Data for Stance Detection in Online Political Discussions , author=. 2024 , eprint=

2024
[26]

Sociological Methods & Research , pages=

Large Language Models for Text Classification: From Zero-shot Learning to Instruction-tuning , author=. Sociological Methods & Research , pages=
[27]

2023 , eprint=

Stance Detection with Collaborative Role-infused LLM-based Agents , author=. 2023 , eprint=

2023
[28]

PeerJ Computer Science , volume=

LOGIC: LLM-originated Guidance for Internal Cognitive Improvement of Small Language Models in Stance Detection , author=. PeerJ Computer Science , volume=
[29]

Proceedings of the National Academy of Sciences , volume=

The Echo Chamber Effect on Social Media , author=. Proceedings of the National Academy of Sciences , volume=
[30]

Big Data and Cognitive Computing , volume=

Graph-Based Conversation Analysis in Social Media , author=. Big Data and Cognitive Computing , volume=
[31]

Advances in Neural Information Processing Systems , volume=

Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs , author=. Advances in Neural Information Processing Systems , volume=
[32]

ACM Computing Surveys , volume=

Stance Detection: A Survey , author=. ACM Computing Surveys , volume=
[33]

International Conference on Learning Representations , year=

Adam: A Method for Stochastic Optimization , author=. International Conference on Learning Representations , year=
[34]

Political Communication , volume=

A Meta-analysis of the Effects of Cross-cutting Exposure on Political Participation , author=. Political Communication , volume=
[35]

Global Environmental Change , volume=

Network Analysis Reveals Open Forums and Echo Chambers in Social Media Discussions of Climate Change , author=. Global Environmental Change , volume=
[36]

Teaching Controversial Issues: A Global Citizenship Guide , year =
[37]

Dearden, R. F. , title =. Journal of Curriculum Studies , volume =. 1981 , doi =

1981
[38]

Science , volume =

Vosoughi, Soroush and Roy, Deb and Aral, Sinan , title =. Science , volume =. 2018 , publisher =

2018
[39]

2018 , address =

Gillespie, Tarleton , title =. 2018 , address =

2018
[40]

Lance and Segerberg, Alexandra , title =

Bennett, W. Lance and Segerberg, Alexandra , title =. Information, Communication & Society , volume =. 2012 , doi =

2012
[41]

and Kiritchenko, Svetlana and Sobhani, Parinaz and Zhu, Xiaodan and Cherry, Colin , title =

Mohammad, Saif M. and Kiritchenko, Svetlana and Sobhani, Parinaz and Zhu, Xiaodan and Cherry, Colin , title =. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval‑2016) , pages =. 2016 , address =

2016
[42]

Australian Outlook , publisher =

Francesco Bailo , title =. Australian Outlook , publisher =. 2023 , month =

2023
[43]

Advances in neural information processing systems , volume=

Revisiting heterophily for graph neural networks , author=. Advances in neural information processing systems , volume=
[44]

Advances in Neural Information Processing Systems , volume=

When do graph neural networks help with node classification? investigating the homophily principle on node distinguishability , author=. Advances in Neural Information Processing Systems , volume=
[45]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

What is missing for graph homophily? disentangling graph homophily for graph neural networks , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[46]

arXiv preprint arXiv:2407.09618 , year=

The Heterophilic Graph Learning Handbook: Benchmarks, Models, Theoretical Analysis, Applications and Challenges , author=. arXiv preprint arXiv:2407.09618 , year=

work page arXiv
[47]

The Thirteenth International Conference on Learning Representations , year=

Let Your Features Tell The Differences: Understanding Graph Convolution By Feature Splitting , author=. The Thirteenth International Conference on Learning Representations , year=
[48]

arXiv preprint arXiv:2508.06034 , year=

Adaptive Heterogeneous Graph Neural Networks: Bridging Heterophily and Heterogeneity , author=. arXiv preprint arXiv:2508.06034 , year=

work page arXiv
[49]

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , year=

When Heterophily Meets Heterogeneity: Challenges and a New Large-Scale Graph Benchmark , author=. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , year=
[50]

Heterophily-Aware Representation Learning on Heterogeneous Graphs , year=

Li, Jintang and Wei, Zheng and Zhu, Yuchang and Wu, Ruofan and Zhang, Huizhe and Chen, Liang and Zheng, Zibin , journal=. Heterophily-Aware Representation Learning on Heterogeneous Graphs , year=
[51]

Exploring the Potential of Large Language Models for Heterophilic Graphs , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[52]

arXiv preprint arXiv:2505.19762 , year=

Language Model-Enhanced Message Passing for Heterophilic Graph Learning , author=. arXiv preprint arXiv:2505.19762 , year=

work page arXiv
[53]

arXiv preprint arXiv:2503.04822 , year=

HeTGB: A Comprehensive Benchmark for Heterophilic Text-Attributed Graphs , author=. arXiv preprint arXiv:2503.04822 , year=

work page arXiv
[54]

ACL , year=

FanChuan: A Multilingual and Graph-Structured Benchmark For Parody Detection and Analysis , author=. ACL , year=
[55]

RL Fine-Tuning Heals OOD Forgetting in SFT

Rl fine-tuning heals ood forgetting in sft , author=. arXiv preprint arXiv:2509.12235 , year=

work page internal anchor Pith review arXiv
[56]

A Challenge Dataset and Effective Models for Conversational Stance Detection

Niu, Fuqiang and Yang, Min and Li, Ang and Zhang, Baoquan and Peng, Xiaojiang and Zhang, Bowen. A Challenge Dataset and Effective Models for Conversational Stance Detection. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

2024
[57]

Zero-Shot Conversational Stance Detection: Dataset and Approaches

Ding, Yuzhe and He, Kang and Li, Bobo and Zheng, Li and He, Haijun and Li, Fei and Teng, Chong and Ji, Donghong. Zero-Shot Conversational Stance Detection: Dataset and Approaches. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.168

work page doi:10.18653/v1/2025.findings-acl.168 2025
[58]

USDC : A Dataset of U ser S tance and D ogmatism in Long C onversations

Marreddy, Mounika and Oota, Subba Reddy and Chinni, Venkata Charan and Gupta, Manish and Flek, Lucie. USDC : A Dataset of U ser S tance and D ogmatism in Long C onversations. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1216

work page doi:10.18653/v1/2025.findings-acl.1216 2025
[59]

2024 , eprint=

Is Graph Convolution Always Beneficial For Every Feature? , author=. 2024 , eprint=

2024
[60]

2023 , eprint=

Hetero ^2 Net: Heterophily-aware Representation Learning on Heterogenerous Graphs , author=. 2023 , eprint=

2023
[61]

Harnessing Explanations:

Xiaoxin He and Xavier Bresson and Thomas Laurent and Adam Perold and Yann LeCun and Bryan Hooi , booktitle=. Harnessing Explanations:. 2024 , url=

2024
[62]

Learning on Large-scale Text-attributed Graphs via Variational Inference , author=
[63]

Advances in Neural Information Processing Systems , volume=

Characterizing graph datasets for node classification: Homophily-heterophily dichotomy and beyond , author=. Advances in Neural Information Processing Systems , volume=
[64]

2002 , url=

Learning from labeled and unlabeled data with label propagation , author=. 2002 , url=

2002
[65]

Proceedings of the international AAAI conference on web and social media , volume=

Stance detection with collaborative role-infused llm-based agents , author=. Proceedings of the international AAAI conference on web and social media , volume=