arxiv: 2604.12138 · v1 · submitted 2026-04-13 · 💻 cs.AI · cs.CL· cs.IR

Recognition: unknown

Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

Aditya Agrawal , Alwarappan Nakkiran , Darshan Fofadiya , Alex Karlsson , Harsha Aduri

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:54 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.IR

keywords RAGopinion-aware retrievalsubjective contentretrieval diversityentity-linked graphsLLM applicationse-commerce data

0 comments

The pith

RAG systems bias toward facts and should instead index opinions explicitly to preserve real diversity in subjective queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

RAG systems have transformed how language models access external knowledge but currently show a bias toward factual content. The authors argue this bias limits performance on real-world subjective topics and creates risks like echo chambers. They propose an Opinion-Aware RAG that uses opinion extraction and entity-linked graphs to preserve diverse perspectives, demonstrated by improved diversity metrics on e-commerce seller data.

Core claim

The paper claims that by formalizing opinion queries as involving genuine heterogeneity that should be preserved, an Opinion-Aware RAG architecture with LLM-based opinion extraction, entity-linked opinion graphs, and opinion-enriched indexing can achieve substantially more representative retrieval. This is shown through experiments comparing it to a traditional baseline on e-commerce seller forum data, with gains in sentiment diversity, entity matching, and demographic coverage.

What carries the argument

Entity-linked opinion graphs built from LLM-extracted opinions to support enriched document indexing and diverse retrieval.

If this is right

Retrieval in subjective domains becomes more representative of actual human heterogeneity rather than converging on dominant views.
Higher coverage of different authors and entities leads to better inclusion of varied viewpoints in AI outputs.
The approach provides a concrete method to reduce echo chamber effects and underrepresentation in applications like reviews and discussions.
It sets the stage for future joint optimization of retrieval and generation stages to match real opinion distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The graphs could be extended with time information to track how perspectives change over time in ongoing discussions.
This retrieval method might apply directly to news or political analysis where balanced viewpoint selection matters.
Generation models could use the same graphs to produce responses that explicitly surface and balance multiple opinions.

Load-bearing premise

That LLM-based opinion extraction and graph construction accurately capture genuine diversity in perspectives without introducing new biases or needing heavy domain-specific tuning.

What would settle it

If tests on additional subjective datasets show no gains in diversity metrics or reveal systematic biases in the opinions selected by the enriched index.

Figures

Figures reproduced from arXiv: 2604.12138 by Aditya Agrawal, Alex Karlsson, Alwarappan Nakkiran, Darshan Fofadiya, Harsha Aduri.

**Figure 1.** Figure 1: Opinion-Aware RAG Architecture. Traditional RAG (bottom, grayed) indexes raw documents directly. Our approach [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

read the original abstract

RAG systems have transformed how LLMs access external knowledge, but we find that current implementations exhibit a bias toward factual, objective content, as evidenced by existing benchmarks and datasets that prioritize objective retrieval. This factual bias - treating opinions and diverse perspectives as noise rather than information to be synthesized - limits RAG systems in real-world scenarios involving subjective content, from social media discussions to product reviews. Beyond technical limitations, this bias poses risks to transparent and accountable AI: echo chamber effects that amplify dominant viewpoints, systematic underrepresentation of minority voices, and potential opinion manipulation through biased information synthesis. We formalize this limitation through the lens of uncertainty: factual queries involve epistemic uncertainty reducible through evidence, while opinion queries involve aleatoric uncertainty reflecting genuine heterogeneity in human perspectives. This distinction implies that factual RAG should minimize posterior entropy, whereas opinion-aware RAG must preserve it. Building on this theoretical foundation, we present an Opinion-Aware RAG architecture featuring LLM-based opinion extraction, entity-linked opinion graphs, and opinion-enriched document indexing. We evaluate our approach on e-commerce seller forum data, comparing an Opinion-Enriched knowledge base against a traditional baseline. Experiments demonstrate substantial improvements in retrieval diversity: +26.8% sentiment diversity, +42.7% entity match rate, and +31.6% author demographic coverage on entity-matched documents. Our results provide empirical evidence that treating subjectivity as a first-class citizen yields measurably more representative retrieval-a first step toward opinion-aware RAG. Future work includes joint optimization of retrieval and generation for distributional fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Opinion-aware RAG improves retrieval diversity on forums but leaves the generation step unexamined.

read the letter

The main takeaway is that this work identifies a genuine limitation in current RAG setups for handling opinions and proposes a targeted fix that boosts retrieval diversity on forum data. The new part is the opinion-aware architecture built around entity-linked graphs and enriched indexing, combined with the uncertainty distinction between facts and opinions. They apply it to e-commerce seller forums and measure improvements in sentiment diversity, entity coverage, and demographic representation. This is a useful step beyond standard factual RAG, and the motivation around avoiding echo chambers and underrepresenting views is straightforward and relevant. The evaluation has clear limits. All the gains are shown only for the documents retrieved, with no results on whether the generated responses better capture the range of opinions in the corpus. Without that, it's unclear if the approach achieves the goal of preserving heterogeneity. The abstract lacks information on baseline details, validation of the extraction step, or statistical tests, which makes the size of the improvements harder to judge. The stress-test concern holds here because retrieval diversity alone does not guarantee better generation. This paper is aimed at practitioners building RAG systems for subjective domains like product reviews or discussion forums. Readers looking for engineering ideas to make retrieval more balanced will get value from the architecture and the specific metrics. The work shows clear thinking on the problem and honest engagement with the practical issues, so it merits a serious referee. I recommend sending it to peer review with a focus on adding generation-level evaluations and more implementation transparency.

Referee Report

3 major / 1 minor

Summary. The paper argues that standard RAG systems exhibit a factual bias that treats opinions as noise, limiting their utility for subjective queries involving aleatoric uncertainty. It proposes an Opinion-Aware RAG architecture using LLM-based opinion extraction, entity-linked opinion graphs, and opinion-enriched document indexing. On e-commerce seller forum data, the approach is compared to a factual baseline and reports retrieval improvements of +26.8% sentiment diversity, +42.7% entity match rate, and +31.6% author demographic coverage on entity-matched documents, positioning this as a first step toward preserving opinion heterogeneity.

Significance. If the empirical results hold under rigorous controls, the work could meaningfully advance RAG for subjective domains by providing both a theoretical framing (epistemic vs. aleatoric uncertainty) and a concrete architecture. The emphasis on retrieval diversity as a proxy for reduced echo-chamber effects and better representation of minority voices addresses a timely concern in accountable AI.

major comments (3)

[Abstract] Abstract: The central empirical claim of substantial retrieval diversity gains is presented without any details on baseline implementation, statistical tests, data splits, or validation of opinion extraction accuracy, rendering the reported +26.8% / +42.7% / +31.6% improvements impossible to assess from the given text.
[Evaluation] Evaluation: All reported metrics (sentiment diversity, entity match rate, author demographic coverage) are computed exclusively on the top-k retrieved documents; no experiment demonstrates that these retrieval changes produce generated outputs whose opinion distribution matches the full corpus or preserves higher posterior entropy rather than collapsing to majority synthesis.
[Theoretical Foundation] Theoretical Foundation: The distinction between minimizing posterior entropy (factual RAG) and preserving it (opinion-aware RAG) is not operationalized via any generation-level metric such as distributional fidelity or entropy of synthesized opinions, leaving the theoretical motivation disconnected from the experiments.

minor comments (1)

[Abstract] The abstract states that 'future work includes joint optimization of retrieval and generation' but provides no discussion of potential selection biases introduced by the LLM opinion extraction step or domain-specific tuning requirements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help clarify the scope and limitations of our work. We address each major comment below, providing clarifications on the paper's focus and making targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim of substantial retrieval diversity gains is presented without any details on baseline implementation, statistical tests, data splits, or validation of opinion extraction accuracy, rendering the reported +26.8% / +42.7% / +31.6% improvements impossible to assess from the given text.

Authors: We agree that the abstract's brevity limits the inclusion of full methodological details. The complete manuscript provides these in Section 3.1 (data collection and splits), Section 3.2 (opinion extraction with validation via accuracy and inter-annotator agreement), Section 4.1 (baseline implementation), and Section 4.3 (statistical tests). To improve standalone readability, we have revised the abstract to briefly note the evaluation setup, data domain, and metric definitions while retaining its concise form. revision: yes
Referee: [Evaluation] Evaluation: All reported metrics (sentiment diversity, entity match rate, author demographic coverage) are computed exclusively on the top-k retrieved documents; no experiment demonstrates that these retrieval changes produce generated outputs whose opinion distribution matches the full corpus or preserves higher posterior entropy rather than collapsing to majority synthesis.

Authors: The paper's scope is explicitly limited to the retrieval stage as a foundational step for opinion-aware RAG, using diversity metrics as a proxy for better representation of heterogeneous opinions and reduced echo-chamber risk in the retrieved context. We do not perform or claim generation experiments, and the conclusion already identifies joint retrieval-generation optimization for distributional fidelity as future work. We have added a dedicated limitations paragraph in the Discussion section to explicitly acknowledge the gap to generation-level evaluation and to explain how the observed top-k improvements serve as a necessary precursor. revision: partial
Referee: [Theoretical Foundation] Theoretical Foundation: The distinction between minimizing posterior entropy (factual RAG) and preserving it (opinion-aware RAG) is not operationalized via any generation-level metric such as distributional fidelity or entropy of synthesized opinions, leaving the theoretical motivation disconnected from the experiments.

Authors: We operationalize entropy preservation at the retrieval level through metrics that quantify increased sentiment diversity, entity coverage, and demographic representation, which directly mitigate the collapse to majority views in the context passed to generation. We have revised Section 2 to more explicitly link the epistemic/aleatoric uncertainty framing to these retrieval proxies and added a paragraph in the limitations discussing the current absence of generation-level metrics (e.g., synthesized opinion entropy) with plans for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's chain consists of a conceptual distinction (epistemic vs. aleatoric uncertainty for factual vs. opinion queries), an architectural proposal (LLM opinion extraction + entity graphs + enriched indexing), and empirical evaluation of retrieval diversity metrics on held-out e-commerce forum data. None of these steps reduce by construction to their inputs: the reported gains (+26.8% sentiment diversity etc.) are measured outcomes on separate test data using newly defined metrics, not quantities fitted from the same data or renamed self-definitions. No equations, self-citations, or uniqueness theorems are invoked in a load-bearing way that would make the central claim tautological. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central argument rests on an untested distinction between epistemic and aleatoric uncertainty applied to queries, plus newly introduced components whose performance depends on LLM extraction quality not independently validated here.

axioms (1)

domain assumption Factual queries involve epistemic uncertainty reducible through evidence, while opinion queries involve aleatoric uncertainty reflecting genuine heterogeneity in human perspectives.
This distinction is invoked to justify preserving rather than minimizing entropy in opinion-aware retrieval.

invented entities (1)

Opinion-enriched document indexing no independent evidence
purpose: To store and retrieve subjective opinions as first-class content alongside facts
New indexing method introduced without external evidence of its general effectiveness beyond the single experiment.

pith-pipeline@v0.9.0 · 5600 in / 1272 out tokens · 38089 ms · 2026-05-10T14:54:13.728345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 33 canonical work pages · 5 internal anchors

[1]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

2020
[2]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kanghao Jia, Jinfeng Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Language models as knowledge bases? InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 2463–2473, 2019

Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. Language models as knowledge bases? InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 2463–2473, 2019

2019
[4]

Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5418–5426, 2020

2020
[5]

Retrieval augmented generation evaluation in the era of large language models: A comprehensive survey.arXiv preprint arXiv:2504.14891, 2024

Aoran Gan, Hao Yu, Kai Zhang, Qi Liu, Wenyu Yan, Zhenya Huang, Shiwei Tong, and Guoping Hu. Retrieval augmented generation evaluation in the era of large language models: A comprehensive survey.arXiv preprint arXiv:2504.14891, 2024

work page arXiv 2024
[6]

Retrieval-augmented generation for ai-generated content: A survey.CoRR, abs/2402.19473, 2024

Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024

work page arXiv 2024
[7]

A survey of graph retrieval-augmented generation for customized large language models.arXiv preprint arXiv:2501.13958, 2025

Yucheng Wang, Xiaohan Li, Yongbin Gao, Jiawei Chen, and Zhiyuan Liu. A systematic literature review of rag: Techniques, metrics, and challenges.arXiv preprint arXiv:2501.13958, 2025

work page arXiv 2025
[8]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval- augmented large language models.Proceedings of the 30th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, pages 6491–6501, 2024

2024
[9]

A benchmark for social-web indirect prompt injection in rag.arXiv preprint arXiv:2601.10923, 2025

Md Rizwan Parvez Rony, Thiago Ferreira, Mohammed Abuhamad, and Preslav Nakov. A benchmark for social-web indirect prompt injection in rag.arXiv preprint arXiv:2601.10923, 2025

work page arXiv 2025
[10]

From unstructured communication to intelligent rag: Multi-agent automation for supply chain knowledge bases

Yifan Chen, Wei Zhang, and Xiaoming Liu. From unstructured communication to intelligent rag: Multi-agent automation for supply chain knowledge bases. Proceedings of the ICML 2025 Workshop on Foundation Models in the Wild, 2025

2025
[11]

Parsing the neural correlates of moral cognition: Ale meta-analysis on morality, theory of mind, and empathy

Danilo Bzdok, Leonhard Schilbach, Kai Vogeley, Karoline Schneider, Angela R Laird, Robert Langner, and Simon B Eickhoff. Parsing the neural correlates of moral cognition: Ale meta-analysis on morality, theory of mind, and empathy. Brain Structure and Function, 217(4):783–796, 2013

2013
[12]

Epistemic vigilance.Mind & Language, 25(4):359–393, 2010

Dan Sperber, Fabrice Clément, Christophe Heintz, Olivier Mascaro, Hugo Mercier, Gloria Origgi, and Deirdre Wilson. Epistemic vigilance.Mind & Language, 25(4):359–393, 2010

2010
[13]

Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence

Hung-Yu Chen, Ramakanth Pasunuru, Jason Weston, and Mohit Bansal. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2292–2307, 2022

2022
[14]

Retrieve anything to augment large language models

Peitian Zhang, Shitao Gu, Yongkang Cao, Yangyifei Xu, Kun Zhang, Zheng Zhao, Huawei Qin, Zhiyuan Liu, Maosong Sun, and Chenyan Xiong. Retrieve anything to augment large language models.arXiv preprint arXiv:2310.07554, 2023

work page arXiv 2023
[15]

Natural questions: a benchmark for question answering research

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. In Transactions of the Association for Computational Linguistics, volume 7, pages 453–466, 2019

2019
[16]

Ms marco: A human generated machine reading com- prehension dataset

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Ms marco: A human generated machine reading com- prehension dataset. InProceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches, 2016

2016
[17]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

2018
[18]

Opiniorag: Towards generating user- centric opinion highlights from large-scale online reviews.arXiv preprint arXiv:2509.00285, 2025

Mir Tafseer Nayeem and Davood Rafiei. Opiniorag: Towards generating user- centric opinion highlights from large-scale online reviews.arXiv preprint arXiv:2509.00285, 2025

work page arXiv 2025
[19]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models.arXiv preprint...

work page internal anchor Pith review arXiv 2023
[20]

Bias amplification in rag: Poisoning knowledge retrieval to steer llms.arXiv preprint arXiv:2506.11415, 2025

Linlin Wang, Tianqing Zhu, Laiqiao Qin, Longxiang Gao, and Wanlei Zhou. Bias amplification in rag: Poisoning knowledge retrieval to steer llms.arXiv preprint arXiv:2506.11415, 2025

work page arXiv 2025
[21]

Evaluating the effect of retrieval augmentation on social biases.arXiv preprint arXiv:2502.17611, 2025

Tianhui Zhang, Yi Zhou, and Danushka Bollegala. Evaluating the effect of retrieval augmentation on social biases.arXiv preprint arXiv:2502.17611, 2025

work page arXiv 2025
[22]

Fairness testing in retrieval-augmented generation: how small perturbations reveal bias in small language models, 2025

Matheus Vinicius da Silva de Oliveira, Jonathan de Andrade Silva, and Aw- dren de Lima Fontao. Fairness testing in retrieval-augmented generation: How small perturbations reveal bias in small language models.arXiv preprint arXiv:2509.26584, 2025

work page arXiv 2025
[23]

Worse than zero-shot? a fact-checking dataset for evaluating the robustness of rag against misleading retrievals.arXiv preprint arXiv:2502.16101, 2025

Linda Zeng, Rithwik Gupta, Divij Motwani, Diji Yang, and Yi Zhang. Worse than zero-shot? a fact-checking dataset for evaluating the robustness of rag against misleading retrievals.arXiv preprint arXiv:2502.16101, 2025

work page arXiv 2025
[24]

Dragged into conflicts: Detecting and addressing conflicting sources in search-augmented llms.arXiv preprint arXiv:2506.08500, 2025

Arie Cattan, Alon Jacovi, Ori Ram, Jonathan Herzig, Roee Aharoni, Sasha Gold- shtein, Eran Ofek, Idan Szpektor, and Avi Caciularu. Dragged into conflicts: Detecting and addressing conflicting sources in search-augmented llms.arXiv preprint arXiv:2506.08500, 2025

work page arXiv 2025
[25]

Conflict-aware soft prompting for retrieval-augmented generation.arXiv preprint arXiv:2508.15253, 2025

Eunseong Choi, June Park, Hyeri Lee, and Jongwuk Lee. Conflict-aware soft prompting for retrieval-augmented generation.arXiv preprint arXiv:2508.15253, 2025

work page arXiv 2025
[26]

Faithfulrag: Fact-level conflict modeling for context-faithful retrieval-augmented generation.arXiv preprint arXiv:2506.08938, 2025

Qinggang Zhang, Zhishang Xiang, Yilin Xiao, Le Wang, Junhui Li, Xinrun Wang, and Jinsong Su. Faithfulrag: Fact-level conflict modeling for context-faithful retrieval-augmented generation.arXiv preprint arXiv:2506.08938, 2025

work page arXiv 2025
[27]

Ad- versarial attacks against neural ranking models via in-context learning.arXiv preprint arXiv:2508.15283, 2025

Amin Bigdeli, Negar Arabzadeh, Ebrahim Bagheri, and Charles LA Clarke. Ad- versarial attacks against neural ranking models via in-context learning.arXiv preprint arXiv:2508.15283, 2025

work page arXiv 2025
[28]

Retrieval augmented fact verification by synthesizing contrastive arguments

Zhenrui Yue, Huimin Zeng, Lanyu Shang, Yifan Liu, Yang Zhang, and Dong Wang. Retrieval augmented fact verification by synthesizing contrastive arguments. arXiv preprint arXiv:2406.09815, 2024

work page arXiv 2024
[29]

Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, and Goran Nenadic

Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, and Goran Ne- nadic. Large language models in argument mining: A survey.arXiv preprint arXiv:2506.16383, 2025

work page arXiv 2025
[30]

Magic: A multi-hop and graph-based benchmark for inter-context conflicts in retrieval-augmented generation.arXiv preprint arXiv:2507.21544, 2025

Jungyeon Lee, Kangmin Lee, and Taeuk Kim. Magic: A multi-hop and graph-based benchmark for inter-context conflicts in retrieval-augmented generation.arXiv preprint arXiv:2507.21544, 2025

work page arXiv 2025
[31]

Contradiction detection in rag systems: Evaluating llms as context validators for improved information consistency.arXiv preprint arXiv:2504.00180, 2025

Vignesh Gokul, Srikanth Tenneti, and Alwarappan Nakkiran. Contradiction detection in rag systems: Evaluating llms as context validators for improved information consistency.arXiv preprint arXiv:2504.00180, 2025

work page arXiv 2025
[32]

Stance detection on social media: State of the art and trends.Information Processing & Management, 58(4):102597, 2021

Abeer AlDayel and Walid Magdy. Stance detection on social media: State of the art and trends.Information Processing & Management, 58(4):102597, 2021

2021
[33]

Capturing stance dynamics in social media: Open challenges and research directions.arXiv preprint arXiv:2109.00475, 2021

Rabab Alkhalifa and Arkaitz Zubiaga. Capturing stance dynamics in social media: Open challenges and research directions.arXiv preprint arXiv:2109.00475, 2021

work page arXiv 2021
[34]

Neural natural language processing for long texts: A survey on clas- sification and summarization.Engineering Applications of Artificial Intelligence, 133:108231, 2024

Dimitrios Tsirmpas, Ioannis Gkionis, Georgios Th Papadopoulos, and Ioannis Mademlis. Neural natural language processing for long texts: A survey on clas- sification and summarization.Engineering Applications of Artificial Intelligence, 133:108231, 2024

2024
[35]

Bryndza at climateactivism 2024: Stance, target and hate event detection via retrieval-augmented gpt-4 and llama.arXiv preprint arXiv:2402.06549, 2024

Marek Šuppa, Daniel Skala, Daniela Jaššová, Samuel Sučík, Andrej Švec, and Peter Hraška. Bryndza at climateactivism 2024: Stance, target and hate event detection via retrieval-augmented gpt-4 and llama.arXiv preprint arXiv:2402.06549, 2024

work page arXiv 2024
[36]

Two stage context learning with large language models for multimodal stance detection on climate change.arXiv preprint arXiv:2509.08024, 2025

Lata Pangtey, Omkar Kabde, Shahid Shafi Dar, and Nagendra Kumar. Two stage context learning with large language models for multimodal stance detection on climate change.arXiv preprint arXiv:2509.08024, 2025

work page arXiv 2025
[37]

A survey on stance detection for mis- and disinformation identification.arXiv preprint arXiv:2408.16906, 2024

Bohan Zhang, Daoan Ding, and Liyao Jiao. A survey on stance detection for mis- and disinformation identification.arXiv preprint arXiv:2408.16906, 2024

work page arXiv 2024
[38]

CognitiveSky: Scalable Sentiment and Narrative Analysis for Decentralized Social Media

Gaurab Chhetri, Anandi Dutta, and Subasish Das. Cognitivesky: Scalable sen- timent and narrative analysis for decentralized social media.arXiv preprint arXiv:2509.11444, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

A Transformer-Based Cross-Platform Analysis of Public Discourse on the 15-Minute City Paradigm

Gaurab Chhetri, Darrell Anderson, Boniphace Kutela, and Subasish Das. A transformer-based cross-platform analysis of public discourse on the 15-minute city paradigm.arXiv preprint arXiv:2509.11443, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Semeval-2020 task 8: Memotion analysis – the visuo-lingual metaphor! 2020

Chhavi Sharma, Deepesh Bhageria, William Scott, Srinivas PYKL, Amitava Das, Tanmoy Chakraborty, Viswanath Pulabaigari, and Bjorn Gamback. Semeval-2020 task 8: Memotion analysis – the visuo-lingual metaphor! 2020

2020
[41]

Youtube av 50k: An annotated corpus for comments in autonomous vehicles

Tao Li, Lei Lin, Minsoo Choi, Kaiming Fu, Siyuan Gong, and Jian Wang. Youtube av 50k: An annotated corpus for comments in autonomous vehicles. InProceedings of the Thirteenth International Joint Symposium on Artificial Intelligence and Natural Language Processing, 2018. Aditya Agrawal, Alwarappan Nakkiran, Darshan Fofadiya, Alex Karlsson, Harsha Aduri

2018
[42]

Sarah Weißmann, Aaron Philipp, Roland Verwiebe, Chiara Osorio Krauter, Nina- Sophie Fritsch, and Claudia Buder. Clicks, comments, consequences: Are content creators’ socio-structural and platform characteristics shaping the exposure to negative sentiment, offensive language, and hate speech on youtube?arXiv preprint arXiv:2504.07676, 2025

work page arXiv 2025
[43]

Topic modeling and sentiment analysis on japanese online media’s coverage of nuclear energy

Yifan Sun, Hirofumi Tsuruta, Masaya Kumagai, and Ken Kurosaki. Topic modeling and sentiment analysis on japanese online media’s coverage of nuclear energy. arXiv preprint arXiv:2411.18383, 2024

work page arXiv 2024
[44]

Credirag: Network-augmented credibility-based retrieval for misinformation detection in reddit.arXiv preprint arXiv:2410.12061, 2024

Ashwin Ram, Yigit Ege Bayiz, Arash Amini, Mustafa Munir, and Radu Marculescu. Credirag: Network-augmented credibility-based retrieval for misinformation detection in reddit.arXiv preprint arXiv:2410.12061, 2024

work page arXiv 2024
[45]

Semantic parsing on freebase from question-answer pairs

Jonathan Berant, Andrew Crivat, and Percy Liang. Semantic parsing on freebase from question-answer pairs. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, 2013

2013
[46]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1601–1611, 2017

2017
[47]

Squad: 100,000+ questions for machine reading comprehension.Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine reading comprehension.Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016

2016
[48]

Popqa: A dataset for population-based question answering

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. Popqa: A dataset for population-based question answering. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14914–14924, 2023

2023
[49]

Liu, and Matt Gardner

Johannes Welbl, Nelson F. Liu, and Matt Gardner. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3219–3229, 2018

2018
[50]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

2022
[51]

The narrativeqa reading compre- hension challenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018

Tomas Kocisky, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Her- mann, Gabor Melis, and Edward Grefenstette. The narrativeqa reading compre- hension challenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018

2018
[52]

Eli5: Long form question answering.Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, 2019

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering.Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, 2019

2019
[53]

Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, and Dragomir Radev. Qmsum: A new benchmark for query-based multi-domain meeting sum- marization.Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pages 5905–5921, 2021

2021
[54]

Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. Asqa: A dataset for ambiguous question answering with long-form answers.Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3866–3885, 2022

2022
[55]

Covid- qa: A question answering dataset for covid-19.ACL 2020 Workshop on Natural Language Processing for COVID-19, 2020

Timo Möller, Anthony Reina, Raghavan Jayakumar, and Malte Pietsch. Covid- qa: A question answering dataset for covid-19.ACL 2020 Workshop on Natural Language Processing for COVID-19, 2020

2020
[56]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers.Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4599–4610, 2021

2021
[57]

Cmb: A comprehensive medical benchmark in chinese

Xidong Wang, Guiming H. Chen, Dingjie Song, Zhiyi Zhang, Zhihong Chen, Qingying Xiao, Feng Jiang, Jianquan Li, Xiang Wan, Benyou Wang, and Haizhou Li. Cmb: A comprehensive medical benchmark in chinese.arXiv preprint arXiv:2308.08833, 2023

work page arXiv 2023
[58]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[59]

Common- senseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Common- senseqa: A question answering challenge targeting commonsense knowledge. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4149–4158, 2019

2019
[60]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations, 2021

2021
[61]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input texts, yes!Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, pages 5336–5358, 2022

2022
[62]

Kilt: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Stas Borzunov, Edouard Grave, et al. Kilt: a benchmark for knowledge intensive language tasks. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pages 2523–2544, 2021

2021
[63]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gupta. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models.Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 1, 2021

2021
[64]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification.Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, 1:809–819, 2018

2018
[65]

Explainable automated fact-checking for public health claims.Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 7740–7754, 2020

Neema Kotonya and Francesca Toni. Explainable automated fact-checking for public health claims.Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 7740–7754, 2020

2020
[66]

Truthfulqa: Measuring how models mimic human falsehoods.Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 3214–3252, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 3214–3252, 2022

2022
[67]

Nomiracl: Knowing when you don’t know for robust mul- tilingual retrieval-augmented generation.arXiv preprint arXiv:2312.11361, 2023

Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan Ka- malloo, David Alfonso-Hermelo, Xiaoguang Li, Qian Liu, Boxing Chen, Mehdi Rezagholizadeh, et al. Nomiracl: Knowing when you don’t know for robust mul- tilingual retrieval-augmented generation.arXiv preprint arXiv:2312.11361, 2023

work page arXiv 2023
[68]

Crag – compre- hensive rag benchmark

Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Jiang, Hanwen Cai, Michael Deng, et al. Crag – compre- hensive rag benchmark. InAdvances in Neural Information Processing Systems, 2024

2024
[69]

Enabling large language models to generate text with citations.Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–681, 2023

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations.Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–681, 2023

2023
[70]

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, et al. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

2023
[71]

Zhenrui Yue, Patrick Schramowski, Antonios Anastasopoulos, and Kristian Kerst- ing. Attrscore: Reference-free evaluation of generated text as attribution.Proceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 5956–5969, 2023

2023
[72]

n/a, editor

Robert Friel, Tho Tran, Xiaoyu Zhao, Yifei Cheng, Zheyuan Jiang, and Anh Totti Vu. Ragbench: Explainable benchmark for retrieval-augmented generation sys- tems.arXiv preprint arXiv:2407.11005, 2024. Opinion-Aware RAG A Extended Benchmark Analysis Table 6 provides a comprehensive analysis of major evaluation datasets used in RAG research from 2019-2025, de...

work page arXiv 2024
[73]

Give me a detailed overview of what sellers think about community management & moderators specifically on seller forums?

over general (Tier 1) entities • Per-entity extraction: Each entity receives its own com- plete opinion structure; a single post may yield multiple entities with different sentiments • Evidence typing: Classify as personal (first-hand experi- ence), anecdotal (heard from others), or data (cites specific statistics) • Emotional markers: Common markers incl...
[74]

there are several perspectives

Opening Framing:The Raw KB opens with neutral organiza- tion (“there are several perspectives”), treating the response as a categorization exercise. The Enriched KB immediately conveys the sentiment distribution (“mixed but generally positive”), answering the actual question about what sellersthink. Aditya Agrawal, Alwarappan Nakkiran, Darshan Fofadiya, A...

2013
[75]

search results don’t show clear feedback on whether moderators consistently respond

Success Attribution:The Raw KB never states whether mod- erators actually help—it even admits “search results don’t show clear feedback on whether moderators consistently respond. ” The Enriched KB explicitly captures positive outcomes: “moderators are the only ones who have successfully helped. ”
[76]

automated response loops

Emotional Language Preservation:The Raw KB sanitizes seller language to “automated response loops. ” The Enriched KB pre- serves the seller voice—“horrible loops”—providing more visceral understanding of seller frustration. Opinion-Aware RAG Raw Discussions KB Response Opinion-Enriched KB Response Based on seller discussions in the forums,there are severa...
[77]

The Enriched KB includes the forum function- ality improvement request (individual user accounts, notification issues)—a whole category of opinion the Raw KB missed

Constructive Feedback:The Raw KB entirely omits con- structive suggestions. The Enriched KB includes the forum function- ality improvement request (individual user accounts, notification issues)—a whole category of opinion the Raw KB missed
[78]

rarely ever posts

Behavioral Insight:The Raw KB reports actions without explaining motivations. The Enriched KB explainswhysellers be- have certain ways: one seller “rarely ever posts” due to notification issues; sellers turn to moderators “as a last resort after exhausting other options. ”
[79]

mixed but generally positive... However

Narrative Coherence:The Raw KB reads as a categorized list organized by topic buckets. The Enriched KB flows as a narrative with contrast (“mixed but generally positive... However... ”), reading like understanding rather than mere reporting