From Binary Groundedness to Support Relations: Towards a Reader-Centred Taxonomy for Comprehension of AI Output

Advait Sarkar; Christian Poelitz; Viktor Kewenig

arxiv: 2604.08082 · v1 · submitted 2026-04-09 · 💻 cs.HC

From Binary Groundedness to Support Relations: Towards a Reader-Centred Taxonomy for Comprehension of AI Output

Advait Sarkar , Christian Poelitz , Viktor Kewenig This is my paper

Pith reviewed 2026-05-10 17:31 UTC · model grok-4.3

classification 💻 cs.HC

keywords groundedness evaluationhallucinationsupport relationsreader-centred taxonomygenerative AIprovenance interfacesretrieval augmented generationAI comprehension

0 comments

The pith

Binary groundedness evaluations obscure the syntactic and interpretive moves AI models make when reformulating source evidence into answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current groundedness and hallucination evaluations treat the link between an AI answer and its sources as a binary choice, either supported or not. This binary framing conceals the specific ways models reword or reason from evidence, including direct copying, paraphrasing, or drawing inductive and deductive conclusions. The paper proposes a taxonomy of support relations drawn from linguistics and philosophy of language to capture these nuances. If successful, this would allow benchmarks to assess grounding more accurately and give users interfaces that explain the nature of the support for each statement. A reader might value this because it addresses limitations in how we currently evaluate and present AI-generated content from documents.

Core claim

We propose the development of a reader-centred taxonomy of grounding as a set of support relations between generated statements and source documents. We explain how this might be synthesised from prior research in linguistics and philosophy of language, and evaluated through a benchmark and human annotation protocol. Such a framework would enable interfaces that communicate not just whether a claim is grounded, but how.

What carries the argument

The reader-centred taxonomy of support relations, a set of categories that distinguishes syntactic moves such as direct quotation versus paraphrase and interpretive moves such as induction versus deduction in how generated statements relate to source documents.

If this is right

Groundedness and hallucination benchmarks could measure specific types of support rather than binary outcomes.
User interfaces for generative AI could display the exact support relation for each statement instead of a single yes/no indicator.
Evaluation protocols could incorporate human annotation to label support relations in generated outputs.
The taxonomy would be built by drawing on existing concepts from linguistics and the philosophy of language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy might be used to audit common reformulation patterns across different models or retrieval methods.
Automatic classifiers trained on annotated support relations could be added to provenance tools.
Adoption could influence how retrieval-augmented generation systems are designed to preserve or expose source connections.

Load-bearing premise

That a reader-centred taxonomy of support relations can be synthesised from prior research in linguistics and philosophy of language and that implementing it would produce measurable improvements in benchmarking and user comprehension of AI output.

What would settle it

A user study in which participants shown support-relation labels perform no better than those shown binary supported/unsupported labels at tasks measuring comprehension and verification of AI-generated answers.

Figures

Figures reproduced from arXiv: 2604.08082 by Advait Sarkar, Christian Poelitz, Viktor Kewenig.

**Figure 1.** Figure 1: Left: standard citation-enabled responses from a language model. Right: a hypothetical interface that distinguishes [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

read the original abstract

Generative AI tools often answer questions using source documents, e.g., through retrieval augmented generation. Current groundedness and hallucination evaluations largely frame the relationship between an answer and its sources as binary (the answer is either supported or unsupported). However, this obscures both the syntactic moves (e.g., direct quotation vs. paraphrase) and the interpretive moves (e.g., induction vs. deduction) performed when models reformulate evidence into an answer. This limits both benchmarking and user-facing provenance interfaces. We propose the development of a reader-centred taxonomy of grounding as a set of support relations between generated statements and source documents. We explain how this might be synthesised from prior research in linguistics and philosophy of language, and evaluated through a benchmark and human annotation protocol. Such a framework would enable interfaces that communicate not just whether a claim is grounded, but how.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper correctly flags the limits of binary groundedness checks in RAG but delivers only a high-level proposal for a new taxonomy without any concrete categories, examples, or tests.

read the letter

The core observation is that current groundedness and hallucination metrics treat support as yes-or-no, which erases how models actually reuse source text through quotation, paraphrase, inference, or synthesis. The authors want to replace that with a reader-centered set of support relations drawn from linguistics and philosophy of language, so that benchmarks and provenance interfaces can describe the relationship more precisely.

Referee Report

1 major / 0 minor

Summary. The paper identifies a limitation in current groundedness and hallucination evaluations for generative AI systems (especially RAG), which treat the relationship between generated answers and source documents as binary (supported or unsupported). It argues this binary view obscures syntactic reformulations (e.g., quotation vs. paraphrase) and interpretive operations (e.g., induction vs. deduction). The authors propose synthesizing a reader-centred taxonomy of support relations from linguistics and philosophy of language, to be tested via a new benchmark and human annotation protocol, ultimately enabling provenance interfaces that communicate how claims are grounded rather than merely whether they are.

Significance. The identification of the binary limitation is timely and well-motivated given the prevalence of retrieval-augmented generation. If a concrete taxonomy can be developed and validated, it would offer a more granular framework for both automated benchmarking and user-facing explanations, potentially improving trust and comprehension in AI outputs. The manuscript earns credit for explicitly linking the proposal to established external literatures and for outlining an evaluation path (benchmark + annotation protocol) that could render the idea falsifiable.

major comments (1)

[Abstract / proposal] Abstract and proposal section: the manuscript correctly diagnoses the binary framing but provides no preliminary taxonomy, no worked examples of support relations (e.g., how a paraphrased inductive inference would be labeled), and no pilot annotation data. This absence is load-bearing because the central claim is that such a taxonomy can be synthesised and will yield measurable improvements; without even a sketch, the feasibility and novelty of the synthesis cannot be assessed from the text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the paper's motivation and for the detailed feedback on the proposal section. We address the major comment below.

read point-by-point responses

Referee: [Abstract / proposal] Abstract and proposal section: the manuscript correctly diagnoses the binary framing but provides no preliminary taxonomy, no worked examples of support relations (e.g., how a paraphrased inductive inference would be labeled), and no pilot annotation data. This absence is load-bearing because the central claim is that such a taxonomy can be synthesised and will yield measurable improvements; without even a sketch, the feasibility and novelty of the synthesis cannot be assessed from the text.

Authors: We agree that the absence of a preliminary sketch limits the ability to assess the proposal in detail. The manuscript is intentionally positioned as a high-level call for the development of a reader-centred taxonomy, synthesised from existing work in linguistics and philosophy of language, rather than a completed taxonomy. Consequently, no concrete taxonomy, worked examples, or pilot data appear in the current version. In revision we will add a new subsection that provides an initial sketch of support relations (including syntactic distinctions such as quotation versus paraphrase and interpretive distinctions such as induction versus deduction), together with at least two worked examples of how a generated statement would be labelled relative to a source document. We will also include a short outline of the intended human annotation protocol. These additions will make the synthesis more concrete without changing the paper's core argument or scope. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a conceptual proposal identifying the binary framing of groundedness/hallucination as a limitation and sketching a reader-centred taxonomy of support relations to be synthesised from linguistics and philosophy of language, with evaluation via a future benchmark. No equations, fitted parameters, derivations, or self-referential reductions appear; the central claim does not reduce to its own inputs by construction and relies on external fields without load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The proposal rests on two domain assumptions about the inadequacy of binary labels and the feasibility of synthesis from linguistics, plus one invented entity (the taxonomy itself) that has no independent evidence yet.

axioms (2)

domain assumption Binary groundedness evaluations obscure syntactic and interpretive moves in AI text generation from sources.
Core premise stated in the first paragraph of the abstract.
domain assumption A reader-centred taxonomy of support relations can be synthesised from prior research in linguistics and philosophy of language.
Stated as the intended synthesis method in the abstract.

invented entities (1)

Reader-centred taxonomy of support relations no independent evidence
purpose: To replace binary groundedness judgments with nuanced categories of how generated statements relate to sources.
Introduced as the central proposed artifact but not yet constructed or tested.

pith-pipeline@v0.9.0 · 5453 in / 1350 out tokens · 56133 ms · 2026-05-10T17:31:36.870361+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

[1]

Arize AI. [n. d.]. LibreEval: The Open-Source Benchmark for RAG Hallucination Detection. https://arize.com/llm-hallucination-dataset/. Accessed 4 February 2026

work page 2026
[2]

Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. 2025. Hallulens: Llm hallucination benchmark.arXiv preprint arXiv:2504.17550(2025)

work page arXiv 2025
[3]

Varich Boonsanong, Vidhisha Balachandran, Xiaochuang Han, Shangbin Feng, Lucy Lu Wang, and Yulia Tsvetkov. 2025. FACTS&EVIDENCE: An Interactive Tool for Transparent Fine-Grained Factual Verification of Machine-Generated Text. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human...

work page doi:10.18653/v1/2025.naacl-demo.35 2025
[4]

Simon Buckingham Shum, Enrico Motta, and John Domingue. 2000. ScholOnto: an ontology-based digital library server for research documents and discourse. International Journal on Digital Libraries3, 3 (2000), 237–248. doi:10.1007/ s007990000034

work page 2000
[5]

Buckingham Shum, Victoria Uren, Gangmin Li, Bertrand Sereno, and Clara Mancini

Simon J. Buckingham Shum, Victoria Uren, Gangmin Li, Bertrand Sereno, and Clara Mancini. 2007. Modelling naturalistic argumentation in research literatures: representation and interaction design issues.International Journal of Intelligent Systems22, 1 (2007), 17–47. doi:10.1002/int.20188

work page doi:10.1002/int.20188 2007
[6]

Kedi Chen, Qin Chen, Jie Zhou, He Yishen, and Liang He. 2024. Diahalu: A dialogue-level hallucination evaluation benchmark for large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024. 9057–9079

work page 2024
[7]

Xiang Chen, Duanzheng Song, Honghao Gui, Chenxi Wang, Ningyu Zhang, Yong Jiang, Fei Huang, Chengfei Lyu, Dan Zhang, and Huajun Chen. 2024. FactCHD: benchmarking fact-conflicting hallucination detection. InProceed- ings of the Thirty-Third International Joint Conference on Artificial Intelligence (Jeju, Korea)(IJCAI ’24). Article 687, 9 pages. doi:10.24963...

work page doi:10.24963/ijcai.2024/687 2024
[8]

García, and Guillermo R

Andrea Cohen, Sebastian Gottifredi, Alejandro J. García, and Guillermo R. Simari

work page
[9]

doi:10.1017/ S0269888913000325

A survey of different approaches to support in argumentation sys- tems.The Knowledge Engineering Review29, 5 (2014), 513–550. doi:10.1017/ S0269888913000325

work page 2014
[10]

Zhang, and Daniel S

Raymond Fok, Joseph Chee Chang, Tal August, Amy X. Zhang, and Daniel S. Weld. 2024. Qlarify: Recursively Expandable Abstracts for Dynamic Information Retrieval over Scientific Papers. InProceedings of the 37th Annual ACM Sympo- sium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machinery, New York, NY,...

work page doi:10.1145/3654777.3676397 2024
[11]

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling Large Language Models to Generate Text with Citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 6465–6488. doi:10.18653/v1/2023.emnlp-main.398

work page doi:10.18653/v1/2023.emnlp-main.398 2023
[12]

Gordon, Carina Negreanu, José Cambronero, Rasika Chakravarthy, Ian Drosos, Hao Fang, Bhaskar Mitra, Hannah Richardson, Advait Sarkar, Stephanie Simmons, Jack Williams, and Ben Zorn

Andrew D. Gordon, Carina Negreanu, José Cambronero, Rasika Chakravarthy, Ian Drosos, Hao Fang, Bhaskar Mitra, Hannah Richardson, Advait Sarkar, Stephanie Simmons, Jack Williams, and Ben Zorn. 2024. Co-audit: tools to help humans double-check AI-generated content.Proceedings of the 14th annual workshop on the intersection of HCI and PL (PLATEAU 2024)(5 202...

work page doi:10.1184/r1/25587552 2024
[13]

1991.Studies in the Way of Words

Paul Grice. 1991.Studies in the Way of Words. Harvard University Press

work page 1991
[14]

Lucas Torroba Hennigen, Shannon Shen, Aniruddha Nrusimha, Bernhard Gapp, David Sontag, and Yoon Kim. 2024. Towards Verifiable Text Generation with Symbolic References. arXiv:2311.09188 [cs.CL] https://arxiv.org/abs/2311.09188

work page arXiv 2024
[15]

Hita Kambhamettu, Jamie Flores, and Andrew Head. 2025. Traceable Texts and Their Effects: A Study of Summary-Source Links in AI-Generated Summaries. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25). Association for Computing Machinery, New York, NY, USA, Article 538, 7 pages. doi:10.1145/370...

work page doi:10.1145/3706599.3719830 2025
[16]

Hita Kambhamettu, Alyssa Hwang, Philippe Laban, and Andrew Head. 2025. At- tribution Gradients: Incrementally Unfolding Citations for Critical Examination of Attributed AI Answers. arXiv:2510.00361 [cs.HC] https://arxiv.org/abs/2510. 00361

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Majeed Kazemitabaar, Jack Williams, Ian Drosos, Tovi Grossman, Austin Zachary Henley, Carina Negreanu, and Advait Sarkar. 2024. Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task Decomposition. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Associ...

work page doi:10.1145/3654777.3676345 2024
[18]

Ioannis Kazlaris, Efstathios Antoniou, Konstantinos Diamantaras, and Charalam- pos Bratsas. 2025. From Illusion to Insight: A Taxonomic Survey of Hallucination Mitigation Techniques in LLMs.AI6, 10 (2025). doi:10.3390/ai6100260

work page doi:10.3390/ai6100260 2025
[19]

Charles W Kneupper. 1978. Teaching argument: An introduction to the Toulmin model.College Composition & Communication29, 3 (1978), 237–241

work page 1978
[20]

Wallace, Zachary C

Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C. Wallace, Zachary C. Lipton, and Jeffrey P. Bigham. 2025. GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence. arXiv:2402.12566 [cs.CL] https://arxiv.org/abs/2402.12566

work page arXiv 2025
[21]

Philippe Laban, Jesse Vig, Marti Hearst, Caiming Xiong, and Chien-Sheng Wu

work page
[22]

In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24)

Beyond the Chat: Executable and Verifiable Text-Editing with LLMs. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machin- ery, New York, NY, USA, Article 20, 23 pages. doi:10.1145/3654777.3676419

work page doi:10.1145/3654777.3676419
[23]

Hao-Ping (Hank) Lee, Advait Sarkar, Lev Tankelevitch, Ian Drosos, Sean Rintel, Richard Banks, and Nicholas Wilson. 2025. The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CH...

work page arXiv 2025
[24]

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. HaluE- val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Mod- els. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 6...

work page doi:10.18653/v1/2023.emnlp- 2023
[25]

Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, and Bill Dolan. 2022. A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavic...

work page doi:10.18653/v1/2022.acl- 2022
[26]

Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan Bragg, Amy X. Zhang, Cas- sidy Trier, Chloe Anastasiades, Tal August, Russell Authur, Danielle Bragg, Erin Bransom, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Yen-Sung Chen, Evie Yu-Yen Cheng, Yvonne Chou, Doug Downey, Rob Evans, Raymond Fok, Fangzhou Hu, Regan Huff, Dongyeop Kang, Tae Soo Kim,...

work page doi:10.1145/3659096 2024
[27]

Buckingham Shum

Clara Mancini and Simon J. Buckingham Shum. 2006. Modelling discourse in contested domains: A semiotic and cognitive framework.International Journal of Human-Computer Studies64, 11 (2006), 1154–1171. doi:10.1016/j.ijhcs.2006.07.002

work page doi:10.1016/j.ijhcs.2006.07.002 2006
[28]

Dasha Metropolitansky and Jonathan Larson. 2025. Towards Effective Extraction and Evaluation of Factual Claims.arXiv preprint arXiv:2502.10855(2025)

work page arXiv 2025
[29]

Dasha Metropolitansky and Jonathan Larson. 2025. VeriTrail: Closed-Domain Hallucination Detection with Traceability.arXiv preprint arXiv:2505.21786(2025)

work page arXiv 2025
[30]

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine- grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 12076–12100

work page 2023
[31]

Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos, Vu Le, Nick McKenna, Carina Suzana Negreanu, Chris Parnin, and Advait Sarkar. 2025. Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions. Proceedings of the AAAI Conference on Artificial Intelligence39, 18 (Apr. 2025), 19589–19597. doi:10.1609/aaai.v39i18.34157

work page doi:10.1609/aaai.v39i18.34157 2025
[32]

Josh M Nicholson, Milo Mordaunt, Patrice Lopez, Ashish Uppala, Domenic Rosati, Neves P Rodrigues, Peter Grabitz, and Sean C Rife. 2021. scite: A smart citation index that displays the context of citations and classifies their intent using deep learning.Quantitative science studies2, 3 (2021), 882–898

work page 2021
[33]

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. 2024. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10862–10878

work page 2024
[34]

Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. 2021. Under- standing Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics. InProceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies. 4812–4829

work page 2021
[35]

Zhang, and Daniel S Weld

Napol Rachatasumrit, Jonathan Bragg, Amy X. Zhang, and Daniel S Weld. 2022. CiteRead: Integrating Localized Citation Contexts into Scientific Paper Reading. InProceedings of the 27th International Conference on Intelligent User Interfaces (Helsinki, Finland)(IUI ’22). Association for Computing Machinery, New York, NY, USA, 707–719. doi:10.1145/3490099.3511162

work page doi:10.1145/3490099.3511162 2022
[36]

Advait Sarkar. 2023. Exploring Perspectives on the Impact of Artificial Intelli- gence on the Creativity of Knowledge Work: Beyond Mechanised Plagiarism and Stochastic Parrots. InProceedings of the 2nd Annual Meeting of the Symposium on Human-Computer Interaction for Work(Oldenburg, Germany)(CHIWORK ’23). Association for Computing Machinery, New York, NY,...

work page doi:10.1145/3596671.3597650 2023
[37]

Advait Sarkar. 2024. AI Should Challenge, Not Obey.Commun. ACM(Sept. 2024), 5 pages. doi:10.1145/3649404 Online First

work page doi:10.1145/3649404 2024
[38]

Advait Sarkar. 2024. Large Language Models Cannot Explain Themselves. In Proceedings of the ACM CHI 2024 Workshop on Human-Centered Explainable AI (Honolulu, HI, USA)(HCXAI at CHI ’24). doi:10.48550/arXiv.2405.04382

work page doi:10.48550/arxiv.2405.04382 2024
[39]

Advait Sarkar, Xiaotong (Tone) Xu, Neil Toronto, Ian Drosos, and Christian Poelitz. 2024. When Copilot Becomes Autopilot: Generative AI’s Critical Risk to Knowledge Work and a Critical Solution. InProceedings of the Annual Conference of the European Spreadsheet Risks Interest Group (EuSpRIG 2024)

work page 2024
[40]

Nicole Sultanum and Arjun Srinivasan. 2023. DATATALES: Investigating the use of Large Language Models for Authoring Data-Driven Articles. In2023 IEEE Visualization and Visual Analytics (VIS). 231–235. doi:10.1109/VIS54172.2023. 00055

work page doi:10.1109/vis54172.2023 2023
[41]

Lev Tankelevitch, Viktor Kewenig, Auste Simkute, Ava Elizabeth Scott, Advait Sarkar, Abigail Sellen, and Sean Rintel. 2024. The Metacognitive Demands and Opportunities of Generative AI. InProceedings of the CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 680,...

work page arXiv 2024
[42]

2003.The Uses of Argument

Stephen E Toulmin. 2003.The Uses of Argument. Cambridge University Press, Cambridge, England

work page 2003
[43]

Victoria Uren, Simon Buckingham Shum, Michelle Bachler, and Gangmin Li. 2006. Sensemaking tools for understanding research literatures: design, implementation and user evaluation.International Journal of Human-Computer Studies64, 5 (2006), 420–445. doi:10.1016/j.ijhcs.2005.09.004

work page doi:10.1016/j.ijhcs.2005.09.004 2006
[44]

Litao Yan, Jeffrey Tao, Lydia B Chilton, and Andrew Head. 2025. Answering Developer Questions with Annotated Agent-Discovered Program Traces. In CHI 2026 STAR Workshop, April 16, 2026, Barcelona, Spain Advait Sarkar, Christian Poelitz, and Viktor Kewenig Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). Ass...

work page doi:10.1145/3746059.3747652 2025
[45]

Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun. 2023. Automatic Evaluation of Attribution by Large Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 4615–4635. doi:10.18653/v1/2023.findings-emnlp.307

work page doi:10.18653/v1/2023.findings-emnlp.307 2023
[46]

Zijian Zhang, Pan Chen, Fangshi Du, Runlong Ye, Oliver Huang, Michael Liut, and Alán Aspuru-Guzik. 2025. TreeReader: A Hierarchical Academic Paper Reader Powered by Language Models. In2025 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 286–292. doi:10.1109/VL-HCC65237.2025. 00039

work page doi:10.1109/vl-hcc65237.2025 2025
[47]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

work page 2023

[1] [1]

Arize AI. [n. d.]. LibreEval: The Open-Source Benchmark for RAG Hallucination Detection. https://arize.com/llm-hallucination-dataset/. Accessed 4 February 2026

work page 2026

[2] [2]

Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. 2025. Hallulens: Llm hallucination benchmark.arXiv preprint arXiv:2504.17550(2025)

work page arXiv 2025

[3] [3]

Varich Boonsanong, Vidhisha Balachandran, Xiaochuang Han, Shangbin Feng, Lucy Lu Wang, and Yulia Tsvetkov. 2025. FACTS&EVIDENCE: An Interactive Tool for Transparent Fine-Grained Factual Verification of Machine-Generated Text. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human...

work page doi:10.18653/v1/2025.naacl-demo.35 2025

[4] [4]

Simon Buckingham Shum, Enrico Motta, and John Domingue. 2000. ScholOnto: an ontology-based digital library server for research documents and discourse. International Journal on Digital Libraries3, 3 (2000), 237–248. doi:10.1007/ s007990000034

work page 2000

[5] [5]

Buckingham Shum, Victoria Uren, Gangmin Li, Bertrand Sereno, and Clara Mancini

Simon J. Buckingham Shum, Victoria Uren, Gangmin Li, Bertrand Sereno, and Clara Mancini. 2007. Modelling naturalistic argumentation in research literatures: representation and interaction design issues.International Journal of Intelligent Systems22, 1 (2007), 17–47. doi:10.1002/int.20188

work page doi:10.1002/int.20188 2007

[6] [6]

Kedi Chen, Qin Chen, Jie Zhou, He Yishen, and Liang He. 2024. Diahalu: A dialogue-level hallucination evaluation benchmark for large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024. 9057–9079

work page 2024

[7] [7]

Xiang Chen, Duanzheng Song, Honghao Gui, Chenxi Wang, Ningyu Zhang, Yong Jiang, Fei Huang, Chengfei Lyu, Dan Zhang, and Huajun Chen. 2024. FactCHD: benchmarking fact-conflicting hallucination detection. InProceed- ings of the Thirty-Third International Joint Conference on Artificial Intelligence (Jeju, Korea)(IJCAI ’24). Article 687, 9 pages. doi:10.24963...

work page doi:10.24963/ijcai.2024/687 2024

[8] [8]

García, and Guillermo R

Andrea Cohen, Sebastian Gottifredi, Alejandro J. García, and Guillermo R. Simari

work page

[9] [9]

doi:10.1017/ S0269888913000325

A survey of different approaches to support in argumentation sys- tems.The Knowledge Engineering Review29, 5 (2014), 513–550. doi:10.1017/ S0269888913000325

work page 2014

[10] [10]

Zhang, and Daniel S

Raymond Fok, Joseph Chee Chang, Tal August, Amy X. Zhang, and Daniel S. Weld. 2024. Qlarify: Recursively Expandable Abstracts for Dynamic Information Retrieval over Scientific Papers. InProceedings of the 37th Annual ACM Sympo- sium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machinery, New York, NY,...

work page doi:10.1145/3654777.3676397 2024

[11] [11]

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling Large Language Models to Generate Text with Citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 6465–6488. doi:10.18653/v1/2023.emnlp-main.398

work page doi:10.18653/v1/2023.emnlp-main.398 2023

[12] [12]

Gordon, Carina Negreanu, José Cambronero, Rasika Chakravarthy, Ian Drosos, Hao Fang, Bhaskar Mitra, Hannah Richardson, Advait Sarkar, Stephanie Simmons, Jack Williams, and Ben Zorn

Andrew D. Gordon, Carina Negreanu, José Cambronero, Rasika Chakravarthy, Ian Drosos, Hao Fang, Bhaskar Mitra, Hannah Richardson, Advait Sarkar, Stephanie Simmons, Jack Williams, and Ben Zorn. 2024. Co-audit: tools to help humans double-check AI-generated content.Proceedings of the 14th annual workshop on the intersection of HCI and PL (PLATEAU 2024)(5 202...

work page doi:10.1184/r1/25587552 2024

[13] [13]

1991.Studies in the Way of Words

Paul Grice. 1991.Studies in the Way of Words. Harvard University Press

work page 1991

[14] [14]

Lucas Torroba Hennigen, Shannon Shen, Aniruddha Nrusimha, Bernhard Gapp, David Sontag, and Yoon Kim. 2024. Towards Verifiable Text Generation with Symbolic References. arXiv:2311.09188 [cs.CL] https://arxiv.org/abs/2311.09188

work page arXiv 2024

[15] [15]

Hita Kambhamettu, Jamie Flores, and Andrew Head. 2025. Traceable Texts and Their Effects: A Study of Summary-Source Links in AI-Generated Summaries. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25). Association for Computing Machinery, New York, NY, USA, Article 538, 7 pages. doi:10.1145/370...

work page doi:10.1145/3706599.3719830 2025

[16] [16]

Hita Kambhamettu, Alyssa Hwang, Philippe Laban, and Andrew Head. 2025. At- tribution Gradients: Incrementally Unfolding Citations for Critical Examination of Attributed AI Answers. arXiv:2510.00361 [cs.HC] https://arxiv.org/abs/2510. 00361

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Majeed Kazemitabaar, Jack Williams, Ian Drosos, Tovi Grossman, Austin Zachary Henley, Carina Negreanu, and Advait Sarkar. 2024. Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task Decomposition. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Associ...

work page doi:10.1145/3654777.3676345 2024

[18] [18]

Ioannis Kazlaris, Efstathios Antoniou, Konstantinos Diamantaras, and Charalam- pos Bratsas. 2025. From Illusion to Insight: A Taxonomic Survey of Hallucination Mitigation Techniques in LLMs.AI6, 10 (2025). doi:10.3390/ai6100260

work page doi:10.3390/ai6100260 2025

[19] [19]

Charles W Kneupper. 1978. Teaching argument: An introduction to the Toulmin model.College Composition & Communication29, 3 (1978), 237–241

work page 1978

[20] [20]

Wallace, Zachary C

Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C. Wallace, Zachary C. Lipton, and Jeffrey P. Bigham. 2025. GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence. arXiv:2402.12566 [cs.CL] https://arxiv.org/abs/2402.12566

work page arXiv 2025

[21] [21]

Philippe Laban, Jesse Vig, Marti Hearst, Caiming Xiong, and Chien-Sheng Wu

work page

[22] [22]

In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24)

Beyond the Chat: Executable and Verifiable Text-Editing with LLMs. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machin- ery, New York, NY, USA, Article 20, 23 pages. doi:10.1145/3654777.3676419

work page doi:10.1145/3654777.3676419

[23] [23]

Hao-Ping (Hank) Lee, Advait Sarkar, Lev Tankelevitch, Ian Drosos, Sean Rintel, Richard Banks, and Nicholas Wilson. 2025. The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CH...

work page arXiv 2025

[24] [24]

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. HaluE- val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Mod- els. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 6...

work page doi:10.18653/v1/2023.emnlp- 2023

[25] [25]

Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, and Bill Dolan. 2022. A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavic...

work page doi:10.18653/v1/2022.acl- 2022

[26] [26]

Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan Bragg, Amy X. Zhang, Cas- sidy Trier, Chloe Anastasiades, Tal August, Russell Authur, Danielle Bragg, Erin Bransom, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Yen-Sung Chen, Evie Yu-Yen Cheng, Yvonne Chou, Doug Downey, Rob Evans, Raymond Fok, Fangzhou Hu, Regan Huff, Dongyeop Kang, Tae Soo Kim,...

work page doi:10.1145/3659096 2024

[27] [27]

Buckingham Shum

Clara Mancini and Simon J. Buckingham Shum. 2006. Modelling discourse in contested domains: A semiotic and cognitive framework.International Journal of Human-Computer Studies64, 11 (2006), 1154–1171. doi:10.1016/j.ijhcs.2006.07.002

work page doi:10.1016/j.ijhcs.2006.07.002 2006

[28] [28]

Dasha Metropolitansky and Jonathan Larson. 2025. Towards Effective Extraction and Evaluation of Factual Claims.arXiv preprint arXiv:2502.10855(2025)

work page arXiv 2025

[29] [29]

Dasha Metropolitansky and Jonathan Larson. 2025. VeriTrail: Closed-Domain Hallucination Detection with Traceability.arXiv preprint arXiv:2505.21786(2025)

work page arXiv 2025

[30] [30]

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine- grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 12076–12100

work page 2023

[31] [31]

Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos, Vu Le, Nick McKenna, Carina Suzana Negreanu, Chris Parnin, and Advait Sarkar. 2025. Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions. Proceedings of the AAAI Conference on Artificial Intelligence39, 18 (Apr. 2025), 19589–19597. doi:10.1609/aaai.v39i18.34157

work page doi:10.1609/aaai.v39i18.34157 2025

[32] [32]

Josh M Nicholson, Milo Mordaunt, Patrice Lopez, Ashish Uppala, Domenic Rosati, Neves P Rodrigues, Peter Grabitz, and Sean C Rife. 2021. scite: A smart citation index that displays the context of citations and classifies their intent using deep learning.Quantitative science studies2, 3 (2021), 882–898

work page 2021

[33] [33]

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. 2024. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10862–10878

work page 2024

[34] [34]

Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. 2021. Under- standing Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics. InProceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies. 4812–4829

work page 2021

[35] [35]

Zhang, and Daniel S Weld

Napol Rachatasumrit, Jonathan Bragg, Amy X. Zhang, and Daniel S Weld. 2022. CiteRead: Integrating Localized Citation Contexts into Scientific Paper Reading. InProceedings of the 27th International Conference on Intelligent User Interfaces (Helsinki, Finland)(IUI ’22). Association for Computing Machinery, New York, NY, USA, 707–719. doi:10.1145/3490099.3511162

work page doi:10.1145/3490099.3511162 2022

[36] [36]

Advait Sarkar. 2023. Exploring Perspectives on the Impact of Artificial Intelli- gence on the Creativity of Knowledge Work: Beyond Mechanised Plagiarism and Stochastic Parrots. InProceedings of the 2nd Annual Meeting of the Symposium on Human-Computer Interaction for Work(Oldenburg, Germany)(CHIWORK ’23). Association for Computing Machinery, New York, NY,...

work page doi:10.1145/3596671.3597650 2023

[37] [37]

Advait Sarkar. 2024. AI Should Challenge, Not Obey.Commun. ACM(Sept. 2024), 5 pages. doi:10.1145/3649404 Online First

work page doi:10.1145/3649404 2024

[38] [38]

Advait Sarkar. 2024. Large Language Models Cannot Explain Themselves. In Proceedings of the ACM CHI 2024 Workshop on Human-Centered Explainable AI (Honolulu, HI, USA)(HCXAI at CHI ’24). doi:10.48550/arXiv.2405.04382

work page doi:10.48550/arxiv.2405.04382 2024

[39] [39]

Advait Sarkar, Xiaotong (Tone) Xu, Neil Toronto, Ian Drosos, and Christian Poelitz. 2024. When Copilot Becomes Autopilot: Generative AI’s Critical Risk to Knowledge Work and a Critical Solution. InProceedings of the Annual Conference of the European Spreadsheet Risks Interest Group (EuSpRIG 2024)

work page 2024

[40] [40]

Nicole Sultanum and Arjun Srinivasan. 2023. DATATALES: Investigating the use of Large Language Models for Authoring Data-Driven Articles. In2023 IEEE Visualization and Visual Analytics (VIS). 231–235. doi:10.1109/VIS54172.2023. 00055

work page doi:10.1109/vis54172.2023 2023

[41] [41]

Lev Tankelevitch, Viktor Kewenig, Auste Simkute, Ava Elizabeth Scott, Advait Sarkar, Abigail Sellen, and Sean Rintel. 2024. The Metacognitive Demands and Opportunities of Generative AI. InProceedings of the CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 680,...

work page arXiv 2024

[42] [42]

2003.The Uses of Argument

Stephen E Toulmin. 2003.The Uses of Argument. Cambridge University Press, Cambridge, England

work page 2003

[43] [43]

Victoria Uren, Simon Buckingham Shum, Michelle Bachler, and Gangmin Li. 2006. Sensemaking tools for understanding research literatures: design, implementation and user evaluation.International Journal of Human-Computer Studies64, 5 (2006), 420–445. doi:10.1016/j.ijhcs.2005.09.004

work page doi:10.1016/j.ijhcs.2005.09.004 2006

[44] [44]

Litao Yan, Jeffrey Tao, Lydia B Chilton, and Andrew Head. 2025. Answering Developer Questions with Annotated Agent-Discovered Program Traces. In CHI 2026 STAR Workshop, April 16, 2026, Barcelona, Spain Advait Sarkar, Christian Poelitz, and Viktor Kewenig Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). Ass...

work page doi:10.1145/3746059.3747652 2025

[45] [45]

Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun. 2023. Automatic Evaluation of Attribution by Large Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 4615–4635. doi:10.18653/v1/2023.findings-emnlp.307

work page doi:10.18653/v1/2023.findings-emnlp.307 2023

[46] [46]

Zijian Zhang, Pan Chen, Fangshi Du, Runlong Ye, Oliver Huang, Michael Liut, and Alán Aspuru-Guzik. 2025. TreeReader: A Hierarchical Academic Paper Reader Powered by Language Models. In2025 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 286–292. doi:10.1109/VL-HCC65237.2025. 00039

work page doi:10.1109/vl-hcc65237.2025 2025

[47] [47]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

work page 2023