Althea: Human-AI Collaboration for Fact-Checking and Critical Reasoning

Anab Maulana Barik; Cai Yang; Harshit Aneja; Kokil Jaidka; Mong Li Lee; Svetlana Churina; Wynne Hsu

arxiv: 2602.11161 · v2 · submitted 2025-12-29 · 💻 cs.HC · cs.CL

Althea: Human-AI Collaboration for Fact-Checking and Critical Reasoning

Svetlana Churina , Kokil Jaidka , Anab Maulana Barik , Harshit Aneja , Cai Yang , Wynne Hsu , Mong Li Lee This is my paper

Pith reviewed 2026-05-16 18:52 UTC · model grok-4.3

classification 💻 cs.HC cs.CL

keywords fact-checkinghuman-AI collaborationcritical reasoningretrieval-augmented systeminteraction modesuser studyAVeriTeC benchmarkmisinformation

0 comments

The pith

Althea structures human-AI fact-checking so guided modes raise immediate accuracy while self-directed modes build lasting reasoning gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Althea as a retrieval-augmented system that generates questions, retrieves evidence, and structures reasoning steps to help users evaluate online claims. On the AVeriTeC benchmark it reaches a Macro-F1 of 0.44 and outperforms standard pipelines in distinguishing supported from refuted claims. A controlled study and longitudinal experiment with 963 participants tested three modes that differ in scaffolding: guided exploratory reasoning, synthesized summary verdicts, and procedural self-search guidance. Guided interaction produced the largest short-term lifts in accuracy and user confidence, whereas self-search produced the strongest improvements that persisted over time. The results indicate that gains depend on how the system organizes cognitive work rather than on effort or exposure alone.

Core claim

Althea integrates question generation, evidence retrieval, and structured reasoning to support user-driven claim evaluation. It achieves Macro-F1 of 0.44 on AVeriTeC. Controlled comparisons of three interaction modes show that guided scaffolding yields the strongest immediate gains in accuracy and confidence, while self-directed procedural guidance yields the most persistent gains over time, because performance improvements arise from how cognitive work is structured and internalized rather than from time or effort alone. Users described the system as transparent and helpful for organizing evidence and clarifying competing claims.

What carries the argument

Three interaction modes that vary the degree of scaffolding: Exploratory mode with guided reasoning, Summary mode with synthesized verdicts, and Self-search mode offering procedural guidance without direct algorithmic intervention.

If this is right

Guided scaffolding can produce rapid lifts in verification accuracy and user confidence.
Self-directed procedural guidance supports internalization of reasoning skills that endure beyond the session.
Performance differences arise from how the system organizes evidence and prompts reflection rather than from raw effort.
Transparent organization of evidence and competing claims helps users engage in reflective reasoning.
Systems can balance immediate performance support with long-term epistemic autonomy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fact-checking platforms could offer users a choice of scaffolding level depending on whether the goal is quick verification or skill development.
The same structured-interaction approach might transfer to other domains that require evaluating competing evidence, such as scientific claims or policy arguments.
Persistent gains from self-directed use suggest the design could help build population-level resistance to misinformation if scaled in educational settings.

Load-bearing premise

Differences in accuracy and persistence across modes are caused by the structure of cognitive work rather than by differences in user motivation, prior knowledge, or time spent.

What would settle it

A follow-up experiment that equalizes time spent and motivation across the three modes and still finds no differences in immediate accuracy or long-term retention would falsify the claim that interaction structure drives the observed gains.

Figures

Figures reproduced from arXiv: 2602.11161 by Anab Maulana Barik, Cai Yang, Harshit Aneja, Kokil Jaidka, Mong Li Lee, Svetlana Churina, Wynne Hsu.

**Figure 5.** Figure 5: Effects of interaction mode on verification strategy use. Panel A: strategy-level standardized changes (𝑧-scores). Panel B: aggregate indices. Points are means with 95% CI. Colored stars: treatment vs. control; black stars: Exploratory vs. Summary (***𝑝 < .001). 5.3.2 Verification Strategy Use [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 7.** Figure 7: System architecture of the Althea chatbot platform. User-facing components are in green, backend platform in orange/gray, and external APIs in purple. The participant is shown as a user icon [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 9.** Figure 9: Predicted overall improvement by treatment and party identification (𝑁 = 642). Lines represent estimated marginal means from the linear model, with shaded ribbons indicating 95% confidence intervals. Slopes: Control (𝛽 = −0.023, n.s.), Summary Mode (𝛽 = −0.190, 𝑝 < .001), Exploratory Mode (𝛽 = −0.014, n.s.). The negative slope for Summary Mode indicates decreasing benefits among Republicans compared to Dem… view at source ↗

read the original abstract

The web's information ecosystem demands fact-checking systems that are both scalable and epistemically trustworthy. Automated approaches offer efficiency but often lack transparency, while human verification remains slow and inconsistent. We introduce Althea, a retrieval-augmented system that integrates question generation, evidence retrieval, and structured reasoning to support user-driven evaluation of online claims. On the AVeriTeC benchmark, Althea achieves a Macro-F1 of 0.44, outperforming standard verification pipelines and improving discrimination between supported and refuted claims. We further evaluate Althea through a controlled user study and a longitudinal survey experiment (N=963), comparing three interaction modes that vary in the degree of scaffolding: an Exploratory mode with guided reasoning, a Summary mode providing synthesized verdicts, and a Self-search mode that offers procedural guidance without algorithmic intervention. Results show that guided interaction produces the strongest immediate gains in accuracy and confidence, while self-directed search yields the most persistent improvements over time. This pattern suggests that performance gains are not driven solely by effort or exposure, but by how cognitive work is structured and internalized. Participants consistently described Althea as transparent and supportive of reflective reasoning, emphasizing its ability to organize evidence and clarify competing claims. By integrating retrieval, interaction, and pedagogical scaffolding, Althea demonstrates how human--AI interaction can move beyond automated verdicts toward durable improvements in reasoning. These findings advance the design of trustworthy, human-centered fact-checking systems that balance guidance with epistemic autonomy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Althea pairs a retrieval-augmented fact-checker with a three-mode user study that separates immediate accuracy gains from longer-term retention, but the causal story still needs tighter checks on effort and exposure.

read the letter

Althea builds a retrieval-augmented system that adds question generation and structured reasoning steps to help people check online claims. On the AVeriTeC benchmark it reaches a Macro-F1 of 0.44 and beats the basic pipelines they compare against. The real work is the controlled comparison of three interaction modes—guided exploratory, summary verdicts, and self-search with only procedural prompts—run longitudinally with 963 participants. Guided mode lifts immediate accuracy and confidence while self-search produces steadier gains that hold up later. Users also reported that the system felt transparent and helped organize evidence. That split between short-term boost and durable change is the part worth paying attention to for anyone designing tools that aim to improve reasoning rather than just hand out verdicts. The study size and the longitudinal tracking are clear strengths here. The soft spot is the missing detail on whether time spent, number of evidence items viewed, or motivation differed across modes. The abstract claims the gains come from how the cognitive work is structured, yet without reported time logs, interaction counts, or covariate-adjusted models it is still possible that guided users simply did more work up front. That leaves the causal attribution thinner than the headline result suggests. This paper is aimed at HCI and NLP groups working on human-AI setups for misinformation. A reader who wants concrete examples of scaffolding levels and their different time horizons will get usable ideas from it. The core system and the study design are solid enough that it should go to peer review rather than get desk-rejected.

Referee Report

1 major / 2 minor

Summary. The manuscript presents Althea, a retrieval-augmented fact-checking system integrating question generation, evidence retrieval, and structured reasoning. On the AVeriTeC benchmark it reports a Macro-F1 of 0.44, outperforming standard verification pipelines. A controlled user study and longitudinal experiment (N=963) compare three interaction modes—Exploratory (guided reasoning), Summary (synthesized verdicts), and Self-search (procedural guidance without algorithmic intervention)—claiming that guided interaction produces the strongest immediate gains in accuracy and confidence while self-directed search yields the most persistent improvements over time. The work concludes that performance gains arise from the structure of cognitive work rather than effort or exposure alone, with participants describing the system as transparent and supportive of reflective reasoning.

Significance. If the mode-specific effects are confirmed after controlling for confounds, the work would be significant for human-AI collaboration in fact-checking and critical reasoning. The large N=963 longitudinal sample is a clear strength for assessing persistence, and the benchmark result provides concrete evidence of technical feasibility. The paper usefully highlights design trade-offs between scaffolding and epistemic autonomy, advancing principles for trustworthy interactive verification systems.

major comments (1)

[User study and longitudinal experiment] User study and longitudinal experiment: The central claim attributes accuracy and persistence differences across Exploratory, Summary, and Self-search modes to the structure of cognitive work rather than effort or exposure. However, the manuscript provides no details on measurement or statistical control of time-on-task, number of interactions, evidence items viewed, participant motivation, or use of covariate-adjusted models. Without these, the causal attribution to scaffolding cannot be verified and remains open to alternative explanations.

minor comments (2)

[Abstract] Abstract: The reported Macro-F1 of 0.44 is presented without the specific baseline scores, exact pipeline implementations, or statistical significance tests used for comparison.
[Abstract and methods] Abstract and methods: The three interaction modes are described at a high level but lack explicit operational definitions, example interfaces, or procedural details that would allow replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights an important gap in the reporting of our user study and longitudinal experiment. We address the concern point by point below and commit to revisions that strengthen the causal interpretation of our findings.

read point-by-point responses

Referee: The central claim attributes accuracy and persistence differences across Exploratory, Summary, and Self-search modes to the structure of cognitive work rather than effort or exposure. However, the manuscript provides no details on measurement or statistical control of time-on-task, number of interactions, evidence items viewed, participant motivation, or use of covariate-adjusted models. Without these, the causal attribution to scaffolding cannot be verified and remains open to alternative explanations.

Authors: We agree that the manuscript lacks explicit details on these controls, which limits the strength of the causal claims as presented. The study interface logged timestamps for all actions, allowing computation of time-on-task per participant and per claim; the number of evidence items viewed and interactions performed were recorded via system logs; and a post-session questionnaire included items on motivation and perceived effort. These data were collected but omitted from the main text due to length constraints. In the revised manuscript we will add a new subsection under Methods describing the logging procedures and will report covariate-adjusted mixed-effects models that include time-on-task, interaction count, and evidence items viewed as fixed effects (with participant as random effect). We will also include motivation scores as an additional covariate. The revised Results section will present both unadjusted and adjusted effect sizes for accuracy and persistence outcomes, allowing readers to evaluate whether the mode differences persist after these controls. Supplementary materials will contain the full covariate tables and model specifications. revision: yes

Circularity Check

0 steps flagged

No circularity; results are empirical outcomes from benchmarks and user studies

full rationale

The paper reports system performance via Macro-F1 on AVeriTeC and comparative outcomes from a controlled user study plus N=963 longitudinal experiment across three interaction modes. These are presented as measured results rather than any derivation chain, equations, fitted parameters renamed as predictions, or self-citations that reduce the central claims to inputs by construction. No self-definitional steps, uniqueness theorems, or ansatzes appear in the abstract or described content; the attribution to cognitive scaffolding rests on experimental contrasts, not logical equivalence to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and system-oriented with no mathematical derivations, fitted parameters, or new axioms described in the abstract.

pith-pipeline@v0.9.0 · 5591 in / 1232 out tokens · 55690 ms · 2026-05-16T18:52:31.000613+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

guided interaction produces the strongest immediate gains in accuracy and confidence, while self-directed search yields the most persistent improvements over time
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Epistemic agency—specifically sourcehood—is the operative mechanism underlying durable improvement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

[1]

Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. arXiv:2106.05707 [cs.CL]

work page arXiv 2021
[2]

Rami Aly, Marek Strong, and Andreas Vlachos. 2023. QA-NatVer: Question Answering for Natural Logic-based Fact Verification. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 8376–8391. https://doi.org/10.18653/v1...

work page doi:10.18653/v1/2023.emnlp-main.521 2023
[3]

Leif Azzopardi. 2021. Cognitive biases in search: a review and reflection of cognitive biases in Information Retrieval. In Proceedings of the 2021 conference on human information interaction and retrieval . 27–37

work page 2021
[4]

Bert N Bakker, Kokil Jaidka, Timothy Dörr, Neil Fasching, and Yphtach Lelkes. 2021. Questionable and open research practices: Attitudes and perceptions among quantitative communication researchers. Journal of Communication 71, 5 (2021), 715–738

work page 2021
[5]

Xabier E Barandiaran, Ezequiel Di Paolo, and Marieke Rohde. 2009. Defining agency: Individuality, normativity, asymmetry, and spatio-temporality in action. Adaptive behavior 17, 5 (2009), 367–386

work page 2009
[6]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology 3, 2 (2006), 77–101

work page 2006
[7]

Joel Breakstone, Mark Smith, Nadav Ziv, and Sam Wineburg. 2022. Civic preparation for the digital age: How college students evaluate online sources about social and political issues. The Journal of Higher Education 93, 7 (2022), 963–988

work page 2022
[8]

Mike Caulfield. 2019. SIFT (The Four Moves). Hapgood. https://hapgood.us/2019/06/19/sift -the-four-moves/ Accessed: 2025

work page 2019
[9]

Jiangjie Chen, Qiaoben Bao, Changzhi Sun, Xinbo Zhang, Jiaze Chen, Hao Zhou, Yanghua Xiao, and Lei Li. 2022. LOREN: Logic-Regularized Reasoning for Interpretable Fact Verification. Proceedings of the AAAI Conference on Artificial Intelligence 36, 10 (Jun. 2022), 10482–10491. https: //doi.org/10.1609/aaai.v36i10.21291

work page doi:10.1609/aaai.v36i10.21291 2022
[10]

Michelene TH Chi, Nicholas De Leeuw, Mei-Hung Chiu, and Christian LaVancher. 1994. Eliciting self-explanations improves understanding. Cognitive science 18, 3 (1994), 439–477

work page 1994
[11]

Svetlana Churina, Anab Maulana Barik, and Saisamarth Rajesh Phaye. 2024. Improving Evidence Retrieval on Claim Verification Pipeline through Question Enrichment. In Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER) , Michael Schlichtkrull, Yulong Chen, Chenxi Whitehouse, Zhenyun Deng, Mubashara Akhtar, Rami Aly, Zhijiang Guo, Ch...

work page doi:10.18653/v1/2024.fever 2024
[12]

Jacob Devasier, Rishabh Mediratta, Phuong Le, David Huang, and Chengkai Li. 2024. ClaimLens: Automated, Explainable Fact-Checking on Voting Claims Using Frame-Semantics. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 311–319

work page 2024
[13]

Mennatallah El -Assady and Caterina Moruzzi. 2022. Which biases and reasoning pitfalls do explanations trigger? Decomposing communication processes in human–AI interaction. IEEE Computer Graphics and Applications 42, 6 (2022), 11–23

work page 2022
[14]

Catherine Z Elgin. 2013. Epistemic agency. Theory and research in education 11, 2 (2013), 135–152

work page 2013
[15]

Freedom House. 2024. Freedom in the World. https://freedomhouse.org/report/freedom -world. Accessed: 2025-09-08

work page 2024
[16]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM 64, 12 (2021), 86–92. Althea: Scaffolding Epistemic Agency for Durable Fact-Checking 23 Manuscript submitted to ACM

work page 2021
[17]

Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology. Vol. 52. Elsevier, 139–183

work page 1988
[18]

Naeemul Hassan, Bill Adair, James T Hamilton, Chengkai Li, Mark Tremayne, Jun Yang, and Cong Yu. 2015. The quest to automate fact-checking. In Proceedings of the 2015 computation+ journalism symposium. Citeseer

work page 2015
[19]

Naeemul Hassan, Fatma Arslan, Chengkai Li, and Mark Tremayne. 2017. Toward automated fact -checking: Detecting check -worthy factual claims by claimbuster. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 1803–1812

work page 2017
[20]

Matthew Jörke, Defne Genç, Valentin Teutschbein, Shardul Sapkota, Sarah Chung, Paul Schmiedmayer, Maria Ines Campero, Abby C King, Emma Brunskill, and James A Landay. 2026. Bloom: Designing for LLM -augmented behavior change interactions. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems. 1–27

work page 2026
[21]

Jukka Jouhki, Epp Lauk, Maija Penttinen, Niina Sormanen, and Turo Uskali. 2016. Facebook’s emotional contagion experiment as a challenge to research ethics. Media and Communication 4, 4 (2016), 75–85

work page 2016
[22]

Norfarahin Jumat, Nur Amira Rahman, Fatin Farhana Rahim, Siti Fatimah Anuar, Ai Lin Low, and Chin Wee Ong. 2025. Scaffolding Higher -Order Thinking Through AI Chatbots: A Multi-Domain Study. In Proceedings of the International Conference on Education and Learning Sciences. Springer. Available at https://irr.singaporetech.edu.sg/articles/conference_contrib...

work page arXiv 2025
[23]

DM Kahan. 2017. Misconceptions, misinformation, and the logic of identity-protective cognition. Yale Law & Economics Research Paper 164 (2017)

work page 2017
[24]

Joseph Kahne and Benjamin Bowyer. 2017. Educating for democracy in a partisan age: Confronting the challenges of motivated reasoning and misinformation. American educational research journal 54, 1 (2017), 3–34

work page 2017
[25]

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2019. What‘s Missing: A Knowledge Gap Guided Approach for Multi -hop Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP -IJCNLP) , Kentaro Inui, Jing Jiang, Vincent Ng, ...

work page doi:10.18653/v1/d19 2019
[26]

Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th international conference on intelligent user interfaces. 126–137

work page 2015
[27]

Sengjie Liu and Christopher G. Healey. 2023. Abstractive Summarization of Large Document Collections Using GPT. arXiv:2310.05690 [cs.AI] https://arxiv.org/abs/2310.05690

work page arXiv 2023
[28]

Sijie Liu, Yuyang Hu, Zihang Tian, Zhe Jin, Shijin Ruan, and Jiaxin Mao. 2024. Investigating Users’ Search Behavior and Outcome with ChatGPT in Learning-oriented Search Tasks. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 103–113

work page 2024
[29]

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen -tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251 (2023)

work page arXiv 2023
[30]

National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. 1979. The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research. Technical Report. Department of Health, Education, and Welfare. https: //www.hhs.gov/ohrp/regulations -and-policy/belmont -report/index.html

work page 1979
[31]

Liangming Pan, Xinyuan Lu, Min -Yen Kan, and Preslav Nakov. 2023. QACheck: A Demonstration System for Question -Guided Multi -Hop Fact - Checking. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Yansong Feng and Els Lefever (Eds.). Association for Computational Linguistics, Singapore, 264–2...

work page doi:10.18653/v1/2023.emnlp-demo.23 2023
[32]

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey I rving. 2022. Red Teaming Language Models with Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 3419–3448

work page 2022
[33]

Noah S Podolefsky, Emily B Moore, and Katherine K Perkins. 2013. Implicit scaffolding in interactive simulations: Design strategies to support multiple educational goals. arXiv preprint arXiv:1306.6544 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[34]

Peng Qi, Zehong Yan, Wynne Hsu, and Mong Li Lee. 2024. SNIFFER: Multimodal Large Language Model for Explainable Out -of-Context Misinfor - mation Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13052–13062

work page 2024
[35]

Kevin Roitero, Michael Soprano, Beatrice Portelli, Damiano Spina, Vincenzo Della Mea, Giuseppe Serra, Stefano Mizzaro, and Gi anluca Demartini

work page
[36]

In Proceedings of the 29th ACM international conference on information & knowledge management

The covid-19 infodemic: Can the crowd judge recent misinformation objectively?. In Proceedings of the 29th ACM international conference on information & knowledge management. 1305–1314

work page
[37]

inoculation

Jon Roozenbeek, Sander Van Der Linden, and Thomas Nygren. 2020. Prebunking interventions based on the psychological theory of “inoculation” can reduce susceptibility to misinformation across cultures. (2020)

work page 2020
[38]

Michael Schlichtkrull, Zhijiang Guo, and Andreas Vlachos. 2024. Averitec: A dataset for real-world claim verification with evidence from the web. Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[39]

Evan Selinger and Woodrow Hartzog. 2016. Facebook’s emotional contagion study and the ethical problem of co-opted identity in mediated environments where users lack control. Research Ethics 12, 1 (2016), 35–43

work page 2016
[40]

Li Shi, Nilavra Bhattacharya, Anubrata Das, Matt Lease, and Jacek Gwizdka. 2022. The effects of interactive ai design on user behavior: An eye-tracking study of fact-checking covid-19 claims. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval. 315–320

work page 2022
[41]

Ben Shneiderman. 2022. Human-centered AI. Oxford University Press. 24 Anonymous et al. Manuscript submitted to ACM

work page 2022
[42]

Jiasheng Si, Yibo Zhao, Yingjie Zhu, Haiyang Zhu, Wenpeng Lu, and Deyu Zhou. 2024. CHECKWHY: Causal Fact Verification via Argument Structure. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Ba...

work page doi:10.18653/v1/2024.acl 2024
[43]

Michael Soprano, Kevin Roitero, David La Barbera, Davide Ceolin, Damiano Spina, Gianluca Demartini, and Stefano Mizzaro. 2024. Cognitive Biases in Fact-Checking and Their Countermeasures: A Review. Information Processing & Management 61, 3 (2024), 103672

work page 2024
[44]

Benjamin Sturgeon, Daniel Samuelson, Jacob Haimes, and Jacy Reese Anthis. 2025. HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants. arXiv preprint arXiv:2509.08494 (2025)

work page arXiv 2025
[45]

Lu Sun, Aaron Chan, Yun Seo Chang, and Steven P Dow. 2024. ReviewFlow: Intelligent scaffolding to support academic peer reviewing. In Proceedings of the 29th International Conference on Intelligent User Interfaces. 120–137

work page 2024
[46]

Kevin Timpe. 2006. Free will. The Bloomsbury Companion to Metaphysics (2006), 257

work page 2006
[47]

Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false news online. science 359, 6380 (2018), 1146–1151

work page 2018
[48]

Nathan Walter, Jonathan Cohen, R Lance Holbert, and Yasmin Morag. 2020. Fact-checking: A meta -analysis of what works and for whom. Political communication 37, 3 (2020), 350–375

work page 2020
[49]

Sam Wineburg and Sarah McGrew. 2019. Lateral reading and the nature of expertise: Reading less and learning more when evaluating digital information. Teachers College Record 121, 11 (2019), 1–40

work page 2019
[50]

Xuan Zhang and Wei Gao. 2023. Towards LLM-based Fact Verification on News Claims with a Hierarchical Step-by-Step Prompting Method. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , Jong C. Pa...

work page doi:10.18653/v1/2023.ijcnlp 2023
[51]

The percentage of people under the poverty line in India has decreased in the past ten years

Yuhao Zhang, Jiaxin An, Ben Wang, Yan Zhang, and Jiqun Liu. 2025. Human-Centered Explainability in Interactive Information Systems: A Survey. arXiv preprint arXiv:2507.02300 (2025). A Appendix A.1 Althea Fact-checking Pipeline Althea is a retrieval -augmented fact -checking system that provides users with a guided and transparent framework for claim evalu...

work page arXiv 2025
[52]

{claim}” Conversation Transcript: — {transcript.strip()} — Return your analysis in a JSON format with two keys: “decision

- The buyer was a trust tied to billionaire Les Wexner. Les Wexner had a relationship with Jeffrey Epstein.[1][2][3][4] - However, the Obamas did not own the property outright; they rented it during their vacations. - The purchase was made by a trust connected to Wexner’s family, with legal and public records confirming the transac- tion.[3][1] Summary: -...

work page 2025

[1] [1]

Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. arXiv:2106.05707 [cs.CL]

work page arXiv 2021

[2] [2]

Rami Aly, Marek Strong, and Andreas Vlachos. 2023. QA-NatVer: Question Answering for Natural Logic-based Fact Verification. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 8376–8391. https://doi.org/10.18653/v1...

work page doi:10.18653/v1/2023.emnlp-main.521 2023

[3] [3]

Leif Azzopardi. 2021. Cognitive biases in search: a review and reflection of cognitive biases in Information Retrieval. In Proceedings of the 2021 conference on human information interaction and retrieval . 27–37

work page 2021

[4] [4]

Bert N Bakker, Kokil Jaidka, Timothy Dörr, Neil Fasching, and Yphtach Lelkes. 2021. Questionable and open research practices: Attitudes and perceptions among quantitative communication researchers. Journal of Communication 71, 5 (2021), 715–738

work page 2021

[5] [5]

Xabier E Barandiaran, Ezequiel Di Paolo, and Marieke Rohde. 2009. Defining agency: Individuality, normativity, asymmetry, and spatio-temporality in action. Adaptive behavior 17, 5 (2009), 367–386

work page 2009

[6] [6]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology 3, 2 (2006), 77–101

work page 2006

[7] [7]

Joel Breakstone, Mark Smith, Nadav Ziv, and Sam Wineburg. 2022. Civic preparation for the digital age: How college students evaluate online sources about social and political issues. The Journal of Higher Education 93, 7 (2022), 963–988

work page 2022

[8] [8]

Mike Caulfield. 2019. SIFT (The Four Moves). Hapgood. https://hapgood.us/2019/06/19/sift -the-four-moves/ Accessed: 2025

work page 2019

[9] [9]

Jiangjie Chen, Qiaoben Bao, Changzhi Sun, Xinbo Zhang, Jiaze Chen, Hao Zhou, Yanghua Xiao, and Lei Li. 2022. LOREN: Logic-Regularized Reasoning for Interpretable Fact Verification. Proceedings of the AAAI Conference on Artificial Intelligence 36, 10 (Jun. 2022), 10482–10491. https: //doi.org/10.1609/aaai.v36i10.21291

work page doi:10.1609/aaai.v36i10.21291 2022

[10] [10]

Michelene TH Chi, Nicholas De Leeuw, Mei-Hung Chiu, and Christian LaVancher. 1994. Eliciting self-explanations improves understanding. Cognitive science 18, 3 (1994), 439–477

work page 1994

[11] [11]

Svetlana Churina, Anab Maulana Barik, and Saisamarth Rajesh Phaye. 2024. Improving Evidence Retrieval on Claim Verification Pipeline through Question Enrichment. In Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER) , Michael Schlichtkrull, Yulong Chen, Chenxi Whitehouse, Zhenyun Deng, Mubashara Akhtar, Rami Aly, Zhijiang Guo, Ch...

work page doi:10.18653/v1/2024.fever 2024

[12] [12]

Jacob Devasier, Rishabh Mediratta, Phuong Le, David Huang, and Chengkai Li. 2024. ClaimLens: Automated, Explainable Fact-Checking on Voting Claims Using Frame-Semantics. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 311–319

work page 2024

[13] [13]

Mennatallah El -Assady and Caterina Moruzzi. 2022. Which biases and reasoning pitfalls do explanations trigger? Decomposing communication processes in human–AI interaction. IEEE Computer Graphics and Applications 42, 6 (2022), 11–23

work page 2022

[14] [14]

Catherine Z Elgin. 2013. Epistemic agency. Theory and research in education 11, 2 (2013), 135–152

work page 2013

[15] [15]

Freedom House. 2024. Freedom in the World. https://freedomhouse.org/report/freedom -world. Accessed: 2025-09-08

work page 2024

[16] [16]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM 64, 12 (2021), 86–92. Althea: Scaffolding Epistemic Agency for Durable Fact-Checking 23 Manuscript submitted to ACM

work page 2021

[17] [17]

Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology. Vol. 52. Elsevier, 139–183

work page 1988

[18] [18]

Naeemul Hassan, Bill Adair, James T Hamilton, Chengkai Li, Mark Tremayne, Jun Yang, and Cong Yu. 2015. The quest to automate fact-checking. In Proceedings of the 2015 computation+ journalism symposium. Citeseer

work page 2015

[19] [19]

Naeemul Hassan, Fatma Arslan, Chengkai Li, and Mark Tremayne. 2017. Toward automated fact -checking: Detecting check -worthy factual claims by claimbuster. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 1803–1812

work page 2017

[20] [20]

Matthew Jörke, Defne Genç, Valentin Teutschbein, Shardul Sapkota, Sarah Chung, Paul Schmiedmayer, Maria Ines Campero, Abby C King, Emma Brunskill, and James A Landay. 2026. Bloom: Designing for LLM -augmented behavior change interactions. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems. 1–27

work page 2026

[21] [21]

Jukka Jouhki, Epp Lauk, Maija Penttinen, Niina Sormanen, and Turo Uskali. 2016. Facebook’s emotional contagion experiment as a challenge to research ethics. Media and Communication 4, 4 (2016), 75–85

work page 2016

[22] [22]

Norfarahin Jumat, Nur Amira Rahman, Fatin Farhana Rahim, Siti Fatimah Anuar, Ai Lin Low, and Chin Wee Ong. 2025. Scaffolding Higher -Order Thinking Through AI Chatbots: A Multi-Domain Study. In Proceedings of the International Conference on Education and Learning Sciences. Springer. Available at https://irr.singaporetech.edu.sg/articles/conference_contrib...

work page arXiv 2025

[23] [23]

DM Kahan. 2017. Misconceptions, misinformation, and the logic of identity-protective cognition. Yale Law & Economics Research Paper 164 (2017)

work page 2017

[24] [24]

Joseph Kahne and Benjamin Bowyer. 2017. Educating for democracy in a partisan age: Confronting the challenges of motivated reasoning and misinformation. American educational research journal 54, 1 (2017), 3–34

work page 2017

[25] [25]

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2019. What‘s Missing: A Knowledge Gap Guided Approach for Multi -hop Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP -IJCNLP) , Kentaro Inui, Jing Jiang, Vincent Ng, ...

work page doi:10.18653/v1/d19 2019

[26] [26]

Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th international conference on intelligent user interfaces. 126–137

work page 2015

[27] [27]

Sengjie Liu and Christopher G. Healey. 2023. Abstractive Summarization of Large Document Collections Using GPT. arXiv:2310.05690 [cs.AI] https://arxiv.org/abs/2310.05690

work page arXiv 2023

[28] [28]

Sijie Liu, Yuyang Hu, Zihang Tian, Zhe Jin, Shijin Ruan, and Jiaxin Mao. 2024. Investigating Users’ Search Behavior and Outcome with ChatGPT in Learning-oriented Search Tasks. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 103–113

work page 2024

[29] [29]

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen -tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251 (2023)

work page arXiv 2023

[30] [30]

National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. 1979. The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research. Technical Report. Department of Health, Education, and Welfare. https: //www.hhs.gov/ohrp/regulations -and-policy/belmont -report/index.html

work page 1979

[31] [31]

Liangming Pan, Xinyuan Lu, Min -Yen Kan, and Preslav Nakov. 2023. QACheck: A Demonstration System for Question -Guided Multi -Hop Fact - Checking. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Yansong Feng and Els Lefever (Eds.). Association for Computational Linguistics, Singapore, 264–2...

work page doi:10.18653/v1/2023.emnlp-demo.23 2023

[32] [32]

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey I rving. 2022. Red Teaming Language Models with Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 3419–3448

work page 2022

[33] [33]

Noah S Podolefsky, Emily B Moore, and Katherine K Perkins. 2013. Implicit scaffolding in interactive simulations: Design strategies to support multiple educational goals. arXiv preprint arXiv:1306.6544 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[34] [34]

Peng Qi, Zehong Yan, Wynne Hsu, and Mong Li Lee. 2024. SNIFFER: Multimodal Large Language Model for Explainable Out -of-Context Misinfor - mation Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13052–13062

work page 2024

[35] [35]

Kevin Roitero, Michael Soprano, Beatrice Portelli, Damiano Spina, Vincenzo Della Mea, Giuseppe Serra, Stefano Mizzaro, and Gi anluca Demartini

work page

[36] [36]

In Proceedings of the 29th ACM international conference on information & knowledge management

The covid-19 infodemic: Can the crowd judge recent misinformation objectively?. In Proceedings of the 29th ACM international conference on information & knowledge management. 1305–1314

work page

[37] [37]

inoculation

Jon Roozenbeek, Sander Van Der Linden, and Thomas Nygren. 2020. Prebunking interventions based on the psychological theory of “inoculation” can reduce susceptibility to misinformation across cultures. (2020)

work page 2020

[38] [38]

Michael Schlichtkrull, Zhijiang Guo, and Andreas Vlachos. 2024. Averitec: A dataset for real-world claim verification with evidence from the web. Advances in Neural Information Processing Systems 36 (2024)

work page 2024

[39] [39]

Evan Selinger and Woodrow Hartzog. 2016. Facebook’s emotional contagion study and the ethical problem of co-opted identity in mediated environments where users lack control. Research Ethics 12, 1 (2016), 35–43

work page 2016

[40] [40]

Li Shi, Nilavra Bhattacharya, Anubrata Das, Matt Lease, and Jacek Gwizdka. 2022. The effects of interactive ai design on user behavior: An eye-tracking study of fact-checking covid-19 claims. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval. 315–320

work page 2022

[41] [41]

Ben Shneiderman. 2022. Human-centered AI. Oxford University Press. 24 Anonymous et al. Manuscript submitted to ACM

work page 2022

[42] [42]

Jiasheng Si, Yibo Zhao, Yingjie Zhu, Haiyang Zhu, Wenpeng Lu, and Deyu Zhou. 2024. CHECKWHY: Causal Fact Verification via Argument Structure. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Ba...

work page doi:10.18653/v1/2024.acl 2024

[43] [43]

Michael Soprano, Kevin Roitero, David La Barbera, Davide Ceolin, Damiano Spina, Gianluca Demartini, and Stefano Mizzaro. 2024. Cognitive Biases in Fact-Checking and Their Countermeasures: A Review. Information Processing & Management 61, 3 (2024), 103672

work page 2024

[44] [44]

Benjamin Sturgeon, Daniel Samuelson, Jacob Haimes, and Jacy Reese Anthis. 2025. HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants. arXiv preprint arXiv:2509.08494 (2025)

work page arXiv 2025

[45] [45]

Lu Sun, Aaron Chan, Yun Seo Chang, and Steven P Dow. 2024. ReviewFlow: Intelligent scaffolding to support academic peer reviewing. In Proceedings of the 29th International Conference on Intelligent User Interfaces. 120–137

work page 2024

[46] [46]

Kevin Timpe. 2006. Free will. The Bloomsbury Companion to Metaphysics (2006), 257

work page 2006

[47] [47]

Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false news online. science 359, 6380 (2018), 1146–1151

work page 2018

[48] [48]

Nathan Walter, Jonathan Cohen, R Lance Holbert, and Yasmin Morag. 2020. Fact-checking: A meta -analysis of what works and for whom. Political communication 37, 3 (2020), 350–375

work page 2020

[49] [49]

Sam Wineburg and Sarah McGrew. 2019. Lateral reading and the nature of expertise: Reading less and learning more when evaluating digital information. Teachers College Record 121, 11 (2019), 1–40

work page 2019

[50] [50]

Xuan Zhang and Wei Gao. 2023. Towards LLM-based Fact Verification on News Claims with a Hierarchical Step-by-Step Prompting Method. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , Jong C. Pa...

work page doi:10.18653/v1/2023.ijcnlp 2023

[51] [51]

The percentage of people under the poverty line in India has decreased in the past ten years

Yuhao Zhang, Jiaxin An, Ben Wang, Yan Zhang, and Jiqun Liu. 2025. Human-Centered Explainability in Interactive Information Systems: A Survey. arXiv preprint arXiv:2507.02300 (2025). A Appendix A.1 Althea Fact-checking Pipeline Althea is a retrieval -augmented fact -checking system that provides users with a guided and transparent framework for claim evalu...

work page arXiv 2025

[52] [52]

{claim}” Conversation Transcript: — {transcript.strip()} — Return your analysis in a JSON format with two keys: “decision

- The buyer was a trust tied to billionaire Les Wexner. Les Wexner had a relationship with Jeffrey Epstein.[1][2][3][4] - However, the Obamas did not own the property outright; they rented it during their vacations. - The purchase was made by a trust connected to Wexner’s family, with legal and public records confirming the transac- tion.[3][1] Summary: -...

work page 2025