pith. machine review for the scientific record. sign in

arxiv: 2604.15326 · v1 · submitted 2026-03-06 · 💻 cs.HC

Recognition: 1 theorem link

· Lean Theorem

Analyzing the Presentation, Content, and Utilization of References in LLM-powered Conversational AI Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:55 UTC · model grok-4.3

classification 💻 cs.HC
keywords LLM conversational AIreferencescitation qualityCRAAP criteriauser interfaceinformation retrievaltrust in AIreference utilization
0
0 comments X

The pith

LLM conversational systems differ substantially in the quantity, quality, and presentation of the references they provide to users.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes how nine LLM-powered conversational AI systems present and cite references in their answers to information-seeking questions. Researchers collected 1,517 references from 30 question-answer pairs and scored each reference's quality with the CRAAP criteria while also noting how the references appear in each system's user interface. The work documents clear differences across systems, with some supplying more references and higher average quality scores than others, and a preliminary user study shows that people seldom click or interact with the references regardless of the system. These patterns matter because users increasingly rely on such systems for factual answers and need reliable ways to verify the supporting sources.

Core claim

The central discovery is that there are notable variations in the presentation, quality, and quantity of references across the nine systems. ChatGPT provides more references per response on average with higher CRAAP quality scores, while systems such as Hunyuan-TurboS supply fewer references and lower scores. Users rarely interact with the references shown to them, and interaction patterns differ by system. The authors conclude that better interface designs are needed to help users engage with and trust references more effectively.

What carries the argument

Evaluation of reference presentation in user interfaces combined with quality scoring via the CRAAP criteria (Currency, Relevance, Authority, Accuracy, Purpose) applied to 1,517 references, plus a preliminary user study tracking actual interaction behavior.

If this is right

  • Users of systems that supply fewer or lower-scoring references may need to perform extra verification steps to reach reliable conclusions.
  • Interface changes that make references more visible or easier to explore could increase engagement rates.
  • Quality differences imply that the underlying retrieval or generation methods vary in how well they surface trustworthy sources.
  • Standard practices for citation display and quality control could reduce the observed performance gaps between systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If reference quality tracks with a system's retrieval architecture, comparing the underlying document stores or fine-tuning data across providers could explain the observed gaps.
  • Making references clickable and contextual might raise overall user trust in LLM answers even when the raw quality scores stay the same.
  • Repeating the analysis on live user queries rather than curated samples would test whether the current variations hold outside the laboratory setting.

Load-bearing premise

That the CRAAP criteria give an appropriate and sufficient measure of reference quality for LLM-generated answers and that the 30 sampled question-answer pairs represent real-world usage patterns.

What would settle it

A larger study using hundreds of real user questions and a different quality assessment method that finds no significant differences in reference quantity, quality scores, or user interaction rates across systems.

Figures

Figures reproduced from arXiv: 2604.15326 by Arpit Narechania, Jianheng Ouyang.

Figure 1
Figure 1. Figure 1: Summary of findings from three analyses of references across nine conversational AI systems. Analysis 1: Presentation [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Five design dimensions for presenting references in conversational AI systems: (1) Parent GUI Components (header, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Quality assessment example using the CRAAP test. A social media post is evaluated across five criteria: Authority [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Quality of references from nine conversational AI systems across the CRAAP Framework. This figure presents six [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: User Interaction Behavior. The bubble chart shows Light Inspection (Hover Rate) against Deep Verification (Click [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

As conversational AI systems become popular for information retrieval and question-answering, the references they cite are key to ensuring their answers are reliable and trustworthy. Yet, no prior work systematically analyzes how these references are presented or their quality. We examine 1,517 references from 30 question-answer pairs across nine systems, focusing on their (1) presentation in the user interface and (2) quality using the CRAAP criteria. We find notable variations in the presentation, quality, and quantity of references across systems. For instance, ChatGPT provides more references (9.5 per response on average) with higher quality (15.48/20 CRAAP score), while Hunyuan-TurboS provides fewer references (4.0) and lower quality (11.65/20). Additionally, a preliminary user study shows that people rarely interact with these references and that their behavior differs across systems. These findings highlight the need for better interface designs that help users engage with and trust references more effectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper examines 1,517 references drawn from 30 question-answer pairs across nine LLM-powered conversational AI systems. It analyzes reference presentation in the user interface and quality via the CRAAP criteria, reports quantitative variations (e.g., ChatGPT averaging 9.5 references with CRAAP score 15.48/20 versus Hunyuan-TurboS with 4.0 references and 11.65/20), and includes a preliminary user study indicating low user interaction with references that differs across systems. The work concludes that better interface designs are needed to improve engagement and trust.

Significance. If the methodological gaps are closed, the study would be significant for HCI research on conversational AI by providing the first systematic comparison of reference handling across multiple systems. The scale of 1,517 references examined offers a concrete empirical basis for claims about quantity and quality differences, and the user-interaction observations point to actionable design implications. The application of the established CRAAP framework lends some structure, though its fit to LLM-generated content requires justification.

major comments (3)
  1. [Abstract] Abstract: The manuscript reports specific quantitative findings (9.5 vs. 4.0 references per response; CRAAP scores 15.48/20 vs. 11.65/20) but supplies no information on how the 30 question-answer pairs were selected, whether queries were stratified or randomized, inter-rater reliability for CRAAP scoring, or any statistical tests supporting the variation claims. These omissions leave the central claims of notable differences unsupported by visible evidence.
  2. [Methods] Methods / Evaluation sections: CRAAP scoring assesses source attributes (currency, relevance, authority, accuracy, purpose) but does not verify whether each cited reference is actually entailed by or supports the LLM-generated answer text. Without an explicit linkage or entailment check between answer content and references, the quality scores cannot be taken as direct measures of trustworthiness in the conversational use case.
  3. [User Study] User Study section: The preliminary user study is described as showing rare interaction and system-dependent behavior, yet no participant count, task details, or statistical analysis is provided in the abstract or summary. Given the small scale implied, this component cannot compensate for the sampling and measurement limitations in the main reference analysis.
minor comments (2)
  1. [Methods] The nine systems examined should be listed explicitly with version numbers or access dates in the Methods section for reproducibility.
  2. [Results] Figure captions and tables reporting per-system reference counts and CRAAP sub-scores would benefit from error bars or confidence intervals to convey variability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where we will revise the manuscript to improve clarity, transparency, and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript reports specific quantitative findings (9.5 vs. 4.0 references per response; CRAAP scores 15.48/20 vs. 11.65/20) but supplies no information on how the 30 question-answer pairs were selected, whether queries were stratified or randomized, inter-rater reliability for CRAAP scoring, or any statistical tests supporting the variation claims. These omissions leave the central claims of notable differences unsupported by visible evidence.

    Authors: We agree that the abstract should provide more methodological transparency. The full manuscript's Methods section describes the 30 question-answer pairs as having been selected for topical diversity across factual, opinion, and current-events domains, but we will revise both the abstract and Methods to explicitly state the selection process, note that queries were not formally stratified or randomized, report inter-rater reliability (Cohen's kappa for CRAAP scoring), and include statistical tests (e.g., ANOVA) supporting the reported differences in quantity and quality. revision: yes

  2. Referee: [Methods] Methods / Evaluation sections: CRAAP scoring assesses source attributes (currency, relevance, authority, accuracy, purpose) but does not verify whether each cited reference is actually entailed by or supports the LLM-generated answer text. Without an explicit linkage or entailment check between answer content and references, the quality scores cannot be taken as direct measures of trustworthiness in the conversational use case.

    Authors: We acknowledge that CRAAP evaluates reference attributes independently and does not include an entailment or support check against the LLM-generated answer text. Our study intentionally focused on reference quality as presented to users, which is a necessary first step for understanding trustworthiness signals. We will revise the Methods and Discussion sections to explicitly state this scope and limitation, clarify that CRAAP scores do not measure direct content support, and note that future work could incorporate entailment analysis. revision: partial

  3. Referee: [User Study] User Study section: The preliminary user study is described as showing rare interaction and system-dependent behavior, yet no participant count, task details, or statistical analysis is provided in the abstract or summary. Given the small scale implied, this component cannot compensate for the sampling and measurement limitations in the main reference analysis.

    Authors: The user study is indeed preliminary and exploratory. We will revise the abstract and User Study section to report the participant count (15), provide task details, and include basic statistical comparisons of interaction rates across systems. We agree it cannot fully compensate for limitations in the main analysis and will reposition the study accordingly as supplementary evidence rather than a comprehensive validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely observational analysis

full rationale

The paper performs a descriptive empirical study by sampling 30 question-answer pairs across nine systems, extracting 1,517 references, and scoring them with the external CRAAP criteria plus a small user interaction log. No equations, fitted parameters, predictions, or derivations appear anywhere in the text. All reported variations (e.g., average reference counts and CRAAP scores) are direct tallies from the collected data rather than quantities derived from or defined in terms of themselves. The CRAAP framework is imported from outside the authors' prior work, and no self-citation is used to justify any load-bearing claim. The analysis is therefore self-contained against external benchmarks and contains none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unexamined validity of the CRAAP test for AI-generated references and the representativeness of the 30 Q&A pairs; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption CRAAP criteria provide a valid and objective measure of reference quality in LLM conversational outputs
    Applied directly to score references without discussion of its suitability for this domain in the abstract.

pith-pipeline@v0.9.0 · 5475 in / 1125 out tokens · 38576 ms · 2026-05-15T15:55:52.383906+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

  1. [1]

    Katie Adamson and Susan Prion. 2013. Reliability: Measuring Internal Consis- tency Using Cronbach’s𝛼.Clinical Simulation in Nursing9 (05 2013), e179–e180. Analyzing the Presentation, Content, and Utilization of References in LLM-powered Conversational AI Systems CHI EA ’26, April 13–17, 2026, Barcelona, Spain doi:10.1016/j.ecns.2012.12.001

  2. [2]

    Andres Algaba, Carmen Mazijn, Vincent Holst, Floriano Tori, Sylvia Wenmack- ers, and Vincent Ginis. 2024. Large language models reflect human citation patterns with a heightened citation bias.arXiv preprint arXiv:2405.15739(2024). doi:10.48550/arXiv.2405.15739

  3. [3]

    Hongye An, Arpit Narechania, Emily Wall, and Kai Xu. 2024. vitaLITy 2: Reviewing Academic Literature Using Large Language Models. Presented at the NLVIZ Workshop, IEEE VIS 2024. doi:10 .48550/arXiv .2408.13450 arXiv:2408.13450 [cs.HC]

  4. [4]

    Gal Bakal, Ali Dasdan, Yaniv Katz, Michael Kaufman, and Guy Levin. 2025. Experience with GitHub Copilot for Developer Productivity at Zoominfo.arXiv preprint arXiv:2501.13282(2025). doi:10.48550/arXiv.2501.13282

  5. [5]

    Sarah Blakeslee. 2004. The CRAAP test.Loex Quarterly31, 3 (2004), 4

  6. [6]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  7. [7]

    Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI- assisted Decision-making.Proc. ACM Hum.-Comput. Interact.5, CSCW1, Article 188 (April 2021), 21 pages. doi:10.1145/3449287

  8. [8]

    Wanling Cai, Yucheng Jin, and Li Chen. 2022. Impacts of Personal Characteristics on User Trust in Conversational Recommender Systems. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems(New Orleans, LA, USA)(CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 489, 14 pages. doi:10.1145/3491102.3517471

  9. [9]

    Gong, and Sam Borja

    Shawn Carolan, Amy Wu Martin, C.C. Gong, and Sam Borja. 2025. 2025: The State of Consumer AI. https://menlovc .com/perspective/2025-the-state-of- consumer-ai/

  10. [10]

    Shyam Sundar

    Cheng Chen and S. Shyam Sundar. 2023. Is this AI trained on Credible Data? The Effects of Labeling Quality and Performance Bias on User Trust. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 816, 11 pages. doi:10.1145/3544548.3580805

  11. [11]

    Lingjiao Chen, Matei Zaharia, and James Zou. 2023. How is ChatGPT’s behavior changing over time? doi:10.48550/arXiv.2307.09009 arXiv:2307.09009 [cs.CL]

  12. [12]

    Lee J Cronbach. 1951. Coefficient alpha and the internal structure of tests. psychometrika16, 3 (1951), 297–334

  13. [13]

    Smit Desai, Christina Ziying Wei, Jaisie Sin, Mateusz Dubiel, Nima Zargham, Shashank Ahire, Martin Porcheron, Anastasia Kuzminykh, Minha Lee, Heloisa Candello, Joel E Fischer, Cosmin Munteanu, and Benjamin R. Cowan. 2024. CUI@CHI 2024: Building Trust in CUIs—From Design to Deployment. InEx- tended Abstracts of the CHI Conference on Human Factors in Comput...

  14. [14]

    Yifan Ding, Matthew Facciani, Ellen Joyce, Amrit Poudel, Sanmitra Bhattacharya, Balaji Veeramani, Sal Aguinaga, and Tim Weninger. 2025. Citations and trust in LLM generated responses. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteent...

  15. [15]

    Shutong Fan, Lan Zhang, and Xiaoyong Yuan. 2026. When AI Persuades: Ad- versarial Explanation Attacks on Human Trust in AI-Assisted Decision Making. doi:10.48550/arXiv.2602.04003 arXiv:2602.04003 [cs.AI]

  16. [16]

    Masyura Ahmad Faudzi, Zaihisma Che Cob, Sharul Azim Sharudin, Ridha Omar, and Masitah Ghazali. 2023. The Effects of User Interface Design for Mobile Learning Application on Learner’s Extraneous Cognitive Load: A Conceptual Framework. InProceedings of the Asian HCI Symposium 2023(Online, Indonesia) (Asian CHI ’23). Association for Computing Machinery, New ...

  17. [17]

    Massimo Franceschet. 2011. PageRank: standing on the shoulders of giants. Commun. ACM54, 6 (June 2011), 92–101. doi:10.1145/1953122.1953146

  18. [19]

    Jiangen He and Jiqun Liu. 2025. Not All Transparency Is Equal: Source Presenta- tion Effects on Attention, Interaction, and Persuasion in Conversational Search. doi:10.48550/arXiv.2512.12207 arXiv:2512.12207 [cs.HC]

  19. [20]

    Jie Huang and Kevin Chang. 2024. Citation: A Key to Building Responsible and Accountable Large Language Models. InFindings of the Association for Computa- tional Linguistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 464–473. doi:10.18653/v1/2024.findings-naacl.31

  20. [21]

    White, and Susan Dumais

    Jeff Huang, Ryen W. White, and Susan Dumais. 2011. No clicks, no problem: using cursor movements to understand and improve search. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Vancouver, BC, Canada)(CHI ’11). Association for Computing Machinery, New York, NY, USA, 1225–1234. doi:10.1145/1978942.1979125

  21. [22]

    Yanwei Huang and Arpit Narechania. 2026. WebSeek: Facilitating Proactive and Reactive Guidance for Decision Making on the Web. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3772318.3791945

  22. [23]

    Farnaz Jahanbakhsh and David R Karger. 2024. A Browser Extension for in-place Signaling and Assessment of Misinformation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 946, 21 pages. doi:10.1145/3613904.3642473

  23. [24]

    Anjali Khurana, Hariharan Subramonyam, and Parmit K Chilana. 2024. Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking. InProceedings of the 29th International Conference on Intelligent User Interfaces(Greenville, SC, USA) (IUI ’24). Association for Computing Machinery, New...

  24. [25]

    Artur Klingbeil, Cassandra Grützner, and Philipp Schreck. 2024. Trust and reliance on AI — An experimental study on the extent and costs of overreliance on AI. Comput. Hum. Behav.160, C (Nov. 2024), 10 pages. doi:10 .1016/j.chb.2024.108352

  25. [26]

    Jane Li, Scott Huffman, and Akihito Tokuda. 2009. Good abandonment in mobile and PC internet search. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval(Boston, MA, USA)(SIGIR ’09). Association for Computing Machinery, New York, NY, USA, 43–50. doi:10.1145/1571941.1571951

  26. [27]

    Jiachen Li, Elizabeth D Mynatt, Varun Mishra, and Jonathan Bell. 2025. ’Always Nice and Confident, Sometimes Wrong’: Developer’s Experiences Engaging Generative AI Chatbots Versus Human-Powered Q&A Platforms.Proceedings of the ACM on Human-Computer Interaction9, 2 (2025), 1–22. doi:10 .48550/ arXiv.2309.13684

  27. [28]

    Q Vera Liao and Jennifer Wortman Vaughan. 2023. Ai transparency in the age of llms: A human-centered research roadmap.arXiv preprint arXiv:2306.0194110 (2023). doi:10.48550/arXiv.2306.01941

  28. [29]

    Nelson Liu, Tianyi Zhang, and Percy Liang. 2023. Evaluating Verifiability in Generative Search Engines. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 7001–7025. doi:10 .18653/ v1/2023.findings-emnlp.467

  29. [30]

    Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Alessandro Santirocchi, Roberto Atzeni, Matteo Cinelli, Vincenzo Cestari, Clelia Rossi-Arnaud, and Walter Quat- trociocchi. 2025. The simulation of judgment in LLMs.Proceedings of the National Academy of Sciences122, 42 (2025), e2518443122. doi:10 .1073/pnas.2518443122 arXiv:https://www.pnas.org/doi/pdf/10.1073...

  30. [31]

    Luise Metzger, Linda Miller, Martin Baumann, and Johannes Kraus. 2024. Em- powering Calibrated (Dis-)Trust in Conversational Agents: A User Study on the Persuasive Power of Limitation Disclaimers vs. Authoritative Style. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Ma...

  31. [32]

    Alaa Mohasseb, Mohamed Bader-El-Den, and Mihaela Cocea. 2018. Question categorization and classification using grammar based approach.Information Processing & Management54, 6 (2018), 1228–1243. doi:10 .1016/j.ipm.2018.05.001

  32. [33]

    Arpit Narechania, Adam Coscia, Emily Wall, and Alex Endert. 2022. Lumos: Increasing Awareness of Analytic Behavior during Visual Data Analysis.IEEE Transactions on Visualization and Computer Graphics28, 1 (Jan. 2022), 1009–1018. doi:10.1109/TVCG.2021.3114827

  33. [34]

    Arpit Narechania, Alex Endert, and Atanu Sinha. 2025. Guidance Source Matters: How Guidance from AI, Expert, or a Group of Analysts Impacts Visual Data Preparation and Analysis.ACM IUI(2025). doi:10.1145/3708359.3712166

  34. [35]

    Arpit Narechania, Alex Endert, and Atanu R Sinha. 2025. Agentic Enterprise: AI-Centric User to User-Centric AI.arXiv preprint arXiv:2506.22893(2025). doi:10.48550/arXiv.2506.22893

  35. [36]

    2024.Designing, Developing, and Democratizing Guidance for Visual Analytics

    Arpit Ajay Narechania. 2024.Designing, Developing, and Democratizing Guidance for Visual Analytics. Ph. D. Dissertation. Georgia Institute of Technology

  36. [37]

    2022.Overreliance on AI: Literature Review

    Samir Passi and Mihaela Vorvoreanu. 2022.Overreliance on AI: Literature Review. Technical Report MSR-TR-2022-12. Microsoft. https://www.microsoft.com/en- us/research/publication/overreliance-on-ai-literature-review/

  37. [38]

    Ar Poorva Priyadarshini. 2024. The impact of user interface design on user engagement.International Journal of Engineering Research & Technology (IJERT) 13, 3 (2024). doi:10.48550/arXiv.2508.02740

  38. [39]

    Abdul Razaque, Salim Hariri, and Joon Yoo. 2025. Ai-Driven User Interface Design: Enhancing Digital Learning and Skill Development. (01 2025). doi:10 .2139/ ssrn.5114814

  39. [40]

    Russell, Chinmay Kulkarni, Elena L

    Daniel M. Russell, Chinmay Kulkarni, Elena L. Glassman, Hariharan Subra- monyam, and Nikolas Martelaro. 2024. Human-Computer Interaction and AI: What Practitioners Need to Know to Design and Build Effective AI systems from a Human Perspective. InExtended Abstracts of the CHI Conference on Hu- man Factors in Computing Systems(Honolulu, HI, USA)(CHI EA ’24)...

  40. [41]

    Vera Schmitt, Isabel Bezzaoui, Charlott Jakob, Premtim Sahitaj, Qianli Wang, Arthur Hilbert, Max Upravitelev, Jonas Fegert, Sebastian Möller, and Veronika Solopova. 2025. Beyond Transparency: Evaluating Explainability in AI-Supported Fact-Checking. InProceedings of the 4th ACM International Workshop on Multime- dia AI against Disinformation (MAD’ 25). Ass...

  41. [42]

    Kayla Schroeder and Zach Wood-Doughty. 2025. Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge. doi:10 .48550/arXiv .2412.12509 arXiv:2412.12509 [cs.CL]

  42. [43]

    Vera Liao, and Ziang Xiao

    Nikhil Sharma, Q. Vera Liao, and Ziang Xiao. 2024. Generative Echo Chamber? Effect of LLM-Powered Search Systems on Diverse Information Seeking. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 1033, 17 pages. doi:10.1145/3613904.3642459

  43. [44]

    Sofia Eleni Spatharioti, David Rothschild, Daniel G Goldstein, and Jake M Hofman

  44. [45]

    InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25)

    Effects of LLM-based Search on Decision Making: Speed, Accuracy, and Overreliance. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 1025, 15 pages. doi:10.1145/3706598.3714082

  45. [46]

    Ivan Srba, Olesya Razuvayevskaya, João A. Leite, Robert Moro, Ipek Baris Schlicht, Sara Tonelli, Francisco Moreno García, Santiago Barrio Lottmann, Denis Teyssou, Valentin Porcellini, Carolina Scarton, Kalina Bontcheva, and Maria Bielikova

  46. [47]

    A Survey on Automatic Credibility Assessment Using Textual Credibility Signals in the Era of Large Language Models.ACM Trans. Intell. Syst. Technol.17, 2, Article 26 (Jan. 2026), 80 pages. doi:10.1145/3770077

  47. [48]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010

  48. [49]

    William H Walters and Esther Isabelle Wilder. 2023. Fabrication and errors in the bibliographic citations generated by ChatGPT.Scientific Reports13, 1 (2023), 14045. doi:10.1038/s41598-023-41032-5

  49. [50]

    Thomas Wolf. 2025. open-llm-leaderboard (Open LLM Leaderboard). https: //huggingface.co/open-llm-leaderboard

  50. [51]

    Vinzenz Wolf and Christian Maier. 2024. ChatGPT usage in everyday life: A motivation-theoretic mixed-methods study.International Journal of Information Management79 (2024), 102821. doi:10.1016/j.ijinfomgt.2024.102821

  51. [52]

    Brockman, Nasir Memon, and Sameer Patil

    Waheeb Yaqub, Otari Kakhidze, Morgan L. Brockman, Nasir Memon, and Sameer Patil. 2020. Effects of Credibility Indicators on Social Media News Sharing Intent. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)(CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–14. doi:10.1145/3313831.3376213

  52. [53]

    Enaam Youssef, Mervat Medhat, Soumaya Abdellatif, and Mahra Al Malek. 2024. Examining the effect of ChatGPT usage on students’ academic learning and achievement: A survey-based study in Ajman, UAE.Computers and Education: Artificial Intelligence7 (2024), 100316. doi:10.1016/j.caeai.2024.100316