arxiv: 2604.15326 · v1 · submitted 2026-03-06 · 💻 cs.HC

Recognition: 1 theorem link

· Lean Theorem

Analyzing the Presentation, Content, and Utilization of References in LLM-powered Conversational AI Systems

Jianheng Ouyang , Arpit Narechania

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:55 UTC · model grok-4.3

classification 💻 cs.HC

keywords LLM conversational AIreferencescitation qualityCRAAP criteriauser interfaceinformation retrievaltrust in AIreference utilization

0 comments

The pith

LLM conversational systems differ substantially in the quantity, quality, and presentation of the references they provide to users.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes how nine LLM-powered conversational AI systems present and cite references in their answers to information-seeking questions. Researchers collected 1,517 references from 30 question-answer pairs and scored each reference's quality with the CRAAP criteria while also noting how the references appear in each system's user interface. The work documents clear differences across systems, with some supplying more references and higher average quality scores than others, and a preliminary user study shows that people seldom click or interact with the references regardless of the system. These patterns matter because users increasingly rely on such systems for factual answers and need reliable ways to verify the supporting sources.

Core claim

The central discovery is that there are notable variations in the presentation, quality, and quantity of references across the nine systems. ChatGPT provides more references per response on average with higher CRAAP quality scores, while systems such as Hunyuan-TurboS supply fewer references and lower scores. Users rarely interact with the references shown to them, and interaction patterns differ by system. The authors conclude that better interface designs are needed to help users engage with and trust references more effectively.

What carries the argument

Evaluation of reference presentation in user interfaces combined with quality scoring via the CRAAP criteria (Currency, Relevance, Authority, Accuracy, Purpose) applied to 1,517 references, plus a preliminary user study tracking actual interaction behavior.

If this is right

Users of systems that supply fewer or lower-scoring references may need to perform extra verification steps to reach reliable conclusions.
Interface changes that make references more visible or easier to explore could increase engagement rates.
Quality differences imply that the underlying retrieval or generation methods vary in how well they surface trustworthy sources.
Standard practices for citation display and quality control could reduce the observed performance gaps between systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If reference quality tracks with a system's retrieval architecture, comparing the underlying document stores or fine-tuning data across providers could explain the observed gaps.
Making references clickable and contextual might raise overall user trust in LLM answers even when the raw quality scores stay the same.
Repeating the analysis on live user queries rather than curated samples would test whether the current variations hold outside the laboratory setting.

Load-bearing premise

That the CRAAP criteria give an appropriate and sufficient measure of reference quality for LLM-generated answers and that the 30 sampled question-answer pairs represent real-world usage patterns.

What would settle it

A larger study using hundreds of real user questions and a different quality assessment method that finds no significant differences in reference quantity, quality scores, or user interaction rates across systems.

Figures

Figures reproduced from arXiv: 2604.15326 by Arpit Narechania, Jianheng Ouyang.

**Figure 1.** Figure 1: Summary of findings from three analyses of references across nine conversational AI systems. Analysis 1: Presentation [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Five design dimensions for presenting references in conversational AI systems: (1) Parent GUI Components (header, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Quality assessment example using the CRAAP test. A social media post is evaluated across five criteria: Authority [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Quality of references from nine conversational AI systems across the CRAAP Framework. This figure presents six [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: User Interaction Behavior. The bubble chart shows Light Inspection (Hover Rate) against Deep Verification (Click [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

As conversational AI systems become popular for information retrieval and question-answering, the references they cite are key to ensuring their answers are reliable and trustworthy. Yet, no prior work systematically analyzes how these references are presented or their quality. We examine 1,517 references from 30 question-answer pairs across nine systems, focusing on their (1) presentation in the user interface and (2) quality using the CRAAP criteria. We find notable variations in the presentation, quality, and quantity of references across systems. For instance, ChatGPT provides more references (9.5 per response on average) with higher quality (15.48/20 CRAAP score), while Hunyuan-TurboS provides fewer references (4.0) and lower quality (11.65/20). Additionally, a preliminary user study shows that people rarely interact with these references and that their behavior differs across systems. These findings highlight the need for better interface designs that help users engage with and trust references more effectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a first empirical scan of reference quantity and CRAAP-scored quality across nine LLM chat systems from 30 queries, plus a small user study on interaction, but the uncharacterized sample and lack of relevance checks limit how much the variation findings can be generalized.

read the letter

The main point is that the authors collected 1517 references from 30 question-answer pairs run on nine conversational systems, scored the sources with the CRAAP rubric, and documented differences such as ChatGPT averaging 9.5 references per response at 15.48/20 while Hunyuan-TurboS averaged 4.0 at 11.65/20. They also ran a preliminary user study showing low engagement with the references across systems. This is presented as the first systematic look at presentation and quality in this setting.

Referee Report

3 major / 2 minor

Summary. The paper examines 1,517 references drawn from 30 question-answer pairs across nine LLM-powered conversational AI systems. It analyzes reference presentation in the user interface and quality via the CRAAP criteria, reports quantitative variations (e.g., ChatGPT averaging 9.5 references with CRAAP score 15.48/20 versus Hunyuan-TurboS with 4.0 references and 11.65/20), and includes a preliminary user study indicating low user interaction with references that differs across systems. The work concludes that better interface designs are needed to improve engagement and trust.

Significance. If the methodological gaps are closed, the study would be significant for HCI research on conversational AI by providing the first systematic comparison of reference handling across multiple systems. The scale of 1,517 references examined offers a concrete empirical basis for claims about quantity and quality differences, and the user-interaction observations point to actionable design implications. The application of the established CRAAP framework lends some structure, though its fit to LLM-generated content requires justification.

major comments (3)

[Abstract] Abstract: The manuscript reports specific quantitative findings (9.5 vs. 4.0 references per response; CRAAP scores 15.48/20 vs. 11.65/20) but supplies no information on how the 30 question-answer pairs were selected, whether queries were stratified or randomized, inter-rater reliability for CRAAP scoring, or any statistical tests supporting the variation claims. These omissions leave the central claims of notable differences unsupported by visible evidence.
[Methods] Methods / Evaluation sections: CRAAP scoring assesses source attributes (currency, relevance, authority, accuracy, purpose) but does not verify whether each cited reference is actually entailed by or supports the LLM-generated answer text. Without an explicit linkage or entailment check between answer content and references, the quality scores cannot be taken as direct measures of trustworthiness in the conversational use case.
[User Study] User Study section: The preliminary user study is described as showing rare interaction and system-dependent behavior, yet no participant count, task details, or statistical analysis is provided in the abstract or summary. Given the small scale implied, this component cannot compensate for the sampling and measurement limitations in the main reference analysis.

minor comments (2)

[Methods] The nine systems examined should be listed explicitly with version numbers or access dates in the Methods section for reproducibility.
[Results] Figure captions and tables reporting per-system reference counts and CRAAP sub-scores would benefit from error bars or confidence intervals to convey variability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where we will revise the manuscript to improve clarity, transparency, and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript reports specific quantitative findings (9.5 vs. 4.0 references per response; CRAAP scores 15.48/20 vs. 11.65/20) but supplies no information on how the 30 question-answer pairs were selected, whether queries were stratified or randomized, inter-rater reliability for CRAAP scoring, or any statistical tests supporting the variation claims. These omissions leave the central claims of notable differences unsupported by visible evidence.

Authors: We agree that the abstract should provide more methodological transparency. The full manuscript's Methods section describes the 30 question-answer pairs as having been selected for topical diversity across factual, opinion, and current-events domains, but we will revise both the abstract and Methods to explicitly state the selection process, note that queries were not formally stratified or randomized, report inter-rater reliability (Cohen's kappa for CRAAP scoring), and include statistical tests (e.g., ANOVA) supporting the reported differences in quantity and quality. revision: yes
Referee: [Methods] Methods / Evaluation sections: CRAAP scoring assesses source attributes (currency, relevance, authority, accuracy, purpose) but does not verify whether each cited reference is actually entailed by or supports the LLM-generated answer text. Without an explicit linkage or entailment check between answer content and references, the quality scores cannot be taken as direct measures of trustworthiness in the conversational use case.

Authors: We acknowledge that CRAAP evaluates reference attributes independently and does not include an entailment or support check against the LLM-generated answer text. Our study intentionally focused on reference quality as presented to users, which is a necessary first step for understanding trustworthiness signals. We will revise the Methods and Discussion sections to explicitly state this scope and limitation, clarify that CRAAP scores do not measure direct content support, and note that future work could incorporate entailment analysis. revision: partial
Referee: [User Study] User Study section: The preliminary user study is described as showing rare interaction and system-dependent behavior, yet no participant count, task details, or statistical analysis is provided in the abstract or summary. Given the small scale implied, this component cannot compensate for the sampling and measurement limitations in the main reference analysis.

Authors: The user study is indeed preliminary and exploratory. We will revise the abstract and User Study section to report the participant count (15), provide task details, and include basic statistical comparisons of interaction rates across systems. We agree it cannot fully compensate for limitations in the main analysis and will reposition the study accordingly as supplementary evidence rather than a comprehensive validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely observational analysis

full rationale

The paper performs a descriptive empirical study by sampling 30 question-answer pairs across nine systems, extracting 1,517 references, and scoring them with the external CRAAP criteria plus a small user interaction log. No equations, fitted parameters, predictions, or derivations appear anywhere in the text. All reported variations (e.g., average reference counts and CRAAP scores) are direct tallies from the collected data rather than quantities derived from or defined in terms of themselves. The CRAAP framework is imported from outside the authors' prior work, and no self-citation is used to justify any load-bearing claim. The analysis is therefore self-contained against external benchmarks and contains none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unexamined validity of the CRAAP test for AI-generated references and the representativeness of the 30 Q&A pairs; no free parameters or invented entities are introduced.

axioms (1)

domain assumption CRAAP criteria provide a valid and objective measure of reference quality in LLM conversational outputs
Applied directly to score references without discussion of its suitability for this domain in the abstract.

pith-pipeline@v0.9.0 · 5475 in / 1125 out tokens · 38576 ms · 2026-05-15T15:55:52.383906+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We examine 1,517 references from 30 question-answer pairs across nine systems, focusing on their (1) presentation in the user interface and (2) quality using the CRAAP criteria.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

[1]

Katie Adamson and Susan Prion. 2013. Reliability: Measuring Internal Consis- tency Using Cronbach’s𝛼.Clinical Simulation in Nursing9 (05 2013), e179–e180. Analyzing the Presentation, Content, and Utilization of References in LLM-powered Conversational AI Systems CHI EA ’26, April 13–17, 2026, Barcelona, Spain doi:10.1016/j.ecns.2012.12.001

work page doi:10.1016/j.ecns.2012.12.001 2013
[2]

Andres Algaba, Carmen Mazijn, Vincent Holst, Floriano Tori, Sylvia Wenmack- ers, and Vincent Ginis. 2024. Large language models reflect human citation patterns with a heightened citation bias.arXiv preprint arXiv:2405.15739(2024). doi:10.48550/arXiv.2405.15739

work page doi:10.48550/arxiv.2405.15739 2024
[3]

Hongye An, Arpit Narechania, Emily Wall, and Kai Xu. 2024. vitaLITy 2: Reviewing Academic Literature Using Large Language Models. Presented at the NLVIZ Workshop, IEEE VIS 2024. doi:10 .48550/arXiv .2408.13450 arXiv:2408.13450 [cs.HC]

work page arXiv 2024
[4]

Gal Bakal, Ali Dasdan, Yaniv Katz, Michael Kaufman, and Guy Levin. 2025. Experience with GitHub Copilot for Developer Productivity at Zoominfo.arXiv preprint arXiv:2501.13282(2025). doi:10.48550/arXiv.2501.13282

work page doi:10.48550/arxiv.2501.13282 2025
[5]

Sarah Blakeslee. 2004. The CRAAP test.Loex Quarterly31, 3 (2004), 4

work page 2004
[6]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

work page 2020
[7]

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI- assisted Decision-making.Proc. ACM Hum.-Comput. Interact.5, CSCW1, Article 188 (April 2021), 21 pages. doi:10.1145/3449287

work page internal anchor Pith review doi:10.1145/3449287 2021
[8]

Wanling Cai, Yucheng Jin, and Li Chen. 2022. Impacts of Personal Characteristics on User Trust in Conversational Recommender Systems. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems(New Orleans, LA, USA)(CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 489, 14 pages. doi:10.1145/3491102.3517471

work page doi:10.1145/3491102.3517471 2022
[9]

Gong, and Sam Borja

Shawn Carolan, Amy Wu Martin, C.C. Gong, and Sam Borja. 2025. 2025: The State of Consumer AI. https://menlovc .com/perspective/2025-the-state-of- consumer-ai/

work page 2025
[10]

Shyam Sundar

Cheng Chen and S. Shyam Sundar. 2023. Is this AI trained on Credible Data? The Effects of Labeling Quality and Performance Bias on User Trust. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 816, 11 pages. doi:10.1145/3544548.3580805

work page doi:10.1145/3544548.3580805 2023
[11]

Lingjiao Chen, Matei Zaharia, and James Zou. 2023. How is ChatGPT’s behavior changing over time? doi:10.48550/arXiv.2307.09009 arXiv:2307.09009 [cs.CL]

work page doi:10.48550/arxiv.2307.09009 2023
[12]

Lee J Cronbach. 1951. Coefficient alpha and the internal structure of tests. psychometrika16, 3 (1951), 297–334

work page 1951
[13]

Smit Desai, Christina Ziying Wei, Jaisie Sin, Mateusz Dubiel, Nima Zargham, Shashank Ahire, Martin Porcheron, Anastasia Kuzminykh, Minha Lee, Heloisa Candello, Joel E Fischer, Cosmin Munteanu, and Benjamin R. Cowan. 2024. CUI@CHI 2024: Building Trust in CUIs—From Design to Deployment. InEx- tended Abstracts of the CHI Conference on Human Factors in Comput...

work page doi:10.1145/3613905.3636287 2024
[14]

Yifan Ding, Matthew Facciani, Ellen Joyce, Amrit Poudel, Sanmitra Bhattacharya, Balaji Veeramani, Sal Aguinaga, and Tim Weninger. 2025. Citations and trust in LLM generated responses. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteent...

work page doi:10.1609/aaai.v39i22.34550 2025
[15]

Shutong Fan, Lan Zhang, and Xiaoyong Yuan. 2026. When AI Persuades: Ad- versarial Explanation Attacks on Human Trust in AI-Assisted Decision Making. doi:10.48550/arXiv.2602.04003 arXiv:2602.04003 [cs.AI]

work page doi:10.48550/arxiv.2602.04003 2026
[16]

Masyura Ahmad Faudzi, Zaihisma Che Cob, Sharul Azim Sharudin, Ridha Omar, and Masitah Ghazali. 2023. The Effects of User Interface Design for Mobile Learning Application on Learner’s Extraneous Cognitive Load: A Conceptual Framework. InProceedings of the Asian HCI Symposium 2023(Online, Indonesia) (Asian CHI ’23). Association for Computing Machinery, New ...

work page doi:10.1145/3604571.3604579 2023
[17]

Massimo Franceschet. 2011. PageRank: standing on the shoulders of giants. Commun. ACM54, 6 (June 2011), 92–101. doi:10.1145/1953122.1953146

work page doi:10.1145/1953122.1953146 2011
[19]

Jiangen He and Jiqun Liu. 2025. Not All Transparency Is Equal: Source Presenta- tion Effects on Attention, Interaction, and Persuasion in Conversational Search. doi:10.48550/arXiv.2512.12207 arXiv:2512.12207 [cs.HC]

work page doi:10.48550/arxiv.2512.12207 2025
[20]

Jie Huang and Kevin Chang. 2024. Citation: A Key to Building Responsible and Accountable Large Language Models. InFindings of the Association for Computa- tional Linguistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 464–473. doi:10.18653/v1/2024.findings-naacl.31

work page doi:10.18653/v1/2024.findings-naacl.31 2024
[21]

White, and Susan Dumais

Jeff Huang, Ryen W. White, and Susan Dumais. 2011. No clicks, no problem: using cursor movements to understand and improve search. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Vancouver, BC, Canada)(CHI ’11). Association for Computing Machinery, New York, NY, USA, 1225–1234. doi:10.1145/1978942.1979125

work page doi:10.1145/1978942.1979125 2011
[22]

Yanwei Huang and Arpit Narechania. 2026. WebSeek: Facilitating Proactive and Reactive Guidance for Decision Making on the Web. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3772318.3791945

work page doi:10.1145/3772318.3791945 2026
[23]

Farnaz Jahanbakhsh and David R Karger. 2024. A Browser Extension for in-place Signaling and Assessment of Misinformation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 946, 21 pages. doi:10.1145/3613904.3642473

work page doi:10.1145/3613904.3642473 2024
[24]

Anjali Khurana, Hariharan Subramonyam, and Parmit K Chilana. 2024. Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking. InProceedings of the 29th International Conference on Intelligent User Interfaces(Greenville, SC, USA) (IUI ’24). Association for Computing Machinery, New...

work page doi:10.1145/3640543.3645200 2024
[25]

Artur Klingbeil, Cassandra Grützner, and Philipp Schreck. 2024. Trust and reliance on AI — An experimental study on the extent and costs of overreliance on AI. Comput. Hum. Behav.160, C (Nov. 2024), 10 pages. doi:10 .1016/j.chb.2024.108352

work page arXiv 2024
[26]

Jane Li, Scott Huffman, and Akihito Tokuda. 2009. Good abandonment in mobile and PC internet search. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval(Boston, MA, USA)(SIGIR ’09). Association for Computing Machinery, New York, NY, USA, 43–50. doi:10.1145/1571941.1571951

work page doi:10.1145/1571941.1571951 2009
[27]

Jiachen Li, Elizabeth D Mynatt, Varun Mishra, and Jonathan Bell. 2025. ’Always Nice and Confident, Sometimes Wrong’: Developer’s Experiences Engaging Generative AI Chatbots Versus Human-Powered Q&A Platforms.Proceedings of the ACM on Human-Computer Interaction9, 2 (2025), 1–22. doi:10 .48550/ arXiv.2309.13684

work page arXiv 2025
[28]

Q Vera Liao and Jennifer Wortman Vaughan. 2023. Ai transparency in the age of llms: A human-centered research roadmap.arXiv preprint arXiv:2306.0194110 (2023). doi:10.48550/arXiv.2306.01941

work page doi:10.48550/arxiv.2306.01941 2023
[29]

Nelson Liu, Tianyi Zhang, and Percy Liang. 2023. Evaluating Verifiability in Generative Search Engines. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 7001–7025. doi:10 .18653/ v1/2023.findings-emnlp.467

work page 2023
[30]

Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Alessandro Santirocchi, Roberto Atzeni, Matteo Cinelli, Vincenzo Cestari, Clelia Rossi-Arnaud, and Walter Quat- trociocchi. 2025. The simulation of judgment in LLMs.Proceedings of the National Academy of Sciences122, 42 (2025), e2518443122. doi:10 .1073/pnas.2518443122 arXiv:https://www.pnas.org/doi/pdf/10.1073...

work page doi:10.1073/pnas.2518443122 2025
[31]

Luise Metzger, Linda Miller, Martin Baumann, and Johannes Kraus. 2024. Em- powering Calibrated (Dis-)Trust in Conversational Agents: A User Study on the Persuasive Power of Limitation Disclaimers vs. Authoritative Style. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Ma...

work page doi:10.1145/3613904.3642122 2024
[32]

Alaa Mohasseb, Mohamed Bader-El-Den, and Mihaela Cocea. 2018. Question categorization and classification using grammar based approach.Information Processing & Management54, 6 (2018), 1228–1243. doi:10 .1016/j.ipm.2018.05.001

work page 2018
[33]

Arpit Narechania, Adam Coscia, Emily Wall, and Alex Endert. 2022. Lumos: Increasing Awareness of Analytic Behavior during Visual Data Analysis.IEEE Transactions on Visualization and Computer Graphics28, 1 (Jan. 2022), 1009–1018. doi:10.1109/TVCG.2021.3114827

work page doi:10.1109/tvcg.2021.3114827 2022
[34]

Arpit Narechania, Alex Endert, and Atanu Sinha. 2025. Guidance Source Matters: How Guidance from AI, Expert, or a Group of Analysts Impacts Visual Data Preparation and Analysis.ACM IUI(2025). doi:10.1145/3708359.3712166

work page doi:10.1145/3708359.3712166 2025
[35]

Arpit Narechania, Alex Endert, and Atanu R Sinha. 2025. Agentic Enterprise: AI-Centric User to User-Centric AI.arXiv preprint arXiv:2506.22893(2025). doi:10.48550/arXiv.2506.22893

work page doi:10.48550/arxiv.2506.22893 2025
[36]

2024.Designing, Developing, and Democratizing Guidance for Visual Analytics

Arpit Ajay Narechania. 2024.Designing, Developing, and Democratizing Guidance for Visual Analytics. Ph. D. Dissertation. Georgia Institute of Technology

work page 2024
[37]

2022.Overreliance on AI: Literature Review

Samir Passi and Mihaela Vorvoreanu. 2022.Overreliance on AI: Literature Review. Technical Report MSR-TR-2022-12. Microsoft. https://www.microsoft.com/en- us/research/publication/overreliance-on-ai-literature-review/

work page 2022
[38]

Ar Poorva Priyadarshini. 2024. The impact of user interface design on user engagement.International Journal of Engineering Research & Technology (IJERT) 13, 3 (2024). doi:10.48550/arXiv.2508.02740

work page doi:10.48550/arxiv.2508.02740 2024
[39]

Abdul Razaque, Salim Hariri, and Joon Yoo. 2025. Ai-Driven User Interface Design: Enhancing Digital Learning and Skill Development. (01 2025). doi:10 .2139/ ssrn.5114814

work page 2025
[40]

Russell, Chinmay Kulkarni, Elena L

Daniel M. Russell, Chinmay Kulkarni, Elena L. Glassman, Hariharan Subra- monyam, and Nikolas Martelaro. 2024. Human-Computer Interaction and AI: What Practitioners Need to Know to Design and Build Effective AI systems from a Human Perspective. InExtended Abstracts of the CHI Conference on Hu- man Factors in Computing Systems(Honolulu, HI, USA)(CHI EA ’24)...

work page doi:10.1145/3613905.3636270 2024
[41]

Vera Schmitt, Isabel Bezzaoui, Charlott Jakob, Premtim Sahitaj, Qianli Wang, Arthur Hilbert, Max Upravitelev, Jonas Fegert, Sebastian Möller, and Veronika Solopova. 2025. Beyond Transparency: Evaluating Explainability in AI-Supported Fact-Checking. InProceedings of the 4th ACM International Workshop on Multime- dia AI against Disinformation (MAD’ 25). Ass...

work page doi:10.1145/3733567.3735566 2025
[42]

Kayla Schroeder and Zach Wood-Doughty. 2025. Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge. doi:10 .48550/arXiv .2412.12509 arXiv:2412.12509 [cs.CL]

work page arXiv 2025
[43]

Vera Liao, and Ziang Xiao

Nikhil Sharma, Q. Vera Liao, and Ziang Xiao. 2024. Generative Echo Chamber? Effect of LLM-Powered Search Systems on Diverse Information Seeking. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 1033, 17 pages. doi:10.1145/3613904.3642459

work page doi:10.1145/3613904.3642459 2024
[44]

Sofia Eleni Spatharioti, David Rothschild, Daniel G Goldstein, and Jake M Hofman

work page
[45]

InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25)

Effects of LLM-based Search on Decision Making: Speed, Accuracy, and Overreliance. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 1025, 15 pages. doi:10.1145/3706598.3714082

work page doi:10.1145/3706598.3714082 2025
[46]

Ivan Srba, Olesya Razuvayevskaya, João A. Leite, Robert Moro, Ipek Baris Schlicht, Sara Tonelli, Francisco Moreno García, Santiago Barrio Lottmann, Denis Teyssou, Valentin Porcellini, Carolina Scarton, Kalina Bontcheva, and Maria Bielikova

work page
[47]

A Survey on Automatic Credibility Assessment Using Textual Credibility Signals in the Era of Large Language Models.ACM Trans. Intell. Syst. Technol.17, 2, Article 26 (Jan. 2026), 80 pages. doi:10.1145/3770077

work page doi:10.1145/3770077 2026
[48]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010

work page 2017
[49]

William H Walters and Esther Isabelle Wilder. 2023. Fabrication and errors in the bibliographic citations generated by ChatGPT.Scientific Reports13, 1 (2023), 14045. doi:10.1038/s41598-023-41032-5

work page doi:10.1038/s41598-023-41032-5 2023
[50]

Thomas Wolf. 2025. open-llm-leaderboard (Open LLM Leaderboard). https: //huggingface.co/open-llm-leaderboard

work page 2025
[51]

Vinzenz Wolf and Christian Maier. 2024. ChatGPT usage in everyday life: A motivation-theoretic mixed-methods study.International Journal of Information Management79 (2024), 102821. doi:10.1016/j.ijinfomgt.2024.102821

work page doi:10.1016/j.ijinfomgt.2024.102821 2024
[52]

Brockman, Nasir Memon, and Sameer Patil

Waheeb Yaqub, Otari Kakhidze, Morgan L. Brockman, Nasir Memon, and Sameer Patil. 2020. Effects of Credibility Indicators on Social Media News Sharing Intent. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)(CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–14. doi:10.1145/3313831.3376213

work page doi:10.1145/3313831.3376213 2020
[53]

Enaam Youssef, Mervat Medhat, Soumaya Abdellatif, and Mahra Al Malek. 2024. Examining the effect of ChatGPT usage on students’ academic learning and achievement: A survey-based study in Ajman, UAE.Computers and Education: Artificial Intelligence7 (2024), 100316. doi:10.1016/j.caeai.2024.100316

work page doi:10.1016/j.caeai.2024.100316 2024