pith. machine review for the scientific record. sign in

arxiv: 2604.17843 · v1 · submitted 2026-04-20 · 💻 cs.HC · cs.AI

Recognition: unknown

Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:34 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords generative AIpolicy researchepistemic humilitycurated datauser evaluationdevelopment economicsWorld Bank
0
0 comments X

The pith

A curated AI platform for policy research saves users 2.4-3.9 hours per week through verified citations and abstention from unsupported queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents AVA, a generative AI system trained exclusively on a library of over 4,000 World Bank reports and equipped with mechanisms to cite sources and decline questions lacking evidence. Researchers evaluated it with more than 2,200 users across 116 countries using logs, surveys, and interviews. Difference-in-differences analysis links ongoing use to weekly time savings of 2.4 to 3.9 hours. The findings illustrate how domain-specific design can reduce risks of misinformation in development and policy work.

Core claim

The central discovery is that AVA's design, featuring a multi-agent pipeline for evidence-based syntheses from the curated reports, citation verifiability, and reasoned abstention, leads to measurable productivity gains. Users in the study engaged with it as a specialized evidence engine, and the institutional grounding helped calibrate trust. The evaluation shows sustained engagement associates with substantial time reductions in research tasks.

What carries the argument

AVA's multi-agent pipeline that combines queries with the curated World Bank report library to produce syntheses, trace claims to specific pages, and abstain with justification when evidence is insufficient.

If this is right

  • Users can treat specialized AI as a reliable evidence engine for policy questions when sources are explicitly linked.
  • Reasoned abstention clarifies the limits of AI assistance and prevents over-reliance.
  • Provenance from a trusted institution like the World Bank supports calibrated trust in outputs.
  • Such systems offer a model for deploying generative AI in high-stakes professional domains without broad misinformation risks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar curation and humility features could be applied to AI tools in other fields like medicine or law to improve safety.
  • Expanding the approach might involve combining multiple institutional libraries while preserving abstention rules.
  • Longer-term studies could test if time savings lead to higher quality policy outputs or more thorough analysis.
  • The design suggests AI can complement rather than replace expert judgment by handling initial synthesis.

Load-bearing premise

Time savings are caused by using AVA and not by other factors such as the characteristics of users who keep using the system.

What would settle it

A controlled experiment that randomly assigns users to AVA or to conventional search methods and directly measures hours spent completing matched research tasks.

Figures

Figures reproduced from arXiv: 2604.17843 by Daniel Alejandro Pinz\'on Hern\'andez, Michelle Dugas, Mohamad Chatila, Nimisha Karnatak, Renos Vakis, Reza Yazdanfar.

Figure 1
Figure 1. Figure 1: AVA System Architecture: Stage 1 curates 4000+ World Bank Reports into a hierarchical RAG; Stage 2 uses agentic [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: AVA’s multi-institutional external deployment across 116 countries. Over 2,200 professionals used the system over five [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AVA interface annotated with its core components: (A) query input box, (B) retrieval process trace, (C) inline verifiable [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Participant flow. From 2,764 initial registrants, par [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Five-day total query volume and abstention rate (% of queries receiving a reasoned “no response”). Abstention is high [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Perplexity AI’s behavior under the “pizza recipe” stress test. When prompted with an out-of-domain question, such as [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Global Distribution of AVA Users: The map covers 116 countries, with darker shading indicating higher user concen [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The AVA Project Roadmap, illustrating key milestones from April to August 2025. [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of the Top 10 Query Languages ( [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of user queries by policy theme and query type (N=1,582). Diagnostic queries predominated (69.0%), [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
read the original abstract

General-purpose LLMs pose misinformation risks for development and policy experts, lacking epistemic humility for verifiable outputs. We present AVA (AI + Verified Analysis), a GenAI platform built on a curated library of over 4,000 World Bank Reports with multilingual capabilities. AVA's multi-agent pipeline enables users to query and receive evidence-based syntheses. It operationalizes epistemic humility through two mechanisms: citation verifiability (tracing claims to sources) and reasoned abstention (declining unsupported queries with justification and redirection). We conducted an in-the-wild evaluation with over 2,200 individuals from heterogeneous organisations and roles in 116 countries, via log analysis, surveys, and 20 interviews. Difference-in-Differences estimates associate sustained engagement with 2.4-3.9 hours saved weekly. Qualitatively, participants used AVA as a specialized "evidence engine"; reasoned abstention clarified scope boundaries, and trust was calibrated through institutional provenance and page-anchored citations. We contribute design guidelines for specialized AI and articulate a vision for "ecosystem-aware" Humble AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AVA (AI + Verified Analysis), a generative AI platform built on a curated library of over 4,000 World Bank reports with multilingual support. It describes a multi-agent pipeline that produces evidence-based syntheses while enforcing epistemic humility via citation verifiability (page-anchored sourcing) and reasoned abstention (declining unsupported queries). An in-the-wild evaluation with over 2,200 voluntary participants from 116 countries, drawing on logs, surveys, and 20 interviews, reports Difference-in-Differences estimates that link sustained engagement to 2.4–3.9 hours of weekly time savings; qualitative themes highlight AVA’s role as an “evidence engine” and the calibration of trust through institutional provenance.

Significance. If the causal claims survive scrutiny, the work supplies timely, field-specific evidence on designing domain-curated GenAI tools that reduce misinformation risks for policy and development researchers. The concrete mechanisms for citation traceability and abstention, together with the large-scale multi-country deployment, offer actionable design guidelines for “humble AI” systems and contribute to HCI and AI-ethics literatures on trustworthy specialized agents.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: the Difference-in-Differences estimates that associate sustained engagement with 2.4–3.9 hours saved weekly provide no information on the identification strategy, control-group construction, pre-engagement trend verification, definition of “sustained engagement,” or handling of differential attrition. In a self-selected sample spanning 116 countries and heterogeneous organizations, the parallel-trends assumption is therefore untestable from the reported material, rendering the headline quantitative claim non-causal on present evidence.
  2. [Evaluation] Evaluation section: time savings are measured exclusively via external user logs and self-reported surveys rather than any internal AVA metrics or fitted parameters. This leaves open the possibility that observed differences reflect selection (more productive or motivated users both sustain engagement and already save time on evidence tasks) rather than a treatment effect of the platform.
minor comments (2)
  1. [Abstract] Abstract: the claim of “multilingual capabilities” is stated without any detail on language coverage, translation quality, or evaluation metrics for non-English queries.
  2. [Evaluation] The manuscript would benefit from an explicit statement of survey response rates and interview sampling criteria to allow readers to assess the representativeness of the qualitative findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on the evaluation methodology. These points identify areas where greater transparency is needed, and we will revise the manuscript to address them while preserving the observational character of the study.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: the Difference-in-Differences estimates that associate sustained engagement with 2.4–3.9 hours saved weekly provide no information on the identification strategy, control-group construction, pre-engagement trend verification, definition of “sustained engagement,” or handling of differential attrition. In a self-selected sample spanning 116 countries and heterogeneous organizations, the parallel-trends assumption is therefore untestable from the reported material, rendering the headline quantitative claim non-causal on present evidence.

    Authors: We agree that the manuscript currently lacks sufficient detail on the DiD identification strategy. The study is observational and in-the-wild; there was no randomized assignment. Sustained engagement was operationalized as users completing at least three queries within the evaluation window, with the comparison group consisting of one-time or low-frequency users. Pre-engagement trends were inspected using available log timestamps prior to the third interaction where data permitted, and differential attrition was handled by restricting analyses to participants who completed both baseline and follow-up surveys (with inverse-probability weighting applied to observable covariates). In the revision we will add an explicit subsection describing these choices, report the limited parallel-trends diagnostics that are feasible with the logs, and rephrase the abstract and results to characterize the estimates as associations conditional on engagement rather than as causal effects. We will also expand the limitations discussion to note that full verification of parallel trends across all 116 countries and organization types is not possible with the collected data. revision: yes

  2. Referee: [Evaluation] Evaluation section: time savings are measured exclusively via external user logs and self-reported surveys rather than any internal AVA metrics or fitted parameters. This leaves open the possibility that observed differences reflect selection (more productive or motivated users both sustain engagement and already save time on evidence tasks) rather than a treatment effect of the platform.

    Authors: The referee is correct that time savings are derived from external sources (self-reported weekly hours on evidence-gathering tasks and usage logs) rather than from any internal AVA instrumentation. The platform does not currently log or estimate user task-completion times. To address selection concerns we will add: (i) descriptive comparisons of baseline characteristics between sustained and non-sustained users, (ii) propensity-score-matched robustness checks using available demographic and role variables, and (iii) explicit caveats that part of the observed difference may reflect pre-existing user productivity or motivation. Because internal time-tracking metrics cannot be retrofitted to the deployed system, we will also outline plans for future instrumentation in the discussion of limitations and future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical DiD result from external logs/surveys, not derived from model equations or self-citations

full rationale

The paper presents AVA as a system with citation verifiability and abstention mechanisms, then reports an in-the-wild evaluation using user logs, surveys, and interviews across 2200+ participants. The headline time-savings claim is obtained by applying standard Difference-in-Differences to observed engagement data; no internal equations, fitted parameters, or self-citation chains are used to generate or force this numerical result. The derivation chain is therefore self-contained against external benchmarks and contains no self-definitional, fitted-input, or uniqueness-imported steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumptions that a fixed library of institutional reports constitutes a sufficient and unbiased knowledge base and that user-reported time savings reflect genuine productivity gains from the AI rather than placebo or selection effects.

axioms (1)
  • domain assumption General-purpose LLMs pose misinformation risks for development and policy experts due to lack of epistemic humility.
    Stated directly in the opening sentence of the abstract as motivation.
invented entities (1)
  • AVA (AI + Verified Analysis) platform no independent evidence
    purpose: To deliver evidence-based syntheses with citation verifiability and reasoned abstention.
    New system introduced and evaluated in the paper.

pith-pipeline@v0.9.0 · 5518 in / 1266 out tokens · 36581 ms · 2026-05-10T04:34:53.440564+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

126 extracted references · 68 canonical work pages · 4 internal anchors

  1. [1]

    Paul J. L. Ammann, Jonas Golde, and Alan Akbik. 2025. Question Decomposition for Retrieval-Augmented Generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Jin Zhao, Mingyang Wang, and Zhu Liu (Eds.). Association for Computational Linguistics, Vienna, Austria, 497–507. d...

  2. [2]

    Chittaranjan Andrade. 2018. Internal, External, and Ecological Validity in Research Design, Conduct, and Evaluation.Indian Journal of Psychological Medicine40, 5 (2018), 498–499. doi:10.4103/IJPSYM.IJPSYM_334_18

  3. [3]

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems37 (2024), 136037–136083

  4. [4]

    BaHammam

    Ahmed S. BaHammam. 2025. The Transparency Paradox: Why Researchers Avoid Disclosing AI Assistance in Scientific Writing.Nature and Science of Sleep 17 (Oct. 2025), 2569–2574. doi:10.2147/NSS.S568375

  5. [5]

    Alexander Bastounis, Paolo Campodonico, Mihaela van der Schaar, Ben Ad- cock, and Anders C Hansen. 2024. On the consistent reasoning paradox of intelligence and optimal trust in AI: The power of’I don’t know’.arXiv preprint arXiv:2408.02357(2024)

  6. [6]

    Marvin Braun, Maike Greve, Alfred Benedikt Brendel, and Lutz M Kolbe. 2024. Humans supervising artificial intelligence–investigation of designs to optimize error detection.Journal of Decision Systems33, 4 (2024), 674–699

  7. [7]

    Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making.Proc. ACM Hum.-Comput. Interact.5, CSCW1, Article 188 (April 2021), 21 pages. doi:10.1145/3449287

  8. [8]

    Business Insider. 2025. Don’t Worry, ChatGPT Can Still Answer Your Health Questions. https://www.businessinsider.com/openai-can-still-answer-your- health-questions-2025-11

  9. [9]

    Shiye Cao, Anqi Liu, and Chien-Ming Huang. 2024. Designing for appropriate reliance: The roles of AI uncertainty presentation, initial user decision, and user demographics in AI-assisted decision-making.Proceedings of the ACM on Human-Computer Interaction8, CSCW1 (2024), 1–32

  10. [10]

    Shuyang Cao and Lu Wang. 2024. Verifiable generation with subsentence-level fine-grained citations.arXiv preprint arXiv:2406.06125(2024)

  11. [11]

    Leo Anthony Celi. 2025. Teaching machines to doubt.Nature Medicine(2025), 1–1

  12. [12]

    Lida Chen, Zujie Liang, Xintao Wang, Jiaqing Liang, Yanghua Xiao, Feng Wei, Jinglei Chen, Zhenghong Hao, Bing Han, and Wei Wang. 2024. Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals. arXiv:2406.10881 [cs.CL] doi:10.48550/arXiv.2406.10881

  13. [14]

    Cheng-Han Chiang and Hung-yi Lee. 2024. Merging facts, crafting fallacies: Evaluating the contradictory nature of aggregated factual claims in long-form generations.arXiv preprint arXiv:2402.05629(2024)

  14. [15]

    Sarah L Dalglish, Hina Khalid, and Shannon A McMahon. 2020. Document analysis in health policy research: the READ approach.Health policy and planning35, 10 (2020), 1424–1431

  15. [16]

    Don E Davis, Everett L Worthington Jr, Joshua N Hook, Robert A Emmons, Peter C Hill, Richard A Bollinger, and Daryl R Van Tongeren. 2013. Humility and the development and repair of social bonds: Two longitudinal studies.Self and identity12, 1 (2013), 58–77

  16. [17]

    Çağdaş Dedeoğlu and Priyank Chandra. 2025. Navigating the Posthuman Turn in Computing and Design: A Posthuman Vocabulary. InProceedings of the ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies. 504–529

  17. [18]

    I don’t know

    Yang Deng, Yong Zhao, Moxin Li, See-Kiong Ng, and Tat-Seng Chua. 2024. Don’t Just Say" I don’t know"! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations.arXiv preprint arXiv:2402.15062(2024). CHI ’26, April 13–17, 2026, Barcelona, Spain Nimisha Karnatak, Mohamad Chatila, Daniel Alejandro Pinzón Hernández, Reza Yazdan...

  18. [19]

    Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2024. Chain-of-Verification Reduces Halluci- nation in Large Language Models. InFindings of the Association for Computa- tional Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics,...

  19. [20]

    doi:10.18653/v1/2024.findings-acl.212

  20. [21]

    Philip DiGiacomo, Haoyang Wang, Jinrui Fang, Yan Leng, W Michael Brode, and Ying Ding. 2025. Guide-RAG: Evidence-Driven Corpus Curation for Retrieval- Augmented Generation in Long COVID.arXiv preprint arXiv:2510.15782(2025)

  21. [22]

    Paul Dourish. 2003. The Appropriation of Interactive Technologies: Some Lessons from Placeless Documents.Computer Supported Cooperative Work (CSCW)12, 4 (2003), 465–490. doi:10.1023/A:1026149119426

  22. [23]

    Elicit. 2025. Elicit: AI for scientific research. https://elicit.com/. Accessed: 2025-12-05

  23. [24]

    Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta In- dra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shit...

  24. [25]

    European Parliament and Council of the European Union. 2024. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence. Official Journal of the European Union, L 2024/1689. http://data.europa.eu/eli/reg/2024/1689/oj Article 50

  25. [26]

    Raymond Fok, Nedim Lipka, Tong Sun, and Alexa F. Siu. 2024. Marco: Supporting Business Document Workflows via Collection-Centric Information Foraging with Large Language Models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, Honolulu, HI, USA, Article 842, 20 pages. doi:10.1145/3613904.3641969

  26. [27]

    Paul Formosa, Sarah Bankins, Rita Matulionyte, and Omid Ghasemi. 2025. Can ChatGPT be an author? Generative AI creative writing assistance and percep- tions of authorship, creatorship, responsibility, and disclosure.Ai & Society40, 5 (2025), 3405–3417

  27. [28]

    Santo Fortunato, Alessandro Flammini, Filippo Menczer, and Alessandro Vespig- nani. 2006. Topical interests and the mitigation of search engine bias.Proceedings of the national academy of sciences103, 34 (2006), 12684–12689

  28. [29]

    Diego Gambetta. 2000. Can We Trust Trust? InTrust: Making and Breaking Cooperative Relations, Diego Gambetta (Ed.). Vol. 13. Department of Sociology, University of Oxford, 213–237

  29. [30]

    Mariem Gandouz, Hajo Holzmann, and Dominik Heider. 2021. Machine learning with asymmetric abstention for biomedical decision-making.BMC medical informatics and decision making21, 1 (2021), 294

  30. [31]

    Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling Large Language Models to Generate Text with Citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). https: //aclanthology.org/2023.emnlp-main.398/

  31. [32]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997 2, 1 (2023)

  32. [33]

    Yonatan Geifman and Ran El-Yaniv. 2017. Selective Classification for Deep Neural Networks. InAdvances in Neural Information Processing Systems 30 (NeurIPS 2017). https://dl.acm.org/doi/10.5555/3295222.3295241

  33. [34]

    Google LLC. 2025. Learn about NotebookLM. https://support.google.com/ notebooklm/answer/16164461?hl=en&co=GENIE.Platform%3DDesktop

  34. [35]

    Shailja Gupta, Rajesh Ranjan, and Surya Narayan Singh. 2024. A comprehensive survey of retrieval-augmented generation (rag): Evolution, current landscape and future directions.arXiv preprint arXiv:2410.12837(2024)

  35. [36]

    Md Naimul Hoque, Tasfia Mashiat, Bhavya Ghai, Cecilia Shelton, Fanny Cheva- lier, Kari Kraus, and Niklas Elmqvist. 2024. The HaLLMark Effect: Supporting Provenance and Transparent Use of Large Language Models in Writing with Interactive Visualization. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, Honolulu, HI, USA, A...

  36. [37]

    Mohammad Hosseini, David B Resnik, and Kristi Holmes. 2023. The ethics of disclosing the use of artificial intelligence tools in writing scholarly manuscripts. Research Ethics19, 4 (2023), 449–465

  37. [38]

    Brett N Hryciw, Andrew JE Seely, and Kwadwo Kyeremanteng. 2023. Guiding principles and proposed classification system for the responsible adoption of artificial intelligence in scientific writing in medicine.Frontiers in Artificial Intelligence6 (2023), 1283353

  38. [39]

    Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A Massively Multilingual Multi-task Bench- mark for Evaluating Cross-lingual Generalization. arXiv:2003.11080 [cs.CL] https://arxiv.org/abs/2003.11080

  39. [40]

    Minda Hu, Bowei He, Yufei Wang, Liangyou Li, Chen Ma, and Irwin King. 2024. Mitigating large language model hallucination with faithful finetuning.arXiv preprint arXiv:2406.11267(2024)

  40. [41]

    Lei Huang, Xiaocheng Feng, Weitao Ma, Yuxuan Gu, Weihong Zhong, Xiachong Feng, Weijiang Yu, Weihua Peng, Duyu Tang, Dandan Tu, et al. 2024. Learning fine-grained grounded citations for attributed large language models.arXiv preprint arXiv:2408.04568(2024)

  41. [42]

    Yulong Hui, Chao Chen, Zhihang Fu, Yihao Liu, Jieping Ye, and Huanchen Zhang. 2025. Reason and Interact with the Corpus, Beyond Black-Box Retrieval. arXiv preprint arXiv:2510.27566(2025)

  42. [43]

    the Wild

    Sarah Jabbour, Trenton Chang, Anindya Das Antar, Joseph Peper, Insu Jang, Jiachen Liu, Jae-Won Chung, Shiqi He, Michael Wellman, Bryan Goodman, Elizabeth Bondi-Kelly, Kevin Samy, Rada Mihalcea, Mosharaf Chowdhury, David Jurgens, and Lu Wang. 2025. Evaluation Framework for AI Systems in "the Wild". arXiv:2504.16778 [cs.CL] doi:10.48550/arXiv.2504.16778

  43. [44]

    Jackson, Tarleton Gillespie, and Sandra Payette

    Steven J. Jackson, Tarleton Gillespie, and Sandra Payette. 2014. The Policy Knot: Re-integrating Policy, Practice and Design in CSCW Studies of Social Computing. InProceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW ’14). ACM, New York, NY, USA, 588–602. doi:10.1145/2531602.2531674

  44. [45]

    Amir Jahanlou, Jo Vermeulen, Tovi Grossman, Parmit Chilana, George Fitzmau- rice, and Justin Matejka. 2023. Task-Centric Application Switching: How and Why Knowledge Workers Switch Software Applications for a Single Task. In Graphics Interface 2023

  45. [46]

    Jiahe Jin, Abhijay Paladugu, and Chenyan Xiong. 2025. Beneficial Reason- ing Behaviors in Agentic Search and Effective Post-training to Obtain Them. arXiv:2510.06534 [cs.AI] https://arxiv.org/abs/2510.06534

  46. [47]

    Justia. 2025. AI and Attorney Ethics Rules: 50-State Survey. https://www.justia. com/trials-litigation/ai-and-attorney-ethics-rules-50-state-survey/. Accessed 2 Dec 2025

  47. [48]

    Nimisha Karnatak, Adrien Baranes, Rob Marchant, Tríona Butler, and Kristen Olson. 2025. ACAI for SBOs: AI Co-creation for Advertising and Inspiration for Small Business Owners.arXiv preprint arXiv:2503.06729(2025). doi:10.48550/ arXiv.2503.06729

  48. [49]

    Nimisha Karnatak, Adrien Baranes, Rob Marchant, Huinan Zeng, Tríona Butler, and Kristen Olson. 2025. Expanding the Generative AI Design Space through Structured Prompting and Multimodal Interfaces. InProceedings of the CHI 2025 Workshop on Computational User Interfaces. Association for Computing Machinery. Workshop paper

  49. [50]

    Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. arXiv:2004.04906 [cs.CL] https://arxiv.org/abs/ 2004.04906

  50. [51]

    Naveena Karusala, Sohini Upadhyay, Rajesh Veeraraghavan, and Krzysztof Z Gajos. 2024. Understanding Contestability on the Margins: Implications for the Design of Algorithmic Decision-making in Public Services. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–16

  51. [52]

    Minsu Kim, Sangryul Kim, and James Thorne. 2025. From Evidence to Be- lief: A Bayesian Epistemology Approach to Language Models.arXiv preprint arXiv:2504.19622(2025)

  52. [53]

    I’m Not Sure, But

    S. S. Y. Kim et al . 2024. “I’m Not Sure, But. . . ”: Uncertainty Expressions and User Reliance/Trust. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT). Preprint available at https://arxiv. org/abs/2405.00623

  53. [54]

    Bran Knowles, Jason D’Cruz, John T Richards, and Kush R Varshney. 2023. Humble AI.Commun. ACM66, 9 (2023), 73–79. doi:10.1145/3587035

  54. [55]

    Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI? Designing to Adjust End-User Expectations. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM. doi:10.1145/3290605.3300641

  55. [56]

    Oard, and James Mayfield

    Dawn Lawrie, Eugene Yang, Douglas W. Oard, and James Mayfield. 2023. Neural Approaches to Multilingual Information Retrieval. InAdvances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2–6, 2023, Proceedings, Part I(Dublin, Ireland). Springer-Verlag, Berlin, Heidelberg, 521–536. doi:10.1007/97...

  56. [57]

    John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro- priate reliance.Human factors46, 1 (2004), 50–80

  57. [58]

    Kang, Matt Latzke, Juho Kim, Jonathan Bragg, Joseph Chee Chang, and Pao Siangliulue

    Yoonjoo Lee, Hyeonsu B. Kang, Matt Latzke, Juho Kim, Jonathan Bragg, Joseph Chee Chang, and Pao Siangliulue. 2024. PaperWeaver: Enriching Topical Paper Alerts by Contextualizing Recommended Papers with User-Collected Papers. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). ACM. doi:10.1145/3613904.3642196

  58. [59]

    Florian Leiser, Sven Eckhardt, Valentin Leuthe, Merlin Knaeble, Alexander Maedche, Gerhard Schwabe, and Ali Sunyaev. 2024. HILL: A Hallucination Identifier for Large Language Models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). ACM. doi:10.1145/3613904. 3642428

  59. [60]

    Zachary Levonian, Chenglu Li, Wangda Zhu, Anoushka Gade, Owen Henkel, Millie-Ellen Postle, and Wanli Xing. 2023. Retrieval-augmented generation to improve math question-answering: Trade-offs between groundedness and human preference.arXiv preprint arXiv:2310.03184(2023)

  60. [61]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Urvashi Khandelwal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAd- vances in Neural Information Processing Systems (NeurIPS)

  61. [62]

    Bo Li, Zhenghua Xu, and Rui Xie. 2025. Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitiga- tion. arXiv:2511.09984 [cs.CL] https://arxiv.org/abs/2511.09984

  62. [63]

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for "Mind" Explo- ration of Large Language Model Society. InThirty-seventh Conference on Neural Information Processing Systems

  63. [64]

    Minghan Li, Miyang Luo, Tianrui Lv, Yishuai Zhang, Siqi Zhao, Ercong Nie, and Guodong Zhou. 2025. A Survey of Long-Document Retrieval in the PLM and LLM Era. arXiv:2509.07759 [cs.IR] https://arxiv.org/abs/2509.07759

  64. [65]

    Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate.arXiv preprint arXiv:2305.19118(2023)

  65. [66]

    Vera and Vaughan, Jennifer Wortman , year =

    Q. Vera Liao and Jennifer Wortman Vaughan. 2023. AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. arXiv:2306.01941 https: //arxiv.org/pdf/2306.01941

  66. [67]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). 3214–3252

  67. [68]

    Genglin Liu, Xingyao Wang, Lifan Yuan, Yangyi Chen, and Hao Peng. 2023. Ex- amining LLMs’ Uncertainty Expression Towards Questions Outside Parametric Knowledge.arXiv preprint arXiv:2311.09731(2023)

  68. [69]

    Nelson F Liu, Tianyi Zhang, and Percy Liang. 2023. Evaluating verifiability in generative search engines.arXiv preprint arXiv:2304.09848(2023)

  69. [70]

    David Lyell and Enrico Coiera. 2017. Automation bias and verification complex- ity: a systematic review.Journal of the American Medical Informatics Association 24, 2 (2017), 423–431

  70. [71]

    Henrietta Lyons, Eduardo Velloso, and Tim Miller. 2021. Conceptualising con- testability: Perspectives on contesting algorithmic decisions.Proceedings of the ACM on Human-Computer Interaction5, CSCW1 (2021), 1–25

  71. [72]

    Yixiao Ma, Yueyue Wu, Qingyao Ai, Yiqun Liu, Yunqiu Shao, Min Zhang, and Shaoping Ma. 2023. Incorporating Structural Information into Legal Case Retrieval.ACM Trans. Inf. Syst.42, 2, Article 40 (Nov. 2023), 28 pages. doi:10. 1145/3609796

  72. [73]

    Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, and Masoud Hashemi. 2025. Do llms know when to not answer? investigating ab- stention abilities of large language models. InProceedings of the 31st International Conference on Computational Linguistics. 9329–9345

  73. [74]

    Jacob Menick et al. 2022. Teaching Language Models to Support Answers with Verified Quotes.arXiv preprint arXiv:2203.11147(2022). https://arxiv.org/abs/ 2203.11147

  74. [75]

    Rahul Nair, Inge Vejsbjerg, Elizabeth M Daly, Christos Varytimidis, and Bran Knowles. 2025. Humble AI in the real-world: the case of algorithmic hiring. In Adjunct Proceedings of the 4th Annual Symposium on Human-Computer Interac- tion for Work. 1–7

  75. [76]

    Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Yixin Mao, and Chien- Sheng Wu. 2025. Search Engines in the AI Era: A Qualitative Understanding to the False Promise of Factual and Verifiable Source-Cited Responses in LLM-based Search. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’25). Association for C...

  76. [77]

    Agada Joseph Oche, Ademola Glory Folashade, Tirthankar Ghosal, and Arpan Biswas. 2025. A systematic review of key retrieval-augmented generation (rag) systems: Progress, gaps, and future directions.arXiv preprint arXiv:2507.18910 (2025)

  77. [78]

    OpenAI. 2025. Usage Policies. https://openai.com/policies/usage-policies. Ac- cessed: 2025-11-28

  78. [79]

    Sebastián Andrés Cajas Ordoñez, Maximin Lange, Torleif Markussen Lunde, Mackenzie J Meni, and Anna E Premo. 2025. Humility and curiosity in human–AI systems for health care.The Lancet406, 10505 (2025), 804–805

  79. [80]

    Wanda J Orlikowski. 2000. Using technology and constituting structures: A practice lens for studying technology in organizations.Organization science11, 4 (2000), 404–428

  80. [81]

    Wanda Janina Orlikowski et al . 1995. Evolving with Notes: Organizational change around groupware technology. (1995)

Showing first 80 references.