Recognition: unknown
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
Pith reviewed 2026-05-10 04:34 UTC · model grok-4.3
The pith
A curated AI platform for policy research saves users 2.4-3.9 hours per week through verified citations and abstention from unsupported queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that AVA's design, featuring a multi-agent pipeline for evidence-based syntheses from the curated reports, citation verifiability, and reasoned abstention, leads to measurable productivity gains. Users in the study engaged with it as a specialized evidence engine, and the institutional grounding helped calibrate trust. The evaluation shows sustained engagement associates with substantial time reductions in research tasks.
What carries the argument
AVA's multi-agent pipeline that combines queries with the curated World Bank report library to produce syntheses, trace claims to specific pages, and abstain with justification when evidence is insufficient.
If this is right
- Users can treat specialized AI as a reliable evidence engine for policy questions when sources are explicitly linked.
- Reasoned abstention clarifies the limits of AI assistance and prevents over-reliance.
- Provenance from a trusted institution like the World Bank supports calibrated trust in outputs.
- Such systems offer a model for deploying generative AI in high-stakes professional domains without broad misinformation risks.
Where Pith is reading between the lines
- Similar curation and humility features could be applied to AI tools in other fields like medicine or law to improve safety.
- Expanding the approach might involve combining multiple institutional libraries while preserving abstention rules.
- Longer-term studies could test if time savings lead to higher quality policy outputs or more thorough analysis.
- The design suggests AI can complement rather than replace expert judgment by handling initial synthesis.
Load-bearing premise
Time savings are caused by using AVA and not by other factors such as the characteristics of users who keep using the system.
What would settle it
A controlled experiment that randomly assigns users to AVA or to conventional search methods and directly measures hours spent completing matched research tasks.
Figures
read the original abstract
General-purpose LLMs pose misinformation risks for development and policy experts, lacking epistemic humility for verifiable outputs. We present AVA (AI + Verified Analysis), a GenAI platform built on a curated library of over 4,000 World Bank Reports with multilingual capabilities. AVA's multi-agent pipeline enables users to query and receive evidence-based syntheses. It operationalizes epistemic humility through two mechanisms: citation verifiability (tracing claims to sources) and reasoned abstention (declining unsupported queries with justification and redirection). We conducted an in-the-wild evaluation with over 2,200 individuals from heterogeneous organisations and roles in 116 countries, via log analysis, surveys, and 20 interviews. Difference-in-Differences estimates associate sustained engagement with 2.4-3.9 hours saved weekly. Qualitatively, participants used AVA as a specialized "evidence engine"; reasoned abstention clarified scope boundaries, and trust was calibrated through institutional provenance and page-anchored citations. We contribute design guidelines for specialized AI and articulate a vision for "ecosystem-aware" Humble AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AVA (AI + Verified Analysis), a generative AI platform built on a curated library of over 4,000 World Bank reports with multilingual support. It describes a multi-agent pipeline that produces evidence-based syntheses while enforcing epistemic humility via citation verifiability (page-anchored sourcing) and reasoned abstention (declining unsupported queries). An in-the-wild evaluation with over 2,200 voluntary participants from 116 countries, drawing on logs, surveys, and 20 interviews, reports Difference-in-Differences estimates that link sustained engagement to 2.4–3.9 hours of weekly time savings; qualitative themes highlight AVA’s role as an “evidence engine” and the calibration of trust through institutional provenance.
Significance. If the causal claims survive scrutiny, the work supplies timely, field-specific evidence on designing domain-curated GenAI tools that reduce misinformation risks for policy and development researchers. The concrete mechanisms for citation traceability and abstention, together with the large-scale multi-country deployment, offer actionable design guidelines for “humble AI” systems and contribute to HCI and AI-ethics literatures on trustworthy specialized agents.
major comments (2)
- [Abstract and Evaluation] Abstract and Evaluation section: the Difference-in-Differences estimates that associate sustained engagement with 2.4–3.9 hours saved weekly provide no information on the identification strategy, control-group construction, pre-engagement trend verification, definition of “sustained engagement,” or handling of differential attrition. In a self-selected sample spanning 116 countries and heterogeneous organizations, the parallel-trends assumption is therefore untestable from the reported material, rendering the headline quantitative claim non-causal on present evidence.
- [Evaluation] Evaluation section: time savings are measured exclusively via external user logs and self-reported surveys rather than any internal AVA metrics or fitted parameters. This leaves open the possibility that observed differences reflect selection (more productive or motivated users both sustain engagement and already save time on evidence tasks) rather than a treatment effect of the platform.
minor comments (2)
- [Abstract] Abstract: the claim of “multilingual capabilities” is stated without any detail on language coverage, translation quality, or evaluation metrics for non-English queries.
- [Evaluation] The manuscript would benefit from an explicit statement of survey response rates and interview sampling criteria to allow readers to assess the representativeness of the qualitative findings.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on the evaluation methodology. These points identify areas where greater transparency is needed, and we will revise the manuscript to address them while preserving the observational character of the study.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: the Difference-in-Differences estimates that associate sustained engagement with 2.4–3.9 hours saved weekly provide no information on the identification strategy, control-group construction, pre-engagement trend verification, definition of “sustained engagement,” or handling of differential attrition. In a self-selected sample spanning 116 countries and heterogeneous organizations, the parallel-trends assumption is therefore untestable from the reported material, rendering the headline quantitative claim non-causal on present evidence.
Authors: We agree that the manuscript currently lacks sufficient detail on the DiD identification strategy. The study is observational and in-the-wild; there was no randomized assignment. Sustained engagement was operationalized as users completing at least three queries within the evaluation window, with the comparison group consisting of one-time or low-frequency users. Pre-engagement trends were inspected using available log timestamps prior to the third interaction where data permitted, and differential attrition was handled by restricting analyses to participants who completed both baseline and follow-up surveys (with inverse-probability weighting applied to observable covariates). In the revision we will add an explicit subsection describing these choices, report the limited parallel-trends diagnostics that are feasible with the logs, and rephrase the abstract and results to characterize the estimates as associations conditional on engagement rather than as causal effects. We will also expand the limitations discussion to note that full verification of parallel trends across all 116 countries and organization types is not possible with the collected data. revision: yes
-
Referee: [Evaluation] Evaluation section: time savings are measured exclusively via external user logs and self-reported surveys rather than any internal AVA metrics or fitted parameters. This leaves open the possibility that observed differences reflect selection (more productive or motivated users both sustain engagement and already save time on evidence tasks) rather than a treatment effect of the platform.
Authors: The referee is correct that time savings are derived from external sources (self-reported weekly hours on evidence-gathering tasks and usage logs) rather than from any internal AVA instrumentation. The platform does not currently log or estimate user task-completion times. To address selection concerns we will add: (i) descriptive comparisons of baseline characteristics between sustained and non-sustained users, (ii) propensity-score-matched robustness checks using available demographic and role variables, and (iii) explicit caveats that part of the observed difference may reflect pre-existing user productivity or motivation. Because internal time-tracking metrics cannot be retrofitted to the deployed system, we will also outline plans for future instrumentation in the discussion of limitations and future work. revision: partial
Circularity Check
No circularity: empirical DiD result from external logs/surveys, not derived from model equations or self-citations
full rationale
The paper presents AVA as a system with citation verifiability and abstention mechanisms, then reports an in-the-wild evaluation using user logs, surveys, and interviews across 2200+ participants. The headline time-savings claim is obtained by applying standard Difference-in-Differences to observed engagement data; no internal equations, fitted parameters, or self-citation chains are used to generate or force this numerical result. The derivation chain is therefore self-contained against external benchmarks and contains no self-definitional, fitted-input, or uniqueness-imported steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption General-purpose LLMs pose misinformation risks for development and policy experts due to lack of epistemic humility.
invented entities (1)
-
AVA (AI + Verified Analysis) platform
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Paul J. L. Ammann, Jonas Golde, and Alan Akbik. 2025. Question Decomposition for Retrieval-Augmented Generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Jin Zhao, Mingyang Wang, and Zhu Liu (Eds.). Association for Computational Linguistics, Vienna, Austria, 497–507. d...
-
[2]
Chittaranjan Andrade. 2018. Internal, External, and Ecological Validity in Research Design, Conduct, and Evaluation.Indian Journal of Psychological Medicine40, 5 (2018), 498–499. doi:10.4103/IJPSYM.IJPSYM_334_18
-
[3]
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems37 (2024), 136037–136083
2024
-
[4]
Ahmed S. BaHammam. 2025. The Transparency Paradox: Why Researchers Avoid Disclosing AI Assistance in Scientific Writing.Nature and Science of Sleep 17 (Oct. 2025), 2569–2574. doi:10.2147/NSS.S568375
- [5]
-
[6]
Marvin Braun, Maike Greve, Alfred Benedikt Brendel, and Lutz M Kolbe. 2024. Humans supervising artificial intelligence–investigation of designs to optimize error detection.Journal of Decision Systems33, 4 (2024), 674–699
2024
-
[7]
Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making.Proc. ACM Hum.-Comput. Interact.5, CSCW1, Article 188 (April 2021), 21 pages. doi:10.1145/3449287
-
[8]
Business Insider. 2025. Don’t Worry, ChatGPT Can Still Answer Your Health Questions. https://www.businessinsider.com/openai-can-still-answer-your- health-questions-2025-11
2025
-
[9]
Shiye Cao, Anqi Liu, and Chien-Ming Huang. 2024. Designing for appropriate reliance: The roles of AI uncertainty presentation, initial user decision, and user demographics in AI-assisted decision-making.Proceedings of the ACM on Human-Computer Interaction8, CSCW1 (2024), 1–32
2024
- [10]
-
[11]
Leo Anthony Celi. 2025. Teaching machines to doubt.Nature Medicine(2025), 1–1
2025
-
[12]
Lida Chen, Zujie Liang, Xintao Wang, Jiaqing Liang, Yanghua Xiao, Feng Wei, Jinglei Chen, Zhenghong Hao, Bing Han, and Wei Wang. 2024. Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals. arXiv:2406.10881 [cs.CL] doi:10.48550/arXiv.2406.10881
- [14]
-
[15]
Sarah L Dalglish, Hina Khalid, and Shannon A McMahon. 2020. Document analysis in health policy research: the READ approach.Health policy and planning35, 10 (2020), 1424–1431
2020
-
[16]
Don E Davis, Everett L Worthington Jr, Joshua N Hook, Robert A Emmons, Peter C Hill, Richard A Bollinger, and Daryl R Van Tongeren. 2013. Humility and the development and repair of social bonds: Two longitudinal studies.Self and identity12, 1 (2013), 58–77
2013
-
[17]
Çağdaş Dedeoğlu and Priyank Chandra. 2025. Navigating the Posthuman Turn in Computing and Design: A Posthuman Vocabulary. InProceedings of the ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies. 504–529
2025
-
[18]
Yang Deng, Yong Zhao, Moxin Li, See-Kiong Ng, and Tat-Seng Chua. 2024. Don’t Just Say" I don’t know"! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations.arXiv preprint arXiv:2402.15062(2024). CHI ’26, April 13–17, 2026, Barcelona, Spain Nimisha Karnatak, Mohamad Chatila, Daniel Alejandro Pinzón Hernández, Reza Yazdan...
-
[19]
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2024. Chain-of-Verification Reduces Halluci- nation in Large Language Models. InFindings of the Association for Computa- tional Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics,...
2024
-
[20]
doi:10.18653/v1/2024.findings-acl.212
- [21]
-
[22]
Paul Dourish. 2003. The Appropriation of Interactive Technologies: Some Lessons from Placeless Documents.Computer Supported Cooperative Work (CSCW)12, 4 (2003), 465–490. doi:10.1023/A:1026149119426
-
[23]
Elicit. 2025. Elicit: AI for scientific research. https://elicit.com/. Accessed: 2025-12-05
2025
-
[24]
Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta In- dra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shit...
-
[25]
European Parliament and Council of the European Union. 2024. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence. Official Journal of the European Union, L 2024/1689. http://data.europa.eu/eli/reg/2024/1689/oj Article 50
2024
-
[26]
Raymond Fok, Nedim Lipka, Tong Sun, and Alexa F. Siu. 2024. Marco: Supporting Business Document Workflows via Collection-Centric Information Foraging with Large Language Models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, Honolulu, HI, USA, Article 842, 20 pages. doi:10.1145/3613904.3641969
-
[27]
Paul Formosa, Sarah Bankins, Rita Matulionyte, and Omid Ghasemi. 2025. Can ChatGPT be an author? Generative AI creative writing assistance and percep- tions of authorship, creatorship, responsibility, and disclosure.Ai & Society40, 5 (2025), 3405–3417
2025
-
[28]
Santo Fortunato, Alessandro Flammini, Filippo Menczer, and Alessandro Vespig- nani. 2006. Topical interests and the mitigation of search engine bias.Proceedings of the national academy of sciences103, 34 (2006), 12684–12689
2006
-
[29]
Diego Gambetta. 2000. Can We Trust Trust? InTrust: Making and Breaking Cooperative Relations, Diego Gambetta (Ed.). Vol. 13. Department of Sociology, University of Oxford, 213–237
2000
-
[30]
Mariem Gandouz, Hajo Holzmann, and Dominik Heider. 2021. Machine learning with asymmetric abstention for biomedical decision-making.BMC medical informatics and decision making21, 1 (2021), 294
2021
-
[31]
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling Large Language Models to Generate Text with Citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). https: //aclanthology.org/2023.emnlp-main.398/
2023
-
[32]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997 2, 1 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Yonatan Geifman and Ran El-Yaniv. 2017. Selective Classification for Deep Neural Networks. InAdvances in Neural Information Processing Systems 30 (NeurIPS 2017). https://dl.acm.org/doi/10.5555/3295222.3295241
- [34]
- [35]
-
[36]
Md Naimul Hoque, Tasfia Mashiat, Bhavya Ghai, Cecilia Shelton, Fanny Cheva- lier, Kari Kraus, and Niklas Elmqvist. 2024. The HaLLMark Effect: Supporting Provenance and Transparent Use of Large Language Models in Writing with Interactive Visualization. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, Honolulu, HI, USA, A...
-
[37]
Mohammad Hosseini, David B Resnik, and Kristi Holmes. 2023. The ethics of disclosing the use of artificial intelligence tools in writing scholarly manuscripts. Research Ethics19, 4 (2023), 449–465
2023
-
[38]
Brett N Hryciw, Andrew JE Seely, and Kwadwo Kyeremanteng. 2023. Guiding principles and proposed classification system for the responsible adoption of artificial intelligence in scientific writing in medicine.Frontiers in Artificial Intelligence6 (2023), 1283353
2023
- [39]
- [40]
- [41]
- [42]
-
[43]
Sarah Jabbour, Trenton Chang, Anindya Das Antar, Joseph Peper, Insu Jang, Jiachen Liu, Jae-Won Chung, Shiqi He, Michael Wellman, Bryan Goodman, Elizabeth Bondi-Kelly, Kevin Samy, Rada Mihalcea, Mosharaf Chowdhury, David Jurgens, and Lu Wang. 2025. Evaluation Framework for AI Systems in "the Wild". arXiv:2504.16778 [cs.CL] doi:10.48550/arXiv.2504.16778
-
[44]
Jackson, Tarleton Gillespie, and Sandra Payette
Steven J. Jackson, Tarleton Gillespie, and Sandra Payette. 2014. The Policy Knot: Re-integrating Policy, Practice and Design in CSCW Studies of Social Computing. InProceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW ’14). ACM, New York, NY, USA, 588–602. doi:10.1145/2531602.2531674
-
[45]
Amir Jahanlou, Jo Vermeulen, Tovi Grossman, Parmit Chilana, George Fitzmau- rice, and Justin Matejka. 2023. Task-Centric Application Switching: How and Why Knowledge Workers Switch Software Applications for a Single Task. In Graphics Interface 2023
2023
- [46]
-
[47]
Justia. 2025. AI and Attorney Ethics Rules: 50-State Survey. https://www.justia. com/trials-litigation/ai-and-attorney-ethics-rules-50-state-survey/. Accessed 2 Dec 2025
2025
- [48]
-
[49]
Nimisha Karnatak, Adrien Baranes, Rob Marchant, Huinan Zeng, Tríona Butler, and Kristen Olson. 2025. Expanding the Generative AI Design Space through Structured Prompting and Multimodal Interfaces. InProceedings of the CHI 2025 Workshop on Computational User Interfaces. Association for Computing Machinery. Workshop paper
2025
- [50]
-
[51]
Naveena Karusala, Sohini Upadhyay, Rajesh Veeraraghavan, and Krzysztof Z Gajos. 2024. Understanding Contestability on the Margins: Implications for the Design of Algorithmic Decision-making in Public Services. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–16
2024
- [52]
-
[53]
S. S. Y. Kim et al . 2024. “I’m Not Sure, But. . . ”: Uncertainty Expressions and User Reliance/Trust. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT). Preprint available at https://arxiv. org/abs/2405.00623
-
[54]
Bran Knowles, Jason D’Cruz, John T Richards, and Kush R Varshney. 2023. Humble AI.Commun. ACM66, 9 (2023), 73–79. doi:10.1145/3587035
-
[55]
Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI? Designing to Adjust End-User Expectations. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM. doi:10.1145/3290605.3300641
-
[56]
Dawn Lawrie, Eugene Yang, Douglas W. Oard, and James Mayfield. 2023. Neural Approaches to Multilingual Information Retrieval. InAdvances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2–6, 2023, Proceedings, Part I(Dublin, Ireland). Springer-Verlag, Berlin, Heidelberg, 521–536. doi:10.1007/97...
-
[57]
John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro- priate reliance.Human factors46, 1 (2004), 50–80
2004
-
[58]
Kang, Matt Latzke, Juho Kim, Jonathan Bragg, Joseph Chee Chang, and Pao Siangliulue
Yoonjoo Lee, Hyeonsu B. Kang, Matt Latzke, Juho Kim, Jonathan Bragg, Joseph Chee Chang, and Pao Siangliulue. 2024. PaperWeaver: Enriching Topical Paper Alerts by Contextualizing Recommended Papers with User-Collected Papers. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). ACM. doi:10.1145/3613904.3642196
-
[59]
Florian Leiser, Sven Eckhardt, Valentin Leuthe, Merlin Knaeble, Alexander Maedche, Gerhard Schwabe, and Ali Sunyaev. 2024. HILL: A Hallucination Identifier for Large Language Models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). ACM. doi:10.1145/3613904. 3642428
- [60]
-
[61]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Urvashi Khandelwal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAd- vances in Neural Information Processing Systems (NeurIPS)
2020
- [62]
-
[63]
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for "Mind" Explo- ration of Large Language Model Society. InThirty-seventh Conference on Neural Information Processing Systems
2023
- [64]
-
[65]
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate.arXiv preprint arXiv:2305.19118(2023)
work page internal anchor Pith review arXiv 2023
-
[66]
Vera and Vaughan, Jennifer Wortman , year =
Q. Vera Liao and Jennifer Wortman Vaughan. 2023. AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. arXiv:2306.01941 https: //arxiv.org/pdf/2306.01941
-
[67]
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). 3214–3252
2022
- [68]
- [69]
-
[70]
David Lyell and Enrico Coiera. 2017. Automation bias and verification complex- ity: a systematic review.Journal of the American Medical Informatics Association 24, 2 (2017), 423–431
2017
-
[71]
Henrietta Lyons, Eduardo Velloso, and Tim Miller. 2021. Conceptualising con- testability: Perspectives on contesting algorithmic decisions.Proceedings of the ACM on Human-Computer Interaction5, CSCW1 (2021), 1–25
2021
-
[72]
Yixiao Ma, Yueyue Wu, Qingyao Ai, Yiqun Liu, Yunqiu Shao, Min Zhang, and Shaoping Ma. 2023. Incorporating Structural Information into Legal Case Retrieval.ACM Trans. Inf. Syst.42, 2, Article 40 (Nov. 2023), 28 pages. doi:10. 1145/3609796
2023
-
[73]
Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, and Masoud Hashemi. 2025. Do llms know when to not answer? investigating ab- stention abilities of large language models. InProceedings of the 31st International Conference on Computational Linguistics. 9329–9345
2025
- [74]
-
[75]
Rahul Nair, Inge Vejsbjerg, Elizabeth M Daly, Christos Varytimidis, and Bran Knowles. 2025. Humble AI in the real-world: the case of algorithmic hiring. In Adjunct Proceedings of the 4th Annual Symposium on Human-Computer Interac- tion for Work. 1–7
2025
-
[76]
Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Yixin Mao, and Chien- Sheng Wu. 2025. Search Engines in the AI Era: A Qualitative Understanding to the False Promise of Factual and Verifiable Source-Cited Responses in LLM-based Search. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’25). Association for C...
- [77]
-
[78]
OpenAI. 2025. Usage Policies. https://openai.com/policies/usage-policies. Ac- cessed: 2025-11-28
2025
-
[79]
Sebastián Andrés Cajas Ordoñez, Maximin Lange, Torleif Markussen Lunde, Mackenzie J Meni, and Anna E Premo. 2025. Humility and curiosity in human–AI systems for health care.The Lancet406, 10505 (2025), 804–805
2025
-
[80]
Wanda J Orlikowski. 2000. Using technology and constituting structures: A practice lens for studying technology in organizations.Organization science11, 4 (2000), 404–428
2000
-
[81]
Wanda Janina Orlikowski et al . 1995. Evolving with Notes: Organizational change around groupware technology. (1995)
1995
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.