pith. sign in

arxiv: 2605.22785 · v1 · pith:Z5VWW6WUnew · submitted 2026-05-21 · 💻 cs.CL

Evaluating Commercial AI Chatbots as News Intermediaries

Pith reviewed 2026-05-22 05:22 UTC · model grok-4.3

classification 💻 cs.CL
keywords AI chatbotsnews accuracyfactual evaluationretrieval failuresregional biasfalse premisesadversarial questions
0
0 comments X

The pith

Commercial AI chatbots reach over 90 percent accuracy on multiple-choice questions about news reported hours earlier but lose 11 to 13 percent in free-response settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests six commercial AI chatbots on 2,100 factual questions drawn from same-day BBC News reports across six regional editions. It measures performance in multiple-choice and free-response formats while also testing responses to questions that embed subtle false premises. The evaluation shows that retrieval failures cause most errors and that accuracy falls sharply for Hindi-language queries due to English-source bias. Models that perform well on clean questions still accept fabricated facts at rates up to 64 percent when premises are flawed.

Core claim

The best chatbots correctly answer over 90 percent of multiple-choice questions about events reported hours earlier, yet the same systems lose 11-13 percent accuracy under free response and drop further when questions contain subtle false premises. Retrieval failures drive over 70 percent of all errors, and every model shows its lowest accuracy on Hindi queries because of an Anglophone retrieval bias.

What carries the argument

A 14-day benchmark of 2,100 same-day factual questions from six BBC regional news services, scored separately on multiple-choice accuracy, free-response accuracy, and adversarial questions that embed false premises.

If this is right

  • High multiple-choice scores do not predict reliable performance when users ask questions in their own words.
  • Regional accuracy gaps persist because models retrieve English sources even for non-English queries.
  • The main performance limit is locating the right source rather than reasoning over retrieved text.
  • Models remain vulnerable to accepting fabricated facts when questions contain subtle false premises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Chatbots may systematically under-serve non-English news audiences even when overall accuracy looks high.
  • Improving multilingual retrieval pipelines would address more errors than further gains in reasoning.
  • Real-world reliability requires testing on naturally occurring user questions rather than curated clean benchmarks.

Load-bearing premise

That questions derived from BBC News reporting constitute a representative and unbiased sample of real-world factual queries users pose to chatbots.

What would settle it

A field study that logs actual user questions about breaking news events and measures chatbot answer accuracy against the same events reported by primary sources.

Figures

Figures reproduced from arXiv: 2605.22785 by Alexander Spangher, Daniel E. Ho, Dan Jurafsky, Emily Shen, Federico Bianchi, James Zou, Mirac Suzgun, Thomas Icard.

Figure 1
Figure 1. Figure 1: Overview of the evaluation pipeline. (1) Articles are collected daily from six BBC News regional services spanning four scripts and populations totaling over two billion. (2-3) 25 five-option MC questions per region are generated from same-day reporting and evaluated across six models in parallel with native web search enabled. (4) The resulting 12,600 model–question instances reveal systematic patterns in… view at source ↗
Figure 2
Figure 2. Figure 2: Four representative benchmark questions, one per script [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multiple-choice (MC) vs. free-response (FR) accu [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy by language/region (6 models × 14 days). Right-margin column gives mean errors per day out of 150 ques￾tions. Hindi trails the next-lowest region by nearly 10%; excluding GPT-4o mini, Hindi still produces ∼2× the errors of any other region (19.6 vs. 6.6–9.6/day). Whiskers show ±1 standard error across the 14 evaluation days. on Hindi, compared with 6.6–9.6 for other regions. The deficit is also no… view at source ↗
Figure 6
Figure 6. Figure 6: BBC source citation rate by model (14-day mean, ±1 SD). Grok 4 cites BBC at 28.5% while three models effectively never cite BBC (0.0–0.2%). The divergence likely reflects differ￾ences in scraping and licensing compliance as much as retrieval capability (§4.1.2). BBC. Claude 4.5 Sonnet (0%) never. This divergence likely reflects legal and technical factors as much as retrieval capa￾bility: BBC has actively … view at source ↗
Figure 8
Figure 8. Figure 8: Top eight cited domains per region (14-day mean, all models aggregated). Sources whose primary language differs from the regional service are highlighted in slate blue—the Anglophone retrieval pivot (§4.1.3). English Wikipedia tops the Hindi panel. et al., 2023)—to the search infrastructure that feeds produc￾tion systems. The pattern is compounded by legal dynamics: an IPPR analysis found the BBC entirely … view at source ↗
Figure 9
Figure 9. Figure 9: Per-model variation in domain reliance, by region. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Web-search ablation, US & Canada questions [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Two illustrative adversarial questions, each derived from a real BBC article by a single subtle factual alteration (highlighted). Left (quote-attribution displacement): the article reports 53% but attributes it to the Center for Progressive Reform, not the Heritage Foundation; pattern-matching the number selects (A). Right (scope inversion): the article describes the composition of the nine victims, not t… view at source ↗
Figure 12
Figure 12. Figure 12: Standard vs. adversarial accuracy, US & Canada [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: reveals interaction effects that neither model-level nor region-level averages capture in isolation. Gemini 3 Flash Gemini 3 Pro Grok 4 Claude 4.5 Sonnet GPT-5 GPT-4o Mini 96.6 ±6.4 94.9 ±5.1 97.7 ±2.1 90.0 ±5.4 98.0 ±2.6 96.3 ±3.3 96.0 ±3.8 94.3 ±5.6 90.9 ±4.0 88.0 ±7.5 97.4 ±3.0 95.7 ±4.6 96.9 ±3.6 96.0 ±4.2 97.7 ±2.1 86.0 ±4.4 98.0 ±2.6 95.7 ±2.9 95.7 ±6.0 90.6 ±5.8 91.1 ±6.7 82.0 ±6.0 90.6 ±5.6 92.6 ±… view at source ↗
Figure 15
Figure 15. Figure 15: Where does multiple-choice over-state accuracy? (Extends Figure [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Multiple-choice vs. free-response by region. [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Search–no-search gap by region (extends Figure [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗
read the original abstract

AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services (US & Canada, Arabic, Afrique, Hindi, Russian, Turkish). The best systems achieve over 90% multiple-choice accuracy on questions about events reported hours earlier. The same systems, however, lose 11-13% under free-response evaluation, and 16-17% across the cohort. We further characterize three failure patterns. First, every model achieves its lowest accuracy on Hindi (79% vs. 89-91% elsewhere) and citations indicate an Anglophone retrieval bias (e.g., models answering Hindi queries cite English Wikipedia more than any Hindi outlet). Second, retrieval, not reasoning, failures drive over 70% of all errors. When models retrieve a correct source, they often extract the correct answer; the problem is to land on the right source in the first place. Third, models achieving 88-96% accuracy on well-formed questions drop to 19-70% when questions contain subtle false premises, with the most vulnerable model accepting fabricated facts 64% of the time. We also identify a detection-accuracy paradox: the best false-premise detector ranks second in adversarial accuracy (abstention rate), while a weaker detector ranks first, showing that premise detection and answer recovery are partially independent capabilities. Overall, these suggest that high accuracy can mask systematic regional inequity, near-total dependence on retrieval infrastructure, and vulnerability to imperfect queries real users pose.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a 14-day evaluation (Feb 9-22, 2026) of six commercial AI chatbots on 2,100 factual questions drawn from same-day BBC News across six regional services. Best systems exceed 90% multiple-choice accuracy on recent events but drop 11-17% in free-response; retrieval failures are said to cause >70% of errors, Hindi accuracy is lowest (79%), models are vulnerable to false-premise questions (dropping to 19-70%), and a detection-accuracy paradox is observed.

Significance. If the headline numbers and failure-mode attributions hold, the work supplies large-scale, time-sensitive evidence on the reliability of commercial chatbots as news intermediaries. The multi-language/regional design and emphasis on emerging facts directly address practical deployment questions around accuracy, equity, and robustness to imperfect user queries.

major comments (2)
  1. [§4 / abstract] §4 (Failure Patterns) and abstract: The central claim that retrieval failures drive over 70% of errors rests on the ability to classify each mistake as retrieval versus reasoning. The manuscript must supply the explicit, pre-specified decision rule or annotation rubric (e.g., citation presence, lexical-overlap threshold, or blinded protocol) used to label errors for proprietary systems whose internal retrieval traces are unavailable. Without this, the 70% figure and the conclusion that 'the problem is to land on the right source' remain sensitive to post-hoc judgment.
  2. [Methods] Methods section: The 11-13% drop from multiple-choice to free-response accuracy is a key quantitative result, yet the abstract and reported methods provide no detail on exact question generation, inter-annotator agreement, or the scoring rubric for free-response answers. These choices directly affect the reported accuracy figures and must be documented with sufficient specificity for replication.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'citations indicate an Anglophone retrieval bias' would be strengthened by reporting the actual citation counts or proportions per language rather than a qualitative statement.
  2. [Results] Results tables: Adding per-model, per-region sample sizes and confidence intervals would clarify whether the Hindi performance gap is statistically distinguishable from other regions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for highlighting areas where our manuscript can be strengthened with additional methodological transparency. We respond to each major comment in turn and commit to revisions that address the concerns raised.

read point-by-point responses
  1. Referee: [§4 / abstract] §4 (Failure Patterns) and abstract: The central claim that retrieval failures drive over 70% of errors rests on the ability to classify each mistake as retrieval versus reasoning. The manuscript must supply the explicit, pre-specified decision rule or annotation rubric (e.g., citation presence, lexical-overlap threshold, or blinded protocol) used to label errors for proprietary systems whose internal retrieval traces are unavailable. Without this, the 70% figure and the conclusion that 'the problem is to land on the right source' remain sensitive to post-hoc judgment.

    Authors: We thank the referee for this important point. The error classification in our study was based on whether the model's response included a citation or reference to a source that contained the correct factual information from the BBC article. Errors without such supporting citations were attributed to retrieval failures. However, we recognize that an explicit, pre-specified rubric was not detailed in the submitted manuscript. We will add a dedicated subsection in Methods describing the annotation protocol, including criteria for citation presence and lexical overlap with ground truth, along with agreement metrics from blinded annotation of a subset of errors. This will substantiate the >70% figure. revision: yes

  2. Referee: [Methods] Methods section: The 11-13% drop from multiple-choice to free-response accuracy is a key quantitative result, yet the abstract and reported methods provide no detail on exact question generation, inter-annotator agreement, or the scoring rubric for free-response answers. These choices directly affect the reported accuracy figures and must be documented with sufficient specificity for replication.

    Authors: We agree that more details on the free-response evaluation are needed for replicability. The free-response questions were generated by converting the multiple-choice questions into open-ended versions by removing the options and adjusting the phrasing slightly for naturalness. Scoring was performed by human annotators who judged whether the model's answer correctly addressed the factual query based on the original BBC report, using a binary correct/incorrect with notes for partial matches. We will revise the Methods section to specify the question generation process, provide the scoring rubric with examples, and report inter-annotator agreement. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement against external ground truth

full rationale

The paper performs a straightforward empirical evaluation by generating factual questions from independent BBC News reports and measuring chatbot accuracy on them. There are no equations, fitted parameters, derivations, or self-citations that reduce any claim to the study's own inputs by construction. Accuracy percentages and the 70% retrieval-failure attribution are computed directly from observed model outputs versus BBC-derived ground truth; the error classification is an observational breakdown rather than a mathematical reduction or self-definitional loop. The study is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central measurements rest on the assumption that BBC regional reporting provides accurate, timely ground truth and that the 14-day window and 2100 questions capture typical chatbot usage. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption BBC News reports in each regional service are accurate and timely factual sources suitable for deriving ground-truth questions.
    All 2,100 questions are derived from same-day BBC reporting; accuracy is measured against these reports.

pith-pipeline@v0.9.0 · 5916 in / 1297 out tokens · 35701 ms · 2026-05-22T05:22:25.938983+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 2 internal anchors

  1. [1]

    MEGA : Multilingual evaluation of generative AI

    Ahuja, K., Diddee, H., Hada, R., Ochieng, M., Ramesh, K., Jain, P., Nambi, A., Ganu, T., Segal, S., Ahmed, M., Bali, K., and Sitaram, S. MEGA : Multilingual evaluation of generative AI . In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 4232--4267, Singapore, Decembe...

  2. [2]

    Arguedas, A. R. How audiences think about news personalisation in the AI era, June 2025. URL https://reutersinstitute.politics.ox.ac.uk/digital-news-report/2025/how-audiences-think-about-news-personalisation-ai-era

  3. [3]

    Self-rag: Learning to retrieve, generate, and critique through self-reflection

    Asai, A., Wu, Z., Wang, Y., Sil, A., and Hajishirzi, H. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, 2023

  4. [4]

    Sam Altman says ChatGPT has hit 800M weekly active users

    Bellan, R. Sam Altman says ChatGPT has hit 800M weekly active users. OpenAI Dev Day keynote, reported by TechCrunch, October 2025. URL https://techcrunch.com/2025/10/06/sam-altman-says-chatgpt-has-hit-800m-weekly-active-users/

  5. [5]

    J., Hitzig, Z., Ong, C., Shan, C

    Chatterji, A., Cunningham, T., Deming, D. J., Hitzig, Z., Ong, C., Shan, C. Y., and Wadman, K. How People Use ChatGPT . Technical Report w34255, National Bureau of Economic Research, 2025. URL https://www.nber.org/papers/w34255

  6. [6]

    Sycophantic AI decreases prosocial intentions and promotes dependence , volume =

    Cheng, M., Lee, C., Khadpe, P., Yu, S., Han, D., and Jurafsky, D. Sycophantic ai decreases prosocial intentions and promotes dependence. Science, 391 0 (6792): 0 eaec8352, 2026. doi:10.1126/science.aec8352. URL https://www.science.org/doi/abs/10.1126/science.aec8352

  7. [7]

    Dahl, M., Magesh, V., Suzgun, M., and Ho, D. E. Large legal fictions: Profiling legal hallucinations in large language models. Journal of Legal Analysis, 16 0 (1): 0 64--93, 2024

  8. [8]

    and Wiwanitkit, V

    Daungsupawong, H. and Wiwanitkit, V. Probing artificial intelligence in neurosurgical training: Correspondence. Brain & Spine, 4: 0 102751, 2024

  9. [9]

    T., Peterson, S

    Dzindolet, M. T., Peterson, S. A., Pomranky, R. A., Pierce, L. G., and Beck, H. P. The role of trust in automation reliance. International journal of human-computer studies, 58 0 (6): 0 697--718, 2003

  10. [10]

    RAGA s: Automated evaluation of retrieval augmented generation

    Es, S., James, J., Espinosa Anke, L., and Schockaert, S. RAGA s: Automated evaluation of retrieval augmented generation. In Aletras, N. and De Clercq, O. (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp.\ 150--158, St. Julians, Malta, March 2024. Association for ...

  11. [11]

    The information ecosystem is being redrawn by AI

    Fang, S. The information ecosystem is being redrawn by AI . That might be good news. Reuters Institute for the Study of Journalism, 2026. URL https://reutersinstitute.politics.ox.ac.uk/news/information-ecosystem-being-redrawn-ai-might-be-good-news

  12. [12]

    and Sidoti, O

    Faverio, M. and Sidoti, O. Teens, social media and AI chatbots 2025, December 2025. URL https://www.pewresearch.org/internet/2025/12/09/teens-social-media-and-ai-chatbots-2025/. Report

  13. [13]

    and Eisele, I

    Ford, M. and Eisele, I. Fact check: How trustworthy are AI fact checks? Deutsche Welle (DW), May 2025. URL https://www.dw.com/en/fact-check-hey-grok-is-this-true-how-trustworthy-are-ai-fact-checks/a-72539345

  14. [14]

    Enabling large language models to generate text with citations

    Gao, T., Yen, H., Yu, J., and Chen, D. Enabling large language models to generate text with citations. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 6465--6488, Singapore, December 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.emnlp-main.3...

  15. [15]

    M., Hewitt, L., Saunders, E., Black, S., Lin, H., Fist, C., Margetts, H., Rand, D

    Hackenburg, K., Tappin, B. M., Hewitt, L., Saunders, E., Black, S., Lin, H., Fist, C., Margetts, H., Rand, D. G., and Summerfield, C. The levers of political persuasion with conversational artificial intelligence. Science, 390 0 (6777): 0 eaea3884, 2025

  16. [16]

    Around the world in 24 hours: Probing LLM knowledge of time and place

    Holtermann, C., R \"o ttger, P., and Lauscher, A. Around the world in 24 hours: Probing LLM knowledge of time and place. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 22875--22897, Vienna, Austria, July 2025. Associatio...

  17. [17]

    Won ' t get fooled again: Answering questions with false premises

    Hu, S., Luo, Y., Wang, H., Cheng, X., Liu, Z., and Sun, M. Won ' t get fooled again: Answering questions with false premises. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5626--5643, Toronto, Canada, July 2023. Association for C...

  18. [18]

    Revealed: ChatGPT draws more on GB News , Al Jazeera , and Marie Claire than the BBC , IPPR analysis shows, January 2026

    Institute for Public Policy Research (IPPR) . Revealed: ChatGPT draws more on GB News , Al Jazeera , and Marie Claire than the BBC , IPPR analysis shows, January 2026. URL https://www.ippr.org/media-office/revealed-chatgpt-draws-more-on-gb-news-al-jazeera-and-marie-claire-than-the-bbc-ippr-analysis-shows

  19. [19]

    and Chandrasekar, A

    Ja\' z wi\' n ska, K. and Chandrasekar, A. AI Search Has a Citation Problem . Columbia Journalism Review, 2025. URL https://www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php

  20. [20]

    L., Asai, A., Yu, X

    Kasai, J., Sakaguchi, K., yoichi takahashi, Bras, R. L., Asai, A., Yu, X. V., Radev, D., Smith, N. A., Choi, Y., and Inui, K. Realtime QA : What's the answer right now? In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=HfKOIPCvsv

  21. [21]

    A rabic MMLU : Assessing massive multitask language understanding in A rabic

    Koto, F., Li, H., Shatnawi, S., Doughman, J., Sadallah, A., Alraeesi, A., Almubarak, K., Alyafeai, Z., Sengupta, N., Shehata, S., Habash, N., Nakov, P., and Baldwin, T. A rabic MMLU : Assessing massive multitask language understanding in A rabic. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics:...

  22. [22]

    AI platform citation patterns: How ChatGPT , Google AI Overviews , and Perplexity source information

    Lafferty, N. AI platform citation patterns: How ChatGPT , Google AI Overviews , and Perplexity source information. Profound, June 2025. URL https://www.tryprofound.com/blog/ai-platform-citation-patterns

  23. [23]

    D., Ngo, N., Pouran Ben Veyseh, A., Man, H., Dernoncourt, F., Bui, T., and Nguyen, T

    Lai, V. D., Ngo, N., Pouran Ben Veyseh, A., Man, H., Dernoncourt, F., Bui, T., and Nguyen, T. H. C hat GPT beyond E nglish: Towards a comprehensive evaluation of large language models in multilingual learning. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 13171--13189, Singapore,...

  24. [24]

    u ttler, H., Lewis, M., Yih, W.-t., Rockt \

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K \"u ttler, H., Lewis, M., Yih, W.-t., Rockt \"a schel, T., et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in neural information processing systems, 33: 0 9459--9474, 2020

  25. [25]

    Li, A. O. and Goyal, T. Memorization vs. reasoning: Updating LLM s with new knowledge. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 25853--25874, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi:10.18653/v1/2025....

  26. [26]

    Lin, Q., Li, J., and Ng, H. T. D yna Q uest: A dynamic question answering dataset reflecting real-world knowledge updates. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 26918--26936, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-...

  27. [27]

    URLhttps://doi.org/10.18653/v1/2022.acl-long.229

    Lin, S., Hilton, J., and Evans, O. T ruthful QA : Measuring how models mimic human falsehoods. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3214--3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:1...

  28. [28]

    and Eddy, K

    Lipka, M. and Eddy, K. Relatively few Americans are getting news from AI chatbots like ChatGPT , October 2025. URL https://www.pewresearch.org/short-reads/2025/10/01/relatively-few-americans-are-getting-news-from-ai-chatbots-like-chatgpt/. Short Reads

  29. [29]

    Evaluating verifiability in generative search engines

    Liu, N., Zhang, T., and Liang, P. Evaluating verifiability in generative search engines. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 7001--7025, Singapore, December 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.findings-emnlp.467. URL https://aclantholog...

  30. [30]

    D., and Ho, D

    Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D., and Ho, D. E. Hallucination-free? assessing the reliability of leading ai legal research tools. Journal of empirical legal studies, 22 0 (2): 0 216--242, 2025

  31. [31]

    Manic, S. S. The ai user-agent landscape in 2026: A complete reference, 2026. URL https://nohacks.co/blog/ai-user-agents-landscape-2026. Published April 13, 2026; accessed May 12, 2026

  32. [32]

    BBC threatens ai firm over unauthorised content use

    McMahon, L. BBC threatens ai firm over unauthorised content use. BBC News , jun 2025. URL https://www.bbc.com/news/articles/cy7ndgylzzmo

  33. [33]

    Proceedings of the 2023

    Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.-t., Koh, P., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H. FA ct S core: Fine-grained atomic evaluation of factual precision in long form text generation. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 12076--1210...

  34. [34]

    T., Nielsen, R

    Newman, N., Ross Arguedas, A., Robertson, C. T., Nielsen, R. K., and Fletcher, R. Reuters institute digital news report 2025, 2025. URL https://reutersinstitute.politics.ox.ac.uk/sites/default/files/2025-06/Digital_News-Report_2025.pdf

  35. [35]

    and Carl, B

    Orth, T. and Carl, B. Trust in media 2025: Which news sources Americans use and trust, May 2025. URL https://yougov.com/en-us/articles/52272-trust-in-media-2025-which-news-sources-americans-use-and-trust

  36. [36]

    H o H : A dynamic benchmark for evaluating the impact of outdated information on retrieval-augmented generation

    Ouyang, J., Pan, T., Cheng, M., Yan, R., Luo, Y., Lin, J., and Liu, Q. H o H : A dynamic benchmark for evaluating the impact of outdated information on retrieval-augmented generation. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pap...

  37. [37]

    MIRAGE : A metric-intensive benchmark for retrieval-augmented generation evaluation

    Park, C., Moon, H., Park, C., and Lim, H. MIRAGE : A metric-intensive benchmark for retrieval-augmented generation evaluation. In Chiruzzo, L., Ritter, A., and Wang, L. (eds.), Findings of the Association for Computational Linguistics: NAACL 2025, pp.\ 2883--2900, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-8...

  38. [38]

    and Stewart, B

    Peskoff, D. and Stewart, B. Credible without credit: Domain experts assess generative language models. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 427--438, Toronto, Canada, July 2023. Association for Computational Linguistics...

  39. [39]

    L., Torr, P

    Petrov, A., Malfa, E. L., Torr, P. H., and Bibi, A. Language model tokenizers introduce unfairness between languages. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA, 2023. Curran Associates Inc

  40. [40]

    and Hruschka, E

    Pezeshkpour, P. and Hruschka, E. Large language models sensitivity to the order of options in multiple-choice questions. In Duh, K., Gomez, H., and Bethard, S. (eds.), Findings of the Association for Computational Linguistics: NAACL 2024, pp.\ 2006--2017, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findi...

  41. [41]

    It's High Time: A Survey of Temporal Question Answering

    Piryani, B., Abdallah, A., Mozafari, J., Anand, A., and Jatowt, A. It's high time: A survey of temporal question answering. arXiv preprint arXiv:2505.20243, 2025. URL https://arxiv.org/abs/2505.20243

  42. [42]

    Will it still be true tomorrow? multilingual evergreen question classification to improve trustworthy QA

    Pletenev, S., Marina, M., Ivanov, N., Galimzianova, D., Krayko, N., Salnikov, M., Konovalov, V., Panchenko, A., and Moskvoretskii, V. Will it still be true tomorrow? multilingual evergreen question classification to improve trustworthy QA . In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Emp...

  43. [43]

    LLM targeted underperformance disproportionately impacts vulnerable users

    Poole-Dayan, E., Roy, D., and Kabbara, J. LLM targeted underperformance disproportionately impacts vulnerable users. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pp.\ 39116--39124, 2026. URL https://ojs.aaai.org/index.php/AAAI/article/view/41259

  44. [44]

    S., Turc, I., and Reitter, D

    Rashkin, H., Nikolaev, V., Lamm, M., Aroyo, L., Collins, M., Das, D., Petrov, S., Tomar, G. S., Turc, I., and Reitter, D. Measuring attribution in natural language generation models. Computational Linguistics, 49 0 (4): 0 777--840, December 2023. doi:10.1162/coli_a_00486. URL https://aclanthology.org/2023.cl-4.2/

  45. [45]

    AI adoption by UK journalists and their newsrooms: surveying applications, approaches, and attitudes, November 2025

    Reuters Institute for the Study of Journalism . AI adoption by UK journalists and their newsrooms: surveying applications, approaches, and attitudes, November 2025. URL https://reutersinstitute.politics.ox.ac.uk/ai-adoption-uk-journalists-and-their-newsrooms-surveying-applications-approaches-and-attitudes

  46. [46]

    How will AI reshape the news in 2026? forecasts by 17 experts from around the world, January 2026

    Reuters Institute for the Study of Journalism . How will AI reshape the news in 2026? forecasts by 17 experts from around the world, January 2026. URL https://reutersinstitute.politics.ox.ac.uk/news/how-will-ai-reshape-news-2026-forecasts-17-experts-around-world

  47. [47]

    RAGC hecker: A fine-grained framework for diagnosing retrieval-augmented generation

    Ru, D., Qiu, L., Hu, X., Zhang, T., Shi, P., Chang, S., Jiayang, C., Wang, C., Sun, S., Li, H., Zhang, Z., Wang, B., Jiang, J., He, T., Wang, Z., Liu, P., Zhang, Y., and Zhang, Z. RAGC hecker: A fine-grained framework for diagnosing retrieval-augmented generation. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchma...

  48. [48]

    AI use in American newspapers is widespread, uneven, and rarely disclosed

    Russell, J., Karpinska, M., Akinode, D., Thai, K., Emi, B., Spero, M., and Iyyer, M. Ai use in american newspapers is widespread, uneven, and rarely disclosed. arXiv preprint arXiv:2510.18774, 2025

  49. [49]

    ARES : An automated evaluation framework for retrieval-augmented generation systems

    Saad-Falcon, J., Khattab, O., Potts, C., and Zaharia, M. ARES : An automated evaluation framework for retrieval-augmented generation systems. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), ...

  50. [50]

    The great masquerade: How ai agents are spoofing their way in, December 2025

    Segura, J. The great masquerade: How ai agents are spoofing their way in, December 2025. URL https://datadome.co/agent-trust-management/ai-agent-spoofing/. Published December 11, 2025; accessed May 12, 2026

  51. [51]

    Multi- FA ct: Assessing factuality of multilingual LLM s using FA ctscore

    Shafayat, S., Kim, E., Oh, J., and Oh, A. Multi- FA ct: Assessing factuality of multilingual LLM s using FA ctscore. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=lkrH6ovzsj

  52. [52]

    R., DURMUS, E., Hatfield-Dodds, Z., Johnston, S

    Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., DURMUS, E., Hatfield-Dodds, Z., Johnston, S. R., Kravec, S. M., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., and Perez, E. Towards understanding sycophancy in language models. In The Twelfth International Conference on Learning Representat...

  53. [53]

    Shaw, S. D. and Nave, G. Thinking-fast, slow, and artificial: How ai is reshaping human reasoning and the rise of cognitive surrender. Available at SSRN 6097646, 2026. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6097646

  54. [54]

    and McClain, C

    Sidoti, O. and McClain, C. 34\ URL https://www.pewresearch.org/short-reads/2025/06/25/34-of-us-adults-have-used-chatgpt-about-double-the-share-in-2023/. Short Reads

  55. [55]

    xai transparency report, December 2025

    Stanford CRFM . xai transparency report, December 2025. URL https://crfm.stanford.edu/fmti/December-2025/company-reports/xAI_FinalReport_FMTI2025.html. Accessed May 12, 2026

  56. [56]

    E., Icard, T., Jurafsky, D., and Zou, J

    Suzgun, M., Gur, T., Bianchi, F., Ho, D. E., Icard, T., Jurafsky, D., and Zou, J. Belief in the machine: Investigating epistemological blind spots of language models. arXiv preprint arXiv:2410.21195, 2024

  57. [57]

    E., Icard, T., Jurafsky, D., and Zou, J

    Suzgun, M., Gur, T., Bianchi, F., Ho, D. E., Icard, T., Jurafsky, D., and Zou, J. Language models cannot reliably distinguish belief from knowledge and fact. Nature Machine Intelligence, pp.\ 1--11, 2025

  58. [58]

    Turpin, M., Michael, J., Perez, E., and Bowman, S. R. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=bzs4uPLXvi

  59. [59]

    F resh LLM s: Refreshing large language models with search engine augmentation

    Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., Tar, C., Sung, Y.-H., Zhou, D., Le, Q., and Luong, T. F resh LLM s: Refreshing large language models with search engine augmentation. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 13697--13720, Bangkok, Thailand, Aug...

  60. [60]

    Self- DC : When to reason and when to act? self divide-and-conquer for compositional unknown questions

    Wang, H., Xue, B., Zhou, B., Zhang, T., Wang, C., Wang, H., Chen, G., and Wong, K.-F. Self- DC : When to reason and when to act? self divide-and-conquer for compositional unknown questions. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...

  61. [61]

    and Haner, J

    Wang, L. and Haner, J. Many Americans say they often come across inaccurate news and have a hard time knowing what’s true, October 2025. URL https://www.pewresearch.org/short-reads/2025/10/29/many-americans-say-they-often-come-across-inaccurate-news-and-have-a-hard-time-knowing-whats-true/. Short Reads

  62. [62]

    Wei, J., Huang, D., Lu, Y., Zhou, D., and Le, Q. V. Simple synthetic data reduces sycophancy in large language models, 2025. URL https://openreview.net/forum?id=WDheQxWAo4

  63. [63]

    An automated framework for assessing how well llms cite relevant medical references

    Wu, K., Wu, E., Wei, K., Zhang, A., Casasola, A., Nguyen, T., Riantawan, S., Shi, P., Ho, D., and Zou, J. An automated framework for assessing how well llms cite relevant medical references. Nature Communications, 16 0 (1): 0 3615, 2025

  64. [64]

    Deconstructing self-bias in llm-generated translation benchmarks

    Xu, W., Agrawal, S., Zouhar, V., Freitag, M., and Deutsch, D. Deconstructing self-bias in llm-generated translation benchmarks. arXiv preprint arXiv:2509.26600, 2025 a

  65. [65]

    Let LLM s take on the latest challenges! a C hinese dynamic question answering benchmark

    Xu, Z., Li, Y., Ding, R., Wang, X., Chen, B., Jiang, Y., Zheng, H., Lu, W., Xie, P., and Huang, F. Let LLM s take on the latest challenges! a C hinese dynamic question answering benchmark. In Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B. D., and Schockaert, S. (eds.), Proceedings of the 31st International Conference on Computational ...

  66. [66]

    Corrective retrieval augmented generation, 2024

    Yan, S.-Q., Gu, J.-C., Zhu, Y., and Ling, Z.-H. Corrective retrieval augmented generation, 2024. URL https://openreview.net/forum?id=JnWJbrnaUE

  67. [67]

    Three key findings from the 2025 digital news report, June 2025

    YouGov . Three key findings from the 2025 digital news report, June 2025. URL https://yougov.com/articles/52379-three-key-findings-from-the-2025-digital-news-report

  68. [68]

    Silencer: From discovery to mitigation of self-bias in llm-as-benchmark-generator

    Yuan, P., Li, Y., Feng, S., Wang, X., Zhang, Y., Shi, J., Tan, C., Pan, B., Hu, Y., and Li, K. Silencer: From discovery to mitigation of self-bias in llm-as-benchmark-generator. arXiv preprint arXiv:2505.20738, 2025

  69. [69]

    u ksel, A., K \

    Y \"u ksel, A., K \"o ksal, A., Senel, L. K., Korhonen, A., and Schuetze, H. T urkish MMLU : Measuring massive multitask language understanding in T urkish. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 7035--7055, Miami, Florida, USA, November 2024. Association for Compu...

  70. [70]

    A., Celi, L

    Zack, T., Lehman, E., Suzgun, M., Rodriguez, J. A., Celi, L. A., Gichoya, J., Jurafsky, D., Szolovits, P., Bates, D. W., Abdulnour, R.-E. E., et al. Assessing the potential of gpt-4 to perpetuate racial and gender biases in health care: a model evaluation study. The Lancet Digital Health, 6 0 (1): 0 e12--e22, 2024

  71. [71]

    Large language models are not robust multiple choice selectors

    Zheng, C., Zhou, H., Meng, F., Zhou, J., and Huang, M. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=shr9PXz7T0