pith. sign in

arxiv: 2310.03214 · v2 · pith:UNO3AZK5new · submitted 2023-10-05 · 💻 cs.CL

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

classification 💻 cs.CL
keywords freshqamodelsquestionsanswersfreshpromptknowledgesearchworld
0
0 comments X
read the original abstract

Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshllms/freshqa and commit to updating it at regular intervals.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models

    cs.CL 2026-03 conditional novelty 7.0

    Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.

  2. BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

    cs.CL 2025-04 conditional novelty 7.0

    BrowseComp-ZH is a new benchmark of 289 Chinese web questions where even the strongest LLM agents reach only 42.9% accuracy.

  3. VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

    cs.IR 2024-10 conditional novelty 7.0

    VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.

  4. Hallucination is Inevitable: An Innate Limitation of Large Language Models

    cs.CL 2024-01 conditional novelty 7.0

    Hallucinations are inevitable in LLMs because they cannot learn all computable functions according to learning theory.

  5. ToolRL: Reward is All Tool Learning Needs

    cs.LG 2025-04 conditional novelty 6.0

    A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

  6. Measuring short-form factuality in large language models

    cs.CL 2024-11 unverdicted novelty 6.0

    SimpleQA is a new benchmark of short, single-answer factual questions collected adversarially against GPT-4 to evaluate LLM factuality and confidence calibration.

  7. Retrieval-Augmented Generation for Natural Language Processing: A Survey

    cs.CL 2024-07 accept novelty 6.0

    The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.

  8. Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

    cs.AI 2026-05 conditional novelty 5.0

    QuestBench is a student-created set of 256 expert-level questions that exposes low performance (16.85% mean pass rate) in current AI deep research systems while serving as a classroom method for accountable AI education.

  9. TrustLLM: Trustworthiness in Large Language Models

    cs.CL 2024-01 unverdicted novelty 5.0

    TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt...

  10. Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

    cs.AI 2026-05 conditional novelty 4.0

    QuestBench is a student-constructed benchmark of 256 questions on which current deep research AI systems achieve a mean pass rate of 16.85% and a best-case rate of 57.58%.