pith. machine review for the scientific record. sign in

arxiv: 2411.04368 · v1 · submitted 2024-11-07 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Measuring short-form factuality in large language models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:41 UTC · model grok-4.3

classification 💻 cs.CL
keywords SimpleQAfactuality evaluationlanguage modelsshort-form questionsknowledge calibrationabstentionbenchmark designGPT-4 adversarial collection
0
0 comments X

The pith

SimpleQA benchmark measures if language models know what they know on short facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SimpleQA, a benchmark designed to test language models on short, fact-seeking questions that have only one indisputable answer. Questions are collected adversarially against GPT-4 responses to keep them difficult, while the single-answer format makes grading straightforward as correct, incorrect, or not attempted. This setup lets evaluators check whether models maximize correct answers while abstaining on topics they do not confidently know. A reader would care because current models often produce factual errors or overconfident guesses, and a simple, durable test could track whether future systems improve at recognizing their own knowledge limits. The authors position the benchmark as targeted and likely to stay useful across the next generations of models.

Core claim

We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer.

What carries the argument

The SimpleQA benchmark, a set of short fact-seeking questions collected adversarially against GPT-4 and restricted to single indisputable answers, which supports grading as correct, incorrect, or not attempted to measure factual accuracy and confidence calibration.

If this is right

  • Models can be scored on whether they maximize correct answers while abstaining on uncertain facts.
  • The single-answer design makes automatic grading reliable without complex rubrics.
  • Adversarial collection against current top models aims to maintain difficulty for the next several generations.
  • The benchmark supplies a targeted, simple signal for short-form factuality separate from long-form or reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training methods that reward abstention on low-confidence facts could be directly optimized against SimpleQA scores.
  • The same adversarial single-answer pattern might transfer to creating short-form benchmarks in other domains such as science or history.
  • Widespread adoption could shift deployment decisions toward models that demonstrate better knowledge calibration rather than raw accuracy alone.

Load-bearing premise

Questions can be created with only a single indisputable answer and that adversarial collection against GPT-4 will keep those questions challenging for future models.

What would settle it

A new frontier model achieving near-perfect accuracy on SimpleQA while still attempting every question, or discovery that a substantial fraction of the questions admit multiple valid answers upon independent review.

read the original abstract

We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models "know what they know," and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at https://github.com/openai/simple-evals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SimpleQA, a benchmark for evaluating large language models on short fact-seeking questions. It prioritizes two properties: adversarial collection against GPT-4 responses to ensure challenge, and question construction guaranteeing a single indisputable answer to enable easy grading of model outputs as correct, incorrect, or not attempted. The benchmark is positioned as a targeted measure of whether models 'know what they know,' with the expectation that it will remain relevant for future frontier models.

Significance. If the core properties hold after validation, SimpleQA would supply a lightweight, focused tool for probing epistemic confidence in LLMs, distinct from longer-form or multi-answer evaluations.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'questions are created such that there exists only a single, indisputable answer' is presented without any described validation procedure, inter-annotator agreement metrics, or collection details, which directly undermines the reliability of the three-way grading scheme for isolating model confidence.
  2. [Abstract] Abstract: no evidence, analysis, or discussion is supplied to show that adversarial collection against GPT-4 produces questions whose difficulty persists for subsequent frontier models, which is required to substantiate the claim of long-term benchmark relevance.
minor comments (1)
  1. [Abstract] The GitHub repository link is given but the abstract supplies no dataset statistics (e.g., question count, topic distribution) or sample items that would allow immediate assessment of the benchmark's scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the abstract and supporting details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'questions are created such that there exists only a single, indisputable answer' is presented without any described validation procedure, inter-annotator agreement metrics, or collection details, which directly undermines the reliability of the three-way grading scheme for isolating model confidence.

    Authors: We agree that the abstract is too concise on this point. The full manuscript (Section 3) details the adversarial collection process against GPT-4, including human verification steps to enforce single-answer questions and reported inter-annotator agreement. To address the concern directly, we will revise the abstract to include a brief summary of the validation procedure and agreement metrics, thereby better supporting the three-way grading scheme. revision: yes

  2. Referee: [Abstract] Abstract: no evidence, analysis, or discussion is supplied to show that adversarial collection against GPT-4 produces questions whose difficulty persists for subsequent frontier models, which is required to substantiate the claim of long-term benchmark relevance.

    Authors: We acknowledge that no prospective empirical evidence can be supplied for future models, as this is inherently unknowable. The abstract presents the long-term relevance as a hope rather than a proven claim, grounded in the adversarial design against a current frontier model. We will revise the wording to clarify this aspirational framing and add a short discussion of the design rationale intended to support continued relevance. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark construction

full rationale

The paper constructs SimpleQA by direct methodological choices: adversarial collection against GPT-4 responses to ensure challenge, and question design ensuring a single indisputable answer for easy grading. These are definitional inputs of the benchmark rather than outputs derived from equations, fitted parameters, or self-citations. No load-bearing steps reduce to prior results by construction; the evaluation framing for 'know what they know' is an intended application, not a self-referential derivation. The paper is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that short questions with indisputable single answers can be reliably constructed and that adversarial filtering against GPT-4 produces lasting difficulty.

axioms (1)
  • domain assumption There exists only a single, indisputable answer for each question in the benchmark
    This assumption enables the easy grading scheme described in the abstract.

pith-pipeline@v0.9.0 · 5462 in / 1118 out tokens · 36089 ms · 2026-05-15T06:41:33.689500+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • LawOfExistence defect_zero_iff_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    questions are created such that there exists only a single, indisputable answer

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

    cs.CY 2026-05 unverdicted novelty 7.0

    StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.

  2. StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

    cs.CY 2026-05 accept novelty 7.0

    StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.

  3. Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.

  4. Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

    cs.CR 2026-04 unverdicted novelty 7.0

    R²A uses a hybrid ensemble surrogate router and suffix optimization to significantly increase black-box LLM router selection of expensive models across query distributions.

  5. NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

    cs.CL 2026-04 unverdicted novelty 7.0

    NameBERT models trained on LLM-augmented academic name data outperform state-of-the-art baselines in nationality classification from names, with augmentation providing gains especially on tail countries.

  6. BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

    cs.CL 2026-04 unverdicted novelty 7.0

    BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

  7. CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

    cs.CL 2026-03 unverdicted novelty 7.0

    CounterRefine improves factual QA by retrieving answer-conditioned counterevidence and deterministically refining draft answers, lifting a GPT-5 RAG baseline by 5.8 points to 73.1% on SimpleQA.

  8. Decomposing and Steering Functional Metacognition in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs have linearly decodable functional metacognitive states that causally modulate reasoning when steered via activation interventions.

  9. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  10. Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification

    cs.AI 2026-04 unverdicted novelty 6.0

    Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.

  11. Evaluation of Agents under Simulated AI Marketplace Dynamics

    cs.IR 2026-04 unverdicted novelty 6.0

    Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.

  12. Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

    cs.CL 2026-04 conditional novelty 6.0

    Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-paramete...

  13. WRAP++: Web discoveRy Amplified Pretraining

    cs.CL 2026-04 unverdicted novelty 6.0

    WRAP++ amplifies Wikipedia data from 8.4B to 80B tokens by creating cross-document QA from hyperlink motifs, yielding better SimpleQA performance and scaling for 7B and 32B OLMo models than single-document methods.

  14. Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

    cs.CR 2026-04 unverdicted novelty 6.0

    Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.

  15. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    cs.CL 2025-06 unverdicted novelty 6.0

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...

  16. Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery

    cs.IR 2026-05 conditional novelty 5.0

    PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.

  17. Purging the Gray Zone: Latent-Geometric Denoising for Precise Knowledge Boundary Awareness

    cs.CL 2026-04 unverdicted novelty 5.0

    GeoDe constructs a truth hyperplane with linear probes and uses geometric distance as a confidence signal to filter gray zone samples during fine-tuning, leading to better truthfulness and OOD generalization in LLMs.

  18. JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

    cs.CL 2026-04 unverdicted novelty 5.0

    JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.

  19. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  20. Humanity's Last Exam

    cs.LG 2025-01 unverdicted novelty 5.0

    Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

  21. EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools

    cs.AI 2026-04 unverdicted novelty 4.0

    Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 20 Pith papers · 3 internal anchors

  1. [1]

    Agrawal, M

    A. Agrawal, M. Suzgun, L. Mackey, and A. T. Kalai. Do language models know when they're hallucinating references? Findings of EACL, 2024. URL https://arxiv.org/abs/2305.18248

  2. [2]

    Anthropic

    P. Anthropic. Claude 3 model card, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

  3. [3]

    Cheng, T

    Q. Cheng, T. Sun, W. Zhang, S. Wang, X. Liu, M. Zhang, J. He, M. Huang, Z. Yin, K. Chen, et al. Evaluating hallucinations in chinese large language models. arXiv preprint arXiv:2310.03368, 2023. URL https://openreview.net/forum?id=1AXvGjfF0V

  4. [4]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Proc of ACL, 2017. URL https://arxiv.org/abs/1705.03551

  5. [5]

    Language Models (Mostly) Know What They Know

    S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. Cl...

  6. [6]

    A. T. Kalai. Personal communication, July 2024. Communication on July 9, 2024

  7. [7]

    Krishna, K

    S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stambler, S. Upadhyay, and M. Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. arXiv preprint, 2024. URL https://arxiv.org/abs/2409.12941

  8. [8]

    Kwiatkowski, J

    T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: A benchmark for question answering research. TACL, 2019. URL https://aclanthology.org/Q19-1026

  9. [9]

    J. Li, X. Cheng, X. Zhao, J.-Y. Nie, and J.-R. Wen. H alu E val: A large-scale hallucination evaluation benchmark for large language models. In EMNLP, 2023. URL https://arxiv.org/abs/2305.11747

  10. [10]

    S. Lin, J. Hilton, and O. Evans. T ruthful QA : Measuring how models mimic human falsehoods. In ACL, 2022 a . URL https://aclanthology.org/2022.acl-long.229

  11. [11]

    S. Lin, J. Hilton, and O. Evans. Teaching models to express their uncertainty in words. In TMLR, 2022 b . URL https://arxiv.org/abs/2205.14334

  12. [12]

    S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. FA ct S core: Fine-grained atomic evaluation of factual precision in long form text generation. In EMNLP, 2023. URL https://aclanthology.org/2023.emnlp-main.741

  13. [13]

    Hello gpt-4o, 2024 a

    OpenAI. Hello gpt-4o, 2024 a . URL https://openai.com/index/hello-gpt-4o/

  14. [14]

    Openai o1-mini, 2024 b

    OpenAI. Openai o1-mini, 2024 b . URL https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/

  15. [15]

    Learning to reason with llms, 2024 c

    OpenAI. Learning to reason with llms, 2024 c . URL https://openai.com/index/learning-to-reason-with-llms/

  16. [16]

    T. Vu, M. Iyyer, X. Wang, N. Constant, J. Wei, J. Wei, C. Tar, Y.-H. Sung, D. Zhou, Q. Le, and T. Luong. Freshllms: Refreshing large language models with search engine augmentation, 2023. URL https://arxiv.org/abs/2310.03214

  17. [17]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. In ICML, 2023. URL https://arxiv.org/abs/2203.11171

  18. [18]

    J. Wei, C. Yang, X. Song, Y. Lu, N. Hu, J. Huang, D. Tran, D. Peng, R. Liu, D. Huang, C. Du, and Q. V. Le. Long-form factuality in large language models. In NeurIPS, 2024. URL https://arxiv.org/abs/2403.18802

  19. [19]

    Y. Zhao, J. Zhang, I. Chern, S. Gao, P. Liu, J. He, et al. FELM : Benchmarking factuality evaluation of large language models. NeurIPS, 2024. URL https://arxiv.org/abs/2310.00741