Recognition: 3 theorem links
· Lean TheoremMeasuring short-form factuality in large language models
Pith reviewed 2026-05-15 06:41 UTC · model grok-4.3
The pith
SimpleQA benchmark measures if language models know what they know on short facts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer.
What carries the argument
The SimpleQA benchmark, a set of short fact-seeking questions collected adversarially against GPT-4 and restricted to single indisputable answers, which supports grading as correct, incorrect, or not attempted to measure factual accuracy and confidence calibration.
If this is right
- Models can be scored on whether they maximize correct answers while abstaining on uncertain facts.
- The single-answer design makes automatic grading reliable without complex rubrics.
- Adversarial collection against current top models aims to maintain difficulty for the next several generations.
- The benchmark supplies a targeted, simple signal for short-form factuality separate from long-form or reasoning tasks.
Where Pith is reading between the lines
- Training methods that reward abstention on low-confidence facts could be directly optimized against SimpleQA scores.
- The same adversarial single-answer pattern might transfer to creating short-form benchmarks in other domains such as science or history.
- Widespread adoption could shift deployment decisions toward models that demonstrate better knowledge calibration rather than raw accuracy alone.
Load-bearing premise
Questions can be created with only a single indisputable answer and that adversarial collection against GPT-4 will keep those questions challenging for future models.
What would settle it
A new frontier model achieving near-perfect accuracy on SimpleQA while still attempting every question, or discovery that a substantial fraction of the questions admit multiple valid answers upon independent review.
read the original abstract
We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models "know what they know," and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at https://github.com/openai/simple-evals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SimpleQA, a benchmark for evaluating large language models on short fact-seeking questions. It prioritizes two properties: adversarial collection against GPT-4 responses to ensure challenge, and question construction guaranteeing a single indisputable answer to enable easy grading of model outputs as correct, incorrect, or not attempted. The benchmark is positioned as a targeted measure of whether models 'know what they know,' with the expectation that it will remain relevant for future frontier models.
Significance. If the core properties hold after validation, SimpleQA would supply a lightweight, focused tool for probing epistemic confidence in LLMs, distinct from longer-form or multi-answer evaluations.
major comments (2)
- [Abstract] Abstract: the assertion that 'questions are created such that there exists only a single, indisputable answer' is presented without any described validation procedure, inter-annotator agreement metrics, or collection details, which directly undermines the reliability of the three-way grading scheme for isolating model confidence.
- [Abstract] Abstract: no evidence, analysis, or discussion is supplied to show that adversarial collection against GPT-4 produces questions whose difficulty persists for subsequent frontier models, which is required to substantiate the claim of long-term benchmark relevance.
minor comments (1)
- [Abstract] The GitHub repository link is given but the abstract supplies no dataset statistics (e.g., question count, topic distribution) or sample items that would allow immediate assessment of the benchmark's scope.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the abstract and supporting details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that 'questions are created such that there exists only a single, indisputable answer' is presented without any described validation procedure, inter-annotator agreement metrics, or collection details, which directly undermines the reliability of the three-way grading scheme for isolating model confidence.
Authors: We agree that the abstract is too concise on this point. The full manuscript (Section 3) details the adversarial collection process against GPT-4, including human verification steps to enforce single-answer questions and reported inter-annotator agreement. To address the concern directly, we will revise the abstract to include a brief summary of the validation procedure and agreement metrics, thereby better supporting the three-way grading scheme. revision: yes
-
Referee: [Abstract] Abstract: no evidence, analysis, or discussion is supplied to show that adversarial collection against GPT-4 produces questions whose difficulty persists for subsequent frontier models, which is required to substantiate the claim of long-term benchmark relevance.
Authors: We acknowledge that no prospective empirical evidence can be supplied for future models, as this is inherently unknowable. The abstract presents the long-term relevance as a hope rather than a proven claim, grounded in the adversarial design against a current frontier model. We will revise the wording to clarify this aspirational framing and add a short discussion of the design rationale intended to support continued relevance. revision: yes
Circularity Check
No circularity in benchmark construction
full rationale
The paper constructs SimpleQA by direct methodological choices: adversarial collection against GPT-4 responses to ensure challenge, and question design ensuring a single indisputable answer for easy grading. These are definitional inputs of the benchmark rather than outputs derived from equations, fitted parameters, or self-citations. No load-bearing steps reduce to prior results by construction; the evaluation framing for 'know what they know' is an intended application, not a self-referential derivation. The paper is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption There exists only a single, indisputable answer for each question in the benchmark
Lean theorems connected to this paper
-
LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
questions are created such that there exists only a single, indisputable answer
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
-
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
-
Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization
R²A uses a hybrid ensemble surrogate router and suffix optimization to significantly increase black-box LLM router selection of expensive models across query distributions.
-
NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data
NameBERT models trained on LLM-augmented academic name data outperform state-of-the-art baselines in nationality classification from names, with augmentation providing gains especially on tail countries.
-
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
-
CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering
CounterRefine improves factual QA by retrieving answer-conditioned counterevidence and deterministically refining draft answers, lifting a GPT-5 RAG baseline by 5.8 points to 73.1% on SimpleQA.
-
Decomposing and Steering Functional Metacognition in Large Language Models
LLMs have linearly decodable functional metacognitive states that causally modulate reasoning when steered via activation interventions.
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
-
Evaluation of Agents under Simulated AI Marketplace Dynamics
Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.
-
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-paramete...
-
WRAP++: Web discoveRy Amplified Pretraining
WRAP++ amplifies Wikipedia data from 8.4B to 80B tokens by creating cross-document QA from hyperlink motifs, yielding better SimpleQA performance and scaling for 7B and 32B OLMo models than single-document methods.
-
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery
PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.
-
Purging the Gray Zone: Latent-Geometric Denoising for Precise Knowledge Boundary Awareness
GeoDe constructs a truth hyperplane with linear probes and uses geometric distance as a confidence signal to filter gray zone samples during fine-tuning, leading to better truthfulness and OOD generalization in LLMs.
-
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools
Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.
Reference graph
Works this paper leans on
-
[1]
A. Agrawal, M. Suzgun, L. Mackey, and A. T. Kalai. Do language models know when they're hallucinating references? Findings of EACL, 2024. URL https://arxiv.org/abs/2305.18248
- [2]
- [3]
-
[4]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Proc of ACL, 2017. URL https://arxiv.org/abs/1705.03551
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
Language Models (Mostly) Know What They Know
S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. Cl...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
A. T. Kalai. Personal communication, July 2024. Communication on July 9, 2024
work page 2024
-
[7]
S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stambler, S. Upadhyay, and M. Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. arXiv preprint, 2024. URL https://arxiv.org/abs/2409.12941
-
[8]
T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: A benchmark for question answering research. TACL, 2019. URL https://aclanthology.org/Q19-1026
work page 2019
- [9]
-
[10]
S. Lin, J. Hilton, and O. Evans. T ruthful QA : Measuring how models mimic human falsehoods. In ACL, 2022 a . URL https://aclanthology.org/2022.acl-long.229
work page 2022
- [11]
-
[12]
S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. FA ct S core: Fine-grained atomic evaluation of factual precision in long form text generation. In EMNLP, 2023. URL https://aclanthology.org/2023.emnlp-main.741
work page 2023
-
[13]
OpenAI. Hello gpt-4o, 2024 a . URL https://openai.com/index/hello-gpt-4o/
work page 2024
-
[14]
OpenAI. Openai o1-mini, 2024 b . URL https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
work page 2024
-
[15]
Learning to reason with llms, 2024 c
OpenAI. Learning to reason with llms, 2024 c . URL https://openai.com/index/learning-to-reason-with-llms/
work page 2024
- [16]
-
[17]
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. In ICML, 2023. URL https://arxiv.org/abs/2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [18]
- [19]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.