pith. machine review for the scientific record.
sign in

arxiv: 2601.14952 · v2 · submitted 2026-01-21 · 💻 cs.CL · cs.AI

CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

Pith reviewed 2026-05-16 12:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords CorpusQAlong-context reasoningbenchmarkretrieval-augmented generationdocument corpusLLM evaluationdata synthesis
0
0 comments X

The pith

A new 10-million-token benchmark shows that long-context LLMs and retrieval systems fail when evidence must be integrated across hundreds of dispersed documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CorpusQA to test models on corpus-level reasoning tasks that require global integration, comparison, and statistical aggregation of information spread across vast collections of documents. Existing benchmarks either use single long texts or assume answers come from a few relevant chunks, which does not match real scenarios where evidence is highly dispersed. A synthesis framework generates complex queries by separating reasoning logic from text, producing examples with programmatically guaranteed correct answers. Experiments demonstrate that even state-of-the-art long-context models degrade as input length grows, while standard retrieval-augmented systems collapse entirely. The work also shows that fine-tuning on the synthesized data improves general long-context reasoning and suggests memory-augmented agentic architectures as a stronger alternative.

Core claim

CorpusQA is a benchmark scaling to 10 million tokens that evaluates holistic reasoning over unstructured document repositories. The synthesis framework decouples reasoning from textual representation to create computation-intensive queries whose ground-truth answers are guaranteed, without relying on human annotation. Experiments establish that state-of-the-art long-context LLMs struggle as input length increases and that standard retrieval-augmented generation systems fail completely when evidence is dispersed across hundreds of documents.

What carries the argument

The data synthesis framework that decouples reasoning from textual representation to generate complex queries with programmatically guaranteed ground-truth answers.

Load-bearing premise

The synthesized queries accurately reflect the difficulty of real-world corpus-level reasoning tasks.

What would settle it

An experiment in which a long-context model or retrieval system maintains high accuracy on CorpusQA queries as the number of documents and total tokens scale to 10 million would falsify the central claim.

read the original abstract

While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts or rely on a "sparse retrieval" assumption-that answers can be derived from a few relevant chunks. This assumption fails for true corpus-level analysis, where evidence is highly dispersed across hundreds of documents and answers require global integration, comparison, and statistical aggregation. To address this critical gap, we introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework. By decoupling reasoning from textual representation, this framework creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic reasoning over vast, unstructured text without relying on fallible human annotation. We further demonstrate the utility of our framework beyond evaluation, showing that fine-tuning on our synthesized data effectively enhances an LLM's general long-context reasoning capabilities. Extensive experiments reveal that even state-of-the-art long-context LLMs struggle as input length increases, and standard retrieval-augmented generation systems collapse entirely. Our findings indicate that memory-augmented agentic architectures offer a more robust alternative, suggesting a critical shift is needed from simply extending context windows to developing advanced architectures for global information synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CorpusQA, a benchmark of up to 10 million tokens for evaluating LLMs on corpus-level reasoning where evidence is dispersed across hundreds of documents. It proposes a data synthesis framework that decouples reasoning from text to generate computation-intensive queries with programmatically guaranteed ground truth, avoiding reliance on human annotation. Experiments demonstrate performance degradation in state-of-the-art long-context models as length increases, complete failure of standard RAG systems on dispersed evidence, and improved general long-context capabilities after fine-tuning on the synthesized data. The authors conclude that memory-augmented agentic architectures are more robust and advocate shifting focus from context-window extension to global synthesis methods.

Significance. If the synthesis framework produces queries whose difficulty and statistical properties match genuine dispersed-evidence scenarios, the benchmark would fill a clear gap in evaluating holistic corpus-level reasoning and provide actionable evidence that current retrieval and long-context approaches are insufficient. The scale (10M tokens), the fine-tuning transfer result, and the comparison to agentic baselines would make it a useful resource for the field.

major comments (3)
  1. [§3] §3 (Data Synthesis Framework): The description does not specify how the framework enforces diversity across query types or prevents leakage from the underlying corpora into the generated queries. Without these controls, it remains possible that models fail for reasons orthogonal to holistic integration, undermining the central claim that the benchmark tests true corpus-level reasoning.
  2. [§4.2–4.3] §4.2–4.3 (Experimental Results): No quantitative tables, error bars, or statistical significance tests are referenced for the reported length-degradation curves or RAG collapse; the abstract and main claims rest on unreported numerical evidence, making it impossible to assess effect sizes or reproducibility.
  3. [§5] §5 (Validation and Real-World Fidelity): No human evaluation, linguistic-statistical comparison to naturally occurring multi-document tasks, or ablation on synthetic artifacts is provided to test whether the programmatically generated queries preserve the difficulty distribution of real dispersed-evidence problems.
minor comments (2)
  1. [Abstract] The abstract claims 'extensive experiments' but does not summarize any concrete metrics (e.g., accuracy drops, token lengths tested); a one-sentence quantitative highlight would improve readability.
  2. [Figures] Figure legends and axis labels in the length-scaling plots should explicitly state the number of documents and total token range for each condition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [§3] §3 (Data Synthesis Framework): The description does not specify how the framework enforces diversity across query types or prevents leakage from the underlying corpora into the generated queries. Without these controls, it remains possible that models fail for reasons orthogonal to holistic integration, undermining the central claim that the benchmark tests true corpus-level reasoning.

    Authors: We agree that §3 would benefit from greater explicitness on these controls. The framework generates queries from abstract reasoning graphs that are deliberately decoupled from corpus content, and we already apply template stratification and corpus-subset randomization to promote diversity. In the revision we will add a dedicated subsection with pseudocode and concrete parameters for diversity enforcement and leakage prevention (e.g., n-gram filtering against source documents). This will make the argument that observed failures stem from holistic integration requirements more rigorous. revision: yes

  2. Referee: [§4.2–4.3] §4.2–4.3 (Experimental Results): No quantitative tables, error bars, or statistical significance tests are referenced for the reported length-degradation curves or RAG collapse; the abstract and main claims rest on unreported numerical evidence, making it impossible to assess effect sizes or reproducibility.

    Authors: We accept this criticism. The current draft presents trends qualitatively; the revision will include full numerical tables for all length-scaling and RAG experiments, report standard deviations as error bars, and add paired statistical tests (with p-values) for the degradation and collapse effects. These additions will be placed in §4.2–4.3 and referenced from the abstract and conclusion. revision: yes

  3. Referee: [§5] §5 (Validation and Real-World Fidelity): No human evaluation, linguistic-statistical comparison to naturally occurring multi-document tasks, or ablation on synthetic artifacts is provided to test whether the programmatically generated queries preserve the difficulty distribution of real dispersed-evidence problems.

    Authors: We acknowledge the value of external validation. The revision will add a new subsection in §5 that provides linguistic-statistical comparisons (token-type ratios, dependency depth, multi-hop count) against existing multi-document QA corpora and an ablation isolating synthetic artifacts. A full-scale human evaluation of query fidelity lies outside the scope of the present work given the 10 M token scale and our emphasis on programmatic ground truth; we will explicitly list this as a limitation and outline a feasible protocol for future studies. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark construction with no circular derivation

full rationale

The paper introduces CorpusQA, a benchmark generated via a data synthesis framework that decouples reasoning from textual representation to produce queries with programmatic ground truth. This is an empirical contribution centered on dataset creation and LLM evaluation experiments rather than any mathematical derivation chain. No equations, predictions, or first-principles results are present that could reduce to fitted parameters or self-definitions by construction. Claims about LLM degradation with length and RAG collapse are supported by reported experiments on the synthesized data, not by any self-referential logic or load-bearing self-citations. The synthesis approach is presented as novel without invoking prior author theorems as uniqueness proofs. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the synthetic queries faithfully capture the computational demands of real dispersed-evidence reasoning; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Synthetic queries generated by decoupling reasoning from textual representation produce ground-truth answers that match the difficulty of human-authored corpus-level questions.
    Invoked when claiming the benchmark challenges systems on holistic reasoning without fallible human annotation.

pith-pipeline@v0.9.0 · 5540 in / 1295 out tokens · 38548 ms · 2026-05-16T12:37:14.179826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.