Overview of the TREC 2025 RAGTIME Track

Andrew Yates; Dawn Lawrie; Eugene Yang; James Mayfield; Luca Soldaini; Sean MacAvaney

arxiv: 2602.10024 · v2 · submitted 2026-02-10 · 💻 cs.IR · cs.CL

Overview of the TREC 2025 RAGTIME Track

Dawn Lawrie , Sean MacAvaney , James Mayfield , Luca Soldaini , Eugene Yang , Andrew Yates This is my paper

Pith reviewed 2026-05-16 02:22 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords TREC 2025RAGTIME trackmultilingual report generationmultilingual information retrievalArabic Chinese English Russiannews documentsevaluation benchmark

0 comments

The pith

RAGTIME track creates benchmark for report generation from multilingual news documents in four languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the RAGTIME track at TREC 2025, whose goal is to evaluate how well systems generate reports from source documents in multiple languages. It assembled a collection of news stories in Arabic, Chinese, English, and Russian to support this evaluation. Three tasks were defined: generating reports in multiple languages, generating reports in English only, and retrieving information across languages. Thirteen teams contributed 125 runs in total, and the overview reports the outcomes of those submissions. The setup supplies a shared testbed for comparing methods that handle mixed-language inputs.

Core claim

The RAGTIME track has created a document collection containing Arabic, Chinese, English, and Russian news stories and includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR), with a total of 125 runs submitted by 13 participating teams.

What carries the argument

The multilingual news document collection that supports the three defined tasks for report generation and cross-language retrieval.

If this is right

Performance numbers from the 125 runs supply initial baselines for measuring future progress on multilingual generation.
The tasks separate the effects of retrieval quality from generation quality across languages.
The collection allows direct head-to-head testing of systems on the same mixed-language inputs.
Results highlight where language-specific gaps remain in current retrieval and summarization methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The track may encourage development of systems that preserve factual accuracy when crossing language boundaries.
It could be extended to measure how well generated reports support downstream decisions such as fact-checking.
Similar collections in other domains, such as scientific literature, would test whether the current news-focused design generalizes.

Load-bearing premise

The news collection and task definitions sufficiently represent real-world multilingual report generation scenarios.

What would settle it

An experiment showing that teams ranking high on these tasks produce reports that experts judge as unhelpful for actual multilingual news synthesis work.

read the original abstract

The principal goal of the RAG TREC Instrument for Multilingual Evaluation (RAGTIME) track at TREC is to study report generation from multilingual source documents. The track has created a document collection containing Arabic, Chinese, English, and Russian news stories. RAGTIME includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR). A total of 125 runs were submitted by 13 participating teams (and as baselines by the track coordinators) for three tasks. This overview describes these three tasks and presents the available results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard TREC track overview that accurately describes a new four-language news collection and three tasks but offers no analysis of results or methods.

read the letter

This paper simply lays out the RAGTIME TREC track. It created a collection of news stories in Arabic, Chinese, English, and Russian, then defined three tasks: multilingual report generation, English-only report generation, and multilingual retrieval. The overview reports 125 runs from 13 teams plus the organizers' baselines and stops there. That is the full contribution. It does a clean job of stating the collection details, task definitions, and participation numbers without overclaiming. The descriptions are direct and match what a track overview should provide. No derivations or fitted models appear, so there is nothing to check for circularity or parameter issues. The soft spots are predictable for this genre. There is no breakdown of what the submitted runs actually produced, no comparison of approaches, and no discussion of whether the news-domain setup captures the harder parts of real multilingual report generation. The assumption that these tasks give meaningful signals is left implicit. That is fine for a logistics paper but limits how much anyone can take away beyond knowing the benchmark exists. This is for researchers who plan to submit to the track or need the exact collection and task specs for their own evaluation work. It is not for readers seeking new techniques or empirical findings. I would bring it to a reading group only if the group is tracking new TREC benchmarks; otherwise it is skippable. It deserves peer review because TREC overviews serve as the official record of these setups, and this one is factually solid on its own terms.

Referee Report

0 major / 2 minor

Summary. The manuscript is an overview of the TREC 2025 RAGTIME track, whose goal is to study report generation from multilingual source documents. It describes the creation of a four-language news document collection (Arabic, Chinese, English, Russian), defines three tasks (Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval), and reports that 125 runs were submitted by 13 participating teams plus coordinator baselines, along with available results.

Significance. If the collection and task definitions are adopted as a community benchmark, the work will be significant for the IR and RAG communities by providing the first large-scale, publicly documented multilingual evaluation resource for report generation and cross-lingual retrieval, enabling direct comparisons across languages and system types.

minor comments (2)

[Abstract] Abstract: the parenthetical remark on baselines would be clearer if it stated how many of the 125 runs were coordinator baselines versus participant runs.
[Task Definitions] Task section: the description of the MLIR task would benefit from an explicit statement of the evaluation metric (e.g., nDCG@10 or MAP) used to score the submitted runs.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation to accept the manuscript. The overview of the TREC 2025 RAGTIME track is intended to document the new multilingual benchmark for report generation and retrieval tasks.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a purely descriptive TREC overview paper that reports the creation of a four-language news document collection, defines three tasks (Multilingual Report Generation, English Report Generation, and MLIR), and states the number of runs and participating teams. No equations, derivations, predictions, fitted parameters, or load-bearing claims exist that could reduce to self-definition, self-citation chains, or renaming of inputs. The central content consists of factual statements about track logistics and submissions, which are self-contained and externally verifiable through the track itself without any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a descriptive overview of a benchmark track with no free parameters, axioms, or invented entities in a mathematical or theoretical sense.

pith-pipeline@v0.9.0 · 5395 in / 989 out tokens · 75622 ms · 2026-05-16T02:22:52.735076+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RAGTIME includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR). A total of 125 runs were submitted by 13 participating teams... document collection containing Arabic, Chinese, English, and Russian news stories.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Nugget coverage among the runs is lower than 0.5... F1 scores combine the sentence support and nugget coverage

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.