arxiv: 2604.16349 · v2 · submitted 2026-03-16 · 💻 cs.IR · cs.AI· cs.CL

Recognition: no theorem link

Benchmarking Real-Time Question Answering via Executable Code Workflows

Wenjie Zhou , Yuan Gao , Xin Zhou , Hao Fu , Zhongjian Miao , Wei Chen , Bo Chen , Xiaobing Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:58 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords real-time question answeringdynamic benchmarksexecutable code workflowsagent-driven pipelinestemporal confusionweb crawlingDOM extractionself-repair mechanism

0 comments

The pith

Even the best AI models achieve only 46 percent accuracy on real-time question answering tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces RT-QA, a dynamic benchmark that uses executable code workflows to generate fresh ground truth for questions requiring current web information. Existing static tests miss the temporal changes in real-world knowledge, so the new framework deploys agents to write and run crawling code with DOM extraction, plus a self-repair loop that adapts when page layouts shift. Across 320 questions in twelve domains the top models reach no higher than 46 percent accuracy. Two failure patterns stand out: reliance on shallow search snippets and failure to re-anchor reasoning to the present date instead of past ones. The work shows that real-time adaptability remains a clear bottleneck for current agents.

Core claim

RT-QA is a dynamic evaluation framework that leverages executable code workflows to retrieve up-to-date answers at evaluation time, with an agent-driven pipeline that autonomously generates code for web crawling and DOM-based extraction plus a self-repair mechanism, revealing that state-of-the-art models attain at most 46 percent accuracy.

What carries the argument

Agent-driven pipeline that generates executable code for web crawling and DOM extraction, together with a self-repair mechanism that adapts to changing page structures.

If this is right

Agents must shift from relying on initial search snippets to performing deeper website scans for accurate real-time data.
Systems need explicit temporal state management to correctly anchor reasoning to the present moment rather than past dates.
Benchmarks for agent capabilities should move beyond static datasets and incorporate live executable retrieval.
Self-repair features in code generation pipelines will be required to keep evaluations valid as web pages evolve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Static benchmarks likely overestimate how well models handle knowledge that changes over time.
Embedding similar live code evaluation directly into training could push models toward better real-time robustness.
The same executable-workflow approach could transfer to other live-data domains such as financial reporting or breaking news.

Load-bearing premise

The autonomous code generation and self-repair process can produce reliable real-time ground truth answers despite ongoing changes to website structures.

What would settle it

A check showing that the pipeline's extracted answers match independently verified current facts for fewer than 80 percent of the questions would undermine the reported 46 percent model accuracy ceiling.

Figures

Figures reproduced from arXiv: 2604.16349 by Bo Chen, Hao Fu, Wei Chen, Wenjie Zhou, Xiaobing Zhao, Xin Zhou, Yuan Gao, Zhongjian Miao.

**Figure 1.** Figure 1: Comparison between Static QA and Real-time QA. Unlike static questions (top) where answers are fixed or can be prestored, real-time questions (bottom) typically involve relative temporal constraints (e.g., “yesterday”, “past 7 days”). have evolved from static knowledge bases into dynamic, search-integrated agents. The emergence of commercial systems like SearchGPT [OpenAI, 2024], Perplexity [Perplexity… view at source ↗

**Figure 2.** Figure 2: An Example of an RT-QA Executed Workflow. The figure demonstrates how a single Python workflow handles dynamic time anchoring. The code (left) explicitly calculates the target date (tomorrow) based on the execution time. Consequently, executing the same workflow on different dates (Oct 31 vs. Nov 1) yields correct, context-aware answers (bottom right), effectively functioning as a dynamic pseudo-API. lever… view at source ↗

**Figure 3.** Figure 3: Overview of the RT-QA Framework. (a) The Construction Pipeline consists of three steps: preparation and tool definition, Agent-in-the-Loop workflow generation, and final dataset construction. (b) The Evaluation Pipeline, where models and workflows are executed simultaneously to assess real-time accuracy. 3.2 Real-Time Dataset Construction Pipeline Constructing robust executable workflows needs to process c… view at source ↗

**Figure 4.** Figure 4: The System Prompt for Model Generation. 4 Experiments 4.1 Experimental Setup Agent Framework for evaluation. Evaluating agents via official web interfaces (e.g., ChatGPT-Web) presents two challenges: lack of transparency (intermediate steps are invisible) and scalability issues (manual testing is labor-intensive). Moreover, our pilot tests showed that web interfaces often yielded lower accuracy compared … view at source ↗

**Figure 6.** Figure 6: Fine-grained Distribution of Failure Modes. Failures are decomposed into 7 specific behavioral flaws. Retrieval & Scanning Issues (45%). The most prevalent failures stem from inadequate information gathering. Lazy Retrieval (20%) occurs when agents rely on search snippets without clicking the authoritative websites. Even when visiting the correct page, agents exhibit Incomplete Scanning (15%), often readin… view at source ↗

**Figure 5.** Figure 5: Case Study: Temporal Confusion (L2) vs. Rigorous Planning (L3). Comparison of agent trajectories on the Shenzhen Museum question. an additional experiment. We injected a strict Time Anchor instruction into the system prompt, explicitly warning agents to prioritize the current timestamp over retrieved historical dates. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Retrieving real-time information is a fundamental capability for search-integrated agents in real-world applications. However, existing benchmarks are predominantly static and therefore fail to capture the temporal dynamics of information and the continuously evolving nature of real-world knowledge. To address this limitation, we propose RT-QA, a dynamic evaluation framework that leverages executable code workflows to retrieve up-to-date answers at evaluation time. Specifically, we construct an agent-driven pipeline that autonomously generates code for web crawling and DOM-based answer extraction to produce real-time ground truth. To ensure robust evaluation over time, the pipeline further incorporates a self-repair mechanism to adapt to changes in web page structures. RT-QA spans 12 domains (e.g., Finance, Sports) with 320 Chinese questions categorized into three difficulty levels. Extensive evaluations of state-of-the-art models (e.g., GPT-5.2, GLM-4.7) reveal significant limitations in real-time adaptability: even the best models achieve only 46% accuracy. Our analysis highlights two primary failure modes: (1) Lazy Retrieval, where agents rely on search snippets instead of deeply scanning specific websites for information (20% of failures); and (2) Temporal Confusion, a cognitive error where agents retrieve a historical date (e.g., an event in 2024) and fail to re-anchor to the current time (2026) for subsequent reasoning. These findings suggest that future agents require not just better retrieval strategies, but robust temporal state management.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RT-QA's live ground truth via autonomous code workflows is a new angle but rests on unvalidated extraction that weakens the 46% accuracy claim.

read the letter

The main point is that this paper builds a benchmark where agents generate and run code to crawl live web pages and extract answers at evaluation time, with a self-repair step for changing page structures. That setup goes beyond static QA datasets and targets real temporal dynamics in 12 domains with 320 Chinese questions. The reported top accuracy of 46% for models like GPT-5.2 comes with two concrete failure modes: agents stopping at search snippets and failing to update time references to the current date. Those observations are useful for anyone working on search-integrated agents. The paper does a decent job naming practical problems that static benchmarks miss. The soft spot is the ground truth itself. The pipeline autonomously produces crawling code and DOM selectors without any reported human spot-checks, cross-source verification, or measured error rate on the extracted answers. If extraction mistakes correlate with the same retrieval or temporal issues the models show, the accuracy gap is hard to interpret. The abstract also skips details on question sampling, statistical tests, or controls for extraction noise. This leaves the numbers feeling preliminary. The work is aimed at researchers building and evaluating agents for dynamic information retrieval. A reader focused on benchmark design for real-time capabilities would pick up ideas from the failure analysis, though they would need to see tighter validation before relying on the scores. It deserves a serious referee because the core idea of executable live ground truth addresses a genuine gap, even if the current version needs more evidence on reliability. I would send it for review with a request to document ground truth accuracy checks.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes RT-QA, a dynamic evaluation framework for real-time question answering that employs an agent-driven pipeline to autonomously generate executable code for web crawling and DOM-based extraction, thereby creating up-to-date ground truth with a self-repair mechanism to handle evolving web page structures. The benchmark includes 320 Chinese questions across 12 domains and three difficulty levels. Evaluations of state-of-the-art models show that even the best-performing models achieve only 46% accuracy, with identified failure modes including lazy retrieval (20% of failures) and temporal confusion.

Significance. If the ground truth generation pipeline is shown to be reliable, the work would usefully demonstrate limitations in current models' real-time retrieval and temporal reasoning capabilities for agentic systems. The executable workflow approach for producing temporally dynamic benchmarks is a constructive direction that static datasets cannot replicate.

major comments (1)

The description of the agent-driven pipeline (abstract and corresponding methods section) provides no validation of the autonomously generated ground truth: there are no mentions of human spot-checks on extracted answers, cross-verification against independent sources, or measured extraction error rates. Because the headline result (maximum 46% accuracy) and the two failure-mode percentages rest entirely on the correctness of this real-time ground truth, the absence of such checks leaves open the possibility that extraction errors are correlated with the same temporal or retrieval issues attributed to the models.

minor comments (1)

The abstract refers to models as GPT-5.2 and GLM-4.7 without specifying exact versions, release dates, or whether these are production or hypothetical checkpoints; the experimental section should list precise model identifiers and access dates.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and will incorporate revisions to strengthen the validation of the ground truth pipeline.

read point-by-point responses

Referee: The description of the agent-driven pipeline (abstract and corresponding methods section) provides no validation of the autonomously generated ground truth: there are no mentions of human spot-checks on extracted answers, cross-verification against independent sources, or measured extraction error rates. Because the headline result (maximum 46% accuracy) and the two failure-mode percentages rest entirely on the correctness of this real-time ground truth, the absence of such checks leaves open the possibility that extraction errors are correlated with the same temporal or retrieval issues attributed to the models.

Authors: We agree that the current manuscript lacks explicit quantitative validation of the ground truth produced by the agent-driven pipeline. While the self-repair mechanism is designed to maintain robustness against structural changes, it does not substitute for human spot-checks, cross-verification, or reported error rates. This omission is a genuine limitation, as the headline accuracy figures and failure-mode analysis depend on ground-truth correctness. In the revised manuscript we will add a dedicated validation subsection to the Methods. It will describe (1) human spot-checks performed on a random sample of extracted answers across domains and difficulty levels, (2) cross-verification of a subset against independent sources, and (3) the extraction error rates observed during pipeline execution and self-repair. These additions will directly rule out the possibility that extraction errors confound the reported model limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark with externally verifiable ground truth

full rationale

The paper constructs RT-QA as an empirical benchmark by describing an agent pipeline that generates crawling code and applies self-repair for live web extraction to produce ground-truth answers for 320 questions. Model accuracies (e.g., 46% for best models) are then measured against these answers. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear; the derivation chain consists of procedural description followed by independent external evaluation of LLMs. The ground-truth mechanism is presented as a methodological choice rather than a mathematical reduction that forces the reported accuracies by construction. This is a standard empirical setup with no internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that automated code can reliably produce accurate live labels; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Web pages can be reliably crawled and parsed via generated code for accurate answer extraction despite structural changes
This underpins the entire ground truth creation process and self-repair mechanism.

pith-pipeline@v0.9.0 · 5584 in / 1296 out tokens · 83005 ms · 2026-05-15T09:58:47.372596+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

query": [

[San Antonio Spurs vs Boston Celtics Jan 10, 2026] ... 3. [Mavericks vs Bulls, January 10, 2026] DAL (107) vs CHI (125)... [Action] Think:让我整理一下最近3天（2026年1月10日、11日、12日）的NBA比赛情况：从搜索结果中，我看到：**1月12日比赛：** -国王124-112湖人（国王得124分）- 魔术128-118鹈鹕（魔术得128分）**1月11日比赛：** -老鹰vs勇士（结果未显示完整）- 国王111-98火箭（国王得111分）**1月10日比赛：** -马刺100-95凯尔特人（马刺得100分）- 公牛125-107独行侠（公牛得125...

work page 2026
[2]

Invisibility in PT-symmetric complex crystals

[NBA’s Nightly Recap | January 10, 2026] Game Recap: Hornets 150, Jazz 95... ... [Action] Think:我们有许多比赛结果。需要找出2026年1月10日、11日和12日这三天中，哪支球队在单场比赛中得分最高。我们需要列出这些比赛中每支球队的得分，然后找出最高分。可以通过搜索更详细的比分汇总，或逐一查看每场比赛的得分情况来实现。让我们先获取2026年1月10日的比赛结果。... (省略部分思考) ...此外，可能还有更多比赛：例如独行侠107-125公牛？等等，有一条结果是\独行侠vs公牛，2026年1月10日"，比分为\DAL (107) vs CHI (125)"。这表明公牛队得了125分，是一个...

work page internal anchor Pith review Pith/arXiv arXiv 2026