arXiv preprint arXiv:2601.20975 , year=

Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, Dipanjan Das · 2026 · arXiv 2601.20975

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

representative citing papers

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

cs.AI · 2026-05-21 · conditional · novelty 7.0

SGR-Bench evaluates agentic LLM systems on state-gated retrieval tasks where evidence is only accessible after configuring site-specific states, with the strongest system reaching 66.18% item-level F1 and failures dominated by retrieval-scope drift.

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

AI co-mathematician: Accelerating mathematicians with agentic AI

cs.AI · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

cs.CL · 2026-05-05 · unverdicted · novelty 7.0

BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.

DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.

Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence

cs.AI · 2026-02-16 · unverdicted · novelty 6.0

A tuned Bioptic Agent achieves 79.7% F1 on a new multilingual benchmark for global drug asset scouting, outperforming Gemini, Claude, GPT, and other models.

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

cs.AI · 2026-05-20 · conditional · novelty 4.0 · 2 refs

QuestBench is a student-constructed benchmark of 256 questions on which current deep research AI systems achieve a mean pass rate of 16.85% and a best-case rate of 57.58%.

citing papers explorer

Showing 7 of 7 citing papers.

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval cs.AI · 2026-05-21 · conditional · none · ref 15
SGR-Bench evaluates agentic LLM systems on state-gated retrieval tasks where evidence is only accessible after configuring site-specific states, with the strongest system reaching 66.18% item-level F1 and failures dominated by retrieval-scope drift.
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition cs.CL · 2026-05-12 · unverdicted · none · ref 111
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
AI co-mathematician: Accelerating mathematicians with agentic AI cs.AI · 2026-05-07 · unverdicted · none · ref 53 · 2 links
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems cs.CL · 2026-05-05 · unverdicted · none · ref 17
BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data cs.LG · 2026-04-21 · unverdicted · none · ref 4
A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.
Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence cs.AI · 2026-02-16 · unverdicted · none · ref 8
A tuned Bioptic Agent achieves 79.7% F1 on a new multilingual benchmark for global drug asset scouting, outperforming Gemini, Claude, GPT, and other models.
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work cs.AI · 2026-05-20 · conditional · none · ref 14 · 2 links
QuestBench is a student-constructed benchmark of 256 questions on which current deep research AI systems achieve a mean pass rate of 16.85% and a best-case rate of 57.58%.

arXiv preprint arXiv:2601.20975 , year=

fields

years

verdicts

representative citing papers

citing papers explorer