SGR-Bench evaluates agentic LLM systems on state-gated retrieval tasks where evidence is only accessible after configuring site-specific states, with the strongest system reaching 66.18% item-level F1 and failures dominated by retrieval-scope drift.
Draco: a cross-domain benchmark for deep research accuracy, completeness, and objectivity, 2026
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.AI 2years
2026 2representative citing papers
DeepWeb-Bench is a benchmark requiring massive cross-source evidence collection and long-horizon derivation, with evaluations on nine frontier models showing derivation and calibration as primary failure modes.
citing papers explorer
-
SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval
SGR-Bench evaluates agentic LLM systems on state-gated retrieval tasks where evidence is only accessible after configuring site-specific states, with the strongest system reaching 66.18% item-level F1 and failures dominated by retrieval-scope drift.
-
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
DeepWeb-Bench is a benchmark requiring massive cross-source evidence collection and long-horizon derivation, with evaluations on nine frontier models showing derivation and calibration as primary failure modes.