REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Frontier LLMs display emerging investigatory agency in autonomous database analysis but struggle with long-horizon exploration on the new DDR-Bench.
citing papers explorer
-
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.
-
Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models
Frontier LLMs display emerging investigatory agency in autonomous database analysis but struggle with long-horizon exploration on the new DDR-Bench.