New benchmark evaluates three frontier deep research agents on 42 SME prompts with verifiers and rubrics, reporting low acceptance rates of 9.5-21.4% and agent-specific failure modes.
The proof and measurement of association between two things
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
method 1polarities
use method 1representative citing papers
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Semantic segmentation decomposes monitoring features into canonical and residual components that concentrate fault-predictive information while preserving operational meaning in predictive maintenance.
An LLM agent automates iterative refinement of data embedding visualizations by generating semantic evaluation reports and recommending configuration changes.
Off-the-shelf German NER tools produce divergent toponym sets that lead to distinct country assignments for climate event news, affecting assessments of national prominence in media coverage.
citing papers explorer
-
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
New benchmark evaluates three frontier deep research agents on 42 SME prompts with verifiers and rubrics, reporting low acceptance rates of 9.5-21.4% and agent-specific failure modes.
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems
Semantic segmentation decomposes monitoring features into canonical and residual components that concentrate fault-predictive information while preserving operational meaning in predictive maintenance.
-
Explainable Iterative Data Visualisation Refinement via an LLM Agent
An LLM agent automates iterative refinement of data embedding visualizations by generating semantic evaluation reports and recommending configuration changes.
-
Geolocating News about Extreme Climate Events: A Comparative Analysis of Off-the-Shelf Tools for Toponym Identification in German
Off-the-shelf German NER tools produce divergent toponym sets that lead to distinct country assignments for climate event news, affecting assessments of national prominence in media coverage.