DeepWeb-Bench is a benchmark requiring massive cross-source evidence collection and long-horizon derivation, with evaluations on nine frontier models showing derivation and calibration as primary failure modes.
SWE-agent: Agent-computer interfaces enable automated soft- ware engineering
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.AI 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.
citing papers explorer
-
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
DeepWeb-Bench is a benchmark requiring massive cross-source evidence collection and long-horizon derivation, with evaluations on nine frontier models showing derivation and calibration as primary failure modes.
-
Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning
LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.