SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 5representative citing papers
Weasel is a trajectory selection method that improves out-of-domain generalization for web agents while achieving 9.7-12.5x training speedups via importance-diversity optimization, AXTree pruning, and rationale style matching.
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
ReVision reduces token usage by 46% and improves success rate by 3% on OSWorld, WebTailBench, and AgentNetBench by removing redundant visual patches from 5-history trajectories with Qwen2.5-VL-7B.
BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.
citing papers explorer
-
Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection
Weasel is a trajectory selection method that improves out-of-domain generalization for web agents while achieving 9.7-12.5x training speedups via importance-diversity optimization, AXTree pruning, and rationale style matching.