Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
Pith reviewed 2026-05-08 03:56 UTC · model grok-4.3
The pith
A benchmark of 200 long-horizon web tasks shows frontier models succeed on under half while operating at very low efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Odysseys as a benchmark of 200 long-horizon web tasks derived from real world browsing sessions and evaluated on the live Internet. Binary pass/fail evaluation proves inadequate for these settings, so each task receives an average of 6.1 graded rubrics that yield higher human agreement than trajectory-level LLM-as-a-judge metrics. Leading frontier models reach a 44.5 percent success rate on the benchmark while attaining only 1.15 percent on the new Trajectory Efficiency metric, underscoring the need for agents capable of productive operation over extended periods.
What carries the argument
The Odysseys benchmark of 200 multi-site long-horizon tasks paired with rubric-based evaluation that scores partial progress across an average of 6.1 graded criteria per task.
If this is right
- Agents need improved mechanisms for maintaining context and performing cross-site reasoning across extended sessions.
- Efficiency measured as rubric progress per step must be treated as a primary objective rather than an afterthought.
- Rubric scoring supplies a finer-grained training signal than binary success rates for guiding agent development.
- Progress toward high performance on these tasks would indicate agents ready for practical hours-long web workflows.
- The benchmark provides a concrete target for building computer-use agents that operate productively in open environments.
Where Pith is reading between the lines
- Similar rubric-based methods could be adapted to evaluate long-horizon performance in other agent settings such as desktop software or research tasks.
- The very low efficiency numbers suggest that research should focus on reducing unnecessary steps even when success is eventually reached.
- Substantial gains on Odysseys would likely accelerate development of reliable agents for ongoing personal assistance on the web.
- Links between this benchmark and advances in long-term memory or planning techniques in AI could be tested directly by measuring score improvements.
Load-bearing premise
The 200 tasks selected from real browsing sessions and the rubrics assigned to them accurately represent the main challenges and quality standards of everyday long-horizon web use.
What would settle it
A replication study in which human raters agree more strongly with simple binary pass/fail labels than with the average rubric scores across the same 200 tasks would undermine the claimed superiority of the rubric method.
Figures
read the original abstract
Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across different domains, planning trips across multiple services, or summarizing information from multiple search queries, require sustained context and cross-site reasoning over potentially hours of browsing. To capture and evaluate such behaviors, we introduce Odysseys: a benchmark of 200 long-horizon web tasks derived from real world browsing sessions evaluated on the live Internet. We find that binary pass/fail evaluation is inadequate for long-horizon settings and introduce a rubric-based evaluation, annotating each Odysseys task with an average of 6.1 graded rubrics. We demonstrate that this yields higher agreement with humans and provides a more fine-grained signal than commonly used trajectory-level LLM-as-a-judge evaluation metrics. We tested several leading frontier models and find that the strongest models achieve a success rate of 44.5%, which leaves substantial room for future improvements. Beyond task success, we argue that efficiency is a first-class concern for long-horizon agents. We introduce a Trajectory Efficiency metric (rubric score per step) and find that even frontier agents achieve only 1.15%, marking an evident need for agents that can succeed efficiently and not simply eventually. Odysseys isolates the critical evaluation of long-horizon proficiency in open-web environments, providing a realistic benchmark to measure progress towards computer-use agents that can potentially productively operate for hours. We release our tasks, evaluation scripts, and other results at https://odysseys-website.pages.dev
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Odysseys, a benchmark of 200 long-horizon web tasks derived from real-world browsing sessions and evaluated on the live Internet. It argues that binary pass/fail evaluation is inadequate for such settings and proposes a rubric-based evaluation annotating each task with an average of 6.1 graded rubrics, claiming this yields higher human agreement than binary or LLM-as-a-judge metrics. Experiments on frontier models report a maximum success rate of 44.5% and a Trajectory Efficiency of 1.15%, indicating substantial room for improvement. The tasks, evaluation scripts, and results are released publicly.
Significance. If the tasks prove representative of realistic long-horizon multi-site web use and the rubric evaluation is shown to be reliably superior, this benchmark would be a valuable contribution to web agent research. It moves beyond saturated short single-site tasks, introduces an efficiency metric that exposes a key limitation of current agents, and provides open artifacts that enable reproducible progress tracking toward agents capable of sustained, productive web operation.
major comments (2)
- [Task Derivation] Task derivation section: the process of deriving the 200 tasks from real browsing sessions is described only at a high level. No quantitative diversity statistics, filtering criteria, session-length distributions, or representativeness metrics are supplied, which is load-bearing for the claim that these tasks faithfully sample long-horizon, multi-site web use and that the reported 44.5% success rate reflects realistic performance.
- [Evaluation Methodology] Evaluation methodology: the claim that rubric-based evaluation (average 6.1 rubrics per task) yields higher agreement with humans than binary pass/fail or LLM-as-a-judge metrics is asserted without the inter-rater protocol, number of annotators, or statistical comparisons (e.g., Cohen’s κ or percentage agreement). This directly affects the justification for the new metric and the interpretation of the 1.15% Trajectory Efficiency result.
minor comments (1)
- [Abstract] Abstract: the statement that rubric evaluation 'yields higher agreement with humans' would benefit from a brief parenthetical quantitative comparison (e.g., agreement rates) if space permits.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Task Derivation] Task derivation section: the process of deriving the 200 tasks from real browsing sessions is described only at a high level. No quantitative diversity statistics, filtering criteria, session-length distributions, or representativeness metrics are supplied, which is load-bearing for the claim that these tasks faithfully sample long-horizon, multi-site web use and that the reported 44.5% success rate reflects realistic performance.
Authors: We agree that the task derivation section would benefit from greater quantitative detail to support the representativeness claim. In the revised manuscript we will expand the relevant section to report session-length distributions, explicit filtering criteria applied to the source browsing logs, and diversity statistics (e.g., distribution across domains, task categories, and number of sites per task). These additions will be drawn from the same data-collection pipeline already used to construct the benchmark. revision: yes
-
Referee: [Evaluation Methodology] Evaluation methodology: the claim that rubric-based evaluation (average 6.1 rubrics per task) yields higher agreement with humans than binary pass/fail or LLM-as-a-judge metrics is asserted without the inter-rater protocol, number of annotators, or statistical comparisons (e.g., Cohen’s κ or percentage agreement). This directly affects the justification for the new metric and the interpretation of the 1.15% Trajectory Efficiency result.
Authors: We acknowledge that the current text does not report the full human-evaluation protocol or statistical comparisons. We will revise the evaluation section to describe the inter-rater protocol, the number of annotators, and quantitative agreement metrics (Cohen’s κ and percentage agreement) between rubric scores, binary judgments, and LLM-as-a-judge scores. These details will be added from the annotation process already performed for the benchmark. revision: yes
Circularity Check
No circularity: empirical benchmark construction with no self-referential derivations or fitted predictions.
full rationale
The paper presents an empirical benchmark of 200 long-horizon web tasks sourced from real browsing sessions, introduces rubric-based evaluation (avg. 6.1 rubrics per task), and reports model performance (44.5% success, 1.15% Trajectory Efficiency) on live internet environments. No equations, first-principles derivations, or predictions appear in the provided text. Task collection, rubric annotation, and efficiency metric definition are descriptive and externally grounded rather than reducing to self-defined inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no parameters are fitted then relabeled as predictions. The central claims rest on direct testing against released artifacts and human agreement comparisons, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Tasks derived from real-world browsing sessions represent typical long-horizon web use
- domain assumption Rubric-based scoring provides higher human agreement than binary pass/fail or LLM-as-a-judge metrics
Forward citations
Cited by 1 Pith paper
-
Skim: Speculative Execution for Fast and Efficient Web Agents
Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.
Reference graph
Works this paper leans on
-
[1]
Later steps must USE or BUILD ON information from earlier steps
SEQUENTIAL FLOW: Each step must logically lead to the next. Later steps must USE or BUILD ON information from earlier steps
-
[2]
UNIFIED GOAL: All steps serve one overarching purpose
-
[3]
GEOGRAPHIC CONSISTENCY: If steps involve physical locations, they must be in the same city/region
-
[4]
CROSS-SITE: Use at least 2 different websites
-
[5]
INFORMATION DEPENDENCIES: At least 30% of steps should depend on a prior step's output
-
[6]
steps": [],
NATURAL VOICE: The agent_prompt must sound like a real person talking to an assistant -- conversational, with personal context. STYLE EXAMPLES: [8 few-shot examples spanning easy -> very_hard] OUTPUT FORMAT: { goal, task_name, agent_prompt, steps, rubric, dependencies, skills, primary_skill, deliverable, self_score, reasoning } - Select exactly {target_le...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.