pith. sign in

arxiv: 2604.24964 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.CL

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

Pith reviewed 2026-05-08 03:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords web agentslong-horizon tasksbenchmarkrubric evaluationtrajectory efficiencymulti-site navigationAI agentsevaluation metrics
0
0 comments X

The pith

A benchmark of 200 long-horizon web tasks shows frontier models succeed on under half while operating at very low efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing web agent benchmarks focus on short single-site tasks where top models already perform near saturation. This paper introduces Odysseys, a set of 200 tasks drawn from actual extended browsing sessions that span multiple sites and require sustained cross-site reasoning. It replaces binary pass/fail judgments with an average of 6.1 graded rubrics per task, which produces stronger agreement with human evaluators than trajectory-level LLM judgments. Testing leading frontier models on these live-web tasks yields a 44.5 percent success rate and a 1.15 percent trajectory efficiency score. The results indicate that current agents still lack the ability to handle realistic, hours-long web workflows productively.

Core claim

We introduce Odysseys as a benchmark of 200 long-horizon web tasks derived from real world browsing sessions and evaluated on the live Internet. Binary pass/fail evaluation proves inadequate for these settings, so each task receives an average of 6.1 graded rubrics that yield higher human agreement than trajectory-level LLM-as-a-judge metrics. Leading frontier models reach a 44.5 percent success rate on the benchmark while attaining only 1.15 percent on the new Trajectory Efficiency metric, underscoring the need for agents capable of productive operation over extended periods.

What carries the argument

The Odysseys benchmark of 200 multi-site long-horizon tasks paired with rubric-based evaluation that scores partial progress across an average of 6.1 graded criteria per task.

If this is right

  • Agents need improved mechanisms for maintaining context and performing cross-site reasoning across extended sessions.
  • Efficiency measured as rubric progress per step must be treated as a primary objective rather than an afterthought.
  • Rubric scoring supplies a finer-grained training signal than binary success rates for guiding agent development.
  • Progress toward high performance on these tasks would indicate agents ready for practical hours-long web workflows.
  • The benchmark provides a concrete target for building computer-use agents that operate productively in open environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar rubric-based methods could be adapted to evaluate long-horizon performance in other agent settings such as desktop software or research tasks.
  • The very low efficiency numbers suggest that research should focus on reducing unnecessary steps even when success is eventually reached.
  • Substantial gains on Odysseys would likely accelerate development of reliable agents for ongoing personal assistance on the web.
  • Links between this benchmark and advances in long-term memory or planning techniques in AI could be tested directly by measuring score improvements.

Load-bearing premise

The 200 tasks selected from real browsing sessions and the rubrics assigned to them accurately represent the main challenges and quality standards of everyday long-horizon web use.

What would settle it

A replication study in which human raters agree more strongly with simple binary pass/fail labels than with the average rubric scores across the same 200 tasks would undermine the claimed superiority of the rubric method.

Figures

Figures reproduced from arXiv: 2604.24964 by Daniel Fried, Jing Yu Koh, Lawrence Keunho Jang, Ruslan Salakhutdinov.

Figure 1
Figure 1. Figure 1: Odysseys is a long-horizon web agent benchmark with 200 tasks based on real user view at source ↗
Figure 2
Figure 2. Figure 2: After researching 10 surgeon profiles, GPT-5.4 encoded the entire data table as a view at source ↗
Figure 3
Figure 3. Figure 3: A scatter plot of perfect success rate against number of steps taken. The pareto view at source ↗
Figure 4
Figure 4. Figure 4: Perfect rubric rate as a function of step budget. Each curve shows the mean rubrics view at source ↗
Figure 5
Figure 5. Figure 5: Rubric scores (left) and # of steps taken per task (right) broken down by difficulty. view at source ↗
Figure 6
Figure 6. Figure 6: The annotation interface used by participants to label their Chrome browsing view at source ↗
Figure 7
Figure 7. Figure 7: The Odysseys QA interface used for manual review of chained tasks. Reviewers view at source ↗
Figure 8
Figure 8. Figure 8: Perfect rubric rate vs. step budget at a 200-step budget. Claude Opus 4.6 is shown view at source ↗
Figure 9
Figure 9. Figure 9: The trajectory viewer used for human agreement annotation. Annotators step view at source ↗
Figure 10
Figure 10. Figure 10: When margauxny.com rendered blank due to JavaScript failures, the GPT-5.4 agent typed view-source:https://margauxny.com/products/... directly in the address bar, then used ctrl+f to search for variants, InStock, 40.5, and US 9.5 in the raw HTML. From the embedded JSON-LD schema, it decoded EU 40.0 = US 9.5, confirmed that the SKU was in stock, and reconstructed the direct variant URL, all without the prod… view at source ↗
Figure 11
Figure 11. Figure 11: After navigating to brave.com/linux, Opus used ctrl+f to search for Chromebook and observed 0/0 matches, confirming that the page did not cover the topic at all, and immediately pivoted to a different strategy. Rather than scrolling to verify, it treated the absence of a match as decisive evidence that the page was unhelpful. 28 view at source ↗
Figure 12
Figure 12. Figure 12: Both yoga studio websites returned 403 Forbidden. Opus immediately navigated to https://web.archive.org/web/2024/https://coilyoga.com/classes/ and repeated the same strategy for Tower Yoga, successfully retrieving archived pages from June 2024 that contained the full class schedule information. 29 view at source ↗
read the original abstract

Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across different domains, planning trips across multiple services, or summarizing information from multiple search queries, require sustained context and cross-site reasoning over potentially hours of browsing. To capture and evaluate such behaviors, we introduce Odysseys: a benchmark of 200 long-horizon web tasks derived from real world browsing sessions evaluated on the live Internet. We find that binary pass/fail evaluation is inadequate for long-horizon settings and introduce a rubric-based evaluation, annotating each Odysseys task with an average of 6.1 graded rubrics. We demonstrate that this yields higher agreement with humans and provides a more fine-grained signal than commonly used trajectory-level LLM-as-a-judge evaluation metrics. We tested several leading frontier models and find that the strongest models achieve a success rate of 44.5%, which leaves substantial room for future improvements. Beyond task success, we argue that efficiency is a first-class concern for long-horizon agents. We introduce a Trajectory Efficiency metric (rubric score per step) and find that even frontier agents achieve only 1.15%, marking an evident need for agents that can succeed efficiently and not simply eventually. Odysseys isolates the critical evaluation of long-horizon proficiency in open-web environments, providing a realistic benchmark to measure progress towards computer-use agents that can potentially productively operate for hours. We release our tasks, evaluation scripts, and other results at https://odysseys-website.pages.dev

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Odysseys, a benchmark of 200 long-horizon web tasks derived from real-world browsing sessions and evaluated on the live Internet. It argues that binary pass/fail evaluation is inadequate for such settings and proposes a rubric-based evaluation annotating each task with an average of 6.1 graded rubrics, claiming this yields higher human agreement than binary or LLM-as-a-judge metrics. Experiments on frontier models report a maximum success rate of 44.5% and a Trajectory Efficiency of 1.15%, indicating substantial room for improvement. The tasks, evaluation scripts, and results are released publicly.

Significance. If the tasks prove representative of realistic long-horizon multi-site web use and the rubric evaluation is shown to be reliably superior, this benchmark would be a valuable contribution to web agent research. It moves beyond saturated short single-site tasks, introduces an efficiency metric that exposes a key limitation of current agents, and provides open artifacts that enable reproducible progress tracking toward agents capable of sustained, productive web operation.

major comments (2)
  1. [Task Derivation] Task derivation section: the process of deriving the 200 tasks from real browsing sessions is described only at a high level. No quantitative diversity statistics, filtering criteria, session-length distributions, or representativeness metrics are supplied, which is load-bearing for the claim that these tasks faithfully sample long-horizon, multi-site web use and that the reported 44.5% success rate reflects realistic performance.
  2. [Evaluation Methodology] Evaluation methodology: the claim that rubric-based evaluation (average 6.1 rubrics per task) yields higher agreement with humans than binary pass/fail or LLM-as-a-judge metrics is asserted without the inter-rater protocol, number of annotators, or statistical comparisons (e.g., Cohen’s κ or percentage agreement). This directly affects the justification for the new metric and the interpretation of the 1.15% Trajectory Efficiency result.
minor comments (1)
  1. [Abstract] Abstract: the statement that rubric evaluation 'yields higher agreement with humans' would benefit from a brief parenthetical quantitative comparison (e.g., agreement rates) if space permits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Task Derivation] Task derivation section: the process of deriving the 200 tasks from real browsing sessions is described only at a high level. No quantitative diversity statistics, filtering criteria, session-length distributions, or representativeness metrics are supplied, which is load-bearing for the claim that these tasks faithfully sample long-horizon, multi-site web use and that the reported 44.5% success rate reflects realistic performance.

    Authors: We agree that the task derivation section would benefit from greater quantitative detail to support the representativeness claim. In the revised manuscript we will expand the relevant section to report session-length distributions, explicit filtering criteria applied to the source browsing logs, and diversity statistics (e.g., distribution across domains, task categories, and number of sites per task). These additions will be drawn from the same data-collection pipeline already used to construct the benchmark. revision: yes

  2. Referee: [Evaluation Methodology] Evaluation methodology: the claim that rubric-based evaluation (average 6.1 rubrics per task) yields higher agreement with humans than binary pass/fail or LLM-as-a-judge metrics is asserted without the inter-rater protocol, number of annotators, or statistical comparisons (e.g., Cohen’s κ or percentage agreement). This directly affects the justification for the new metric and the interpretation of the 1.15% Trajectory Efficiency result.

    Authors: We acknowledge that the current text does not report the full human-evaluation protocol or statistical comparisons. We will revise the evaluation section to describe the inter-rater protocol, the number of annotators, and quantitative agreement metrics (Cohen’s κ and percentage agreement) between rubric scores, binary judgments, and LLM-as-a-judge scores. These details will be added from the annotation process already performed for the benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction with no self-referential derivations or fitted predictions.

full rationale

The paper presents an empirical benchmark of 200 long-horizon web tasks sourced from real browsing sessions, introduces rubric-based evaluation (avg. 6.1 rubrics per task), and reports model performance (44.5% success, 1.15% Trajectory Efficiency) on live internet environments. No equations, first-principles derivations, or predictions appear in the provided text. Task collection, rubric annotation, and efficiency metric definition are descriptive and externally grounded rather than reducing to self-defined inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no parameters are fitted then relabeled as predictions. The central claims rest on direct testing against released artifacts and human agreement comparisons, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on domain assumptions about task realism and evaluation validity without introducing free parameters or new entities.

axioms (2)
  • domain assumption Tasks derived from real-world browsing sessions represent typical long-horizon web use
    This supports the claim of realism for the benchmark.
  • domain assumption Rubric-based scoring provides higher human agreement than binary pass/fail or LLM-as-a-judge metrics
    Central to the proposed evaluation approach.

pith-pipeline@v0.9.0 · 5606 in / 1413 out tokens · 33537 ms · 2026-05-08T03:56:07.411895+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Skim: Speculative Execution for Fast and Efficient Web Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.

Reference graph

Works this paper leans on

6 extracted references · cited by 1 Pith paper

  1. [1]

    Later steps must USE or BUILD ON information from earlier steps

    SEQUENTIAL FLOW: Each step must logically lead to the next. Later steps must USE or BUILD ON information from earlier steps

  2. [2]

    UNIFIED GOAL: All steps serve one overarching purpose

  3. [3]

    GEOGRAPHIC CONSISTENCY: If steps involve physical locations, they must be in the same city/region

  4. [4]

    CROSS-SITE: Use at least 2 different websites

  5. [5]

    INFORMATION DEPENDENCIES: At least 30% of steps should depend on a prior step's output

  6. [6]

    steps": [],

    NATURAL VOICE: The agent_prompt must sound like a real person talking to an assistant -- conversational, with personal context. STYLE EXAMPLES: [8 few-shot examples spanning easy -> very_hard] OUTPUT FORMAT: { goal, task_name, agent_prompt, steps, rubric, dependencies, skills, primary_skill, deliverable, self_score, reasoning } - Select exactly {target_le...