arxiv: 2604.12666 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.CL· cs.HC

Recognition: unknown

From Imitation to Discrimination: Progressive Curriculum Learning for Robust Web Navigation

Chuang Peng , Wei Zhang , Renshuai Tao , Xinhao Zhang , Jian Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:09 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.HC

keywords web navigation agentscurriculum learningpreference optimizationhard negative mininglarge language modelsrobust agent training

0 comments

The pith

Progressive curriculum from imitation to discrimination lets 32B models surpass GPT-4.5 on web navigation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard supervised fine-tuning falls short for web navigation agents because it does not teach them to discriminate against plausible but wrong elements in crowded HTML pages or to generalize to new site layouts. The authors address this by creating the Triton dataset of 590k instances using structural-semantic hard negative mining to identify tricky distractors and a dual-agent consensus pipeline to generate and verify diverse tasks across domains. They then apply a progressive curriculum: basic imitation with SFT, discrimination via ORPO, and long-horizon consistency with GRPO. The resulting Triton-GRPO-32B model achieves a 58.7% step success rate on the Mind2Web benchmark, outperforming GPT-4.5 and Claude-4.5, which supports the idea that targeted data and training stages can be more impactful than model size alone.

Core claim

The Triton dataset built via Structural-Semantic Hard Negative Mining and Dual-Agent Consensus, combined with progressive training through SFT, ORPO, and GRPO stages, produces 32B models that reach 58.7% Step Success Rate on Mind2Web, exceeding the performance of GPT-4.5 at 42.4% and Claude-4.5 at 41.4%.

What carries the argument

Structural-Semantic Hard Negative Mining to generate topologically similar distractors and Dual-Agent Consensus for synthesizing verified cross-domain tasks, enabling the shift from imitation to robust discrimination in the curriculum.

If this is right

Agents gain the ability to reject incorrect but plausible page elements in densely populated web pages.
Training leads to improved generalization across unseen website layouts and domains.
Long-horizon navigation consistency improves through the final optimization stage.
Specialized data and curriculum design can outperform increases in raw model parameters for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying the hard negative mining and consensus verification to other agent benchmarks could yield similar gains in robustness.
The success suggests data quality and training progression are key levers for web agent development beyond scaling laws.

Load-bearing premise

The training examples generated by the hard negative mining and dual-agent consensus accurately represent the difficulty and noise distribution of real-world HTML without introducing biases that inflate measured performance.

What would settle it

Running the Triton-GRPO-32B model on a new collection of web pages where distractors are engineered to have different structural or semantic properties than those mined in the dataset, and observing if the step success rate drops below 42%.

Figures

Figures reproduced from arXiv: 2604.12666 by Chuang Peng, Jian Yang, Renshuai Tao, Wei Zhang, Xinhao Zhang.

**Figure 2.** Figure 2: Overview of our proposed framework. (Left) We construct the Triton dataset by mining topological hard [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of semantic diversity via t-SNE. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics analysis. Performance trajectories of the Full Curriculum versus variants with insufficient SFT foundations. Shaded areas denote alignment stages. The persistent gap confirms that ORPO and GRPO act as behavioral refiners rather than knowledge injectors, and thus cannot recover from a weak SFT baseline. from MIND2WEB’s training split and the synthetic visual-grounding data from WEBSIGHT,… view at source ↗

**Figure 5.** Figure 5: Hyperparameter sensitivity. Step Success [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Scaling analysis of synthetic data. We measure [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visualizing Hard Negatives in a flight search list. The model must discern the correct button based on the [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of Counterfactual Rejection. While baseline models often force an action on disabled [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt templates used in the Dual-Agent Consensus pipeline. The Generator ( [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: The flattened input representation. We inject explicit IDs (e.g., [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

read the original abstract

Text-based web agents offer computational efficiency for autonomous web navigation, yet developing robust agents remains challenging due to the noisy and heterogeneous nature of real-world HTML. Standard Supervised Fine-Tuning (SFT) approaches fail in two critical dimensions: they lack discrimination capabilities to reject plausible but incorrect elements in densely populated pages, and exhibit limited generalization to unseen website layouts. To address these challenges, we introduce the Triton dataset (590k instances) and a progressive training curriculum. Triton is constructed via Structural-Semantic Hard Negative Mining, which explicitly mines topologically similar distractors, and a Dual-Agent Consensus pipeline that synthesizes diverse cross-domain tasks with strict verification. Building upon this foundation, our progressive curriculum produces three models: Triton-SFT-32B for basic imitation, Triton-ORPO-32B for robust discrimination via Odds Ratio Preference Optimization, and Triton-GRPO-32B for long-horizon consistency through Group Relative Policy Optimization. Empirical evaluation on Mind2Web demonstrates that Triton-GRPO-32B achieves state-of-the-art performance among open-source models with 58.7% Step Success Rate, surpassing GPT-4.5 (42.4%) and Claude-4.5 (41.4%) by over 16%, validating that specialized data curriculum outweighs raw parameter scale for web navigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Triton dataset and staged SFT-ORPO-GRPO curriculum are the real contributions here, but the 16-point jump over GPT-4.5 on Mind2Web still needs tighter checks on data fidelity and evaluation.

read the letter

The paper's main move is releasing the 590k Triton dataset built with topology-aware hard-negative mining plus dual-agent consensus for task synthesis, then running a three-stage curriculum on a 32B model. That produces the reported 58.7% step success rate on Mind2Web, which beats the closed models they cite. The curriculum itself—starting with imitation, moving to odds-ratio preference optimization for discrimination, then group-relative policy optimization for consistency—is a sensible response to the two problems they flag: agents failing to reject plausible wrong elements and failing to generalize across layouts. The hard-negative approach and the dual-agent verification step are concrete engineering choices that address real pain points in web navigation data.

Referee Report

3 major / 1 minor

Summary. The paper introduces the Triton dataset (590k instances) constructed via Structural-Semantic Hard Negative Mining and a Dual-Agent Consensus pipeline for synthesizing verified web navigation tasks. It proposes a progressive curriculum of SFT for basic imitation, followed by ORPO for discrimination against hard negatives, and GRPO for long-horizon consistency, yielding Triton-GRPO-32B which reports 58.7% Step Success Rate on Mind2Web, outperforming GPT-4.5 (42.4%) and Claude-4.5 (41.4%).

Significance. If the results hold after proper validation, the work would demonstrate that targeted data curation and staged optimization can enable a 32B open-source model to substantially outperform larger proprietary models on web navigation, underscoring the value of curriculum design over raw scale. The Triton dataset construction approach could provide a reusable template for generating robust training data in noisy environments.

major comments (3)

[Dataset Construction] Dataset construction: The Structural-Semantic Hard Negative Mining and Dual-Agent Consensus pipeline is described at a high level but supplies no quantitative validation (e.g., inter-agent agreement rates, human precision on held-out labels, or distributional divergence statistics vs. live HTML). This is load-bearing because the headline 16%+ SSR gain on Mind2Web could arise from synthetic artifacts rather than improved robustness if the mined negatives or consensus labels contain systematic errors.
[Empirical Evaluation] Empirical evaluation: The reported 58.7% SSR for Triton-GRPO-32B lacks any description of the evaluation protocol, statistical significance tests, data-leakage checks between Triton and Mind2Web, or ablation results isolating the contribution of each curriculum stage (SFT, ORPO, GRPO). Without these, the central claim that the curriculum produces the observed gains cannot be assessed.
[Training Curriculum] Training details: No analysis is provided on how the progressive stages incrementally improve discrimination or long-horizon behavior, nor are there controls showing that the performance edge is not simply due to the volume of 590k instances or model scale alone.

minor comments (1)

[Abstract] The abstract would benefit from explicitly stating the exact model parameter counts and whether the GPT-4.5/Claude-4.5 baselines use the same prompting or tool-use setup as the Triton models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and will revise the manuscript to incorporate the requested details, analyses, and validations.

read point-by-point responses

Referee: [Dataset Construction] Dataset construction: The Structural-Semantic Hard Negative Mining and Dual-Agent Consensus pipeline is described at a high level but supplies no quantitative validation (e.g., inter-agent agreement rates, human precision on held-out labels, or distributional divergence statistics vs. live HTML). This is load-bearing because the headline 16%+ SSR gain on Mind2Web could arise from synthetic artifacts rather than improved robustness if the mined negatives or consensus labels contain systematic errors.

Authors: We agree that quantitative validation is essential for substantiating dataset quality. In the revised manuscript, we will add a new subsection reporting: inter-agent agreement rates from the Dual-Agent Consensus pipeline; human precision and recall on a held-out set of 500 instances; and distributional statistics (e.g., KL divergence on element frequencies and structural similarity) comparing Triton instances against live HTML samples. These additions will confirm the pipeline's reliability and rule out systematic labeling errors. revision: yes
Referee: [Empirical Evaluation] Empirical evaluation: The reported 58.7% SSR for Triton-GRPO-32B lacks any description of the evaluation protocol, statistical significance tests, data-leakage checks between Triton and Mind2Web, or ablation results isolating the contribution of each curriculum stage (SFT, ORPO, GRPO). Without these, the central claim that the curriculum produces the observed gains cannot be assessed.

Authors: We will expand the Experimental Setup and Results sections to fully specify the Mind2Web evaluation protocol, including prompting details, success criteria, and run counts. Statistical significance will be added via bootstrap confidence intervals and paired tests on the 58.7% SSR. Data-leakage checks will verify zero overlap in websites or pages between Triton and Mind2Web. Ablation tables will isolate gains from each stage (SFT to ORPO to GRPO). revision: yes
Referee: [Training Curriculum] Training details: No analysis is provided on how the progressive stages incrementally improve discrimination or long-horizon behavior, nor are there controls showing that the performance edge is not simply due to the volume of 590k instances or model scale alone.

Authors: We will add a dedicated curriculum analysis subsection showing incremental performance gains across SFT, ORPO, and GRPO stages on Mind2Web and internal sets, with metrics for discrimination accuracy and long-horizon consistency. Control experiments will compare against a 590k-instance standard SFT baseline and a larger-scale model to isolate the curriculum's contribution from data volume or parameter count. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmark

full rationale

The paper's chain consists of constructing the Triton dataset (590k instances) via Structural-Semantic Hard Negative Mining and Dual-Agent Consensus, then applying a progressive curriculum (SFT → ORPO → GRPO) to produce models evaluated on the external Mind2Web benchmark. The central claim of 58.7% Step Success Rate is an empirical measurement on held-out external data rather than a quantity derived by construction from the training inputs or self-citations. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would reduce the reported performance to the dataset construction steps. The result is therefore self-contained against an external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the newly introduced dataset construction and curriculum; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5546 in / 1223 out tokens · 40603 ms · 2026-05-10T15:09:13.116605+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 3 canonical work pages · 2 internal anchors

[1]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tian- lin Shi, and Percy Liang. 2018. Reinforcement learn- ing on web interfaces using workflow-guided explo- ration.arXiv preprint arXiv:1802.08802. OpenAI. 2025. Introducing gpt-4.5: A re- search preview. https://openai.com/index/ introducing-gpt...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524, 2025

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, and 1 others. 2025. Seed-coder: Let the code model curate data for itself. arXiv prepri...

work page arXiv 2025
[3]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854. A Data Construction Details In this section, we provide detailed specifications for the dataset construction pipeline, including the mathematical formulation for hard negative mining and the automated synthesis workflow. A.1 Discriminative Trajectory Mini...

work page internal anchor Pith review arXiv
[4]

Click the ’Sign Up’ button at the top right

Tag Removal: We strip all non-visual and Generator Agent Prompts (MG) [System System] You are an expert web user specializing in creating realistic user interactions. [Shared Context] HTML Snippet:{HTML_CONTEXT} Target Element:{TARGET_ELEMENT_HTML} [Task-Specific Instructions] Type 1: Navigation Intent Generate a short, imperative command that directly op...
[5]

Generic attributes (e.g., style strings, tracking codes) are dis- carded

Attribute Filtering: To reduce noise, we retain only semantically relevant at- tributes: class, id, type, name, aria-label, placeholder, and value. Generic attributes (e.g., style strings, tracking codes) are dis- carded
[6]

Text Truncation: Long text nodes are trun- cated to the first 50 tokens to preserve the structural outline without exhausting the con- text window
[7]

ele- ment=42

ID Injection: Crucially, we inject a sequential numeric identifier (e.g., backend_node_id) into every interactive element. This allows the model to output a concise ID (e.g., "ele- ment=42") rather than generating a complex XPath. Multi-Task Input Format.We structure the in- put using the standard ChatML format. To support multi-step navigation, we also a...
[8]

Sampling: For each instruction x in the train- ing set, we sample N= 5 trajectories from the current SFT checkpoint MSFT using tem- peratureT= 1.0to encourage diversity
[9]

search-bar

Winner Selection (yw): The ground truth tra- jectory is always fixed as the winner. Formatted Model Input (ChatML style) <|im_start|>system You are a proficient web navigation agent. Given the HTML content and a user instruction, select the correct element and operation. Output format: Element ID and Operation. <|im_end|> <|im_start|>user Observation (Cle...
[10]

iPhone 13

Type "iPhone 13" into element [15] (Search Box) Current Instruction: "Click the search button to see results." <|im_end|> <|im_start|>assistant Element:43 Operation:Click <|im_end|> Figure 10: The flattened input representation. We inject explicit IDs (e.g., id="42") into the HTML to enable precise referencing. The History block enables the model to maint...
[11]

Hard- est Loser

Loser Selection ( yl): We select the "Hard- est Loser" to penalize. Among the generated trajectories that areincorrect(i.e., wrong ele- ment ID or wrong operation type), we select the one with thehighest log-probability. This selection strategy specifically targets the model’s "blind spots"—answers that the model is confident in but are factually wrong (e...