hub Mixed citations

Mind2Web: Towards a Generalist Agent for the Web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang · 2023 · cs.CL · arXiv 2306.06070

Mixed citation behavior. Most common role is background (58%).

50 Pith papers citing it

Background 58% of classified citations

open full Pith review browse 50 citing papers arXiv PDF

abstract

We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1) diverse domains, websites, and tasks, 2) use of real-world websites instead of simulated and simplified ones, and 3) a broad spectrum of user interaction patterns. Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents. While the raw HTML of real-world websites are often too large to be fed to LLMs, we show that first filtering it with a small LM significantly improves the effectiveness and efficiency of LLMs. Our solution demonstrates a decent level of performance, even on websites or entire domains the model has never seen before, but there is still a substantial room to improve towards truly generalizable agents. We open-source our dataset, model implementation, and trained models (https://osu-nlp-group.github.io/Mind2Web) to facilitate further research on building a generalist agent for the web.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 dataset 4 method 1

citation-polarity summary

background 7 use dataset 4 use method 1

representative citing papers

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

cs.AI · 2026-05-19 · conditional · novelty 7.0 · 2 refs

EngiAI introduces a LangGraph-based multi-agent framework and a three-part benchmark suite for LLM-driven engineering design, reporting high task completion rates for proprietary models on Beams2D and Photonics2D problems.

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces

cs.CR · 2026-05-14 · unverdicted · novelty 7.0

UI traces of actions and timings from LLM browser agents enable identification of the underlying model with up to 96% F1 across 14 models and multiple tasks.

MMSkills: Towards Multimodal Skills for General Visual Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0 · 3 refs

MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

cs.CL · 2026-05-12 · conditional · novelty 7.0 · 2 refs

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

WAAA! Web Adversaries Against Agentic Browsers

cs.CR · 2026-05-06 · unverdicted · novelty 7.0

Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.

OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

PlayCoder: Making LLM-Generated GUI Code Playable

cs.SE · 2026-04-21 · conditional · novelty 7.0

PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.

How Can AI Augment Access to Justice? Public Defenders' Perspectives on AI Adoption

cs.CY · 2025-10-27 · accept · novelty 7.0

Public defenders view AI as most useful for evidence investigation but limited in courtroom work and strategy, with adoption blocked by costs, confidentiality risks, and norms, requiring human oversight and open development.

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

cs.LG · 2024-03-12 · unverdicted · novelty 7.0

WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

Failure-driven self-improvement raises OpenCUA-72B success rate on OSWorld from 42.3% to 48.9% via LLM diagnosis and inference-time code patches, without retraining.

GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing

cs.SE · 2026-05-28 · unverdicted · novelty 6.0

GUITestScape supplies an interactive benchmark for exploratory GUI testing and GUIJudge supplies an open-set process-aware evaluator that outperforms baselines on MLLM agents.

Design and Report Benchmarks for Knowledge Work

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.

SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SPIKE dual-controller framework raises success rates 5-9 points and cuts tokens 55% in StarDojo agents by reusing strategic plans across stable segments and escalating only at detected events.

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

cs.SE · 2026-04-30 · unverdicted · novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.

Structured Distillation of Web Agent Capabilities Enables Generalization

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

Structured synthetic trajectory generation from Gemini 3 Pro enables a 9B open-weight model to reach 41.5% on WebArena, outperforming Claude 3.5 Sonnet and GPT-4o while generalizing to unseen enterprise environments.

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

cs.CR · 2026-02-24 · unverdicted · novelty 6.0

The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.

Real-Time Procedural Learning From Experience for AI Agents

cs.AI · 2025-11-27 · unverdicted · novelty 6.0

PRAXIS enables AI agents to acquire procedural knowledge in real time by indexing and retrieving state-action-result experiences, leading to better accuracy, reliability, and efficiency on web browsing benchmarks.

Mobile GUI Agents under Real-world Threats: Are We There Yet?

cs.CR · 2025-07-06 · conditional · novelty 6.0

Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.

citing papers explorer

Showing 30 of 30 citing papers after filters.

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis cs.AI · 2026-05-24 · unverdicted · none · ref 42 · internal anchor
ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps
EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design cs.AI · 2026-05-19 · conditional · none · ref 43 · 2 links · internal anchor
EngiAI introduces a LangGraph-based multi-agent framework and a three-part benchmark suite for LLM-driven engineering design, reporting high task completion rates for proprietary models on Beams2D and Photonics2D problems.
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents cs.CL · 2026-05-18 · unverdicted · none · ref 8 · internal anchor
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 7 · internal anchor
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces cs.CR · 2026-05-14 · unverdicted · none · ref 11 · internal anchor
UI traces of actions and timings from LLM browser agents enable identification of the underlying model with up to 96% F1 across 14 models and multiple tasks.
MMSkills: Towards Multimodal Skills for General Visual Agents cs.AI · 2026-05-13 · unverdicted · none · ref 8 · 3 links · internal anchor
MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation cs.CL · 2026-05-12 · conditional · none · ref 5 · 2 links · internal anchor
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents cs.AI · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
WAAA! Web Adversaries Against Agentic Browsers cs.CR · 2026-05-06 · unverdicted · none · ref 28 · internal anchor
Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory cs.CL · 2026-04-29 · unverdicted · none · ref 4 · internal anchor
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
PlayCoder: Making LLM-Generated GUI Code Playable cs.SE · 2026-04-21 · conditional · none · ref 17 · internal anchor
PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents cs.CV · 2026-06-30 · unverdicted · none · ref 4 · internal anchor
Failure-driven self-improvement raises OpenCUA-72B success rate on OSWorld from 42.3% to 48.9% via LLM diagnosis and inference-time code patches, without retraining.
GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing cs.SE · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
GUITestScape supplies an interactive benchmark for exploratory GUI testing and GUIJudge supplies an open-set process-aware evaluator that outperforms baselines on MLLM agents.
Design and Report Benchmarks for Knowledge Work cs.AI · 2026-05-22 · unverdicted · none · ref 41 · internal anchor
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents cs.CV · 2026-05-18 · unverdicted · none · ref 4 · internal anchor
SPIKE dual-controller framework raises success rates 5-9 points and cuts tokens 55% in StarDojo agents by reusing strategic plans across stable segments and escalating only at detected events.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows cs.SE · 2026-04-30 · unverdicted · none · ref 6 · internal anchor
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
Structured Distillation of Web Agent Capabilities Enables Generalization cs.LG · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
Structured synthetic trajectory generation from Gemini 3 Pro enables a 9B open-weight model to reach 41.5% on WebArena, outperforming Claude 3.5 Sonnet and GPT-4o while generalizing to unseen enterprise environments.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents cs.CR · 2026-02-24 · unverdicted · none · ref 61 · internal anchor
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
Exploring the Topology and Memory of Consensus: How LLM Agents Agree, Fragment, or Settle When Forming Conventions cs.MA · 2026-06-02 · unverdicted · none · ref 3 · internal anchor
Simulations of 16 LLM agents in a naming game on 8 topologies show memory depth interacts with network structure to flip coordination speed and increase fragmentation in centralized networks.
Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark cs.AI · 2026-05-28 · unverdicted · none · ref 10 · internal anchor
Fine-tuned Qwen3-VL-8B reaches sem_sim 0.783 on PiSAR held-out set vs 0.46-0.48 for frontier zero-shot, while Gemma-4-26B scores 0.441.
HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML cs.SE · 2026-05-26 · unverdicted · none · ref 3 · internal anchor
HTMLCure uses browser-executed interaction trajectories to diagnose and repair LLM HTML outputs, expanding 97K prompts into a 40K refined SFT set that lifts a 27B model to 50.6 on HTMLBench-400 and 81.2 on MiniAppBench.
Tuning Qwen2.5-VL to Improve Its Web Interaction Skills cs.HC · 2026-02-20 · unverdicted · none · ref 4 · internal anchor
Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.
Legal Retrieval for Public Defenders cs.IR · 2026-01-20 · conditional · none · ref 8 · internal anchor
NJ BriefBank is a domain-adapted legal retrieval tool for public defenders that improves on standard benchmarks by incorporating legal reasoning, domain data, and synthetic examples, with a new released taxonomy and annotated evaluation dataset.
What makes a harness a harness: necessary and sufficient conditions for an agent harness cs.SE · 2026-06-08 · unverdicted · none · ref 12 · internal anchor
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
CLI-Anything: Towards Agent-Native Computer Use cs.HC · 2026-06-02 · unverdicted · none · ref 16 · internal anchor
CLI-Anything advocates transforming applications into CLI-based protocols for agent-native interaction and introduces the CLI-Hub platform to support this shift away from GUI agents.
Avenir-UX: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding cs.AI · 2026-02-25 · unverdicted · none · ref 5 · internal anchor
Avenir-UX automates web usability testing by using GUI-grounded simulation of user behavior to generate standardized reports with SUS, SEQ, and Think Aloud protocols.
The Agentic Web Requires New Normative Infrastructure cs.CY · 2026-06-09 · unverdicted · none · ref 127 · internal anchor
The agentic web requires new normative infrastructure of laws, norms, and practices to allow user-delegated AI agents to access online properties without being blocked as malicious bots.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 124 · internal anchor
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.
A Primer in Post-Training Reasoning Data: What We Know About How It Works cs.CL · 2026-06-01 · unverdicted · none · ref 5 · internal anchor
A literature synthesis that organizes post-training reasoning data research around data objects, usefulness factors, construction methods, and scaling behaviors to create an attribution framework.
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models cs.LG · 2026-04-15 · unreviewed · ref 4 · internal anchor

Mind2Web: Towards a Generalist Agent for the Web

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer