OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
hub Mixed citations
Mind2Web: Towards a Generalist Agent for the Web
Mixed citation behavior. Most common role is background (58%).
abstract
We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1) diverse domains, websites, and tasks, 2) use of real-world websites instead of simulated and simplified ones, and 3) a broad spectrum of user interaction patterns. Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents. While the raw HTML of real-world websites are often too large to be fed to LLMs, we show that first filtering it with a small LM significantly improves the effectiveness and efficiency of LLMs. Our solution demonstrates a decent level of performance, even on websites or entire domains the model has never seen before, but there is still a substantial room to improve towards truly generalizable agents. We open-source our dataset, model implementation, and trained models (https://osu-nlp-group.github.io/Mind2Web) to facilitate further research on building a generalist agent for the web.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
Public defenders view AI as most useful for evidence investigation but limited in courtroom work and strategy, with adoption blocked by costs, confidentiality risks, and norms, requiring human oversight and open development.
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
GUITestScape supplies an interactive benchmark for exploratory GUI testing and GUIJudge supplies an open-set process-aware evaluator that outperforms baselines on MLLM agents.
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
SPIKE dual-controller framework raises success rates 5-9 points and cuts tokens 55% in StarDojo agents by reusing strategic plans across stable segments and escalating only at detected events.
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
Structured synthetic trajectory generation from Gemini 3 Pro enables a 9B open-weight model to reach 41.5% on WebArena, outperforming Claude 3.5 Sonnet and GPT-4o while generalizing to unseen enterprise environments.
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
PRAXIS enables AI agents to acquire procedural knowledge in real time by indexing and retrieving state-action-result experiences, leading to better accuracy, reliability, and efficiency on web browsing benchmarks.
Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
GuardAgent safeguards LLM agents by generating task plans from safety requests and mapping them to executable guardrail code, achieving over 98% accuracy on a healthcare access-control benchmark and 83% on a web safety benchmark.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
InjecAgent benchmark demonstrates that tool-integrated LLM agents are vulnerable to indirect prompt injection attacks, with ReAct-prompted GPT-4 succeeding on 24% of attacks and nearly twice that rate when attacker instructions are reinforced.
citing papers explorer
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
-
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
-
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
-
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
-
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
-
WAAA! Web Adversaries Against Agentic Browsers
Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
PlayCoder: Making LLM-Generated GUI Code Playable
PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
-
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
-
How Can AI Augment Access to Justice? Public Defenders' Perspectives on AI Adoption
Public defenders view AI as most useful for evidence investigation but limited in courtroom work and strategy, with adoption blocked by costs, confidentiality risks, and norms, requiring human oversight and open development.
-
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
-
GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing
GUITestScape supplies an interactive benchmark for exploratory GUI testing and GUIJudge supplies an open-set process-aware evaluator that outperforms baselines on MLLM agents.
-
Design and Report Benchmarks for Knowledge Work
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
-
SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents
SPIKE dual-controller framework raises success rates 5-9 points and cuts tokens 55% in StarDojo agents by reusing strategic plans across stable segments and escalating only at detected events.
-
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
-
Structured Distillation of Web Agent Capabilities Enables Generalization
Structured synthetic trajectory generation from Gemini 3 Pro enables a 9B open-weight model to reach 41.5% on WebArena, outperforming Claude 3.5 Sonnet and GPT-4o while generalizing to unseen enterprise environments.
-
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
-
Real-Time Procedural Learning From Experience for AI Agents
PRAXIS enables AI agents to acquire procedural knowledge in real time by indexing and retrieving state-action-result experiences, leading to better accuracy, reliability, and efficiency on web browsing benchmarks.
-
Mobile GUI Agents under Real-world Threats: Are We There Yet?
Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
GuardAgent safeguards LLM agents by generating task plans from safety requests and mapping them to executable guardrail code, achieving over 98% accuracy on a healthcare access-control benchmark and 83% on a web safety benchmark.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
InjecAgent benchmark demonstrates that tool-integrated LLM agents are vulnerable to indirect prompt injection attacks, with ReAct-prompted GPT-4 succeeding on 24% of attacks and nearly twice that rate when attacker instructions are reinforced.
-
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
WebVoyager uses a large multimodal model to complete real-world web tasks end-to-end and reaches 59.1 percent success on a new benchmark of 15 live sites, with an automatic GPT-4V evaluator that matches human judgments 85 percent of the time.
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded
GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
-
Cognitive Architectures for Language Agents
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
-
Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark
Fine-tuned Qwen3-VL-8B reaches sem_sim 0.783 on PiSAR held-out set vs 0.46-0.48 for frontier zero-shot, while Gemma-4-26B scores 0.441.
-
HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML
HTMLCure uses browser-executed interaction trajectories to diagnose and repair LLM HTML outputs, expanding 97K prompts into a 40K refined SFT set that lifts a 27B model to 50.6 on HTMLBench-400 and 81.2 on MiniAppBench.
-
Tuning Qwen2.5-VL to Improve Its Web Interaction Skills
Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.
-
Legal Retrieval for Public Defenders
NJ BriefBank is a domain-adapted legal retrieval tool for public defenders that improves on standard benchmarks by incorporating legal reasoning, domain data, and synthetic examples, with a new released taxonomy and annotated evaluation dataset.
-
SynthAgent: Adapting Web Agents with Synthetic Supervision
SynthAgent uses dual refinement of synthetic tasks and trajectories to produce higher-quality training data that improves web agent adaptation to target environments.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
A survey that defines Computer-Using Agents for safety analysis, categorizes their threats, proposes a taxonomy of defensive strategies, and summarizes benchmarks and datasets for evaluating CUA safety and performance.
-
What makes a harness a harness: necessary and sufficient conditions for an agent harness
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
-
Avenir-UX: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding
Avenir-UX automates web usability testing by using GUI-grounded simulation of user behavior to generate standardized reports with SUS, SEQ, and Think Aloud protocols.
-
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
-
Understanding the planning of LLM agents: A survey
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
-
Agent AI: Surveying the Horizons of Multimodal Interaction
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
-
The Agentic Web Requires New Normative Infrastructure
The agentic web requires new normative infrastructure of laws, norms, and practices to allow user-delegated AI agents to access online properties without being blocked as malicious bots.
-
Toward Native Multimodal Modeling: A Roadmap
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.
- EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design
- MMSkills: Towards Multimodal Skills for General Visual Agents