Baseline reference

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757

Shunyu Yao, Howard Chen, John Yang, Karthik Narasimhan · 2022

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

9 Pith papers citing it

Baseline 60% of classified citations

browse 9 citing papers

citation-role summary

dataset 3 background 2

citation-polarity summary

use dataset 3 background 2

representative citing papers

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

cs.AI · 2026-05-21 · conditional · novelty 7.0

SGR-Bench evaluates agentic LLM systems on state-gated retrieval tasks where evidence is only accessible after configuring site-specific states, with the strongest system reaching 66.18% item-level F1 and failures dominated by retrieval-scope drift.

SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on SkillsBench.

MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.

EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.

The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.

Verifiable Process Rewards for Agentic Reasoning

cs.AI · 2026-05-11

citing papers explorer

Showing 9 of 9 citing papers.

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval cs.AI · 2026-05-21 · conditional · none · ref 41
SGR-Bench evaluates agentic LLM systems on state-gated retrieval tasks where evidence is only accessible after configuring site-specific states, with the strongest system reaching 66.18% item-level F1 and failures dominated by retrieval-scope drift.
SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces cs.AI · 2026-05-12 · unverdicted · none · ref 27
SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on SkillsBench.
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs cs.AI · 2026-05-11 · unverdicted · none · ref 27
MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium cs.AI · 2026-05-10 · unverdicted · none · ref 88
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 44
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web cs.CV · 2026-04-09 · unverdicted · none · ref 64
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents cs.CL · 2026-05-19 · unverdicted · none · ref 42
ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.
The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory cs.LG · 2026-05-10 · unverdicted · none · ref 59
Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.
Verifiable Process Rewards for Agentic Reasoning cs.AI · 2026-05-11 · unreviewed · ref 34

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer