BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
hub
Advances in Neural Information Processing Systems , volume=
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
Partial harnesses for LLM agents, specifying only initial execution steps, achieve higher pass rates than fully decomposed workflows, as analyzed through trajectory alignment and validated in synthetic and terminal benchmarks.
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
citing papers explorer
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.
-
DocOS: Towards Proactive Document-Guided Actions in GUI Agents
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
-
Harnesses for Inference-Time Alignment over Execution Trajectories
Partial harnesses for LLM agents, specifying only initial execution steps, achieve higher pass rates than fully decomposed workflows, as analyzed through trajectory alignment and validated in synthetic and terminal benchmarks.
-
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
-
Memory in the Age of AI Agents
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
- Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
- SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
- Human-Guided Harm Recovery for Computer Use Agents