Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
ALFRED : A benchmark for interpreting grounded instructions for everyday tasks
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
dataset 1polarities
use dataset 1representative citing papers
SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.
Introduces Image Reconstruction Game benchmark showing describer model dominates reconstruction quality in multi-turn VLM-generator dialogue, with math images hardest and token budget affecting convergence.
citing papers explorer
-
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
-
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.
-
The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue
Introduces Image Reconstruction Game benchmark showing describer model dominates reconstruction quality in multi-turn VLM-generator dialogue, with math images hardest and token budget affecting convergence.