Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
ALFRED : A benchmark for interpreting grounded instructions for everyday tasks
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
dataset 1
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
dataset 1polarities
use dataset 1representative citing papers
SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.
Introduces Image Reconstruction Game benchmark showing describer model dominates reconstruction quality in multi-turn VLM-generator dialogue, with math images hardest and token budget affecting convergence.
citing papers explorer
-
The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue
Introduces Image Reconstruction Game benchmark showing describer model dominates reconstruction quality in multi-turn VLM-generator dialogue, with math images hardest and token budget affecting convergence.