MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.
Mirage-1: Augmenting and updating gui agent with hierarchical multimodal skills.arXiv preprint arXiv:2506.10387
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
PersonalAlign introduces a hierarchical memory agent that uses long-term user records to resolve vague GUI instructions and provide proactive assistance, improving execution by 15.7% and proactive performance by 7.3% on the new AndroidIntent benchmark.
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
BehaviorVLA learns long-horizon behavioral representations via causal Mamba encoder and phase-conditioned decoder, reporting SOTA results of 58% on RoboTwin 2.0, 98% on LIBERO, 4.36 on CALVIN, and matching OpenVLA-OFT performance with 50% data in sim-to-real transfer.
citing papers explorer
No citing papers match the current filters.