hub

Scaling instructable agents across many simulated worlds

MariaAbiRaad, ArunAhuja, CatarinaBarros, FredericBesse, AndrewBolt, AdrianBolton, BethanieBrownfield, Gavin Buttimore, Max Cant, Sarah Chakera, et al · 2024 · arXiv 2404.10179

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

cs.SE · 2026-05-09 · conditional · novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.

RetroMotion: Retrocausal Motion Forecasting Models are Instructable

cs.CV · 2025-05-26 · unverdicted · novelty 7.0

Retrocausal transformer decomposes multi-agent motion forecasts into marginals and pairwise joints, models uncertainty with compressed exponentials, achieves strong Waymo results, generalizes to Argoverse 2 and V2X-Seq, and enables implicit instruction following from standard training.

SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SPIKE dual-controller framework raises success rates 5-9 points and cuts tokens 55% in StarDojo agents by reusing strategic plans across stable segments and escalating only at detected events.

CA2: Code-Aware Agent for Automated Game Testing

cs.SE · 2026-05-13 · unverdicted · novelty 6.0

CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

cs.CV · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

Video Game Accessibility through Shared Control for People with Upper-Limb Impairments

cs.HC · 2026-01-16 · unverdicted · novelty 5.0

A configurable framework called GamePals enables shared control via human cooperation or partial automation to improve video game accessibility for people with upper-limb impairments, evaluated in a study with 13 participants.

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

cs.AI · 2025-09-02 · conditional · novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

Shared Control for Game Accessibility: Understanding Current Human Cooperation Practices to Inform the Design of Partial Automation Solutions

cs.HC · 2025-09-02 · unverdicted · novelty 4.0

Interviews with 14 accessible gamers identify essential shared-control practices, key limitations of human assistance, and design requirements for software agents that could automate support.

citing papers explorer

Showing 11 of 11 citing papers.

Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries cs.SE · 2026-05-09 · conditional · none · ref 23
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks cs.AI · 2026-04-22 · unverdicted · none · ref 21
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 69
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
RetroMotion: Retrocausal Motion Forecasting Models are Instructable cs.CV · 2025-05-26 · unverdicted · none · ref 36
Retrocausal transformer decomposes multi-agent motion forecasts into marginals and pairwise joints, models uncertainty with compressed exponentials, achieves strong Waymo results, generalizes to Argoverse 2 and V2X-Seq, and enables implicit instruction following from standard training.
SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents cs.CV · 2026-05-18 · unverdicted · none · ref 1
SPIKE dual-controller framework raises success rates 5-9 points and cuts tokens 55% in StarDojo agents by reusing strategic plans across stable segments and escalating only at detected events.
CA2: Code-Aware Agent for Automated Game Testing cs.SE · 2026-05-13 · unverdicted · none · ref 17
CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 41
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse cs.CV · 2026-05-11 · unverdicted · none · ref 137 · 2 links
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
Video Game Accessibility through Shared Control for People with Upper-Limb Impairments cs.HC · 2026-01-16 · unverdicted · none · ref 54
A configurable framework called GamePals enables shared control via human cooperation or partial automation to improve video game accessibility for people with upper-limb impairments, evaluated in a study with 13 participants.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning cs.AI · 2025-09-02 · conditional · none · ref 52
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Shared Control for Game Accessibility: Understanding Current Human Cooperation Practices to Inform the Design of Partial Automation Solutions cs.HC · 2025-09-02 · unverdicted · none · ref 54
Interviews with 14 accessible gamers identify essential shared-control practices, key limitations of human assistance, and design requirements for software agents that could automate support.

Scaling instructable agents across many simulated worlds

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer