Coachable agents for interactive gameplay

(2) Sony AI; (3) Sony AI; Akanksha Saran (2); Alisa Devlic (1); Andreanne Lemay (2); Craig Sherstan (3); Daniel Hernandez (2); Declan Oller (2); Dustin R. Morrill (2); Elahe Aghapour (2)

arxiv: 2607.00642 · v1 · pith:GRANV7HWnew · submitted 2026-07-01 · 💻 cs.AI · cs.LG

Coachable agents for interactive gameplay

Roberto Capobianco (1) , Harm van Seijen (2) , Nolan D. Bard (2) , Neil Burch (2) , Fatima Davelouis (2) , Josh Davidson (2) , Alisa Devlic (1) , Yunshu Du (2)

show 41 more authors

Ishan Durugkar (2) Siddhant Gangapurwala (2) Daniel Hernandez (2) G. Zacharias Holland (2) Sahil Jain (2) Kenta Kawamoto (3) Raksha Kumaraswamy (2) Patrick MacAlpine (2) Dustin R. Morrill (2) Declan Oller (2) Francesco Riccio (1) Akanksha Saran (2) Craig Sherstan (3) Kaushik Subramanian (1) Thomas J. Walsh (2) Samuel Barrett (2) Kizza N. Frisbee (2) Mady Govil (2) Johannes G\"unther (2) Varun R. Kompella (2) James A. MacGlashan (2) Maxwell Svetlik (2) Michael D. Thomure (2) Jaden B. Travnik (2) Kevin Waugh (2) Elahe Aghapour (2) Florian Fuchs (1) Andreanne Lemay (2) Shruti Mishra (1) Takuma Seno (3) Peter Stone (2) Michael Spranger (3) Peter R. Wurman (2) ((1) Sony AI Zurich Switzerland (2) Sony AI North America various locations (3) Sony AI Tokyo Japan)

This is my paper

Pith reviewed 2026-07-02 12:49 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords taskagentsbehaviorcontroldomaindomainsfinalframework

0 comments

The pith

A framework combining universal value function approximators with targeted training scenarios and data augmentation produces RL agents that adapt to user-specified styles in real time across video games and humanoid domains while preserving core task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning agents usually learn one fixed way to solve a task through trial and error. This work adds the ability for a human coach to request different styles of behavior on the fly, such as aggressive versus defensive play in a game. The method uses universal value function approximators, which allow the agent to condition its actions on both the task goal and the requested style. Training includes carefully chosen scenarios, specific learning algorithms, and data augmentation to make the style requests effective. Demonstrations occur in car racing, stylized combat in a AAA game, and humanoid walking. In each case the agents follow the style instructions while still finishing the main objective. The key practical result is that the final behavior can be selected by an end user at runtime rather than being locked in during training.

Core claim

each agent shows strong coherence to the style requests while still satisfying the main task in its domain. Importantly, the techniques outlined in this paper allow an end user to choose the final behavior at run time, giving them flexible control over the final executed performance.

Load-bearing premise

That carefully selected training scenarios, learning algorithms, and data augmentation can encode arbitrary styles via UVFAs without degrading core task performance or requiring domain-specific redesign for each new style.

read the original abstract

Reinforcement learning has proven to be a valuable tool in the creation of advanced AI and robotic systems, contributing to everything from game playing to robotics to foundation models. Through trial-and-error, these AI systems typically learn one, near-optimal behavior to solve their tasks. However, there are many use cases in which one would like to assert some level of control, preferably in real time, over how the task is solved. We refer to these modifications of a core task as styles. We combine universal value function approximators (UVFAs) with carefully selected training scenarios, learning algorithms, and data augmentation to create a framework for coaching agents that exhibit styles in complex domains. We demonstrate the framework's application in the AAA video games Horizon Forbidden West and Gran Turismo, and in an open-source humanoid test domain. Despite the different nature of the domains -- car racing, stylized game combat, and humanoid walking -- each agent shows strong coherence to the style requests while still satisfying the main task in its domain. Importantly, the techniques outlined in this paper allow an end user to choose the final behavior at run time, giving them flexible control over the final executed performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical recipe for runtime style control of RL agents in AAA games via UVFAs, with demos across racing, combat, and walking that hold together on the main claims.

read the letter

The main point is that this work shows how to train UVFAs so an end user can pick agent styles at runtime in real game domains while the core task still gets done. They combine the universal approximators with targeted scenarios, algorithms, and augmentation, then test in Gran Turismo, Horizon Forbidden West, and a humanoid walker.

What stands out as new is the focus on interactive coaching for complex, already-deployed environments rather than just adding a style input in a toy setting. The cross-domain results are the strongest part: the same basic approach produces coherent styles in car racing, stylized combat, and locomotion without obvious domain-specific rewrites.

The soft spot is the lack of hard numbers. The abstract and description assert strong coherence and preserved task performance, but there are no reported deltas on lap times, win rates, or success metrics with versus without the style conditioning. That makes it difficult to judge whether the auxiliary input creates any capacity or optimization cost. If the full paper has those comparisons, they should be front and center; if not, the central claim rests more on qualitative demonstration than on measured trade-offs.

This is for applied RL groups working on games or robotics who already use value-function methods and want controllable behavior. It is not a theoretical advance on UVFAs themselves.

I would send it to peer review. The domains are demanding enough that the engineering details matter, and referees can push for the missing quantitative checks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; all technical details are deferred to the full manuscript which was unavailable for review.

pith-pipeline@v0.9.1-grok · 6034 in / 1101 out tokens · 16505 ms · 2026-07-02T12:49:59.569279+00:00 · methodology

Coachable agents for interactive gameplay

Core claim

Load-bearing premise

discussion (0)