Learning a Behavioral Repertoire from Demonstrations

Daniel Cabarcas Jaramillo; Jean-Baptiste Mouret; Miguel Gonzalez Duque; Niels Justesen; Sebastian Risi

arxiv: 1907.03046 · v1 · pith:DDTD3JGSnew · submitted 2019-07-05 · 💻 cs.LG · cs.AI

Learning a Behavioral Repertoire from Demonstrations

Niels Justesen , Miguel Gonzalez Duque , Daniel Cabarcas Jaramillo , Jean-Baptiste Mouret , Sebastian Risi This is my paper

Pith reviewed 2026-05-25 01:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords imitation learningbehavioral repertoireStarCraft IIPCApolicy conditioningUCB1demonstrationsbuild-order planning

0 comments

The pith

A single neural network policy conditioned on PCA-derived behavior vectors can express distinct behaviors from demonstrations and adapt via UCB1 to outperform standard imitation learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Behavioral Repertoire Imitation Learning to move beyond the single average policy produced by conventional imitation learning. It augments state-action pairs from 7,777 StarCraft II human replays with low-dimensional behavior descriptions obtained by applying PCA to army unit compositions. A single neural network is then trained to output actions conditioned on these descriptions, enabling precise modulation of the policy's behavior. The method further combines the conditioned policy with the UCB1 algorithm to select behaviors between games, yielding higher performance than an unconditioned imitation learning baseline.

Core claim

By augmenting state-action pairs with behavioral descriptions obtained from PCA on army unit composition, a single neural network policy can be trained to express a repertoire of distinct behaviors from human demonstrations; this policy can be manipulated by changing the conditioning vector and further adapted between games using UCB1 to exceed the performance of a traditional imitation learning policy on StarCraft II build-order planning.

What carries the argument

Behavioral Repertoire Imitation Learning (BRIL), a method that augments demonstrations with PCA-derived low-dimensional behavior vectors and trains a policy network conditioned on those vectors.

If this is right

The learned policy can be effectively manipulated to express distinct behaviors.
Applying the UCB1 algorithm allows adaptation of the policy's behavior in-between games.
This adaptation reaches performance beyond that of the traditional IL baseline approach.
The approach applies to build-order planning trained on 7,777 human replays.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The conditioning mechanism could be applied in other games or control tasks where demonstrations contain varied strategies by identifying suitable low-dimensional descriptors.
Interpolating between vectors in the learned behavioral space might produce useful hybrid behaviors not present in the original demonstrations.
A repertoire approach could reduce the need to train and maintain entirely separate policies for different play styles.

Load-bearing premise

The low-dimensional vectors obtained by applying PCA to army unit composition in the demonstrations capture distinct, meaningful, and controllable behaviors that can be used to condition the policy.

What would settle it

If changing the conditioning vector produces no measurable difference in the agent's observed army compositions or build orders during gameplay, the claim that behaviors are effectively manipulable would be falsified.

read the original abstract

Imitation Learning (IL) is a machine learning approach to learn a policy from a dataset of demonstrations. IL can be useful to kick-start learning before applying reinforcement learning (RL) but it can also be useful on its own, e.g. to learn to imitate human players in video games. However, a major limitation of current IL approaches is that they learn only a single "average" policy based on a dataset that possibly contains demonstrations of numerous different types of behaviors. In this paper, we propose a new approach called Behavioral Repertoire Imitation Learning (BRIL) that instead learns a repertoire of behaviors from a set of demonstrations by augmenting the state-action pairs with behavioral descriptions. The outcome of this approach is a single neural network policy conditioned on a behavior description that can be precisely modulated. We apply this approach to train a policy on 7,777 human replays to perform build-order planning in StarCraft II. Principal Component Analysis (PCA) is applied to construct a low-dimensional behavioral space from the high-dimensional army unit composition of each demonstration. The results demonstrate that the learned policy can be effectively manipulated to express distinct behaviors. Additionally, by applying the UCB1 algorithm, we are able to adapt the behavior of the policy - in-between games - to reach a performance beyond that of the traditional IL baseline approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BRIL conditions an IL policy on PCA vectors from StarCraft army compositions to get steerable behaviors and a modest UCB1 gain over plain imitation, but the abstract gives little evidence that the vectors actually drive distinct dynamic actions.

read the letter

The paper takes a dataset of 7777 human StarCraft replays and augments each state-action pair with a low-dimensional vector from PCA on the demonstration's army unit composition. A single network is then trained to output actions conditioned on that vector, so that changing the input at test time is meant to produce different build-order behaviors. They also run UCB1 over the behavior space between games and report better performance than an unconditioned IL baseline. That is the main new piece: a simple way to turn one imitation dataset into a controllable repertoire without training separate policies. The UCB1 adaptation step is a clean, low-overhead addition that fits the setting. The work is empirical and the setup is described clearly enough that the method can be reimplemented from the abstract alone. The central assumption is that the PCA dimensions capture controllable strategic differences rather than static aggregates like final army size. The abstract asserts that the policy can be manipulated to express distinct behaviors, yet it supplies no numbers on action-distribution shift, trajectory divergence, or whether the conditioning input is actually attended to by the network. Without those checks it is possible the reported gains come from something weaker than claimed. This is aimed at people doing imitation learning in games who already have multi-style demonstration data and want to avoid training multiple models. A reader working on conditioned policies or test-time adaptation in sequential decision tasks would find the concrete StarCraft experiment useful to examine. The paper deserves a serious referee because the method is reproducible from the description and the adaptation result, even if preliminary, is worth verifying against the stated limitation of single-policy IL.

Referee Report

2 major / 2 minor

Summary. The paper proposes Behavioral Repertoire Imitation Learning (BRIL), an imitation learning method that augments state-action pairs from 7,777 StarCraft II human replays with low-dimensional behavioral descriptors obtained by applying PCA to army unit composition. This produces a single neural network policy conditioned on the PCA vector that can be modulated to express different behaviors; the authors further combine the conditioned policy with the UCB1 algorithm for between-game adaptation and report that it outperforms a standard (unconditioned) IL baseline on build-order planning.

Significance. If the central empirical claims hold, the work provides a lightweight way to extract and control a repertoire of behaviors from a heterogeneous demonstration set without training multiple policies, and shows a practical online adaptation loop via UCB1. This could be useful in game AI and robotics domains where demonstrations contain multiple strategies. The approach is entirely empirical and does not rely on any parameter-free derivations or machine-checked proofs.

major comments (2)

[Abstract] Abstract and results: the central claim that the learned policy 'can be effectively manipulated to express distinct behaviors' is load-bearing for the contribution, yet the manuscript provides no quantitative verification that different values of the PCA conditioning vector produce measurably different action distributions or unit-build trajectories on identical states. Without such a check (e.g., KL divergence between action distributions or trajectory statistics), it remains possible that the policy ignores the conditioning input or that the PCA dimensions primarily capture static aggregate statistics rather than dynamic strategic choices.
[Abstract] Abstract and evaluation: the claim that UCB1 adaptation 'reach[es] a performance beyond that of the traditional IL baseline' is presented without reported metrics, statistical tests, baseline implementation details, number of evaluation games, or controls for confounds such as exploration budget or state distribution shift. These omissions make it impossible to assess whether the reported outperformance is robust or reproducible.

minor comments (2)

[Method] The manuscript should clarify the exact dimensionality chosen for the PCA behavioral space and whether any ablation on this choice was performed.
[Method] Notation for the augmented state (s, a, z) where z is the PCA vector should be introduced consistently in the method section and used throughout the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing that additional quantitative details are needed to support the central claims.

read point-by-point responses

Referee: [Abstract] Abstract and results: the central claim that the learned policy 'can be effectively manipulated to express distinct behaviors' is load-bearing for the contribution, yet the manuscript provides no quantitative verification that different values of the PCA conditioning vector produce measurably different action distributions or unit-build trajectories on identical states. Without such a check (e.g., KL divergence between action distributions or trajectory statistics), it remains possible that the policy ignores the conditioning input or that the PCA dimensions primarily capture static aggregate statistics rather than dynamic strategic choices.

Authors: We agree that the manuscript would be strengthened by explicit quantitative verification of the conditioning effect. The current version relies on qualitative examples of distinct build orders; we will add new analysis in the revision, including KL divergence between action distributions and statistics on unit-build trajectories for different PCA vectors evaluated on identical states. This will directly address whether the policy utilizes the conditioning input and whether the PCA dimensions capture dynamic choices. revision: yes
Referee: [Abstract] Abstract and evaluation: the claim that UCB1 adaptation 'reach[es] a performance beyond that of the traditional IL baseline' is presented without reported metrics, statistical tests, baseline implementation details, number of evaluation games, or controls for confounds such as exploration budget or state distribution shift. These omissions make it impossible to assess whether the reported outperformance is robust or reproducible.

Authors: The referee correctly identifies that the UCB1 results lack sufficient detail for reproducibility. In the revised manuscript we will expand the evaluation section to report the specific performance metrics, number of games, statistical tests, baseline implementation details, and controls for exploration budget and distribution shift. These additions will allow readers to assess the robustness of the adaptation results. revision: yes

Circularity Check

0 steps flagged

Empirical IL method with no circular derivation or self-referential prediction

full rationale

The paper describes an empirical procedure: PCA is applied once to the unit-composition vectors of 7777 fixed demonstrations to produce conditioning inputs; a single network is then trained on the augmented (state, action, PCA-vector) tuples. No equation or result is defined in terms of itself, no fitted parameter is later renamed a 'prediction,' and no uniqueness theorem or ansatz is imported via self-citation. All performance claims rest on held-out evaluation and UCB1 adaptation performed after training, which are independent of the training procedure itself. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that PCA on unit composition yields controllable behavioral descriptors and that demonstrations contain sufficient diversity for the conditioned policy to learn distinct behaviors. No free parameters or invented entities are identifiable from the abstract alone.

axioms (1)

domain assumption Demonstrations contain multiple distinct behaviors that can be captured in a low-dimensional space via PCA on unit compositions.
Invoked to construct the behavioral space from the 7,777 replays and to enable conditioning.

pith-pipeline@v0.9.0 · 5777 in / 1269 out tokens · 38999 ms · 2026-05-25T01:58:26.810708+00:00 · methodology

Learning a Behavioral Repertoire from Demonstrations

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)