pith. sign in

arxiv: 2208.01899 · v2 · submitted 2022-08-03 · 💻 cs.LG

Understanding Adversarial Imitation Learning in Small Sample Regime: A Stage-coupled Analysis

Pith reviewed 2026-05-24 11:03 UTC · model grok-4.3

classification 💻 cs.LG
keywords adversarial imitation learningTV-AILimitation gapsmall sample regimehorizon-free bounddynamic programminglocomotion controlMarkov decision process
0
0 comments X

The pith

TV-AIL achieves a horizon-free imitation gap of O(min{1, sqrt{|S|/N}}) on locomotion control instances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Adversarial imitation learning can match expert performance using very few trajectories even in long-horizon tasks such as locomotion control. The paper provides a theoretical explanation for total-variation-based AIL by deriving an imitation gap bound that does not grow with the planning horizon. On abstracted instances from these tasks, the gap is at most the minimum of 1 and roughly the square root of the state space size divided by the number of trajectories. This bound holds in the small-sample regime and explains why performance remains good despite limited data and long horizons. The derivation uses a new stage-coupled dynamic programming analysis of the multi-stage policy optimization in TV-AIL.

Core claim

For TV-AIL on a class of instances abstracted from locomotion control tasks, the imitation gap is O(min{1, sqrt{|S|/N}}). This bound is independent of the planning horizon and is meaningful for both small and large N. The analysis relies on the structure of multi-stage policy optimization and employs dynamic programming in a stage-coupled manner.

What carries the argument

Stage-coupled dynamic programming analysis of the multi-stage policy optimization structure in TV-AIL

If this is right

  • The imitation gap is at most 1 regardless of horizon length.
  • The bound scales with sqrt{|S|/N} when N is small relative to |S|.
  • TV-AIL can achieve expert-level performance with a single expert trajectory on these instances.
  • The performance does not degrade as the task horizon increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the multi-stage structure is present in other AIL methods, similar horizon-free bounds may hold.
  • The result suggests that tabular MDPs with small state spaces can be imitated effectively from limited data.
  • Extensions to continuous control might involve similar scaling after discretization.

Load-bearing premise

The result applies specifically to a class of instances abstracted from locomotion control tasks where the multi-stage policy optimization can be analyzed with dynamic programming.

What would settle it

An experiment showing that the imitation gap grows with the planning horizon on the class of instances considered in the paper would falsify the horizon-free claim.

Figures

Figures reproduced from arXiv: 2208.01899 by Tian Xu, Yang Yu, Zhi-Quan Luo, Ziniu Li.

Figure 1
Figure 1. Figure 1: An illustration of the Hopper task in the MuJoCo locomotion benchmark. Top row: by taking [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A simple MDP corresponding to Assumption [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A simple MDP corresponding to Assumption [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A simple example to show that TV-AIL’s objective in ( [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗
read the original abstract

Imitation learning learns a policy from expert trajectories. While the expert data is believed to be crucial for imitation quality, it was found that a kind of imitation learning approach, adversarial imitation learning (AIL), can have exceptional performance. With as little as only one expert trajectory, AIL can match the expert performance even in a long horizon, on tasks such as locomotion control. There are two mysterious points in this phenomenon. First, why can AIL perform well with only a few expert trajectories? Second, why does AIL maintain good performance despite the length of the planning horizon? In this paper, we theoretically explore these two questions. For a total-variation-distance-based AIL (called TV-AIL), our analysis shows a horizon-free imitation gap $\mathcal O(\{\min\{1, \sqrt{|\mathcal S|/N} \})$ on a class of instances abstracted from locomotion control tasks. Here $|\mathcal S|$ is the state space size for a tabular Markov decision process, and $N$ is the number of expert trajectories. We emphasize two important features of our bound. First, this bound is meaningful in both small and large sample regimes. Second, this bound suggests that the imitation gap of TV-AIL is at most 1 regardless of the planning horizon. Therefore, this bound can explain the empirical observation. Technically, we leverage the structure of multi-stage policy optimization in TV-AIL and present a new stage-coupled analysis via dynamic programming

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper analyzes total-variation adversarial imitation learning (TV-AIL) and derives, via stage-coupled dynamic programming on the multi-stage policy optimization structure, a horizon-free imitation gap bound of O(min{1, sqrt{|S|/N}}) that holds on a class of MDPs abstracted from locomotion control tasks. The authors claim this bound explains both the strong empirical performance of AIL with very few expert trajectories and its robustness to long planning horizons.

Significance. A non-vacuous horizon-free bound that also decays with N in the small-sample regime would be a useful theoretical contribution to understanding AIL. The stage-coupled DP technique itself is a technical strength. However, the specific form of the bound limits its ability to explain small-sample behavior beyond the trivial gap ≤ 1 that holds for any policy.

major comments (2)
  1. [Abstract] Abstract: the stated bound evaluates to 1 whenever N ≤ |S|. This is the trivial upper bound on any normalized imitation gap and holds for arbitrary algorithms; it therefore does not explain why TV-AIL achieves small gaps with N ≪ |S| (the first mystery posed in the abstract). The claim that the bound is 'meaningful in both small and large sample regimes' is not supported by the displayed expression.
  2. [Abstract] Abstract, final paragraph: the result is restricted to 'a class of instances abstracted from locomotion control tasks' whose multi-stage structure permits the DP analysis. Because the class is defined post-hoc to admit the analysis, it is unclear whether the bound applies to the actual MDPs on which the empirical claims are based or whether the abstraction removes the very features that would make the small-sample regime non-trivial.
minor comments (1)
  1. [Abstract] Abstract: the big-O expression contains an extraneous opening brace: O({min{1, …}}).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comments below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the stated bound evaluates to 1 whenever N ≤ |S|. This is the trivial upper bound on any normalized imitation gap and holds for arbitrary algorithms; it therefore does not explain why TV-AIL achieves small gaps with N ≪ |S| (the first mystery posed in the abstract). The claim that the bound is 'meaningful in both small and large sample regimes' is not supported by the displayed expression.

    Authors: We agree that the bound reduces to the trivial value of 1 whenever N ≤ |S|, providing no non-trivial insight into why TV-AIL can achieve small gaps for N ≪ |S|. This observation is correct and we will revise the abstract to remove or qualify the claim that the bound is 'meaningful in both small and large sample regimes,' instead emphasizing its horizon independence (addressing the second mystery) and its decay for N > |S|. revision: yes

  2. Referee: [Abstract] Abstract, final paragraph: the result is restricted to 'a class of instances abstracted from locomotion control tasks' whose multi-stage structure permits the DP analysis. Because the class is defined post-hoc to admit the analysis, it is unclear whether the bound applies to the actual MDPs on which the empirical claims are based or whether the abstraction removes the very features that would make the small-sample regime non-trivial.

    Authors: The class is constructed to isolate the multi-stage policy optimization structure present in locomotion tasks that enables the stage-coupled DP analysis. We will add text to the manuscript providing a more explicit mapping from this class to the MDPs appearing in the experiments, along with a discussion of the abstraction's scope and limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation via stage-coupled DP is self-contained

full rationale

The paper derives the stated imitation gap bound O(min{1, sqrt{|S|/N}}) explicitly via a new stage-coupled dynamic programming analysis that exploits the multi-stage policy optimization structure of TV-AIL on the abstracted class of instances. No parameters are fitted to data and then relabeled as predictions, no self-citations are invoked to justify uniqueness or ansatzes, and the bound expression is not shown to reduce to its own inputs by definition. The analysis is presented as independent of external fitted values or prior author results, making the central claim self-contained against the paper's own equations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The bound rests on the existence of a suitable class of MDPs abstracted from locomotion tasks and on the multi-stage optimization structure of TV-AIL being amenable to dynamic programming coupling; no explicit free parameters or invented entities are stated.

axioms (2)
  • domain assumption The MDP is tabular with finite state space size |S|
    Directly used in the stated bound; invoked in the abstract when defining the imitation gap.
  • ad hoc to paper The instances belong to the class abstracted from locomotion control tasks
    The bound is claimed only for this class; location: abstract, sentence introducing the bound.

pith-pipeline@v0.9.0 · 5800 in / 1351 out tokens · 26322 ms · 2026-05-24T11:03:45.979235+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adversarial Imitation Learning with General Function Approximation: Theoretical Analysis and Practical Algorithms

    cs.LG 2026-05 unverdicted novelty 8.0

    OPT-AIL provides the first provably efficient adversarial imitation learning algorithms under general function approximation, achieving polynomial expert sample and interaction complexity.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper

  1. [1]

    In Example 1, the estimator in the last time step happens to equal the true distribution, i.e., ˆdπE 2 (s1, a) = dπE 2 (s1, a) and ˆdπE 2 (s2, a) = dπE 2 (s2, a)

    Finally, by the standard analysis of dynamic programming [Bertsekas, 2012], we know thatπ⋆ = ( π⋆ 1, π⋆ 2 ) is identical to πAIL = ( πAIL 1 , πAIL 2 ), where πAIL is the optimal solution to the original objective in (5). In Example 1, the estimator in the last time step happens to equal the true distribution, i.e., ˆdπE 2 (s1, a) = dπE 2 (s1, a) and ˆdπE ...

  2. [2]

    In the following part, we want to prove that the optimality ofπAIL implies that ∑(s,a) |ˆdπE H (s, a) − dπAIL H (s, a)| ≤ ∑(s,a) |dπE H (s, a) − ˆdπE H (s, a)|

    On the other hand, by triangle inequality, we have ⏐⏐⏐⏐ ∑ (s,a)∈S ×A dπE H (s, a)rH(s, a) − dπAIL H (s, a)rH(s, a) ⏐⏐⏐⏐ ≤ ∑ (s,a)∈S ×A ⏐⏐⏐⏐dπE H (s, a) − ˆdπE H (s, a) ⏐⏐⏐⏐ + ⏐⏐⏐⏐ ˆdπE H (s, a) − dπAIL H (s, a) ⏐⏐⏐⏐ . In the following part, we want to prove that the optimality ofπAIL implies that ∑(s,a) |ˆdπE H (s, a) − dπAIL H (s, a)| ≤ ∑(s,a) |dπE H (s,...

  3. [3]

    (24) Then we consider the second term in (24)

    Then we have that V(πAIL) − V(π) ≤ 4 H ∑ h=1 h−1 ∑ ℓ=1 ∑ s∈S G dπ ℓ (s) ( 1 − πℓ(a1|s) ) + ∑ s∈Sπ dπ H(s) ( πAIL H (a1|s) − πH(a1|s) ) . (24) Then we consider the second term in (24). For the last time step H, notice that by construction, we have πAIL H (a1|s) = min{ˆdπE H (s)/dπE H (s), 1}. Then we get ∑ s∈Sπ dπ H(s) ( πAIL H (a1|s) − πH(a1|s) ) ≤ ∑ s∈Sπ...

  4. [4]

    To prove this lower bound, we draw a connection between the ℓ1-norm-based estimation error and the missing mass [Good, 1953, McAllester and Ortiz, 2003, Rajaraman et al., 2020]

    First, we consider the small sample regime where N ≼ |X |. To prove this lower bound, we draw a connection between the ℓ1-norm-based estimation error and the missing mass [Good, 1953, McAllester and Ortiz, 2003, Rajaraman et al., 2020]. Specifically, we construct a multinomial distributionQ′ as follows. Q′ = ( Q′(1), . . . ,Q′(|X | − 1), Q′(|X |) ) = ( 1 |...

  5. [5]

    The lower bound in the large sample regime can be obtained directly from [Kamath et al., 2015, Lemma 8]

    Then we get that max Q∈Q E [Q − ˆQ  1 ] ≥ E [Q′ − ˆQ  1 ] ≥ 1 3ec ≿ 1, which completes the proof in the small sample regime. The lower bound in the large sample regime can be obtained directly from [Kamath et al., 2015, Lemma 8]. B.3 Proof of Proposition 4 Proof of Proposition

  6. [6]

    To apply such a regret bound, we need to verify that

    Lemma 12 is a direct consequence of the regret bound of online gradient descend [Shalev- Shwartz, 2012]. To apply such a regret bound, we need to verify that

  7. [7]

    D Discussion D.1 Optimization Procedures for TV-AIL Here we discuss optimization procedures for TV-AIL

    Then, invoking Corollary 2.7 in [Shalev-Shwartz, 2012] with B = √ H|S ||A| and L = 2 √ H finishes the proof. D Discussion D.1 Optimization Procedures for TV-AIL Here we discuss optimization procedures for TV-AIL. Recall the objective of TV-AIL: min π∈Π H ∑ h=1 ∑ (s,a)∈S ×A |dπ h (s, a) − ˆdπE h (s, a)|, (41) Two optimization approaches can solve the above ...

  8. [8]

    1: Obtain the estimation ˆdπE h in (4)

    Algorithm 1 TV-AIL via Gradient-based Method Input: expert demonstrations D, number of iterations T, step size η(t), and initialization c(1). 1: Obtain the estimation ˆdπE h in (4). 2: for t = 1, 2,· · · , T do 3: π(t) ← solve the optimal policy with the reward function −c(t) up to an error of εopt. 4: Compute the state-action distribution dπ(t) h for π(t...

  9. [9]

    Please refer to Appendix C.2.3 for the proof

    Consider Algorithm 1, then we have T ∑ t=1 f (t) ( c(t) ) − min c∈CTV T ∑ t=1 f (t)(c) ≤ 2H √ 2|S ||A|T, where f (t)(c) = H ∑ h=1 ∑ (s,a)∈S ×A ch(s, a)(ˆdπE h (s, a) − dπ(t) h (s, a)). Please refer to Appendix C.2.3 for the proof. Basically, Lemma 12 is a direct consequence of the regret bound of online gradient descent [Shalev-Shwartz, 2012]. Now, we pro...

  10. [10]

    Here we list main proof procedures and interested readers are referred to [Syed and Schapire, 2007] for details

    The analysis here is based on the proof strategy of [Syed and Schapire, 2007, Theorem 2]. Here we list main proof procedures and interested readers are referred to [Syed and Schapire, 2007] for details. Proof of Proposition

  11. [11]

    With the dual representation ofℓ1-norm and the celebrated min-max theorem Bertsekas [2016], we have min π∈Π H ∑ h=1 dπ h − ˆdπE h  1 = − min c∈CTV max π∈Π H ∑ h=1 ∑ (s,a)∈S ×A ch(s, a) ( ˆdπE h (s, a) − dπ h (s, a) ) . Furthermore, we have that min c∈CTV max π∈Π H ∑ h=1 ∑ (s,a)∈S ×A ch(s, a) ( ˆdπE h (s, a) − dπ h (s, a) ) ≤ max π∈Π H ∑ h=1 ∑ (s,a...

  12. [12]

    H = 100 H = 500 H = 1000 H = 2000 TV-AIL 0.69 ± 0.00 0.70 ± 0.00 0.71 ± 0.00 0.71 ± 0.00 FEM 0.58 ± 0.00 0.57 ± 0.00 0.58 ± 0.00 0.58 ± 0.00 GTAL 0.80 ± 0.00 0.81 ± 0.00 0.81 ± 0.19 0.74 ± 0.32 GAIL 0.94 ± 0.00 0.95 ± 0.00 0.95 ± 0.00 0.95 ± 0.00 59 Table 8: Imitation gap on the lower bound instance with N =

  13. [13]

    H = 100 H = 500 H = 1000 H = 2000 TV-AIL 49.50 ± 0.01 247.50 ± 0.00 495.00 ± 0.01 990.00 ± 0.00 FEM 49.50 ± 0.00 247.50 ± 0.00 495.00 ± 0.00 990.00 ± 0.00 GTAL 50.05 ± 4.93 250.52 ± 3.96 495.05 ± 4.95 989.06 ± 4.85 GAIL 49.50 ± 0.00 247.50 ± 0.00 495.00 ± 0.00 990.00 ± 0.00 Table 9: Imitation gap on the lower bound instance with N =

  14. [14]

    We use the expert dataset collected by the trained online SAC [Haarnoja et al., 2018] with 1 million steps

    H = 100 H = 500 H = 1000 H = 2000 TV-AIL 18.23 ± 1.63 94.30 ± 6.85 179.80 ± 16.31 360.05 ± 38.21 FEM 18.18 ± 1.63 92.85 ± 7.06 192.21 ± 15.89 378.15 ± 30.52 GTAL 21.52 ± 2.22 94.63 ± 7.83 188.02 ± 15.06 373.98 ± 30.82 GAIL 18.27 ± 1.62 91.71 ± 7.55 184.34 ± 14.34 371.09 ± 30.64 E Experiment Details E.1 Experiment Details on MuJoCo For MuJoCo tasks, all ex...

  15. [15]

    The implementation of TV-AIL is based on an existing AIL algorithm DAC [Kostrikov et al., 2019] (https://github.com/google-research/ google-research/tree/master/value_dice)

    For MuJoCo tasks, we implement BC according to [Li et al., 2022]. The implementation of TV-AIL is based on an existing AIL algorithm DAC [Kostrikov et al., 2019] (https://github.com/google-research/ google-research/tree/master/value_dice). Instead of the KL-divergence in DAC, TV distance is consid- ered in TV-AIL. To this end, the tanh activation function...

  16. [16]

    For the lower bound instances (refer to Assumption 2), we consider that the initial state distribution is uniform over all states

    For RBAS MDPs, we consider that the initial state distribution is uniform over good states. For the lower bound instances (refer to Assumption 2), we consider that the initial state distribution is uniform over all states. By construction, the policy value of the expert policy is H. BC directly estimates the expert policy from expert demonstrations via (3...

  17. [17]

    20 2 [10, 100, 1000, 2000] 1 Lower bound instance (Table

  18. [18]

    100 2 [10, 100, 1000, 2000] 1 Lower bound instance (Table

  19. [19]

    Notice that Proposition 6 still holds with this adaptive step size [Orabona, 2019]

    100 2 [10, 100, 1000, 2000] 100 where D = √ 2H|S ||A| is the diameter of the set CTV. Notice that Proposition 6 still holds with this adaptive step size [Orabona, 2019]. 61