Understanding Adversarial Imitation Learning in Small Sample Regime: A Stage-coupled Analysis

Tian Xu; Yang Yu; Zhi-Quan Luo; Ziniu Li

arxiv: 2208.01899 · v2 · submitted 2022-08-03 · 💻 cs.LG

Understanding Adversarial Imitation Learning in Small Sample Regime: A Stage-coupled Analysis

Tian Xu , Ziniu Li , Yang Yu , Zhi-Quan Luo This is my paper

Pith reviewed 2026-05-24 11:03 UTC · model grok-4.3

classification 💻 cs.LG

keywords adversarial imitation learningTV-AILimitation gapsmall sample regimehorizon-free bounddynamic programminglocomotion controlMarkov decision process

0 comments

The pith

TV-AIL achieves a horizon-free imitation gap of O(min{1, sqrt{|S|/N}}) on locomotion control instances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Adversarial imitation learning can match expert performance using very few trajectories even in long-horizon tasks such as locomotion control. The paper provides a theoretical explanation for total-variation-based AIL by deriving an imitation gap bound that does not grow with the planning horizon. On abstracted instances from these tasks, the gap is at most the minimum of 1 and roughly the square root of the state space size divided by the number of trajectories. This bound holds in the small-sample regime and explains why performance remains good despite limited data and long horizons. The derivation uses a new stage-coupled dynamic programming analysis of the multi-stage policy optimization in TV-AIL.

Core claim

For TV-AIL on a class of instances abstracted from locomotion control tasks, the imitation gap is O(min{1, sqrt{|S|/N}}). This bound is independent of the planning horizon and is meaningful for both small and large N. The analysis relies on the structure of multi-stage policy optimization and employs dynamic programming in a stage-coupled manner.

What carries the argument

Stage-coupled dynamic programming analysis of the multi-stage policy optimization structure in TV-AIL

If this is right

The imitation gap is at most 1 regardless of horizon length.
The bound scales with sqrt{|S|/N} when N is small relative to |S|.
TV-AIL can achieve expert-level performance with a single expert trajectory on these instances.
The performance does not degrade as the task horizon increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the multi-stage structure is present in other AIL methods, similar horizon-free bounds may hold.
The result suggests that tabular MDPs with small state spaces can be imitated effectively from limited data.
Extensions to continuous control might involve similar scaling after discretization.

Load-bearing premise

The result applies specifically to a class of instances abstracted from locomotion control tasks where the multi-stage policy optimization can be analyzed with dynamic programming.

What would settle it

An experiment showing that the imitation gap grows with the planning horizon on the class of instances considered in the paper would falsify the horizon-free claim.

Figures

Figures reproduced from arXiv: 2208.01899 by Tian Xu, Yang Yu, Zhi-Quan Luo, Ziniu Li.

**Figure 2.** Figure 2: A simple MDP corresponding to Assumption [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: A simple MDP corresponding to Assumption [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: A simple example to show that TV-AIL’s objective in ( [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗

read the original abstract

Imitation learning learns a policy from expert trajectories. While the expert data is believed to be crucial for imitation quality, it was found that a kind of imitation learning approach, adversarial imitation learning (AIL), can have exceptional performance. With as little as only one expert trajectory, AIL can match the expert performance even in a long horizon, on tasks such as locomotion control. There are two mysterious points in this phenomenon. First, why can AIL perform well with only a few expert trajectories? Second, why does AIL maintain good performance despite the length of the planning horizon? In this paper, we theoretically explore these two questions. For a total-variation-distance-based AIL (called TV-AIL), our analysis shows a horizon-free imitation gap $\mathcal O(\{\min\{1, \sqrt{|\mathcal S|/N} \})$ on a class of instances abstracted from locomotion control tasks. Here $|\mathcal S|$ is the state space size for a tabular Markov decision process, and $N$ is the number of expert trajectories. We emphasize two important features of our bound. First, this bound is meaningful in both small and large sample regimes. Second, this bound suggests that the imitation gap of TV-AIL is at most 1 regardless of the planning horizon. Therefore, this bound can explain the empirical observation. Technically, we leverage the structure of multi-stage policy optimization in TV-AIL and present a new stage-coupled analysis via dynamic programming

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The horizon-free bound is real but trivial for small N, so it does not explain why AIL succeeds with one trajectory.

read the letter

The paper's core contribution is a stage-coupled dynamic programming argument that produces an imitation gap bound of O(min{1, sqrt{|S|/N}}) for TV-AIL on a restricted class of MDPs drawn from locomotion tasks. The analysis itself looks new relative to standard imitation learning bounds, and the absence of any horizon dependence is a clean technical feature that directly addresses one of the two questions posed in the abstract. The multi-stage structure of the algorithm is used in a natural way, and the derivation appears self-contained without obvious circularity or heavy self-citation. That part of the work is worth recording. The soft spot is exactly the one flagged in the stress-test note. When N equals 1 the bound collapses to 1. If the imitation gap is normalized so that its maximum value is 1, then the statement is true for any algorithm and supplies no information about why the gap is actually small rather than merely bounded. The paper presents the min{1, ...} form as explaining both the small-sample and long-horizon phenomena, but the small-sample claim rests entirely on the capping term without additional structure that would drive the gap toward zero when N is much smaller than |S|. The class of instances is abstracted from locomotion control, yet the abstract gives no detail on how representative those instances are or whether the policy optimization structure survives outside the abstraction. A reader who wants a tight explanation for one-trajectory success will not find it here. The paper is aimed at imitation learning theorists who care about finite-sample bounds. Someone already working on adversarial methods might extract the DP technique, but the main interpretive claim does not hold up. I would send it to referees so the derivation can be checked in full, though the authors should be asked to revise the discussion of the small-sample regime.

Referee Report

2 major / 1 minor

Summary. The paper analyzes total-variation adversarial imitation learning (TV-AIL) and derives, via stage-coupled dynamic programming on the multi-stage policy optimization structure, a horizon-free imitation gap bound of O(min{1, sqrt{|S|/N}}) that holds on a class of MDPs abstracted from locomotion control tasks. The authors claim this bound explains both the strong empirical performance of AIL with very few expert trajectories and its robustness to long planning horizons.

Significance. A non-vacuous horizon-free bound that also decays with N in the small-sample regime would be a useful theoretical contribution to understanding AIL. The stage-coupled DP technique itself is a technical strength. However, the specific form of the bound limits its ability to explain small-sample behavior beyond the trivial gap ≤ 1 that holds for any policy.

major comments (2)

[Abstract] Abstract: the stated bound evaluates to 1 whenever N ≤ |S|. This is the trivial upper bound on any normalized imitation gap and holds for arbitrary algorithms; it therefore does not explain why TV-AIL achieves small gaps with N ≪ |S| (the first mystery posed in the abstract). The claim that the bound is 'meaningful in both small and large sample regimes' is not supported by the displayed expression.
[Abstract] Abstract, final paragraph: the result is restricted to 'a class of instances abstracted from locomotion control tasks' whose multi-stage structure permits the DP analysis. Because the class is defined post-hoc to admit the analysis, it is unclear whether the bound applies to the actual MDPs on which the empirical claims are based or whether the abstraction removes the very features that would make the small-sample regime non-trivial.

minor comments (1)

[Abstract] Abstract: the big-O expression contains an extraneous opening brace: O({min{1, …}}).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: the stated bound evaluates to 1 whenever N ≤ |S|. This is the trivial upper bound on any normalized imitation gap and holds for arbitrary algorithms; it therefore does not explain why TV-AIL achieves small gaps with N ≪ |S| (the first mystery posed in the abstract). The claim that the bound is 'meaningful in both small and large sample regimes' is not supported by the displayed expression.

Authors: We agree that the bound reduces to the trivial value of 1 whenever N ≤ |S|, providing no non-trivial insight into why TV-AIL can achieve small gaps for N ≪ |S|. This observation is correct and we will revise the abstract to remove or qualify the claim that the bound is 'meaningful in both small and large sample regimes,' instead emphasizing its horizon independence (addressing the second mystery) and its decay for N > |S|. revision: yes
Referee: [Abstract] Abstract, final paragraph: the result is restricted to 'a class of instances abstracted from locomotion control tasks' whose multi-stage structure permits the DP analysis. Because the class is defined post-hoc to admit the analysis, it is unclear whether the bound applies to the actual MDPs on which the empirical claims are based or whether the abstraction removes the very features that would make the small-sample regime non-trivial.

Authors: The class is constructed to isolate the multi-stage policy optimization structure present in locomotion tasks that enables the stage-coupled DP analysis. We will add text to the manuscript providing a more explicit mapping from this class to the MDPs appearing in the experiments, along with a discussion of the abstraction's scope and limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation via stage-coupled DP is self-contained

full rationale

The paper derives the stated imitation gap bound O(min{1, sqrt{|S|/N}}) explicitly via a new stage-coupled dynamic programming analysis that exploits the multi-stage policy optimization structure of TV-AIL on the abstracted class of instances. No parameters are fitted to data and then relabeled as predictions, no self-citations are invoked to justify uniqueness or ansatzes, and the bound expression is not shown to reduce to its own inputs by definition. The analysis is presented as independent of external fitted values or prior author results, making the central claim self-contained against the paper's own equations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The bound rests on the existence of a suitable class of MDPs abstracted from locomotion tasks and on the multi-stage optimization structure of TV-AIL being amenable to dynamic programming coupling; no explicit free parameters or invented entities are stated.

axioms (2)

domain assumption The MDP is tabular with finite state space size |S|
Directly used in the stated bound; invoked in the abstract when defining the imitation gap.
ad hoc to paper The instances belong to the class abstracted from locomotion control tasks
The bound is claimed only for this class; location: abstract, sentence introducing the bound.

pith-pipeline@v0.9.0 · 5800 in / 1351 out tokens · 26322 ms · 2026-05-24T11:03:45.979235+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

our analysis shows a horizon-free imitation gap O(min{1, sqrt{|S|/N}}) on a class of instances abstracted from locomotion control tasks... stage-coupled analysis via dynamic programming
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the imitation gap of TV-AIL is at most 1 regardless of the planning horizon

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adversarial Imitation Learning with General Function Approximation: Theoretical Analysis and Practical Algorithms
cs.LG 2026-05 unverdicted novelty 8.0

OPT-AIL provides the first provably efficient adversarial imitation learning algorithms under general function approximation, achieving polynomial expert sample and interaction complexity.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper

[1]

In Example 1, the estimator in the last time step happens to equal the true distribution, i.e., ˆdπE 2 (s1, a) = dπE 2 (s1, a) and ˆdπE 2 (s2, a) = dπE 2 (s2, a)

Finally, by the standard analysis of dynamic programming [Bertsekas, 2012], we know thatπ⋆ = ( π⋆ 1, π⋆ 2 ) is identical to πAIL = ( πAIL 1 , πAIL 2 ), where πAIL is the optimal solution to the original objective in (5). In Example 1, the estimator in the last time step happens to equal the true distribution, i.e., ˆdπE 2 (s1, a) = dπE 2 (s1, a) and ˆdπE ...

work page 2012
[2]

In the following part, we want to prove that the optimality ofπAIL implies that ∑(s,a) |ˆdπE H (s, a) − dπAIL H (s, a)| ≤ ∑(s,a) |dπE H (s, a) − ˆdπE H (s, a)|

On the other hand, by triangle inequality, we have ⏐⏐⏐⏐ ∑ (s,a)∈S ×A dπE H (s, a)rH(s, a) − dπAIL H (s, a)rH(s, a) ⏐⏐⏐⏐ ≤ ∑ (s,a)∈S ×A ⏐⏐⏐⏐dπE H (s, a) − ˆdπE H (s, a) ⏐⏐⏐⏐ + ⏐⏐⏐⏐ ˆdπE H (s, a) − dπAIL H (s, a) ⏐⏐⏐⏐ . In the following part, we want to prove that the optimality ofπAIL implies that ∑(s,a) |ˆdπE H (s, a) − dπAIL H (s, a)| ≤ ∑(s,a) |dπE H (s,...

work page 2015
[3]

(24) Then we consider the second term in (24)

Then we have that V(πAIL) − V(π) ≤ 4 H ∑ h=1 h−1 ∑ ℓ=1 ∑ s∈S G dπ ℓ (s) ( 1 − πℓ(a1|s) ) + ∑ s∈Sπ dπ H(s) ( πAIL H (a1|s) − πH(a1|s) ) . (24) Then we consider the second term in (24). For the last time step H, notice that by construction, we have πAIL H (a1|s) = min{ˆdπE H (s)/dπE H (s), 1}. Then we get ∑ s∈Sπ dπ H(s) ( πAIL H (a1|s) − πH(a1|s) ) ≤ ∑ s∈Sπ...

work page 2015
[4]

To prove this lower bound, we draw a connection between the ℓ1-norm-based estimation error and the missing mass [Good, 1953, McAllester and Ortiz, 2003, Rajaraman et al., 2020]

First, we consider the small sample regime where N ≼ |X |. To prove this lower bound, we draw a connection between the ℓ1-norm-based estimation error and the missing mass [Good, 1953, McAllester and Ortiz, 2003, Rajaraman et al., 2020]. Speciﬁcally, we construct a multinomial distributionQ′ as follows. Q′ = ( Q′(1), . . . ,Q′(|X | − 1), Q′(|X |) ) = ( 1 |...

work page 1953
[5]

The lower bound in the large sample regime can be obtained directly from [Kamath et al., 2015, Lemma 8]

Then we get that max Q∈Q E [Q − ˆQ  1 ] ≥ E [Q′ − ˆQ  1 ] ≥ 1 3ec ≿ 1, which completes the proof in the small sample regime. The lower bound in the large sample regime can be obtained directly from [Kamath et al., 2015, Lemma 8]. B.3 Proof of Proposition 4 Proof of Proposition

work page 2015
[6]

To apply such a regret bound, we need to verify that

Lemma 12 is a direct consequence of the regret bound of online gradient descend [Shalev- Shwartz, 2012]. To apply such a regret bound, we need to verify that

work page 2012
[7]

D Discussion D.1 Optimization Procedures for TV-AIL Here we discuss optimization procedures for TV-AIL

Then, invoking Corollary 2.7 in [Shalev-Shwartz, 2012] with B = √ H|S ||A| and L = 2 √ H ﬁnishes the proof. D Discussion D.1 Optimization Procedures for TV-AIL Here we discuss optimization procedures for TV-AIL. Recall the objective of TV-AIL: min π∈Π H ∑ h=1 ∑ (s,a)∈S ×A |dπ h (s, a) − ˆdπE h (s, a)|, (41) Two optimization approaches can solve the above ...

work page 2012
[8]

1: Obtain the estimation ˆdπE h in (4)

Algorithm 1 TV-AIL via Gradient-based Method Input: expert demonstrations D, number of iterations T, step size η(t), and initialization c(1). 1: Obtain the estimation ˆdπE h in (4). 2: for t = 1, 2,· · · , T do 3: π(t) ← solve the optimal policy with the reward function −c(t) up to an error of εopt. 4: Compute the state-action distribution dπ(t) h for π(t...

work page 2020
[9]

Please refer to Appendix C.2.3 for the proof

Consider Algorithm 1, then we have T ∑ t=1 f (t) ( c(t) ) − min c∈CTV T ∑ t=1 f (t)(c) ≤ 2H √ 2|S ||A|T, where f (t)(c) = H ∑ h=1 ∑ (s,a)∈S ×A ch(s, a)(ˆdπE h (s, a) − dπ(t) h (s, a)). Please refer to Appendix C.2.3 for the proof. Basically, Lemma 12 is a direct consequence of the regret bound of online gradient descent [Shalev-Shwartz, 2012]. Now, we pro...

work page 2012
[10]

Here we list main proof procedures and interested readers are referred to [Syed and Schapire, 2007] for details

The analysis here is based on the proof strategy of [Syed and Schapire, 2007, Theorem 2]. Here we list main proof procedures and interested readers are referred to [Syed and Schapire, 2007] for details. Proof of Proposition

work page 2007
[11]

With the dual representation ofℓ1-norm and the celebrated min-max theorem Bertsekas [2016], we have min π∈Π H ∑ h=1 dπ h − ˆdπE h  1 = − min c∈CTV max π∈Π H ∑ h=1 ∑ (s,a)∈S ×A ch(s, a) ( ˆdπE h (s, a) − dπ h (s, a) ) . Furthermore, we have that min c∈CTV max π∈Π H ∑ h=1 ∑ (s,a)∈S ×A ch(s, a) ( ˆdπE h (s, a) − dπ h (s, a) ) ≤ max π∈Π H ∑ h=1 ∑ (s,a...

work page 2016
[12]

H = 100 H = 500 H = 1000 H = 2000 TV-AIL 0.69 ± 0.00 0.70 ± 0.00 0.71 ± 0.00 0.71 ± 0.00 FEM 0.58 ± 0.00 0.57 ± 0.00 0.58 ± 0.00 0.58 ± 0.00 GTAL 0.80 ± 0.00 0.81 ± 0.00 0.81 ± 0.19 0.74 ± 0.32 GAIL 0.94 ± 0.00 0.95 ± 0.00 0.95 ± 0.00 0.95 ± 0.00 59 Table 8: Imitation gap on the lower bound instance with N =

work page 2000
[13]

H = 100 H = 500 H = 1000 H = 2000 TV-AIL 49.50 ± 0.01 247.50 ± 0.00 495.00 ± 0.01 990.00 ± 0.00 FEM 49.50 ± 0.00 247.50 ± 0.00 495.00 ± 0.00 990.00 ± 0.00 GTAL 50.05 ± 4.93 250.52 ± 3.96 495.05 ± 4.95 989.06 ± 4.85 GAIL 49.50 ± 0.00 247.50 ± 0.00 495.00 ± 0.00 990.00 ± 0.00 Table 9: Imitation gap on the lower bound instance with N =

work page 2000
[14]

We use the expert dataset collected by the trained online SAC [Haarnoja et al., 2018] with 1 million steps

H = 100 H = 500 H = 1000 H = 2000 TV-AIL 18.23 ± 1.63 94.30 ± 6.85 179.80 ± 16.31 360.05 ± 38.21 FEM 18.18 ± 1.63 92.85 ± 7.06 192.21 ± 15.89 378.15 ± 30.52 GTAL 21.52 ± 2.22 94.63 ± 7.83 188.02 ± 15.06 373.98 ± 30.82 GAIL 18.27 ± 1.62 91.71 ± 7.55 184.34 ± 14.34 371.09 ± 30.64 E Experiment Details E.1 Experiment Details on MuJoCo For MuJoCo tasks, all ex...

work page 2000
[15]

The implementation of TV-AIL is based on an existing AIL algorithm DAC [Kostrikov et al., 2019] (https://github.com/google-research/ google-research/tree/master/value_dice)

For MuJoCo tasks, we implement BC according to [Li et al., 2022]. The implementation of TV-AIL is based on an existing AIL algorithm DAC [Kostrikov et al., 2019] (https://github.com/google-research/ google-research/tree/master/value_dice). Instead of the KL-divergence in DAC, TV distance is consid- ered in TV-AIL. To this end, the tanh activation function...

work page 2022
[16]

For the lower bound instances (refer to Assumption 2), we consider that the initial state distribution is uniform over all states

For RBAS MDPs, we consider that the initial state distribution is uniform over good states. For the lower bound instances (refer to Assumption 2), we consider that the initial state distribution is uniform over all states. By construction, the policy value of the expert policy is H. BC directly estimates the expert policy from expert demonstrations via (3...

work page 2019
[17]

20 2 [10, 100, 1000, 2000] 1 Lower bound instance (Table

work page 2000
[18]

100 2 [10, 100, 1000, 2000] 1 Lower bound instance (Table

work page 2000
[19]

Notice that Proposition 6 still holds with this adaptive step size [Orabona, 2019]

100 2 [10, 100, 1000, 2000] 100 where D = √ 2H|S ||A| is the diameter of the set CTV. Notice that Proposition 6 still holds with this adaptive step size [Orabona, 2019]. 61

work page 2000

[1] [1]

In Example 1, the estimator in the last time step happens to equal the true distribution, i.e., ˆdπE 2 (s1, a) = dπE 2 (s1, a) and ˆdπE 2 (s2, a) = dπE 2 (s2, a)

Finally, by the standard analysis of dynamic programming [Bertsekas, 2012], we know thatπ⋆ = ( π⋆ 1, π⋆ 2 ) is identical to πAIL = ( πAIL 1 , πAIL 2 ), where πAIL is the optimal solution to the original objective in (5). In Example 1, the estimator in the last time step happens to equal the true distribution, i.e., ˆdπE 2 (s1, a) = dπE 2 (s1, a) and ˆdπE ...

work page 2012

[2] [2]

In the following part, we want to prove that the optimality ofπAIL implies that ∑(s,a) |ˆdπE H (s, a) − dπAIL H (s, a)| ≤ ∑(s,a) |dπE H (s, a) − ˆdπE H (s, a)|

On the other hand, by triangle inequality, we have ⏐⏐⏐⏐ ∑ (s,a)∈S ×A dπE H (s, a)rH(s, a) − dπAIL H (s, a)rH(s, a) ⏐⏐⏐⏐ ≤ ∑ (s,a)∈S ×A ⏐⏐⏐⏐dπE H (s, a) − ˆdπE H (s, a) ⏐⏐⏐⏐ + ⏐⏐⏐⏐ ˆdπE H (s, a) − dπAIL H (s, a) ⏐⏐⏐⏐ . In the following part, we want to prove that the optimality ofπAIL implies that ∑(s,a) |ˆdπE H (s, a) − dπAIL H (s, a)| ≤ ∑(s,a) |dπE H (s,...

work page 2015

[3] [3]

(24) Then we consider the second term in (24)

Then we have that V(πAIL) − V(π) ≤ 4 H ∑ h=1 h−1 ∑ ℓ=1 ∑ s∈S G dπ ℓ (s) ( 1 − πℓ(a1|s) ) + ∑ s∈Sπ dπ H(s) ( πAIL H (a1|s) − πH(a1|s) ) . (24) Then we consider the second term in (24). For the last time step H, notice that by construction, we have πAIL H (a1|s) = min{ˆdπE H (s)/dπE H (s), 1}. Then we get ∑ s∈Sπ dπ H(s) ( πAIL H (a1|s) − πH(a1|s) ) ≤ ∑ s∈Sπ...

work page 2015

[4] [4]

To prove this lower bound, we draw a connection between the ℓ1-norm-based estimation error and the missing mass [Good, 1953, McAllester and Ortiz, 2003, Rajaraman et al., 2020]

First, we consider the small sample regime where N ≼ |X |. To prove this lower bound, we draw a connection between the ℓ1-norm-based estimation error and the missing mass [Good, 1953, McAllester and Ortiz, 2003, Rajaraman et al., 2020]. Speciﬁcally, we construct a multinomial distributionQ′ as follows. Q′ = ( Q′(1), . . . ,Q′(|X | − 1), Q′(|X |) ) = ( 1 |...

work page 1953

[5] [5]

The lower bound in the large sample regime can be obtained directly from [Kamath et al., 2015, Lemma 8]

Then we get that max Q∈Q E [Q − ˆQ  1 ] ≥ E [Q′ − ˆQ  1 ] ≥ 1 3ec ≿ 1, which completes the proof in the small sample regime. The lower bound in the large sample regime can be obtained directly from [Kamath et al., 2015, Lemma 8]. B.3 Proof of Proposition 4 Proof of Proposition

work page 2015

[6] [6]

To apply such a regret bound, we need to verify that

Lemma 12 is a direct consequence of the regret bound of online gradient descend [Shalev- Shwartz, 2012]. To apply such a regret bound, we need to verify that

work page 2012

[7] [7]

D Discussion D.1 Optimization Procedures for TV-AIL Here we discuss optimization procedures for TV-AIL

Then, invoking Corollary 2.7 in [Shalev-Shwartz, 2012] with B = √ H|S ||A| and L = 2 √ H ﬁnishes the proof. D Discussion D.1 Optimization Procedures for TV-AIL Here we discuss optimization procedures for TV-AIL. Recall the objective of TV-AIL: min π∈Π H ∑ h=1 ∑ (s,a)∈S ×A |dπ h (s, a) − ˆdπE h (s, a)|, (41) Two optimization approaches can solve the above ...

work page 2012

[8] [8]

1: Obtain the estimation ˆdπE h in (4)

Algorithm 1 TV-AIL via Gradient-based Method Input: expert demonstrations D, number of iterations T, step size η(t), and initialization c(1). 1: Obtain the estimation ˆdπE h in (4). 2: for t = 1, 2,· · · , T do 3: π(t) ← solve the optimal policy with the reward function −c(t) up to an error of εopt. 4: Compute the state-action distribution dπ(t) h for π(t...

work page 2020

[9] [9]

Please refer to Appendix C.2.3 for the proof

Consider Algorithm 1, then we have T ∑ t=1 f (t) ( c(t) ) − min c∈CTV T ∑ t=1 f (t)(c) ≤ 2H √ 2|S ||A|T, where f (t)(c) = H ∑ h=1 ∑ (s,a)∈S ×A ch(s, a)(ˆdπE h (s, a) − dπ(t) h (s, a)). Please refer to Appendix C.2.3 for the proof. Basically, Lemma 12 is a direct consequence of the regret bound of online gradient descent [Shalev-Shwartz, 2012]. Now, we pro...

work page 2012

[10] [10]

Here we list main proof procedures and interested readers are referred to [Syed and Schapire, 2007] for details

The analysis here is based on the proof strategy of [Syed and Schapire, 2007, Theorem 2]. Here we list main proof procedures and interested readers are referred to [Syed and Schapire, 2007] for details. Proof of Proposition

work page 2007

[11] [11]

With the dual representation ofℓ1-norm and the celebrated min-max theorem Bertsekas [2016], we have min π∈Π H ∑ h=1 dπ h − ˆdπE h  1 = − min c∈CTV max π∈Π H ∑ h=1 ∑ (s,a)∈S ×A ch(s, a) ( ˆdπE h (s, a) − dπ h (s, a) ) . Furthermore, we have that min c∈CTV max π∈Π H ∑ h=1 ∑ (s,a)∈S ×A ch(s, a) ( ˆdπE h (s, a) − dπ h (s, a) ) ≤ max π∈Π H ∑ h=1 ∑ (s,a...

work page 2016

[12] [12]

H = 100 H = 500 H = 1000 H = 2000 TV-AIL 0.69 ± 0.00 0.70 ± 0.00 0.71 ± 0.00 0.71 ± 0.00 FEM 0.58 ± 0.00 0.57 ± 0.00 0.58 ± 0.00 0.58 ± 0.00 GTAL 0.80 ± 0.00 0.81 ± 0.00 0.81 ± 0.19 0.74 ± 0.32 GAIL 0.94 ± 0.00 0.95 ± 0.00 0.95 ± 0.00 0.95 ± 0.00 59 Table 8: Imitation gap on the lower bound instance with N =

work page 2000

[13] [13]

H = 100 H = 500 H = 1000 H = 2000 TV-AIL 49.50 ± 0.01 247.50 ± 0.00 495.00 ± 0.01 990.00 ± 0.00 FEM 49.50 ± 0.00 247.50 ± 0.00 495.00 ± 0.00 990.00 ± 0.00 GTAL 50.05 ± 4.93 250.52 ± 3.96 495.05 ± 4.95 989.06 ± 4.85 GAIL 49.50 ± 0.00 247.50 ± 0.00 495.00 ± 0.00 990.00 ± 0.00 Table 9: Imitation gap on the lower bound instance with N =

work page 2000

[14] [14]

We use the expert dataset collected by the trained online SAC [Haarnoja et al., 2018] with 1 million steps

H = 100 H = 500 H = 1000 H = 2000 TV-AIL 18.23 ± 1.63 94.30 ± 6.85 179.80 ± 16.31 360.05 ± 38.21 FEM 18.18 ± 1.63 92.85 ± 7.06 192.21 ± 15.89 378.15 ± 30.52 GTAL 21.52 ± 2.22 94.63 ± 7.83 188.02 ± 15.06 373.98 ± 30.82 GAIL 18.27 ± 1.62 91.71 ± 7.55 184.34 ± 14.34 371.09 ± 30.64 E Experiment Details E.1 Experiment Details on MuJoCo For MuJoCo tasks, all ex...

work page 2000

[15] [15]

The implementation of TV-AIL is based on an existing AIL algorithm DAC [Kostrikov et al., 2019] (https://github.com/google-research/ google-research/tree/master/value_dice)

For MuJoCo tasks, we implement BC according to [Li et al., 2022]. The implementation of TV-AIL is based on an existing AIL algorithm DAC [Kostrikov et al., 2019] (https://github.com/google-research/ google-research/tree/master/value_dice). Instead of the KL-divergence in DAC, TV distance is consid- ered in TV-AIL. To this end, the tanh activation function...

work page 2022

[16] [16]

For the lower bound instances (refer to Assumption 2), we consider that the initial state distribution is uniform over all states

For RBAS MDPs, we consider that the initial state distribution is uniform over good states. For the lower bound instances (refer to Assumption 2), we consider that the initial state distribution is uniform over all states. By construction, the policy value of the expert policy is H. BC directly estimates the expert policy from expert demonstrations via (3...

work page 2019

[17] [17]

20 2 [10, 100, 1000, 2000] 1 Lower bound instance (Table

work page 2000

[18] [18]

100 2 [10, 100, 1000, 2000] 1 Lower bound instance (Table

work page 2000

[19] [19]

Notice that Proposition 6 still holds with this adaptive step size [Orabona, 2019]

100 2 [10, 100, 1000, 2000] 100 where D = √ 2H|S ||A| is the diameter of the set CTV. Notice that Proposition 6 still holds with this adaptive step size [Orabona, 2019]. 61

work page 2000