ADAPS: Autonomous Driving Via Principled Simulations

David Wolinski; Ming C. Lin; Weizi Li

arxiv: 1907.08874 · v1 · pith:6S44VW2Ynew · submitted 2019-07-20 · 💻 cs.RO

ADAPS: Autonomous Driving Via Principled Simulations

Weizi Li , David Wolinski , Ming C. Lin This is my paper

Pith reviewed 2026-05-24 18:31 UTC · model grok-4.3

classification 💻 cs.RO

keywords autonomous drivingsimulation platformsaccident generationhierarchical controlonline learningtraining data

0 comments

The pith

ADAPS uses two simulation platforms to generate accident data and a memory-enabled hierarchical policy to learn robust driving controls with fewer iterations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ADAPS as a method to build robust control policies for autonomous vehicles by addressing the need for diverse training data that includes rare events like accidents. It relies on two simulation platforms that generate and analyze accidents to create labeled data automatically, combined with a hierarchical policy structure that incorporates memory. An online learning process is included that cuts the number of required iterations relative to existing techniques. A sympathetic reader would care because this targets the practical gap between simulated training and safe real-world performance in unpredictable conditions.

Core claim

ADAPS consists of two simulation platforms in generating and analyzing accidents to automatically produce labeled training data, and a memory-enabled hierarchical control policy. Additionally, ADAPS offers a more efficient online learning mechanism that reduces the number of iterations required in learning compared to existing methods such as DAGGER.

What carries the argument

ADAPS system of two simulation platforms for accident data production plus a memory-enabled hierarchical control policy and efficient online learning mechanism.

If this is right

Labeled training data for rare events becomes available without manual collection or annotation.
The hierarchical policy structure with memory supports handling of sequential and complex driving decisions.
Online learning converges with fewer iterations than DAGGER-style methods.
Both qualitative and quantitative performance gains appear in simulated driving environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the simulations prove transferable, the same platforms could generate data for additional edge cases beyond accidents.
The efficiency gain in iterations could allow policies to be retrained rapidly when new sensor data arrives.
Hierarchical memory might help the policy generalize across different vehicle types or road layouts.
Validation against real crash statistics would be a direct next check for the data-generation step.

Load-bearing premise

The simulated accident scenarios and driving dynamics accurately model real-world conditions sufficiently for the learned policy to transfer effectively to physical autonomous vehicles.

What would settle it

Test the trained policy on a physical vehicle in real accident-like situations and check whether it matches the safety performance observed in the simulations.

Figures

Figures reproduced from arXiv: 1907.08874 by David Wolinski, Ming C. Lin, Weizi Li.

**Figure 2.** Figure 2: LEFT and CENTER: the comparisons between our policy [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The visualization results of collected images using t-SNE [32]. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Plotted collision-free trajectories generated by the expert algorithm [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: (This figure is copied from the main text to here for completeness.) [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Autonomous driving has gained significant advancements in recent years. However, obtaining a robust control policy for driving remains challenging as it requires training data from a variety of scenarios, including rare situations (e.g., accidents), an effective policy architecture, and an efficient learning mechanism. We propose ADAPS for producing robust control policies for autonomous vehicles. ADAPS consists of two simulation platforms in generating and analyzing accidents to automatically produce labeled training data, and a memory-enabled hierarchical control policy. Additionally, ADAPS offers a more efficient online learning mechanism that reduces the number of iterations required in learning compared to existing methods such as DAGGER. We present both theoretical and experimental results. The latter are produced in simulated environments, where qualitative and quantitative results are generated to demonstrate the benefits of ADAPS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ADAPS targets rare accident scenarios with simulation-based data generation and a hierarchical policy, but all claims stay inside simulation with no transfer evidence.

read the letter

The paper's core contribution is a pipeline called ADAPS that runs two simulation platforms to create and label accident data automatically, then trains a memory-enabled hierarchical policy with an online learning loop that it says needs fewer iterations than DAGGER. They report both theoretical arguments and simulated experiments showing qualitative and quantitative gains. The focus on systematically producing training examples for low-probability events is the practical angle that stands out, since most driving datasets under-represent crashes and near-misses. The hierarchical structure plus memory is a straightforward way to manage short-term control and longer context, and the reduced-iteration claim is at least framed as an efficiency improvement over a known baseline. That combination is new enough as an integrated system for this domain. The experiments are confined to simulated environments, which the abstract states plainly. The title and abstract still frame the output as robust policies for autonomous vehicles, yet there is no domain randomization, cross-simulator check, or real-vehicle test to support transfer. That assumption carries the robustness claim, and without it the results only show performance inside the training distribution. The theoretical results are mentioned but not detailed enough in the abstract to judge their scope or assumptions. If the full paper has clear metrics, fair baselines, and reproducible numbers on the iteration reduction, that would be the strongest part to evaluate. This work is aimed at people already working on simulation-for-training pipelines in robotics or autonomous systems. A reader who needs ideas for generating edge-case data might extract the simulation platform design, but anyone expecting validated real-world policies will find the gap obvious. It is worth sending for peer review. The problem of rare-event data is real and the proposed combination is concrete; referees can press on the transfer question and the experimental details without the paper being dismissed outright.

Referee Report

2 major / 0 minor

Summary. The paper proposes ADAPS, a framework consisting of two simulation platforms that generate and analyze accidents to automatically produce labeled training data, a memory-enabled hierarchical control policy, and an efficient online learning mechanism claimed to require fewer iterations than DAGGER. Theoretical results are presented alongside experimental results conducted exclusively in simulated environments, with qualitative and quantitative demonstrations of benefits for autonomous driving policies.

Significance. If the simulated accident generation and policy learning transfer effectively, the approach could offer a useful method for creating training data on rare events and improving sample efficiency in hierarchical policies. The explicit use of simulations for principled data production is a potential strength, though the manuscript provides no evidence of real-world validation.

major comments (2)

[Abstract] Abstract: The central claim is that ADAPS produces 'robust control policies for autonomous vehicles', but the experimental results are explicitly limited to simulated environments with no sim-to-real transfer tests, domain randomization, cross-simulator validation, or physical deployment. This assumption is load-bearing for the robustness and applicability claims.
[Abstract] Abstract: The efficiency advantage over DAGGER (reduced iterations in online learning) is presented as a key result, yet the manuscript provides no quantitative metrics, baseline comparisons, error bars, or statistical analysis to support that the improvement is meaningful or generalizable beyond the specific simulated scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the scope of our claims and the strength of the empirical evidence. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim is that ADAPS produces 'robust control policies for autonomous vehicles', but the experimental results are explicitly limited to simulated environments with no sim-to-real transfer tests, domain randomization, cross-simulator validation, or physical deployment. This assumption is load-bearing for the robustness and applicability claims.

Authors: We agree that the manuscript explicitly states all experiments occur in simulated environments and provides no sim-to-real transfer, domain randomization, or physical deployment results. The robustness claims refer to performance within the simulated settings, including rare accident scenarios generated by the proposed framework. To address the concern, we will revise the abstract and add a limitations paragraph clarifying the simulation-only scope and identifying real-world transfer as future work. revision: yes
Referee: [Abstract] Abstract: The efficiency advantage over DAGGER (reduced iterations in online learning) is presented as a key result, yet the manuscript provides no quantitative metrics, baseline comparisons, error bars, or statistical analysis to support that the improvement is meaningful or generalizable beyond the specific simulated scenarios.

Authors: The manuscript reports iteration counts from simulated experiments and includes a theoretical analysis of the online learning mechanism. We acknowledge that the current presentation lacks error bars, statistical tests, and expanded baseline tables. We will add these quantitative details and statistical analysis to the experimental section in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents ADAPS as a composite system: two simulation platforms for generating/analyzing accidents to produce labeled data, plus a memory-enabled hierarchical policy and an online learning mechanism shown to require fewer iterations than DAGGER. No equations, definitions, or claims in the abstract reduce a derived quantity to a fitted input by construction, invoke self-citations as uniqueness theorems, or smuggle ansatzes. The derivation chain combines independent modules (simulation data generation + policy architecture + learning efficiency) without self-referential loops or renaming of known results. Experimental claims are explicitly limited to simulation, but this does not create circularity in the stated results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the proposal remains at the level of system description without detailing any fitted values or unstated assumptions beyond the general claim.

pith-pipeline@v0.9.0 · 5654 in / 1141 out tokens · 38242 ms · 2026-05-24T18:31:38.753941+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

[1]

A reduction of imitation learning and structured prediction to no-regret online learning,

S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, 2011, pp. 627–635

work page 2011
[2]

Learning to search: structured prediction techniques for imitation learning,

N. Ratliff, “Learning to search: structured prediction techniques for imitation learning,” Ph.D. dissertation, Carnegie Mellon University, 2009

work page 2009
[3]

Learning preference models for autonomous mobile robots in complex domains,

D. Silver, “Learning preference models for autonomous mobile robots in complex domains,” Ph.D. dissertation, 2010

work page 2010
[4]

ALVINN: An autonomous land vehicle in a neural network,

D. Pomerleau, “ALVINN: An autonomous land vehicle in a neural network,” in Advances in neural information processing systems, 1989, pp. 305–313

work page 1989
[5]

Learning monocular reactive uav con- trol in cluttered natural environments,

S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A. Bagnell, and M. Hebert, “Learning monocular reactive uav con- trol in cluttered natural environments,” in Robotics and Automation, 2013 IEEE International Conference on. IEEE, 2013, pp. 1765–1772

work page 2013
[6]

A survey on visual trafﬁc simulation: Models, evaluations, and applications in autonomous driving,

Q. Chao, H. Bi, W. Li, T. Mao, Z. Wang, M. C. Lin, and Z. Deng, “A survey on visual trafﬁc simulation: Models, evaluations, and applications in autonomous driving,”Computer Graphics Fourm, 2019

work page 2019
[7]

Planning and decision- making for autonomous vehicles,

W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision- making for autonomous vehicles,”Annual Review of Control, Robotics, and Autonomous Systems , 2018

work page 2018
[8]

Deepdriving: Learning affordance for direct perception in autonomous driving,

C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning affordance for direct perception in autonomous driving,” in Computer Vision, 2015 IEEE International Conference on, 2015, pp. 2722–2730

work page 2015
[9]

Off-road obstacle avoidance through end-to-end learning,

Y . LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp, “Off-road obstacle avoidance through end-to-end learning,” in Advances in neural information processing systems , 2005, pp. 739–746

work page 2005
[10]

End to End Learning for Self-Driving Cars

M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al., “End to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

End-to-end learning of driving models from large-scale video datasets,

H. Xu, Y . Gao, F. Yu, and T. Darrell, “End-to-end learning of driving models from large-scale video datasets,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017, pp. 3530– 3538

work page 2017
[12]

Agile off-road autonomous driving using end-to-end deep imitation learning,

Y . Pan, C.-A. Cheng, K. Saigol, K. Lee, X. Yan, E. Theodorou, and B. Boots, “Agile off-road autonomous driving using end-to-end deep imitation learning,” in Robotics: Science and Systems , 2018

work page 2018
[13]

End-to-end driving via conditional imitation learning,

F. Codevilla, M. M ¨uller, A. Dosovitskiy, A. L ´opez, and V . Koltun, “End-to-end driving via conditional imitation learning,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on . IEEE, 2017, pp. 746–753

work page 2017
[14]

Recent advances in hierarchical reinforcement learning,

A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete Event Dynamic Systems , vol. 13, no. 4, pp. 341–379, 2003

work page 2003
[15]

Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,

R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artiﬁcial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999

work page 1999
[16]

Robot learn- ing from demonstration by constructing skill trees,

G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto, “Robot learn- ing from demonstration by constructing skill trees,” The International Journal of Robotics Research , vol. 31, no. 3, pp. 360–375, 2012

work page 2012
[17]

Guided policy search,

S. Levine and V . Koltun, “Guided policy search,” in Proceedings of the 30th International Conference on Machine Learning (ICML), 2013, pp. 1–9

work page 2013
[18]

Stable function approximation in dynamic program- ming,

G. J. Gordon, “Stable function approximation in dynamic program- ming,” in Machine Learning Proceedings 1995 . Elsevier, 1995, pp. 261–268

work page 1995
[19]

A sparse sampling algorithm for near-optimal planning in large markov decision processes,

M. Kearns, Y . Mansour, and A. Y . Ng, “A sparse sampling algorithm for near-optimal planning in large markov decision processes,” Ma- chine learning, vol. 49, no. 2-3, pp. 193–208, 2002

work page 2002
[20]

Finite time bounds for sampling based ﬁtted value iteration,

C. Szepesv ´ari and R. Munos, “Finite time bounds for sampling based ﬁtted value iteration,” in Proceedings of the 22nd international conference on Machine learning , 2005, pp. 880–887

work page 2005
[21]

Self-improving reactive agents based on reinforcement learning, planning and teaching,

L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Machine learning , vol. 8, no. 3-4, pp. 293–321, 1992

work page 1992
[22]

A reduction from apprenticeship learning to classiﬁcation,

U. Syed and R. E. Schapire, “A reduction from apprenticeship learning to classiﬁcation,” in Advances in Neural Information Processing Systems, 2010, pp. 2253–2261

work page 2010
[23]

Search-based structured prediction,

H. Daum ´e, J. Langford, and D. Marcu, “Search-based structured prediction,” Machine learning, vol. 75, no. 3, pp. 297–325, 2009

work page 2009
[24]

On the generalization ability of online strongly convex programming algorithms,

S. M. Kakade and A. Tewari, “On the generalization ability of online strongly convex programming algorithms,” in Advances in Neural Information Processing Systems , 2009, pp. 801–808

work page 2009
[25]

Logarithmic regret algorithms for online convex optimization,

E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret algorithms for online convex optimization,” Machine Learning, vol. 69, no. 2-3, pp. 169–192, 2007

work page 2007
[26]

Approximately optimal approximate reinforcement learning,

S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in Proceedings of the 30th International Conference on Machine Learning (ICML) , vol. 2, 2002, pp. 267–274

work page 2002
[27]

Policy search by dynamic programming,

J. A. Bagnell, S. M. Kakade, J. G. Schneider, and A. Y . Ng, “Policy search by dynamic programming,” in Advances in neural information processing systems, 2004, pp. 831–838

work page 2004
[28]

Drivers’ brake reaction times,

G. Johansson and K. Rumar, “Drivers’ brake reaction times,” Human factors, vol. 13, no. 1, pp. 23–27, 1971

work page 1971
[29]

Driver reaction time in crash avoidance research: validation of a driving simulator study on a test track,

D. V . McGehee, E. N. Mazzae, and G. S. Baldwin, “Driver reaction time in crash avoidance research: validation of a driving simulator study on a test track,” in Proceedings of the human factors and ergonomics society annual meeting , vol. 44, no. 20, 2000

work page 2000
[30]

Warpdriver: context-aware prob- abilistic motion prediction for crowd simulation,

D. Wolinski, M. Lin, and J. Pettr ´e, “Warpdriver: context-aware prob- abilistic motion prediction for crowd simulation,” ACM Transactions on Graphics (TOG) , vol. 35, no. 6, 2016

work page 2016
[31]

Query-efﬁcient imitation learning for end-to- end simulated driving,

J. Zhang and K. Cho, “Query-efﬁcient imitation learning for end-to- end simulated driving,” in AAAI, 2017, pp. 2891–2897

work page 2017
[32]

Visualizing data using t-sne,

L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research , vol. 9, no. Nov, pp. 2579–2605, 2008

work page 2008
[33]

City-scale trafﬁc animation using statistical learning and metamodel-based optimization,

W. Li, D. Wolinski, and M. C. Lin, “City-scale trafﬁc animation using statistical learning and metamodel-based optimization,” ACM Trans. Graph., vol. 36, no. 6, pp. 200:1–200:12, Nov. 2017

work page 2017
[34]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[35]

Deep learning,

Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015

work page 2015
[36]

Adam: A method for stochastic optimization,

D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015. IX. A PPENDIX A. Solving An SPC Task We show the proofs of solving an SPC task using standard supervised learning, DAGGER [1], and ADAPS, respec- tively. We use “state” and ”observation” interchangeably here as for these proofs we can always ﬁnd a deterministic function to map the two

work page 2015
[37]

Supervised Learning: The following proof is adapted and simpliﬁed from Ross et al. [1]. We include it here for completeness. Theorem 2: Consider a T -step control task. Let ϵ = Eφ∼dπ∗,a∗∼π∗(φ) [l (φ,π,a∗)] be the observed surrogate loss under the training distribution induced by the expert’s policy π∗. We assume C∈ [0,Cmax] and l upper bounds the 0-1 loss...

work page
[38]

DAGGER: The following proof is adapted from Ross et al. [1]. We include it here for completeness. Note that for Theorem 3, we have arrived at the different third term as of Ross et al. [1]. Lemma 1: [1] Let P andQ be any two distributions over elementsx∈X andf :X→ R, any bounded function such that f(x)∈ [a,b ] for all x∈X . Let the range r = b−a. Then|Ex∼...

work page
[39]

left” with rl >0, and lrr is on the “right

ADAPS: With the assumption that we can treat the generated trajectories from our model and the additional data generated based on them as running a learned policy to sample independent expert trajectories at different states while performing policy roll-out, we have the following guarantee of ADAPS. To better understand the following theorem and proof, we...

work page
[40]

The ﬁrst is a straight road which represents a linear geometry, the second is a curved road which represents a non-linear geometry, and the third is an open ground

Scenarios: We have tested our method in three sce- narios. The ﬁrst is a straight road which represents a linear geometry, the second is a curved road which represents a non-linear geometry, and the third is an open ground. The ﬁrst two represent on-road situations while the last represents an off-road situation. Both the straight and curved roads consist...

work page
[41]

Due to factors such as the rendering complexity and the delay of the communication module, the actual running speed is in the range of 20±1m/s

Vehicle Specs: The vehicle’s speed is set to 20 m/s, which value is used to compute the throttle value in the simulator. Due to factors such as the rendering complexity and the delay of the communication module, the actual running speed is in the range of 20±1m/s. The length and width of the vehicle are 4.5 m and 2.5 m, respectively. The distance between ...

work page
[42]

This scaling operation is meant to preserve the obstacle’s visibility, since at distances greater than 30m a normal-sized obstacle is quickly reduced to just a few pixels

Obstacles: For the on-road scenarios, we use a scaled version of a virtual trafﬁc cone as the obstacle on both the straight and curved roads. This scaling operation is meant to preserve the obstacle’s visibility, since at distances greater than 30m a normal-sized obstacle is quickly reduced to just a few pixels. This is an intrinsic limitation of the sing...

work page
[43]

Training Data: In order to train Following, we have built a waypoint system on the straight road and curved road for the A V to follow, respectively. By running the vehicle for roughly equal distances on both roads, we have gathered in total 65 061 images (33 642 images for the straight road and 31 419 images for the curved road). On the open ground, we h...

work page

[1] [1]

A reduction of imitation learning and structured prediction to no-regret online learning,

S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, 2011, pp. 627–635

work page 2011

[2] [2]

Learning to search: structured prediction techniques for imitation learning,

N. Ratliff, “Learning to search: structured prediction techniques for imitation learning,” Ph.D. dissertation, Carnegie Mellon University, 2009

work page 2009

[3] [3]

Learning preference models for autonomous mobile robots in complex domains,

D. Silver, “Learning preference models for autonomous mobile robots in complex domains,” Ph.D. dissertation, 2010

work page 2010

[4] [4]

ALVINN: An autonomous land vehicle in a neural network,

D. Pomerleau, “ALVINN: An autonomous land vehicle in a neural network,” in Advances in neural information processing systems, 1989, pp. 305–313

work page 1989

[5] [5]

Learning monocular reactive uav con- trol in cluttered natural environments,

S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A. Bagnell, and M. Hebert, “Learning monocular reactive uav con- trol in cluttered natural environments,” in Robotics and Automation, 2013 IEEE International Conference on. IEEE, 2013, pp. 1765–1772

work page 2013

[6] [6]

A survey on visual trafﬁc simulation: Models, evaluations, and applications in autonomous driving,

Q. Chao, H. Bi, W. Li, T. Mao, Z. Wang, M. C. Lin, and Z. Deng, “A survey on visual trafﬁc simulation: Models, evaluations, and applications in autonomous driving,”Computer Graphics Fourm, 2019

work page 2019

[7] [7]

Planning and decision- making for autonomous vehicles,

W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision- making for autonomous vehicles,”Annual Review of Control, Robotics, and Autonomous Systems , 2018

work page 2018

[8] [8]

Deepdriving: Learning affordance for direct perception in autonomous driving,

C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning affordance for direct perception in autonomous driving,” in Computer Vision, 2015 IEEE International Conference on, 2015, pp. 2722–2730

work page 2015

[9] [9]

Off-road obstacle avoidance through end-to-end learning,

Y . LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp, “Off-road obstacle avoidance through end-to-end learning,” in Advances in neural information processing systems , 2005, pp. 739–746

work page 2005

[10] [10]

End to End Learning for Self-Driving Cars

M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al., “End to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

End-to-end learning of driving models from large-scale video datasets,

H. Xu, Y . Gao, F. Yu, and T. Darrell, “End-to-end learning of driving models from large-scale video datasets,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017, pp. 3530– 3538

work page 2017

[12] [12]

Agile off-road autonomous driving using end-to-end deep imitation learning,

Y . Pan, C.-A. Cheng, K. Saigol, K. Lee, X. Yan, E. Theodorou, and B. Boots, “Agile off-road autonomous driving using end-to-end deep imitation learning,” in Robotics: Science and Systems , 2018

work page 2018

[13] [13]

End-to-end driving via conditional imitation learning,

F. Codevilla, M. M ¨uller, A. Dosovitskiy, A. L ´opez, and V . Koltun, “End-to-end driving via conditional imitation learning,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on . IEEE, 2017, pp. 746–753

work page 2017

[14] [14]

Recent advances in hierarchical reinforcement learning,

A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete Event Dynamic Systems , vol. 13, no. 4, pp. 341–379, 2003

work page 2003

[15] [15]

Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,

R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artiﬁcial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999

work page 1999

[16] [16]

Robot learn- ing from demonstration by constructing skill trees,

G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto, “Robot learn- ing from demonstration by constructing skill trees,” The International Journal of Robotics Research , vol. 31, no. 3, pp. 360–375, 2012

work page 2012

[17] [17]

Guided policy search,

S. Levine and V . Koltun, “Guided policy search,” in Proceedings of the 30th International Conference on Machine Learning (ICML), 2013, pp. 1–9

work page 2013

[18] [18]

Stable function approximation in dynamic program- ming,

G. J. Gordon, “Stable function approximation in dynamic program- ming,” in Machine Learning Proceedings 1995 . Elsevier, 1995, pp. 261–268

work page 1995

[19] [19]

A sparse sampling algorithm for near-optimal planning in large markov decision processes,

M. Kearns, Y . Mansour, and A. Y . Ng, “A sparse sampling algorithm for near-optimal planning in large markov decision processes,” Ma- chine learning, vol. 49, no. 2-3, pp. 193–208, 2002

work page 2002

[20] [20]

Finite time bounds for sampling based ﬁtted value iteration,

C. Szepesv ´ari and R. Munos, “Finite time bounds for sampling based ﬁtted value iteration,” in Proceedings of the 22nd international conference on Machine learning , 2005, pp. 880–887

work page 2005

[21] [21]

Self-improving reactive agents based on reinforcement learning, planning and teaching,

L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Machine learning , vol. 8, no. 3-4, pp. 293–321, 1992

work page 1992

[22] [22]

A reduction from apprenticeship learning to classiﬁcation,

U. Syed and R. E. Schapire, “A reduction from apprenticeship learning to classiﬁcation,” in Advances in Neural Information Processing Systems, 2010, pp. 2253–2261

work page 2010

[23] [23]

Search-based structured prediction,

H. Daum ´e, J. Langford, and D. Marcu, “Search-based structured prediction,” Machine learning, vol. 75, no. 3, pp. 297–325, 2009

work page 2009

[24] [24]

On the generalization ability of online strongly convex programming algorithms,

S. M. Kakade and A. Tewari, “On the generalization ability of online strongly convex programming algorithms,” in Advances in Neural Information Processing Systems , 2009, pp. 801–808

work page 2009

[25] [25]

Logarithmic regret algorithms for online convex optimization,

E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret algorithms for online convex optimization,” Machine Learning, vol. 69, no. 2-3, pp. 169–192, 2007

work page 2007

[26] [26]

Approximately optimal approximate reinforcement learning,

S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in Proceedings of the 30th International Conference on Machine Learning (ICML) , vol. 2, 2002, pp. 267–274

work page 2002

[27] [27]

Policy search by dynamic programming,

J. A. Bagnell, S. M. Kakade, J. G. Schneider, and A. Y . Ng, “Policy search by dynamic programming,” in Advances in neural information processing systems, 2004, pp. 831–838

work page 2004

[28] [28]

Drivers’ brake reaction times,

G. Johansson and K. Rumar, “Drivers’ brake reaction times,” Human factors, vol. 13, no. 1, pp. 23–27, 1971

work page 1971

[29] [29]

Driver reaction time in crash avoidance research: validation of a driving simulator study on a test track,

D. V . McGehee, E. N. Mazzae, and G. S. Baldwin, “Driver reaction time in crash avoidance research: validation of a driving simulator study on a test track,” in Proceedings of the human factors and ergonomics society annual meeting , vol. 44, no. 20, 2000

work page 2000

[30] [30]

Warpdriver: context-aware prob- abilistic motion prediction for crowd simulation,

D. Wolinski, M. Lin, and J. Pettr ´e, “Warpdriver: context-aware prob- abilistic motion prediction for crowd simulation,” ACM Transactions on Graphics (TOG) , vol. 35, no. 6, 2016

work page 2016

[31] [31]

Query-efﬁcient imitation learning for end-to- end simulated driving,

J. Zhang and K. Cho, “Query-efﬁcient imitation learning for end-to- end simulated driving,” in AAAI, 2017, pp. 2891–2897

work page 2017

[32] [32]

Visualizing data using t-sne,

L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research , vol. 9, no. Nov, pp. 2579–2605, 2008

work page 2008

[33] [33]

City-scale trafﬁc animation using statistical learning and metamodel-based optimization,

W. Li, D. Wolinski, and M. C. Lin, “City-scale trafﬁc animation using statistical learning and metamodel-based optimization,” ACM Trans. Graph., vol. 36, no. 6, pp. 200:1–200:12, Nov. 2017

work page 2017

[34] [34]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[35] [35]

Deep learning,

Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015

work page 2015

[36] [36]

Adam: A method for stochastic optimization,

D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015. IX. A PPENDIX A. Solving An SPC Task We show the proofs of solving an SPC task using standard supervised learning, DAGGER [1], and ADAPS, respec- tively. We use “state” and ”observation” interchangeably here as for these proofs we can always ﬁnd a deterministic function to map the two

work page 2015

[37] [37]

Supervised Learning: The following proof is adapted and simpliﬁed from Ross et al. [1]. We include it here for completeness. Theorem 2: Consider a T -step control task. Let ϵ = Eφ∼dπ∗,a∗∼π∗(φ) [l (φ,π,a∗)] be the observed surrogate loss under the training distribution induced by the expert’s policy π∗. We assume C∈ [0,Cmax] and l upper bounds the 0-1 loss...

work page

[38] [38]

DAGGER: The following proof is adapted from Ross et al. [1]. We include it here for completeness. Note that for Theorem 3, we have arrived at the different third term as of Ross et al. [1]. Lemma 1: [1] Let P andQ be any two distributions over elementsx∈X andf :X→ R, any bounded function such that f(x)∈ [a,b ] for all x∈X . Let the range r = b−a. Then|Ex∼...

work page

[39] [39]

left” with rl >0, and lrr is on the “right

ADAPS: With the assumption that we can treat the generated trajectories from our model and the additional data generated based on them as running a learned policy to sample independent expert trajectories at different states while performing policy roll-out, we have the following guarantee of ADAPS. To better understand the following theorem and proof, we...

work page

[40] [40]

The ﬁrst is a straight road which represents a linear geometry, the second is a curved road which represents a non-linear geometry, and the third is an open ground

Scenarios: We have tested our method in three sce- narios. The ﬁrst is a straight road which represents a linear geometry, the second is a curved road which represents a non-linear geometry, and the third is an open ground. The ﬁrst two represent on-road situations while the last represents an off-road situation. Both the straight and curved roads consist...

work page

[41] [41]

Due to factors such as the rendering complexity and the delay of the communication module, the actual running speed is in the range of 20±1m/s

Vehicle Specs: The vehicle’s speed is set to 20 m/s, which value is used to compute the throttle value in the simulator. Due to factors such as the rendering complexity and the delay of the communication module, the actual running speed is in the range of 20±1m/s. The length and width of the vehicle are 4.5 m and 2.5 m, respectively. The distance between ...

work page

[42] [42]

This scaling operation is meant to preserve the obstacle’s visibility, since at distances greater than 30m a normal-sized obstacle is quickly reduced to just a few pixels

Obstacles: For the on-road scenarios, we use a scaled version of a virtual trafﬁc cone as the obstacle on both the straight and curved roads. This scaling operation is meant to preserve the obstacle’s visibility, since at distances greater than 30m a normal-sized obstacle is quickly reduced to just a few pixels. This is an intrinsic limitation of the sing...

work page

[43] [43]

Training Data: In order to train Following, we have built a waypoint system on the straight road and curved road for the A V to follow, respectively. By running the vehicle for roughly equal distances on both roads, we have gathered in total 65 061 images (33 642 images for the straight road and 31 419 images for the curved road). On the open ground, we h...

work page