pith. sign in

arxiv: 2605.24152 · v2 · pith:TTNV5GATnew · submitted 2026-05-22 · 💻 cs.AI

Neuro-Inspired Inverse Learning for Planning and Control

Pith reviewed 2026-06-30 16:04 UTC · model grok-4.3

classification 💻 cs.AI
keywords inverse learningforward modelplanningcontrolD4RLembodied AIneuro-inspiredquantum gates
0
0 comments X

The pith

Inverse Learning optimizes full action sequences through a learned forward model to match or exceed offline RL on maze navigation tasks while using one to two orders of magnitude less inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an Inverter framework that draws on three brain-like principles: paired forward and inverse internal models, open-loop multi-step commands, and hierarchical organization. It formalizes Inverse Learning as a method that amortizes sequence planning by training an inverse model to optimize entire T-step action trajectories against a fixed forward model. On all nine D4RL maze2d and antmaze variants, single inverters or two-level stacks achieve an average 24.2 percent improvement over offline RL and diffusion baselines, with far lower test-time compute. The approach also succeeds at synthesizing single-qubit quantum gates at over 1000 times lower cost than standard numerical methods. A noted failure mode occurs when the forward model is trained on narrow data and can be hacked to produce unrealistic trajectories; this is addressed by using broader random training data.

Core claim

By optimizing the entire action sequence through the forward model rather than emitting one action at a time, the learned inverse model produces smooth, goal-coherent trajectories that reach control policies closer to the analytic optimum than the policy that generated the training data.

What carries the argument

The Inverter, a learned inverse model that, given a goal and a forward model, iteratively optimizes a full T-step action sequence at test time.

If this is right

  • Inverters match or beat current offline RL and diffusion planners across all nine evaluated D4RL maze environments while requiring 10-100 times less inference compute.
  • Optimizing full trajectories rather than single steps yields policies that exceed the performance of the data-generating policy itself.
  • Hierarchical n=2 inverter stacks extend the method to longer-horizon tasks without proportional increase in compute.
  • The same inverse-learning procedure applies to continuous quantum control, reproducing GRAPE-level gate fidelity at over 1000 times lower per-gate cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on continuous control domains beyond mazes by replacing the analytic forward model with a learned dynamics model that covers a wider state distribution.
  • Combining inverters with online model updates might reduce the need for broad offline data collection while preserving the trajectory-wide coherence benefit.
  • The hierarchical stacking pattern suggests a natural route to multi-agent or multi-timescale planning by nesting inverters at different temporal resolutions.

Load-bearing premise

The forward model used in inverse optimization must accurately capture environment dynamics over the full planning horizon without exploitable errors that allow unrealistic but high-scoring trajectories.

What would settle it

Train the forward model on a narrow subset of the data and check whether inverters produce trajectories that achieve high simulated reward yet fail when executed in the real environment.

Figures

Figures reproduced from arXiv: 2605.24152 by Maryna Kapitonova, Tonio Ball.

Figure 1
Figure 1. Figure 1: The Inverter planning and control framework. (A) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Trajectory comparison on [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Directional movement spectrum and sequence-level optimization. Panels A–F show polar histograms of T=16-step chunk-displacement directions for the training data and a representative subset of methods; angles indicate 0 ◦ =+x, 90◦ =+y in raw maze data coordinates, based on each method’s per-episode rollout until the first step where the agent enters a goal-ball of radius 0.5, or time out. Panel G and H show… view at source ↗
Figure 4
Figure 4. Figure 4: Action-space structure and control optimality. Per-step action scatter plots (ax,ay) for the training data and a representative subset of methods (Panels A–F). Note that BC-10% [57] lands below the training data on action saturation (Panel H), even though it is trained to imitate it: a unimodal Gaussian policy head contracts the action distribution toward the interior – the multimodal-action BC failure mod… view at source ↗
Figure 5
Figure 5. Figure 5: Simultaneous planning and control with a coupled Game and Locomotion Inverter on the AntMan [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Random training data yields calibrated FoM rewards; narrow expert data induces FoM hacking. Each panel compares FoM-predicted reward to realized game reward over 1000 sampled start states for the final high-level IM checkpoints; point color encodes the Out-of-Distribution (OOD) score OOD(x) = d1(x,T )/de1(T ), where d1(x,T ) = miny∈T ∥x−y∥ is each sampled start state x’s nearest-neighbor distance to the tr… view at source ↗
Figure 7
Figure 7. Figure 7: Single-shot quantum gate synthesis on a 3-level transmon under a known Lindbladian with a Pulse [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-episode trajectory overlays for the Inverter on every maze variant we evaluate. [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The simple algorithmic Path Inverter uses only the offline training-data distribution, no maze [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
read the original abstract

We present a neuro-inspired framework for embodied planning and control. Building on three principles that enable fast and highly effective goal-directed behavior in the mammalian brain - paired forward/inverse internal models, open-loop multi-step motor commands, and sequential, hierarchical organization of action - our Inverter framework uses learned components, trained end-to-end through Inverse Learning (IL) and supplemented where natural by analytic or algorithmic modules; we formalize IL and delineate it from supervised, reinforcement, and imitation learning. IL bridges Reinforcement Learning (RL)-style amortization, which runs in a single forward pass but emits only one action at a time, and Optimal Control (OC)-style sequence planning over whole trajectories, but with iterative test-time computation. Single Inverters or hierarchical n=2 Inverter stacks match or improve on offline-RL and diffusion-planner baselines on all 3 maze2d and 6 antmaze D4RL variants by an average of +24.2% (range -1.9% to +78.2%), at one-to-two orders of magnitude less inference compute time. Distinctively, optimizing through the Forward Model (FoM) over the entire T-step action sequence - rather than per step - lets Inverters produce smooth, goal-coherent, trajectory-wide structure and reach control policies closer to the analytic optimum than the policy underlying the training data itself. We also identify a failure mode of IL: FoM hacking under narrow training-data coverage, which we mitigate by using random training data with broader coverage. As an application example, a Pulse Inverter synthesizes arbitrary single-qubit quantum gates with fidelity matching the standard iterative numerical baseline (GRAPE), at more than 1000x lower per-gate compute time. In summary, we conclude that IL enables a versatile class of world-interfaces, especially for latency- and resource-critical embodied AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces a neuro-inspired Inverter framework for embodied planning and control that uses paired forward/inverse internal models, open-loop multi-step commands, and hierarchical organization, trained via Inverse Learning (IL). It claims that single Inverters or n=2 hierarchical stacks match or exceed offline-RL and diffusion-planner baselines on all 3 maze2d and 6 antmaze D4RL tasks by +24.2% average (range -1.9% to +78.2%) at 10-100x lower inference compute, while producing smoother trajectory-wide structure; it also reports an application to single-qubit quantum gate synthesis matching GRAPE fidelity at >1000x lower compute, and identifies FoM hacking as a failure mode under narrow data coverage that is mitigated by broader random training data.

Significance. If the empirical claims hold after verification of the forward-model assumption, the work offers a distinct bridge between amortized RL policies and full-horizon optimal control, with neuro-inspired structure enabling efficient test-time optimization over entire trajectories. The quantum-control application demonstrates versatility beyond robotics benchmarks. Strengths include the explicit delineation of IL from supervised/RL/imitation learning and the identification of a concrete failure mode with a proposed mitigation.

major comments (1)
  1. [Abstract / D4RL experiments] Abstract and D4RL experiments section: the headline +24.2% performance claim requires that optimization through the learned Forward Model (FoM) over the full T-step horizon produces trajectories whose dynamics match the true environment (i.e., no exploitable FoM hacking). The abstract itself states that this assumption fails under narrow training-data coverage and is only mitigated by switching to broader random data; however, the manuscript provides no quantitative diagnostics (e.g., per-step or cumulative prediction error on held-out trajectories, or comparison of FoM-optimized vs. true-environment rollouts) confirming that residual discrepancies do not drive the reported gains on the specific D4RL variants.
minor comments (1)
  1. [Methods] Notation for the hierarchical n=2 Inverter stack and the exact definition of Inverse Learning (IL) objective should be introduced with an equation in the methods section for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thorough review and for highlighting the need to substantiate the Forward Model (FoM) assumption underlying the D4RL performance claims. We address this point directly below and will incorporate additional diagnostics in the revision.

read point-by-point responses
  1. Referee: [Abstract / D4RL experiments] Abstract and D4RL experiments section: the headline +24.2% performance claim requires that optimization through the learned Forward Model (FoM) over the full T-step horizon produces trajectories whose dynamics match the true environment (i.e., no exploitable FoM hacking). The abstract itself states that this assumption fails under narrow training-data coverage and is only mitigated by switching to broader random data; however, the manuscript provides no quantitative diagnostics (e.g., per-step or cumulative prediction error on held-out trajectories, or comparison of FoM-optimized vs. true-environment rollouts) confirming that residual discrepancies do not drive the reported gains on the specific D4RL variants.

    Authors: We agree that quantitative validation of FoM fidelity on the specific D4RL tasks is necessary to rule out exploitable discrepancies as a driver of the reported gains. The manuscript already notes that narrow data coverage induces FoM hacking and that broader random training data mitigates it; however, we did not include explicit error metrics or true-environment rollout comparisons for the D4RL variants. In the revised version we will add (i) per-step and cumulative prediction error on held-out trajectories drawn from the same D4RL datasets and (ii) side-by-side comparison of FoM-optimized trajectories versus their execution in the true environment. These additions will directly confirm that the +24.2 % average improvement (and the per-task range) arises from genuine trajectory optimization rather than model exploitation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmark results, not self-referential derivations

full rationale

The paper presents an empirical framework (Inverter stacks trained via Inverse Learning) whose headline results are performance numbers on public D4RL maze2d and antmaze tasks, compared against offline-RL and diffusion baselines. No derivation chain, uniqueness theorem, or ansatz is shown that reduces by construction to fitted inputs or self-citations; the abstract explicitly flags and mitigates the FoM-hacking failure mode rather than assuming it away. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the neuro-inspired principles are presented as background motivation rather than formal axioms.

pith-pipeline@v0.9.1-grok · 5867 in / 1312 out tokens · 36616 ms · 2026-06-30T16:04:09.343381+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

112 extracted references · 35 canonical work pages · 6 internal anchors

  1. [1]

    Emanuel Todorov and Michael I. Jordan. Optimal feedback control as a theory of motor coordination.Nature Neuroscience, 5(11):1226–1235, 2002. doi: 10.1038/nn963

  2. [2]

    Knill and Alexandre Pouget

    David C. Knill and Alexandre Pouget. The Bayesian brain: The role of uncertainty in neural coding and computation.Trends in Neurosciences, 27(12):712–719, 2004. doi: 10.1016/j.tins.2004.10.007

  3. [3]

    The free-energy principle: A unified brain theory?Nature Reviews Neuroscience, 11(2): 127–138, 2010

    Karl Friston. The free-energy principle: A unified brain theory?Nature Reviews Neuroscience, 11(2): 127–138, 2010. doi: 10.1038/nrn2787

  4. [4]

    Gershman, Eric J

    Samuel J. Gershman, Eric J. Horvitz, and Joshua B. Tenenbaum. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.Science, 349(6245):273–278, 2015. doi: 10.1126/ science.aac6076

  5. [5]

    Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources.Behavioral and brain sciences, 43:e1, 2020

    Falk Lieder and Thomas L Griffiths. Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources.Behavioral and brain sciences, 43:e1, 2020

  6. [6]

    Homo heuristicus: Why biased minds make better inferences.Topics in cognitive science, 1(1):107–143, 2009

    Gerd Gigerenzer and Henry Brighton. Homo heuristicus: Why biased minds make better inferences.Topics in cognitive science, 1(1):107–143, 2009

  7. [7]

    Jordan and David E

    Michael I. Jordan and David E. Rumelhart. Forward models: Supervised learning with a distal teacher. Cognitive Science, 16(3):307–354, 1992. doi: 10.1207/s15516709cog1603_1

  8. [8]

    Internal models for motor control and trajectory planning.Current Opinion in Neurobiology, 9(6):718–727, 1999

    Mitsuo Kawato. Internal models for motor control and trajectory planning.Current Opinion in Neurobiology, 9(6):718–727, 1999. doi: 10.1016/S0959-4388(99)00028-8

  9. [9]

    Wolpert and Mitsuo Kawato

    Daniel M. Wolpert and Mitsuo Kawato. Multiple paired forward and inverse models for motor control. Neural Networks, 11(7–8):1317–1329, 1998. doi: 10.1016/S0893-6080(98)00066-5

  10. [10]

    Ballistic movement: muscle activation and neuromuscular adaptation

    E Paul Zehr and Digby G Sale. Ballistic movement: muscle activation and neuromuscular adaptation. Canadian Journal of applied physiology, 19(4):363–378, 1994

  11. [11]

    Forward modeling allows feedback control for fast reaching movements

    Michel Desmurget and Scott Grafton. Forward modeling allows feedback control for fast reaching movements. Trends in Cognitive Sciences, 4(11):423–431, 2000

  12. [12]

    Graybiel

    Ann M. Graybiel. The basal ganglia and chunking of action repertoires.Neurobiology of Learning and Memory, 70(1–2):119–136, 1998. doi: 10.1006/nlme.1998.3843

  13. [13]

    Motor skill learning between selection and execution.Trends in Cognitive Sciences, 19(4):227–233, 2015

    Jörn Diedrichsen and Katja Kornysheva. Motor skill learning between selection and execution.Trends in Cognitive Sciences, 19(4):227–233, 2015

  14. [14]

    The role of higher-order motor areas in voluntary movement as revealed by high-resolution eeg and fmri.Neuroimage, 10(6):682–694, 1999

    Tonio Ball, Axel Schreiber, Bernd Feige, Michael Wagner, Carl Hermann Lücking, and Rumyana Kristeva- Feige. The role of higher-order motor areas in voluntary movement as revealed by high-resolution eeg and fmri.Neuroimage, 10(6):682–694, 1999

  15. [15]

    Uncovering a timescale hierarchy by studying the brain in a natural context.The Journal of Neuroscience, 45(12):e2368242025, 2025

    Uri Hasson. Uncovering a timescale hierarchy by studying the brain in a natural context.The Journal of Neuroscience, 45(12):e2368242025, 2025

  16. [16]

    Pontryagin, Vladimir G

    Lev S. Pontryagin, Vladimir G. Boltyansky, Revaz V . Gamkrelidze, and Evgenii F. Mishchenko.The Mathematical Theory of Optimal Processes. Interscience Publishers, New York, 1962

  17. [17]

    Oskar Bolza.Vorlesungen über Variationsrechnung. B. G. Teubner, Leipzig and Berlin, 1909

  18. [18]

    Bellman.Dynamic Programming

    Richard E. Bellman.Dynamic Programming. Princeton University Press, 1957

  19. [19]

    Solving inverse problems using data-driven models.Acta numerica, 28:1–174, 2019

    Simon Arridge, Peter Maass, Ozan Öktem, and Carola-Bibiane Schönlieb. Solving inverse problems using data-driven models.Acta numerica, 28:1–174, 2019. 18 Inverse Learning for Planning and Control

  20. [20]

    Deep learning techniques for inverse problems in imaging.IEEE Journal on Selected Areas in Information Theory, 1(1):39–56, 2020

    Gregory Ongie, Ajil Jalal, Christopher A Metzler, Richard G Baraniuk, Alexandros G Dimakis, and Rebecca Willett. Deep learning techniques for inverse problems in imaging.IEEE Journal on Selected Areas in Information Theory, 1(1):39–56, 2020

  21. [21]

    Learning fast approximations of sparse coding

    Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. InProceedings of the 27th international conference on international conference on machine learning, pages 399–406, 2010

  22. [22]

    Diffusion Posterior Sampling for General Noisy Inverse Problems

    Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems.arXiv preprint arXiv:2209.14687, 2022

  23. [23]

    Pistikopoulos

    Alberto Bemporad, Manfred Morari, Vivek Dua, and Efstratios N. Pistikopoulos. The explicit linear quadratic regulator for constrained systems.Automatica, 38(1):3–20, 2002. doi: 10.1016/S0005-1098(01)00174-1

  24. [24]

    Guided policy search

    Sergey Levine and Vladlen Koltun. Guided policy search. InInternational conference on machine learning, pages 1–9. PMLR, 2013

  25. [25]

    Combining the benefits of function approximation and trajectory optimiza- tion

    Igor Mordatch and Emo Todorov. Combining the benefits of function approximation and trajectory optimiza- tion. InRobotics: Science and Systems, 2014. doi: 10.15607/RSS.2014.X.052

  26. [26]

    Mpc-net: A first principles guided policy search.IEEE Robotics and Automation Letters, 5(2):2897–2904, 2020

    Jan Carius, Farbod Farshidian, and Marco Hutter. Mpc-net: A first principles guided policy search.IEEE Robotics and Automation Letters, 5(2):2897–2904, 2020

  27. [27]

    Learning continuous control policies by stochastic value gradients

    Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. InAdvances in Neural Information Processing Systems, 2015

  28. [28]

    Zico Kolter

    Brandon Amos and J. Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017

  29. [29]

    Differentiable mpc for end-to-end planning and control.Advances in neural information processing systems, 31, 2018

    Brandon Amos, Ivan Jimenez, Jacob Sacks, Byron Boots, and J Zico Kolter. Differentiable mpc for end-to-end planning and control.Advances in neural information processing systems, 31, 2018

  30. [30]

    Making the world differentiable: On using self-supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments

    Jürgen Schmidhuber. Making the world differentiable: On using self-supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, Institut für Informatik, Technische Universität München, 1990

  31. [31]

    A path towards autonomous machine intelligence, version 0.9.2

    Yann LeCun. A path towards autonomous machine intelligence, version 0.9.2. OpenReview position paper,

  32. [32]

    PILCO: A model-based and data-efficient approach to policy search

    Marc Peter Deisenroth and Carl Edward Rasmussen. PILCO: A model-based and data-efficient approach to policy search. InProceedings of the 28th International Conference on Machine Learning, pages 465–472, 2011

  33. [33]

    Dream to control: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representations, 2020

  34. [34]

    Universal planning networks: Learning generalizable representations for visuomotor control

    Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks: Learning generalizable representations for visuomotor control. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4732– 4741, 2018

  35. [35]

    TD-MPC2: Scalable, robust world models for continuous control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations, 2024

  36. [36]

    Offline reinforcement learning as one big sequence modeling problem

    Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. InAdvances in Neural Information Processing Systems, volume 34, pages 1273–1286, 2021. 19 Inverse Learning for Planning and Control

  37. [37]

    Decision Transformer: Reinforcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision Transformer: Reinforcement learning via sequence modeling. InAdvances in Neural Information Processing Systems, volume 34, 2021

  38. [38]

    Tenenbaum, and Sergey Levine

    Michael Janner, Yilun Du, Joshua B. Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. InProceedings of the 39th International Conference on Machine Learning (ICML), volume 162 ofProceedings of Machine Learning Research, pages 9902–9915. PMLR, 17–23 Jul 2022

  39. [39]

    Lillicrap, and David Silver

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy P. Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020. doi: 10.1038/s41586-020-03051-4

  40. [40]

    Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, and Yann LeCun. Learning from reward-free offline data: A case for planning with latent dynamics models, 2025

  41. [41]

    Navigation world models, 2024

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models, 2024. URLhttps://arxiv.org/abs/2412.03572

  42. [42]

    Qureshi, Anthony Simeonov, Mayur J

    Ahmed H. Qureshi, Anthony Simeonov, Mayur J. Bency, and Michael C. Yip. Motion planning networks. In IEEE International Conference on Robotics and Automation, pages 2118–2124, 2019. doi: 10.1109/ICRA. 2019.8793889

  43. [43]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems, 2023. doi: 10.15607/RSS.2023.XIX.016

  44. [44]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  45. [45]

    Sutton, Doina Precup, and Satinder Singh

    Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2):181–211, 1999. doi: 10.1016/S0004-3702(99)00052-1

  46. [46]

    Hierarchical world models as visual whole-body humanoid controllers

    Nicklas Hansen, Jyothir S V , Vlad Sobal, Yann LeCun, Xiaolong Wang, and Hao Su. Hierarchical world models as visual whole-body humanoid controllers. InInternational Conference on Learning Representations, 2025

  47. [47]

    Integrated task and motion planning.Annual review of control, robotics, and autonomous systems, 4(1):265–293, 2021

    Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Integrated task and motion planning.Annual review of control, robotics, and autonomous systems, 4(1):265–293, 2021

  48. [48]

    Warm start of mixed-integer programs for model predictive control of hybrid systems.IEEE Transactions on Automatic Control, 66(6):2433–2448, 2020

    Tobia Marcucci and Russ Tedrake. Warm start of mixed-integer programs for model predictive control of hybrid systems.IEEE Transactions on Automatic Control, 66(6):2433–2448, 2020

  49. [49]

    Model predictive control with signal temporal logic specifications

    Vasumathi Raman, Alexandre Donzé, Mehdi Maasoumy, Richard M Murray, Alberto Sangiovanni-Vincentelli, and Sanjit A Seshia. Model predictive control with signal temporal logic specifications. In53rd IEEE Conference on Decision and Control, pages 81–87. IEEE, 2014

  50. [50]

    Directly fine-tuning diffusion models on differen- tiable rewards

    Kevin Clark, Paul Vicol, Kevin Swersky, and David Fleet. Directly fine-tuning diffusion models on differen- tiable rewards. InInternational Conference on Learning Representations, volume 2024, pages 4793–4822, 2024

  51. [51]

    Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023. 20 Inverse Learning for Planning and Control

  52. [52]

    Fine-tuning discrete diffusion models via reward optimization with applications to dna and protein design

    Chenyu Wang, Masatoshi Uehara, Yichun He, Amy Wang, Avantika Lal, Tommi Jaakkola, Sergey Levine, Aviv Regev, Hanchen Wang, and Tommaso Biancalani. Fine-tuning discrete diffusion models via reward optimization with applications to dna and protein design. InInternational Conference on Learning Represen- tations, volume 2025, pages 47871–47899, 2025

  53. [53]

    Adjoint matching: Fine- tuning flow and diffusion generative models with memoryless stochastic optimal control

    Carles Domingo i Enrich, Michal Drozdzal, Brian Karrer, and Ricky TQ Chen. Adjoint matching: Fine- tuning flow and diffusion generative models with memoryless stochastic optimal control. InInternational Conference on Learning Representations, volume 2025, pages 53791–53846, 2025

  54. [54]

    Algorithms for inverse reinforcement learning

    Andrew Y Ng, Stuart Russell, et al. Algorithms for inverse reinforcement learning. InIcml, volume 1, page 2, 2000

  55. [55]

    CORL: Research-oriented deep offline reinforcement learning library

    Denis Tarasov, Alexander Nikulin, Dmitry Akimov, Vladislav Kurenkov, and Sergey Kolesnikov. CORL: Research-oriented deep offline reinforcement learning library. InAdvances in Neural Information Processing Systems, volume 36, 2023

  56. [56]

    DecisionLLM: Large language models for long sequence decision exploration, 2026

    Xiaowei Lv, Zhilin Zhang, Yijun Li, Yusen Huo, Siyuan Ju, Xuyan Li, Chunxiang Hong, Tianyu Wang, Yongcai Wang, Peng Sun, Chuan Yu, Jian Xu, and Bo Zheng. DecisionLLM: Large language models for long sequence decision exploration, 2026. URLhttps://arxiv.org/abs/2601.10148

  57. [57]

    When should we prefer offline reinforcement learning over behavioral cloning? InInternational Conference on Learning Representations, 2022

    Aviral Kumar, Joey Hong, Anikait Singh, and Sergey Levine. When should we prefer offline reinforcement learning over behavioral cloning? InInternational Conference on Learning Representations, 2022

  58. [58]

    Implicit behavioral cloning

    Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. In5th Annual Conference on Robot Learning, 2021. URLhttps://openreview.net/forum?id=rif3a5NAxU6

  59. [59]

    Paul M. Fitts. The information capacity of the human motor system in controlling the amplitude of movement. Journal of Experimental Psychology, 47(6):381–391, 1954

  60. [60]

    E. R. F. W. Crossman and P. J. Goodeve. Feedback control of hand-movement and Fitts’ law.The Quarterly Journal of Experimental Psychology Section A, 35(2):251–278, 1983. doi: 10.1080/14640748308402133

  61. [61]

    Meyer, Richard A

    David E. Meyer, Richard A. Abrams, Sylvan Kornblum, Charles E. Wright, and J. E. Keith Smith. Optimality in human motor performance: Ideal control of rapid aimed movements.Psychological Review, 95(3): 340–370, 1988. doi: 10.1037/0033-295X.95.3.340

  62. [62]

    Optimal control of coupled spin dynamics: design of nmr pulse sequences by gradient ascent algorithms.Journal of magnetic resonance, 172(2):296–305, 2005

    Navin Khaneja, Timo Reiss, Cindie Kehlet, Thomas Schulte-Herbrüggen, and Steffen J Glaser. Optimal control of coupled spin dynamics: design of nmr pulse sequences by gradient ascent algorithms.Journal of magnetic resonance, 172(2):296–305, 2005

  63. [63]

    Random layers for quantum optimal control with exponential expressivity.arXiv preprint arXiv:2603.08948, 2026

    Marco Dall’Ara, Martin Koppenhöfer, Florentin Reiter, Thomas Wellens, Simone Montangero, and Wal- ter Hahn. Random layers for quantum optimal control with exponential expressivity.arXiv preprint arXiv:2603.08948, 2026

  64. [64]

    Fidelity-informed neural pulse compilation of a continuous family of quantum gates with uncertainty-margin analysis

    Arash Fath Lipaei, Ebrahim Khaleghian, Selin Aslan, Gani Göral, Zidong Lin, and Özgür E Müstecaplıo˘glu. Fidelity-informed neural pulse compilation of a continuous family of quantum gates with uncertainty-margin analysis.arXiv preprint arXiv:2604.11314, 2026

  65. [65]

    Neuroscience- inspired artificial intelligence.Neuron, 95(2):245–258, 2017

    Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. Neuroscience- inspired artificial intelligence.Neuron, 95(2):245–258, 2017

  66. [66]

    The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving rat.Brain research, 1971

    John O’Keefe and Jonathan Dostrovsky. The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving rat.Brain research, 1971

  67. [67]

    Microstructure of a spatial map in the entorhinal cortex.Nature, 436(7052):801–806, 2005

    Torkel Hafting, Marianne Fyhn, Sturla Molden, May-Britt Moser, and Edvard I Moser. Microstructure of a spatial map in the entorhinal cortex.Nature, 436(7052):801–806, 2005. 21 Inverse Learning for Planning and Control

  68. [68]

    Role for supplementary motor area cells in planning several movements ahead

    Jun Tanji and Keisetsu Shima. Role for supplementary motor area cells in planning several movements ahead. Nature, 371(6496):413–416, 1994

  69. [69]

    Combinations of muscle synergies in the construction of a natural motor behavior.Nature neuroscience, 6(3):300–308, 2003

    Andrea d’Avella, Philippe Saltiel, and Emilio Bizzi. Combinations of muscle synergies in the construction of a natural motor behavior.Nature neuroscience, 6(3):300–308, 2003

  70. [70]

    Invariant visual represen- tation by single neurons in the human brain.Nature, 435(7045):1102–1107, 2005

    R Quian Quiroga, Leila Reddy, Gabriel Kreiman, Christof Koch, and Itzhak Fried. Invariant visual represen- tation by single neurons in the human brain.Nature, 435(7045):1102–1107, 2005

  71. [71]

    Roger N. Lemon. Descending pathways in motor control.Annual Review of Neuroscience, 31:195–218,

  72. [72]

    doi: 10.1146/annurev.neuro.31.060407.125547

  73. [73]

    Jean-Alban Rathelot and Peter L. Strick. Subdivisions of primary motor cortex based on cortico-motoneuronal cells.Proceedings of the National Academy of Sciences, 106(3):918–923, 2009. doi: 10.1073/pnas. 0808362106

  74. [74]

    Auditory fovea and Doppler shift compensation: Adaptations for flutter detection in echolocating bats using CF-FM signals.Journal of Comparative Physiology A, 197(5): 541–559, 2011

    Hans-Ulrich Schnitzler and Annette Denzinger. Auditory fovea and Doppler shift compensation: Adaptations for flutter detection in echolocating bats using CF-FM signals.Journal of Comparative Physiology A, 197(5): 541–559, 2011. doi: 10.1007/s00359-010-0569-6

  75. [75]

    Neuronal basis for object location in the vibrissa scanning sensori- motor system.Neuron, 72(3):455–468, 2011

    David Kleinfeld and Martin Deschênes. Neuronal basis for object location in the vibrissa scanning sensori- motor system.Neuron, 72(3):455–468, 2011. doi: 10.1016/j.neuron.2011.10.009

  76. [76]

    Kupsky, and Gary H

    Jeheskel Shoshani, William J. Kupsky, and Gary H. Marchant. Elephant brain: Part I: Gross morphology, functions, comparative anatomy, and evolution.Brain Research Bulletin, 70(2):124–157, 2006. doi: 10.1016/j.brainresbull.2006.03.016

  77. [77]

    Roofline: an insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009

    Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009

  78. [78]

    Efficiently scaling transformer inference.Proceedings of machine learning and systems, 5:606–624, 2023

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference.Proceedings of machine learning and systems, 5:606–624, 2023

  79. [79]

    Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. InProceedings of the 29th ACM international conference on architectural support for programming lang...

  80. [80]

    Jax: composable transforma- tions of python+ numpy programs

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, et al. Jax: composable transforma- tions of python+ numpy programs. 2018

Showing first 80 references.