Learning Arbitration for Shared Autonomy by Hindsight Data Aggregation

Jim Mainprice; Marc Toussaint; Yoojin Oh

arxiv: 1906.12280 · v1 · pith:TI2QISF3new · submitted 2019-06-28 · 💻 cs.RO

Learning Arbitration for Shared Autonomy by Hindsight Data Aggregation

Yoojin Oh , Marc Toussaint , Jim Mainprice This is my paper

Pith reviewed 2026-05-25 13:33 UTC · model grok-4.3

classification 💻 cs.RO

keywords shared autonomyarbitration functionrecurrent neural networkhindsight data aggregationteleoperationpick-and-placeintent inference

0 comments

The pith

A recurrent neural network learns an arbitration function for shared autonomy by training on user interaction data collected during shared control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to automate the arbitration function that decides when to blend user commands with autonomous robot actions in teleoperated pick-and-place tasks. It models this function as a recurrent neural network that receives the current state, intent prediction scores, and user input, then outputs a blending weight. Training occurs through hindsight data aggregation: users operate the shared-control system, and the resulting interaction traces are used to update the network. A reader would care because hand-designed arbitration rules are brittle, and a data-driven alternative could make shared control more responsive to actual user behavior without manual retuning for each task.

Core claim

The authors define a shared control policy that blends direct user control and autonomous control based on intent inference, then replace the handcrafted arbitration rule with a recurrent neural network whose inputs are state, intent scores, and user command. They train this network by hindsight data aggregation on traces gathered while users perform the task under the shared-control policy itself, and they report preliminary comparisons against a handcrafted baseline in a virtual gripper environment.

What carries the argument

Recurrent neural network that maps state, intent prediction scores, and user command to an arbitration weight between user and robot commands, trained by hindsight data aggregation on shared-control interaction traces.

If this is right

The arbitration function can be learned directly from traces of users operating the shared-control system without separate offline demonstrations.
Because the policy remains differentiable, the learned arbitration can be further optimized end-to-end with the rest of the shared-autonomy stack.
The approach produces measurable improvements over a fixed handcrafted arbitration rule in virtual teleoperation trials.
Observed limitations point to the value of adding user-specific adaptation mechanisms on top of the learned arbitration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-aggregation loop could be applied to other manipulation or navigation tasks if the intent predictor and motion generator are replaced.
Performance may degrade when the user population changes, because the training distribution is shaped by the current arbitration policy.
Adding an online adaptation layer that fine-tunes the network per user after initial training would address the adaptability gap noted in the results.

Load-bearing premise

Interaction data collected while users operate the shared system supplies an unbiased and sufficient training distribution for the RNN.

What would settle it

In a controlled user study on the same pick-and-place tasks, the learned arbitration produces measurably higher task completion times or lower subjective ratings than the handcrafted baseline.

Figures

Figures reproduced from arXiv: 1906.12280 by Jim Mainprice, Marc Toussaint, Yoojin Oh.

**Figure 2.** Figure 2: Top: predicted intent 28x28 likelihood heatmaps. Bot [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Prediction and assistance for the wrong goal. Left (1 and 2): alpha trained with 30 episodes. Right (3 and 4): alpha [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 5.** Figure 5: Comparing the average completion time for each mode [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Baxter setup using RGBD camera with object and arm [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

read the original abstract

In this paper we present a framework for the teleoperation of pick-and-place tasks. We define a shared control policy that allows to blend between direct user control and autonomous control based on user intent inference. One of the main challenges in shared autonomy systems is to define the arbitration function, which decides when to let the autonomous agent take over. In this work, we propose a model and training method to learn the arbitration function. Our model is based on a recurrent neural network that takes as input the state, intent prediction scores and user command to produce an arbitration between user and robot commands. This work extends our previous work on differentiable policies for shared autonomy. Differentiability of the policy is desirable to further train the shared autonomy system end-to-end. In this work we propose training of the arbitration function by using data from user performing the task with shared control. We present initial results by teleoperating a gripper in a virtual environment using pre-trained motion generation and intent prediction. We compare our data aggregation training procedure to a handcrafted arbitration function. Our preliminary results show the efficacy of the approach and shed light on limitations that we believe demonstrate the need for user adaptability in shared autonomy systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Incremental RNN arbitration for shared autonomy via hindsight aggregation on user data, but the abstract shows no metrics and leaves the training distribution circularity unaddressed.

read the letter

The paper extends the authors' earlier differentiable shared-autonomy policies by adding an RNN that outputs an arbitration weight from state, intent scores, and user command. Training uses hindsight data aggregation on trajectories collected while users operate the shared-control system itself. That combination is the concrete addition over their prior work and over generic imitation methods. It targets a practical pain point in teleoperation: deciding when the robot should take over during pick-and-place tasks. The virtual-environment comparison to a handcrafted baseline is a reasonable first check, and the note that results highlight the need for user adaptability is an honest limitation call-out. The central empirical claim, however, rests on unshown numbers; the abstract supplies no quantitative scores, trial counts, or protocol details. More importantly, the training loop collects data under the very arbitration policy being learned, so the state-action distribution depends on the current parameters. The abstract does not explain whether a single aggregation pass suffices or whether an outer iteration is required to avoid distribution shift. Because the work is preliminary and the evaluation thin, the main value is the training recipe rather than a validated method. Readers already working on shared autonomy or on-policy imitation in robotics will find the architecture and aggregation step worth looking at. It is coherent enough on its own terms to merit referee time, though any review should press for quantitative results and a clearer treatment of the data-collection dependency.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a shared autonomy framework for pick-and-place teleoperation. A recurrent neural network learns an arbitration function that blends user and robot commands; the network is trained by hindsight data aggregation on trajectories collected while users interact with the shared-control system itself. The approach extends prior differentiable-policy work and is compared to a handcrafted arbitration baseline in a virtual gripper environment using pre-trained motion generation and intent prediction. Preliminary results are reported to demonstrate efficacy while indicating the need for user adaptability.

Significance. If the empirical claims can be placed on a rigorous quantitative footing, the work would supply a data-driven route to arbitration in shared autonomy and would usefully extend differentiable shared-control policies. The hindsight-aggregation training procedure is a concrete attempt to mitigate on-policy distribution shift, which is a recognized difficulty in this domain.

major comments (2)

[Abstract] Abstract: the claim that the learned arbitration 'demonstrate[s] the efficacy of the approach' is unsupported; no quantitative metrics, error bars, dataset sizes, number of users, or evaluation protocol are supplied, so the comparison to the handcrafted baseline cannot be assessed.
[Method] Training procedure (hindsight data aggregation): because the arbitration output directly modulates the robot command experienced by the user, the state-action distribution at data-collection time is a function of the current arbitration parameters. The manuscript does not state whether a single round of aggregation is claimed to suffice or whether an outer loop that re-collects data after each update is required; this circular dependency is load-bearing for the central claim that the RNN learns an effective arbitration function.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major points below and will revise the manuscript accordingly to strengthen the presentation of the preliminary results and clarify the training procedure.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the learned arbitration 'demonstrate[s] the efficacy of the approach' is unsupported; no quantitative metrics, error bars, dataset sizes, number of users, or evaluation protocol are supplied, so the comparison to the handcrafted baseline cannot be assessed.

Authors: We agree that the abstract overstates the preliminary nature of the results. The reported experiments are initial demonstrations in a virtual environment without the requested quantitative details. We will revise the abstract to remove the efficacy claim, qualify all statements as preliminary, and add a note that detailed metrics, user counts, and protocols appear in the experimental section. revision: yes
Referee: [Method] Training procedure (hindsight data aggregation): because the arbitration output directly modulates the robot command experienced by the user, the state-action distribution at data-collection time is a function of the current arbitration parameters. The manuscript does not state whether a single round of aggregation is claimed to suffice or whether an outer loop that re-collects data after each update is required; this circular dependency is load-bearing for the central claim that the RNN learns an effective arbitration function.

Authors: The manuscript describes a single round of data collection under the initial shared-control policy followed by hindsight aggregation to train the RNN. We acknowledge that the text does not explicitly address whether an outer iterative loop is required to mitigate distribution shift. In the revision we will add a paragraph clarifying that our reported experiments used one round of collection and training, discuss the potential limitations of this choice relative to full DAgger-style iteration, and note that the hindsight formulation is intended to reduce (but not eliminate) the on-policy mismatch. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained.

full rationale

The paper's central procedure trains an RNN arbitration function on trajectories collected while users interact with a shared-control system. The abstract explicitly frames this as an extension of prior differentiable-policy work and presents preliminary results comparing the learned arbitration to a handcrafted baseline. No equation, definition, or training step is shown to reduce by construction to its own output (e.g., no parameter is fitted on a subset and then renamed a prediction of a closely related quantity). The self-citation is acknowledged but is not load-bearing for the new hindsight-aggregation claim. Because the provided text supplies no explicit reduction of the form Eq. X = Eq. Y or fitted-input-called-prediction, the derivation is treated as self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on pre-trained intent and motion modules plus the assumption that aggregated user interaction data under the shared policy is representative; no new physical entities are postulated.

free parameters (1)

RNN weights and hyperparameters
The arbitration network parameters are fitted to the collected user data; exact architecture and regularization choices are unspecified in the abstract.

axioms (1)

domain assumption Pre-trained motion generation and intent prediction modules exist and remain fixed during arbitration training.
Abstract states results use 'pre-trained motion generation and intent prediction'.

pith-pipeline@v0.9.0 · 5736 in / 1313 out tokens · 47518 ms · 2026-05-25T13:33:05.355348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

[1]

A Blended Human-Robot Shared Control Framework to Handle Drift and Latency

Anas Abou Allaban, Velin Dimitrov, and Tas ¸kın Padır. A blended human-robot shared control framework to handle drift and latency. arXiv preprint arXiv:1811.09382, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

A policy- blending formalism for shared control

Anca D Dragan and Siddhartha S Srinivasa. A policy- blending formalism for shared control. The International Journal of Robotics Research , 32(7):790–805, 2013

work page 2013
[3]

Ferrell and T.B

W.R. Ferrell and T.B. Sheridan. Supervisory control of remote manipulation. IEEE Spectrum , 4(10):81–88, 1967

work page 1967
[4]

Teleoperation and beyond for assistive hu- manoid robots

Michael A Goodrich, Jacob W Crandall, and Emilia Barakova. Teleoperation and beyond for assistive hu- manoid robots. Reviews of Human Factors and Er- gonomics, 9(1):175–226, 2013

work page 2013
[5]

Human-in-the-loop optimization of shared autonomy in assistive robotics

Deepak Gopinath, Siddarth Jain, and Brenna D Argall. Human-in-the-loop optimization of shared autonomy in assistive robotics. IEEE Robotics and Automation Let- ters, 2(1):247–254, 2016

work page 2016
[6]

Shared autonomy via hindsight optimization for teleoperation and teaming

Shervin Javdani, Henny Admoni, Stefania Pellegrinelli, Siddhartha S Srinivasa, and J Andrew Bagnell. Shared autonomy via hindsight optimization for teleoperation and teaming. The International Journal of Robotics Research, 37(7):717–742, May 2018. doi: 10.1177/ 0278364918776060. URL http://journals.sagepub.com/ doi/10.1177/0278364918776060

work page doi:10.1177/0278364918776060 2018
[7]

Real-time perception meets reactive motion gen- eration

Daniel Kappler, Franziska Meier, Jan Issac, Jim Main- price, Cristina Garcia Cifuentes, Manuel W ¨uthrich, Vin- cent Berenz, Stefan Schaal, Nathan Ratliff, and Jeannette Bohg. Real-time perception meets reactive motion gen- eration. IEEE Robotics and Automation Letters , 3(3): 1864–1871, 2018

work page 2018
[8]

Goal set inverse optimal control and iterative re- planning for predicting human reaching motions in shared workspaces

Jim Mainprice, Raﬁ Hayne, and Dmitry Berenson. Goal set inverse optimal control and iterative re- planning for predicting human reaching motions in shared workspaces. 2016

work page 2016
[9]

Warp- ing the workspace geometry with electric potentials for motion optimization of manipulation tasks

Jim Mainprice, Nathan Ratliff, and Stefan Schaal. Warp- ing the workspace geometry with electric potentials for motion optimization of manipulation tasks. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 3156–3163. IEEE, 2016

work page 2016
[10]

Algorithms for Inverse Reinforcement Learning

Andrew Y Ng and Stuart J Russell. Algorithms for Inverse Reinforcement Learning. ICML, 2000. URL https://dblp.org/rec/conf/icml/NgR00

work page 2000
[11]

Human-Robot Mutual Adapta- tion in Shared Autonomy

Stefanos Nikolaidis, Yu Xiang Zhu, David Hsu, and Siddhartha Srinivasa. Human-Robot Mutual Adapta- tion in Shared Autonomy. HRI, 2017. doi: 10.1145/ 2909824.3020253. URL https://dblp.org/rec/conf/hri/ NikolaidisZHS17

work page arXiv 2017
[12]

A differentiable policy for shared autonomy

Yoojin Oh, Hangbeom Kim, Marc Toussaint, and Jim Mainprice. A differentiable policy for shared autonomy. In 2nd Workshop Robot Teammates Operating in Dy- namic, Unstructured Environments (RT-DUNE), 2019

work page 2019
[13]

To- ward a user-guided manipulation framework for high-dof robots with limited communication

Calder Phillips-Grafﬂin, Nicholas Alunni, Halit Bener Suay, Jim Mainprice, Daniel Lofaro, Dmitry Berenson, Sonia Chernova, Robert W Lindeman, and Paul Oh. To- ward a user-guided manipulation framework for high-dof robots with limited communication. Intelligent Service Robotics, 7(3):121–131, 2014

work page 2014
[14]

From autonomy to cooperative traded control of humanoid manipulation tasks with unreliable communi- cation

Calder Phillips-Grafﬂin, Halit Bener Suay, Jim Main- price, Nicholas Alunni, Daniel Lofaro, Dmitry Beren- son, Sonia Chernova, Robert W Lindeman, and Paul Oh. From autonomy to cooperative traded control of humanoid manipulation tasks with unreliable communi- cation. Journal of Intelligent & Robotic Systems, 82(3-4): 341–361, 2016

work page 2016
[15]

Shared Autonomy via Deep Reinforcement Learning

Siddharth Reddy, Anca D Dragan, and Sergey Levine. Shared autonomy via deep reinforcement learning. arXiv preprint arXiv:1802.01744, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Efﬁcient reductions for imitation learning

St ´ephane Ross and Drew Bagnell. Efﬁcient reductions for imitation learning. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pages 661–668, 2010

work page 2010
[17]

Goal-predictive robotic teleoperation from noisy sensors

Christopher Schultz, Sanket Gaurav, Mathew Monfort, Lingfei Zhang, and Brian D Ziebart. Goal-predictive robotic teleoperation from noisy sensors. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 5377–5383. IEEE, 2017

work page 2017
[18]

Telerobotics, automation and human supervisory control

Thomas B Sheridan. Telerobotics, automation and human supervisory control. The MIT press, 1992

work page 1992
[19]

Maximum Entropy Inverse Reinforcement Learning

Brian Ziebart and J Andrew Bagnell. Maximum Entropy Inverse Reinforcement Learning. pages 1–7, May 2008

work page 2008

[1] [1]

A Blended Human-Robot Shared Control Framework to Handle Drift and Latency

Anas Abou Allaban, Velin Dimitrov, and Tas ¸kın Padır. A blended human-robot shared control framework to handle drift and latency. arXiv preprint arXiv:1811.09382, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

A policy- blending formalism for shared control

Anca D Dragan and Siddhartha S Srinivasa. A policy- blending formalism for shared control. The International Journal of Robotics Research , 32(7):790–805, 2013

work page 2013

[3] [3]

Ferrell and T.B

W.R. Ferrell and T.B. Sheridan. Supervisory control of remote manipulation. IEEE Spectrum , 4(10):81–88, 1967

work page 1967

[4] [4]

Teleoperation and beyond for assistive hu- manoid robots

Michael A Goodrich, Jacob W Crandall, and Emilia Barakova. Teleoperation and beyond for assistive hu- manoid robots. Reviews of Human Factors and Er- gonomics, 9(1):175–226, 2013

work page 2013

[5] [5]

Human-in-the-loop optimization of shared autonomy in assistive robotics

Deepak Gopinath, Siddarth Jain, and Brenna D Argall. Human-in-the-loop optimization of shared autonomy in assistive robotics. IEEE Robotics and Automation Let- ters, 2(1):247–254, 2016

work page 2016

[6] [6]

Shared autonomy via hindsight optimization for teleoperation and teaming

Shervin Javdani, Henny Admoni, Stefania Pellegrinelli, Siddhartha S Srinivasa, and J Andrew Bagnell. Shared autonomy via hindsight optimization for teleoperation and teaming. The International Journal of Robotics Research, 37(7):717–742, May 2018. doi: 10.1177/ 0278364918776060. URL http://journals.sagepub.com/ doi/10.1177/0278364918776060

work page doi:10.1177/0278364918776060 2018

[7] [7]

Real-time perception meets reactive motion gen- eration

Daniel Kappler, Franziska Meier, Jan Issac, Jim Main- price, Cristina Garcia Cifuentes, Manuel W ¨uthrich, Vin- cent Berenz, Stefan Schaal, Nathan Ratliff, and Jeannette Bohg. Real-time perception meets reactive motion gen- eration. IEEE Robotics and Automation Letters , 3(3): 1864–1871, 2018

work page 2018

[8] [8]

Goal set inverse optimal control and iterative re- planning for predicting human reaching motions in shared workspaces

Jim Mainprice, Raﬁ Hayne, and Dmitry Berenson. Goal set inverse optimal control and iterative re- planning for predicting human reaching motions in shared workspaces. 2016

work page 2016

[9] [9]

Warp- ing the workspace geometry with electric potentials for motion optimization of manipulation tasks

Jim Mainprice, Nathan Ratliff, and Stefan Schaal. Warp- ing the workspace geometry with electric potentials for motion optimization of manipulation tasks. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 3156–3163. IEEE, 2016

work page 2016

[10] [10]

Algorithms for Inverse Reinforcement Learning

Andrew Y Ng and Stuart J Russell. Algorithms for Inverse Reinforcement Learning. ICML, 2000. URL https://dblp.org/rec/conf/icml/NgR00

work page 2000

[11] [11]

Human-Robot Mutual Adapta- tion in Shared Autonomy

Stefanos Nikolaidis, Yu Xiang Zhu, David Hsu, and Siddhartha Srinivasa. Human-Robot Mutual Adapta- tion in Shared Autonomy. HRI, 2017. doi: 10.1145/ 2909824.3020253. URL https://dblp.org/rec/conf/hri/ NikolaidisZHS17

work page arXiv 2017

[12] [12]

A differentiable policy for shared autonomy

Yoojin Oh, Hangbeom Kim, Marc Toussaint, and Jim Mainprice. A differentiable policy for shared autonomy. In 2nd Workshop Robot Teammates Operating in Dy- namic, Unstructured Environments (RT-DUNE), 2019

work page 2019

[13] [13]

To- ward a user-guided manipulation framework for high-dof robots with limited communication

Calder Phillips-Grafﬂin, Nicholas Alunni, Halit Bener Suay, Jim Mainprice, Daniel Lofaro, Dmitry Berenson, Sonia Chernova, Robert W Lindeman, and Paul Oh. To- ward a user-guided manipulation framework for high-dof robots with limited communication. Intelligent Service Robotics, 7(3):121–131, 2014

work page 2014

[14] [14]

From autonomy to cooperative traded control of humanoid manipulation tasks with unreliable communi- cation

Calder Phillips-Grafﬂin, Halit Bener Suay, Jim Main- price, Nicholas Alunni, Daniel Lofaro, Dmitry Beren- son, Sonia Chernova, Robert W Lindeman, and Paul Oh. From autonomy to cooperative traded control of humanoid manipulation tasks with unreliable communi- cation. Journal of Intelligent & Robotic Systems, 82(3-4): 341–361, 2016

work page 2016

[15] [15]

Shared Autonomy via Deep Reinforcement Learning

Siddharth Reddy, Anca D Dragan, and Sergey Levine. Shared autonomy via deep reinforcement learning. arXiv preprint arXiv:1802.01744, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Efﬁcient reductions for imitation learning

St ´ephane Ross and Drew Bagnell. Efﬁcient reductions for imitation learning. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pages 661–668, 2010

work page 2010

[17] [17]

Goal-predictive robotic teleoperation from noisy sensors

Christopher Schultz, Sanket Gaurav, Mathew Monfort, Lingfei Zhang, and Brian D Ziebart. Goal-predictive robotic teleoperation from noisy sensors. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 5377–5383. IEEE, 2017

work page 2017

[18] [18]

Telerobotics, automation and human supervisory control

Thomas B Sheridan. Telerobotics, automation and human supervisory control. The MIT press, 1992

work page 1992

[19] [19]

Maximum Entropy Inverse Reinforcement Learning

Brian Ziebart and J Andrew Bagnell. Maximum Entropy Inverse Reinforcement Learning. pages 1–7, May 2008

work page 2008