Multimodal Uncertainty Reduction for Intention Recognition in Human-Robot Interaction

Constantin Rothkopf; Dorothea Koert; Jan Peters; Susanne Trick

arxiv: 1907.02426 · v1 · pith:TN4YFIGCnew · submitted 2019-07-04 · 💻 cs.RO · cs.HC· cs.LG· stat.ML

Multimodal Uncertainty Reduction for Intention Recognition in Human-Robot Interaction

Susanne Trick , Dorothea Koert , Jan Peters , Constantin Rothkopf This is my paper

Pith reviewed 2026-05-25 09:21 UTC · model grok-4.3

classification 💻 cs.RO cs.HCcs.LGstat.ML

keywords intention recognitionmultimodal fusionhuman-robot interactionuncertainty reductionclassifier fusionIndependent Opinion Poolassistive robotics

0 comments

The pith

Fusing probability distributions from speech, gesture, gaze and object classifiers via Independent Opinion Pool reduces uncertainty in robot intention recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multimodal fusion of intention classifiers can decrease uncertainty about a person's goals in human-robot interactions. Separate classifiers are built for speech, gestures, gaze directions, and scene objects, each producing probability distributions over possible intentions. These distributions are merged using the Bayesian Independent Opinion Pool method. Evaluation in a collaborative task with a 7-DoF robot arm demonstrates that the fused system improves accuracy, robustness to modality failure, and uncertainty reduction compared to any single classifier. This matters for safe and intuitive robot assistance, particularly with elderly users.

Core claim

By combining the probability distributions output by individual classifiers for speech, gestures, gaze directions, and scene objects using the Bayesian method Independent Opinion Pool, the uncertainty about the intention to be recognized can be decreased. The fused classifiers outperform the respective individual base classifiers with respect to increased accuracy, robustness, and reduced uncertainty in a collaborative human-robot interaction task.

What carries the argument

Independent Opinion Pool, the Bayesian method used to combine probability distributions from multiple modality-specific classifiers into a single distribution with lower uncertainty.

If this is right

Accuracy of intention recognition increases when multiple modalities are fused.
Robustness against failure of individual modalities improves.
Uncertainty about the predicted intention is reduced.
The approach is validated in a task involving a 7-DoF robot arm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robots assisting elderly people could achieve safer interactions by maintaining awareness of their own uncertainty levels.
Similar fusion techniques might apply to other recognition tasks where multiple human signals are available.
If the assumption of conditional independence between modalities does not hold in new settings, the uncertainty reduction could be an artifact rather than a real gain.

Load-bearing premise

The four modalities supply conditionally independent information about the human's intention.

What would settle it

An experiment where the modalities are shown to be dependent such that fusing them fails to reduce uncertainty or even increases it compared to the best single modality.

Figures

Figures reproduced from arXiv: 1907.02426 by Constantin Rothkopf, Dorothea Koert, Jan Peters, Susanne Trick.

**Figure 2.** Figure 2: (a) When used for classifier fusion, Independent Opinion Pool [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The kitchen scenario the fusion system was evaluated in. The human [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of all possible combinations of base classifiers [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Entropies of all distributions resulting from classifier combinations including the speech (top) or gesture (bottom) classifier over all 90 test examples [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The human shows the intention roll by uttering a command containing the word ”roll”, reaching out its arm for the roll, fixating the roll and [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Assistive robots can potentially improve the quality of life and personal independence of elderly people by supporting everyday life activities. To guarantee a safe and intuitive interaction between human and robot, human intentions need to be recognized automatically. As humans communicate their intentions multimodally, the use of multiple modalities for intention recognition may not just increase the robustness against failure of individual modalities but especially reduce the uncertainty about the intention to be predicted. This is desirable as particularly in direct interaction between robots and potentially vulnerable humans a minimal uncertainty about the situation as well as knowledge about this actual uncertainty is necessary. Thus, in contrast to existing methods, in this work a new approach for multimodal intention recognition is introduced that focuses on uncertainty reduction through classifier fusion. For the four considered modalities speech, gestures, gaze directions and scene objects individual intention classifiers are trained, all of which output a probability distribution over all possible intentions. By combining these output distributions using the Bayesian method Independent Opinion Pool the uncertainty about the intention to be recognized can be decreased. The approach is evaluated in a collaborative human-robot interaction task with a 7-DoF robot arm. The results show that fused classifiers which combine multiple modalities outperform the respective individual base classifiers with respect to increased accuracy, robustness, and reduced uncertainty.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies Independent Opinion Pool to fuse four HRI modalities for lower uncertainty but leaves the required conditional independence untested and reports no numbers.

read the letter

The main takeaway is that training separate classifiers on speech, gestures, gaze, and objects then fusing them with Independent Opinion Pool produces lower uncertainty and higher accuracy than any single modality in their collaborative task. This is a direct, domain-specific use of an established fusion rule rather than a new algorithm. The work does a solid job explaining why uncertainty matters for safe assistive robotics and sets up a realistic 7-DoF arm experiment. That part is useful for anyone already working on multimodal intention recognition. The soft spots are straightforward. The fusion rule is valid only under conditional independence of the modalities given the intention, yet the paper supplies no diagnostic for that assumption even though gaze, gesture, and speech are known to correlate in natural interaction. Without a check such as residual correlation or comparison to a joint model, any measured drop in entropy could be an artifact of the multiplicative update. The abstract also claims gains in accuracy, robustness, and uncertainty reduction but gives no dataset sizes, cross-validation details, or quantitative values, so the strength of the evidence stays unclear from what is shown. This paper is for researchers building practical intention-aware controllers in human-robot collaboration who already know the fusion literature and want a worked example with these four modalities. It is not reshaping the broader field. I would send it to peer review because the application is relevant and the method is clearly stated, though the independence check and the missing performance numbers would need to be addressed in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a multimodal intention recognition approach for human-robot interaction. Separate classifiers are trained on four modalities (speech, gestures, gaze directions, scene objects), each outputting a probability distribution over intentions. These are fused via the Independent Opinion Pool (IOP) method, with the central claim that fusion reduces uncertainty about the intention while improving accuracy and robustness. Evaluation occurs in a collaborative task with a 7-DoF robot arm, asserting that multimodal fused classifiers outperform single-modality baselines.

Significance. If the independence assumption holds and quantitative gains are reproducible, the work offers a practical Bayesian fusion technique focused on uncertainty quantification, which is relevant for safe HRI with elderly or vulnerable users. The explicit treatment of uncertainty as a performance criterion is a positive aspect. However, the absence of diagnostics for the core modeling assumption weakens the evidential support for the uncertainty-reduction claim.

major comments (2)

[Method section (IOP fusion)] The IOP fusion step (described in the method section) is a normalized product of the modality likelihoods and produces a valid posterior only under conditional independence of the four modalities given the intention. No diagnostic is reported (pairwise conditional mutual information, residual correlation after conditioning on labels, or comparison to a joint model) to verify this assumption holds in the recorded HRI data. This is load-bearing for the central claim because speech, gesture, and gaze are known to be correlated in natural interaction; any reported entropy drop or accuracy gain could be an artifact of the multiplicative rule rather than genuine multimodal information.
[Results section] The results section claims that fused classifiers outperform individual base classifiers on accuracy, robustness, and reduced uncertainty, yet the abstract and summary supply no quantitative values, dataset sizes, number of trials, cross-validation procedure, or error bars. Without these, the magnitude and statistical reliability of the reported gains cannot be assessed.

minor comments (1)

[Method section] Notation for the IOP formula should be made explicit (e.g., clarify whether the product is taken before or after normalization and how zero-probability outputs are handled).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our multimodal fusion approach. We address each major comment below.

read point-by-point responses

Referee: [Method section (IOP fusion)] The IOP fusion step (described in the method section) is a normalized product of the modality likelihoods and produces a valid posterior only under conditional independence of the four modalities given the intention. No diagnostic is reported (pairwise conditional mutual information, residual correlation after conditioning on labels, or comparison to a joint model) to verify this assumption holds in the recorded HRI data. This is load-bearing for the central claim because speech, gesture, and gaze are known to be correlated in natural interaction; any reported entropy drop or accuracy gain could be an artifact of the multiplicative rule rather than genuine multimodal information.

Authors: We agree that IOP relies on the conditional independence assumption and that the manuscript does not report explicit diagnostics such as conditional mutual information. In the revised version we will add a paragraph in the methods section discussing the assumption, including a brief empirical check (e.g., residual pairwise correlations after conditioning on intention labels) computed from the existing dataset. We note that the modalities were recorded in a controlled collaborative task where conditioning on the discrete intention label substantially reduces observed dependence; nevertheless, we will make this limitation explicit rather than claiming the assumption is strictly verified. revision: partial
Referee: [Results section] The results section claims that fused classifiers outperform individual base classifiers on accuracy, robustness, and reduced uncertainty, yet the abstract and summary supply no quantitative values, dataset sizes, number of trials, cross-validation procedure, or error bars. Without these, the magnitude and statistical reliability of the reported gains cannot be assessed.

Authors: The full evaluation section reports the dataset (number of participants and trials), 5-fold cross-validation procedure, and quantitative metrics with standard deviations. We acknowledge that the abstract omits these specifics. In the revision we will expand the abstract to include the key numerical results (accuracy improvement, entropy reduction, and trial count) together with the cross-validation details. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fusion results are measured, not derived by construction

full rationale

The paper trains four separate modality classifiers, each producing an output distribution, then fuses them via the standard Independent Opinion Pool formula. Reported gains in accuracy, robustness, and uncertainty reduction are obtained from direct evaluation on recorded HRI interaction data; no equation or self-citation reduces these measured quantities back to the fitted classifier parameters or to the fusion rule itself. The conditional-independence modeling assumption is external and does not create a self-definitional loop inside the reported results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on (1) the independence assumption required by Independent Opinion Pool and (2) the empirical performance of four separately trained modality classifiers whose parameters are fitted to task data; no new entities are postulated.

free parameters (1)

modality classifier parameters
Each of the four base classifiers is trained on data; their internal parameters are fitted values that determine the input probability distributions to the fusion step.

axioms (1)

domain assumption The four modalities are conditionally independent given the true intention
Independent Opinion Pool requires this independence to guarantee that the fused distribution correctly reduces uncertainty; the assumption is invoked when the method is introduced in the abstract.

pith-pipeline@v0.9.0 · 5762 in / 1318 out tokens · 21663 ms · 2026-05-25T09:21:43.886543+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By combining these output distributions using the Bayesian method Independent Opinion Pool the uncertainty about the intention to be recognized can be decreased.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The approach is evaluated in a collaborative human-robot interaction task with a 7-DoF robot arm.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

[1]

J. O. Berger, Statistical Decision Theory and Bayesian Analysis . London: Springer, 1985

work page 1985
[2]

IW-Report 33/18: Die Entwicklung der Pﬂege- fallzahlen in den Bundeslaendern. Eine Simulation bis 2035

S. Kochskaemper, “IW-Report 33/18: Die Entwicklung der Pﬂege- fallzahlen in den Bundeslaendern. Eine Simulation bis 2035.” IW - Wirtschaftliche Untersuchungen, Berichte und Sachverhalte , 2018

work page 2035
[3]

Tractable proba- bilistic models for intention recognition based on expert knowledge,

O. C. Schrempf, D. Albrecht, and U. D. Hanebeck, “Tractable proba- bilistic models for intention recognition based on expert knowledge,” in 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct 2007, pp. 1429–1434

work page 2007
[4]

Humans integrate visual and haptic information in a statistically optimal fashion,

M. O. Ernst and M. Banks, “Humans integrate visual and haptic information in a statistically optimal fashion,” Nature, vol. 415, no. 6870, pp. 429–433, 2002

work page 2002
[5]

Objects as attributes for scene classiﬁcation,

L.-J. Li, H. Su, Y . Lim, and L. Fei-Fei, “Objects as attributes for scene classiﬁcation,” in European Conference on Computer Vision . Springer, 2010, pp. 57–69

work page 2010
[6]

Data fusion and multiple classiﬁer systems for human activity detection and health monitoring: Review and open research directions,

H. F. Nweke, Y . W. Teh, G. Mujtaba, and M. A. Al-garadi, “Data fusion and multiple classiﬁer systems for human activity detection and health monitoring: Review and open research directions,” Information Fusion, vol. 46, pp. 147–170, 2019

work page 2019
[7]

Multimodal human action recognition in assistive human-robot interaction,

I. Rodomagoulakis, N. Kardaris, V . Pitsikalis, E. Mavroudi, A. Kat- samanis, A. Tsiami, and P. Maragos, “Multimodal human action recognition in assistive human-robot interaction,” in 2016 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 2702–2706

work page 2016
[8]

A multi-modal perception based assistive robotic system for the elderly,

C. Mollaret, A. Mekonnen, F. Lerasle, I. Ferran ´e, J. Pinquier, B. Boudet, and P. Rumeau, “A multi-modal perception based assistive robotic system for the elderly,” Computer Vision and Image Under- standing, vol. 149, pp. 78–97, 2016

work page 2016
[9]

Natural human-robot interaction using speech, head pose and gestures,

R. Stiefelhagen, C. Fugen, R. Gieselmann, H. Holzapfel, K. Nickel, and A. Waibel, “Natural human-robot interaction using speech, head pose and gestures,” in 2004 IEEE International Conference on Intel- ligent Robots and Systems , vol. 3, Sept 2004, pp. 2422–2427

work page 2004
[10]

Starting engagement detec- tion towards a companion robot using multimodal features,

D. Vaufreydaz, W. Johal, and C. Combe, “Starting engagement detec- tion towards a companion robot using multimodal features,” Robot. Auton. Syst., vol. 75, no. PA, pp. 4–16, Jan. 2016

work page 2016
[11]

Multimodal signal processing and learning aspects of human-robot interaction for an assistive bathing robot,

A. Zlatintsi, I. Rodomagoulakis, P. Koutras, A. C. Dometios, V . Pit- sikalis, C. S. Tzafestas, and P. Maragos, “Multimodal signal processing and learning aspects of human-robot interaction for an assistive bathing robot,” IEEE Int. Conference on Acoustics, Speech and Signal Processing, pp. 3171–3175, 2018

work page 2018
[12]

Human intention un- derstanding based on object affordance and action classiﬁcation,

Z. Yu, S. Kim, R. Mallipeddi, and M. Lee, “Human intention un- derstanding based on object affordance and action classiﬁcation,” in 2015 International Joint Conference on Neural Networks (IJCNN) , July 2015, pp. 1–6

work page 2015
[13]

A multimodal human-robot-interaction scenario: Working together with an industrial robot,

A. Bannat, J. Gast, T. Rehrl, W. R ¨osel, G. Rigoll, and F. Wallhoff, “A multimodal human-robot-interaction scenario: Working together with an industrial robot,” in Human-Computer Interaction. Novel Interaction Methods and Techniques , J. A. Jacko, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 303–311

work page 2009
[14]

Predicting the intention of human activities for real-time human-robot interaction (hri),

V . Dutta and T. Zielinska, “Predicting the intention of human activities for real-time human-robot interaction (hri),” in International Confer- ence on Social Robotics . Springer, 2016, pp. 723–734

work page 2016
[15]

Predicting human actions taking into account object affor- dances,

——, “Predicting human actions taking into account object affor- dances,” Journal of Intelligent & Robotic Systems , pp. 1–17, 2018

work page 2018
[16]

Deep networks for predicting human intent with respect to objects,

R. Kelley, K. Browne, L. Wigand, M. Nicolescu, B. Hamilton, and M. Nicolescu, “Deep networks for predicting human intent with respect to objects,” in 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , March 2012, pp. 171–172

work page 2012
[17]

Context-based bayesian intent recognition,

R. Kelley, A. Tavakkoli, C. King, A. Ambardekar, M. Nicolescu, and M. Nicolescu, “Context-based bayesian intent recognition,” IEEE Transactions on Autonomous Mental Development , vol. 4, no. 3, pp. 215–225, Sept 2012

work page 2012
[18]

Multi-sensor based human motion intention recognition algorithm for walking-aid robot,

W. Xu, J. Huang, and Q. Yan, “Multi-sensor based human motion intention recognition algorithm for walking-aid robot,” in 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO) , Dec 2015, pp. 2041–2046

work page 2015
[19]

Estimating intent for human-robot inter- action,

D. Kulic and E. A. Croft, “Estimating intent for human-robot inter- action,” in in IEEE Int. Conference on Advanced Robotics , 2003, p. 810815

work page 2003
[20]

Using entropy as a stream reliability estimate for audio-visual speech recognition,

M. Gurban and J. P. Thiran, “Using entropy as a stream reliability estimate for audio-visual speech recognition,” in 2008 16th European Signal Processing Conference, Aug 2008, pp. 1–5

work page 2008
[21]

Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction,

H. Liu, T. Fan, and P. Wu, “Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction,” in 2014 IEEE International Conference on Robotics and Automation (ICRA), May 2014, pp. 6644–6651

work page 2014
[22]

Integer occupancy grids: a probabilistic multi- sensor fusion framework for embedded perception,

T. R. Andriamahefa, “Integer occupancy grids: a probabilistic multi- sensor fusion framework for embedded perception,” Ph.D. dissertation, Universit´e Grenoble Alpes, 2017

work page 2017
[23]

Sensor data fusion using a probability density grid,

D. Elsaesser, “Sensor data fusion using a probability density grid,” in 2007 10th International Conference on Information Fusion , 2007, pp. 1–8

work page 2007
[24]

Multi-class classiﬁcation for semantic labeling of places,

L. Shi, S. Kodagoda, and G. Dissanayake, “Multi-class classiﬁcation for semantic labeling of places,” in2010 11th International Conference on Control Automation Robotics & Vision , 2010, pp. 2307–2312

work page 2010
[25]

Robust data fusion with occu- pancy grid,

P. Stepan, M. Kulich, and L. Preucil, “Robust data fusion with occu- pancy grid,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 35, no. 1, pp. 106–115, 2005

work page 2005
[26]

Intelligent speech control system for human-robot interaction,

X. Liu, S. S. Ge, R. Jiang, and C. H. Goh, “Intelligent speech control system for human-robot interaction,” in 2016 35th Chinese Control Conference (CCC), July 2016, pp. 6154–6159

work page 2016
[27]

Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting

R. Tang and J. Lin, “Honk: A pytorch reimplementation of con- volutional neural networks for keyword spotting,” arXiv preprint arXiv:1710.06554, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Convolutional neural networks for small- footprint keyword spotting,

T. N. Sainath and C. Parada, “Convolutional neural networks for small- footprint keyword spotting,” in INTERSPEECH, 2015

work page 2015
[29]

Gesture based human multi- robot interaction,

G. Canal, C. Angulo, and S. Escalera, “Gesture based human multi- robot interaction,” in 2015 International Joint Conference on Neural Networks (IJCNN), July 2015, pp. 1–8

work page 2015
[30]

Learning multiple collaborative tasks with a mixture of interaction primitives,

M. Ewerton, G. Neumann, R. Lioutikov, H. B. Amor, J. Peters, and G. Maeda, “Learning multiple collaborative tasks with a mixture of interaction primitives,” in 2015 IEEE International Conference on Robotics and Automation (ICRA) , May 2015, pp. 1535–1542

work page 2015
[31]

Probabilistic movement primitives,

A. Paraschos, C. Daniel, J. R. Peters, and G. Neumann, “Probabilistic movement primitives,” in Advances in neural information processing systems, 2013, pp. 2616–2624

work page 2013
[32]

Predicting user intent through eye gaze for shared autonomy,

H. Admoni and S. Srinivasa, “Predicting user intent through eye gaze for shared autonomy,” in Proceedings of the AAAI Fall Symposium Series: Shared Autonomy in Research and Practice (AAAI Fall Sym- posium). AAAI Press Toronto, ON, 2016, pp. 298–303

work page 2016
[33]

Using gaze pat- terns to predict task intent in collaboration,

C.-M. Huang, S. Andrist, A. Saupp ´e, and B. Mutlu, “Using gaze pat- terns to predict task intent in collaboration,” Frontiers in Psychology, vol. 6, 2015

work page 2015
[34]

Look-ahead ﬁxations: anticipatory eye movements in natural tasks,

N. Mennie, M. Hayhoe, and B. Sullivan, “Look-ahead ﬁxations: anticipatory eye movements in natural tasks,” Experimental Brain Research, vol. 179, no. 3, pp. 427–442, 2007

work page 2007
[35]

Task and context determine where you look,

C. A. Rothkopf, D. H. Ballard, and M. M. Hayhoe, “Task and context determine where you look,” Journal of vision , vol. 7, no. 14, pp. 16– 20, 2007

work page 2007
[36]

Pupil ros plugin: Connecting pupil eye-tracking platform and robot operation system (ros) platform,

L. Qian, “Pupil ros plugin: Connecting pupil eye-tracking platform and robot operation system (ros) platform,” 2016, software available at https://github.com/qian256/pupil ros plugin

work page 2016
[37]

The affordance-matching hypothesis: how objects guide action understanding and prediction,

P. Bach, T. Nicholson, and M. Hudson, “The affordance-matching hypothesis: how objects guide action understanding and prediction,” Frontiers in Human Neuroscience , vol. 8, no. 254, 2014

work page 2014
[38]

Object-based representation for scene classiﬁca- tion,

X. Luo and J. Xu, “Object-based representation for scene classiﬁca- tion,” in Proceedings of the 29th Canadian Conference on Artiﬁcial Intelligence on Advances in Artiﬁcial Intelligence - Volume 9673. New York, NY , USA: Springer-Verlag New York, Inc., 2016, pp. 102–108

work page 2016
[39]

Stream conﬁdence estimation for audio- visual speech recognition,

G. Potamianos and C. Neti, “Stream conﬁdence estimation for audio- visual speech recognition,” in 6th International Conference on Spoken Language Processing, 2000

work page 2000

[1] [1]

J. O. Berger, Statistical Decision Theory and Bayesian Analysis . London: Springer, 1985

work page 1985

[2] [2]

IW-Report 33/18: Die Entwicklung der Pﬂege- fallzahlen in den Bundeslaendern. Eine Simulation bis 2035

S. Kochskaemper, “IW-Report 33/18: Die Entwicklung der Pﬂege- fallzahlen in den Bundeslaendern. Eine Simulation bis 2035.” IW - Wirtschaftliche Untersuchungen, Berichte und Sachverhalte , 2018

work page 2035

[3] [3]

Tractable proba- bilistic models for intention recognition based on expert knowledge,

O. C. Schrempf, D. Albrecht, and U. D. Hanebeck, “Tractable proba- bilistic models for intention recognition based on expert knowledge,” in 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct 2007, pp. 1429–1434

work page 2007

[4] [4]

Humans integrate visual and haptic information in a statistically optimal fashion,

M. O. Ernst and M. Banks, “Humans integrate visual and haptic information in a statistically optimal fashion,” Nature, vol. 415, no. 6870, pp. 429–433, 2002

work page 2002

[5] [5]

Objects as attributes for scene classiﬁcation,

L.-J. Li, H. Su, Y . Lim, and L. Fei-Fei, “Objects as attributes for scene classiﬁcation,” in European Conference on Computer Vision . Springer, 2010, pp. 57–69

work page 2010

[6] [6]

Data fusion and multiple classiﬁer systems for human activity detection and health monitoring: Review and open research directions,

H. F. Nweke, Y . W. Teh, G. Mujtaba, and M. A. Al-garadi, “Data fusion and multiple classiﬁer systems for human activity detection and health monitoring: Review and open research directions,” Information Fusion, vol. 46, pp. 147–170, 2019

work page 2019

[7] [7]

Multimodal human action recognition in assistive human-robot interaction,

I. Rodomagoulakis, N. Kardaris, V . Pitsikalis, E. Mavroudi, A. Kat- samanis, A. Tsiami, and P. Maragos, “Multimodal human action recognition in assistive human-robot interaction,” in 2016 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 2702–2706

work page 2016

[8] [8]

A multi-modal perception based assistive robotic system for the elderly,

C. Mollaret, A. Mekonnen, F. Lerasle, I. Ferran ´e, J. Pinquier, B. Boudet, and P. Rumeau, “A multi-modal perception based assistive robotic system for the elderly,” Computer Vision and Image Under- standing, vol. 149, pp. 78–97, 2016

work page 2016

[9] [9]

Natural human-robot interaction using speech, head pose and gestures,

R. Stiefelhagen, C. Fugen, R. Gieselmann, H. Holzapfel, K. Nickel, and A. Waibel, “Natural human-robot interaction using speech, head pose and gestures,” in 2004 IEEE International Conference on Intel- ligent Robots and Systems , vol. 3, Sept 2004, pp. 2422–2427

work page 2004

[10] [10]

Starting engagement detec- tion towards a companion robot using multimodal features,

D. Vaufreydaz, W. Johal, and C. Combe, “Starting engagement detec- tion towards a companion robot using multimodal features,” Robot. Auton. Syst., vol. 75, no. PA, pp. 4–16, Jan. 2016

work page 2016

[11] [11]

Multimodal signal processing and learning aspects of human-robot interaction for an assistive bathing robot,

A. Zlatintsi, I. Rodomagoulakis, P. Koutras, A. C. Dometios, V . Pit- sikalis, C. S. Tzafestas, and P. Maragos, “Multimodal signal processing and learning aspects of human-robot interaction for an assistive bathing robot,” IEEE Int. Conference on Acoustics, Speech and Signal Processing, pp. 3171–3175, 2018

work page 2018

[12] [12]

Human intention un- derstanding based on object affordance and action classiﬁcation,

Z. Yu, S. Kim, R. Mallipeddi, and M. Lee, “Human intention un- derstanding based on object affordance and action classiﬁcation,” in 2015 International Joint Conference on Neural Networks (IJCNN) , July 2015, pp. 1–6

work page 2015

[13] [13]

A multimodal human-robot-interaction scenario: Working together with an industrial robot,

A. Bannat, J. Gast, T. Rehrl, W. R ¨osel, G. Rigoll, and F. Wallhoff, “A multimodal human-robot-interaction scenario: Working together with an industrial robot,” in Human-Computer Interaction. Novel Interaction Methods and Techniques , J. A. Jacko, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 303–311

work page 2009

[14] [14]

Predicting the intention of human activities for real-time human-robot interaction (hri),

V . Dutta and T. Zielinska, “Predicting the intention of human activities for real-time human-robot interaction (hri),” in International Confer- ence on Social Robotics . Springer, 2016, pp. 723–734

work page 2016

[15] [15]

Predicting human actions taking into account object affor- dances,

——, “Predicting human actions taking into account object affor- dances,” Journal of Intelligent & Robotic Systems , pp. 1–17, 2018

work page 2018

[16] [16]

Deep networks for predicting human intent with respect to objects,

R. Kelley, K. Browne, L. Wigand, M. Nicolescu, B. Hamilton, and M. Nicolescu, “Deep networks for predicting human intent with respect to objects,” in 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , March 2012, pp. 171–172

work page 2012

[17] [17]

Context-based bayesian intent recognition,

R. Kelley, A. Tavakkoli, C. King, A. Ambardekar, M. Nicolescu, and M. Nicolescu, “Context-based bayesian intent recognition,” IEEE Transactions on Autonomous Mental Development , vol. 4, no. 3, pp. 215–225, Sept 2012

work page 2012

[18] [18]

Multi-sensor based human motion intention recognition algorithm for walking-aid robot,

W. Xu, J. Huang, and Q. Yan, “Multi-sensor based human motion intention recognition algorithm for walking-aid robot,” in 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO) , Dec 2015, pp. 2041–2046

work page 2015

[19] [19]

Estimating intent for human-robot inter- action,

D. Kulic and E. A. Croft, “Estimating intent for human-robot inter- action,” in in IEEE Int. Conference on Advanced Robotics , 2003, p. 810815

work page 2003

[20] [20]

Using entropy as a stream reliability estimate for audio-visual speech recognition,

M. Gurban and J. P. Thiran, “Using entropy as a stream reliability estimate for audio-visual speech recognition,” in 2008 16th European Signal Processing Conference, Aug 2008, pp. 1–5

work page 2008

[21] [21]

Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction,

H. Liu, T. Fan, and P. Wu, “Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction,” in 2014 IEEE International Conference on Robotics and Automation (ICRA), May 2014, pp. 6644–6651

work page 2014

[22] [22]

Integer occupancy grids: a probabilistic multi- sensor fusion framework for embedded perception,

T. R. Andriamahefa, “Integer occupancy grids: a probabilistic multi- sensor fusion framework for embedded perception,” Ph.D. dissertation, Universit´e Grenoble Alpes, 2017

work page 2017

[23] [23]

Sensor data fusion using a probability density grid,

D. Elsaesser, “Sensor data fusion using a probability density grid,” in 2007 10th International Conference on Information Fusion , 2007, pp. 1–8

work page 2007

[24] [24]

Multi-class classiﬁcation for semantic labeling of places,

L. Shi, S. Kodagoda, and G. Dissanayake, “Multi-class classiﬁcation for semantic labeling of places,” in2010 11th International Conference on Control Automation Robotics & Vision , 2010, pp. 2307–2312

work page 2010

[25] [25]

Robust data fusion with occu- pancy grid,

P. Stepan, M. Kulich, and L. Preucil, “Robust data fusion with occu- pancy grid,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 35, no. 1, pp. 106–115, 2005

work page 2005

[26] [26]

Intelligent speech control system for human-robot interaction,

X. Liu, S. S. Ge, R. Jiang, and C. H. Goh, “Intelligent speech control system for human-robot interaction,” in 2016 35th Chinese Control Conference (CCC), July 2016, pp. 6154–6159

work page 2016

[27] [27]

Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting

R. Tang and J. Lin, “Honk: A pytorch reimplementation of con- volutional neural networks for keyword spotting,” arXiv preprint arXiv:1710.06554, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Convolutional neural networks for small- footprint keyword spotting,

T. N. Sainath and C. Parada, “Convolutional neural networks for small- footprint keyword spotting,” in INTERSPEECH, 2015

work page 2015

[29] [29]

Gesture based human multi- robot interaction,

G. Canal, C. Angulo, and S. Escalera, “Gesture based human multi- robot interaction,” in 2015 International Joint Conference on Neural Networks (IJCNN), July 2015, pp. 1–8

work page 2015

[30] [30]

Learning multiple collaborative tasks with a mixture of interaction primitives,

M. Ewerton, G. Neumann, R. Lioutikov, H. B. Amor, J. Peters, and G. Maeda, “Learning multiple collaborative tasks with a mixture of interaction primitives,” in 2015 IEEE International Conference on Robotics and Automation (ICRA) , May 2015, pp. 1535–1542

work page 2015

[31] [31]

Probabilistic movement primitives,

A. Paraschos, C. Daniel, J. R. Peters, and G. Neumann, “Probabilistic movement primitives,” in Advances in neural information processing systems, 2013, pp. 2616–2624

work page 2013

[32] [32]

Predicting user intent through eye gaze for shared autonomy,

H. Admoni and S. Srinivasa, “Predicting user intent through eye gaze for shared autonomy,” in Proceedings of the AAAI Fall Symposium Series: Shared Autonomy in Research and Practice (AAAI Fall Sym- posium). AAAI Press Toronto, ON, 2016, pp. 298–303

work page 2016

[33] [33]

Using gaze pat- terns to predict task intent in collaboration,

C.-M. Huang, S. Andrist, A. Saupp ´e, and B. Mutlu, “Using gaze pat- terns to predict task intent in collaboration,” Frontiers in Psychology, vol. 6, 2015

work page 2015

[34] [34]

Look-ahead ﬁxations: anticipatory eye movements in natural tasks,

N. Mennie, M. Hayhoe, and B. Sullivan, “Look-ahead ﬁxations: anticipatory eye movements in natural tasks,” Experimental Brain Research, vol. 179, no. 3, pp. 427–442, 2007

work page 2007

[35] [35]

Task and context determine where you look,

C. A. Rothkopf, D. H. Ballard, and M. M. Hayhoe, “Task and context determine where you look,” Journal of vision , vol. 7, no. 14, pp. 16– 20, 2007

work page 2007

[36] [36]

Pupil ros plugin: Connecting pupil eye-tracking platform and robot operation system (ros) platform,

L. Qian, “Pupil ros plugin: Connecting pupil eye-tracking platform and robot operation system (ros) platform,” 2016, software available at https://github.com/qian256/pupil ros plugin

work page 2016

[37] [37]

The affordance-matching hypothesis: how objects guide action understanding and prediction,

P. Bach, T. Nicholson, and M. Hudson, “The affordance-matching hypothesis: how objects guide action understanding and prediction,” Frontiers in Human Neuroscience , vol. 8, no. 254, 2014

work page 2014

[38] [38]

Object-based representation for scene classiﬁca- tion,

X. Luo and J. Xu, “Object-based representation for scene classiﬁca- tion,” in Proceedings of the 29th Canadian Conference on Artiﬁcial Intelligence on Advances in Artiﬁcial Intelligence - Volume 9673. New York, NY , USA: Springer-Verlag New York, Inc., 2016, pp. 102–108

work page 2016

[39] [39]

Stream conﬁdence estimation for audio- visual speech recognition,

G. Potamianos and C. Neti, “Stream conﬁdence estimation for audio- visual speech recognition,” in 6th International Conference on Spoken Language Processing, 2000

work page 2000