Trustworthy Visual Predicates for Robust Manipulation Understanding under Degradation

Fatemeh Ziaeetabar

arxiv: 2606.08121 · v1 · pith:TBRFPIHPnew · submitted 2026-06-06 · 💻 cs.CV

Trustworthy Visual Predicates for Robust Manipulation Understanding under Degradation

Fatemeh Ziaeetabar This is my paper

Pith reviewed 2026-06-27 20:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual predicatesmanipulation understandingrobustness under degradationpredicate reliabilityegocentric visionaction recognitionneuro-symbolic modelsconfidence-aware estimation

0 comments

The pith

Visual predicates fail in structured ways under image degradation rather than uniformly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to measure how reliably different visual predicates can be recovered from images when those images suffer blur, occlusion, low resolution, frame drops, or detection noise. It defines a vocabulary of predicates used in manipulation understanding, then tracks how each one holds up or collapses using new reliability metrics that include preservation rate, sensitivity to each degradation type, temporal consistency, and effect on downstream task accuracy. The central finding is that failures are not random: static spatial predicates stay relatively stable while contact, motion-coupling, grasp, and release predicates degrade fastest. This matters because these predicates are the relational building blocks inside event-chain and neuro-symbolic models, so knowing which ones are trustworthy under real conditions lets those models be made more robust.

Core claim

Experiments on controlled videos and on VISOR/EPIC-KITCHENS, H2O, and ARCTIC show that predicate failures are structured rather than uniform. Static spatial predicates remain comparatively robust, whereas contact-sensitive, dynamic, and derived predicates such as grasp and release are more fragile. Under severe degradation, detection noise, occlusion, and frame dropping cause the strongest reliability losses. Downstream analysis shows that degraded predicates reduce manipulation-understanding accuracy from 0.89 to 0.58, while removing confidence weighting under moderate degradation reduces accuracy from 0.74 to 0.64.

What carries the argument

A predicate-level reliability framework that supplies a structured predicate vocabulary, confidence-aware estimation, and five metrics (preservation, degradation sensitivity, temporal consistency, confidence-weighted stability, downstream impact) to diagnose which predicates survive which degradations.

If this is right

Static spatial predicates can be used with higher trust in degraded conditions for downstream reasoning.
Contact-sensitive and dynamic predicates require additional safeguards or alternative evidence sources.
Confidence weighting in predicate estimation measurably improves downstream accuracy under moderate degradation.
Detection noise, occlusion, and frame dropping are the degradations that produce the largest reliability losses.
Manipulation-understanding pipelines lose roughly one-third of their accuracy when predicates are left unfiltered under severe degradation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Perception modules could monitor image quality in real time and down-weight or replace fragile predicates accordingly.
The same reliability metrics could be applied to action-recognition pipelines that also rely on contact and grasp predicates.
Future datasets collected from physical robots under uncontrolled lighting and motion could be used to validate or refine the synthetic-degradation results.
Designers of neuro-symbolic systems might add explicit uncertainty propagation from predicate confidence scores into higher-level planning.

Load-bearing premise

The chosen public datasets together with the applied synthetic degradations are representative of the visual failures that occur in real deployed manipulation systems.

What would settle it

A controlled test on real-world robot videos containing naturally occurring blur, occlusion, and frame drops in which all predicate types show statistically indistinguishable failure rates would falsify the structured-failure claim.

read the original abstract

Manipulation understanding requires reliable relational evidence, such as contact, support, containment, motion coupling, grasp, release, and active-hand involvement. Although these visual predicates are widely used in event-chain, graph-based, and neuro-symbolic models, their reliability under visual degradation is rarely analyzed directly. This paper introduces a predicate-level reliability framework for robust manipulation understanding under blur, occlusion, illumination change, low resolution, frame dropping, and detection noise. The framework defines a structured predicate vocabulary, confidence-aware predicate estimation, and reliability metrics for predicate preservation, degradation sensitivity, temporal consistency, confidence-weighted stability, and downstream impact. Experiments on controlled manipulation videos and public egocentric or bimanual datasets, including VISOR/EPIC-KITCHENS, H2O, and ARCTIC, show that predicate failures are structured rather than uniform. Static spatial predicates remain comparatively robust, whereas contact-sensitive, dynamic, and derived predicates such as grasp and release are more fragile. Under severe degradation, detection noise, occlusion, and frame dropping cause the strongest reliability losses. Downstream analysis shows that degraded predicates reduce manipulation-understanding accuracy from 0.89 to 0.58, while removing confidence weighting under moderate degradation reduces accuracy from 0.74 to 0.64. These results show that predicate reliability provides a diagnostic layer between visual perception and structured manipulation reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a predicate reliability layer with metrics that reveal structured failures under degradation, but the synthetic tests leave open whether those patterns hold in real camera data.

read the letter

The core contribution is a set of metrics—preservation, degradation sensitivity, temporal consistency, confidence-weighted stability, and downstream impact—applied to visual predicates like contact, grasp, and support. Experiments on VISOR/EPIC-KITCHENS, H2O, and ARCTIC with added blur, occlusion, noise, and frame drops show static spatial predicates stay more stable while dynamic and derived ones like grasp and release degrade faster, pulling manipulation accuracy from 0.89 down to 0.58. That structured pattern is the main new observation.

The work does a clean job of separating the predicate layer from the rest of the pipeline and showing that confidence weighting helps under moderate degradation. The downstream numbers give a concrete sense of why this matters for event-chain or neuro-symbolic models.

The soft spot is the reliance on synthetic degradations. The abstract does not describe any comparison to real degraded footage from deployed cameras, so the reported structure could partly reflect the chosen degradation model rather than intrinsic predicate properties. No error bars, dataset sizes, or statistical tests appear in the summary, which makes it hard to gauge how reliable the differences are. The implementation details for confidence-aware estimation are also thin in what is visible.

This is aimed at researchers building manipulation understanding systems that need to reason about when their predicates can be trusted. It is worth sending for peer review because the diagnostic framing addresses a real gap and the experiments are at least reproducible on public data, even if the real-world mapping needs more work.

Referee Report

3 major / 0 minor

Summary. The paper introduces a predicate-level reliability framework for manipulation understanding that defines a structured vocabulary of visual predicates (contact, support, grasp, release, etc.), confidence-aware estimation, and metrics including predicate preservation, degradation sensitivity, temporal consistency, confidence-weighted stability, and downstream impact. Experiments apply synthetic degradations (blur, occlusion, illumination, low-res, frame drop, detection noise) to controlled videos and public datasets (VISOR/EPIC-KITCHENS, H2O, ARCTIC) and report that failures are structured rather than uniform: static spatial predicates are comparatively robust while contact-sensitive, dynamic, and derived predicates are fragile. Detection noise, occlusion, and frame dropping produce the largest reliability losses, with downstream manipulation-understanding accuracy dropping from 0.89 to 0.58 and removal of confidence weighting reducing accuracy from 0.74 to 0.64 under moderate degradation.

Significance. If the reported structure of predicate failures and the quantitative impact numbers hold after statistical validation and real-world testing, the framework would supply a practical diagnostic layer between low-level perception and structured reasoning models, allowing systems to weight or replace fragile predicates under known degradation conditions and thereby improve robustness in deployed manipulation pipelines.

major comments (3)

[Abstract] Abstract: the accuracy reductions (0.89 to 0.58 and 0.74 to 0.64) are stated without error bars, dataset sizes, number of trials, or any statistical significance tests, so it is impossible to determine whether the claimed distinction between robust static-spatial predicates and fragile contact/dynamic predicates is supported by the data.
[Abstract] Abstract: the central claim that confidence weighting improves downstream accuracy (0.74 to 0.64) cannot be evaluated because the paper provides no description of how confidence scores are computed, how they are integrated into predicate estimation, or how the weighted versus unweighted pipelines differ.
[Abstract] Abstract / Experiments: the observed predicate-failure structure rests on synthetic degradations applied to the listed public datasets; without any comparison to real degraded manipulation footage (e.g., actual camera motion blur coupled with hand occlusion), it remains possible that the reported robustness ordering is an artifact of the chosen degradation model rather than an intrinsic property of the predicates.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment point by point below, indicating where revisions will be made to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the accuracy reductions (0.89 to 0.58 and 0.74 to 0.64) are stated without error bars, dataset sizes, number of trials, or any statistical significance tests, so it is impossible to determine whether the claimed distinction between robust static-spatial predicates and fragile contact/dynamic predicates is supported by the data.

Authors: The abstract condenses the primary findings; the experimental results section reports the underlying dataset sizes (across VISOR/EPIC-KITCHENS, H2O, and ARCTIC), number of trials, error bars on all metrics, and statistical tests supporting the static-vs-dynamic distinction. To ensure the abstract is self-contained, we will revise it to reference the statistical validation and key dataset details. revision: yes
Referee: [Abstract] Abstract: the central claim that confidence weighting improves downstream accuracy (0.74 to 0.64) cannot be evaluated because the paper provides no description of how confidence scores are computed, how they are integrated into predicate estimation, or how the weighted versus unweighted pipelines differ.

Authors: We agree the abstract omits these implementation details. The methods section defines confidence scores from predicate detector outputs, their integration into the stability metric, and the weighted vs. unweighted ablation. We will revise the abstract to include a concise description of the confidence computation and pipeline difference so the claim can be evaluated directly from the abstract. revision: yes
Referee: [Abstract] Abstract / Experiments: the observed predicate-failure structure rests on synthetic degradations applied to the listed public datasets; without any comparison to real degraded manipulation footage (e.g., actual camera motion blur coupled with hand occlusion), it remains possible that the reported robustness ordering is an artifact of the chosen degradation model rather than an intrinsic property of the predicates.

Authors: Synthetic degradations enable controlled isolation of individual factors on real manipulation videos from the public datasets. We acknowledge that real-world degradations may include unmodeled correlations. In revision we will expand the discussion and limitations sections to explicitly note this possibility and state that the reported ordering requires future validation against real degraded footage. revision: partial

Circularity Check

0 steps flagged

No circularity: framework definitions and empirical results are independent

full rationale

The paper introduces a predicate reliability framework by defining vocabulary, confidence-aware estimation, and metrics (preservation, sensitivity, consistency, stability, impact) directly from first principles, then reports empirical observations on public datasets under synthetic degradations. No equations, derivations, or fitted parameters are described that reduce predictions to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on experimental measurements rather than any self-referential reduction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the framework itself is the primary contribution but rests on unstated assumptions about predicate definitions and dataset representativeness.

pith-pipeline@v0.9.1-grok · 5768 in / 1068 out tokens · 20564 ms · 2026-06-27T20:00:16.930851+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Robotics and Autonomous Systems57(5), 469–483 (2009) https://doi.org/10.1016/j.robot.2008.10.024

Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robotics and Autonomous Systems57(5), 469–483 (2009) https://doi.org/10.1016/j.robot.2008.10.024

work page doi:10.1016/j.robot.2008.10.024 2009
[2]

Interna- tional Journal of Computer Vision130(1), 33–55 (2022) https://doi.org/10.1007/ s11263-021-01531-2

Damen, D., Doughty, H., Farinella, G.M., Furnari, A., Kazakos, E., Ma, J., Moltisanti, D., Munro, J., Perrett, T., Price, W., Wray, M.: Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. Interna- tional Journal of Computer Vision130(1), 33–55 (2022) https://doi.org/10.1007/ s11263-021-01531-2

2022
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., Yao, A.: Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21096–21106 (2022)

2022
[4]

In: Proceedings 47 of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B.,et al.: Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. In: Proceedings 47 of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19383–19400 (2024)

2024
[5]

The International Journal of Robotics Research30(10), 1229–1249 (2011) https://doi.org/10.1177/ 0278364911410459

Aksoy, E.E., Abramov, A., D¨ orr, J., Ning, K., Dellen, B., W¨ org¨ otter, F.: Learn- ing the semantics of object–action relations by observation. The International Journal of Robotics Research30(10), 1229–1249 (2011) https://doi.org/10.1177/ 0278364911410459

2011
[6]

In: Proceedings of the IEEE International Conference on Robotics and Automation, pp

Ziaeetabar, F., Aksoy, E.E., W¨ org¨ otter, F., Tamosiunaite, M.: Semantic analy- sis of manipulation actions using spatial relations. In: Proceedings of the IEEE International Conference on Robotics and Automation, pp. 4612–4619 (2017). https://doi.org/10.1109/ICRA.2017.7989536

work page doi:10.1109/icra.2017.7989536 2017
[7]

Tsagarakis, and Enrico Mingo Hoffman

Ziaeetabar, F., Kulvicius, T., Tamosiunaite, M., W¨ org¨ otter, F.: Recognition and prediction of manipulation actions using enriched semantic event chains. Robotics and Autonomous Systems110, 173–188 (2018) https://doi.org/10.1016/j.robot. 2018.10.005

work page doi:10.1016/j.robot 2018
[8]

Shamma, Michael S

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Visual genome: Connecting language and vision using crowdsourced dense image anno- tations. International Journal of Computer Vision123(1), 32–73 (2017) https: //doi.org/10.1007/s11263-016-0981-7

work page doi:10.1007/s11263-016-0981-7 2017
[9]

In: Proceedings of the 38th International Conference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763...

2021
[10]

https://doi.org/10.48550/ arXiv.2303.05499

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (2023). https://doi.org/10.48550/ arXiv.2303.05499

Pith/arXiv arXiv 2023
[11]

In: IEEE/CVF International Conference on Computer Vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., Doll´ ar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026 (2023). https://doi.org/10.1109/ICCV51070.2023.00371

work page doi:10.1109/iccv51070.2023.00371 2023
[12]

In: International Conference on Learning Representations (2019)

Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to com- mon corruptions and perturbations. In: International Conference on Learning Representations (2019)

2019
[13]

In: International Conference on Learning 48 Representations (2019)

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Bren- del, W.: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In: International Conference on Learning 48 Representations (2019)

2019
[14]

In: NeurIPS Workshop on Machine Learning for Autonomous Driving (2019)

Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. In: NeurIPS Workshop on Machine Learning for Autonomous Driving (2019)

2019
[15]

PLOS ONE15(12), 0243829 (2020) https://doi.org/10.1371/journal.pone.0243829

Ziaeetabar, F., Pomp, J., Pfeiffer, S., El-Sourani, N., Schubotz, R.I., Tamosiu- naite, M., W¨ org¨ otter, F.: Using enriched semantic event chains to model human action prediction based on minimal spatial information. PLOS ONE15(12), 0243829 (2020) https://doi.org/10.1371/journal.pone.0243829

work page doi:10.1371/journal.pone.0243829 2020
[16]

Scientific reports 10(1), 3999 (2020)

W¨ org¨ otter, F., Ziaeetabar, F., Pfeiffer, S., Kaya, O., Kulvicius, T., Tamosiu- naite, M.: Humans predict action using grammar-like structures. Scientific reports 10(1), 3999 (2020)

2020
[17]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018). https://doi.org/10.1609/aaai.v32i1.12328

work page doi:10.1609/aaai.v32i1.12328 2018
[18]

IEEE Access (2024) https://doi.org/10.1109/ACCESS.2024.3509674

Ziaeetabar, F., Tamosiunaite, M., W¨ org¨ otter, F.: A hierarchical graph-based approach for recognition and description generation of bimanual actions in videos. IEEE Access (2024) https://doi.org/10.1109/ACCESS.2024.3509674

work page doi:10.1109/access.2024.3509674 2024
[19]

IEEE Access13, 201990–202009 (2025) https://doi.org/10.1109/ACCESS.2025.3637990

Ziaeetabar, F., W¨ org¨ otter, F.: Adaptive multimodal graph reasoning with founda- tion models for fine-grained action recognition. IEEE Access13, 201990–202009 (2025) https://doi.org/10.1109/ACCESS.2025.3637990

work page doi:10.1109/access.2025.3637990 2025
[20]

Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains

Ziaeetabar, F.: Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains (2026). https://doi.org/10.48550/arXiv.2604.21053

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.21053 2026
[21]

In: Advances in Neural Information Processing Systems, vol

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27, pp. 568–576 (2014)

2014
[22]

Deep Residual Learning for Image Recognition

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733 (2017). https://doi.org/10.1109/CVPR. 2017.502

work page doi:10.1109/cvpr 2017
[23]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Lin, J., Gan, C., Han, S.: TSM: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)

2019
[24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recog- nition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) 49

2019
[25]

Proceedings of Machine Learning Research, vol

Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 813–
[26]

PMLR, Virtual Event (2021)

2021
[27]

In: Advances in Neural Information Processing Systems, vol

Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In: Advances in Neural Information Processing Systems, vol. 35, pp. 10078–10093 (2022)

2022
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., Martin, M., Nagarajan, T.,et al.: Ego4D: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18973–18990 (2022)

2022
[29]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Kwon, T., Tekin, B., Stuhmer, J., Bogo, F., Pollefeys, M.: H2O: Two hands manipulating objects for first person interaction recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10138–10148 (2021)

2021
[30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M.J., Hilliges, O.: ARCTIC: A dataset for dexterous bimanual hand-object manipu- lation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

2023
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Cho, H., Kim, C., Kim, J., Lee, S., Ismayilzada, E., Baek, S.: Transformer-based unified recognition of two hands manipulating objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4769– 4778 (2023)

2023
[32]

In: Proceedings of the British Machine Vision Conference (2023)

Roh, W., Lee, S.H., Ryoo, W.J., Lee, J., Oh, G., Hwang, S., Chi, H.-g., Kim, S.: Functional hand type prior for 3d hand pose estimation and action recognition from egocentric view monocular videos. In: Proceedings of the British Machine Vision Conference (2023)

2023
[33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ji, J., Krishna, R., Fei-Fei, L., Niebles, J.C.: Action genome: Actions as com- positions of spatio-temporal scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10236–10247 (2020). https://doi.org/10.1109/CVPR42600.2020.01025

work page doi:10.1109/cvpr42600.2020.01025 2020
[34]

In: Advances in Neural Information Processing Systems, Datasets and Benchmarks Track (2022)

Darkhalil, A., Shan, D., Zhu, B., Ma, J., Kar, A., Higgins, R., Fidler, S., Fouhey, D., Damen, D.: EPIC-KITCHENS VISOR benchmark: VIdeo segmentations and object relations. In: Advances in Neural Information Processing Systems, Datasets and Benchmarks Track (2022)

2022
[35]

In: European Conference 50 on Computer Vision, pp

Brahmbhatt, S., Tang, C., Twigg, C.D., Kemp, C.C., Hays, J.: ContactPose: A dataset of grasps with object contact and hand pose. In: European Conference 50 on Computer Vision, pp. 361–378. Springer, Cham (2020)

2020
[36]

In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

Ziaeetabar, F., Kulvicius, T., Tamosiunaite, M., W¨ org¨ otter, F.: Prediction of manipulation action classes using semantic spatial reasoning. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3350– 3357 (2018). IEEE

2018
[37]

In: Proceedings of the 3rd ACM International Conference on Multimedia in Asia, pp

Hirata, T., Mukuta, Y., Harada, T.: Making video recognition models robust to common corruptions with supervised contrastive learning. In: Proceedings of the 3rd ACM International Conference on Multimedia in Asia, pp. 1–6 (2021). https://doi.org/10.1145/3469877.3497692

work page doi:10.1145/3469877.3497692 2021
[38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Zeng, R., Xu, Q., Huang, W., Chen, P., Tan, M., Gan, C.: Benchmarking the robustness of temporal action detection models against temporal corruptions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024
[39]

Medical Image Analysis48, 117–130 (2018)

Parisot, S., Ktena, S.I., Ferrante, E., Lee, M., Guerrero, R., Glocker, B., Rueckert, D.: Disease prediction using graph convolutional networks: Application to autism spectrum disorder and alzheimer’s disease. Medical Image Analysis48, 117–130 (2018)

2018
[40]

Computers in Biology and Medicine149, 106079 (2022)

Ma, Q., Zhou, S., Li, C., Liu, F., Liu, Y., Hou, M., Zhang, Y.: Dgrunit: Dual graph reasoning unit for brain tumor segmentation. Computers in Biology and Medicine149, 106079 (2022)

2022
[41]

arXiv preprint arXiv:2508.01465 (2025) 51

Ziaeetabar, F.: Efficientgformer: Multimodal brain tumor segmentation via pruned graph-augmented transformer. arXiv preprint arXiv:2508.01465 (2025) 51

arXiv 2025

[1] [1]

Robotics and Autonomous Systems57(5), 469–483 (2009) https://doi.org/10.1016/j.robot.2008.10.024

Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robotics and Autonomous Systems57(5), 469–483 (2009) https://doi.org/10.1016/j.robot.2008.10.024

work page doi:10.1016/j.robot.2008.10.024 2009

[2] [2]

Interna- tional Journal of Computer Vision130(1), 33–55 (2022) https://doi.org/10.1007/ s11263-021-01531-2

Damen, D., Doughty, H., Farinella, G.M., Furnari, A., Kazakos, E., Ma, J., Moltisanti, D., Munro, J., Perrett, T., Price, W., Wray, M.: Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. Interna- tional Journal of Computer Vision130(1), 33–55 (2022) https://doi.org/10.1007/ s11263-021-01531-2

2022

[3] [3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., Yao, A.: Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21096–21106 (2022)

2022

[4] [4]

In: Proceedings 47 of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B.,et al.: Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. In: Proceedings 47 of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19383–19400 (2024)

2024

[5] [5]

The International Journal of Robotics Research30(10), 1229–1249 (2011) https://doi.org/10.1177/ 0278364911410459

Aksoy, E.E., Abramov, A., D¨ orr, J., Ning, K., Dellen, B., W¨ org¨ otter, F.: Learn- ing the semantics of object–action relations by observation. The International Journal of Robotics Research30(10), 1229–1249 (2011) https://doi.org/10.1177/ 0278364911410459

2011

[6] [6]

In: Proceedings of the IEEE International Conference on Robotics and Automation, pp

Ziaeetabar, F., Aksoy, E.E., W¨ org¨ otter, F., Tamosiunaite, M.: Semantic analy- sis of manipulation actions using spatial relations. In: Proceedings of the IEEE International Conference on Robotics and Automation, pp. 4612–4619 (2017). https://doi.org/10.1109/ICRA.2017.7989536

work page doi:10.1109/icra.2017.7989536 2017

[7] [7]

Tsagarakis, and Enrico Mingo Hoffman

Ziaeetabar, F., Kulvicius, T., Tamosiunaite, M., W¨ org¨ otter, F.: Recognition and prediction of manipulation actions using enriched semantic event chains. Robotics and Autonomous Systems110, 173–188 (2018) https://doi.org/10.1016/j.robot. 2018.10.005

work page doi:10.1016/j.robot 2018

[8] [8]

Shamma, Michael S

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Visual genome: Connecting language and vision using crowdsourced dense image anno- tations. International Journal of Computer Vision123(1), 32–73 (2017) https: //doi.org/10.1007/s11263-016-0981-7

work page doi:10.1007/s11263-016-0981-7 2017

[9] [9]

In: Proceedings of the 38th International Conference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763...

2021

[10] [10]

https://doi.org/10.48550/ arXiv.2303.05499

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (2023). https://doi.org/10.48550/ arXiv.2303.05499

Pith/arXiv arXiv 2023

[11] [11]

In: IEEE/CVF International Conference on Computer Vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., Doll´ ar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026 (2023). https://doi.org/10.1109/ICCV51070.2023.00371

work page doi:10.1109/iccv51070.2023.00371 2023

[12] [12]

In: International Conference on Learning Representations (2019)

Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to com- mon corruptions and perturbations. In: International Conference on Learning Representations (2019)

2019

[13] [13]

In: International Conference on Learning 48 Representations (2019)

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Bren- del, W.: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In: International Conference on Learning 48 Representations (2019)

2019

[14] [14]

In: NeurIPS Workshop on Machine Learning for Autonomous Driving (2019)

Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. In: NeurIPS Workshop on Machine Learning for Autonomous Driving (2019)

2019

[15] [15]

PLOS ONE15(12), 0243829 (2020) https://doi.org/10.1371/journal.pone.0243829

Ziaeetabar, F., Pomp, J., Pfeiffer, S., El-Sourani, N., Schubotz, R.I., Tamosiu- naite, M., W¨ org¨ otter, F.: Using enriched semantic event chains to model human action prediction based on minimal spatial information. PLOS ONE15(12), 0243829 (2020) https://doi.org/10.1371/journal.pone.0243829

work page doi:10.1371/journal.pone.0243829 2020

[16] [16]

Scientific reports 10(1), 3999 (2020)

W¨ org¨ otter, F., Ziaeetabar, F., Pfeiffer, S., Kaya, O., Kulvicius, T., Tamosiu- naite, M.: Humans predict action using grammar-like structures. Scientific reports 10(1), 3999 (2020)

2020

[17] [17]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018). https://doi.org/10.1609/aaai.v32i1.12328

work page doi:10.1609/aaai.v32i1.12328 2018

[18] [18]

IEEE Access (2024) https://doi.org/10.1109/ACCESS.2024.3509674

Ziaeetabar, F., Tamosiunaite, M., W¨ org¨ otter, F.: A hierarchical graph-based approach for recognition and description generation of bimanual actions in videos. IEEE Access (2024) https://doi.org/10.1109/ACCESS.2024.3509674

work page doi:10.1109/access.2024.3509674 2024

[19] [19]

IEEE Access13, 201990–202009 (2025) https://doi.org/10.1109/ACCESS.2025.3637990

Ziaeetabar, F., W¨ org¨ otter, F.: Adaptive multimodal graph reasoning with founda- tion models for fine-grained action recognition. IEEE Access13, 201990–202009 (2025) https://doi.org/10.1109/ACCESS.2025.3637990

work page doi:10.1109/access.2025.3637990 2025

[20] [20]

Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains

Ziaeetabar, F.: Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains (2026). https://doi.org/10.48550/arXiv.2604.21053

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.21053 2026

[21] [21]

In: Advances in Neural Information Processing Systems, vol

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27, pp. 568–576 (2014)

2014

[22] [22]

Deep Residual Learning for Image Recognition

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733 (2017). https://doi.org/10.1109/CVPR. 2017.502

work page doi:10.1109/cvpr 2017

[23] [23]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Lin, J., Gan, C., Han, S.: TSM: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)

2019

[24] [24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recog- nition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) 49

2019

[25] [25]

Proceedings of Machine Learning Research, vol

Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 813–

[26] [26]

PMLR, Virtual Event (2021)

2021

[27] [27]

In: Advances in Neural Information Processing Systems, vol

Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In: Advances in Neural Information Processing Systems, vol. 35, pp. 10078–10093 (2022)

2022

[28] [28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., Martin, M., Nagarajan, T.,et al.: Ego4D: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18973–18990 (2022)

2022

[29] [29]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Kwon, T., Tekin, B., Stuhmer, J., Bogo, F., Pollefeys, M.: H2O: Two hands manipulating objects for first person interaction recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10138–10148 (2021)

2021

[30] [30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M.J., Hilliges, O.: ARCTIC: A dataset for dexterous bimanual hand-object manipu- lation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

2023

[31] [31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Cho, H., Kim, C., Kim, J., Lee, S., Ismayilzada, E., Baek, S.: Transformer-based unified recognition of two hands manipulating objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4769– 4778 (2023)

2023

[32] [32]

In: Proceedings of the British Machine Vision Conference (2023)

Roh, W., Lee, S.H., Ryoo, W.J., Lee, J., Oh, G., Hwang, S., Chi, H.-g., Kim, S.: Functional hand type prior for 3d hand pose estimation and action recognition from egocentric view monocular videos. In: Proceedings of the British Machine Vision Conference (2023)

2023

[33] [33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ji, J., Krishna, R., Fei-Fei, L., Niebles, J.C.: Action genome: Actions as com- positions of spatio-temporal scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10236–10247 (2020). https://doi.org/10.1109/CVPR42600.2020.01025

work page doi:10.1109/cvpr42600.2020.01025 2020

[34] [34]

In: Advances in Neural Information Processing Systems, Datasets and Benchmarks Track (2022)

Darkhalil, A., Shan, D., Zhu, B., Ma, J., Kar, A., Higgins, R., Fidler, S., Fouhey, D., Damen, D.: EPIC-KITCHENS VISOR benchmark: VIdeo segmentations and object relations. In: Advances in Neural Information Processing Systems, Datasets and Benchmarks Track (2022)

2022

[35] [35]

In: European Conference 50 on Computer Vision, pp

Brahmbhatt, S., Tang, C., Twigg, C.D., Kemp, C.C., Hays, J.: ContactPose: A dataset of grasps with object contact and hand pose. In: European Conference 50 on Computer Vision, pp. 361–378. Springer, Cham (2020)

2020

[36] [36]

In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

Ziaeetabar, F., Kulvicius, T., Tamosiunaite, M., W¨ org¨ otter, F.: Prediction of manipulation action classes using semantic spatial reasoning. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3350– 3357 (2018). IEEE

2018

[37] [37]

In: Proceedings of the 3rd ACM International Conference on Multimedia in Asia, pp

Hirata, T., Mukuta, Y., Harada, T.: Making video recognition models robust to common corruptions with supervised contrastive learning. In: Proceedings of the 3rd ACM International Conference on Multimedia in Asia, pp. 1–6 (2021). https://doi.org/10.1145/3469877.3497692

work page doi:10.1145/3469877.3497692 2021

[38] [38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Zeng, R., Xu, Q., Huang, W., Chen, P., Tan, M., Gan, C.: Benchmarking the robustness of temporal action detection models against temporal corruptions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024

[39] [39]

Medical Image Analysis48, 117–130 (2018)

Parisot, S., Ktena, S.I., Ferrante, E., Lee, M., Guerrero, R., Glocker, B., Rueckert, D.: Disease prediction using graph convolutional networks: Application to autism spectrum disorder and alzheimer’s disease. Medical Image Analysis48, 117–130 (2018)

2018

[40] [40]

Computers in Biology and Medicine149, 106079 (2022)

Ma, Q., Zhou, S., Li, C., Liu, F., Liu, Y., Hou, M., Zhang, Y.: Dgrunit: Dual graph reasoning unit for brain tumor segmentation. Computers in Biology and Medicine149, 106079 (2022)

2022

[41] [41]

arXiv preprint arXiv:2508.01465 (2025) 51

Ziaeetabar, F.: Efficientgformer: Multimodal brain tumor segmentation via pruned graph-augmented transformer. arXiv preprint arXiv:2508.01465 (2025) 51

arXiv 2025