Uncertainty Estimation in Instance Segmentation of Affordances via Bayesian Visual Transformers

Jose J.Guerrero; Lorenzo Mur-Labadia; Ruben Martinez-Cantina

arxiv: 2605.03614 · v1 · submitted 2026-05-05 · 💻 cs.CV

Uncertainty Estimation in Instance Segmentation of Affordances via Bayesian Visual Transformers

Lorenzo Mur-Labadia , Ruben Martinez-Cantina , Jose J.Guerrero This is my paper

Pith reviewed 2026-05-07 17:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords affordancesbayesianbetterestimationinstancenovelsegmentationuncertainty

0 comments

The pith

Bayesian visual transformers with ensemble and sampling methods achieve a 7.4 percentage point gain on weighted F-beta score for affordance instance segmentation on the IIT-Aff dataset while providing calibrated epistemic and aleatoric uncertainty maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual affordances are the parts of an image where a person or robot could perform an action, like grasping a handle or sitting on a chair. The authors build a neural network based on visual transformers that not only draws masks around these regions but also reports how uncertain it is about each pixel. They create multiple versions of the model through ensembles and random sampling, then compare the outputs to separate two kinds of uncertainty: one from the model's lack of knowledge and one from inherent noise in the image. They also introduce a new score called Probability-based Mask Quality to judge how consistent the masks are across these versions. On a challenging dataset of indoor scenes, the Bayesian version beats a standard deterministic network by 7.4 points on a key accuracy metric and produces probability values that are better calibrated, meaning the model is less likely to be overconfident. The uncertainty maps show that the model is unsure mainly around object edges and in visually ambiguous areas.

Core claim

Our results show that the global consensus of multiple sub-networks of Bayesian models improve deterministic networks due to a better mask refinement and generalization. This fact, joined with the more powerful features extracted by attention-based mechanisms, represent an improvement of +7.4 p.p on the F_β^w score in the challenging IIT-Aff dataset. Bayesian models are also better calibrated, producing less overconfident probabilities and with a better uncertainty estimation.

Load-bearing premise

That the chosen ensemble and sampling approximations in the Bayesian visual transformer architecture reliably separate epistemic from aleatoric uncertainty and that the observed gains on the IIT-Aff dataset generalize beyond the specific training and evaluation splits used.

Figures

Figures reproduced from arXiv: 2605.03614 by Jose J.Guerrero, Lorenzo Mur-Labadia, Ruben Martinez-Cantina.

**Figure 1.** Figure 1: Our model architecture is composed of an attention-based backbone [13] extended with sampling view at source ↗

**Figure 2.** Figure 2: Evolution of the calibration metrics with the number of samples view at source ↗

**Figure 3.** Figure 3: Sparsification error curves for the semantic and spatial probabilities for Swin-T view at source ↗

**Figure 4.** Figure 4: Qualitative results and uncertainty prediction obtained by the Swin-T Mask-Ens Bayesian Instance view at source ↗

read the original abstract

Visual affordances identify regions in an image with potential interactions, offering a novel paradigm for scene understanding. Recognizing affordances allows autonomous robots to act more naturally, could enhance human-robot interactions, enrich augmented reality systems, and benefit prosthetic vision devices. Accurate and localized prediction of affordance regions, rather than general saliency maps is crucial for these applications. We present a model for instance segmentation of affordances by adopting sample-based and ensembles approaches for uncertainty estimation. We extend an attention-based architecture for our novel task, showing with detailed ablation experiments the effects of each component. By comparing the distribution of these different detections, we extract pixel-wise epistemic and aleatoric variances at both the semantic and spatial levels. In addition, we propose a novel measure called Probability-based Mask Quality, which enables a comprehensive analysis of semantic and spatial variations in a probabilistic instance segmentation model. Our results show that the global consensus of multiple sub-networks of Bayesian models improve deterministic networks due to a better mask refinement and generalization. This fact, joined with the more powerful features extracted by attention-based mechanisms, represent an improvement of +7.4 p.p on the $F_{\beta}^w$ score in the challenging IIT-Aff dataset. Bayesian models are also better calibrated, producing less overconfident probabilities and with a better uncertainty estimation. Qualitative results show that aleatoric variance appears in the contour of the objects, while the epistemic variance is observed in visual challenging pixels, adding interpretability to the neural network.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines Bayesian uncertainty with visual transformers for affordance instance segmentation and reports a 7.4-point gain plus a new mask quality metric, but the experiments do not cleanly separate the Bayesian contribution from the attention architecture.

read the letter

The main thing to know is that this work applies sample-based and ensemble Bayesian methods to visual transformers for instance segmentation of affordances, adds a Probability-based Mask Quality metric, and claims a 7.4 percentage point improvement on the weighted F-beta score on the IIT-Aff dataset along with better calibration and interpretable uncertainty maps (aleatoric at contours, epistemic at hard pixels). The ablations on components and the qualitative results are the parts that hold up best. The new metric is a reasonable way to probe semantic and spatial variation in probabilistic masks, and the application to affordances is a legitimate extension for robotics and AR use cases. The paper does engage the literature on uncertainty estimation and transformers without obvious circularity. The central soft spot is exactly the one in the stress-test note. The abstract and results attribute gains to the consensus of Bayesian sub-networks for mask refinement, yet the deterministic baseline appears to be a non-attention network rather than a matched deterministic visual transformer. Without that direct comparison under identical backbone and training, the +7.4 point lift cannot be credited primarily to the uncertainty modeling. The single-dataset evaluation also leaves generalization untested. This is aimed at CV researchers working on affordance or uncertainty-aware segmentation for embodied applications. A reader in that niche would get value from the metric and the uncertainty visualizations. It deserves a serious referee because the empirical setup includes ablations and the task is practically relevant, even though the attribution of improvements needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper presents a Bayesian Visual Transformer architecture for instance segmentation of visual affordances, using ensemble and sampling-based approximations to estimate epistemic and aleatoric uncertainty. It introduces a Probability-based Mask Quality metric for analyzing probabilistic outputs and claims that the consensus across Bayesian sub-networks, combined with attention-based features, yields better mask refinement, generalization, and calibration than deterministic networks, with a reported +7.4 p.p. gain in F_β^w on the IIT-Aff dataset. Qualitative results are said to show aleatoric variance at object contours and epistemic variance at challenging pixels.

Significance. If the gains can be robustly isolated to the Bayesian components rather than the attention backbone, the work would contribute a practical uncertainty-aware approach to affordance segmentation with potential utility in robotics and AR. The new mask quality measure and explicit epistemic/aleatoric separation add interpretability value, but the current presentation leaves the attribution of improvements unverified.

major comments (2)

[Abstract] Abstract: The central claim attributes the +7.4 p.p. F_β^w improvement and better calibration to 'global consensus of multiple sub-networks of Bayesian models' plus attention features, yet no explicit ablation compares the full Bayesian VT against a deterministic VT with identical backbone, training protocol, and attention layers. Without this isolation, the load-bearing assumption that uncertainty estimation drives mask refinement cannot be evaluated.
[Abstract] Abstract: Ablation experiments on each component are referenced, but the text supplies no quantitative results, baseline definitions, statistical significance tests, or error analysis for the reported calibration and uncertainty improvements. This prevents verification of whether the ensemble/sampling approximations reliably separate epistemic from aleatoric uncertainty on the IIT-Aff splits.

minor comments (2)

The notation F_β^w is used without an explicit definition or reference to its weighting scheme in the provided text; a brief equation or citation would improve clarity.
The Probability-based Mask Quality metric is introduced as novel but its exact formulation (e.g., how probability distributions over masks are aggregated) is not detailed in the abstract; a dedicated methods subsection would aid reproducibility.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work is empirical and does not introduce mathematical derivations, new physical entities, or unstated axioms beyond standard assumptions of deep learning and Bayesian neural network approximations.

axioms (1)

domain assumption Ensemble and Monte-Carlo sampling methods provide a practical approximation to epistemic uncertainty in neural networks for segmentation tasks
Invoked to justify extraction of epistemic and aleatoric variances from multiple model predictions.

invented entities (1)

Probability-based Mask Quality no independent evidence
purpose: To quantify semantic and spatial consistency across probabilistic instance masks
Introduced as a novel evaluation measure for probabilistic segmentation models

pith-pipeline@v0.9.0 · 5574 in / 1403 out tokens · 173000 ms · 2026-05-07T17:55:31.740053+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

[1]

J. J. Gibson, The theory of affordances, Hilldale, USA 1 (1977) 67–82

work page 1977
[2]

R. R. Murphy, Case studies of applying gibson’s ecological approach to mobile robots, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 29 (1999) 105–111

work page 1999
[3]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al., Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

work page 2023
[4]

Sanchez-Garcia, R

M. Sanchez-Garcia, R. Martinez-Cantin, J. J. Guerrero, Semantic and structural image segmentation for prosthetic vision, Plos one 15 (2020) e0227677

work page 2020
[5]

Nagarajan, C

T. Nagarajan, C. Feichtenhofer, K. Grauman, Grounded human-object interaction hotspots from video, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 8688–8697

work page 2019
[6]

K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2961– 2969

work page 2017
[7]

Doherty, B

J. Doherty, B. Gardiner, E. Kerr, N. Siddique, Bifpn-yolo: One-stage object de- tection integrating bi-directional feature pyramid networks, Pattern Recognition 160 (2025) 111209

work page 2025
[8]

Furnari, G

A. Furnari, G. M. Farinella, What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6252–6261. 28

work page 2019
[9]

Nagarajan, Y

T. Nagarajan, Y . Li, C. Feichtenhofer, K. Grauman, Ego-topo: Environment affordances from egocentric video, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 163–172

work page 2020
[10]

C. Guo, G. Pleiss, Y . Sun, K. Q. Weinberger, On calibration of modern neural networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 1321–1330

work page 2017
[11]

Morilla-Cabello, L

D. Morilla-Cabello, L. Mur-Labadia, R. Martinez-Cantin, E. Montijano, Robust fusion for Bayesian semantic mapping, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023)

work page 2023
[12]

Mur-Labadia, R

L. Mur-Labadia, R. Martinez-Cantin, J. Guerrero, Bayesian deep learning for af- fordance segmentation in images, in: 2023 International Conference on Robotics and Automation (ICRA), IEEE, 2023

work page 2023
[13]

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, B. Guo, Swin trans- former: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10012–10022

work page 2021
[14]

Y . Gal, Z. Ghahramani, Dropout as a Bayesian approximation: Representing model uncertainty in deep learning, in: International Conference on Machine Learning (ICML), PMLR, 2016, pp. 1050–1059

work page 2016
[15]

Durasov, T

N. Durasov, T. Bagautdinov, P. Baque, P. Fua, Masksembles for uncertainty esti- mation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13539–13548

work page 2021
[16]

Lakshminarayanan, A

B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty estimation using deep ensembles, Advances in neural information processing systems 30 (2017)

work page 2017
[17]

Huang, Y

G. Huang, Y . Li, G. Pleiss, Z. Liu, J. E. Hopcroft, K. Q. Weinberger, Snapshot ensembles: Train 1, get m for free, International Conference on Learning Repre- sentations (ICLR) (2017). 29

work page 2017
[18]

H. S. Koppula, A. Saxena, Anticipating human activities using object affordances for reactive robotic response, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2015) 14–29

work page 2015
[19]

Mur-Labadia, J

L. Mur-Labadia, J. J. Guerrero, R. Martinez-Cantin, Multi-label affordance map- ping from egocentric vision, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5238–5249

work page 2023
[20]

Montesano, M

L. Montesano, M. Lopes, A. Bernardino, J. Santos-Victor, Learning object af- fordances: from sensory–motor coordination to imitation, IEEE Transactions on Robotics 24 (2008) 15–26

work page 2008
[21]

S. Yang, W. Zhang, R. Song, J. Cheng, H. Wang, Y . Li, Watch and act: Learning robotic manipulation from visual demonstration, IEEE Transactions on Systems, Man, and Cybernetics: Systems (2023)

work page 2023
[22]

Myers, C

A. Myers, C. L. Teo, C. Fermüller, Y . Aloimonos, Affordance detection of tool parts from geometric features, in: 2015 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2015, pp. 1374–1381

work page 2015
[23]

Nguyen, D

A. Nguyen, D. Kanoulas, D. G. Caldwell, N. Tsagarakis, Object-based affor- dances detection with convolutional neural networks and dense conditional ran- dom fields, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2017, pp. 5908–5915

work page 2017
[24]

Nguyen, D

A. Nguyen, D. Kanoulas, D. G. Caldwell, N. G. Tsagarakis, Detecting object affordances with convolutional neural networks, in: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2016, pp. 2765– 2770

work page 2016
[25]

T.-T. Do, A. Nguyen, I. Reid, Affordancenet: An end-to-end deep learning ap- proach for object affordance detection, in: 2018 International Conference on Robotics and Automation (ICRA), IEEE, 2018, pp. 5882–5889. 30

work page 2018
[26]

C. N. D. Minh, S. Z. Gilani, S. M. S. Islam, D. Suter, Learning affordance seg- mentation: An investigative study, in: 2020 Digital Image Computing: Tech- niques and Applications (DICTA), IEEE, 2020, pp. 1–8

work page 2020
[27]

Caselles-Dupré, M

H. Caselles-Dupré, M. Garcia-Ortiz, D. Filliat, Are standard object segmen- tation models sufficient for learning affordance segmentation?, arXiv preprint arXiv:2107.02095 (2021)

work page arXiv 2021
[28]

Apicella, A

T. Apicella, A. Xompero, P. Gastaldo, A. Cavallaro, Segmenting object affor- dances: Reproducibility and sensitivity to scale, in: European Conference on Computer Vision Workshops, Springer, 2024, pp. 286–304

work page 2024
[29]

Fang, T.-L

K. Fang, T.-L. Wu, D. Yang, S. Savarese, J. J. Lim, Demo2vec: Reasoning object affordances from online videos, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2139–2147

work page 2018
[30]

G. Li, N. Tsagkas, J. Song, R. Mon-Williams, S. Vijayakumar, K. Shao, L. Sevilla-Lara, Learning precise affordances from egocentric videos for robotic manipulation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 10581–10591

work page 2025
[31]

Heidinger, S

M. Heidinger, S. Jauhri, V . Prasad, G. Chalvatzaki, 2handedafforder: Learning precise actionable bimanual affordances from human videos, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14743– 14753

work page 2025
[32]

S. Qian, W. Chen, M. Bai, X. Zhou, Z. Tu, L. E. Li, Affordancellm: Ground- ing affordance from vision language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7587–7597

work page 2024
[33]

Cuttano, G

C. Cuttano, G. Rosi, G. Trivigno, G. Averta, What does clip know about peeling a banana?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2238–2247. 31

work page 2024
[34]

G. Li, D. Sun, L. Sevilla-Lara, V . Jampani, One-shot open affordance learning with foundation models, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 3086–3096

work page 2024
[35]

J. Tang, G. Zheng, J. Yu, S. Yang, Cotdet: Affordance knowledge prompting for task driven object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3068–3078

work page 2023
[36]

D. Wu, Y . Fu, S. Huang, Y . Liu, F. Jia, N. Liu, F. Dai, T. Wang, R. M. Anwer, F. S. Khan, et al., Ragnet: Large-scale reasoning-based affordance segmentation benchmark towards general grasping, in: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2025, pp. 11980–11990

work page 2025
[37]

X. Wang, X. Yang, Y . Xu, Y . Wu, Z. Li, N. Zhao, Affordbot: 3d fine-grained embodied reasoning via multimodal large language models, in: Advances in Neural Information Processing Systems, 2025

work page 2025
[38]

D. Lu, L. Kong, T. Huang, G. H. Lee, Geal: Generalizable 3d affordance learn- ing with cross-modal consistency, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1680–1690

work page 2025
[39]

W. Moon, H. S. Seong, J.-P. Heo, Selective contrastive learning for weakly su- pervised affordance grounding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 5210–5220

work page 2025
[40]

Apicella, A

T. Apicella, A. Xompero, A. Cavallaro, Visual affordance prediction: Survey and reproducibility, arXiv preprint arXiv:2505.05074 (2025)

work page arXiv 2025
[41]

Papamarkou, M

T. Papamarkou, M. Skoularidou, K. Palla, L. Aitchison, J. Arbel, D. Dunson, M. Filippone, V . Fortuin, P. Hennig, J. M. Hernández-Lobato, et al., Position: Bayesian deep learning is needed in the age of large-scale ai, in: International Conference on Machine Learning, PMLR, 2024, pp. 39556–39586

work page 2024
[42]

V . D. Wild, S. Ghalebikesabi, D. Sejdinovic, J. Knoblauch, A rigorous link be- tween deep ensembles and (variational) bayesian methods, in: Advances in Neu- ral Information Processing Systems, 2023, pp. 39782–39811. 32

work page 2023
[43]

B. G. Doan, A. Shamsi, X.-Y . Guo, A. Mohammadi, H. Alinejad-Rokny, D. Sejdi- novic, D. Teney, D. C. Ranasinghe, E. Abbasnejad, Bayesian low-rank learning (bella): A practical approach to bayesian neural networks, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 2025, pp. 16298–16307

work page 2025
[44]

Postels, H

J. Postels, H. Blum, C. Cadena, R. Siegwart, L. V . Gool, F. Tombari, Quantifying aleatoric and epistemic uncertainty using density estimation in latent space, Pro- ceedings of the 39th Conference on Uncertainty in Artificial Intelligence (2023)

work page 2023
[45]

K. A. Sankararaman, S. Wang, H. Fang, Bayesformer: Transformer with uncer- tainty estimation, arXiv preprint arXiv:2206.00826 (2022)

work page arXiv 2022
[46]

Müller, N

S. Müller, N. Hollmann, S. P. Arango, J. Grabocka, F. Hutter, Transformers can do bayesian inference, in: International Conference on Learning Representations, 2022

work page 2022
[47]

Reuter, T

A. Reuter, T. G. Rudner, V . Fortuin, D. Rügamer, Can transformers learn full bayesian inference in context?, in: International Conference on Machine Learn- ing, 2025, pp. 51531–51582

work page 2025
[48]

Uncertainty

A. Gleave, G. Irving, Uncertainty estimation for language reward models, arXiv preprint arXiv:2203.07472 (2022)

work page arXiv 2022
[49]

H. Wang, Q. Ji, Beyond dirichlet-based models: when bayesian neural networks meet evidential deep learning, in: The 40th Conference on Uncertainty in Artifi- cial Intelligence, 2024

work page 2024
[50]

Kendall, Y

A. Kendall, Y . Gal, What uncertainties do we need in Bayesian deep learning for computer vision?, 31st Conference on Neural Information Processing Systems (NIPS) (2017)

work page 2017
[51]

G.-P. Ji, L. Zhu, M. Zhuge, K. Fu, Fast camouflaged object detection via edge- based reversible re-calibration network, Pattern Recognition 123 (2022) 108414

work page 2022
[52]

S. Kim, P. Chikontwe, S. An, S. H. Park, Uncertainty-aware semi-supervised few shot segmentation, Pattern Recognition 137 (2023) 109292. 33

work page 2023
[53]

S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detec- tion with region proposal networks, Advances in neural information processing systems 28 (2015)

work page 2015
[54]

Miller, F

D. Miller, F. Dayoub, M. Milford, N. Sünderhauf, Evaluating merging strategies for sampling-based uncertainty techniques in object detection, in: 2019 Interna- tional Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 2348– 2354

work page 2019
[55]

Margolin, L

R. Margolin, L. Zelnik-Manor, A. Tal, How to evaluate foreground maps?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2014, pp. 248–255

work page 2014
[56]

D. Hall, F. Dayoub, J. Skinner, H. Zhang, D. Miller, P. Corke, G. Carneiro, A. An- gelova, N. Sünderhauf, Probabilistic object detection: Definition and evaluation, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), 2020, pp. 1031–1040

work page 2020
[57]

H. W. Kuhn, The hungarian method for the assignment problem, Naval research logistics quarterly 2 (1955) 83–97

work page 1955
[58]

G. W. Brier, et al., Verification of forecasts expressed in terms of probability, Monthly weather review 78 (1950) 1–3

work page 1950
[59]

C. Yin, Q. Zhang, Object affordance detection with boundary-preserving network for robotic manipulation tasks, Neural Computing and Applications 34 (2022) 17963–17980

work page 2022
[60]

Zhang, H

Y . Zhang, H. Li, T. Ren, Y . Dou, Q. Li, Multi-scale fusion and global semantic encoding for affordance detection, in: 2022 International Joint Conference on Neural Networks (IJCNN), IEEE, 2022, pp. 1–8. 34

work page 2022

[1] [1]

J. J. Gibson, The theory of affordances, Hilldale, USA 1 (1977) 67–82

work page 1977

[2] [2]

R. R. Murphy, Case studies of applying gibson’s ecological approach to mobile robots, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 29 (1999) 105–111

work page 1999

[3] [3]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al., Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

work page 2023

[4] [4]

Sanchez-Garcia, R

M. Sanchez-Garcia, R. Martinez-Cantin, J. J. Guerrero, Semantic and structural image segmentation for prosthetic vision, Plos one 15 (2020) e0227677

work page 2020

[5] [5]

Nagarajan, C

T. Nagarajan, C. Feichtenhofer, K. Grauman, Grounded human-object interaction hotspots from video, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 8688–8697

work page 2019

[6] [6]

K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2961– 2969

work page 2017

[7] [7]

Doherty, B

J. Doherty, B. Gardiner, E. Kerr, N. Siddique, Bifpn-yolo: One-stage object de- tection integrating bi-directional feature pyramid networks, Pattern Recognition 160 (2025) 111209

work page 2025

[8] [8]

Furnari, G

A. Furnari, G. M. Farinella, What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6252–6261. 28

work page 2019

[9] [9]

Nagarajan, Y

T. Nagarajan, Y . Li, C. Feichtenhofer, K. Grauman, Ego-topo: Environment affordances from egocentric video, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 163–172

work page 2020

[10] [10]

C. Guo, G. Pleiss, Y . Sun, K. Q. Weinberger, On calibration of modern neural networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 1321–1330

work page 2017

[11] [11]

Morilla-Cabello, L

D. Morilla-Cabello, L. Mur-Labadia, R. Martinez-Cantin, E. Montijano, Robust fusion for Bayesian semantic mapping, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023)

work page 2023

[12] [12]

Mur-Labadia, R

L. Mur-Labadia, R. Martinez-Cantin, J. Guerrero, Bayesian deep learning for af- fordance segmentation in images, in: 2023 International Conference on Robotics and Automation (ICRA), IEEE, 2023

work page 2023

[13] [13]

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, B. Guo, Swin trans- former: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10012–10022

work page 2021

[14] [14]

Y . Gal, Z. Ghahramani, Dropout as a Bayesian approximation: Representing model uncertainty in deep learning, in: International Conference on Machine Learning (ICML), PMLR, 2016, pp. 1050–1059

work page 2016

[15] [15]

Durasov, T

N. Durasov, T. Bagautdinov, P. Baque, P. Fua, Masksembles for uncertainty esti- mation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13539–13548

work page 2021

[16] [16]

Lakshminarayanan, A

B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty estimation using deep ensembles, Advances in neural information processing systems 30 (2017)

work page 2017

[17] [17]

Huang, Y

G. Huang, Y . Li, G. Pleiss, Z. Liu, J. E. Hopcroft, K. Q. Weinberger, Snapshot ensembles: Train 1, get m for free, International Conference on Learning Repre- sentations (ICLR) (2017). 29

work page 2017

[18] [18]

H. S. Koppula, A. Saxena, Anticipating human activities using object affordances for reactive robotic response, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2015) 14–29

work page 2015

[19] [19]

Mur-Labadia, J

L. Mur-Labadia, J. J. Guerrero, R. Martinez-Cantin, Multi-label affordance map- ping from egocentric vision, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5238–5249

work page 2023

[20] [20]

Montesano, M

L. Montesano, M. Lopes, A. Bernardino, J. Santos-Victor, Learning object af- fordances: from sensory–motor coordination to imitation, IEEE Transactions on Robotics 24 (2008) 15–26

work page 2008

[21] [21]

S. Yang, W. Zhang, R. Song, J. Cheng, H. Wang, Y . Li, Watch and act: Learning robotic manipulation from visual demonstration, IEEE Transactions on Systems, Man, and Cybernetics: Systems (2023)

work page 2023

[22] [22]

Myers, C

A. Myers, C. L. Teo, C. Fermüller, Y . Aloimonos, Affordance detection of tool parts from geometric features, in: 2015 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2015, pp. 1374–1381

work page 2015

[23] [23]

Nguyen, D

A. Nguyen, D. Kanoulas, D. G. Caldwell, N. Tsagarakis, Object-based affor- dances detection with convolutional neural networks and dense conditional ran- dom fields, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2017, pp. 5908–5915

work page 2017

[24] [24]

Nguyen, D

A. Nguyen, D. Kanoulas, D. G. Caldwell, N. G. Tsagarakis, Detecting object affordances with convolutional neural networks, in: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2016, pp. 2765– 2770

work page 2016

[25] [25]

T.-T. Do, A. Nguyen, I. Reid, Affordancenet: An end-to-end deep learning ap- proach for object affordance detection, in: 2018 International Conference on Robotics and Automation (ICRA), IEEE, 2018, pp. 5882–5889. 30

work page 2018

[26] [26]

C. N. D. Minh, S. Z. Gilani, S. M. S. Islam, D. Suter, Learning affordance seg- mentation: An investigative study, in: 2020 Digital Image Computing: Tech- niques and Applications (DICTA), IEEE, 2020, pp. 1–8

work page 2020

[27] [27]

Caselles-Dupré, M

H. Caselles-Dupré, M. Garcia-Ortiz, D. Filliat, Are standard object segmen- tation models sufficient for learning affordance segmentation?, arXiv preprint arXiv:2107.02095 (2021)

work page arXiv 2021

[28] [28]

Apicella, A

T. Apicella, A. Xompero, P. Gastaldo, A. Cavallaro, Segmenting object affor- dances: Reproducibility and sensitivity to scale, in: European Conference on Computer Vision Workshops, Springer, 2024, pp. 286–304

work page 2024

[29] [29]

Fang, T.-L

K. Fang, T.-L. Wu, D. Yang, S. Savarese, J. J. Lim, Demo2vec: Reasoning object affordances from online videos, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2139–2147

work page 2018

[30] [30]

G. Li, N. Tsagkas, J. Song, R. Mon-Williams, S. Vijayakumar, K. Shao, L. Sevilla-Lara, Learning precise affordances from egocentric videos for robotic manipulation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 10581–10591

work page 2025

[31] [31]

Heidinger, S

M. Heidinger, S. Jauhri, V . Prasad, G. Chalvatzaki, 2handedafforder: Learning precise actionable bimanual affordances from human videos, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14743– 14753

work page 2025

[32] [32]

S. Qian, W. Chen, M. Bai, X. Zhou, Z. Tu, L. E. Li, Affordancellm: Ground- ing affordance from vision language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7587–7597

work page 2024

[33] [33]

Cuttano, G

C. Cuttano, G. Rosi, G. Trivigno, G. Averta, What does clip know about peeling a banana?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2238–2247. 31

work page 2024

[34] [34]

G. Li, D. Sun, L. Sevilla-Lara, V . Jampani, One-shot open affordance learning with foundation models, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 3086–3096

work page 2024

[35] [35]

J. Tang, G. Zheng, J. Yu, S. Yang, Cotdet: Affordance knowledge prompting for task driven object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3068–3078

work page 2023

[36] [36]

D. Wu, Y . Fu, S. Huang, Y . Liu, F. Jia, N. Liu, F. Dai, T. Wang, R. M. Anwer, F. S. Khan, et al., Ragnet: Large-scale reasoning-based affordance segmentation benchmark towards general grasping, in: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2025, pp. 11980–11990

work page 2025

[37] [37]

X. Wang, X. Yang, Y . Xu, Y . Wu, Z. Li, N. Zhao, Affordbot: 3d fine-grained embodied reasoning via multimodal large language models, in: Advances in Neural Information Processing Systems, 2025

work page 2025

[38] [38]

D. Lu, L. Kong, T. Huang, G. H. Lee, Geal: Generalizable 3d affordance learn- ing with cross-modal consistency, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1680–1690

work page 2025

[39] [39]

W. Moon, H. S. Seong, J.-P. Heo, Selective contrastive learning for weakly su- pervised affordance grounding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 5210–5220

work page 2025

[40] [40]

Apicella, A

T. Apicella, A. Xompero, A. Cavallaro, Visual affordance prediction: Survey and reproducibility, arXiv preprint arXiv:2505.05074 (2025)

work page arXiv 2025

[41] [41]

Papamarkou, M

T. Papamarkou, M. Skoularidou, K. Palla, L. Aitchison, J. Arbel, D. Dunson, M. Filippone, V . Fortuin, P. Hennig, J. M. Hernández-Lobato, et al., Position: Bayesian deep learning is needed in the age of large-scale ai, in: International Conference on Machine Learning, PMLR, 2024, pp. 39556–39586

work page 2024

[42] [42]

V . D. Wild, S. Ghalebikesabi, D. Sejdinovic, J. Knoblauch, A rigorous link be- tween deep ensembles and (variational) bayesian methods, in: Advances in Neu- ral Information Processing Systems, 2023, pp. 39782–39811. 32

work page 2023

[43] [43]

B. G. Doan, A. Shamsi, X.-Y . Guo, A. Mohammadi, H. Alinejad-Rokny, D. Sejdi- novic, D. Teney, D. C. Ranasinghe, E. Abbasnejad, Bayesian low-rank learning (bella): A practical approach to bayesian neural networks, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 2025, pp. 16298–16307

work page 2025

[44] [44]

Postels, H

J. Postels, H. Blum, C. Cadena, R. Siegwart, L. V . Gool, F. Tombari, Quantifying aleatoric and epistemic uncertainty using density estimation in latent space, Pro- ceedings of the 39th Conference on Uncertainty in Artificial Intelligence (2023)

work page 2023

[45] [45]

K. A. Sankararaman, S. Wang, H. Fang, Bayesformer: Transformer with uncer- tainty estimation, arXiv preprint arXiv:2206.00826 (2022)

work page arXiv 2022

[46] [46]

Müller, N

S. Müller, N. Hollmann, S. P. Arango, J. Grabocka, F. Hutter, Transformers can do bayesian inference, in: International Conference on Learning Representations, 2022

work page 2022

[47] [47]

Reuter, T

A. Reuter, T. G. Rudner, V . Fortuin, D. Rügamer, Can transformers learn full bayesian inference in context?, in: International Conference on Machine Learn- ing, 2025, pp. 51531–51582

work page 2025

[48] [48]

Uncertainty

A. Gleave, G. Irving, Uncertainty estimation for language reward models, arXiv preprint arXiv:2203.07472 (2022)

work page arXiv 2022

[49] [49]

H. Wang, Q. Ji, Beyond dirichlet-based models: when bayesian neural networks meet evidential deep learning, in: The 40th Conference on Uncertainty in Artifi- cial Intelligence, 2024

work page 2024

[50] [50]

Kendall, Y

A. Kendall, Y . Gal, What uncertainties do we need in Bayesian deep learning for computer vision?, 31st Conference on Neural Information Processing Systems (NIPS) (2017)

work page 2017

[51] [51]

G.-P. Ji, L. Zhu, M. Zhuge, K. Fu, Fast camouflaged object detection via edge- based reversible re-calibration network, Pattern Recognition 123 (2022) 108414

work page 2022

[52] [52]

S. Kim, P. Chikontwe, S. An, S. H. Park, Uncertainty-aware semi-supervised few shot segmentation, Pattern Recognition 137 (2023) 109292. 33

work page 2023

[53] [53]

S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detec- tion with region proposal networks, Advances in neural information processing systems 28 (2015)

work page 2015

[54] [54]

Miller, F

D. Miller, F. Dayoub, M. Milford, N. Sünderhauf, Evaluating merging strategies for sampling-based uncertainty techniques in object detection, in: 2019 Interna- tional Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 2348– 2354

work page 2019

[55] [55]

Margolin, L

R. Margolin, L. Zelnik-Manor, A. Tal, How to evaluate foreground maps?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2014, pp. 248–255

work page 2014

[56] [56]

D. Hall, F. Dayoub, J. Skinner, H. Zhang, D. Miller, P. Corke, G. Carneiro, A. An- gelova, N. Sünderhauf, Probabilistic object detection: Definition and evaluation, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), 2020, pp. 1031–1040

work page 2020

[57] [57]

H. W. Kuhn, The hungarian method for the assignment problem, Naval research logistics quarterly 2 (1955) 83–97

work page 1955

[58] [58]

G. W. Brier, et al., Verification of forecasts expressed in terms of probability, Monthly weather review 78 (1950) 1–3

work page 1950

[59] [59]

C. Yin, Q. Zhang, Object affordance detection with boundary-preserving network for robotic manipulation tasks, Neural Computing and Applications 34 (2022) 17963–17980

work page 2022

[60] [60]

Zhang, H

Y . Zhang, H. Li, T. Ren, Y . Dou, Q. Li, Multi-scale fusion and global semantic encoding for affordance detection, in: 2022 International Joint Conference on Neural Networks (IJCNN), IEEE, 2022, pp. 1–8. 34

work page 2022