pith. sign in

arxiv: 2605.03614 · v1 · submitted 2026-05-05 · 💻 cs.CV

Uncertainty Estimation in Instance Segmentation of Affordances via Bayesian Visual Transformers

Pith reviewed 2026-05-07 17:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords affordancesbayesianbetterestimationinstancenovelsegmentationuncertainty
0
0 comments X

The pith

Bayesian visual transformers with ensemble and sampling methods achieve a 7.4 percentage point gain on weighted F-beta score for affordance instance segmentation on the IIT-Aff dataset while providing calibrated epistemic and aleatoric uncertainty maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual affordances are the parts of an image where a person or robot could perform an action, like grasping a handle or sitting on a chair. The authors build a neural network based on visual transformers that not only draws masks around these regions but also reports how uncertain it is about each pixel. They create multiple versions of the model through ensembles and random sampling, then compare the outputs to separate two kinds of uncertainty: one from the model's lack of knowledge and one from inherent noise in the image. They also introduce a new score called Probability-based Mask Quality to judge how consistent the masks are across these versions. On a challenging dataset of indoor scenes, the Bayesian version beats a standard deterministic network by 7.4 points on a key accuracy metric and produces probability values that are better calibrated, meaning the model is less likely to be overconfident. The uncertainty maps show that the model is unsure mainly around object edges and in visually ambiguous areas.

Core claim

Our results show that the global consensus of multiple sub-networks of Bayesian models improve deterministic networks due to a better mask refinement and generalization. This fact, joined with the more powerful features extracted by attention-based mechanisms, represent an improvement of +7.4 p.p on the F_β^w score in the challenging IIT-Aff dataset. Bayesian models are also better calibrated, producing less overconfident probabilities and with a better uncertainty estimation.

Load-bearing premise

That the chosen ensemble and sampling approximations in the Bayesian visual transformer architecture reliably separate epistemic from aleatoric uncertainty and that the observed gains on the IIT-Aff dataset generalize beyond the specific training and evaluation splits used.

Figures

Figures reproduced from arXiv: 2605.03614 by Jose J.Guerrero, Lorenzo Mur-Labadia, Ruben Martinez-Cantina.

Figure 1
Figure 1. Figure 1: Our model architecture is composed of an attention-based backbone [13] extended with sampling view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of the calibration metrics with the number of samples view at source ↗
Figure 3
Figure 3. Figure 3: Sparsification error curves for the semantic and spatial probabilities for Swin-T view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results and uncertainty prediction obtained by the Swin-T Mask-Ens Bayesian Instance view at source ↗
read the original abstract

Visual affordances identify regions in an image with potential interactions, offering a novel paradigm for scene understanding. Recognizing affordances allows autonomous robots to act more naturally, could enhance human-robot interactions, enrich augmented reality systems, and benefit prosthetic vision devices. Accurate and localized prediction of affordance regions, rather than general saliency maps is crucial for these applications. We present a model for instance segmentation of affordances by adopting sample-based and ensembles approaches for uncertainty estimation. We extend an attention-based architecture for our novel task, showing with detailed ablation experiments the effects of each component. By comparing the distribution of these different detections, we extract pixel-wise epistemic and aleatoric variances at both the semantic and spatial levels. In addition, we propose a novel measure called Probability-based Mask Quality, which enables a comprehensive analysis of semantic and spatial variations in a probabilistic instance segmentation model. Our results show that the global consensus of multiple sub-networks of Bayesian models improve deterministic networks due to a better mask refinement and generalization. This fact, joined with the more powerful features extracted by attention-based mechanisms, represent an improvement of +7.4 p.p on the $F_{\beta}^w$ score in the challenging IIT-Aff dataset. Bayesian models are also better calibrated, producing less overconfident probabilities and with a better uncertainty estimation. Qualitative results show that aleatoric variance appears in the contour of the objects, while the epistemic variance is observed in visual challenging pixels, adding interpretability to the neural network.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a Bayesian Visual Transformer architecture for instance segmentation of visual affordances, using ensemble and sampling-based approximations to estimate epistemic and aleatoric uncertainty. It introduces a Probability-based Mask Quality metric for analyzing probabilistic outputs and claims that the consensus across Bayesian sub-networks, combined with attention-based features, yields better mask refinement, generalization, and calibration than deterministic networks, with a reported +7.4 p.p. gain in F_β^w on the IIT-Aff dataset. Qualitative results are said to show aleatoric variance at object contours and epistemic variance at challenging pixels.

Significance. If the gains can be robustly isolated to the Bayesian components rather than the attention backbone, the work would contribute a practical uncertainty-aware approach to affordance segmentation with potential utility in robotics and AR. The new mask quality measure and explicit epistemic/aleatoric separation add interpretability value, but the current presentation leaves the attribution of improvements unverified.

major comments (2)
  1. [Abstract] Abstract: The central claim attributes the +7.4 p.p. F_β^w improvement and better calibration to 'global consensus of multiple sub-networks of Bayesian models' plus attention features, yet no explicit ablation compares the full Bayesian VT against a deterministic VT with identical backbone, training protocol, and attention layers. Without this isolation, the load-bearing assumption that uncertainty estimation drives mask refinement cannot be evaluated.
  2. [Abstract] Abstract: Ablation experiments on each component are referenced, but the text supplies no quantitative results, baseline definitions, statistical significance tests, or error analysis for the reported calibration and uncertainty improvements. This prevents verification of whether the ensemble/sampling approximations reliably separate epistemic from aleatoric uncertainty on the IIT-Aff splits.
minor comments (2)
  1. The notation F_β^w is used without an explicit definition or reference to its weighting scheme in the provided text; a brief equation or citation would improve clarity.
  2. The Probability-based Mask Quality metric is introduced as novel but its exact formulation (e.g., how probability distributions over masks are aggregated) is not detailed in the abstract; a dedicated methods subsection would aid reproducibility.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work is empirical and does not introduce mathematical derivations, new physical entities, or unstated axioms beyond standard assumptions of deep learning and Bayesian neural network approximations.

axioms (1)
  • domain assumption Ensemble and Monte-Carlo sampling methods provide a practical approximation to epistemic uncertainty in neural networks for segmentation tasks
    Invoked to justify extraction of epistemic and aleatoric variances from multiple model predictions.
invented entities (1)
  • Probability-based Mask Quality no independent evidence
    purpose: To quantify semantic and spatial consistency across probabilistic instance masks
    Introduced as a novel evaluation measure for probabilistic segmentation models

pith-pipeline@v0.9.0 · 5574 in / 1403 out tokens · 173000 ms · 2026-05-07T17:55:31.740053+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

  1. [1]

    J. J. Gibson, The theory of affordances, Hilldale, USA 1 (1977) 67–82

  2. [2]

    R. R. Murphy, Case studies of applying gibson’s ecological approach to mobile robots, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 29 (1999) 105–111

  3. [3]

    Grauman, A

    K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al., Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

  4. [4]

    Sanchez-Garcia, R

    M. Sanchez-Garcia, R. Martinez-Cantin, J. J. Guerrero, Semantic and structural image segmentation for prosthetic vision, Plos one 15 (2020) e0227677

  5. [5]

    Nagarajan, C

    T. Nagarajan, C. Feichtenhofer, K. Grauman, Grounded human-object interaction hotspots from video, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 8688–8697

  6. [6]

    K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2961– 2969

  7. [7]

    Doherty, B

    J. Doherty, B. Gardiner, E. Kerr, N. Siddique, Bifpn-yolo: One-stage object de- tection integrating bi-directional feature pyramid networks, Pattern Recognition 160 (2025) 111209

  8. [8]

    Furnari, G

    A. Furnari, G. M. Farinella, What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6252–6261. 28

  9. [9]

    Nagarajan, Y

    T. Nagarajan, Y . Li, C. Feichtenhofer, K. Grauman, Ego-topo: Environment affordances from egocentric video, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 163–172

  10. [10]

    C. Guo, G. Pleiss, Y . Sun, K. Q. Weinberger, On calibration of modern neural networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 1321–1330

  11. [11]

    Morilla-Cabello, L

    D. Morilla-Cabello, L. Mur-Labadia, R. Martinez-Cantin, E. Montijano, Robust fusion for Bayesian semantic mapping, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023)

  12. [12]

    Mur-Labadia, R

    L. Mur-Labadia, R. Martinez-Cantin, J. Guerrero, Bayesian deep learning for af- fordance segmentation in images, in: 2023 International Conference on Robotics and Automation (ICRA), IEEE, 2023

  13. [13]

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, B. Guo, Swin trans- former: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10012–10022

  14. [14]

    Y . Gal, Z. Ghahramani, Dropout as a Bayesian approximation: Representing model uncertainty in deep learning, in: International Conference on Machine Learning (ICML), PMLR, 2016, pp. 1050–1059

  15. [15]

    Durasov, T

    N. Durasov, T. Bagautdinov, P. Baque, P. Fua, Masksembles for uncertainty esti- mation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13539–13548

  16. [16]

    Lakshminarayanan, A

    B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty estimation using deep ensembles, Advances in neural information processing systems 30 (2017)

  17. [17]

    Huang, Y

    G. Huang, Y . Li, G. Pleiss, Z. Liu, J. E. Hopcroft, K. Q. Weinberger, Snapshot ensembles: Train 1, get m for free, International Conference on Learning Repre- sentations (ICLR) (2017). 29

  18. [18]

    H. S. Koppula, A. Saxena, Anticipating human activities using object affordances for reactive robotic response, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2015) 14–29

  19. [19]

    Mur-Labadia, J

    L. Mur-Labadia, J. J. Guerrero, R. Martinez-Cantin, Multi-label affordance map- ping from egocentric vision, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5238–5249

  20. [20]

    Montesano, M

    L. Montesano, M. Lopes, A. Bernardino, J. Santos-Victor, Learning object af- fordances: from sensory–motor coordination to imitation, IEEE Transactions on Robotics 24 (2008) 15–26

  21. [21]

    S. Yang, W. Zhang, R. Song, J. Cheng, H. Wang, Y . Li, Watch and act: Learning robotic manipulation from visual demonstration, IEEE Transactions on Systems, Man, and Cybernetics: Systems (2023)

  22. [22]

    Myers, C

    A. Myers, C. L. Teo, C. Fermüller, Y . Aloimonos, Affordance detection of tool parts from geometric features, in: 2015 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2015, pp. 1374–1381

  23. [23]

    Nguyen, D

    A. Nguyen, D. Kanoulas, D. G. Caldwell, N. Tsagarakis, Object-based affor- dances detection with convolutional neural networks and dense conditional ran- dom fields, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2017, pp. 5908–5915

  24. [24]

    Nguyen, D

    A. Nguyen, D. Kanoulas, D. G. Caldwell, N. G. Tsagarakis, Detecting object affordances with convolutional neural networks, in: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2016, pp. 2765– 2770

  25. [25]

    T.-T. Do, A. Nguyen, I. Reid, Affordancenet: An end-to-end deep learning ap- proach for object affordance detection, in: 2018 International Conference on Robotics and Automation (ICRA), IEEE, 2018, pp. 5882–5889. 30

  26. [26]

    C. N. D. Minh, S. Z. Gilani, S. M. S. Islam, D. Suter, Learning affordance seg- mentation: An investigative study, in: 2020 Digital Image Computing: Tech- niques and Applications (DICTA), IEEE, 2020, pp. 1–8

  27. [27]

    Caselles-Dupré, M

    H. Caselles-Dupré, M. Garcia-Ortiz, D. Filliat, Are standard object segmen- tation models sufficient for learning affordance segmentation?, arXiv preprint arXiv:2107.02095 (2021)

  28. [28]

    Apicella, A

    T. Apicella, A. Xompero, P. Gastaldo, A. Cavallaro, Segmenting object affor- dances: Reproducibility and sensitivity to scale, in: European Conference on Computer Vision Workshops, Springer, 2024, pp. 286–304

  29. [29]

    Fang, T.-L

    K. Fang, T.-L. Wu, D. Yang, S. Savarese, J. J. Lim, Demo2vec: Reasoning object affordances from online videos, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2139–2147

  30. [30]

    G. Li, N. Tsagkas, J. Song, R. Mon-Williams, S. Vijayakumar, K. Shao, L. Sevilla-Lara, Learning precise affordances from egocentric videos for robotic manipulation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 10581–10591

  31. [31]

    Heidinger, S

    M. Heidinger, S. Jauhri, V . Prasad, G. Chalvatzaki, 2handedafforder: Learning precise actionable bimanual affordances from human videos, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14743– 14753

  32. [32]

    S. Qian, W. Chen, M. Bai, X. Zhou, Z. Tu, L. E. Li, Affordancellm: Ground- ing affordance from vision language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7587–7597

  33. [33]

    Cuttano, G

    C. Cuttano, G. Rosi, G. Trivigno, G. Averta, What does clip know about peeling a banana?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2238–2247. 31

  34. [34]

    G. Li, D. Sun, L. Sevilla-Lara, V . Jampani, One-shot open affordance learning with foundation models, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 3086–3096

  35. [35]

    J. Tang, G. Zheng, J. Yu, S. Yang, Cotdet: Affordance knowledge prompting for task driven object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3068–3078

  36. [36]

    D. Wu, Y . Fu, S. Huang, Y . Liu, F. Jia, N. Liu, F. Dai, T. Wang, R. M. Anwer, F. S. Khan, et al., Ragnet: Large-scale reasoning-based affordance segmentation benchmark towards general grasping, in: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2025, pp. 11980–11990

  37. [37]

    X. Wang, X. Yang, Y . Xu, Y . Wu, Z. Li, N. Zhao, Affordbot: 3d fine-grained embodied reasoning via multimodal large language models, in: Advances in Neural Information Processing Systems, 2025

  38. [38]

    D. Lu, L. Kong, T. Huang, G. H. Lee, Geal: Generalizable 3d affordance learn- ing with cross-modal consistency, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1680–1690

  39. [39]

    W. Moon, H. S. Seong, J.-P. Heo, Selective contrastive learning for weakly su- pervised affordance grounding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 5210–5220

  40. [40]

    Apicella, A

    T. Apicella, A. Xompero, A. Cavallaro, Visual affordance prediction: Survey and reproducibility, arXiv preprint arXiv:2505.05074 (2025)

  41. [41]

    Papamarkou, M

    T. Papamarkou, M. Skoularidou, K. Palla, L. Aitchison, J. Arbel, D. Dunson, M. Filippone, V . Fortuin, P. Hennig, J. M. Hernández-Lobato, et al., Position: Bayesian deep learning is needed in the age of large-scale ai, in: International Conference on Machine Learning, PMLR, 2024, pp. 39556–39586

  42. [42]

    V . D. Wild, S. Ghalebikesabi, D. Sejdinovic, J. Knoblauch, A rigorous link be- tween deep ensembles and (variational) bayesian methods, in: Advances in Neu- ral Information Processing Systems, 2023, pp. 39782–39811. 32

  43. [43]

    B. G. Doan, A. Shamsi, X.-Y . Guo, A. Mohammadi, H. Alinejad-Rokny, D. Sejdi- novic, D. Teney, D. C. Ranasinghe, E. Abbasnejad, Bayesian low-rank learning (bella): A practical approach to bayesian neural networks, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 2025, pp. 16298–16307

  44. [44]

    Postels, H

    J. Postels, H. Blum, C. Cadena, R. Siegwart, L. V . Gool, F. Tombari, Quantifying aleatoric and epistemic uncertainty using density estimation in latent space, Pro- ceedings of the 39th Conference on Uncertainty in Artificial Intelligence (2023)

  45. [45]

    K. A. Sankararaman, S. Wang, H. Fang, Bayesformer: Transformer with uncer- tainty estimation, arXiv preprint arXiv:2206.00826 (2022)

  46. [46]

    Müller, N

    S. Müller, N. Hollmann, S. P. Arango, J. Grabocka, F. Hutter, Transformers can do bayesian inference, in: International Conference on Learning Representations, 2022

  47. [47]

    Reuter, T

    A. Reuter, T. G. Rudner, V . Fortuin, D. Rügamer, Can transformers learn full bayesian inference in context?, in: International Conference on Machine Learn- ing, 2025, pp. 51531–51582

  48. [48]

    Uncertainty

    A. Gleave, G. Irving, Uncertainty estimation for language reward models, arXiv preprint arXiv:2203.07472 (2022)

  49. [49]

    H. Wang, Q. Ji, Beyond dirichlet-based models: when bayesian neural networks meet evidential deep learning, in: The 40th Conference on Uncertainty in Artifi- cial Intelligence, 2024

  50. [50]

    Kendall, Y

    A. Kendall, Y . Gal, What uncertainties do we need in Bayesian deep learning for computer vision?, 31st Conference on Neural Information Processing Systems (NIPS) (2017)

  51. [51]

    G.-P. Ji, L. Zhu, M. Zhuge, K. Fu, Fast camouflaged object detection via edge- based reversible re-calibration network, Pattern Recognition 123 (2022) 108414

  52. [52]

    S. Kim, P. Chikontwe, S. An, S. H. Park, Uncertainty-aware semi-supervised few shot segmentation, Pattern Recognition 137 (2023) 109292. 33

  53. [53]

    S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detec- tion with region proposal networks, Advances in neural information processing systems 28 (2015)

  54. [54]

    Miller, F

    D. Miller, F. Dayoub, M. Milford, N. Sünderhauf, Evaluating merging strategies for sampling-based uncertainty techniques in object detection, in: 2019 Interna- tional Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 2348– 2354

  55. [55]

    Margolin, L

    R. Margolin, L. Zelnik-Manor, A. Tal, How to evaluate foreground maps?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2014, pp. 248–255

  56. [56]

    D. Hall, F. Dayoub, J. Skinner, H. Zhang, D. Miller, P. Corke, G. Carneiro, A. An- gelova, N. Sünderhauf, Probabilistic object detection: Definition and evaluation, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), 2020, pp. 1031–1040

  57. [57]

    H. W. Kuhn, The hungarian method for the assignment problem, Naval research logistics quarterly 2 (1955) 83–97

  58. [58]

    G. W. Brier, et al., Verification of forecasts expressed in terms of probability, Monthly weather review 78 (1950) 1–3

  59. [59]

    C. Yin, Q. Zhang, Object affordance detection with boundary-preserving network for robotic manipulation tasks, Neural Computing and Applications 34 (2022) 17963–17980

  60. [60]

    Zhang, H

    Y . Zhang, H. Li, T. Ren, Y . Dou, Q. Li, Multi-scale fusion and global semantic encoding for affordance detection, in: 2022 International Joint Conference on Neural Networks (IJCNN), IEEE, 2022, pp. 1–8. 34