pith. sign in

arxiv: 2412.01782 · v4 · submitted 2024-12-02 · 💻 cs.CV · cs.AI

Uncertainty Quantification in Detection Transformers: Object-Level Calibration and Image-Level Reliability

Pith reviewed 2026-05-23 07:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords object detectionDETRcalibration erroruncertainty quantificationHungarian matchingtransformerpost-processing
0
0 comments X

The pith

DETRs train one prediction per object to be well-calibrated while forcing the others to suppress their confidence scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DETR object detectors output hundreds of predictions per image, far more than the number of objects. The paper establishes that the Hungarian matching loss creates a specialist division of labor: one prediction per object learns to produce accurate confidence, while the remaining predictions are pushed to output near-zero foreground probability. This division is optimal for the loss but leaves the calibrated predictions unidentifiable at inference time. As a result, any post-processing selection risks mixing predictions of different calibration quality, and standard metrics cannot evaluate the combined system. The authors introduce an object-centric calibration error to address this gap and propose a framework for image-level reliability prediction.

Core claim

DETRs employ an optimal specialist strategy: one prediction per object is trained to be well-calibrated, while the remaining predictions are trained to suppress their foreground confidence to near zero, even when maintaining accurate localization. This strategy emerges as the loss-minimizing solution to the Hungarian matching, fundamentally shaping DETRs' outputs. While selecting the well-calibrated predictions is ideal, they are unidentifiable at inference time. This means that any post-processing algorithm poses a risk of outputting a set of predictions with mixed calibration levels.

What carries the argument

The specialist strategy that arises as the loss-minimizing solution to the Hungarian matching in DETR training.

Load-bearing premise

The well-calibrated specialist predictions remain unidentifiable from the model's output at inference time.

What would settle it

Finding a post-processing algorithm that can reliably isolate only the well-calibrated predictions for every image would demonstrate that the unidentifiability assumption does not hold in practice.

Figures

Figures reproduced from arXiv: 2412.01782 by Carson Sobolewski, Navid Azizan, Young-Jin Park.

Figure 1
Figure 1. Figure 1: DETR generates hundreds of predictions for each image, resulting in multiple predictions per object, with at least one (i.e., [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A diagram of the DETR architecture. An input image is first processed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualizations of the predictions generated by Cal-DETR. The optimal positive prediction (indexed by 0 and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of confidence threshold selection on various performance and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: A visualization of the difference in calibration between positive [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of the scaling factor (λ) on image-level UQ performance of ContrastiveConf (OCE). Pearson correlation coefficient (PCC) using various scaling factors is reported. The optimal scaling factor lies within the range of 5.0 to 10.0, while this range generalizes well across out-of-distribution datasets. Furthermore, it shows the efficacy of ContrastiveConf over Conf+ (i.e., ContrastiveConf with λ = 0.0). … view at source ↗
Figure 7
Figure 7. Figure 7: Impact of parameter selection on OCE (y-axis inverted) and the Pearson correlation coefficient (PCC) between [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Exemplary visualization demonstrating the impact of parameter selection on the final subset of predictions in Cal-DETR for different post-processing [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A visualization of the difference in calibration between positive and negative predictions on the [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Impact of confidence threshold selection on various performance metrics in UP-DETR, Deformable-DETR, Cal-DETR, and DINO on [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of parameter selection on OCE (y-axis inverted) and the Pearson correlation coefficient (PCC) between [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
read the original abstract

DETR and its variants have emerged as promising architectures for object detection, offering an end-to-end prediction pipeline. In practice, however, DETRs generate hundreds of predictions that far outnumber the actual objects present in an image. This raises a critical question: which of these predictions could be trusted? This is particularly important for safety-critical applications, such as in autonomous vehicles. Addressing this concern, we provide empirical and theoretical evidence that predictions within the same image play distinct roles, resulting in varying reliability levels. Our analysis reveals that DETRs employ an optimal specialist strategy: one prediction per object is trained to be well-calibrated, while the remaining predictions are trained to suppress their foreground confidence to near zero, even when maintaining accurate localization. We show that this strategy emerges as the loss-minimizing solution to the Hungarian matching, fundamentally shaping DETRs' outputs. While selecting the well-calibrated predictions is ideal, they are unidentifiable at inference time. This means that any post-processing algorithm poses a risk of outputting a set of predictions with mixed calibration levels. Therefore, practical deployment necessitates a joint evaluation of both the model's calibration quality and the effectiveness of the post-processing algorithm. However, we demonstrate that existing metrics like average precision and expected calibration error are inadequate for this task. To address this issue, we further introduce Object-level Calibration Error (OCE): This object-centric design penalizes both retaining suppressed predictions and missed ground truth foreground objects, making OCE suitable for both evaluating models and identifying reliable prediction subsets. Finally, we present a post hoc uncertainty quantification framework that predicts per-image model accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that DETRs adopt an optimal specialist strategy induced by the Hungarian matching loss: exactly one prediction per ground-truth object is trained to be well-calibrated while the remaining predictions suppress foreground confidence to near zero (even with accurate localization). It asserts that these well-calibrated predictions are unidentifiable at inference time, rendering standard metrics (AP, ECE) inadequate for joint assessment of calibration and post-processing; it therefore introduces the Object-level Calibration Error (OCE) and a post-hoc per-image uncertainty quantification framework.

Significance. If the specialist strategy and unidentifiability claims hold with supporting derivations and experiments, the work would offer a mechanistic explanation for DETR output structure and motivate an object-centric calibration metric suited to detection pipelines, with relevance to safety-critical applications.

major comments (2)
  1. [Abstract] Abstract: The assertion that well-calibrated predictions are unidentifiable at inference time is placed in tension by the specialist strategy itself. If unmatched predictions are driven to near-zero foreground probability, a fixed confidence threshold or top-k selection would isolate the reliable subset without mixing calibration levels, undercutting the premise that identifiability is impossible and thereby weakening the necessity of the joint-evaluation argument and the OCE metric.
  2. [Abstract] Abstract: The manuscript states that it provides 'empirical and theoretical evidence' for the specialist strategy and the inadequacy of AP/ECE, yet the visible abstract contains no methods, loss derivations, or experimental details that would allow verification of these claims; the support for the central load-bearing assertions therefore cannot be assessed from the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications drawn from the full paper and indicate planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that well-calibrated predictions are unidentifiable at inference time is placed in tension by the specialist strategy itself. If unmatched predictions are driven to near-zero foreground probability, a fixed confidence threshold or top-k selection would isolate the reliable subset without mixing calibration levels, undercutting the premise that identifiability is impossible and thereby weakening the necessity of the joint-evaluation argument and the OCE metric.

    Authors: The specialist strategy does drive unmatched predictions toward near-zero foreground probability as the loss-minimizing outcome of Hungarian matching. However, this does not resolve identifiability at inference. Because matching occurs only during training against ground truth, no equivalent signal exists at test time to designate which of the (often hundreds of) predictions per object is the specialist. Empirical analysis in the manuscript shows that confidence distributions of specialist and suppressed predictions exhibit overlap due to optimization dynamics, initialization, and image-specific factors; consequently, any fixed threshold or top-k selection risks retaining suppressed predictions (with poor calibration) or discarding well-calibrated ones. This mixing is precisely why standard post-processing cannot be assumed to isolate reliable subsets, motivating the joint calibration-plus-post-processing evaluation and the object-centric OCE metric. We will add a clarifying paragraph in Section 3.2 and the discussion to make this distinction explicit. revision: partial

  2. Referee: [Abstract] Abstract: The manuscript states that it provides 'empirical and theoretical evidence' for the specialist strategy and the inadequacy of AP/ECE, yet the visible abstract contains no methods, loss derivations, or experimental details that would allow verification of these claims; the support for the central load-bearing assertions therefore cannot be assessed from the provided text.

    Authors: Abstracts are intentionally concise summaries and do not contain derivations or experimental details; the full manuscript supplies both. Section 3 derives the specialist strategy as the unique loss-minimizing assignment under the Hungarian bipartite matching objective, while Section 4 presents extensive experiments (including per-object calibration histograms and comparisons against AP/ECE) demonstrating that standard metrics fail to capture the mixed-calibration risk. The abstract's phrasing is therefore supported by the body of the paper. No revision to the abstract itself is required, though we can expand the contribution statement in the introduction if the editor prefers. revision: no

Circularity Check

0 steps flagged

No circularity; claims derive from loss analysis without reduction to inputs

full rationale

The paper states that the specialist strategy 'emerges as the loss-minimizing solution to the Hungarian matching' and that well-calibrated predictions are unidentifiable, leading to the need for OCE. No equations or steps are shown that reduce this claim to a fitted parameter, self-definition, or self-citation chain by construction. The derivation is presented as an analysis of standard DETR training and post-processing, remaining self-contained against external benchmarks like the Hungarian algorithm itself. No load-bearing self-citations or ansatzes are quoted that would force the result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the specialist strategy is described as emerging from an existing loss rather than from new postulates.

pith-pipeline@v0.9.0 · 5824 in / 1115 out tokens · 29115 ms · 2026-05-23T07:45:44.315433+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

  1. [1]

    Rich feature hierarchies for accurate object detection and semantic segmentation,

    R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587

  2. [2]

    Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks ,

    S. Ren, K. He, R. Girshick, and J. Sun, “ Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks ,” IEEE Transactions on Pattern Analysis & Machine Intelligence , vol. 39, no. 06, pp. 1137–1149, Jun. 2017

  3. [3]

    You only look once: Unified, real-time object detection,

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 779– 788

  4. [4]

    Mask r-cnn,

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 42, no. 2, pp. 386–397, 2020

  5. [5]

    Sparse r-cnn: An end-to-end framework for object detection,

    P. Sun, R. Zhang, Y . Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, Z. Yuan, and P. Luo, “Sparse r-cnn: An end-to-end framework for object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 650–15 664, 2023

  6. [6]

    Cascade r-cnn: High quality object detection and instance segmentation,

    Z. Cai and N. Vasconcelos, “Cascade r-cnn: High quality object detection and instance segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 5, pp. 1483–1498, 2021

  7. [7]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision . Springer, 2020, pp. 213– 229

  8. [8]

    Multi- variate confidence calibration for object detection,

    F. Kuppers, J. Kronenberger, A. Shantia, and A. Haselhoff, “Multi- variate confidence calibration for object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 326–327

  9. [9]

    Towards improving calibration in object detection under domain shift,

    M. A. Munir, M. H. Khan, M. Sarfraz, and M. Ali, “Towards improving calibration in object detection under domain shift,” Advances in Neural Information Processing Systems , vol. 35, pp. 38 706–38 718, 2022

  10. [10]

    Bridging precision and confidence: A train-time loss for calibrating object detection,

    M. A. Munir, M. H. Khan, S. Khan, and F. S. Khan, “Bridging precision and confidence: A train-time loss for calibrating object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 474–11 483

  11. [11]

    Multiclass confidence and localization calibration for object detection,

    B. Pathiraja, M. Gunawardhana, and M. H. Khan, “Multiclass confidence and localization calibration for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 19 734–19 743

  12. [12]

    Cal-detr: calibrated detection transformer,

    M. A. Munir, S. H. Khan, M. H. Khan, M. Ali, and F. Shahbaz Khan, “Cal-detr: calibrated detection transformer,” Advances in neural infor- mation processing systems , vol. 36, 2024

  13. [13]

    Domain adaptive object detection via balancing between self-training and adversarial learning,

    M. A. Munir, M. H. Khan, M. S. Sarfraz, and M. Ali, “Domain adaptive object detection via balancing between self-training and adversarial learning,” IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 45, no. 12, pp. 14 353–14 365, 2023

  14. [14]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,”arXiv preprint arXiv:2010.04159, 2020

  15. [15]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,” arXiv preprint arXiv:2203.03605 , 2022

  16. [16]

    Introduction to modern information retrieval,

    G. Salton, “Introduction to modern information retrieval,” McGrawHill Book Co, 1983

  17. [17]

    The pascal visual object classes (voc) challenge,

    M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,” International journal of computer vision , vol. 88, pp. 303–338, 2010

  18. [18]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 . Springer, 2014, pp. 740–755

  19. [19]

    Towards building self-aware object detectors via reliable uncertainty quantification and calibration,

    K. Oksuz, T. Joy, and P. K. Dokania, “Towards building self-aware object detectors via reliable uncertainty quantification and calibration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9263–9274

  20. [20]

    On calibration of object detectors: Pitfalls, evaluation and baselines,

    S. Kuzucu, K. Oksuz, J. Sadeghi, and P. K. Dokania, “On calibration of object detectors: Pitfalls, evaluation and baselines,” in European Conference on Computer Vision . Springer, 2025, pp. 185–204

  21. [21]

    Localization recall precision (lrp): A new performance metric for object detection,

    K. Oksuz, B. C. Cam, E. Akbas, and S. Kalkan, “Localization recall precision (lrp): A new performance metric for object detection,” in Proceedings of the European conference on computer vision (ECCV) , 2018, pp. 504–519

  22. [22]

    Out-of-distribution identification: Let detector tell which i am not sure,

    R. Li, C. Zhang, H. Zhou, C. Shi, and Y . Luo, “Out-of-distribution identification: Let detector tell which i am not sure,” in European Conference on Computer Vision . Springer, 2022, pp. 638–654

  23. [23]

    V os: Learning what you don’t know by virtual outlier synthesis,

    X. Du, Z. Wang, M. Cai, and Y . Li, “V os: Learning what you don’t know by virtual outlier synthesis,” arXiv preprint arXiv:2202.01197 , 2022

  24. [24]

    Siren: Shaping representations for detecting out-of-distribution objects,

    X. Du, G. Gozum, Y . Ming, and Y . Li, “Siren: Shaping representations for detecting out-of-distribution objects,” Advances in Neural Informa- tion Processing Systems , vol. 35, pp. 20 434–20 449, 2022

  25. [25]

    Safe: Sensitivity-aware features for out-of-distribution object detection,

    S. Wilson, T. Fischer, F. Dayoub, D. Miller, and N. S ¨underhauf, “Safe: Sensitivity-aware features for out-of-distribution object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 565–23 576

  26. [26]

    How certain is your transformer?

    A. Shelmanov, E. Tsymbalov, D. Puzyrev, K. Fedyanin, A. Panchenko, and M. Panov, “How certain is your transformer?” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , 2021, pp. 1833–1840

  27. [27]

    Sketching curvature for efficient out-of-distribution detection for deep neural networks,

    A. Sharma, N. Azizan, and M. Pavone, “Sketching curvature for efficient out-of-distribution detection for deep neural networks,” in Uncertainty in artificial intelligence . PMLR, 2021, pp. 1958–1967

  28. [28]

    Uncertainty estimation of transformer predictions for misclassification detection,

    A. Vazhentsev, G. Kuzmin, A. Shelmanov, A. Tsvigun, E. Tsymbalov, K. Fedyanin, M. Panov, A. Panchenko, G. Gusev, M. Burtsev et al. , “Uncertainty estimation of transformer predictions for misclassification detection,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8237– 8252

  29. [29]

    Quantifying repre- sentation reliability in self-supervised learning models,

    Y .-J. Park, H. Wang, S. Ardeshir, and N. Azizan, “Quantifying repre- sentation reliability in self-supervised learning models,” arXiv preprint arXiv:2306.00206, 2023

  30. [30]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning,

    Y . Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning . PMLR, 2016, pp. 1050–1059

  31. [31]

    Generalized intersection over union: A metric and a loss for bounding box regression,

    H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , 2019, pp. 658–666

  32. [32]

    The hungarian method for the assignment problem,

    H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955

  33. [33]

    Embedding reliability: On the predictability of downstream performance,

    S. Ardeshir and N. Azizan, “Embedding reliability: On the predictability of downstream performance,” in NeurIPS ML Safety Workshop , 2022

  34. [34]

    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

    Y .-S. Chuang, Y . Xie, H. Luo, Y . Kim, J. Glass, and P. He, “Dola: Decoding by contrasting layers improves factuality in large language models,” arXiv preprint arXiv:2309.03883 , 2023

  35. [35]

    Up-detr: Unsupervised pre-training for object detection with transformers,

    Z. Dai, B. Cai, Y . Lin, and J. Chen, “Up-detr: Unsupervised pre-training for object detection with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1601– 1610

  36. [36]

    One metric to measure them all: Localisation recall precision (lrp) for evaluating visual detection tasks,

    K. Oksuz, B. C. Cam, S. Kalkan, and E. Akbas, “One metric to measure them all: Localisation recall precision (lrp) for evaluating visual detection tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 9446–9463, 2022. PREPRINT. 12 person: 0.99 skis: 0.69 skis: 0.12 person: 0.10 person: 0.13 skis: 0.12 (a) Thresholdin...