Uncertainty Quantification in Detection Transformers: Object-Level Calibration and Image-Level Reliability
Pith reviewed 2026-05-23 07:45 UTC · model grok-4.3
The pith
DETRs train one prediction per object to be well-calibrated while forcing the others to suppress their confidence scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DETRs employ an optimal specialist strategy: one prediction per object is trained to be well-calibrated, while the remaining predictions are trained to suppress their foreground confidence to near zero, even when maintaining accurate localization. This strategy emerges as the loss-minimizing solution to the Hungarian matching, fundamentally shaping DETRs' outputs. While selecting the well-calibrated predictions is ideal, they are unidentifiable at inference time. This means that any post-processing algorithm poses a risk of outputting a set of predictions with mixed calibration levels.
What carries the argument
The specialist strategy that arises as the loss-minimizing solution to the Hungarian matching in DETR training.
Load-bearing premise
The well-calibrated specialist predictions remain unidentifiable from the model's output at inference time.
What would settle it
Finding a post-processing algorithm that can reliably isolate only the well-calibrated predictions for every image would demonstrate that the unidentifiability assumption does not hold in practice.
Figures
read the original abstract
DETR and its variants have emerged as promising architectures for object detection, offering an end-to-end prediction pipeline. In practice, however, DETRs generate hundreds of predictions that far outnumber the actual objects present in an image. This raises a critical question: which of these predictions could be trusted? This is particularly important for safety-critical applications, such as in autonomous vehicles. Addressing this concern, we provide empirical and theoretical evidence that predictions within the same image play distinct roles, resulting in varying reliability levels. Our analysis reveals that DETRs employ an optimal specialist strategy: one prediction per object is trained to be well-calibrated, while the remaining predictions are trained to suppress their foreground confidence to near zero, even when maintaining accurate localization. We show that this strategy emerges as the loss-minimizing solution to the Hungarian matching, fundamentally shaping DETRs' outputs. While selecting the well-calibrated predictions is ideal, they are unidentifiable at inference time. This means that any post-processing algorithm poses a risk of outputting a set of predictions with mixed calibration levels. Therefore, practical deployment necessitates a joint evaluation of both the model's calibration quality and the effectiveness of the post-processing algorithm. However, we demonstrate that existing metrics like average precision and expected calibration error are inadequate for this task. To address this issue, we further introduce Object-level Calibration Error (OCE): This object-centric design penalizes both retaining suppressed predictions and missed ground truth foreground objects, making OCE suitable for both evaluating models and identifying reliable prediction subsets. Finally, we present a post hoc uncertainty quantification framework that predicts per-image model accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that DETRs adopt an optimal specialist strategy induced by the Hungarian matching loss: exactly one prediction per ground-truth object is trained to be well-calibrated while the remaining predictions suppress foreground confidence to near zero (even with accurate localization). It asserts that these well-calibrated predictions are unidentifiable at inference time, rendering standard metrics (AP, ECE) inadequate for joint assessment of calibration and post-processing; it therefore introduces the Object-level Calibration Error (OCE) and a post-hoc per-image uncertainty quantification framework.
Significance. If the specialist strategy and unidentifiability claims hold with supporting derivations and experiments, the work would offer a mechanistic explanation for DETR output structure and motivate an object-centric calibration metric suited to detection pipelines, with relevance to safety-critical applications.
major comments (2)
- [Abstract] Abstract: The assertion that well-calibrated predictions are unidentifiable at inference time is placed in tension by the specialist strategy itself. If unmatched predictions are driven to near-zero foreground probability, a fixed confidence threshold or top-k selection would isolate the reliable subset without mixing calibration levels, undercutting the premise that identifiability is impossible and thereby weakening the necessity of the joint-evaluation argument and the OCE metric.
- [Abstract] Abstract: The manuscript states that it provides 'empirical and theoretical evidence' for the specialist strategy and the inadequacy of AP/ECE, yet the visible abstract contains no methods, loss derivations, or experimental details that would allow verification of these claims; the support for the central load-bearing assertions therefore cannot be assessed from the provided text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications drawn from the full paper and indicate planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that well-calibrated predictions are unidentifiable at inference time is placed in tension by the specialist strategy itself. If unmatched predictions are driven to near-zero foreground probability, a fixed confidence threshold or top-k selection would isolate the reliable subset without mixing calibration levels, undercutting the premise that identifiability is impossible and thereby weakening the necessity of the joint-evaluation argument and the OCE metric.
Authors: The specialist strategy does drive unmatched predictions toward near-zero foreground probability as the loss-minimizing outcome of Hungarian matching. However, this does not resolve identifiability at inference. Because matching occurs only during training against ground truth, no equivalent signal exists at test time to designate which of the (often hundreds of) predictions per object is the specialist. Empirical analysis in the manuscript shows that confidence distributions of specialist and suppressed predictions exhibit overlap due to optimization dynamics, initialization, and image-specific factors; consequently, any fixed threshold or top-k selection risks retaining suppressed predictions (with poor calibration) or discarding well-calibrated ones. This mixing is precisely why standard post-processing cannot be assumed to isolate reliable subsets, motivating the joint calibration-plus-post-processing evaluation and the object-centric OCE metric. We will add a clarifying paragraph in Section 3.2 and the discussion to make this distinction explicit. revision: partial
-
Referee: [Abstract] Abstract: The manuscript states that it provides 'empirical and theoretical evidence' for the specialist strategy and the inadequacy of AP/ECE, yet the visible abstract contains no methods, loss derivations, or experimental details that would allow verification of these claims; the support for the central load-bearing assertions therefore cannot be assessed from the provided text.
Authors: Abstracts are intentionally concise summaries and do not contain derivations or experimental details; the full manuscript supplies both. Section 3 derives the specialist strategy as the unique loss-minimizing assignment under the Hungarian bipartite matching objective, while Section 4 presents extensive experiments (including per-object calibration histograms and comparisons against AP/ECE) demonstrating that standard metrics fail to capture the mixed-calibration risk. The abstract's phrasing is therefore supported by the body of the paper. No revision to the abstract itself is required, though we can expand the contribution statement in the introduction if the editor prefers. revision: no
Circularity Check
No circularity; claims derive from loss analysis without reduction to inputs
full rationale
The paper states that the specialist strategy 'emerges as the loss-minimizing solution to the Hungarian matching' and that well-calibrated predictions are unidentifiable, leading to the need for OCE. No equations or steps are shown that reduce this claim to a fitted parameter, self-definition, or self-citation chain by construction. The derivation is presented as an analysis of standard DETR training and post-processing, remaining self-contained against external benchmarks like the Hungarian algorithm itself. No load-bearing self-citations or ansatzes are quoted that would force the result.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Rich feature hierarchies for accurate object detection and semantic segmentation,
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587
work page 2014
-
[2]
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks ,
S. Ren, K. He, R. Girshick, and J. Sun, “ Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks ,” IEEE Transactions on Pattern Analysis & Machine Intelligence , vol. 39, no. 06, pp. 1137–1149, Jun. 2017
work page 2017
-
[3]
You only look once: Unified, real-time object detection,
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 779– 788
work page 2016
-
[4]
K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 42, no. 2, pp. 386–397, 2020
work page 2020
-
[5]
Sparse r-cnn: An end-to-end framework for object detection,
P. Sun, R. Zhang, Y . Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, Z. Yuan, and P. Luo, “Sparse r-cnn: An end-to-end framework for object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 650–15 664, 2023
work page 2023
-
[6]
Cascade r-cnn: High quality object detection and instance segmentation,
Z. Cai and N. Vasconcelos, “Cascade r-cnn: High quality object detection and instance segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 5, pp. 1483–1498, 2021
work page 2021
-
[7]
End-to-end object detection with transformers,
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision . Springer, 2020, pp. 213– 229
work page 2020
-
[8]
Multi- variate confidence calibration for object detection,
F. Kuppers, J. Kronenberger, A. Shantia, and A. Haselhoff, “Multi- variate confidence calibration for object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 326–327
work page 2020
-
[9]
Towards improving calibration in object detection under domain shift,
M. A. Munir, M. H. Khan, M. Sarfraz, and M. Ali, “Towards improving calibration in object detection under domain shift,” Advances in Neural Information Processing Systems , vol. 35, pp. 38 706–38 718, 2022
work page 2022
-
[10]
Bridging precision and confidence: A train-time loss for calibrating object detection,
M. A. Munir, M. H. Khan, S. Khan, and F. S. Khan, “Bridging precision and confidence: A train-time loss for calibrating object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 474–11 483
work page 2023
-
[11]
Multiclass confidence and localization calibration for object detection,
B. Pathiraja, M. Gunawardhana, and M. H. Khan, “Multiclass confidence and localization calibration for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 19 734–19 743
work page 2023
-
[12]
Cal-detr: calibrated detection transformer,
M. A. Munir, S. H. Khan, M. H. Khan, M. Ali, and F. Shahbaz Khan, “Cal-detr: calibrated detection transformer,” Advances in neural infor- mation processing systems , vol. 36, 2024
work page 2024
-
[13]
Domain adaptive object detection via balancing between self-training and adversarial learning,
M. A. Munir, M. H. Khan, M. S. Sarfraz, and M. Ali, “Domain adaptive object detection via balancing between self-training and adversarial learning,” IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 45, no. 12, pp. 14 353–14 365, 2023
work page 2023
-
[14]
Deformable DETR: Deformable Transformers for End-to-End Object Detection
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,”arXiv preprint arXiv:2010.04159, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[15]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,” arXiv preprint arXiv:2203.03605 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Introduction to modern information retrieval,
G. Salton, “Introduction to modern information retrieval,” McGrawHill Book Co, 1983
work page 1983
-
[17]
The pascal visual object classes (voc) challenge,
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,” International journal of computer vision , vol. 88, pp. 303–338, 2010
work page 2010
-
[18]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 . Springer, 2014, pp. 740–755
work page 2014
-
[19]
K. Oksuz, T. Joy, and P. K. Dokania, “Towards building self-aware object detectors via reliable uncertainty quantification and calibration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9263–9274
work page 2023
-
[20]
On calibration of object detectors: Pitfalls, evaluation and baselines,
S. Kuzucu, K. Oksuz, J. Sadeghi, and P. K. Dokania, “On calibration of object detectors: Pitfalls, evaluation and baselines,” in European Conference on Computer Vision . Springer, 2025, pp. 185–204
work page 2025
-
[21]
Localization recall precision (lrp): A new performance metric for object detection,
K. Oksuz, B. C. Cam, E. Akbas, and S. Kalkan, “Localization recall precision (lrp): A new performance metric for object detection,” in Proceedings of the European conference on computer vision (ECCV) , 2018, pp. 504–519
work page 2018
-
[22]
Out-of-distribution identification: Let detector tell which i am not sure,
R. Li, C. Zhang, H. Zhou, C. Shi, and Y . Luo, “Out-of-distribution identification: Let detector tell which i am not sure,” in European Conference on Computer Vision . Springer, 2022, pp. 638–654
work page 2022
-
[23]
V os: Learning what you don’t know by virtual outlier synthesis,
X. Du, Z. Wang, M. Cai, and Y . Li, “V os: Learning what you don’t know by virtual outlier synthesis,” arXiv preprint arXiv:2202.01197 , 2022
-
[24]
Siren: Shaping representations for detecting out-of-distribution objects,
X. Du, G. Gozum, Y . Ming, and Y . Li, “Siren: Shaping representations for detecting out-of-distribution objects,” Advances in Neural Informa- tion Processing Systems , vol. 35, pp. 20 434–20 449, 2022
work page 2022
-
[25]
Safe: Sensitivity-aware features for out-of-distribution object detection,
S. Wilson, T. Fischer, F. Dayoub, D. Miller, and N. S ¨underhauf, “Safe: Sensitivity-aware features for out-of-distribution object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 565–23 576
work page 2023
-
[26]
How certain is your transformer?
A. Shelmanov, E. Tsymbalov, D. Puzyrev, K. Fedyanin, A. Panchenko, and M. Panov, “How certain is your transformer?” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , 2021, pp. 1833–1840
work page 2021
-
[27]
Sketching curvature for efficient out-of-distribution detection for deep neural networks,
A. Sharma, N. Azizan, and M. Pavone, “Sketching curvature for efficient out-of-distribution detection for deep neural networks,” in Uncertainty in artificial intelligence . PMLR, 2021, pp. 1958–1967
work page 2021
-
[28]
Uncertainty estimation of transformer predictions for misclassification detection,
A. Vazhentsev, G. Kuzmin, A. Shelmanov, A. Tsvigun, E. Tsymbalov, K. Fedyanin, M. Panov, A. Panchenko, G. Gusev, M. Burtsev et al. , “Uncertainty estimation of transformer predictions for misclassification detection,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8237– 8252
work page 2022
-
[29]
Quantifying repre- sentation reliability in self-supervised learning models,
Y .-J. Park, H. Wang, S. Ardeshir, and N. Azizan, “Quantifying repre- sentation reliability in self-supervised learning models,” arXiv preprint arXiv:2306.00206, 2023
-
[30]
Dropout as a bayesian approximation: Representing model uncertainty in deep learning,
Y . Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning . PMLR, 2016, pp. 1050–1059
work page 2016
-
[31]
Generalized intersection over union: A metric and a loss for bounding box regression,
H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , 2019, pp. 658–666
work page 2019
-
[32]
The hungarian method for the assignment problem,
H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955
work page 1955
-
[33]
Embedding reliability: On the predictability of downstream performance,
S. Ardeshir and N. Azizan, “Embedding reliability: On the predictability of downstream performance,” in NeurIPS ML Safety Workshop , 2022
work page 2022
-
[34]
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
Y .-S. Chuang, Y . Xie, H. Luo, Y . Kim, J. Glass, and P. He, “Dola: Decoding by contrasting layers improves factuality in large language models,” arXiv preprint arXiv:2309.03883 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Up-detr: Unsupervised pre-training for object detection with transformers,
Z. Dai, B. Cai, Y . Lin, and J. Chen, “Up-detr: Unsupervised pre-training for object detection with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1601– 1610
work page 2021
-
[36]
K. Oksuz, B. C. Cam, S. Kalkan, and E. Akbas, “One metric to measure them all: Localisation recall precision (lrp) for evaluating visual detection tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 9446–9463, 2022. PREPRINT. 12 person: 0.99 skis: 0.69 skis: 0.12 person: 0.10 person: 0.13 skis: 0.12 (a) Thresholdin...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.