ALFA: Agglomerative Late Fusion Algorithm for Object Detection
Pith reviewed 2026-05-24 21:57 UTC · model grok-4.3
The pith
ALFA fuses multiple object detector predictions with agglomerative clustering to achieve lower error on PASCAL VOC.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ALFA is based on agglomerative clustering of object detector predictions taking into consideration both the bounding box locations and the class scores. Each cluster represents a single object hypothesis whose location is a weighted combination of the clustered bounding boxes. ALFA was evaluated using combinations of a pair (SSD and DeNet) and a triplet (SSD, DeNet and Faster R-CNN) of recent object detectors that are close to the state-of-the-art. ALFA achieves state of the art results on PASCAL VOC 2007 and PASCAL VOC 2012, outperforming the individual detectors as well as baseline combination strategies, achieving up to 32% lower error than the best individual detectors and up to 6% lower
What carries the argument
Agglomerative clustering of bounding box predictions and class scores from multiple detectors to form object hypotheses via weighted box combination.
Load-bearing premise
The clustering step groups predictions from the same object correctly without systematic errors from over-merging or splitting detections.
What would settle it
An evaluation on images with closely spaced objects where detectors produce conflicting boxes, checking whether the mAP drops below that of the best single detector.
Figures
read the original abstract
We propose ALFA - a novel late fusion algorithm for object detection. ALFA is based on agglomerative clustering of object detector predictions taking into consideration both the bounding box locations and the class scores. Each cluster represents a single object hypothesis whose location is a weighted combination of the clustered bounding boxes. ALFA was evaluated using combinations of a pair (SSD and DeNet) and a triplet (SSD, DeNet and Faster R-CNN) of recent object detectors that are close to the state-of-the-art. ALFA achieves state of the art results on PASCAL VOC 2007 and PASCAL VOC 2012, outperforming the individual detectors as well as baseline combination strategies, achieving up to 32% lower error than the best individual detectors and up to 6% lower error than the reference fusion algorithm DBF - Dynamic Belief Fusion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ALFA, a late-fusion algorithm that applies agglomerative clustering to object-detector predictions (using both bounding-box locations and class scores) so that each resulting cluster represents a single object whose final box is a weighted average of the clustered boxes. It evaluates the method on combinations of SSD+DeNet and SSD+DeNet+Faster R-CNN and claims state-of-the-art mAP on PASCAL VOC 2007 and 2012, with up to 32 % lower error than the best single detector and up to 6 % lower error than the reference fusion method DBF.
Significance. If the numerical claims can be reproduced, ALFA would supply a lightweight, training-free post-processing step that improves detection accuracy by fusing off-the-shelf detectors. The approach is conceptually straightforward and targets a practical need in detector ensembles.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments section: performance numbers (32 % and 6 % error reductions) are stated without any description of experimental protocol, dataset splits, evaluation code, number of runs, or error bars, so the central empirical claim cannot be verified from the manuscript.
- [Method] Method description (agglomerative clustering): the claim that each cluster corresponds to exactly one ground-truth object is load-bearing for the entire fusion argument, yet no analysis, validation against ground-truth groupings, or sensitivity study is provided for the chosen linkage, distance metric (bbox + score), or dendrogram cut threshold.
- [Method / Experiments] No information is given on how clustering parameters or the weighting coefficients are selected or whether they were tuned on the test set; this directly affects whether the reported gains are independent of the evaluation data.
minor comments (1)
- [Method] Notation for the distance function and the weighted-box formula should be made explicit with equations rather than prose.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve verifiability and clarity of the experimental and methodological details.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: performance numbers (32 % and 6 % error reductions) are stated without any description of experimental protocol, dataset splits, evaluation code, number of runs, or error bars, so the central empirical claim cannot be verified from the manuscript.
Authors: We agree that the manuscript would benefit from an explicit description of the evaluation protocol. The reported results follow the standard PASCAL VOC 2007 and 2012 test-set evaluation using the official VOC evaluation code; detectors were trained on the standard train/val splits while ALFA itself requires no training. In the revised version we will add a dedicated paragraph in the Experiments section stating the protocol, confirming a single run per configuration (standard practice for these benchmarks), and noting that error bars are not conventionally reported for mAP on VOC but that the relative gains hold across the tested detector combinations. revision: yes
-
Referee: [Method] Method description (agglomerative clustering): the claim that each cluster corresponds to exactly one ground-truth object is load-bearing for the entire fusion argument, yet no analysis, validation against ground-truth groupings, or sensitivity study is provided for the chosen linkage, distance metric (bbox + score), or dendrogram cut threshold.
Authors: The manuscript describes each cluster as representing a single object hypothesis rather than asserting an exact one-to-one correspondence with ground-truth objects. To strengthen the presentation we will include a sensitivity study on linkage method, distance metric, and dendrogram cut threshold, together with a brief empirical check of cluster-to-ground-truth alignment on a sample of images from the validation set. revision: yes
-
Referee: [Method / Experiments] No information is given on how clustering parameters or the weighting coefficients are selected or whether they were tuned on the test set; this directly affects whether the reported gains are independent of the evaluation data.
Authors: All clustering parameters and weighting coefficients were chosen on the PASCAL VOC validation sets; the test sets were used only for final reporting. We will add an explicit statement of this procedure and the concrete parameter values in the revised Method section. revision: yes
Circularity Check
No circularity: algorithmic method with independent empirical evaluation
full rationale
The paper presents ALFA as an agglomerative clustering post-processing step on detector outputs, with each cluster's location defined as a weighted combination of boxes. No equations, parameters fitted on test data, or self-citation chains are shown that reduce the reported mAP gains to quantities defined by the same inputs. Evaluation uses standard PASCAL VOC benchmarks with explicit comparisons to individual detectors and DBF; the clustering step is described as a fixed procedure rather than a fitted model whose outputs are relabeled as predictions. This is the common case of a self-contained algorithmic contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ensemble methods in machine learning
Dietterich, Thomas G. “Ensemble methods in machine learning.” Multiple classifier systems 1857 (2000): 1-15
work page 2000
-
[2]
The pascal visual object classes (voc) chal- lenge
Everingham, Mark, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. “The pascal visual object classes (voc) chal- lenge.” International journal of computer vision 88, no. 2 (2010): 303- 338
work page 2010
-
[3]
The pascal visual object classes challenge: A retrospective
Everingham, Mark, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. “The pascal visual object classes challenge: A retrospective.” International journal of computer vision 111, no. 1 (2015): 98-136
work page 2015
-
[4]
DSSD : Deconvolutional Single Shot Detector
Fu, Cheng-Yang, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C. Berg. “DSSD: Deconvolutional Single Shot Detector.” arXiv preprint arXiv:1701.06659 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
Girshick, Ross. “Fast r-cnn.” In Proceedings of the IEEE international conference on computer vision , pp. 1440-1448. 2015
work page 2015
-
[6]
Rich feature hierarchies for accurate object detection and semantic segmen- tation
Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hierarchies for accurate object detection and semantic segmen- tation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587. 2014
work page 2014
-
[7]
Deep residual learning for image recognition
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770-778. 2016
work page 2016
-
[8]
Detect2rank: Combining object detectors using learning to rank
Karaoglu, Sezer, Yang Liu, and Theo Gevers. “Detect2rank: Combining object detectors using learning to rank.” IEEE Transactions on Image Processing 25, no. 1 (2016): 233-248
work page 2016
-
[9]
Dynamic belief fusion for object detec- tion
Lee, Hyungtae, Heesung Kwon, Ryan M. Robinson, William D. Noth- wang, and Amar M. Marathe. “Dynamic belief fusion for object detec- tion.” In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pp. 1-9. IEEE, 2016
work page 2016
-
[10]
Microsoft coco: Common objects in context
Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick. “Microsoft coco: Common objects in context.” In European conference on computer vision, pp. 740-755. Springer, Cham, 2014
work page 2014
-
[11]
Ssd: Single shot multibox detector
Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. “Ssd: Single shot multibox detector.” In European conference on computer vision , pp. 21-
-
[12]
Springer, Cham, 2016
work page 2016
-
[13]
You only look once: Unified, real-time object detection
Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. “You only look once: Unified, real-time object detection.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 779-
-
[14]
YOLO9000: Better, Faster, Stronger
Redmon, Joseph, and Ali Farhadi. “YOLO9000: better, faster, stronger.” arXiv preprint arXiv:1612.08242 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[15]
Faster R- CNN: Towards real-time object detection with region proposal networks
Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. “Faster R- CNN: Towards real-time object detection with region proposal networks.” In Advances in neural information processing systems , pp. 91-99. 2015
work page 2015
-
[16]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional net- works for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[17]
DeNet: Scalable Real-time Object Detection with Directed Sparse Sampling
Tychsen-Smith, Lachlan, and Lars Petersson. “DeNet: Scalable Real- time Object Detection with Directed Sparse Sampling.” arXiv preprint arXiv:1703.10295 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.