ALFA: Agglomerative Late Fusion Algorithm for Object Detection

Evgenii Razinkov; Iuliia Saveleva; Ji\v{r}i Matas

arxiv: 1907.06067 · v1 · pith:PCRUZHPYnew · submitted 2019-07-13 · 💻 cs.CV

ALFA: Agglomerative Late Fusion Algorithm for Object Detection

Evgenii Razinkov , Iuliia Saveleva , Ji\v{r}i Matas This is my paper

Pith reviewed 2026-05-24 21:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords object detectionlate fusionagglomerative clusteringPASCAL VOCbounding boxdetector fusionSSDFaster R-CNN

0 comments

The pith

ALFA fuses multiple object detector predictions with agglomerative clustering to achieve lower error on PASCAL VOC.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ALFA, a late fusion algorithm for object detection that clusters predictions from different detectors. The clustering uses both bounding box locations and class scores to group detections that likely belong to the same object. Each group then produces a single hypothesis by weighted averaging of the boxes. Tested on PASCAL VOC 2007 and 2012 with SSD, DeNet, and Faster R-CNN, it outperforms the individual detectors and the DBF fusion method. A reader would care if they want to improve detection accuracy by combining existing models.

Core claim

ALFA is based on agglomerative clustering of object detector predictions taking into consideration both the bounding box locations and the class scores. Each cluster represents a single object hypothesis whose location is a weighted combination of the clustered bounding boxes. ALFA was evaluated using combinations of a pair (SSD and DeNet) and a triplet (SSD, DeNet and Faster R-CNN) of recent object detectors that are close to the state-of-the-art. ALFA achieves state of the art results on PASCAL VOC 2007 and PASCAL VOC 2012, outperforming the individual detectors as well as baseline combination strategies, achieving up to 32% lower error than the best individual detectors and up to 6% lower

What carries the argument

Agglomerative clustering of bounding box predictions and class scores from multiple detectors to form object hypotheses via weighted box combination.

Load-bearing premise

The clustering step groups predictions from the same object correctly without systematic errors from over-merging or splitting detections.

What would settle it

An evaluation on images with closely spaced objects where detectors produce conflicting boxes, checking whether the mAP drops below that of the best single detector.

Figures

Figures reproduced from arXiv: 1907.06067 by Evgenii Razinkov, Iuliia Saveleva, Ji\v{r}i Matas.

**Figure 1.** Figure 1: Image from PASCAL VOC 2007 test set. Bounding boxes and IoU with ground truth: DeNet – red (IoU = 0.75); SSD – green (IoU = 0.77); ALFA – blue (IoU = 0.93). Ground truth bounding box is in white. and learning a ranking system on a validation set. Handcrafted feature vector includes information about detector-detector context, object saliency and object-object relation information. Ranking is learned using … view at source ↗

read the original abstract

We propose ALFA - a novel late fusion algorithm for object detection. ALFA is based on agglomerative clustering of object detector predictions taking into consideration both the bounding box locations and the class scores. Each cluster represents a single object hypothesis whose location is a weighted combination of the clustered bounding boxes. ALFA was evaluated using combinations of a pair (SSD and DeNet) and a triplet (SSD, DeNet and Faster R-CNN) of recent object detectors that are close to the state-of-the-art. ALFA achieves state of the art results on PASCAL VOC 2007 and PASCAL VOC 2012, outperforming the individual detectors as well as baseline combination strategies, achieving up to 32% lower error than the best individual detectors and up to 6% lower error than the reference fusion algorithm DBF - Dynamic Belief Fusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ALFA clusters detector outputs by box location plus class score then averages per cluster; it claims gains over DBF on VOC but the abstract gives almost no protocol or validation details.

read the letter

ALFA's main move is to run agglomerative clustering on the combined space of bounding-box coordinates and class scores from multiple detectors, treat each cluster as one object, and output a weighted box. That is the concrete novelty relative to plain NMS or the DBF baseline it cites. The method is post-processing only, so it can be dropped on top of existing detectors without retraining, which is a practical plus for anyone already running SSD, DeNet, or Faster R-CNN ensembles on VOC-style data. The reported numbers—up to 32 % lower error than the best single detector and 6 % lower than DBF on both 2007 and 2012—are the sort of incremental improvement that could matter in a pipeline where every point of mAP counts. The abstract is clear about the high-level idea and the datasets, and the authors position the work against the right prior art. That is the credit it earns. The soft spots are exactly where the stress-test note flags them. No information is given on how the linkage, distance metric, or dendrogram cut is chosen, whether those choices were tuned on the test set, or whether the resulting clusters actually align with ground-truth objects rather than merging nearby distinct ones or splitting multi-detector hits on the same object. Without ablations, error bars, or even a description of the train/val/test splits used for the fusion step, the numerical claims cannot be assessed from the text supplied. The experimental section is simply missing from what is visible. This paper is aimed at researchers who build or tune object-detection ensembles and want a lightweight late-fusion option. A reader already working on VOC or similar benchmarks could extract the clustering recipe and test it quickly. It is coherent on its own terms and the central claim is falsifiable, so it clears the bar for a serious referee even though the current version would need the missing experimental details filled in before acceptance.

Referee Report

3 major / 1 minor

Summary. The paper proposes ALFA, a late-fusion algorithm that applies agglomerative clustering to object-detector predictions (using both bounding-box locations and class scores) so that each resulting cluster represents a single object whose final box is a weighted average of the clustered boxes. It evaluates the method on combinations of SSD+DeNet and SSD+DeNet+Faster R-CNN and claims state-of-the-art mAP on PASCAL VOC 2007 and 2012, with up to 32 % lower error than the best single detector and up to 6 % lower error than the reference fusion method DBF.

Significance. If the numerical claims can be reproduced, ALFA would supply a lightweight, training-free post-processing step that improves detection accuracy by fusing off-the-shelf detectors. The approach is conceptually straightforward and targets a practical need in detector ensembles.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: performance numbers (32 % and 6 % error reductions) are stated without any description of experimental protocol, dataset splits, evaluation code, number of runs, or error bars, so the central empirical claim cannot be verified from the manuscript.
[Method] Method description (agglomerative clustering): the claim that each cluster corresponds to exactly one ground-truth object is load-bearing for the entire fusion argument, yet no analysis, validation against ground-truth groupings, or sensitivity study is provided for the chosen linkage, distance metric (bbox + score), or dendrogram cut threshold.
[Method / Experiments] No information is given on how clustering parameters or the weighting coefficients are selected or whether they were tuned on the test set; this directly affects whether the reported gains are independent of the evaluation data.

minor comments (1)

[Method] Notation for the distance function and the weighted-box formula should be made explicit with equations rather than prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve verifiability and clarity of the experimental and methodological details.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: performance numbers (32 % and 6 % error reductions) are stated without any description of experimental protocol, dataset splits, evaluation code, number of runs, or error bars, so the central empirical claim cannot be verified from the manuscript.

Authors: We agree that the manuscript would benefit from an explicit description of the evaluation protocol. The reported results follow the standard PASCAL VOC 2007 and 2012 test-set evaluation using the official VOC evaluation code; detectors were trained on the standard train/val splits while ALFA itself requires no training. In the revised version we will add a dedicated paragraph in the Experiments section stating the protocol, confirming a single run per configuration (standard practice for these benchmarks), and noting that error bars are not conventionally reported for mAP on VOC but that the relative gains hold across the tested detector combinations. revision: yes
Referee: [Method] Method description (agglomerative clustering): the claim that each cluster corresponds to exactly one ground-truth object is load-bearing for the entire fusion argument, yet no analysis, validation against ground-truth groupings, or sensitivity study is provided for the chosen linkage, distance metric (bbox + score), or dendrogram cut threshold.

Authors: The manuscript describes each cluster as representing a single object hypothesis rather than asserting an exact one-to-one correspondence with ground-truth objects. To strengthen the presentation we will include a sensitivity study on linkage method, distance metric, and dendrogram cut threshold, together with a brief empirical check of cluster-to-ground-truth alignment on a sample of images from the validation set. revision: yes
Referee: [Method / Experiments] No information is given on how clustering parameters or the weighting coefficients are selected or whether they were tuned on the test set; this directly affects whether the reported gains are independent of the evaluation data.

Authors: All clustering parameters and weighting coefficients were chosen on the PASCAL VOC validation sets; the test sets were used only for final reporting. We will add an explicit statement of this procedure and the concrete parameter values in the revised Method section. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic method with independent empirical evaluation

full rationale

The paper presents ALFA as an agglomerative clustering post-processing step on detector outputs, with each cluster's location defined as a weighted combination of boxes. No equations, parameters fitted on test data, or self-citation chains are shown that reduce the reported mAP gains to quantities defined by the same inputs. Evaluation uses standard PASCAL VOC benchmarks with explicit comparisons to individual detectors and DBF; the clustering step is described as a fixed procedure rather than a fitted model whose outputs are relabeled as predictions. This is the common case of a self-contained algorithmic contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5684 in / 1137 out tokens · 19705 ms · 2026-05-24T21:57:36.580367+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 4 internal anchors

[1]

Ensemble methods in machine learning

Dietterich, Thomas G. “Ensemble methods in machine learning.” Multiple classiﬁer systems 1857 (2000): 1-15

work page 2000
[2]

The pascal visual object classes (voc) chal- lenge

Everingham, Mark, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. “The pascal visual object classes (voc) chal- lenge.” International journal of computer vision 88, no. 2 (2010): 303- 338

work page 2010
[3]

The pascal visual object classes challenge: A retrospective

Everingham, Mark, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. “The pascal visual object classes challenge: A retrospective.” International journal of computer vision 111, no. 1 (2015): 98-136

work page 2015
[4]

DSSD : Deconvolutional Single Shot Detector

Fu, Cheng-Yang, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C. Berg. “DSSD: Deconvolutional Single Shot Detector.” arXiv preprint arXiv:1701.06659 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

Fast r-cnn

Girshick, Ross. “Fast r-cnn.” In Proceedings of the IEEE international conference on computer vision , pp. 1440-1448. 2015

work page 2015
[6]

Rich feature hierarchies for accurate object detection and semantic segmen- tation

Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hierarchies for accurate object detection and semantic segmen- tation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587. 2014

work page 2014
[7]

Deep residual learning for image recognition

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770-778. 2016

work page 2016
[8]

Detect2rank: Combining object detectors using learning to rank

Karaoglu, Sezer, Yang Liu, and Theo Gevers. “Detect2rank: Combining object detectors using learning to rank.” IEEE Transactions on Image Processing 25, no. 1 (2016): 233-248

work page 2016
[9]

Dynamic belief fusion for object detec- tion

Lee, Hyungtae, Heesung Kwon, Ryan M. Robinson, William D. Noth- wang, and Amar M. Marathe. “Dynamic belief fusion for object detec- tion.” In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pp. 1-9. IEEE, 2016

work page 2016
[10]

Microsoft coco: Common objects in context

Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick. “Microsoft coco: Common objects in context.” In European conference on computer vision, pp. 740-755. Springer, Cham, 2014

work page 2014
[11]

Ssd: Single shot multibox detector

Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. “Ssd: Single shot multibox detector.” In European conference on computer vision , pp. 21-

work page
[12]

Springer, Cham, 2016

work page 2016
[13]

You only look once: Uniﬁed, real-time object detection

Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. “You only look once: Uniﬁed, real-time object detection.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 779-

work page
[14]

YOLO9000: Better, Faster, Stronger

Redmon, Joseph, and Ali Farhadi. “YOLO9000: better, faster, stronger.” arXiv preprint arXiv:1612.08242 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Faster R- CNN: Towards real-time object detection with region proposal networks

Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. “Faster R- CNN: Towards real-time object detection with region proposal networks.” In Advances in neural information processing systems , pp. 91-99. 2015

work page 2015
[16]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional net- works for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[17]

DeNet: Scalable Real-time Object Detection with Directed Sparse Sampling

Tychsen-Smith, Lachlan, and Lars Petersson. “DeNet: Scalable Real- time Object Detection with Directed Sparse Sampling.” arXiv preprint arXiv:1703.10295 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Ensemble methods in machine learning

Dietterich, Thomas G. “Ensemble methods in machine learning.” Multiple classiﬁer systems 1857 (2000): 1-15

work page 2000

[2] [2]

The pascal visual object classes (voc) chal- lenge

Everingham, Mark, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. “The pascal visual object classes (voc) chal- lenge.” International journal of computer vision 88, no. 2 (2010): 303- 338

work page 2010

[3] [3]

The pascal visual object classes challenge: A retrospective

Everingham, Mark, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. “The pascal visual object classes challenge: A retrospective.” International journal of computer vision 111, no. 1 (2015): 98-136

work page 2015

[4] [4]

DSSD : Deconvolutional Single Shot Detector

Fu, Cheng-Yang, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C. Berg. “DSSD: Deconvolutional Single Shot Detector.” arXiv preprint arXiv:1701.06659 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

Fast r-cnn

Girshick, Ross. “Fast r-cnn.” In Proceedings of the IEEE international conference on computer vision , pp. 1440-1448. 2015

work page 2015

[6] [6]

Rich feature hierarchies for accurate object detection and semantic segmen- tation

Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hierarchies for accurate object detection and semantic segmen- tation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587. 2014

work page 2014

[7] [7]

Deep residual learning for image recognition

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770-778. 2016

work page 2016

[8] [8]

Detect2rank: Combining object detectors using learning to rank

Karaoglu, Sezer, Yang Liu, and Theo Gevers. “Detect2rank: Combining object detectors using learning to rank.” IEEE Transactions on Image Processing 25, no. 1 (2016): 233-248

work page 2016

[9] [9]

Dynamic belief fusion for object detec- tion

Lee, Hyungtae, Heesung Kwon, Ryan M. Robinson, William D. Noth- wang, and Amar M. Marathe. “Dynamic belief fusion for object detec- tion.” In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pp. 1-9. IEEE, 2016

work page 2016

[10] [10]

Microsoft coco: Common objects in context

Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick. “Microsoft coco: Common objects in context.” In European conference on computer vision, pp. 740-755. Springer, Cham, 2014

work page 2014

[11] [11]

Ssd: Single shot multibox detector

Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. “Ssd: Single shot multibox detector.” In European conference on computer vision , pp. 21-

work page

[12] [12]

Springer, Cham, 2016

work page 2016

[13] [13]

You only look once: Uniﬁed, real-time object detection

Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. “You only look once: Uniﬁed, real-time object detection.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 779-

work page

[14] [14]

YOLO9000: Better, Faster, Stronger

Redmon, Joseph, and Ali Farhadi. “YOLO9000: better, faster, stronger.” arXiv preprint arXiv:1612.08242 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [15]

Faster R- CNN: Towards real-time object detection with region proposal networks

Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. “Faster R- CNN: Towards real-time object detection with region proposal networks.” In Advances in neural information processing systems , pp. 91-99. 2015

work page 2015

[16] [16]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional net- works for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[17] [17]

DeNet: Scalable Real-time Object Detection with Directed Sparse Sampling

Tychsen-Smith, Lachlan, and Lars Petersson. “DeNet: Scalable Real- time Object Detection with Directed Sparse Sampling.” arXiv preprint arXiv:1703.10295 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017