Class-specific Anchoring Proposal for 3D Object Recognition in LIDAR and RGB Images

Amir Hossein Raffiee; Humayun Irshad

arxiv: 1907.09081 · v1 · pith:BP3P3TP7new · submitted 2019-07-22 · 💻 cs.CV

Class-specific Anchoring Proposal for 3D Object Recognition in LIDAR and RGB Images

Amir Hossein Raffiee , Humayun Irshad This is my paper

Pith reviewed 2026-05-24 18:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D object detectionanchor clusteringLIDAR RGB fusionKITTI benchmarkpedestrian detectionclass-specific anchorsregional proposal network

0 comments

The pith

Class-specific anchoring by size and aspect ratio boosts 3D detection accuracy on KITTI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Class-specific Anchoring Proposal (CAP) for 3D object detection that fuses LIDAR and RGB data. It replaces generic anchors with clusters derived separately for each class from observed sizes and aspect ratios in the training data. Experiments on the state-of-the-art detector show accuracy gains of roughly 7-9 percent on pedestrians, 1-2 percent on cars, and 12 percent on cyclists across Easy/Moderate/Hard splits. The same clustering also improves the quality of regions proposed by the regional proposal network. The authors further identify the cluster counts per class that work best on the KITTI benchmark.

Core claim

Clustering anchors on a per-class basis using object sizes and aspect ratios from the KITTI training distribution produces a measurable rise in 3D detection accuracy and improves regional proposal quality compared with the baseline anchoring used by the current leading detector.

What carries the argument

Class-specific Anchoring Proposal (CAP), which replaces a single set of generic anchors with separate k-means clusters of size and aspect ratio computed independently for each object class.

If this is right

Pedestrian detection accuracy rises by 7-9 percent across difficulty levels.
Car detection accuracy rises by 1-2 percent across difficulty levels.
Cyclist detection accuracy rises by 12 percent on the Easy setting.
The regional proposal network produces higher-quality candidate regions.
Each class has an optimal cluster count that further maximizes the gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clustering logic could be applied to other 3D detectors without changing their architecture.
If the test distribution shifts in object scale, the pre-computed clusters may need re-derivation from new data.
The method reduces the manual search over anchor scales that is common in 3D detection pipelines.

Load-bearing premise

Anchors clustered from sizes and aspect ratios in the KITTI training set will remain effective on the held-out test distribution and on data from other sensors or environments.

What would settle it

Re-training the same detector with CAP on a different dataset such as nuScenes and measuring whether the reported per-class gains disappear or reverse.

read the original abstract

Detecting objects in a two-dimensional setting is often insufficient in the context of real-life applications where the surrounding environment needs to be accurately recognized and oriented in three-dimension (3D), such as in the case of autonomous driving vehicles. Therefore, accurately and efficiently detecting objects in the three-dimensional setting is becoming increasingly relevant to a wide range of industrial applications, and thus is progressively attracting the attention of researchers. Building systems to detect objects in 3D is a challenging task though, because it relies on the multi-modal fusion of data derived from different sources. In this paper, we study the effects of anchoring using the current state-of-the-art 3D object detector and propose Class-specific Anchoring Proposal (CAP) strategy based on object sizes and aspect ratios based clustering of anchors. The proposed anchoring strategy significantly increased detection accuracy's by 7.19%, 8.13% and 8.8% on Easy, Moderate and Hard setting of the pedestrian class, 2.19%, 2.17% and 1.27% on Easy, Moderate and Hard setting of the car class and 12.1% on Easy setting of cyclist class. We also show that the clustering in anchoring process also enhances the performance of the regional proposal network in proposing regions of interests significantly. Finally, we propose the best cluster numbers for each class of objects in KITTI dataset that improves the performance of detection model significantly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Class-specific anchor clustering for 3D detection is a small incremental step whose reported gains cannot be checked from the given details.

read the letter

The paper applies per-class k-means clustering on object sizes and aspect ratios to generate anchors inside an existing 3D detector, then measures the effect on KITTI. That is the core new piece: treating anchor choice as class-dependent rather than a single set for all objects. The idea is sensible because cars, pedestrians, and cyclists really do occupy different scale ranges, so separate priors can reduce the mismatch that generic anchors create. The abstract also notes that the change helps the region proposal network itself, which is a plausible side benefit. Beyond that, there is little else that stands out as fresh; the clustering method itself follows the same k-means approach already used in 2D anchor optimization papers. The main weakness is that the claimed lifts (roughly 7-8% on pedestrians, 1-2% on cars) appear without any baseline numbers, without error bars, and without any statement that the clusters were fit only on the training split. The abstract simply says the clustering used the KITTI dataset, which leaves open the possibility that test examples influenced the anchor centers. If that happened, the numbers are not reproducible under standard protocol. No training details for the base detector are supplied either, so it is impossible to tell whether the gains come from the anchoring change or from some other difference in setup. A reader working on anchor-based 3D detectors might still want to try the class-wise clustering trick on their own data, but the paper as written does not supply enough evidence to treat the specific percentages as reliable. I would not bring this to a reading group or cite it. A serious editor should desk-reject rather than send it out for review until the experiments section shows proper baselines, confirms the train-only split, and reports variance.

Referee Report

2 major / 2 minor

Summary. The paper proposes Class-specific Anchoring Proposal (CAP), a strategy that clusters anchors per object class using sizes and aspect ratios drawn from the KITTI dataset and integrates the resulting priors into an existing 3D object detector operating on LIDAR and RGB inputs. It reports concrete accuracy gains of 7.19/8.13/8.8 % (Easy/Moderate/Hard) on pedestrians, 2.19/2.17/1.27 % on cars, and 12.1 % (Easy) on cyclists, together with improved regional-proposal-network recall and recommended cluster counts per class.

Significance. If the numerical gains survive a clean train-only clustering protocol and are shown to be statistically reliable, the approach supplies a lightweight, class-aware prior that can be dropped into any anchor-based 3D detector. The explicit per-class cluster recommendations and the claim of RPN improvement constitute concrete, falsifiable contributions that practitioners could test directly on KITTI or similar benchmarks.

major comments (2)

[Experiments / Method] Experiments / Method sections: the description of anchor clustering states that centers are obtained from “the KITTI dataset” without declaring that only the official training split was used. Because the reported gains (e.g., +7.19 % pedestrian Easy) are the central empirical claim, any inclusion of validation or test examples would constitute indirect leakage and render the numbers non-reproducible under standard train/test separation.
[Abstract / Experiments] Abstract and Experiments: the percentage improvements are presented without the corresponding baseline AP values, without error bars or number of runs, and without any statistical test. These omissions make it impossible to judge whether the stated deltas exceed normal training variance and therefore undermine the load-bearing claim that CAP “significantly increased detection accuracy.”

minor comments (2)

[Abstract] Abstract: “accuracy's” is a typographical error; should read “accuracies.”
[Method] Notation: the paper never defines the distance metric or linkage method used for the k-means clustering of anchors; this detail is needed for exact reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and rigor of our work. We address each major comment below.

read point-by-point responses

Referee: [Experiments / Method] Experiments / Method sections: the description of anchor clustering states that centers are obtained from “the KITTI dataset” without declaring that only the official training split was used. Because the reported gains (e.g., +7.19 % pedestrian Easy) are the central empirical claim, any inclusion of validation or test examples would constitute indirect leakage and render the numbers non-reproducible under standard train/test separation.

Authors: We confirm that anchor clustering was performed exclusively on the official training split of the KITTI dataset; no validation or test data was used. The original wording was imprecise. In the revised manuscript we will explicitly state in both the Method and Experiments sections that only the training split was employed, thereby eliminating any ambiguity regarding data leakage. revision: yes
Referee: [Abstract / Experiments] Abstract and Experiments: the percentage improvements are presented without the corresponding baseline AP values, without error bars or number of runs, and without any statistical test. These omissions make it impossible to judge whether the stated deltas exceed normal training variance and therefore undermine the load-bearing claim that CAP “significantly increased detection accuracy.”

Authors: We agree that the abstract should report the absolute baseline AP values alongside the deltas; we will add them in the revised abstract and ensure they appear clearly in the Experiments section. Regarding error bars, multiple runs, and statistical tests, our submission used single training runs per configuration, which remains common practice on KITTI. We will insert a brief discussion noting this limitation and highlighting that the observed gains are large (7–12 %) and consistent across three classes, which we consider indicative of a genuine effect. Full multi-run statistics would require additional compute and are not feasible for this revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical accuracy gains are measured outcomes on held-out KITTI splits

full rationale

The paper proposes CAP via k-means-style clustering on observed object sizes and aspect ratios, then reports mAP improvements on the standard KITTI Easy/Moderate/Hard splits. These numerical gains are presented as direct experimental results from retraining/evaluating the base 3D detector with the new anchors; no equations, self-citations, or uniqueness theorems are invoked that would reduce the claimed deltas to a fitted parameter defined inside the paper or to a prior result by the same authors. The evaluation remains externally falsifiable on the public benchmark under conventional train/test separation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of a clustering step applied to an off-the-shelf 3D detector on one dataset; the number of clusters per class is chosen to maximize reported scores.

free parameters (1)

number of clusters per object class
Paper states it proposes the best cluster numbers for each class on KITTI to improve performance.

axioms (1)

domain assumption The chosen base 3D object detector is representative of current state-of-the-art performance.
Abstract frames the study as testing anchoring effects on the current SOTA detector.

pith-pipeline@v0.9.0 · 5792 in / 1284 out tokens · 21824 ms · 2026-05-24T18:36:28.457989+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we use K-mean clustering and Gaussian Mixture Model (GMM) methods... each object in particular class is considered as a vector x with three features (L,H,W)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The proposed anchoring strategy significantly increased detection accuracy's by 7.19%, 8.13% and 8.8% on Easy, Moderate and Hard setting of the pedestrian class

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.