arxiv: 2605.04541 · v2 · submitted 2026-05-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection

Muyao Peng , Shun Zou , Pei An , You Yang , Qiong Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords image-to-point-cloud registrationoutlier rejectionangular consistencyhierarchical attentioncross-modality matchingpoint cloud registrationcomputer visionrobotics

0 comments

The pith

Angle-I2P improves image-to-point-cloud registration by enforcing scale-invariant angular consistency and hierarchical attention to reject outliers when most initial matches are wrong.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Angle-I2P as an outlier rejection network for image-to-point-cloud registration, a task central to robotic manipulation, grasping, and localization. It establishes that adding an explicit scale-invariant geometric constraint based on angular consistency between the two modalities helps the model separate reliable correspondences from outliers. The method further uses a global-to-local hierarchical attention mechanism to remove geometrically inconsistent matches under rigid transformations. This addresses the failure of conventional PnP solvers when the starting inlier ratio is low. The authors report state-of-the-art inlier ratios and registration recall on 7Scenes, RGBD Scenes V2, and their own collected data.

Core claim

By designing a scale-invariant, cross-modality geometric constraint based on angular consistency to guide inlier-outlier distinction, and pairing it with a global-to-local hierarchical attention mechanism that filters geometrically inconsistent matches under rigid transformation, Angle-I2P raises the inlier ratio and registration recall, enabling more accurate results from low-quality initial correspondences.

What carries the argument

The scale-invariant angular consistency constraint, which supplies an explicit geometric prior to distinguish inliers from outliers across image and point-cloud features, combined with global-to-local hierarchical attention that progressively removes inconsistent matches.

If this is right

Higher inlier ratios allow standard PnP solvers to produce more accurate poses from the refined correspondences.
The method yields consistent gains in registration recall on indoor scene benchmarks including 7Scenes and RGBD Scenes V2.
Outlier rejection becomes more robust to the low inlier ratios typical of cross-modality feature matching.
The overall pipeline achieves state-of-the-art registration performance without changing the upstream feature extractor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same angular prior could be tested in other cross-sensor tasks such as camera-to-LiDAR calibration or RGB-depth alignment.
Embedding the rejection step earlier in the pipeline might reduce dependence on post-hoc robust estimators like RANSAC.
Extending the hierarchical attention to handle non-rigid or dynamic scenes would test whether the geometric constraint generalizes beyond rigid assumptions.

Load-bearing premise

The angular consistency constraint can reliably separate true inliers from outliers despite real-world cross-modality noise, and the hierarchical attention can remove inconsistent matches without discarding valid ones.

What would settle it

If a controlled test set with known ground-truth correspondences shows that registration success rate does not rise above strong baselines once the initial inlier ratio drops below roughly 10 percent, the utility of the angular constraint and attention layers would be falsified.

Figures

Figures reproduced from arXiv: 2605.04541 by Muyao Peng, Pei An, Qiong Liu, Shun Zou, You Yang.

**Figure 1.** Figure 1: Registration performance of existing image-to-point cloud outliers view at source ↗

**Figure 2.** Figure 2: Pipeline of the Angle-I2P. First, we back-project the image view at source ↗

**Figure 3.** Figure 3: Visualization of the outliers rejection results of each method. We show four selected scenes in 7Scenes datasets. view at source ↗

**Figure 4.** Figure 4: Visualization of the outliers rejection results of each method. We show the results of RGBDScenesV2 datasets. view at source ↗

read the original abstract

Image-to-point-cloud registration (I2P) is a fundamental task in robotic applications such as manipulation,grasping, and localization. Existing deep learning-based I2P methods seek to align image and point cloud features in a learned representation space to establish correspondences, and have achieved promising results. However, when the inlier ratio of the initial matching pairs is low, conventional Perspective-n-Points (PnP) methods may struggle to achieve accurate results. To address this limitation, we propose Angle-I2P, an outlier rejection network that leverages angle-consistent geometric constraints and hierarchical attention. First, we design a scale-invariant, crossmodality geometric constraint based on angular consistency. This explicit geometric constraint guides the model in distinguishing inliers from outliers. Furthermore, we propose a global-tolocal hierarchical attention mechanism that effectively filters out geometrically inconsistent matches under rigid transformation, thereby improving the Inlier Ratio (IR) and Registration Recall (RR). Experimental results demonstrate that our method achieves state-of-the-art performance on the 7Scenes, RGBD Scenes V2, and a self-collected dataset, with consistent improvements across all benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a scale-invariant angular consistency term and global-to-local attention to clean up low-inlier matches in image-to-point-cloud registration, but the practical reliability of that angular prior under cross-modality noise is the part that needs the most scrutiny.

read the letter

This paper's main move is to insert an explicit geometric prior based on angles between correspondences, which stays invariant to scale, and then refine the matches with a hierarchical attention setup that goes from global context down to local checks. The goal is to boost inlier ratio before running PnP when the initial feature matches are messy, which is a common pain point in robotic I2P tasks like localization or grasping. That combination of an old-school geometric cue with modern attention feels like the fresh part relative to prior registration networks that mostly lean on learned embeddings or simpler distance filters. They also pick reasonable benchmarks in 7Scenes and RGBD Scenes V2 plus their own data, and the abstract flags consistent gains in IR and RR, which at least shows they are measuring the right things. The architecture description is clear enough that someone could reimplement the core idea without too much guesswork. The soft spots sit mostly around the angular constraint itself. Angles are preserved under rigid motion in theory, but once you cross from 2D image features to 3D points the feature extractors, sensor noise, calibration drift, and partial overlaps can easily distort the measured angles enough that the prior stops separating inliers cleanly. The stress-test note flags exactly this as the load-bearing assumption, and without a derivation, noise model, or ablation that isolates how much the angular term helps versus the attention alone, the SOTA claim rests on empirical results that could be sensitive to dataset specifics or training choices. The self-collected set in particular would need careful scrutiny for any selection effects. This is the sort of incremental registration paper that would interest people building perception stacks for manipulation or SLAM who already know the low-inlier problem but want a concrete module to try. It is not a foundational rethink, but the proposal is concrete and the problem is real, so it deserves a serious referee to check the tables, ablations, and whether the gains hold on standard baselines like recent attention or RANSAC variants. I would send it out for review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Angle-I2P, an outlier rejection network for image-to-point-cloud registration tasks. It introduces a scale-invariant cross-modality geometric constraint based on angular consistency to distinguish inliers from outliers and a global-to-local hierarchical attention mechanism to filter geometrically inconsistent matches. The authors claim that this leads to state-of-the-art performance on the 7Scenes, RGBD Scenes V2, and a self-collected dataset.

Significance. Should the proposed method's improvements in inlier ratio and registration recall hold under scrutiny, it would represent a meaningful advance in handling low-inlier-ratio scenarios common in cross-modality registration for robotics. The explicit geometric prior is a notable strength compared to purely data-driven approaches.

major comments (2)

[Abstract] The abstract asserts SOTA results with 'consistent improvements across all benchmarks' but includes no quantitative tables, baseline comparisons, ablation studies, or error analysis, preventing verification of the central empirical claim.
No analysis is provided on whether the scale-invariant angular consistency constraint remains effective under real-world cross-modality noise, sensor-specific artifacts, or calibration drift, which underpins the ability to separate inliers from outliers and achieve the reported gains.

minor comments (1)

[Abstract] Presentation issues include missing space in 'manipulation,grasping', 'crossmodality' should be hyphenated as 'cross-modality', and 'global-tolocal' should be 'global-to-local'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] The abstract asserts SOTA results with 'consistent improvements across all benchmarks' but includes no quantitative tables, baseline comparisons, ablation studies, or error analysis, preventing verification of the central empirical claim.

Authors: We agree that the abstract, due to length constraints, does not contain specific numbers or tables. The full manuscript provides these details in Section 4, including Table 1 reporting inlier ratios and registration recalls on 7Scenes and RGBD Scenes V2 with comparisons to baselines, plus ablation studies in Section 4.3. To address the concern, we will revise the abstract to include key quantitative highlights, such as the reported improvements in IR and RR. revision: yes
Referee: [—] No analysis is provided on whether the scale-invariant angular consistency constraint remains effective under real-world cross-modality noise, sensor-specific artifacts, or calibration drift, which underpins the ability to separate inliers from outliers and achieve the reported gains.

Authors: The evaluations use real datasets (7Scenes, RGBD Scenes V2) that contain sensor noise, cross-modality artifacts, and calibration variations typical of RGB-depth setups. The scale-invariant angular constraint is intended to mitigate scale and geometric inconsistencies arising from such factors, as evidenced by the performance gains in low-inlier-ratio cases. We will add a dedicated robustness discussion subsection, including qualitative analysis of the constraint under these conditions, to make this explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit geometric prior and attention architecture are independent of fitted outputs.

full rationale

The paper introduces a scale-invariant angular consistency constraint as an explicit geometric prior and a global-to-local hierarchical attention mechanism as a new architectural component. Neither is derived from or fitted to the target inlier/outlier labels on the evaluation benchmarks; both are defined a priori and then applied to produce candidate correspondences whose quality is measured on held-out test sets (7Scenes, RGBD Scenes V2, self-collected). No equation reduces a claimed prediction to a quantity defined by the same data, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard geometric assumptions of rigid transformations and the empirical effectiveness of learned attention; no new physical entities or ad-hoc constants are introduced beyond typical neural-network parameters.

axioms (1)

domain assumption Rigid transformations preserve angles between point pairs
Invoked to justify the scale-invariant angular consistency constraint used to label inliers versus outliers.

pith-pipeline@v0.9.0 · 5509 in / 1347 out tokens · 59791 ms · 2026-05-12T03:16:16.172801+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a scale-invariant, cross-modality geometric constraint based on angular consistency... cos(α_I_ij) = ô_i · ô_j / (∥ô_i∥∥ô_j∥) ... independent of scale s (Eq. 6)
IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

θ_ij = [1 − δ_ij²/σ_d²]+ ... global-to-local hierarchical attention... Attention = Softmax(Θ Q K^T / √d) V (Eq. 11)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

An economic framework for 6-dof grasp detection,

X.-M. Wu, J.-F. Cai, J.-J. Jiang, D. Zheng, Y .-L. Wei, and W.-S. Zheng, “An economic framework for 6-dof grasp detection,” inProceedings of European Conference on Computer Vision (ECCV), 2024, pp. 357–375

work page 2024
[2]

Mast3r-slam: Real- time dense slam with 3d reconstruction priors,

R. Murai, E. Dexheimer, and A. J. Davison, “Mast3r-slam: Real- time dense slam with 3d reconstruction priors,” inProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 16 695–16 705

work page 2025
[3]

Ol- reg: Registration of image and sparse lidar point cloud with object-level dense correspondences,

P. An, X. Hu, J. Ding, J. Zhang, J. Ma, Y . Yang, and Q. Liu, “Ol- reg: Registration of image and sparse lidar point cloud with object-level dense correspondences,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 7523–7536, 2024

work page 2024
[4]

Enhance image- to-point-cloud registration with beltrami flow: P. an et al

P. An, Y . Yang, J. Yang, M. Peng, Q. Liu, and L. Nan, “Enhance image- to-point-cloud registration with beltrami flow: P. an et al.”International Journal of Computer Vision, vol. 133, no. 12, pp. 8589–8616, 2025

work page 2025
[5]

P2-net: Joint description and detection of local features for pixel and point matching,

B. Wang, C. Chen, Z. Cui, J. Qin, C. X. Lu, Z. Yu, P. Zhao, Z. Dong, F. Zhu, N. Trigoniet al., “P2-net: Joint description and detection of local features for pixel and point matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 16 004– 16 013

work page 2021
[6]

2d3d-matchnet: Learning to match keypoints across 2d image and 3d point cloud,

M. Feng, S. Hu, M. H. Ang, and G. H. Lee, “2d3d-matchnet: Learning to match keypoints across 2d image and 3d point cloud,” inProceedings of the International Conference on Robotics and Automation (ICRA), 2019, pp. 4790–4796

work page 2019
[7]

Graphi2p: Image-to- point cloud registration with exploring pattern of correspondence via graph learning,

L. Bie, S. Pan, S. Li, Y . Zhao, and Y . Gao, “Graphi2p: Image-to- point cloud registration with exploring pattern of correspondence via graph learning,” inProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 22 161–22 171

work page 2025
[8]

Ldf-i2p: Learning discriminative cross-modality features for image-to-point cloud registration,

M. Peng, P. An, Y . Yang, and Q. Liu, “Ldf-i2p: Learning discriminative cross-modality features for image-to-point cloud registration,”IEEE Transactions on Instrumentation and Measurement, vol. 74, pp. 1–12, 2025

work page 2025
[9]

Is geometry enough for matching in visual localization?

Q. Zhou, S. Agostinho, A. O ˇsep, and L. Leal-Taix´e, “Is geometry enough for matching in visual localization?” inProceedings of the European Conference on Computer Vision (ECCV), 2022, pp. 407–425

work page 2022
[10]

Build a cross-modality bridge for image-to-point cloud registration,

L. Bie, S. Pan, K. Cheng, and L. Han, “Build a cross-modality bridge for image-to-point cloud registration,” inProceedings of IEEE International Conference on Multimedia and Expo (ICME), 2024, pp. 1–6

work page 2024
[11]

Image-to-point registration via cross- modality correspondence retrieval,

L. Bie, S. Li, and K. Cheng, “Image-to-point registration via cross- modality correspondence retrieval,” inProceedings of the 2024 Interna- tional Conference on Multimedia Retrieval (ICMR), 2024, pp. 266–274

work page 2024
[12]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,

M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,”Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981

work page 1981
[13]

Lcd: Learned cross-domain descriptors for 2d-3d matching,

Q.-H. Pham, M. A. Uy, B.-S. Hua, D. T. Nguyen, G. Roig, and S.-K. Yeung, “Lcd: Learned cross-domain descriptors for 2d-3d matching,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020, pp. 11 856–11 864

work page 2020
[14]

Mincd- pnp: Learning 2d-3d correspondences with approximate blind pnp,

P. An, J. Yang, M. Peng, Y . Yang, Q. Liu, X. Wu, and L. Nan, “Mincd- pnp: Learning 2d-3d correspondences with approximate blind pnp,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 26 519–26 528

work page 2025
[15]

2d3d- matr: 2d-3d matching transformer for detection-free registration between images and point clouds,

M. Li, Z. Qin, Z. Gao, R. Yi, C. Zhu, Y . Guo, and K. Xu, “2d3d- matr: 2d-3d matching transformer for detection-free registration between images and point clouds,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 14 128–14 138

work page 2023
[16]

Freereg: Image-to-point cloud registration leveraging pretrained diffu- sion models and monocular depth estimators,

H. Wang, Y . Liu, B. Wang, Y . Sun, Z. Dong, W. Wang, and B. Yang, “Freereg: Image-to-point cloud registration leveraging pretrained diffu- sion models and monocular depth estimators,” inProceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024
[17]

Diff- reg: Diffusion model in doubly stochastic matrix space for registration problem,

Q. Wu, H. Jiang, L. Luo, J. Li, Y . Ding, J. Xie, and J. Yang, “Diff- reg: Diffusion model in doubly stochastic matrix space for registration problem,” inProceedings of the European Conference on Computer Vision (ECCV), 2024, pp. 160–178

work page 2024
[18]

Teaser: Fast and certifiable point cloud registration,

H. Yang, J. Shi, and L. Carlone, “Teaser: Fast and certifiable point cloud registration,”IEEE Transactions on Robotics, vol. 37, no. 2, pp. 314– 333, 2020

work page 2020
[19]

Graph-cut ransac,

D. Barath and J. Matas, “Graph-cut ransac,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6733–6741

work page 2018
[20]

Fastmac: Stochastic spectral sampling of correspondence graph,

Y . Zhang, H. Zhao, H. Li, and S. Chen, “Fastmac: Stochastic spectral sampling of correspondence graph,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 17 857–17 867

work page 2024
[21]

Mac++: Going further with maximal cliques for 3d registration,

X. Zhang, Y . Zhang, and J. Yang, “Mac++: Going further with maximal cliques for 3d registration,” inProceedings of International Conference on 3D Vision (3DV), 2025, pp. 261–275

work page 2025
[22]

3d registration with maximal cliques,

X. Zhang, J. Yang, S. Zhang, and Y . Zhang, “3d registration with maximal cliques,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 17 745– 17 754

work page 2023
[23]

Hypergct: A dynamic hyper-gnn-learned geometric constraint for 3d registration,

X. Zhang, J. Ma, J. Guo, W. Hu, Z. Qi, F. Hui, J. Yang, and Y . Zhang, “Hypergct: A dynamic hyper-gnn-learned geometric constraint for 3d registration,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 24 750–24 759

work page 2025
[24]

Graph-cut ransac: Local optimization on spatially coherent structures,

D. Barath and J. Matas, “Graph-cut ransac: Local optimization on spatially coherent structures,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 4961–4974, 2021

work page 2021
[25]

Muscle- reg: Multi-scale contextual embedding and local correspondence rectifi- cation for robust two-stage point cloud registration,

Y . Zhang, J. Zhang, X. Qian, Y . Cen, B. Zhang, and J. Gong, “Muscle- reg: Multi-scale contextual embedding and local correspondence rectifi- cation for robust two-stage point cloud registration,”IEEE Robotics and Automation Letters, 2025

work page 2025
[26]

3dpcp-net: A lightweight progressive 3d cor- respondence pruning network for accurate and efficient point cloud registration,

J. Wang and Z. Li, “3dpcp-net: A lightweight progressive 3d cor- respondence pruning network for accurate and efficient point cloud registration,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1885–1894

work page 2024
[27]

Deep graph-based spatial consistency for robust non-rigid point cloud registration,

Z. Qin, H. Yu, C. Wang, Y . Peng, and K. Xu, “Deep graph-based spatial consistency for robust non-rigid point cloud registration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 5394–5403

work page 2023
[28]

Deepi2p: Image-to-point cloud registration via deep classification,

J. Li and G. H. Lee, “Deepi2p: Image-to-point cloud registration via deep classification,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15 960– 15 969

work page 2021
[29]

Depth anything v2,

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 21 875–21 911, 2024

work page 2024
[30]

Real-time rgb- d camera relocalization,

B. Glocker, S. Izadi, J. Shotton, and A. Criminisi, “Real-time rgb- d camera relocalization,” inProceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2013, pp. 173– 179

work page 2013
[31]

Unsupervised feature learning for 3d scene labeling,

K. Lai, L. Bo, and D. Fox, “Unsupervised feature learning for 3d scene labeling,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2014, pp. 3050–3057

work page 2014