pith. machine review for the scientific record. sign in

arxiv: 2605.04541 · v2 · submitted 2026-05-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords image-to-point-cloud registrationoutlier rejectionangular consistencyhierarchical attentioncross-modality matchingpoint cloud registrationcomputer visionrobotics
0
0 comments X

The pith

Angle-I2P improves image-to-point-cloud registration by enforcing scale-invariant angular consistency and hierarchical attention to reject outliers when most initial matches are wrong.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Angle-I2P as an outlier rejection network for image-to-point-cloud registration, a task central to robotic manipulation, grasping, and localization. It establishes that adding an explicit scale-invariant geometric constraint based on angular consistency between the two modalities helps the model separate reliable correspondences from outliers. The method further uses a global-to-local hierarchical attention mechanism to remove geometrically inconsistent matches under rigid transformations. This addresses the failure of conventional PnP solvers when the starting inlier ratio is low. The authors report state-of-the-art inlier ratios and registration recall on 7Scenes, RGBD Scenes V2, and their own collected data.

Core claim

By designing a scale-invariant, cross-modality geometric constraint based on angular consistency to guide inlier-outlier distinction, and pairing it with a global-to-local hierarchical attention mechanism that filters geometrically inconsistent matches under rigid transformation, Angle-I2P raises the inlier ratio and registration recall, enabling more accurate results from low-quality initial correspondences.

What carries the argument

The scale-invariant angular consistency constraint, which supplies an explicit geometric prior to distinguish inliers from outliers across image and point-cloud features, combined with global-to-local hierarchical attention that progressively removes inconsistent matches.

If this is right

  • Higher inlier ratios allow standard PnP solvers to produce more accurate poses from the refined correspondences.
  • The method yields consistent gains in registration recall on indoor scene benchmarks including 7Scenes and RGBD Scenes V2.
  • Outlier rejection becomes more robust to the low inlier ratios typical of cross-modality feature matching.
  • The overall pipeline achieves state-of-the-art registration performance without changing the upstream feature extractor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same angular prior could be tested in other cross-sensor tasks such as camera-to-LiDAR calibration or RGB-depth alignment.
  • Embedding the rejection step earlier in the pipeline might reduce dependence on post-hoc robust estimators like RANSAC.
  • Extending the hierarchical attention to handle non-rigid or dynamic scenes would test whether the geometric constraint generalizes beyond rigid assumptions.

Load-bearing premise

The angular consistency constraint can reliably separate true inliers from outliers despite real-world cross-modality noise, and the hierarchical attention can remove inconsistent matches without discarding valid ones.

What would settle it

If a controlled test set with known ground-truth correspondences shows that registration success rate does not rise above strong baselines once the initial inlier ratio drops below roughly 10 percent, the utility of the angular constraint and attention layers would be falsified.

Figures

Figures reproduced from arXiv: 2605.04541 by Muyao Peng, Pei An, Qiong Liu, Shun Zou, You Yang.

Figure 1
Figure 1. Figure 1: Registration performance of existing image-to-point cloud outliers view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of the Angle-I2P. First, we back-project the image view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the outliers rejection results of each method. We show four selected scenes in 7Scenes datasets. view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the outliers rejection results of each method. We show the results of RGBDScenesV2 datasets. view at source ↗
read the original abstract

Image-to-point-cloud registration (I2P) is a fundamental task in robotic applications such as manipulation,grasping, and localization. Existing deep learning-based I2P methods seek to align image and point cloud features in a learned representation space to establish correspondences, and have achieved promising results. However, when the inlier ratio of the initial matching pairs is low, conventional Perspective-n-Points (PnP) methods may struggle to achieve accurate results. To address this limitation, we propose Angle-I2P, an outlier rejection network that leverages angle-consistent geometric constraints and hierarchical attention. First, we design a scale-invariant, crossmodality geometric constraint based on angular consistency. This explicit geometric constraint guides the model in distinguishing inliers from outliers. Furthermore, we propose a global-tolocal hierarchical attention mechanism that effectively filters out geometrically inconsistent matches under rigid transformation, thereby improving the Inlier Ratio (IR) and Registration Recall (RR). Experimental results demonstrate that our method achieves state-of-the-art performance on the 7Scenes, RGBD Scenes V2, and a self-collected dataset, with consistent improvements across all benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Angle-I2P, an outlier rejection network for image-to-point-cloud registration tasks. It introduces a scale-invariant cross-modality geometric constraint based on angular consistency to distinguish inliers from outliers and a global-to-local hierarchical attention mechanism to filter geometrically inconsistent matches. The authors claim that this leads to state-of-the-art performance on the 7Scenes, RGBD Scenes V2, and a self-collected dataset.

Significance. Should the proposed method's improvements in inlier ratio and registration recall hold under scrutiny, it would represent a meaningful advance in handling low-inlier-ratio scenarios common in cross-modality registration for robotics. The explicit geometric prior is a notable strength compared to purely data-driven approaches.

major comments (2)
  1. [Abstract] The abstract asserts SOTA results with 'consistent improvements across all benchmarks' but includes no quantitative tables, baseline comparisons, ablation studies, or error analysis, preventing verification of the central empirical claim.
  2. No analysis is provided on whether the scale-invariant angular consistency constraint remains effective under real-world cross-modality noise, sensor-specific artifacts, or calibration drift, which underpins the ability to separate inliers from outliers and achieve the reported gains.
minor comments (1)
  1. [Abstract] Presentation issues include missing space in 'manipulation,grasping', 'crossmodality' should be hyphenated as 'cross-modality', and 'global-tolocal' should be 'global-to-local'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and commit to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts SOTA results with 'consistent improvements across all benchmarks' but includes no quantitative tables, baseline comparisons, ablation studies, or error analysis, preventing verification of the central empirical claim.

    Authors: We agree that the abstract, due to length constraints, does not contain specific numbers or tables. The full manuscript provides these details in Section 4, including Table 1 reporting inlier ratios and registration recalls on 7Scenes and RGBD Scenes V2 with comparisons to baselines, plus ablation studies in Section 4.3. To address the concern, we will revise the abstract to include key quantitative highlights, such as the reported improvements in IR and RR. revision: yes

  2. Referee: [—] No analysis is provided on whether the scale-invariant angular consistency constraint remains effective under real-world cross-modality noise, sensor-specific artifacts, or calibration drift, which underpins the ability to separate inliers from outliers and achieve the reported gains.

    Authors: The evaluations use real datasets (7Scenes, RGBD Scenes V2) that contain sensor noise, cross-modality artifacts, and calibration variations typical of RGB-depth setups. The scale-invariant angular constraint is intended to mitigate scale and geometric inconsistencies arising from such factors, as evidenced by the performance gains in low-inlier-ratio cases. We will add a dedicated robustness discussion subsection, including qualitative analysis of the constraint under these conditions, to make this explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit geometric prior and attention architecture are independent of fitted outputs.

full rationale

The paper introduces a scale-invariant angular consistency constraint as an explicit geometric prior and a global-to-local hierarchical attention mechanism as a new architectural component. Neither is derived from or fitted to the target inlier/outlier labels on the evaluation benchmarks; both are defined a priori and then applied to produce candidate correspondences whose quality is measured on held-out test sets (7Scenes, RGBD Scenes V2, self-collected). No equation reduces a claimed prediction to a quantity defined by the same data, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard geometric assumptions of rigid transformations and the empirical effectiveness of learned attention; no new physical entities or ad-hoc constants are introduced beyond typical neural-network parameters.

axioms (1)
  • domain assumption Rigid transformations preserve angles between point pairs
    Invoked to justify the scale-invariant angular consistency constraint used to label inliers versus outliers.

pith-pipeline@v0.9.0 · 5509 in / 1347 out tokens · 59791 ms · 2026-05-12T03:16:16.172801+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    An economic framework for 6-dof grasp detection,

    X.-M. Wu, J.-F. Cai, J.-J. Jiang, D. Zheng, Y .-L. Wei, and W.-S. Zheng, “An economic framework for 6-dof grasp detection,” inProceedings of European Conference on Computer Vision (ECCV), 2024, pp. 357–375

  2. [2]

    Mast3r-slam: Real- time dense slam with 3d reconstruction priors,

    R. Murai, E. Dexheimer, and A. J. Davison, “Mast3r-slam: Real- time dense slam with 3d reconstruction priors,” inProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 16 695–16 705

  3. [3]

    Ol- reg: Registration of image and sparse lidar point cloud with object-level dense correspondences,

    P. An, X. Hu, J. Ding, J. Zhang, J. Ma, Y . Yang, and Q. Liu, “Ol- reg: Registration of image and sparse lidar point cloud with object-level dense correspondences,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 7523–7536, 2024

  4. [4]

    Enhance image- to-point-cloud registration with beltrami flow: P. an et al

    P. An, Y . Yang, J. Yang, M. Peng, Q. Liu, and L. Nan, “Enhance image- to-point-cloud registration with beltrami flow: P. an et al.”International Journal of Computer Vision, vol. 133, no. 12, pp. 8589–8616, 2025

  5. [5]

    P2-net: Joint description and detection of local features for pixel and point matching,

    B. Wang, C. Chen, Z. Cui, J. Qin, C. X. Lu, Z. Yu, P. Zhao, Z. Dong, F. Zhu, N. Trigoniet al., “P2-net: Joint description and detection of local features for pixel and point matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 16 004– 16 013

  6. [6]

    2d3d-matchnet: Learning to match keypoints across 2d image and 3d point cloud,

    M. Feng, S. Hu, M. H. Ang, and G. H. Lee, “2d3d-matchnet: Learning to match keypoints across 2d image and 3d point cloud,” inProceedings of the International Conference on Robotics and Automation (ICRA), 2019, pp. 4790–4796

  7. [7]

    Graphi2p: Image-to- point cloud registration with exploring pattern of correspondence via graph learning,

    L. Bie, S. Pan, S. Li, Y . Zhao, and Y . Gao, “Graphi2p: Image-to- point cloud registration with exploring pattern of correspondence via graph learning,” inProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 22 161–22 171

  8. [8]

    Ldf-i2p: Learning discriminative cross-modality features for image-to-point cloud registration,

    M. Peng, P. An, Y . Yang, and Q. Liu, “Ldf-i2p: Learning discriminative cross-modality features for image-to-point cloud registration,”IEEE Transactions on Instrumentation and Measurement, vol. 74, pp. 1–12, 2025

  9. [9]

    Is geometry enough for matching in visual localization?

    Q. Zhou, S. Agostinho, A. O ˇsep, and L. Leal-Taix´e, “Is geometry enough for matching in visual localization?” inProceedings of the European Conference on Computer Vision (ECCV), 2022, pp. 407–425

  10. [10]

    Build a cross-modality bridge for image-to-point cloud registration,

    L. Bie, S. Pan, K. Cheng, and L. Han, “Build a cross-modality bridge for image-to-point cloud registration,” inProceedings of IEEE International Conference on Multimedia and Expo (ICME), 2024, pp. 1–6

  11. [11]

    Image-to-point registration via cross- modality correspondence retrieval,

    L. Bie, S. Li, and K. Cheng, “Image-to-point registration via cross- modality correspondence retrieval,” inProceedings of the 2024 Interna- tional Conference on Multimedia Retrieval (ICMR), 2024, pp. 266–274

  12. [12]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,

    M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,”Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981

  13. [13]

    Lcd: Learned cross-domain descriptors for 2d-3d matching,

    Q.-H. Pham, M. A. Uy, B.-S. Hua, D. T. Nguyen, G. Roig, and S.-K. Yeung, “Lcd: Learned cross-domain descriptors for 2d-3d matching,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020, pp. 11 856–11 864

  14. [14]

    Mincd- pnp: Learning 2d-3d correspondences with approximate blind pnp,

    P. An, J. Yang, M. Peng, Y . Yang, Q. Liu, X. Wu, and L. Nan, “Mincd- pnp: Learning 2d-3d correspondences with approximate blind pnp,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 26 519–26 528

  15. [15]

    2d3d- matr: 2d-3d matching transformer for detection-free registration between images and point clouds,

    M. Li, Z. Qin, Z. Gao, R. Yi, C. Zhu, Y . Guo, and K. Xu, “2d3d- matr: 2d-3d matching transformer for detection-free registration between images and point clouds,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 14 128–14 138

  16. [16]

    Freereg: Image-to-point cloud registration leveraging pretrained diffu- sion models and monocular depth estimators,

    H. Wang, Y . Liu, B. Wang, Y . Sun, Z. Dong, W. Wang, and B. Yang, “Freereg: Image-to-point cloud registration leveraging pretrained diffu- sion models and monocular depth estimators,” inProceedings of the International Conference on Learning Representations (ICLR), 2024

  17. [17]

    Diff- reg: Diffusion model in doubly stochastic matrix space for registration problem,

    Q. Wu, H. Jiang, L. Luo, J. Li, Y . Ding, J. Xie, and J. Yang, “Diff- reg: Diffusion model in doubly stochastic matrix space for registration problem,” inProceedings of the European Conference on Computer Vision (ECCV), 2024, pp. 160–178

  18. [18]

    Teaser: Fast and certifiable point cloud registration,

    H. Yang, J. Shi, and L. Carlone, “Teaser: Fast and certifiable point cloud registration,”IEEE Transactions on Robotics, vol. 37, no. 2, pp. 314– 333, 2020

  19. [19]

    Graph-cut ransac,

    D. Barath and J. Matas, “Graph-cut ransac,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6733–6741

  20. [20]

    Fastmac: Stochastic spectral sampling of correspondence graph,

    Y . Zhang, H. Zhao, H. Li, and S. Chen, “Fastmac: Stochastic spectral sampling of correspondence graph,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 17 857–17 867

  21. [21]

    Mac++: Going further with maximal cliques for 3d registration,

    X. Zhang, Y . Zhang, and J. Yang, “Mac++: Going further with maximal cliques for 3d registration,” inProceedings of International Conference on 3D Vision (3DV), 2025, pp. 261–275

  22. [22]

    3d registration with maximal cliques,

    X. Zhang, J. Yang, S. Zhang, and Y . Zhang, “3d registration with maximal cliques,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 17 745– 17 754

  23. [23]

    Hypergct: A dynamic hyper-gnn-learned geometric constraint for 3d registration,

    X. Zhang, J. Ma, J. Guo, W. Hu, Z. Qi, F. Hui, J. Yang, and Y . Zhang, “Hypergct: A dynamic hyper-gnn-learned geometric constraint for 3d registration,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 24 750–24 759

  24. [24]

    Graph-cut ransac: Local optimization on spatially coherent structures,

    D. Barath and J. Matas, “Graph-cut ransac: Local optimization on spatially coherent structures,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 4961–4974, 2021

  25. [25]

    Muscle- reg: Multi-scale contextual embedding and local correspondence rectifi- cation for robust two-stage point cloud registration,

    Y . Zhang, J. Zhang, X. Qian, Y . Cen, B. Zhang, and J. Gong, “Muscle- reg: Multi-scale contextual embedding and local correspondence rectifi- cation for robust two-stage point cloud registration,”IEEE Robotics and Automation Letters, 2025

  26. [26]

    3dpcp-net: A lightweight progressive 3d cor- respondence pruning network for accurate and efficient point cloud registration,

    J. Wang and Z. Li, “3dpcp-net: A lightweight progressive 3d cor- respondence pruning network for accurate and efficient point cloud registration,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1885–1894

  27. [27]

    Deep graph-based spatial consistency for robust non-rigid point cloud registration,

    Z. Qin, H. Yu, C. Wang, Y . Peng, and K. Xu, “Deep graph-based spatial consistency for robust non-rigid point cloud registration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 5394–5403

  28. [28]

    Deepi2p: Image-to-point cloud registration via deep classification,

    J. Li and G. H. Lee, “Deepi2p: Image-to-point cloud registration via deep classification,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15 960– 15 969

  29. [29]

    Depth anything v2,

    L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 21 875–21 911, 2024

  30. [30]

    Real-time rgb- d camera relocalization,

    B. Glocker, S. Izadi, J. Shotton, and A. Criminisi, “Real-time rgb- d camera relocalization,” inProceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2013, pp. 173– 179

  31. [31]

    Unsupervised feature learning for 3d scene labeling,

    K. Lai, L. Bo, and D. Fox, “Unsupervised feature learning for 3d scene labeling,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2014, pp. 3050–3057