Let's Take This Online: Adapting Scene Coordinate Regression Network Predictions for Online RGB-D Camera Relocalisation
Pith reviewed 2026-05-25 19:32 UTC · model grok-4.3
The pith
A two-step adaptation lets a scene coordinate regression network trained on one scene predict coordinates in a new scene for fast online relocalisation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that replacing a regression forest's branching with a two-step adaptation—using a network trained on one scene to predict points there, then looking up corresponding clusters in a new scene—allows the network to produce accurate scene coordinates in the new scene, achieving state-of-the-art performance on the 7-Scenes and Cambridge Landmarks datasets while running in under 300 ms.
What carries the argument
The two-step adaptation process that uses network predictions of points in the original scene to look up clusters of points from the new scene.
If this is right
- The adapted network generalises to novel poses away from the training trajectory.
- Dense correspondences improve robustness in textureless regions compared to sparse keypoint methods.
- Performance reaches state-of-the-art levels on both indoor and outdoor benchmark datasets.
- Runtime under 300 ms makes the approach suitable for live camera relocalisation scenarios.
Where Pith is reading between the lines
- The method could support repeated adaptations in slowly changing environments without full retraining each time.
- Integration with keyframe-based methods might combine trajectory coverage with robustness to textureless areas.
- The lookup mechanism might generalise to other regression tasks that need quick scene transfer without labelled data for the target.
Load-bearing premise
Points predicted by the network on the original scene can be used to look up the right clusters of points from the new scene.
What would settle it
A test showing that relocalisation accuracy stays low after adaptation because the predicted points from the original scene fail to index the correct clusters in the new scene.
Figures
read the original abstract
Many applications require a camera to be relocalised online, without expensive offline training on the target scene. Whilst both keyframe and sparse keypoint matching methods can be used online, the former often fail away from the training trajectory, and the latter can struggle in textureless regions. By contrast, scene coordinate regression (SCoRe) methods generalise to novel poses and can leverage dense correspondences to improve robustness, and recent work has shown how to adapt SCoRe forests between scenes, allowing their state-of-the-art performance to be leveraged online. However, because they use features hand-crafted for indoor use, they do not generalise well to harder outdoor scenes. Whilst replacing the forest with a neural network and learning suitable features for outdoor use is possible, the techniques used to adapt forests between scenes are unfortunately harder to transfer to a network context. In this paper, we address this by proposing a novel way of leveraging a network trained on one scene to predict points in another scene. Our approach replaces the appearance clustering performed by the branching structure of a regression forest with a two-step process that first uses the network to predict points in the original scene, and then uses these predicted points to look up clusters of points from the new scene. We show experimentally that our online approach achieves state-of-the-art performance on both the 7-Scenes and Cambridge Landmarks datasets, whilst running in under 300ms, making it highly effective in live scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-step adaptation method for scene coordinate regression (SCoRe) networks to enable online RGB-D camera relocalisation in new scenes. A network trained on an original scene is used to regress 3D points, which then index precomputed clusters from the target scene; this replaces the branching structure of regression forests. The manuscript claims this yields state-of-the-art performance on the 7-Scenes and Cambridge Landmarks datasets while running in under 300 ms.
Significance. If the adaptation reliably transfers accuracy, the work would be significant because it extends forest-style online adaptation to neural networks that can handle outdoor scenes, addressing a practical gap between offline training and live deployment. The real-time claim and use of standard benchmarks are strengths; however, the absence of any derivation or bound on the cluster-lookup step limits the result's generality.
major comments (2)
- [§3.2] §3.2 (two-step adaptation): the claim that points regressed by the original-scene network can be used to retrieve correct clusters from the new scene is load-bearing for all reported results, yet no analysis, bound, or ablation is provided showing why this mapping preserves geometric accuracy when the scenes differ in scale, texture, or structure (the precise transfer step highlighted in the skeptic note).
- [§5] §5 (experimental evaluation): the SOTA claim on both 7-Scenes and Cambridge Landmarks rests on quantitative comparisons, but the manuscript provides no error bars, statistical tests, or per-scene breakdown that would confirm the adaptation step—not the base network—is responsible for the reported gains.
minor comments (2)
- The abstract states 'under 300 ms' but the timing breakdown (network forward pass vs. cluster lookup vs. pose solver) is not tabulated; a table would clarify the real-time claim.
- Notation for the cluster lookup function is introduced without an explicit equation; adding one would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below, clarifying the method's assumptions and outlining planned revisions to strengthen the experimental validation and analysis.
read point-by-point responses
-
Referee: [§3.2] §3.2 (two-step adaptation): the claim that points regressed by the original-scene network can be used to retrieve correct clusters from the new scene is load-bearing for all reported results, yet no analysis, bound, or ablation is provided showing why this mapping preserves geometric accuracy when the scenes differ in scale, texture, or structure (the precise transfer step highlighted in the skeptic note).
Authors: We agree that the manuscript would benefit from additional analysis of the cluster-lookup transfer step. The approach relies on the observation that the source-trained network predicts 3D points in a shared canonical frame, allowing nearest-neighbor lookup into target-scene clusters defined by spatial proximity; this is intended to approximate the forest's branching without requiring retraining. While no formal bound is derived, the geometric intuition is that coarse 3D alignment suffices for cluster retrieval even across moderate scene variations. In revision we will expand §3.2 with a clearer derivation of the lookup step and add an ablation that varies scene scale and texture differences on the Cambridge Landmarks sequences. revision: partial
-
Referee: [§5] §5 (experimental evaluation): the SOTA claim on both 7-Scenes and Cambridge Landmarks rests on quantitative comparisons, but the manuscript provides no error bars, statistical tests, or per-scene breakdown that would confirm the adaptation step—not the base network—is responsible for the reported gains.
Authors: The current manuscript reports mean errors and timing but indeed omits error bars, statistical significance tests, and explicit per-scene tables isolating the adaptation contribution. We will revise §5 to include per-scene breakdowns for both datasets, error bars computed over multiple runs, and an additional ablation that compares the full two-step adaptation against the unadapted base network and against a version that skips the cluster lookup. These additions will make the source of the reported gains explicit. revision: yes
Circularity Check
No significant circularity; adaptation method is an independent technical contribution
full rationale
The paper introduces a two-step process to adapt a network trained on scene A for use on scene B by regressing points in A's frame and then indexing precomputed clusters from B. This is presented as a novel replacement for forest branching, with SOTA claims supported by experiments on 7-Scenes and Cambridge Landmarks. No equations or derivations reduce by construction to fitted inputs or self-citations; the central premise does not rely on a load-bearing self-citation chain or self-definitional mapping. The method is self-contained against external benchmarks and does not rename known results or smuggle ansatzes via citation.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Pan-tilt-zoom SLAM for Sports Videos
An online PTZ SLAM system using a novel camera model, ray landmarks, and a pan-tilt forest for superior pose estimation in sports videos.
Reference graph
Works this paper leans on
-
[1]
D. Acharya, K. Khoshelham, and S. Winter. BIM-PoseNet: Indoor camera localisation using a 3D indoor model and deep learning from synthetic images. ISPRS Journal of Pho- togrammetry and Remote Sensing, 150:245–258, 2019
work page 2019
-
[2]
H. Bae, M. Walker, J. White, Y . Pan, Y . Sun, and M. Golparvar-Fard. Fast and scalable structure-from-motion based localization for high-precision mobile augmented real- ity systems. The Journal of Mobile User Experience, 5(1):1– 21, 2016
work page 2016
-
[3]
V . Balntas, S. Li, and V . Prisacariu. RelocNet: Continuous Metric Learning Relocalisation using Neural Nets. InECCV, 2018
work page 2018
-
[4]
P. J. Besl and N. D. McKay. A Method for Registration of 3-D Shapes. TPAMI, 14(2):239–256, February 1992
work page 1992
-
[5]
E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother. DSAC – Differentiable RANSAC for Camera Localization. In CVPR, 2017
work page 2017
-
[6]
E. Brachmann, F. Michel, A. Krull, M. Y . Yang, S. Gumhold, and C. Rother. Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image. In CVPR, 2016
work page 2016
-
[7]
E. Brachmann and C. Rother. Learning Less is More – 6D Camera Localization via 3D Surface Regression. In CVPR, 2018
work page 2018
-
[8]
E. Brachmann and C. Rother. Neural-Guided RANSAC: Learning Where to Sample Model Hypotheses. arXiv:1905.04132v1, 2019
-
[9]
S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz. Geometry-Aware Learning of Maps for Camera Localiza- tion. In CVPR, pages 2616–2625, 2018
work page 2018
- [10]
- [11]
-
[12]
T. Cavallari*, S. Golodetz*, N. A. Lord*, J. Valentin*, V . A. Prisacariu, L. D. Stefano, and P. H. S. Torr. Real-Time RGB- D Camera Pose Estimation in Novel Scenes using a Relocal- isation Cascade. TPAMI, Early Access, 2019
work page 2019
-
[13]
T. Cavallari, S. Golodetz*, N. A. Lord*, J. Valentin, L. D. Stefano, and P. H. S. Torr. On-the-Fly Adaptation of Regres- sion Forests for Online Camera Relocalisation. In CVPR, 2017
work page 2017
-
[14]
O. Chum, J. Matas, and J. Kittler. Locally Optimized RANSAC. In Joint Pattern Recognition Symposium, pages 236–243, 2003
work page 2003
- [15]
-
[16]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, pages 248–255, 2009
work page 2009
-
[17]
L. Deng, Z. Chen, B. Chen, Y . Duan, and J. Zhou. Incremen- tal image set querying based localization. Neurocomputing, 2016
work page 2016
- [18]
-
[19]
Y . Feng, Y . Wu, and L. Fan. Real-time SLAM relocalization with online learning of binary feature indexing. Machine Vision and Applications, 28(8):953–963, 2017
work page 2017
-
[20]
M. A. Fischler and R. C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. CACM, 24(6), 1981
work page 1981
-
[21]
B. Fulkerson and S. Soatto. Really quick shift: Image seg- mentation on a GPU. In ECCV, pages 350–358, 2010
work page 2010
-
[22]
D. G ´alvez-L´opez and J. D. Tard ´os. Real-Time Loop De- tection with Bags of Binary Words. In IROS, pages 51–58, 2011
work page 2011
-
[23]
A. P. Gee and W. Mayol-Cuevas. 6D Relocalisation for RGBD Cameras Using Synthetic View Regression. In BMVC, 2012
work page 2012
-
[24]
B. Glocker, J. Shotton, A. Criminisi, and S. Izadi. Real- Time RGB-D Camera Relocalization via Randomized Ferns for Keyframe Encoding. TVCG, 21(5), 2015
work page 2015
-
[25]
S. Golodetz*, T. Cavallari*, N. A. Lord*, V . A. Priscariu, D. W. Murray, and P. H. S. Torr. Collaborative Large-Scale Dense 3D Reconstruction with Online Inter-Agent Pose Op- timisation. TVCG, 24(11):2895–2905, 2018
work page 2018
-
[26]
SemanticPaint: A Framework for the Interactive Segmentation of 3D Scenes
S. Golodetz*, M. Sapienza*, J. P. C. Valentin, V . Vineet, M.- M. Cheng, A. Arnab, V . A. Prisacariu, O. K¨ahler, C. Y . Ren, D. W. Murray, S. Izadi, and P. H. S. Torr. SemanticPaint: A Framework for the Interactive Segmentation of 3D Scenes. Technical Report TVG-2015-1, Department of Engineering Science, University of Oxford, October 2015. Released as ar...
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[27]
A. Guzman-Rivera, P. Kohli, B. Glocker, J. Shotton, T. Sharp, A. Fitzgibbon, and S. Izadi. Multi-Output Learn- ing for Camera Relocalization. In CVPR, pages 1114–1121, 2014
work page 2014
-
[28]
R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2nd edition, 2004
work page 2004
-
[29]
K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016
work page 2016
-
[30]
S. Ioffe and C. Szegedy. Batch Normalization: Accelerat- ing Deep Network Training by Reducing Internal Covariate Shift. In ICML, pages 448–456, 2015
work page 2015
-
[31]
W. Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallogra- phy, 32(5):922–923, 1976
work page 1976
- [32]
-
[33]
O. K ¨ahler, V . A. Prisacariu, and D. W. Murray. Real-Time Large-Scale Dense 3D Reconstruction with Loop Closure. In ECCV, pages 500–516, 2016
work page 2016
-
[34]
A. Kendall and R. Cipolla. Modelling Uncertainty in Deep Learning for Camera Relocalization. In ICRA, 2016
work page 2016
-
[35]
A. Kendall and R. Cipolla. Geometric loss functions for camera pose regression with deep learning. In CVPR, pages 5974–5983, 2017
work page 2017
-
[36]
A. Kendall, M. Grimes, and R. Cipolla. PoseNet: A Convo- lutional Network for Real-Time 6-DOF Camera Relocaliza- tion. In ICCV, pages 2938–2946, 2015
work page 2015
-
[37]
D. P. Kingma* and J. L. Ba*. Adam: A Method for Stochas- tic Optimization. In ICLR, 2015
work page 2015
-
[38]
Z. Laskar*, I. Melekhov*, S. Kalia, and J. Kannala. Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network. In ICCV-W, pages 929–938, 2017
work page 2017
- [39]
- [40]
- [41]
-
[42]
X. Li, J. Ylioinas, and J. Kannala. Full-Frame Scene Co- ordinate Regression for Image-Based Localization. In RSS, 2018
work page 2018
-
[43]
X. Li, J. Ylioinas, J. Verbeek, and J. Kannala. Scene Coor- dinate Regression with Angle-Based Reprojection Loss for Camera Relocalization. In ECCV, 2018
work page 2018
-
[44]
G. Lu, Y . Yan, A. Kolagunda, and C. Kambhamettu. A Fast 3D Indoor-Localization Approach Based on Video Queries. In MultiMedia Modeling, pages 218–230, 2016
work page 2016
-
[45]
D. W. Marquardt. An Algorithm for Least-Squares Estima- tion of Nonlinear Parameters. SIAP, 11(2), 1963
work page 1963
-
[46]
D. Massiceti, A. Krull, E. Brachmann, C. Rother, and P. H. S. Torr. Random Forests versus Neural Networks – What’s Best for Camera Localization? In ICRA, 2017
work page 2017
-
[47]
I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu. Image- based Localization using Hourglass Networks. In ICCV-W, 2017
work page 2017
-
[48]
L. Meng, J. Chen, F. Tung, J. J. Little, and C. W. de Silva. Exploiting Random RGB and Sparse Features for Camera Pose Estimation. In BMVC, 2016
work page 2016
-
[49]
L. Meng, J. Chen, F. Tung, J. J. Little, J. Valentin, and C. W. de Silva. Backtracking Regression Forests for Ac- curate Camera Relocalization. In IROS, 2017
work page 2017
-
[50]
L. Meng, F. Tung, J. J. Little, J. Valentin, and C. W. de Silva. Exploiting Points and Lines in Regression Forests for RGB- D Camera Relocalization. In IROS, 2018
work page 2018
-
[51]
R. Mur-Artal, J. M. M. Montiel, and J. D. Tard ´os. ORB- SLAM: A Versatile and Accurate Monocular SLAM System. RO, 31(5):1147–1163, October 2015
work page 2015
-
[52]
R. Mur-Artal and J. D. Tard ´os. Fast Relocalisation and Loop Closing in Keyframe-Based SLAM. In ICRA, 2014
work page 2014
-
[53]
R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon. KinectFusion: Real-Time Dense Surface Mapping and Tracking. In ISMAR, pages 127–136, 2011
work page 2011
-
[54]
M. Nießner, M. Zollh ¨ofer, S. Izadi, and M. Stamminger. Real-time 3D Reconstruction at Scale using V oxel Hashing. TOG, 32(6), 2013
work page 2013
-
[55]
R. Paucher and M. Turk. Location-based augmented reality on mobile phones. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition – Workshops, pages 9–16, 2010
work page 2010
-
[56]
V . A. Prisacariu, O. K ¨ahler, S. Golodetz, M. Sapienza, T. Cavallari, P. H. S. Torr, and D. W. Murray. InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure. arXiv:1708.00783v1, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[57]
VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry
N. Radwan*, A. Valada*, and W. Burgard. VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry. arXiv:1804.08366v4, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[58]
N. L. Rodas, F. Barrera, and N. Padoy. Marker-less AR in the Hybrid Room using Equipment Detection for Camera Relo- calization. In MICCAI, pages 463–470, 2015
work page 2015
-
[59]
T. Sattler, B. Leibe, and L. Kobbelt. Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localiza- tion. TPAMI, 9, 2017
work page 2017
-
[60]
T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taix ´e. Un- derstanding the Limitations of CNN-based Absolute Camera Pose Regression. In CVPR, 2019
work page 2019
-
[61]
T. Schmidt, R. Newcombe, and D. Fox. Self-supervised Vi- sual Descriptor Learning for Dense Correspondence. RA-L, 2(2):420–427, 2017
work page 2017
-
[62]
J. L. Sch ¨onberger, M. Pollefeys, A. Geiger, and T. Sattler. Semantic Visual Localization. In CVPR, 2018
work page 2018
-
[63]
J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. In CVPR, pages 2930–2937, 2013
work page 2013
-
[64]
K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, 2015
work page 2015
- [65]
- [66]
-
[67]
at https://pytorch.org/docs/stable/ torchvision/models.html
-
[68]
A. Valada*, N. Radwan*, and W. Burgard. Deep Auxiliary Learning for Visual Localization and Odometry. In ICRA, 2018
work page 2018
-
[69]
J. Valentin, A. Dai, M. Nießner, P. Kohli, P. Torr, S. Izadi, and C. Keskin. Learning to Navigate the Energy Landscape. In 3DV, 2016
work page 2016
-
[70]
J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi, and P. Torr. Exploiting Uncertainty in Regression Forests for Accurate Camera Relocalization. In CVPR, 2015
work page 2015
- [71]
-
[72]
B. Williams, G. Klein, and I. Reid. Automatic Relocalization and Loop Closing for Real-Time Monocular SLAM.TPAMI, 33(9):1699–1712, September 2011
work page 2011
-
[73]
J. Wu, L. Ma, and X. Hu. Delving Deeper into Convolutional Neural Networks for Camera Relocalization. In ICRA, 2017. Chess Fire Office Pumpkin Kitchen Stairs Raw 72.50% 41.50% 53.38% 44.40% 39.90% 1.20% 0.032m/1.495◦ 0.061m/2.724◦ 0.046m/1.804◦ 0.060m/1.865◦ 0.068m/2.255◦ 0.528m/6.487◦ + ICP 98.35% 76.65% 84.05% 74.10% 70.90% 26.10% 0.013m/1.034◦ 0.009m/1....
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.