Let's Take This Online: Adapting Scene Coordinate Regression Network Predictions for Online RGB-D Camera Relocalisation

Jishnu Mukhoti; Luca Bertinetto; Philip Torr; Stuart Golodetz; Tommaso Cavallari

arxiv: 1906.08744 · v1 · pith:7ASCN36Pnew · submitted 2019-06-20 · 💻 cs.CV · cs.LG· cs.RO

Let's Take This Online: Adapting Scene Coordinate Regression Network Predictions for Online RGB-D Camera Relocalisation

Tommaso Cavallari , Luca Bertinetto , Jishnu Mukhoti , Philip Torr , Stuart Golodetz This is my paper

Pith reviewed 2026-05-25 19:32 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO

keywords scene coordinate regressiononline relocalisationRGB-D cameranetwork adaptation7-Scenes datasetCambridge Landmarks dataset

0 comments

The pith

A two-step adaptation lets a scene coordinate regression network trained on one scene predict coordinates in a new scene for fast online relocalisation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to adapt neural networks for scene coordinate regression between scenes without retraining, enabling online camera relocalisation. It replaces the appearance clustering of regression forests with a process that first has the network predict points in the original scene and then uses those predictions to retrieve clusters from the new scene. This preserves the network's ability to generalise to novel poses and use dense correspondences while making it practical for live use. A sympathetic reader would care because it combines the robustness of dense methods with the speed needed for real-time applications like augmented reality.

Core claim

The authors claim that replacing a regression forest's branching with a two-step adaptation—using a network trained on one scene to predict points there, then looking up corresponding clusters in a new scene—allows the network to produce accurate scene coordinates in the new scene, achieving state-of-the-art performance on the 7-Scenes and Cambridge Landmarks datasets while running in under 300 ms.

What carries the argument

The two-step adaptation process that uses network predictions of points in the original scene to look up clusters of points from the new scene.

If this is right

The adapted network generalises to novel poses away from the training trajectory.
Dense correspondences improve robustness in textureless regions compared to sparse keypoint methods.
Performance reaches state-of-the-art levels on both indoor and outdoor benchmark datasets.
Runtime under 300 ms makes the approach suitable for live camera relocalisation scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support repeated adaptations in slowly changing environments without full retraining each time.
Integration with keyframe-based methods might combine trajectory coverage with robustness to textureless areas.
The lookup mechanism might generalise to other regression tasks that need quick scene transfer without labelled data for the target.

Load-bearing premise

Points predicted by the network on the original scene can be used to look up the right clusters of points from the new scene.

What would settle it

A test showing that relocalisation accuracy stays low after adaptation because the predicted points from the original scene fail to index the correct clusters in the new scene.

Figures

Figures reproduced from arXiv: 1906.08744 by Jishnu Mukhoti, Luca Bertinetto, Philip Torr, Stuart Golodetz, Tommaso Cavallari.

**Figure 1.** Figure 1: An overview of our approach. Ahead of time, we train a scene coordinate regression network offline to predict correspondences between pixels in an input image and 3D points in an arbitrary pre-training scene (here, Chess [63]): see §2.2. To use this network to predict points in a different target scene (here, Heads [63]) online, we use the points the network predicts to index into an array of reservoirs, i… view at source ↗

**Figure 2.** Figure 2: ScoreNet architecture. We use a truncated VGG16 feature extractor, followed by several 1×1 convolutional layers, to regress 3D world space points for a subset of pixels from the original image. structure of the ScoreNets we use and how they are trained are described in §2.2. To use a ScoreNet to relocalise in a scene other than the one on which it was trained, we need a way of adapting the predictions of… view at source ↗

**Figure 3.** Figure 3: Grid-based reservoir indexing. Suppose that the 3D point p that the ScoreNet predicts for a given pixel falls into cell (2, 1, 3) in a bounded grid placed over the training scene (we show only the x and y dimensions, for simplicity). Then g(px) = 2, g(py) = 1 and g(pz) = 3, and we can calculate a grid cell index of G(p) = 42 × 3 + 4×1+2 = 54 for p. We use this grid cell index to perform a lookup in a table… view at source ↗

**Figure 4.** Figure 4: Visualising the raw and adapted points that a [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: A string representation of the layer structure of [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Evaluating the ability of our relocaliser pre [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Evaluating the ability of our relocaliser pre [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: The proportions of test frames from Cambridge [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: The distributions of depth values in the training sequences of the different scenes in the 7-Scenes [ [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Evaluating how the performance of our relo [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Evaluating how the relocalisation performance [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

read the original abstract

Many applications require a camera to be relocalised online, without expensive offline training on the target scene. Whilst both keyframe and sparse keypoint matching methods can be used online, the former often fail away from the training trajectory, and the latter can struggle in textureless regions. By contrast, scene coordinate regression (SCoRe) methods generalise to novel poses and can leverage dense correspondences to improve robustness, and recent work has shown how to adapt SCoRe forests between scenes, allowing their state-of-the-art performance to be leveraged online. However, because they use features hand-crafted for indoor use, they do not generalise well to harder outdoor scenes. Whilst replacing the forest with a neural network and learning suitable features for outdoor use is possible, the techniques used to adapt forests between scenes are unfortunately harder to transfer to a network context. In this paper, we address this by proposing a novel way of leveraging a network trained on one scene to predict points in another scene. Our approach replaces the appearance clustering performed by the branching structure of a regression forest with a two-step process that first uses the network to predict points in the original scene, and then uses these predicted points to look up clusters of points from the new scene. We show experimentally that our online approach achieves state-of-the-art performance on both the 7-Scenes and Cambridge Landmarks datasets, whilst running in under 300ms, making it highly effective in live scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core move is a two-step lookup that lets a network trained on scene A regress points to index clusters precomputed from scene B, enabling online adaptation without retraining.

read the letter

The main thing here is the adaptation trick: run the original network to predict 3D points in its training coordinate frame, then use those points to retrieve clusters from the target scene instead of relying on forest branching. This lets them keep the network's learned features (better for outdoor data) while borrowing the adaptation speed from earlier forest work. They report real-time operation under 300 ms and state-of-the-art numbers on both 7-Scenes and Cambridge Landmarks. That combination of network generalization plus online transfer is the concrete advance over the forest papers they cite. The method is presented cleanly and the motivation for moving beyond hand-crafted features is clear. The soft spot is exactly the one the stress-test flags: the lookup step assumes that points regressed in scene A's frame will land near the correct geometric clusters in scene B even though the network has never seen B. The abstract gives no derivation, bound, or ablation showing when this holds or breaks (different scale, texture, or structure are obvious failure modes). Without the experimental section it is impossible to tell whether the claimed SOTA actually survives that assumption or whether the numbers come from closely related scenes. This is the kind of incremental but practical paper that relocalisation and SLAM groups would want to see. It deserves a serious referee to check the full results, baselines, and failure cases rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-step adaptation method for scene coordinate regression (SCoRe) networks to enable online RGB-D camera relocalisation in new scenes. A network trained on an original scene is used to regress 3D points, which then index precomputed clusters from the target scene; this replaces the branching structure of regression forests. The manuscript claims this yields state-of-the-art performance on the 7-Scenes and Cambridge Landmarks datasets while running in under 300 ms.

Significance. If the adaptation reliably transfers accuracy, the work would be significant because it extends forest-style online adaptation to neural networks that can handle outdoor scenes, addressing a practical gap between offline training and live deployment. The real-time claim and use of standard benchmarks are strengths; however, the absence of any derivation or bound on the cluster-lookup step limits the result's generality.

major comments (2)

[§3.2] §3.2 (two-step adaptation): the claim that points regressed by the original-scene network can be used to retrieve correct clusters from the new scene is load-bearing for all reported results, yet no analysis, bound, or ablation is provided showing why this mapping preserves geometric accuracy when the scenes differ in scale, texture, or structure (the precise transfer step highlighted in the skeptic note).
[§5] §5 (experimental evaluation): the SOTA claim on both 7-Scenes and Cambridge Landmarks rests on quantitative comparisons, but the manuscript provides no error bars, statistical tests, or per-scene breakdown that would confirm the adaptation step—not the base network—is responsible for the reported gains.

minor comments (2)

The abstract states 'under 300 ms' but the timing breakdown (network forward pass vs. cluster lookup vs. pose solver) is not tabulated; a table would clarify the real-time claim.
Notation for the cluster lookup function is introduced without an explicit equation; adding one would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below, clarifying the method's assumptions and outlining planned revisions to strengthen the experimental validation and analysis.

read point-by-point responses

Referee: [§3.2] §3.2 (two-step adaptation): the claim that points regressed by the original-scene network can be used to retrieve correct clusters from the new scene is load-bearing for all reported results, yet no analysis, bound, or ablation is provided showing why this mapping preserves geometric accuracy when the scenes differ in scale, texture, or structure (the precise transfer step highlighted in the skeptic note).

Authors: We agree that the manuscript would benefit from additional analysis of the cluster-lookup transfer step. The approach relies on the observation that the source-trained network predicts 3D points in a shared canonical frame, allowing nearest-neighbor lookup into target-scene clusters defined by spatial proximity; this is intended to approximate the forest's branching without requiring retraining. While no formal bound is derived, the geometric intuition is that coarse 3D alignment suffices for cluster retrieval even across moderate scene variations. In revision we will expand §3.2 with a clearer derivation of the lookup step and add an ablation that varies scene scale and texture differences on the Cambridge Landmarks sequences. revision: partial
Referee: [§5] §5 (experimental evaluation): the SOTA claim on both 7-Scenes and Cambridge Landmarks rests on quantitative comparisons, but the manuscript provides no error bars, statistical tests, or per-scene breakdown that would confirm the adaptation step—not the base network—is responsible for the reported gains.

Authors: The current manuscript reports mean errors and timing but indeed omits error bars, statistical significance tests, and explicit per-scene tables isolating the adaptation contribution. We will revise §5 to include per-scene breakdowns for both datasets, error bars computed over multiple runs, and an additional ablation that compares the full two-step adaptation against the unadapted base network and against a version that skips the cluster lookup. These additions will make the source of the reported gains explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; adaptation method is an independent technical contribution

full rationale

The paper introduces a two-step process to adapt a network trained on scene A for use on scene B by regressing points in A's frame and then indexing precomputed clusters from B. This is presented as a novel replacement for forest branching, with SOTA claims supported by experiments on 7-Scenes and Cambridge Landmarks. No equations or derivations reduce by construction to fitted inputs or self-citations; the central premise does not rely on a load-bearing self-citation chain or self-definitional mapping. The method is self-contained against external benchmarks and does not rename known results or smuggle ansatzes via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities; the method is described at a high level without mathematical details.

pith-pipeline@v0.9.0 · 5815 in / 1204 out tokens · 61162 ms · 2026-05-25T19:32:05.659294+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pan-tilt-zoom SLAM for Sports Videos
cs.CV 2019-07 unverdicted novelty 6.0

An online PTZ SLAM system using a novel camera model, ray landmarks, and a pan-tilt forest for superior pose estimation in sports videos.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Acharya, K

D. Acharya, K. Khoshelham, and S. Winter. BIM-PoseNet: Indoor camera localisation using a 3D indoor model and deep learning from synthetic images. ISPRS Journal of Pho- togrammetry and Remote Sensing, 150:245–258, 2019

work page 2019
[2]

H. Bae, M. Walker, J. White, Y . Pan, Y . Sun, and M. Golparvar-Fard. Fast and scalable structure-from-motion based localization for high-precision mobile augmented real- ity systems. The Journal of Mobile User Experience, 5(1):1– 21, 2016

work page 2016
[3]

Balntas, S

V . Balntas, S. Li, and V . Prisacariu. RelocNet: Continuous Metric Learning Relocalisation using Neural Nets. InECCV, 2018

work page 2018
[4]

P. J. Besl and N. D. McKay. A Method for Registration of 3-D Shapes. TPAMI, 14(2):239–256, February 1992

work page 1992
[5]

Brachmann, A

E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother. DSAC – Differentiable RANSAC for Camera Localization. In CVPR, 2017

work page 2017
[6]

Brachmann, F

E. Brachmann, F. Michel, A. Krull, M. Y . Yang, S. Gumhold, and C. Rother. Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image. In CVPR, 2016

work page 2016
[7]

Brachmann and C

E. Brachmann and C. Rother. Learning Less is More – 6D Camera Localization via 3D Surface Regression. In CVPR, 2018

work page 2018
[8]

Brachmann and C

E. Brachmann and C. Rother. Neural-Guided RANSAC: Learning Where to Sample Model Hypotheses. arXiv:1905.04132v1, 2019

work page arXiv 1905
[9]

Brahmbhatt, J

S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz. Geometry-Aware Learning of Maps for Camera Localiza- tion. In CVPR, pages 2616–2625, 2018

work page 2018
[10]

M. Bui, C. Baur, N. Navab, S. Ilic, and S. Albarqouni. Adver- sarial Joint Image and Pose Distribution Learning for Cam- era Pose Regression and Reﬁnement. arXiv:1903.06646v2, 2019

work page arXiv 1903
[11]

Castle, G

R. Castle, G. Klein, and D. W. Murray. Video-rate Local- ization in Multiple Maps for Wearable Augmented Reality. In IEEE International Symposium on Wearable Computers , pages 15–22, 2008

work page 2008
[12]

Cavallari*, S

T. Cavallari*, S. Golodetz*, N. A. Lord*, J. Valentin*, V . A. Prisacariu, L. D. Stefano, and P. H. S. Torr. Real-Time RGB- D Camera Pose Estimation in Novel Scenes using a Relocal- isation Cascade. TPAMI, Early Access, 2019

work page 2019
[13]

Cavallari, S

T. Cavallari, S. Golodetz*, N. A. Lord*, J. Valentin, L. D. Stefano, and P. H. S. Torr. On-the-Fly Adaptation of Regres- sion Forests for Online Camera Relocalisation. In CVPR, 2017

work page 2017
[14]

O. Chum, J. Matas, and J. Kittler. Locally Optimized RANSAC. In Joint Pattern Recognition Symposium, pages 236–243, 2003

work page 2003
[15]

Clark, S

R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen. VidLoc: A Deep Spatio-Temporal Model for 6-DoF Video- Clip Relocalization. In CVPR, pages 6856–6864, 2017

work page 2017
[16]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, pages 248–255, 2009

work page 2009
[17]

L. Deng, Z. Chen, B. Chen, Y . Duan, and J. Zhou. Incremen- tal image set querying based localization. Neurocomputing, 2016

work page 2016
[18]

Duong, A

N.-D. Duong, A. Kacete, C. Sodalie, P.-Y . Richard, and J. Royan. xyzNet: Towards Machine Learning Camera Relo- calization by Using a Scene Coordinate Prediction Network. In ISMAR, 2018

work page 2018
[19]

Y . Feng, Y . Wu, and L. Fan. Real-time SLAM relocalization with online learning of binary feature indexing. Machine Vision and Applications, 28(8):953–963, 2017

work page 2017
[20]

M. A. Fischler and R. C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. CACM, 24(6), 1981

work page 1981
[21]

Fulkerson and S

B. Fulkerson and S. Soatto. Really quick shift: Image seg- mentation on a GPU. In ECCV, pages 350–358, 2010

work page 2010
[22]

G ´alvez-L´opez and J

D. G ´alvez-L´opez and J. D. Tard ´os. Real-Time Loop De- tection with Bags of Binary Words. In IROS, pages 51–58, 2011

work page 2011
[23]

A. P. Gee and W. Mayol-Cuevas. 6D Relocalisation for RGBD Cameras Using Synthetic View Regression. In BMVC, 2012

work page 2012
[24]

Glocker, J

B. Glocker, J. Shotton, A. Criminisi, and S. Izadi. Real- Time RGB-D Camera Relocalization via Randomized Ferns for Keyframe Encoding. TVCG, 21(5), 2015

work page 2015
[25]

Golodetz*, T

S. Golodetz*, T. Cavallari*, N. A. Lord*, V . A. Priscariu, D. W. Murray, and P. H. S. Torr. Collaborative Large-Scale Dense 3D Reconstruction with Online Inter-Agent Pose Op- timisation. TVCG, 24(11):2895–2905, 2018

work page 2018
[26]

SemanticPaint: A Framework for the Interactive Segmentation of 3D Scenes

S. Golodetz*, M. Sapienza*, J. P. C. Valentin, V . Vineet, M.- M. Cheng, A. Arnab, V . A. Prisacariu, O. K¨ahler, C. Y . Ren, D. W. Murray, S. Izadi, and P. H. S. Torr. SemanticPaint: A Framework for the Interactive Segmentation of 3D Scenes. Technical Report TVG-2015-1, Department of Engineering Science, University of Oxford, October 2015. Released as ar...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[27]

Guzman-Rivera, P

A. Guzman-Rivera, P. Kohli, B. Glocker, J. Shotton, T. Sharp, A. Fitzgibbon, and S. Izadi. Multi-Output Learn- ing for Camera Relocalization. In CVPR, pages 1114–1121, 2014

work page 2014
[28]

Hartley and A

R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2nd edition, 2004

work page 2004
[29]

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016

work page 2016
[30]

Ioffe and C

S. Ioffe and C. Szegedy. Batch Normalization: Accelerat- ing Deep Network Training by Reducing Internal Covariate Shift. In ICML, pages 448–456, 2015

work page 2015
[31]

W. Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallogra- phy, 32(5):922–923, 1976

work page 1976
[32]

Kacete, T

A. Kacete, T. Wentz, and J. Royan. Decision Forest For Ef- ﬁcient and Robust Camera Relocalization. In ISMAR, pages 20–24, 2017

work page 2017
[33]

K ¨ahler, V

O. K ¨ahler, V . A. Prisacariu, and D. W. Murray. Real-Time Large-Scale Dense 3D Reconstruction with Loop Closure. In ECCV, pages 500–516, 2016

work page 2016
[34]

Kendall and R

A. Kendall and R. Cipolla. Modelling Uncertainty in Deep Learning for Camera Relocalization. In ICRA, 2016

work page 2016
[35]

Kendall and R

A. Kendall and R. Cipolla. Geometric loss functions for camera pose regression with deep learning. In CVPR, pages 5974–5983, 2017

work page 2017
[36]

Kendall, M

A. Kendall, M. Grimes, and R. Cipolla. PoseNet: A Convo- lutional Network for Real-Time 6-DOF Camera Relocaliza- tion. In ICCV, pages 2938–2946, 2015

work page 2015
[37]

D. P. Kingma* and J. L. Ba*. Adam: A Method for Stochas- tic Optimization. In ICLR, 2015

work page 2015
[38]

Laskar*, I

Z. Laskar*, I. Melekhov*, S. Kalia, and J. Kannala. Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network. In ICCV-W, pages 929–938, 2017

work page 2017
[39]

Levenberg

K. Levenberg. A Method for the Solution of Certain Prob- lems in Least Squares. QAM, 2(2):164–168, 1944

work page 1944
[40]

Q. Li, J. Zhu, R. Cao, K. Sun, J. M. Garibaldi, Q. Li, B. Liu, and G. Qiu. Relative Geometry-Aware Siamese Neural Network for 6DOF Camera Relocalization. arXiv:1901.01049v2, 2019

work page arXiv 1901
[41]

Li and A

S. Li and A. Calway. RGBD Relocalisation Using Pairwise Geometry and Concise Key Point Sets. In ICRA, 2015

work page 2015
[42]

X. Li, J. Ylioinas, and J. Kannala. Full-Frame Scene Co- ordinate Regression for Image-Based Localization. In RSS, 2018

work page 2018
[43]

X. Li, J. Ylioinas, J. Verbeek, and J. Kannala. Scene Coor- dinate Regression with Angle-Based Reprojection Loss for Camera Relocalization. In ECCV, 2018

work page 2018
[44]

G. Lu, Y . Yan, A. Kolagunda, and C. Kambhamettu. A Fast 3D Indoor-Localization Approach Based on Video Queries. In MultiMedia Modeling, pages 218–230, 2016

work page 2016
[45]

D. W. Marquardt. An Algorithm for Least-Squares Estima- tion of Nonlinear Parameters. SIAP, 11(2), 1963

work page 1963
[46]

Massiceti, A

D. Massiceti, A. Krull, E. Brachmann, C. Rother, and P. H. S. Torr. Random Forests versus Neural Networks – What’s Best for Camera Localization? In ICRA, 2017

work page 2017
[47]

Melekhov, J

I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu. Image- based Localization using Hourglass Networks. In ICCV-W, 2017

work page 2017
[48]

L. Meng, J. Chen, F. Tung, J. J. Little, and C. W. de Silva. Exploiting Random RGB and Sparse Features for Camera Pose Estimation. In BMVC, 2016

work page 2016
[49]

L. Meng, J. Chen, F. Tung, J. J. Little, J. Valentin, and C. W. de Silva. Backtracking Regression Forests for Ac- curate Camera Relocalization. In IROS, 2017

work page 2017
[50]

L. Meng, F. Tung, J. J. Little, J. Valentin, and C. W. de Silva. Exploiting Points and Lines in Regression Forests for RGB- D Camera Relocalization. In IROS, 2018

work page 2018
[51]

Mur-Artal, J

R. Mur-Artal, J. M. M. Montiel, and J. D. Tard ´os. ORB- SLAM: A Versatile and Accurate Monocular SLAM System. RO, 31(5):1147–1163, October 2015

work page 2015
[52]

Mur-Artal and J

R. Mur-Artal and J. D. Tard ´os. Fast Relocalisation and Loop Closing in Keyframe-Based SLAM. In ICRA, 2014

work page 2014
[53]

R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon. KinectFusion: Real-Time Dense Surface Mapping and Tracking. In ISMAR, pages 127–136, 2011

work page 2011
[54]

Nießner, M

M. Nießner, M. Zollh ¨ofer, S. Izadi, and M. Stamminger. Real-time 3D Reconstruction at Scale using V oxel Hashing. TOG, 32(6), 2013

work page 2013
[55]

Paucher and M

R. Paucher and M. Turk. Location-based augmented reality on mobile phones. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition – Workshops, pages 9–16, 2010

work page 2010
[56]

V . A. Prisacariu, O. K ¨ahler, S. Golodetz, M. Sapienza, T. Cavallari, P. H. S. Torr, and D. W. Murray. InﬁniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure. arXiv:1708.00783v1, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[57]

VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry

N. Radwan*, A. Valada*, and W. Burgard. VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry. arXiv:1804.08366v4, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[58]

N. L. Rodas, F. Barrera, and N. Padoy. Marker-less AR in the Hybrid Room using Equipment Detection for Camera Relo- calization. In MICCAI, pages 463–470, 2015

work page 2015
[59]

Sattler, B

T. Sattler, B. Leibe, and L. Kobbelt. Efﬁcient & Effective Prioritized Matching for Large-Scale Image-Based Localiza- tion. TPAMI, 9, 2017

work page 2017
[60]

Sattler, Q

T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taix ´e. Un- derstanding the Limitations of CNN-based Absolute Camera Pose Regression. In CVPR, 2019

work page 2019
[61]

Schmidt, R

T. Schmidt, R. Newcombe, and D. Fox. Self-supervised Vi- sual Descriptor Learning for Dense Correspondence. RA-L, 2(2):420–427, 2017

work page 2017
[62]

J. L. Sch ¨onberger, M. Pollefeys, A. Geiger, and T. Sattler. Semantic Visual Localization. In CVPR, 2018

work page 2018
[63]

Shotton, B

J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. In CVPR, pages 2930–2937, 2013

work page 2013
[64]

Simonyan and A

K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, 2015

work page 2015
[65]

Taira, M

H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii. InLoc: Indoor Visual Local- ization with Dense Matching and View Synthesis. In CVPR, 2018

work page 2018
[66]

Available online (as of 10th May

TorchVision.Models. Available online (as of 10th May

work page
[67]

at https://pytorch.org/docs/stable/ torchvision/models.html

work page
[68]

Valada*, N

A. Valada*, N. Radwan*, and W. Burgard. Deep Auxiliary Learning for Visual Localization and Odometry. In ICRA, 2018

work page 2018
[69]

Valentin, A

J. Valentin, A. Dai, M. Nießner, P. Kohli, P. Torr, S. Izadi, and C. Keskin. Learning to Navigate the Energy Landscape. In 3DV, 2016

work page 2016
[70]

Valentin, M

J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi, and P. Torr. Exploiting Uncertainty in Regression Forests for Accurate Camera Relocalization. In CVPR, 2015

work page 2015
[71]

Walch, C

F. Walch, C. Hazirbas, L. Leal-Taix ´e, T. Sattler, S. Hilsen- beck, and D. Cremers. Image-based localization using LSTMs for structured feature correlation. In ICCV, pages 627–637, 2017

work page 2017
[72]

Williams, G

B. Williams, G. Klein, and I. Reid. Automatic Relocalization and Loop Closing for Real-Time Monocular SLAM.TPAMI, 33(9):1699–1712, September 2011

work page 2011
[73]

J. Wu, L. Ma, and X. Hu. Delving Deeper into Convolutional Neural Networks for Camera Relocalization. In ICRA, 2017. Chess Fire Ofﬁce Pumpkin Kitchen Stairs Raw 72.50% 41.50% 53.38% 44.40% 39.90% 1.20% 0.032m/1.495◦ 0.061m/2.724◦ 0.046m/1.804◦ 0.060m/1.865◦ 0.068m/2.255◦ 0.528m/6.487◦ + ICP 98.35% 76.65% 84.05% 74.10% 70.90% 26.10% 0.013m/1.034◦ 0.009m/1....

work page 2017

[1] [1]

Acharya, K

D. Acharya, K. Khoshelham, and S. Winter. BIM-PoseNet: Indoor camera localisation using a 3D indoor model and deep learning from synthetic images. ISPRS Journal of Pho- togrammetry and Remote Sensing, 150:245–258, 2019

work page 2019

[2] [2]

H. Bae, M. Walker, J. White, Y . Pan, Y . Sun, and M. Golparvar-Fard. Fast and scalable structure-from-motion based localization for high-precision mobile augmented real- ity systems. The Journal of Mobile User Experience, 5(1):1– 21, 2016

work page 2016

[3] [3]

Balntas, S

V . Balntas, S. Li, and V . Prisacariu. RelocNet: Continuous Metric Learning Relocalisation using Neural Nets. InECCV, 2018

work page 2018

[4] [4]

P. J. Besl and N. D. McKay. A Method for Registration of 3-D Shapes. TPAMI, 14(2):239–256, February 1992

work page 1992

[5] [5]

Brachmann, A

E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother. DSAC – Differentiable RANSAC for Camera Localization. In CVPR, 2017

work page 2017

[6] [6]

Brachmann, F

E. Brachmann, F. Michel, A. Krull, M. Y . Yang, S. Gumhold, and C. Rother. Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image. In CVPR, 2016

work page 2016

[7] [7]

Brachmann and C

E. Brachmann and C. Rother. Learning Less is More – 6D Camera Localization via 3D Surface Regression. In CVPR, 2018

work page 2018

[8] [8]

Brachmann and C

E. Brachmann and C. Rother. Neural-Guided RANSAC: Learning Where to Sample Model Hypotheses. arXiv:1905.04132v1, 2019

work page arXiv 1905

[9] [9]

Brahmbhatt, J

S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz. Geometry-Aware Learning of Maps for Camera Localiza- tion. In CVPR, pages 2616–2625, 2018

work page 2018

[10] [10]

M. Bui, C. Baur, N. Navab, S. Ilic, and S. Albarqouni. Adver- sarial Joint Image and Pose Distribution Learning for Cam- era Pose Regression and Reﬁnement. arXiv:1903.06646v2, 2019

work page arXiv 1903

[11] [11]

Castle, G

R. Castle, G. Klein, and D. W. Murray. Video-rate Local- ization in Multiple Maps for Wearable Augmented Reality. In IEEE International Symposium on Wearable Computers , pages 15–22, 2008

work page 2008

[12] [12]

Cavallari*, S

T. Cavallari*, S. Golodetz*, N. A. Lord*, J. Valentin*, V . A. Prisacariu, L. D. Stefano, and P. H. S. Torr. Real-Time RGB- D Camera Pose Estimation in Novel Scenes using a Relocal- isation Cascade. TPAMI, Early Access, 2019

work page 2019

[13] [13]

Cavallari, S

T. Cavallari, S. Golodetz*, N. A. Lord*, J. Valentin, L. D. Stefano, and P. H. S. Torr. On-the-Fly Adaptation of Regres- sion Forests for Online Camera Relocalisation. In CVPR, 2017

work page 2017

[14] [14]

O. Chum, J. Matas, and J. Kittler. Locally Optimized RANSAC. In Joint Pattern Recognition Symposium, pages 236–243, 2003

work page 2003

[15] [15]

Clark, S

R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen. VidLoc: A Deep Spatio-Temporal Model for 6-DoF Video- Clip Relocalization. In CVPR, pages 6856–6864, 2017

work page 2017

[16] [16]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, pages 248–255, 2009

work page 2009

[17] [17]

L. Deng, Z. Chen, B. Chen, Y . Duan, and J. Zhou. Incremen- tal image set querying based localization. Neurocomputing, 2016

work page 2016

[18] [18]

Duong, A

N.-D. Duong, A. Kacete, C. Sodalie, P.-Y . Richard, and J. Royan. xyzNet: Towards Machine Learning Camera Relo- calization by Using a Scene Coordinate Prediction Network. In ISMAR, 2018

work page 2018

[19] [19]

Y . Feng, Y . Wu, and L. Fan. Real-time SLAM relocalization with online learning of binary feature indexing. Machine Vision and Applications, 28(8):953–963, 2017

work page 2017

[20] [20]

M. A. Fischler and R. C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. CACM, 24(6), 1981

work page 1981

[21] [21]

Fulkerson and S

B. Fulkerson and S. Soatto. Really quick shift: Image seg- mentation on a GPU. In ECCV, pages 350–358, 2010

work page 2010

[22] [22]

G ´alvez-L´opez and J

D. G ´alvez-L´opez and J. D. Tard ´os. Real-Time Loop De- tection with Bags of Binary Words. In IROS, pages 51–58, 2011

work page 2011

[23] [23]

A. P. Gee and W. Mayol-Cuevas. 6D Relocalisation for RGBD Cameras Using Synthetic View Regression. In BMVC, 2012

work page 2012

[24] [24]

Glocker, J

B. Glocker, J. Shotton, A. Criminisi, and S. Izadi. Real- Time RGB-D Camera Relocalization via Randomized Ferns for Keyframe Encoding. TVCG, 21(5), 2015

work page 2015

[25] [25]

Golodetz*, T

S. Golodetz*, T. Cavallari*, N. A. Lord*, V . A. Priscariu, D. W. Murray, and P. H. S. Torr. Collaborative Large-Scale Dense 3D Reconstruction with Online Inter-Agent Pose Op- timisation. TVCG, 24(11):2895–2905, 2018

work page 2018

[26] [26]

SemanticPaint: A Framework for the Interactive Segmentation of 3D Scenes

S. Golodetz*, M. Sapienza*, J. P. C. Valentin, V . Vineet, M.- M. Cheng, A. Arnab, V . A. Prisacariu, O. K¨ahler, C. Y . Ren, D. W. Murray, S. Izadi, and P. H. S. Torr. SemanticPaint: A Framework for the Interactive Segmentation of 3D Scenes. Technical Report TVG-2015-1, Department of Engineering Science, University of Oxford, October 2015. Released as ar...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[27] [27]

Guzman-Rivera, P

A. Guzman-Rivera, P. Kohli, B. Glocker, J. Shotton, T. Sharp, A. Fitzgibbon, and S. Izadi. Multi-Output Learn- ing for Camera Relocalization. In CVPR, pages 1114–1121, 2014

work page 2014

[28] [28]

Hartley and A

R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2nd edition, 2004

work page 2004

[29] [29]

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016

work page 2016

[30] [30]

Ioffe and C

S. Ioffe and C. Szegedy. Batch Normalization: Accelerat- ing Deep Network Training by Reducing Internal Covariate Shift. In ICML, pages 448–456, 2015

work page 2015

[31] [31]

W. Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallogra- phy, 32(5):922–923, 1976

work page 1976

[32] [32]

Kacete, T

A. Kacete, T. Wentz, and J. Royan. Decision Forest For Ef- ﬁcient and Robust Camera Relocalization. In ISMAR, pages 20–24, 2017

work page 2017

[33] [33]

K ¨ahler, V

O. K ¨ahler, V . A. Prisacariu, and D. W. Murray. Real-Time Large-Scale Dense 3D Reconstruction with Loop Closure. In ECCV, pages 500–516, 2016

work page 2016

[34] [34]

Kendall and R

A. Kendall and R. Cipolla. Modelling Uncertainty in Deep Learning for Camera Relocalization. In ICRA, 2016

work page 2016

[35] [35]

Kendall and R

A. Kendall and R. Cipolla. Geometric loss functions for camera pose regression with deep learning. In CVPR, pages 5974–5983, 2017

work page 2017

[36] [36]

Kendall, M

A. Kendall, M. Grimes, and R. Cipolla. PoseNet: A Convo- lutional Network for Real-Time 6-DOF Camera Relocaliza- tion. In ICCV, pages 2938–2946, 2015

work page 2015

[37] [37]

D. P. Kingma* and J. L. Ba*. Adam: A Method for Stochas- tic Optimization. In ICLR, 2015

work page 2015

[38] [38]

Laskar*, I

Z. Laskar*, I. Melekhov*, S. Kalia, and J. Kannala. Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network. In ICCV-W, pages 929–938, 2017

work page 2017

[39] [39]

Levenberg

K. Levenberg. A Method for the Solution of Certain Prob- lems in Least Squares. QAM, 2(2):164–168, 1944

work page 1944

[40] [40]

Q. Li, J. Zhu, R. Cao, K. Sun, J. M. Garibaldi, Q. Li, B. Liu, and G. Qiu. Relative Geometry-Aware Siamese Neural Network for 6DOF Camera Relocalization. arXiv:1901.01049v2, 2019

work page arXiv 1901

[41] [41]

Li and A

S. Li and A. Calway. RGBD Relocalisation Using Pairwise Geometry and Concise Key Point Sets. In ICRA, 2015

work page 2015

[42] [42]

X. Li, J. Ylioinas, and J. Kannala. Full-Frame Scene Co- ordinate Regression for Image-Based Localization. In RSS, 2018

work page 2018

[43] [43]

X. Li, J. Ylioinas, J. Verbeek, and J. Kannala. Scene Coor- dinate Regression with Angle-Based Reprojection Loss for Camera Relocalization. In ECCV, 2018

work page 2018

[44] [44]

G. Lu, Y . Yan, A. Kolagunda, and C. Kambhamettu. A Fast 3D Indoor-Localization Approach Based on Video Queries. In MultiMedia Modeling, pages 218–230, 2016

work page 2016

[45] [45]

D. W. Marquardt. An Algorithm for Least-Squares Estima- tion of Nonlinear Parameters. SIAP, 11(2), 1963

work page 1963

[46] [46]

Massiceti, A

D. Massiceti, A. Krull, E. Brachmann, C. Rother, and P. H. S. Torr. Random Forests versus Neural Networks – What’s Best for Camera Localization? In ICRA, 2017

work page 2017

[47] [47]

Melekhov, J

I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu. Image- based Localization using Hourglass Networks. In ICCV-W, 2017

work page 2017

[48] [48]

L. Meng, J. Chen, F. Tung, J. J. Little, and C. W. de Silva. Exploiting Random RGB and Sparse Features for Camera Pose Estimation. In BMVC, 2016

work page 2016

[49] [49]

L. Meng, J. Chen, F. Tung, J. J. Little, J. Valentin, and C. W. de Silva. Backtracking Regression Forests for Ac- curate Camera Relocalization. In IROS, 2017

work page 2017

[50] [50]

L. Meng, F. Tung, J. J. Little, J. Valentin, and C. W. de Silva. Exploiting Points and Lines in Regression Forests for RGB- D Camera Relocalization. In IROS, 2018

work page 2018

[51] [51]

Mur-Artal, J

R. Mur-Artal, J. M. M. Montiel, and J. D. Tard ´os. ORB- SLAM: A Versatile and Accurate Monocular SLAM System. RO, 31(5):1147–1163, October 2015

work page 2015

[52] [52]

Mur-Artal and J

R. Mur-Artal and J. D. Tard ´os. Fast Relocalisation and Loop Closing in Keyframe-Based SLAM. In ICRA, 2014

work page 2014

[53] [53]

R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon. KinectFusion: Real-Time Dense Surface Mapping and Tracking. In ISMAR, pages 127–136, 2011

work page 2011

[54] [54]

Nießner, M

M. Nießner, M. Zollh ¨ofer, S. Izadi, and M. Stamminger. Real-time 3D Reconstruction at Scale using V oxel Hashing. TOG, 32(6), 2013

work page 2013

[55] [55]

Paucher and M

R. Paucher and M. Turk. Location-based augmented reality on mobile phones. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition – Workshops, pages 9–16, 2010

work page 2010

[56] [56]

V . A. Prisacariu, O. K ¨ahler, S. Golodetz, M. Sapienza, T. Cavallari, P. H. S. Torr, and D. W. Murray. InﬁniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure. arXiv:1708.00783v1, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[57] [57]

VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry

N. Radwan*, A. Valada*, and W. Burgard. VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry. arXiv:1804.08366v4, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[58] [58]

N. L. Rodas, F. Barrera, and N. Padoy. Marker-less AR in the Hybrid Room using Equipment Detection for Camera Relo- calization. In MICCAI, pages 463–470, 2015

work page 2015

[59] [59]

Sattler, B

T. Sattler, B. Leibe, and L. Kobbelt. Efﬁcient & Effective Prioritized Matching for Large-Scale Image-Based Localiza- tion. TPAMI, 9, 2017

work page 2017

[60] [60]

Sattler, Q

T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taix ´e. Un- derstanding the Limitations of CNN-based Absolute Camera Pose Regression. In CVPR, 2019

work page 2019

[61] [61]

Schmidt, R

T. Schmidt, R. Newcombe, and D. Fox. Self-supervised Vi- sual Descriptor Learning for Dense Correspondence. RA-L, 2(2):420–427, 2017

work page 2017

[62] [62]

J. L. Sch ¨onberger, M. Pollefeys, A. Geiger, and T. Sattler. Semantic Visual Localization. In CVPR, 2018

work page 2018

[63] [63]

Shotton, B

J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. In CVPR, pages 2930–2937, 2013

work page 2013

[64] [64]

Simonyan and A

K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, 2015

work page 2015

[65] [65]

Taira, M

H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii. InLoc: Indoor Visual Local- ization with Dense Matching and View Synthesis. In CVPR, 2018

work page 2018

[66] [66]

Available online (as of 10th May

TorchVision.Models. Available online (as of 10th May

work page

[67] [67]

at https://pytorch.org/docs/stable/ torchvision/models.html

work page

[68] [68]

Valada*, N

A. Valada*, N. Radwan*, and W. Burgard. Deep Auxiliary Learning for Visual Localization and Odometry. In ICRA, 2018

work page 2018

[69] [69]

Valentin, A

J. Valentin, A. Dai, M. Nießner, P. Kohli, P. Torr, S. Izadi, and C. Keskin. Learning to Navigate the Energy Landscape. In 3DV, 2016

work page 2016

[70] [70]

Valentin, M

J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi, and P. Torr. Exploiting Uncertainty in Regression Forests for Accurate Camera Relocalization. In CVPR, 2015

work page 2015

[71] [71]

Walch, C

F. Walch, C. Hazirbas, L. Leal-Taix ´e, T. Sattler, S. Hilsen- beck, and D. Cremers. Image-based localization using LSTMs for structured feature correlation. In ICCV, pages 627–637, 2017

work page 2017

[72] [72]

Williams, G

B. Williams, G. Klein, and I. Reid. Automatic Relocalization and Loop Closing for Real-Time Monocular SLAM.TPAMI, 33(9):1699–1712, September 2011

work page 2011

[73] [73]

J. Wu, L. Ma, and X. Hu. Delving Deeper into Convolutional Neural Networks for Camera Relocalization. In ICRA, 2017. Chess Fire Ofﬁce Pumpkin Kitchen Stairs Raw 72.50% 41.50% 53.38% 44.40% 39.90% 1.20% 0.032m/1.495◦ 0.061m/2.724◦ 0.046m/1.804◦ 0.060m/1.865◦ 0.068m/2.255◦ 0.528m/6.487◦ + ICP 98.35% 76.65% 84.05% 74.10% 70.90% 26.10% 0.013m/1.034◦ 0.009m/1....

work page 2017