arxiv: 2604.27448 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

LA-Pose: Latent Action Pretraining Meets Pose Estimation

Zhengqing Wang , Saurabh Nair , Prajwal Chidananda , Pujith Kachana , Samuel Li , Matthew Brown , Yasutaka Furukawa

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords camera pose estimationself-supervised pretraininglatent actionsinverse dynamicsdriving videosfeed-forward inference3D annotations

0 comments

The pith

Repurposing latent action features from self-supervised pretraining on unlabeled driving videos enables accurate camera pose estimation with far less labeled 3D data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to pretrain on large unlabeled driving videos using inverse- and forward-dynamics models to extract latent action representations. These representations are then fed as features into a camera pose estimator that requires only a small amount of high-quality 3D annotations for finetuning. The resulting system runs in a single feed-forward pass and delivers higher accuracy than recent supervised methods on standard driving benchmarks. A sympathetic reader would care because the method cuts the need for expensive 3D labels by orders of magnitude while keeping real-time inference speed. This points to a practical route for scaling pose estimation without proportional growth in annotation effort.

Core claim

LA-Pose trains inverse- and forward-dynamics models on large-scale unlabeled driving videos to obtain latent action representations, then uses those representations as input features for a camera pose estimator that is finetuned on only a limited set of high-quality 3D annotations, producing accurate and generalizable feed-forward pose predictions that exceed the accuracy of recent fully supervised feed-forward methods on the Waymo and PandaSet benchmarks.

What carries the argument

Latent action representations learned by inverse- and forward-dynamics models on unlabeled videos, repurposed as input features to a finetuned pose estimator.

If this is right

Camera pose can be estimated accurately and efficiently in a single feed-forward pass using the prelearned features.
The approach matches or exceeds state-of-the-art accuracy on driving benchmarks while using orders of magnitude less labeled 3D data.
On Waymo and PandaSet, pose accuracy rises by more than 10 percent relative to recent feed-forward methods.
This constitutes the first demonstration that inverse-dynamics self-supervised learning can be applied successfully to pose estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-dynamics pretraining could supply useful features for related geometric tasks such as depth estimation or visual odometry.
Because the pretraining uses only unlabeled video, the method could transfer to other video domains where 3D labels are scarce.
Separating large-scale unlabeled pretraining from small-scale labeled finetuning offers a template for scaling other perception models in robotics and autonomous systems.

Load-bearing premise

The latent action features learned from unlabeled driving videos remain sufficiently informative and transferable to camera pose estimation after finetuning on a limited set of 3D annotations.

What would settle it

If a pose estimator built on these latent action features shows no accuracy gain over a baseline trained from scratch on the same small set of 3D annotations, or fails to exceed recent feed-forward methods by more than 10 percent on the Waymo and PandaSet benchmarks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.27448 by Matthew Brown, Prajwal Chidananda, Pujith Kachana, Samuel Li, Saurabh Nair, Yasutaka Furukawa, Zhengqing Wang.

**Figure 1.** Figure 1: Overview of LA-Pose. We introduce a two-stage framework that unifies large-scale latent action pretraining with camera pose estimation. From millions of unlabeled driving videos, an inverse–forward dynamics model learns latent actions that encode interframe motion in a fully self-supervised manner. When visualized in T-SNE space, these latent actions exhibit structured clusters that align closely with tru… view at source ↗

**Figure 2.** Figure 2: Our framework consists of two stages: latent action pretraining and camera pose post-training. In the pretraining stage (top), an view at source ↗

**Figure 3.** Figure 3: Qualitative results of camera pose estimation. Comparison of predicted camera trajectories: Ours (green), Rig3R [ view at source ↗

**Figure 4.** Figure 4: Distribution of pose estimation AUC@5 for LA-Pose view at source ↗

**Figure 6.** Figure 6: Failure case under reverse motion. Performance degrades when the vehicle moves backward, a rare condition in the supervised training set. Despite this distribution gap, the pretrained backbone still produces partially consistent trajectories. References [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal… view at source ↗

**Figure 7.** Figure 7: Qualitative results under low frame rate (1 fps) on Waymo. Each example shows camera poses projected onto the xz plane, with frustums drawn at frames 0, 5, 10, and 15. LA-Pose (green) maintains stable and temporally consistent motion across the sequence, whereas VGGT [31] (cyan) exhibits noticeable drift and discontinuities under sparse temporal sampling view at source ↗

**Figure 8.** Figure 8: Qualitative results on OpenDV–YouTube. Each example shows scenes from diverse cities and viewpoints collected from online YouTube driving videos. LA-Pose produces stable and temporally consistent trajectories across a wide variety of conditions, including urban streets, highways, and curved mountain roads. The results qualitatively demonstrate strong generalization from our pre-trained backbone to uncalibr… view at source ↗

read the original abstract

This paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos. Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks. Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations. This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks show that LA-Pose achieves competitive and even superior performance to state-of-the-art methods while using orders of magnitude less labeled data. Concretely, on the Waymo and PandaSet benchmarks, LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods. To our knowledge, this work is the first to demonstrate the power of inverse-dynamics self-supervised learning for pose estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LA-Pose repurposes latent actions from video dynamics for pose estimation and claims big gains with less labeled data, but the experiments need more detail to confirm the pretraining is doing the heavy lifting.

read the letter

The punchline is that this work takes latent action features pretrained on unlabeled driving videos using inverse and forward dynamics models and feeds them into a pose estimation network that is then finetuned on a small amount of 3D labeled data. On Waymo and PandaSet it gets more than 10 percent better accuracy than recent methods while using orders of magnitude less supervision. What is new is the repurposing step itself. The pretraining follows prior work like Genie, but applying those features to camera pose rather than action prediction or world modeling is the fresh angle, and the abstract positions it as the first such demonstration. The paper does well at framing a practical problem in 3D vision for autonomous driving and offering a self-supervised route that could lower annotation costs. The feed-forward design keeps runtime efficiency in mind, which matters for real applications. The soft spots center on verification of the results. Without the full experimental section it is difficult to assess the baselines used, whether the 10 percent lift includes error bars or multiple runs, and crucially whether ablations show the latent features are necessary or if similar gains could come from the finetuning architecture alone. The concern that dynamics latents might mainly capture 2D appearance or optical flow signals rather than full 3D structure is not obviously resolved by the abstract. If that holds, the transfer benefit could be overstated. No load-bearing issues with circularity or invented entities appear from the description. This paper is for people working on self-supervised pretraining for 3D tasks in robotics and computer vision. A reader who wants ideas for reducing labeled data in pose estimation would get useful value from the high-level approach. It deserves a serious referee because the potential impact on data efficiency is meaningful and the benchmarks are relevant, even if revisions for stronger evidence on the feature quality are likely. I recommend sending it for peer review with attention to the experimental details and any analysis of what the latents actually encode.

Referee Report

2 major / 1 minor

Summary. The paper introduces LA-Pose, which pretrains latent action representations via inverse- and forward-dynamics models on large-scale unlabeled driving videos (in the style of Genie). These latents are repurposed as direct inputs to a feed-forward camera pose estimator that is finetuned on a limited set of high-quality 3D annotations. The central claim is that this yields over 10% higher pose accuracy than recent feed-forward methods on the Waymo and PandaSet benchmarks while using orders of magnitude less labeled data, positioning the work as the first demonstration of inverse-dynamics self-supervised learning for pose estimation.

Significance. If the experimental claims hold after rigorous validation, the result would be significant: it would show that dynamics-based self-supervised pretraining on unlabeled video can produce transferable features for metric 3D pose estimation, offering a scalable path to reduce dependence on costly 3D annotations in autonomous-driving settings.

major comments (2)

[Abstract] Abstract and experimental section: the reported >10% accuracy lift on Waymo/PandaSet is presented without accompanying baselines, error bars, ablation tables, or statistical significance tests. Because the central claim rests on this quantitative improvement being attributable to the latent-action pretraining, the absence of these controls makes it impossible to rule out that gains arise from architecture choices or finetuning details alone.
[Method] Method and experiments: the transferability assumption—that inverse/forward-dynamics latents encode 3D geometric structure (metric depth, scale, absolute orientation) rather than 2D appearance or relative-motion proxies—is load-bearing for the claim yet unsupported by feature visualizations, nearest-neighbor retrievals, or controlled ablations that isolate the pretraining contribution from the finetuning stage.

minor comments (1)

[Abstract] Clarify the precise definition of 'pose accuracy' (e.g., translation/rotation error thresholds, median vs. mean) used for the 10% figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested controls and evidence.

read point-by-point responses

Referee: [Abstract] Abstract and experimental section: the reported >10% accuracy lift on Waymo/PandaSet is presented without accompanying baselines, error bars, ablation tables, or statistical significance tests. Because the central claim rests on this quantitative improvement being attributable to the latent-action pretraining, the absence of these controls makes it impossible to rule out that gains arise from architecture choices or finetuning details alone.

Authors: We agree that the current presentation of the >10% improvement would be strengthened by additional controls. In the revised version we will add full baseline tables comparing against the cited feed-forward methods, error bars computed over multiple random seeds, ablation tables that remove the pretraining stage while keeping the pose estimator architecture and finetuning protocol identical, and paired statistical significance tests (e.g., Wilcoxon signed-rank) on the Waymo and PandaSet metrics. These additions will make explicit that the reported gains are attributable to the latent-action pretraining rather than other design choices. revision: yes
Referee: [Method] Method and experiments: the transferability assumption—that inverse/forward-dynamics latents encode 3D geometric structure (metric depth, scale, absolute orientation) rather than 2D appearance or relative-motion proxies—is load-bearing for the claim yet unsupported by feature visualizations, nearest-neighbor retrievals, or controlled ablations that isolate the pretraining contribution from the finetuning stage.

Authors: We acknowledge that direct evidence for the geometric content of the learned latents is currently indirect. We will add (i) t-SNE and PCA visualizations of the latent action features colored by ground-truth depth and orientation, (ii) nearest-neighbor retrievals in latent space that retrieve frames with similar metric 3D structure, and (iii) controlled ablations that train the identical pose estimator from scratch versus from the pretrained latents, keeping all other hyperparameters fixed. These experiments will isolate the pretraining contribution and directly test the transferability assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper's core approach pretrains latent action representations via inverse- and forward-dynamics models on unlabeled driving videos (separate from the target task), then finetunes a pose estimator on a distinct set of high-quality 3D annotations before evaluating on held-out benchmarks (Waymo, PandaSet). No equations, predictions, or first-principles derivations are presented that reduce the reported accuracy gains to fitted parameters or self-referential definitions by construction. The central empirical claim rests on external benchmark performance after distinct pretraining and finetuning stages, with no load-bearing self-citations or uniqueness theorems invoked to force the result. This is a standard self-contained empirical pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of latent action features learned from video dynamics to the pose task; no explicit free parameters, axioms, or invented entities are stated in the abstract.

axioms (1)

domain assumption Latent action representations extracted from inverse- and forward-dynamics models on driving videos capture motion information useful for camera pose estimation
This transfer assumption is required for the finetuning stage to succeed with limited labels

pith-pipeline@v0.9.0 · 5535 in / 1324 out tokens · 76666 ms · 2026-05-07T08:53:25.342549+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 15 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review arXiv
[2]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

2023
[3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 1, 2

work page internal anchor Pith review arXiv 2025
[4]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learn- ing visual representations from video.arXiv preprint arXiv:2404.08471, 2024. 2

work page internal anchor Pith review arXiv 2024
[5]

Ge- nie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Ge- nie: Generative interactive environments. InForty-first Inter- national Conference on Machine Learning, 2024. 1, 2, 3

2024
[6]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 4

2020
[7]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 1, 2

2021
[8]

Scal- ing 4d representations.arXiv preprint arXiv:2412.15212,

Jo ˜ao Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, et al. Scal- ing 4d representations.arXiv preprint arXiv:2412.15212,

work page arXiv
[9]

Argoverse: 3d tracking and forecasting with rich maps

Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jag- jeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8748–8757, 2019. 4

2019
[10]

Dynamo: In-domain dynamics pretraining for visuo-motor control.Advances in Neural Information Processing Systems, 37:33933–33961, 2024

Zichen Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, and Lerrel Pinto. Dynamo: In-domain dynamics pretraining for visuo-motor control.Advances in Neural Information Processing Systems, 37:33933–33961, 2024. 3

2024
[11]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 2

2021
[12]

Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024

Gao et al. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024. 3

2024
[13]

Unsupervised learning of depth and ego-motion from video

Zhou et al. Unsupervised learning of depth and ego-motion from video. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2017. 2

2017
[14]

The matrix: Infinite-horizon world generation with real-time moving control

Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024. 3

work page arXiv 2024
[15]

Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025. 3

work page arXiv 2025
[16]

Gem: A generaliz- able ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition con- trol, 2024

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Br ¨uggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, and Alexandre Alahi. Gem: A generaliz- able ego-vision multimodal ...

2024
[17]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gian- luca Corrado. Gaia-1: A generative world model for au- tonomous driving.arXiv preprint arXiv:2309.17080, 2023. 3

work page internal anchor Pith review arXiv 2023
[18]

Rayzer: A self-supervised large view synthesis model

Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, and Georgios Pavlakos. Rayzer: A self-supervised large view synthesis model. 2025. 2

2025
[19]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d re- construction.arXiv preprint arXiv:2509.13414, 2025. 2, 4, 5

work page internal anchor Pith review arXiv 2025
[20]

Posenet: A convolutional network for real-time 6-dof camera relocalization, 2016

Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization, 2016. 2

2016
[21]

arXiv preprint arXiv:2506.02265 (2025) 9

Samuel Li, Pujith Kachana, Prajwal Chidananda, Saurabh Nair, Yasutaka Furukawa, and Matthew Brown. Rig3r: Rig- aware conditioning for learned 3d reconstruction.arXiv preprint arXiv:2506.02265, 2025. 2, 5, 6

work page arXiv 2025
[22]

Mitchel, Hyunwoo Ryu, and Vincent Sitzmann

Thomas W. Mitchel, Hyunwoo Ryu, and Vincent Sitzmann. True self-supervised novel view synthesis is transferable,
[23]

Cosmos world foundation model platform for physical ai, 2025

NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Ji...

2025
[24]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fe- doseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

work page internal anchor Pith review arXiv
[25]

Learning to act without actions, 2024

Dominik Schmidt and Minqi Jiang. Learning to act without actions, 2024. 3

2024
[26]

Schonberger and Jan-Michael Frahm

Johannes L. Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR),
[27]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 2, 4, 5

2020
[28]

Latent action pretraining through world modeling.arXiv preprint arXiv:2509.18428, 2025

Bahey Tharwat, Yara Nasser, Ali Abouzeid, and Ian Reid. Latent action pretraining through world modeling.arXiv preprint arXiv:2509.18428, 2025. 3

work page arXiv 2025
[29]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, 2022. 2

2022
[30]

Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment, 2024

Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment, 2024. 2

2024
[31]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 5, 6, 8, 1

2025
[32]

Continuous 3d per- ception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 2

2025
[33]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546,

work page internal anchor Pith review arXiv
[34]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2

2024
[35]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Scalable permutation-equivariant visual geometry learning, 2025. 2

2025
[36]

Pandaset: Advanced sensor suite dataset for autonomous driving

Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, et al. Pandaset: Advanced sensor suite dataset for autonomous driving. In2021 IEEE international intelligent transportation systems conference (ITSC), pages 3095–3101. IEEE, 2021. 2, 5

2021
[37]

Genad: Generalized predictive model for autonomous driving.arXiv preprint arXiv:2403.09630, 2024

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Genad: Generalized predictive model for autonomous driving.arXiv preprint arXiv:2403.09630, 2024. 1

work page arXiv 2024
[38]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ im- ages in one forward pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

2025
[39]

Latent Action Pretraining from Videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretrain- ing from videos.arXiv preprint arXiv:2410.11758, 2024. 2, 3

work page Pith review arXiv 2024
[40]

Gamefactory: Creating new games with gen- erative interactive videos.arXiv preprint arXiv:2501.08325,

Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with gen- erative interactive videos.arXiv preprint arXiv:2501.08325,

work page arXiv
[41]

Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, and Shubham Tulsiani

Jason Y . Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, and Shubham Tulsiani. Cameras as rays: Pose estimation via ray diffusion, 2024. 2 LA-Pose: Latent Action Pretraining Meets Pose Estimation Supplementary Material The supplementary document provides additional qualita- tive visualizations and analysis: • Qualitative comparison between L...

2024
[42]

Qualitative Results under Low Frame Rate Figure 7 presents additional qualitative comparisons be- tween LA-Pose and VGGT [31] under the low frame rate (1 fps) setting on the Waymo dataset. All visualizations fol- low the same protocol as in the main paper, where predicted camera trajectories are projected to thexzplane with cam- era frustums shown at fram...
[43]

The OpenDV–YouTube dataset [37] is a large-scale collection of unconstrained driv- ing videos gathered from public YouTube channels

Qualitative Results on OpenDV–YouTube Figure 8 shows qualitative results of LA-Pose on the OpenDV–YouTube dataset [37]. The OpenDV–YouTube dataset [37] is a large-scale collection of unconstrained driv- ing videos gathered from public YouTube channels. It forms the main component of OpenDV-2K, spanning over 1700 hours of front-view recordings captured acr...
[44]

We categorize trajectories into bins based on curvature and acceleration to examine model performance under dif- ferent motion regimes

Failure Mode Analysis To further understand the limitations of our method, we an- alyze pose estimation performance across different trajec- tory curvatures and accelerations on the Waymo validation set. We categorize trajectories into bins based on curvature and acceleration to examine model performance under dif- ferent motion regimes. Curvature is defi...