pith. machine review for the scientific record. sign in

arxiv: 2604.27448 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

LA-Pose: Latent Action Pretraining Meets Pose Estimation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords camera pose estimationself-supervised pretraininglatent actionsinverse dynamicsdriving videosfeed-forward inference3D annotations
0
0 comments X

The pith

Repurposing latent action features from self-supervised pretraining on unlabeled driving videos enables accurate camera pose estimation with far less labeled 3D data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to pretrain on large unlabeled driving videos using inverse- and forward-dynamics models to extract latent action representations. These representations are then fed as features into a camera pose estimator that requires only a small amount of high-quality 3D annotations for finetuning. The resulting system runs in a single feed-forward pass and delivers higher accuracy than recent supervised methods on standard driving benchmarks. A sympathetic reader would care because the method cuts the need for expensive 3D labels by orders of magnitude while keeping real-time inference speed. This points to a practical route for scaling pose estimation without proportional growth in annotation effort.

Core claim

LA-Pose trains inverse- and forward-dynamics models on large-scale unlabeled driving videos to obtain latent action representations, then uses those representations as input features for a camera pose estimator that is finetuned on only a limited set of high-quality 3D annotations, producing accurate and generalizable feed-forward pose predictions that exceed the accuracy of recent fully supervised feed-forward methods on the Waymo and PandaSet benchmarks.

What carries the argument

Latent action representations learned by inverse- and forward-dynamics models on unlabeled videos, repurposed as input features to a finetuned pose estimator.

If this is right

  • Camera pose can be estimated accurately and efficiently in a single feed-forward pass using the prelearned features.
  • The approach matches or exceeds state-of-the-art accuracy on driving benchmarks while using orders of magnitude less labeled 3D data.
  • On Waymo and PandaSet, pose accuracy rises by more than 10 percent relative to recent feed-forward methods.
  • This constitutes the first demonstration that inverse-dynamics self-supervised learning can be applied successfully to pose estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-dynamics pretraining could supply useful features for related geometric tasks such as depth estimation or visual odometry.
  • Because the pretraining uses only unlabeled video, the method could transfer to other video domains where 3D labels are scarce.
  • Separating large-scale unlabeled pretraining from small-scale labeled finetuning offers a template for scaling other perception models in robotics and autonomous systems.

Load-bearing premise

The latent action features learned from unlabeled driving videos remain sufficiently informative and transferable to camera pose estimation after finetuning on a limited set of 3D annotations.

What would settle it

If a pose estimator built on these latent action features shows no accuracy gain over a baseline trained from scratch on the same small set of 3D annotations, or fails to exceed recent feed-forward methods by more than 10 percent on the Waymo and PandaSet benchmarks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.27448 by Matthew Brown, Prajwal Chidananda, Pujith Kachana, Samuel Li, Saurabh Nair, Yasutaka Furukawa, Zhengqing Wang.

Figure 1
Figure 1. Figure 1: Overview of LA-Pose. We introduce a two-stage framework that unifies large-scale latent action pretraining with camera pose estimation. From millions of unlabeled driving videos, an inverse–forward dynamics model learns latent actions that encode inter￾frame motion in a fully self-supervised manner. When visualized in T-SNE space, these latent actions exhibit structured clusters that align closely with tru… view at source ↗
Figure 2
Figure 2. Figure 2: Our framework consists of two stages: latent action pretraining and camera pose post-training. In the pretraining stage (top), an view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results of camera pose estimation. Comparison of predicted camera trajectories: Ours (green), Rig3R [ view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of pose estimation AUC@5 for LA-Pose view at source ↗
Figure 6
Figure 6. Figure 6: Failure case under reverse motion. Performance de￾grades when the vehicle moves backward, a rare condition in the supervised training set. Despite this distribution gap, the pre￾trained backbone still produces partially consistent trajectories. References [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah￾mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results under low frame rate (1 fps) on Waymo. Each example shows camera poses projected onto the xz plane, with frustums drawn at frames 0, 5, 10, and 15. LA-Pose (green) maintains stable and temporally consistent motion across the sequence, whereas VGGT [31] (cyan) exhibits noticeable drift and discontinuities under sparse temporal sampling view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results on OpenDV–YouTube. Each example shows scenes from diverse cities and viewpoints collected from online YouTube driving videos. LA-Pose produces stable and temporally consistent trajectories across a wide variety of conditions, including urban streets, highways, and curved mountain roads. The results qualitatively demonstrate strong generalization from our pre-trained backbone to uncalibr… view at source ↗
read the original abstract

This paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos. Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks. Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations. This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks show that LA-Pose achieves competitive and even superior performance to state-of-the-art methods while using orders of magnitude less labeled data. Concretely, on the Waymo and PandaSet benchmarks, LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods. To our knowledge, this work is the first to demonstrate the power of inverse-dynamics self-supervised learning for pose estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LA-Pose, which pretrains latent action representations via inverse- and forward-dynamics models on large-scale unlabeled driving videos (in the style of Genie). These latents are repurposed as direct inputs to a feed-forward camera pose estimator that is finetuned on a limited set of high-quality 3D annotations. The central claim is that this yields over 10% higher pose accuracy than recent feed-forward methods on the Waymo and PandaSet benchmarks while using orders of magnitude less labeled data, positioning the work as the first demonstration of inverse-dynamics self-supervised learning for pose estimation.

Significance. If the experimental claims hold after rigorous validation, the result would be significant: it would show that dynamics-based self-supervised pretraining on unlabeled video can produce transferable features for metric 3D pose estimation, offering a scalable path to reduce dependence on costly 3D annotations in autonomous-driving settings.

major comments (2)
  1. [Abstract] Abstract and experimental section: the reported >10% accuracy lift on Waymo/PandaSet is presented without accompanying baselines, error bars, ablation tables, or statistical significance tests. Because the central claim rests on this quantitative improvement being attributable to the latent-action pretraining, the absence of these controls makes it impossible to rule out that gains arise from architecture choices or finetuning details alone.
  2. [Method] Method and experiments: the transferability assumption—that inverse/forward-dynamics latents encode 3D geometric structure (metric depth, scale, absolute orientation) rather than 2D appearance or relative-motion proxies—is load-bearing for the claim yet unsupported by feature visualizations, nearest-neighbor retrievals, or controlled ablations that isolate the pretraining contribution from the finetuning stage.
minor comments (1)
  1. [Abstract] Clarify the precise definition of 'pose accuracy' (e.g., translation/rotation error thresholds, median vs. mean) used for the 10% figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested controls and evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental section: the reported >10% accuracy lift on Waymo/PandaSet is presented without accompanying baselines, error bars, ablation tables, or statistical significance tests. Because the central claim rests on this quantitative improvement being attributable to the latent-action pretraining, the absence of these controls makes it impossible to rule out that gains arise from architecture choices or finetuning details alone.

    Authors: We agree that the current presentation of the >10% improvement would be strengthened by additional controls. In the revised version we will add full baseline tables comparing against the cited feed-forward methods, error bars computed over multiple random seeds, ablation tables that remove the pretraining stage while keeping the pose estimator architecture and finetuning protocol identical, and paired statistical significance tests (e.g., Wilcoxon signed-rank) on the Waymo and PandaSet metrics. These additions will make explicit that the reported gains are attributable to the latent-action pretraining rather than other design choices. revision: yes

  2. Referee: [Method] Method and experiments: the transferability assumption—that inverse/forward-dynamics latents encode 3D geometric structure (metric depth, scale, absolute orientation) rather than 2D appearance or relative-motion proxies—is load-bearing for the claim yet unsupported by feature visualizations, nearest-neighbor retrievals, or controlled ablations that isolate the pretraining contribution from the finetuning stage.

    Authors: We acknowledge that direct evidence for the geometric content of the learned latents is currently indirect. We will add (i) t-SNE and PCA visualizations of the latent action features colored by ground-truth depth and orientation, (ii) nearest-neighbor retrievals in latent space that retrieve frames with similar metric 3D structure, and (iii) controlled ablations that train the identical pose estimator from scratch versus from the pretrained latents, keeping all other hyperparameters fixed. These experiments will isolate the pretraining contribution and directly test the transferability assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper's core approach pretrains latent action representations via inverse- and forward-dynamics models on unlabeled driving videos (separate from the target task), then finetunes a pose estimator on a distinct set of high-quality 3D annotations before evaluating on held-out benchmarks (Waymo, PandaSet). No equations, predictions, or first-principles derivations are presented that reduce the reported accuracy gains to fitted parameters or self-referential definitions by construction. The central empirical claim rests on external benchmark performance after distinct pretraining and finetuning stages, with no load-bearing self-citations or uniqueness theorems invoked to force the result. This is a standard self-contained empirical pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of latent action features learned from video dynamics to the pose task; no explicit free parameters, axioms, or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Latent action representations extracted from inverse- and forward-dynamics models on driving videos capture motion information useful for camera pose estimation
    This transfer assumption is required for the finetuning stage to succeed with limited labels

pith-pipeline@v0.9.0 · 5535 in / 1324 out tokens · 76666 ms · 2026-05-07T08:53:25.342549+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 15 canonical work pages · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

  3. [3]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 1, 2

  4. [4]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learn- ing visual representations from video.arXiv preprint arXiv:2404.08471, 2024. 2

  5. [5]

    Ge- nie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Ge- nie: Generative interactive environments. InForty-first Inter- national Conference on Machine Learning, 2024. 1, 2, 3

  6. [6]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 4

  7. [7]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 1, 2

  8. [8]

    Scal- ing 4d representations.arXiv preprint arXiv:2412.15212,

    Jo ˜ao Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, et al. Scal- ing 4d representations.arXiv preprint arXiv:2412.15212,

  9. [9]

    Argoverse: 3d tracking and forecasting with rich maps

    Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jag- jeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8748–8757, 2019. 4

  10. [10]

    Dynamo: In-domain dynamics pretraining for visuo-motor control.Advances in Neural Information Processing Systems, 37:33933–33961, 2024

    Zichen Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, and Lerrel Pinto. Dynamo: In-domain dynamics pretraining for visuo-motor control.Advances in Neural Information Processing Systems, 37:33933–33961, 2024. 3

  11. [11]

    An image is worth 16x16 words: Transformers for image recognition at scale, 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 2

  12. [12]

    Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024

    Gao et al. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024. 3

  13. [13]

    Unsupervised learning of depth and ego-motion from video

    Zhou et al. Unsupervised learning of depth and ego-motion from video. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2017. 2

  14. [14]

    The matrix: Infinite-horizon world generation with real-time moving control

    Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024. 3

  15. [15]

    Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

    Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025. 3

  16. [16]

    Gem: A generaliz- able ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition con- trol, 2024

    Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Br ¨uggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, and Alexandre Alahi. Gem: A generaliz- able ego-vision multimodal ...

  17. [17]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gian- luca Corrado. Gaia-1: A generative world model for au- tonomous driving.arXiv preprint arXiv:2309.17080, 2023. 3

  18. [18]

    Rayzer: A self-supervised large view synthesis model

    Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, and Georgios Pavlakos. Rayzer: A self-supervised large view synthesis model. 2025. 2

  19. [19]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d re- construction.arXiv preprint arXiv:2509.13414, 2025. 2, 4, 5

  20. [20]

    Posenet: A convolutional network for real-time 6-dof camera relocalization, 2016

    Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization, 2016. 2

  21. [21]

    arXiv preprint arXiv:2506.02265 (2025) 9

    Samuel Li, Pujith Kachana, Prajwal Chidananda, Saurabh Nair, Yasutaka Furukawa, and Matthew Brown. Rig3r: Rig- aware conditioning for learned 3d reconstruction.arXiv preprint arXiv:2506.02265, 2025. 2, 5, 6

  22. [22]

    Mitchel, Hyunwoo Ryu, and Vincent Sitzmann

    Thomas W. Mitchel, Hyunwoo Ryu, and Vincent Sitzmann. True self-supervised novel view synthesis is transferable,

  23. [23]

    Cosmos world foundation model platform for physical ai, 2025

    NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Ji...

  24. [24]

    GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

    Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fe- doseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

  25. [25]

    Learning to act without actions, 2024

    Dominik Schmidt and Minqi Jiang. Learning to act without actions, 2024. 3

  26. [26]

    Schonberger and Jan-Michael Frahm

    Johannes L. Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  27. [27]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 2, 4, 5

  28. [28]

    Latent action pretraining through world modeling.arXiv preprint arXiv:2509.18428, 2025

    Bahey Tharwat, Yara Nasser, Ali Abouzeid, and Ian Reid. Latent action pretraining through world modeling.arXiv preprint arXiv:2509.18428, 2025. 3

  29. [29]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, 2022

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, 2022. 2

  30. [30]

    Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment, 2024

    Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment, 2024. 2

  31. [31]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 5, 6, 8, 1

  32. [32]

    Continuous 3d per- ception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 2

  33. [33]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546,

  34. [34]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2

  35. [35]

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Scalable permutation-equivariant visual geometry learning, 2025. 2

  36. [36]

    Pandaset: Advanced sensor suite dataset for autonomous driving

    Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, et al. Pandaset: Advanced sensor suite dataset for autonomous driving. In2021 IEEE international intelligent transportation systems conference (ITSC), pages 3095–3101. IEEE, 2021. 2, 5

  37. [37]

    Genad: Generalized predictive model for autonomous driving.arXiv preprint arXiv:2403.09630, 2024

    Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Genad: Generalized predictive model for autonomous driving.arXiv preprint arXiv:2403.09630, 2024. 1

  38. [38]

    Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

    Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ im- ages in one forward pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

  39. [39]

    Latent Action Pretraining from Videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretrain- ing from videos.arXiv preprint arXiv:2410.11758, 2024. 2, 3

  40. [40]

    Gamefactory: Creating new games with gen- erative interactive videos.arXiv preprint arXiv:2501.08325,

    Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with gen- erative interactive videos.arXiv preprint arXiv:2501.08325,

  41. [41]

    Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, and Shubham Tulsiani

    Jason Y . Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, and Shubham Tulsiani. Cameras as rays: Pose estimation via ray diffusion, 2024. 2 LA-Pose: Latent Action Pretraining Meets Pose Estimation Supplementary Material The supplementary document provides additional qualita- tive visualizations and analysis: • Qualitative comparison between L...

  42. [42]

    Qualitative Results under Low Frame Rate Figure 7 presents additional qualitative comparisons be- tween LA-Pose and VGGT [31] under the low frame rate (1 fps) setting on the Waymo dataset. All visualizations fol- low the same protocol as in the main paper, where predicted camera trajectories are projected to thexzplane with cam- era frustums shown at fram...

  43. [43]

    The OpenDV–YouTube dataset [37] is a large-scale collection of unconstrained driv- ing videos gathered from public YouTube channels

    Qualitative Results on OpenDV–YouTube Figure 8 shows qualitative results of LA-Pose on the OpenDV–YouTube dataset [37]. The OpenDV–YouTube dataset [37] is a large-scale collection of unconstrained driv- ing videos gathered from public YouTube channels. It forms the main component of OpenDV-2K, spanning over 1700 hours of front-view recordings captured acr...

  44. [44]

    We categorize trajectories into bins based on curvature and acceleration to examine model performance under dif- ferent motion regimes

    Failure Mode Analysis To further understand the limitations of our method, we an- alyze pose estimation performance across different trajec- tory curvatures and accelerations on the Waymo validation set. We categorize trajectories into bins based on curvature and acceleration to examine model performance under dif- ferent motion regimes. Curvature is defi...