Recognition: 2 theorem links
· Lean TheoremDVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
Pith reviewed 2026-05-13 22:56 UTC · model grok-4.3
The pith
A streaming DVGT-2 model jointly reconstructs dense 3D geometry and plans driving trajectories online while transferring directly across camera configurations without fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DVGT-2 processes sequential camera inputs with temporal causal attention and sliding-window historical feature caching to output dense 3D geometry reconstruction together with trajectory planning for the current frame. This streaming design preserves or exceeds the reconstruction quality of earlier non-streaming multi-frame methods while enabling real-time inference, and the identical model applies zero-shot to planning tasks across different camera configurations on NAVSIM and nuScenes.
What carries the argument
Streaming Driving Visual Geometry Transformer (DVGT-2) that applies temporal causal attention and sliding-window historical feature caching to jointly produce dense geometry and planning from online video.
If this is right
- Real-time joint geometry and planning becomes feasible without waiting for batch multi-frame processing.
- Geometry reconstruction quality exceeds prior batch-based methods on several datasets despite the online constraint.
- The identical trained model delivers planning results on both closed-loop NAVSIM and open-loop nuScenes without retraining.
- Planning works across diverse camera configurations without any fine-tuning step.
Where Pith is reading between the lines
- Geometry could function as a universal intermediate layer that links perception directly to control without separate modules.
- The caching and sliding-window pattern may extend to other online 3D video tasks that need both accuracy and low latency.
- Scaling model size or adding sensor types while keeping the streaming property could be tested on the same benchmarks.
Load-bearing premise
Historical feature caching inside a causal streaming architecture can preserve the reconstruction accuracy of full-batch multi-frame geometry methods.
What would settle it
Measure whether DVGT-2 geometry metrics on a standard multi-view reconstruction benchmark drop below the original batch DVGT performance when the model is forced to run strictly in streaming mode with limited cache reuse.
Figures
read the original abstract
End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DVGT-2, a streaming Driving Visual Geometry Transformer under a Vision-Geometry-Action paradigm for end-to-end autonomous driving. It replaces language-auxiliary VLA models with dense 3D geometry as the primary cue, using temporal causal attention and a sliding-window historical feature cache to enable online joint geometry reconstruction and trajectory planning. The work claims that this architecture achieves superior geometry performance over batch methods like DVGT on multiple datasets while allowing the same trained model to transfer directly to planning tasks across camera setups, including closed-loop NAVSIM and open-loop nuScenes benchmarks, without fine-tuning.
Significance. If the empirical results hold under rigorous validation, the contribution would be significant for real-time autonomous driving systems. By demonstrating that a causal streaming model can maintain or exceed batch multi-view geometry accuracy while enabling direct planning transfer, it offers a practical path toward scalable VGA models that operate without offline batch processing. The emphasis on dense geometry over language descriptions and the cross-configuration zero-shot planning capability address key deployment challenges in diverse sensor setups.
major comments (3)
- [§3.2] §3.2 (Temporal Causal Attention and Sliding-Window Cache): The central claim that streaming causal attention plus historical caching preserves reconstruction quality equivalent to batch DVGT is load-bearing, yet the manuscript provides no direct ablation comparing depth accuracy, point-cloud completeness, or occlusion handling metrics between DVGT-2 and the original batch DVGT on matched long sequences or dynamic scenes. Causal restriction to past frames and windowed discarding of older context risk drift for distant or newly occluded objects, and this must be quantified with sequence-length sweeps and batch-vs-streaming tables to substantiate superiority.
- [§4] §4 (Experiments, Planning Transfer): The assertion of direct applicability to planning across diverse camera configurations without fine-tuning is central but unsupported by specific quantitative results. The manuscript must include closed-loop NAVSIM metrics (e.g., collision rate, route completion) and open-loop nuScenes metrics (e.g., L2 error, collision rate) with error bars, baselines, and camera-configuration ablations; without these, the no-fine-tuning transfer claim cannot be evaluated against the risk that streaming geometry errors propagate to planning.
- [§4.1] §4.1 (Geometry Reconstruction Results): Claims of superior performance on various datasets lack reported numbers, baselines (including original DVGT), and statistical details such as mean depth error or Chamfer distance with standard deviations. This absence prevents assessment of whether any observed gains are meaningful or merely within variance of the batch method.
minor comments (2)
- [§3.3] Notation for the sliding-window interval and cache reuse should be formalized with a clear equation or pseudocode to avoid ambiguity in the streaming inference description.
- [Introduction] The abstract and introduction would benefit from explicit citation of the original DVGT paper and recent streaming geometry works to better situate the incremental contribution.
Simulated Author's Rebuttal
We appreciate the referee's detailed feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below. Where revisions are needed, we have updated the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Temporal Causal Attention and Sliding-Window Cache): The central claim that streaming causal attention plus historical caching preserves reconstruction quality equivalent to batch DVGT is load-bearing, yet the manuscript provides no direct ablation comparing depth accuracy, point-cloud completeness, or occlusion handling metrics between DVGT-2 and the original batch DVGT on matched long sequences or dynamic scenes. Causal restriction to past frames and windowed discarding of older context risk drift for distant or newly occluded objects, and this must be quantified with sequence-length sweeps and batch-vs-streaming tables to substantiate superiority.
Authors: We agree that a direct comparison is essential to validate the streaming approach. In the revised version, we have added a new subsection in §3.2 with an ablation study comparing DVGT-2 to the batch DVGT on long sequences from the datasets. This includes metrics for depth accuracy (mean absolute error), point-cloud completeness (percentage of reconstructed points), and occlusion handling. Additionally, we provide sequence-length sweeps showing that performance remains stable without significant drift, supported by tables comparing batch and streaming modes. These additions substantiate that the causal attention and cache maintain quality while enabling online operation. revision: yes
-
Referee: [§4] §4 (Experiments, Planning Transfer): The assertion of direct applicability to planning across diverse camera configurations without fine-tuning is central but unsupported by specific quantitative results. The manuscript must include closed-loop NAVSIM metrics (e.g., collision rate, route completion) and open-loop nuScenes metrics (e.g., L2 error, collision rate) with error bars, baselines, and camera-configuration ablations; without these, the no-fine-tuning transfer claim cannot be evaluated against the risk that streaming geometry errors propagate to planning.
Authors: We thank the referee for highlighting this. The original manuscript included some planning results, but to address the request for specific metrics, we have expanded §4 with detailed closed-loop NAVSIM results including collision rate and route completion, and open-loop nuScenes L2 error and collision rate. We report these with error bars from 5 independent runs, include relevant baselines, and add camera-configuration ablations demonstrating zero-shot transfer across setups. This shows that streaming geometry errors do not propagate adversely to planning performance. revision: yes
-
Referee: [§4.1] §4.1 (Geometry Reconstruction Results): Claims of superior performance on various datasets lack reported numbers, baselines (including original DVGT), and statistical details such as mean depth error or Chamfer distance with standard deviations. This absence prevents assessment of whether any observed gains are meaningful or merely within variance of the batch method.
Authors: We apologize for the lack of explicit numerical reporting in the initial submission. In the revision, we have updated §4.1 to include comprehensive tables with mean depth error, Chamfer distance, and other metrics for all datasets, including comparisons to the original DVGT and other baselines. Standard deviations are reported from multiple evaluations to allow assessment of statistical significance. These numbers confirm the superior performance of DVGT-2. revision: yes
Circularity Check
Empirical architectural extension with no load-bearing derivations or self-referential reductions
full rationale
The manuscript proposes DVGT-2 as a streaming causal extension of prior geometry reconstruction work, relying on temporal attention and sliding-window caching for joint geometry and planning outputs. No equations, closed-form derivations, or first-principles results are presented that reduce by construction to fitted inputs or self-citations. Claims of superior reconstruction and zero-shot cross-configuration planning are framed as empirical outcomes on NAVSIM and nuScenes. A minor self-citation to the original DVGT appears in the motivation but is not invoked as a uniqueness theorem or load-bearing premise for any result; the central contribution remains an independent model design evaluated experimentally.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DVGT-2 achieves superior geometry reconstruction... same trained model directly applied to planning across diverse camera configurations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
Chen,S.,Jiang,B.,Gao,H.,Liao,B.,Xu,Q.,Zhang,Q.,Huang,C.,Liu,W.,Wang, X.: Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
TPAMI45(11), 12878–12895 (2022)
Chitta, K., Prakash, A., Jaeger, B., Yu, Z., Renz, K., Geiger, A.: Transfuser: Imita- tion with transformer-based sensor fusion for autonomous driving. TPAMI45(11), 12878–12895 (2022)
work page 2022
- [5]
-
[6]
NeurIPS37, 28706– 28719 (2024)
Dauner, D., Hallgarten, M., Li, T., Weng, X., Huang, Z., Yang, Z., Li, H., Gilitschenski, I., Ivanovic, B., Pavone, M., et al.: Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmarking. NeurIPS37, 28706– 28719 (2024)
work page 2024
-
[7]
arXiv preprint arXiv:2412.06777 (2024)
Fei, X., Zheng, W., Duan, Y., Zhan, W., Tomizuka, M., Keutzer, K., Lu, J.: Driv3r: Learning dense 4d reconstruction for autonomous driving. arXiv preprint arXiv:2412.06777 (2024)
-
[8]
Feng, R., Xi, N., Chu, D., Wang, R., Deng, Z., Wang, A., Lu, L., Wang, J., Huang, Y.: Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. arXiv preprint arXiv:2504.19580 (2025)
-
[9]
Fu, H., Zhang, D., Zhao, Z., Cui, J., Liang, D., Zhang, C., Zhang, D., Xie, H., Wang, B., Bai, X.: Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. arXiv preprint arXiv:2503.19755 (2025)
-
[10]
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. IJRR32(11), 1231–1237 (2013)
work page 2013
- [11]
-
[12]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Hegde, D., Yasarla, R., Cai, H., Han, S., Bhattacharyya, A., Mahajan, S., Liu, L., Garrepalli, R., Patel, V.M., Porikli, F.: Distilling multi-modal large language mod- els for autonomous driving. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27575–27585 (2025)
work page 2025
- [13]
-
[14]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Hu, P., Huang, A., Dolan, J., Held, D., Ramanan, D.: Safe local motion plan- ning with self-supervised freespace forecasting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12732–12741 (2021)
work page 2021
-
[15]
Hu, S., Chen, L., Wu, P., Li, H., Yan, J., Tao, D.: St-p3: End-to-end vision-based autonomousdrivingviaspatial-temporalfeaturelearning.In:EuropeanConference on Computer Vision. pp. 533–549. Springer (2022)
work page 2022
- [16]
-
[17]
Revisiting Multimodal Positional Encoding in Vision-Language Models
Huang, J., Liu, X., Song, S., Hou, R., Chang, H., Lin, J., Bai, S.: Revisit- ing multimodal positional encoding in vision-language models. arXiv preprint arXiv:2510.23095 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: Bevdet: High-performance multi- camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
work page internal anchor Pith review arXiv 2021
- [19]
- [20]
- [21]
-
[22]
arXiv preprint arXiv:2412.07689 (2024)
Huang, Z., Feng, C., Yan, F., Xiao, B., Jie, Z., Zhong, Y., Liang, X., Ma, L.: Drivemm: All-in-one large multimodal model for autonomous driving. arXiv preprint arXiv:2412.07689 (2024)
-
[23]
EMMA: End-to-End Multimodal Model for Autonomous Driving
Hwang, J.J., Xu, R., Lin, H., Hung, W.C., Ji, J., Choi, K., Huang, D., He, T., Cov- ington, P., Sapp, B., et al.: Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262 (2024)
work page internal anchor Pith review arXiv 2024
-
[24]
Diffvla: Vision-language guided diffusion planning for autonomous driving,
Jiang, A., Gao, Y., Sun, Z., Wang, Y., Wang, J., Chai, J., Cao, Q., Heng, Y., Jiang, H., Dong, Y., et al.: Diffvla: Vision-language guided diffusion planning for autonomous driving. arXiv preprint arXiv:2505.19381 (2025)
-
[25]
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Jiang, B., Chen, S., Liao, B., Zhang, X., Yin, W., Zhang, Q., Huang, C., Liu, W., Wang, X.: Senna: Bridging large vision-language models and end-to-end au- tonomous driving. arXiv preprint arXiv:2410.22313 (2024)
work page internal anchor Pith review arXiv 2024
- [26]
-
[27]
Jiang, B., Chen, S., Zhang, Q., Liu, W., Wang, X.: Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning. arXiv preprint arXiv:2503.07608 (2025)
-
[28]
Jiang, X., Ma, Y., Li, P., Xu, L., Wen, X., Zhan, K., Xia, Z., Jia, P., Lang, X., Sun, S.: Transdiffuser: End-to-end trajectory generation with decorrelated multi-modal representation for autonomous driving. arXiv e-prints pp. arXiv–2505 (2025)
work page 2025
-
[29]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Univer- sal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
In: European Conference on Computer Vision
Khurana, T., Hu, P., Dave, A., Ziglar, J., Held, D., Ramanan, D.: Differentiable raycasting for self-supervised occupancy forecasting. In: European Conference on Computer Vision. pp. 353–369. Springer (2022)
work page 2022
- [31]
-
[32]
Li, D., Ren, J., Wang, Y., Wen, X., Li, P., Xu, L., Zhan, K., Xia, Z., Jia, P., Lang, X., et al.: Finetuning generative trajectory model with reinforcement learning from human feedback. arXiv e-prints pp. arXiv–2503 (2025)
work page 2025
-
[33]
Hydra-mdp++: Advancing end-to-end driving via expert- guided hydra-distillation,
Li, K., Li, Z., Lan, S., Xie, Y., Zhang, Z., Liu, J., Wu, Z., Yu, Z., Alvarez, J.M.: Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation. arXiv preprint arXiv:2503.12820 (2025) 18 S. Zuo, Z. Xie, W. Zheng et al
-
[34]
Li, Y., Shang, S., Liu, W., Zhan, B., Wang, H., Wang, Y., Chen, Y., Wang, X., An, Y., Tang, C., et al.: Drivevla-w0: World models amplify data scaling law in autonomous driving. arXiv preprint arXiv:2510.12796 (2025)
-
[35]
Li, Y., Wang, Y., Liu, Y., He, J., Fan, L., Zhang, Z.: End-to-end driving with online trajectory evaluation via bev world model. arXiv preprint arXiv:2504.01941 (2025)
- [36]
-
[37]
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
Li, Y., Xiong, K., Guo, X., Li, F., Yan, S., Xu, G., Zhou, L., Chen, L., Sun, H., Wang, B., et al.: Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. arXiv preprint arXiv:2506.08052 (2025)
work page internal anchor Pith review arXiv 2025
-
[38]
Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
Li, Z., Li, K., Wang, S., Lan, S., Yu, Z., Ji, Y., Li, Z., Zhu, Z., Kautz, J., Wu, Z., et al.: Hydra-mdp: End-to-end multimodal planning with multi-target hydra- distillation. arXiv preprint arXiv:2406.06978 (2024)
work page internal anchor Pith review arXiv 2024
-
[39]
Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q., Dai, J.: Bevformer: learningbird’s-eye-viewrepresentationfromlidar-cameraviaspatiotemporaltrans- formers. TPAMI (2024)
work page 2024
-
[40]
Advances in neural information processing systems35, 10421–10434 (2022)
Liang, T., Xie, H., Yu, K., Xia, Z., Lin, Z., Wang, Y., Tang, T., Wang, B., Tang, Z.: Bevfusion: A simple and robust lidar-camera fusion framework. Advances in neural information processing systems35, 10421–10434 (2022)
work page 2022
- [41]
-
[42]
IEEE Transactions on Artificial Intelligence (2025)
Liao, H., Kong, H., Wang, B., Wang, C., Ye, W., He, Z., Xu, C., Li, Z.: Cot-drive: Efficient motion forecasting for autonomous driving with llms and chain-of-thought prompting. IEEE Transactions on Artificial Intelligence (2025)
work page 2025
-
[43]
arXiv preprint arXiv:2211.10581 (2022)
Lin,X.,Lin,T.,Pei,Z.,Huang,L.,Su,Z.:Sparse4d:Multi-view3dobjectdetection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581 (2022)
-
[44]
Sparse4d v2: Recurrent temporal fusion with sparse model.arXiv preprint arXiv:2305.14018, 2023
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4d v2: Recurrent temporal fusion with sparse model. arXiv preprint arXiv:2305.14018 (2023)
-
[45]
arXiv preprint arXiv:2311.11722 (2023)
Lin, X., Pei, Z., Lin, T., Huang, L., Su, Z.: Sparse4d v3: Advancing end-to-end 3d detection and tracking. arXiv preprint arXiv:2311.11722 (2023)
-
[46]
Liu, P., Liu, H., Liu, H., Liu, X., Ni, J., Ma, J.: Vlm-e2e: Enhancing end-to- end autonomous driving with multimodal driver attention fusion. arXiv preprint arXiv:2502.18042 (2025)
- [47]
-
[48]
In: 2023 IEEE international conference on robotics and automation (ICRA)
Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.L., Han, S.: Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In: 2023 IEEE international conference on robotics and automation (ICRA). pp. 2774–2781. IEEE (2023)
work page 2023
-
[49]
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
work page 2019
-
[50]
Lu, H., Liu, Z., Jiang, G., Luo, Y., Chen, S., Zhang, Y., Chen, Y.C.: Uniugp: Uni- fying understanding, generation, and planing for end-to-end autonomous driving. arXiv preprint arXiv:2512.09864 (2025)
-
[51]
Luo, Y., Chen, Q., Li, F., Xu, S., Liu, J., Song, Z., Yang, Z.x., Wen, F.: Unleash- ing vla potentials in autonomous driving via explicit learning from failures. arXiv preprint arXiv:2603.01063 (2026) DVGT-2 19
-
[52]
Luo, Y., Li, F., Xu, S., Ji, Y., Zhang, Z., Wang, B., Shen, Y., Cui, J., Chen, L., Chen, G., et al.: Last-vla: Thinking in latent spatio-temporal space for vision- language-action in autonomous driving. arXiv preprint arXiv:2603.01928 (2026)
-
[53]
Luo, Y., Li, F., Xu, S., Lai, Z., Yang, L., Chen, Q., Luo, Z., Xie, Z., Jiang, S., Liu, J., et al.: Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving. arXiv preprint arXiv:2509.13769 (2025)
-
[54]
GPT-Driver: Learning to Drive with GPT
Mao, J., Qian, Y., Ye, J., Zhao, H., Wang, Y.: Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415 (2023)
work page internal anchor Pith review arXiv 2023
-
[55]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Pan, C., Yaman, B., Nesti, T., Mallik, A., Allievi, A.G., Velipasalar, S., Ren, L.: Vlp: Vision language planning for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14760– 14769 (2024)
work page 2024
- [56]
-
[57]
In: Pro- ceedings of the 23rd international conference on Machine learning
Ratliff, N.D., Bagnell, J.A., Zinkevich, M.A.: Maximum margin planning. In: Pro- ceedings of the 23rd international conference on Machine learning. pp. 729–736 (2006)
work page 2006
- [58]
-
[59]
Renz, K., Chen, L., Marcu, A.M., Hünermann, J., Hanotte, B., Karnsund, A., Shotton, J., Arani, E., Sinavski, O.: Carllava: Vision language models for camera- only closed-loop driving. arXiv preprint arXiv:2406.10165 (2024)
- [60]
-
[61]
Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [62]
-
[63]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Tong, W., Sima, C., Wang, T., Chen, L., Wu, S., Deng, H., Gu, Y., Lu, L., Luo, P., Lin, D., et al.: Scene as occupancy. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8406–8415 (2023)
work page 2023
- [64]
-
[65]
3D Reconstruction with Spatial Memory.arXiv:2408.16061, 2024
Wang, H., Agapito, L.: 3d reconstruction with spatial memory. arXiv preprint arXiv:2408.16061 (2024)
- [66]
- [67]
-
[68]
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Wang, R., Xu, S., Dong, Y., Deng, Y., Xiang, J., Lv, Z., Sun, G., Tong, X., Yang, J.: Moge-2: Accurate monocular geometry with metric scale and sharp details. arXiv preprint arXiv:2507.02546 (2025)
work page internal anchor Pith review arXiv 2025
- [69]
-
[70]
In: Proceedings of the computer vision and pattern recognition conference
Wang, S., Yu, Z., Jiang, X., Lan, S., Shi, M., Chang, N., Kautz, J., Li, Y., Al- varez, J.M.: Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In: Proceedings of the computer vision and pattern recognition conference. pp. 22442–22452 (2025)
work page 2025
- [71]
-
[72]
Wang, W., Xie, J., Hu, C., Zou, H., Fan, J., Tong, W., Wen, Y., Wu, S., Deng, H., Li, Z., et al.: Drivemlm: Aligning multi-modal large language models with be- havioral planning states for autonomous driving. arXiv preprint arXiv:2312.09245 (2023)
-
[73]
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π 3: Permutation-equivariant visual geometry learning. arXiv preprint arXiv:2507.13347 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [74]
-
[75]
Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory.arXiv:2507.02863, 2025
Wu, Y., Zheng, W., Zhou, J., Lu, J.: Point3r: Streaming 3d reconstruction with explicit spatial pointer memory. arXiv preprint arXiv:2507.02863 (2025)
-
[76]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Xing, Z., Zhang, X., Hu, Y., Jiang, B., He, T., Zhang, Q., Long, X., Yin, W.: Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1602–1611 (2025)
work page 2025
-
[77]
Xu, S., Li, F., Huang, P., Song, Z., Yang, Z.X.: Tigdistill-bev: Multi-view bev 3d object detection via target inner-geometry learning distillation. TCSVT (2025)
work page 2025
-
[78]
Xu, Y., Hu, Y., Zhang, Z., Meyer, G.P., Mustikovela, S.K., Srinivasa, S., Wolff, E.M.,Huang,X.:Vlm-ad:End-to-endautonomousdrivingthroughvision-language model supervision. arXiv preprint arXiv:2412.14446 (2024)
-
[79]
Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.Y.K., Li, Z., Zhao, H.: Drivegpt4: Interpretable end-to-end autonomous driving via large language model. RA-L (2024)
work page 2024
-
[80]
Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. NeurIPS37, 21875–21911 (2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.