Recognition: 2 theorem links
· Lean TheoremVirtual KITTI 2
Pith reviewed 2026-05-13 15:55 UTC · model grok-4.3
The pith
Virtual KITTI 2 clones five KITTI tracking sequences and supplies each in multiple weather and camera variants with complete synthetic labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Virtual KITTI 2 consists of five sequence clones from the KITTI tracking benchmark, each available in variants with modified weather conditions such as fog and rain or modified camera configurations such as 15-degree rotations. For each sequence the dataset supplies RGB, depth, class segmentation, instance segmentation, flow, scene flow, camera parameters and vehicle locations. Experiments using state-of-the-art autonomous driving algorithms demonstrate the dataset's capabilities.
What carries the argument
The Virtual KITTI 2 dataset, generated by cloning real KITTI sequences and applying controlled modifications to weather and camera parameters while producing perfect ground-truth annotations for depth, segmentation, and flow.
Load-bearing premise
The synthetic images and their variants are realistic enough that models trained on them will generalize to real-world autonomous driving data.
What would settle it
A depth or segmentation model trained only on Virtual KITTI 2 performs substantially worse on the original real KITTI test sequences than the same model trained on real KITTI data.
read the original abstract
This paper introduces an updated version of the well-known Virtual KITTI dataset which consists of 5 sequence clones from the KITTI tracking benchmark. In addition, the dataset provides different variants of these sequences such as modified weather conditions (e.g. fog, rain) or modified camera configurations (e.g. rotated by 15 degrees). For each sequence, we provide multiple sets of images containing RGB, depth, class segmentation, instance segmentation, flow, and scene flow data. Camera parameters and poses as well as vehicle locations are available as well. In order to showcase some of the dataset's capabilities, we ran multiple relevant experiments using state-of-the-art algorithms from the field of autonomous driving. The dataset is available for download at https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Virtual KITTI 2, an updated version of the Virtual KITTI dataset consisting of 5 sequence clones from the KITTI tracking benchmark. It includes variants with modified weather conditions (e.g., fog, rain) and camera configurations (e.g., 15-degree rotations). For each sequence, the dataset provides RGB, depth, class segmentation, instance segmentation, flow, scene flow, camera parameters, and vehicle locations. Experiments with state-of-the-art algorithms are run to showcase capabilities, and the dataset is available for download at the provided link.
Significance. If the described assets are complete and accessible, this dataset release is significant for autonomous driving research in computer vision. It supplies a controlled synthetic environment with multiple ground-truth modalities and systematic variants, enabling targeted evaluation of algorithm robustness to weather and viewpoint changes that complements real-world benchmarks like KITTI.
minor comments (2)
- A table summarizing the number of frames, sequences, and variants per modality would improve clarity and allow quick assessment of dataset scale.
- The abstract mentions experiments with state-of-the-art algorithms but does not name them or report key metrics; adding this information would strengthen the overview of the dataset's demonstrated utility.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of Virtual KITTI 2 and for recommending minor revision. We appreciate the recognition that the dataset provides a valuable controlled synthetic environment with multiple ground-truth modalities to complement real-world benchmarks such as KITTI. No specific major comments were raised in the report.
Circularity Check
No significant circularity; dataset release with no derivation chain
full rationale
The paper is a dataset release describing Virtual KITTI 2, consisting of cloned KITTI sequences with weather and camera variants plus standard modalities (RGB, depth, segmentation, flow, etc.). The abstract and full text contain no equations, proofs, fitted parameters, predictions, or modeling steps. The central contribution is the existence and accessibility of the described assets, which is externally verifiable via the provided download link and does not rely on any internal derivation that could reduce to its inputs by construction. No self-citations, ansatzes, or uniqueness claims are load-bearing for any quantitative result. This is the expected outcome for a pure dataset paper.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 25 Pith papers
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
-
VDPP: Video Depth Post-Processing for Speed and Scalability
VDPP is an RGB-free video depth post-processor that achieves over 43 FPS on Jetson Orin Nano by refining geometry at low resolution rather than reconstructing full scenes.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth embeds predicted camera poses into a spatio-temporal transformer to achieve state-of-the-art 3D-consistent video depth estimation.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth achieves improved 3D-consistent video depth by embedding predicted inter-frame camera poses into a network with an Alternating Spatio-Temporal Transformer for better spatial precision and temporal coherence.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.
-
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
-
Image Generators are Generalist Vision Learners
Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
-
Image Generators are Generalist Vision Learners
Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
-
Geometric Context Transformer for Streaming 3D Reconstruction
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
-
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
-
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
-
LoMa: Local Feature Matching Revisited
Scaling data, model size, and compute for local feature matching produces large performance gains on challenging benchmarks and a new manually annotated HardMatch dataset.
-
SimpleProc: Fully Procedural Synthetic Data from Simple Rules for Multi-View Stereo
Procedural rules with NURBS generate MVS training data that outperforms same-scale manual curation and matches or exceeds larger manual datasets.
-
SAM 2: Segment Anything in Images and Videos
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation datas...
-
Depth Anything V2
Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.
-
The Midas Touch for Metric Depth
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
-
Syn4D: A Multiview Synthetic 4D Dataset
Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
-
Who Handles Orientation? Investigating Invariance in Feature Matching
Learning rotation invariance in descriptors matches the performance of matcher-level invariance but allows earlier invariance, faster matchers, and no loss in upright performance when trained at scale.
-
SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation
SMFormer achieves state-of-the-art self-supervised stereo matching by using vision foundation models for disturbance-resistant features and data augmentation to enforce output consistency, rivaling or exceeding some s...
-
Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching
GREATEN fuses surface normals with image features via gated contextual-geometric fusion and efficient sparse attentions to cut stereo matching errors by up to 30% on real datasets when trained solely on synthetic data.
-
A Hybrid Approach for Closing the Sim2real Appearance Gap in Game Engine Synthetic Datasets
Combining a diffusion model and an image-to-image translation model produces more photorealistic game-engine synthetic images than either alone while keeping semantic labels intact.
Reference graph
Works this paper leans on
-
[1]
Virtual worlds as proxy for multi-object tracking analysis
A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016
work page 2016
-
[2]
Procedural generation of videos to train deep action recognition networks
C R De Souza, A Gaidon, Y Cabon, and A M Lopez Pena. Procedural generation of videos to train deep action recognition networks. In CVPR, 2017
work page 2017
-
[3]
Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios
Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng, Jianping Shi, and Bolei Zhou. Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In CVPR, 2019
work page 2019
-
[4]
German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, June 2016
work page 2016
-
[5]
Are we ready for autonomous driving? the kitti vision benchmark suite
Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012
work page 2012
-
[6]
Visual localization by learning objects-of-interest dense match regression
Philippe Weinzaepfel, Gabriela Csurka, Yohann Cabon, and Martin Humenberger. Visual localization by learning objects-of-interest dense match regression. In CVPR, 2019
work page 2019
-
[7]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016
work page 2016
-
[8]
Lopez, Uwe Franke, Marc Pollefeys, and Juan Carlos Moure
Daniel Hernandez-Juarez, Lukas Schneider, Antonio Espinosa, David Vazquez, Antonio M. Lopez, Uwe Franke, Marc Pollefeys, and Juan Carlos Moure. Slanted stixels: Representing san francisco’s steepest streets. In BMVC, 2017
work page 2017
-
[9]
Temporal coherence for active learning in videos
Javad Zolfaghari Bengar, Abel Gonzalez-Garcia, Gabriel Villalonga, Bogdan Raducanu, Hamed H Aghdam, Mikhail Mozerov, Antonio M Lopez, and Joost van de Weijer. Temporal coherence for active learning in videos. In ICCV Workshops, 2019
work page 2019
-
[10]
Stephan R Richter, Zeeshan Hayder, and Vladlen Koltun. Playing for benchmarks. In ICCV, pages 2213–2222, 2017
work page 2017
-
[11]
Analyzing computer vision data-the good, the bad and the ugly
Oliver Zendel, Katrin Honauer, Markus Murschitz, Martin Humenberger, and Gustavo Fernandez Dominguez. Analyzing computer vision data-the good, the bad and the ugly. In CVPR, 2017
work page 2017
-
[12]
How good is my test data? introducing safety analysis for computer vision
Oliver Zendel, Markus Murschitz, Martin Humenberger, and Wolfgang Herzner. How good is my test data? introducing safety analysis for computer vision. International Journal of Computer Vision, 125(1-3):95–109, 2017
work page 2017
-
[13]
Domain adaptation in computer vision applications
Gabriela Csurka. Domain adaptation in computer vision applications. Springer, 2017
work page 2017
-
[14]
Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision , pages 1440–1448, 2015
work page 2015
-
[15]
Edge boxes: Locating object proposals from edges
C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In European conference on computer vision, pages 391–405. Springer, 2014
work page 2014
-
[16]
Globally-optimal greedy algorithms for tracking a variable number of objects
Hamed Pirsiavash, Deva Ramanan, and Charless C Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR 2011, pages 1201–1208. IEEE, 2011
work page 2011
-
[17]
Learning to track: Online multi-object tracking by decision making
Yu Xiang, Alexandre Alahi, and Silvio Savarese. Learning to track: Online multi-object tracking by decision making. In Proceedings of the IEEE international conference on computer vision, pages 4705–4713, 2015
work page 2015
-
[18]
James Bergstra, Daniel Yamins, and David Daniel Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. 2013
work page 2013
-
[19]
Evaluating multiple object tracking performance: the clear mot metrics
Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. Journal on Image and Video Processing, 2008:1, 2008
work page 2008
-
[20]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015
-
[21]
Automatic differentiation in PyTorch
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017
work page 2017
-
[22]
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[23]
Girshick, Kaiming He, Bharath Hariharan, and Serge J
Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. CoRR, abs/1612.03144, 2016. 10 A PREPRINT - JANUARY 30, 2020
-
[24]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal- network.org/challenges/VOC/voc2007/workshop/index.html
work page 2007
-
[25]
Ga-net: Guided aggregation net for end-to-end stereo matching
Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. Ga-net: Guided aggregation net for end-to-end stereo matching. In CVPR, 2019
work page 2019
-
[26]
Unsupervised learning of depth and ego-motion from video
Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017
work page 2017
-
[27]
Self-supervised model adaptation for multimodal semantic segmentation
Abhinav Valada, Rohit Mohan, and Wolfram Burgard. Self-supervised model adaptation for multimodal semantic segmentation. International Journal of Computer Vision (IJCV), jul 2019. Special Issue: Deep Learning for Robotic Vision. 11
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.