Virtual KITTI 2

Yohann Cabon , Naila Murray , Martin Humenberger

Authors on Pith no claims yet

Pith reviewed 2026-05-13 15:55 UTC · model grok-4.3

classification 💻 cs.CV cs.ROeess.IV

keywords virtual kittisynthetic datasetautonomous drivingsemantic segmentationdepth estimationoptical flowscene flowcomputer vision

0 comments

The pith

Virtual KITTI 2 clones five KITTI tracking sequences and supplies each in multiple weather and camera variants with complete synthetic labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Virtual KITTI 2, an updated synthetic dataset built from five sequences cloned directly from the real KITTI tracking benchmark. Each sequence appears in several variants created by changing weather conditions such as fog or rain and by altering camera configurations such as 15-degree rotations. The dataset supplies RGB images together with depth maps, class segmentation, instance segmentation, optical flow, scene flow, camera parameters, and vehicle locations. Experiments with current state-of-the-art autonomous driving algorithms illustrate how the controlled variants can be used to train and test perception models.

Core claim

Virtual KITTI 2 consists of five sequence clones from the KITTI tracking benchmark, each available in variants with modified weather conditions such as fog and rain or modified camera configurations such as 15-degree rotations. For each sequence the dataset supplies RGB, depth, class segmentation, instance segmentation, flow, scene flow, camera parameters and vehicle locations. Experiments using state-of-the-art autonomous driving algorithms demonstrate the dataset's capabilities.

What carries the argument

The Virtual KITTI 2 dataset, generated by cloning real KITTI sequences and applying controlled modifications to weather and camera parameters while producing perfect ground-truth annotations for depth, segmentation, and flow.

Load-bearing premise

The synthetic images and their variants are realistic enough that models trained on them will generalize to real-world autonomous driving data.

What would settle it

A depth or segmentation model trained only on Virtual KITTI 2 performs substantially worse on the original real KITTI test sequences than the same model trained on real KITTI data.

read the original abstract

This paper introduces an updated version of the well-known Virtual KITTI dataset which consists of 5 sequence clones from the KITTI tracking benchmark. In addition, the dataset provides different variants of these sequences such as modified weather conditions (e.g. fog, rain) or modified camera configurations (e.g. rotated by 15 degrees). For each sequence, we provide multiple sets of images containing RGB, depth, class segmentation, instance segmentation, flow, and scene flow data. Camera parameters and poses as well as vehicle locations are available as well. In order to showcase some of the dataset's capabilities, we ran multiple relevant experiments using state-of-the-art algorithms from the field of autonomous driving. The dataset is available for download at https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces Virtual KITTI 2, an updated version of the Virtual KITTI dataset consisting of 5 sequence clones from the KITTI tracking benchmark. It includes variants with modified weather conditions (e.g., fog, rain) and camera configurations (e.g., 15-degree rotations). For each sequence, the dataset provides RGB, depth, class segmentation, instance segmentation, flow, scene flow, camera parameters, and vehicle locations. Experiments with state-of-the-art algorithms are run to showcase capabilities, and the dataset is available for download at the provided link.

Significance. If the described assets are complete and accessible, this dataset release is significant for autonomous driving research in computer vision. It supplies a controlled synthetic environment with multiple ground-truth modalities and systematic variants, enabling targeted evaluation of algorithm robustness to weather and viewpoint changes that complements real-world benchmarks like KITTI.

minor comments (2)

A table summarizing the number of frames, sequences, and variants per modality would improve clarity and allow quick assessment of dataset scale.
The abstract mentions experiments with state-of-the-art algorithms but does not name them or report key metrics; adding this information would strengthen the overview of the dataset's demonstrated utility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of Virtual KITTI 2 and for recommending minor revision. We appreciate the recognition that the dataset provides a valuable controlled synthetic environment with multiple ground-truth modalities to complement real-world benchmarks such as KITTI. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; dataset release with no derivation chain

full rationale

The paper is a dataset release describing Virtual KITTI 2, consisting of cloned KITTI sequences with weather and camera variants plus standard modalities (RGB, depth, segmentation, flow, etc.). The abstract and full text contain no equations, proofs, fitted parameters, predictions, or modeling steps. The central contribution is the existence and accessibility of the described assets, which is externally verifiable via the provided download link and does not rely on any internal derivation that could reduce to its inputs by construction. No self-citations, ansatzes, or uniqueness claims are load-bearing for any quantitative result. This is the expected outcome for a pure dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset release paper with no mathematical derivation, so the ledger contains no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5438 in / 1080 out tokens · 36992 ms · 2026-05-13T15:55:36.095465+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
cs.CV 2026-05 unverdicted novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
cs.CV 2026-04 unverdicted novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
VDPP: Video Depth Post-Processing for Speed and Scalability
cs.CV 2026-04 unverdicted novelty 7.0

VDPP is an RGB-free video depth post-processor that achieves over 43 FPS on Jetson Orin Nano by refining geometry at low resolution rather than reconstructing full scenes.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
cs.CV 2026-05 unverdicted novelty 6.0

GemDepth embeds predicted camera poses into a spatio-temporal transformer to achieve state-of-the-art 3D-consistent video depth estimation.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
cs.CV 2026-05 unverdicted novelty 6.0

GemDepth achieves improved 3D-consistent video depth by embedding predicted inter-frame camera poses into a network with an Alternating Spatio-Temporal Transformer for better spatial precision and temporal coherence.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
cs.CV 2026-05 unverdicted novelty 6.0

GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 conditional novelty 6.0

Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 unverdicted novelty 6.0

Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
Geometric Context Transformer for Streaming 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
cs.CV 2026-04 unverdicted novelty 6.0

SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
LoMa: Local Feature Matching Revisited
cs.CV 2026-04 unverdicted novelty 6.0

Scaling data, model size, and compute for local feature matching produces large performance gains on challenging benchmarks and a new manually annotated HardMatch dataset.
SimpleProc: Fully Procedural Synthetic Data from Simple Rules for Multi-View Stereo
cs.CV 2026-04 unverdicted novelty 6.0

Procedural rules with NURBS generate MVS training data that outperforms same-scale manual curation and matches or exceeds larger manual datasets.
SAM 2: Segment Anything in Images and Videos
cs.CV 2024-08 conditional novelty 6.0

SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation datas...
Depth Anything V2
cs.CV 2024-06 unverdicted novelty 6.0

Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.
The Midas Touch for Metric Depth
cs.CV 2026-05 unverdicted novelty 5.0

MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
cs.CV 2026-05 unverdicted novelty 5.0

ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
Syn4D: A Multiview Synthetic 4D Dataset
cs.CV 2026-05 unverdicted novelty 5.0

Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
Who Handles Orientation? Investigating Invariance in Feature Matching
cs.CV 2026-04 accept novelty 5.0

Learning rotation invariance in descriptors matches the performance of matcher-level invariance but allows earlier invariance, faster matchers, and no loss in upright performance when trained at scale.
SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation
cs.CV 2026-04 unverdicted novelty 5.0

SMFormer achieves state-of-the-art self-supervised stereo matching by using vision foundation models for disturbance-resistant features and data augmentation to enforce output consistency, rivaling or exceeding some s...
Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching
cs.CV 2026-04 unverdicted novelty 5.0

GREATEN fuses surface normals with image features via gated contextual-geometric fusion and efficient sparse attentions to cut stereo matching errors by up to 30% on real datasets when trained solely on synthetic data.
A Hybrid Approach for Closing the Sim2real Appearance Gap in Game Engine Synthetic Datasets
cs.CV 2026-05 unverdicted novelty 4.0

Combining a diffusion model and an image-to-image translation model produces more photorealistic game-engine synthetic images than either alone while keeping semantic labels intact.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 22 Pith papers · 1 internal anchor

[1]

Virtual worlds as proxy for multi-object tracking analysis

A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016

work page 2016
[2]

Procedural generation of videos to train deep action recognition networks

C R De Souza, A Gaidon, Y Cabon, and A M Lopez Pena. Procedural generation of videos to train deep action recognition networks. In CVPR, 2017

work page 2017
[3]

Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios

Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng, Jianping Shi, and Bolei Zhou. Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In CVPR, 2019

work page 2019
[4]

German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, June 2016

work page 2016
[5]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012

work page 2012
[6]

Visual localization by learning objects-of-interest dense match regression

Philippe Weinzaepfel, Gabriela Csurka, Yohann Cabon, and Martin Humenberger. Visual localization by learning objects-of-interest dense match regression. In CVPR, 2019

work page 2019
[7]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016

work page 2016
[8]

Lopez, Uwe Franke, Marc Pollefeys, and Juan Carlos Moure

Daniel Hernandez-Juarez, Lukas Schneider, Antonio Espinosa, David Vazquez, Antonio M. Lopez, Uwe Franke, Marc Pollefeys, and Juan Carlos Moure. Slanted stixels: Representing san francisco’s steepest streets. In BMVC, 2017

work page 2017
[9]

Temporal coherence for active learning in videos

Javad Zolfaghari Bengar, Abel Gonzalez-Garcia, Gabriel Villalonga, Bogdan Raducanu, Hamed H Aghdam, Mikhail Mozerov, Antonio M Lopez, and Joost van de Weijer. Temporal coherence for active learning in videos. In ICCV Workshops, 2019

work page 2019
[10]

Playing for benchmarks

Stephan R Richter, Zeeshan Hayder, and Vladlen Koltun. Playing for benchmarks. In ICCV, pages 2213–2222, 2017

work page 2017
[11]

Analyzing computer vision data-the good, the bad and the ugly

Oliver Zendel, Katrin Honauer, Markus Murschitz, Martin Humenberger, and Gustavo Fernandez Dominguez. Analyzing computer vision data-the good, the bad and the ugly. In CVPR, 2017

work page 2017
[12]

How good is my test data? introducing safety analysis for computer vision

Oliver Zendel, Markus Murschitz, Martin Humenberger, and Wolfgang Herzner. How good is my test data? introducing safety analysis for computer vision. International Journal of Computer Vision, 125(1-3):95–109, 2017

work page 2017
[13]

Domain adaptation in computer vision applications

Gabriela Csurka. Domain adaptation in computer vision applications. Springer, 2017

work page 2017
[14]

Fast r-cnn

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision , pages 1440–1448, 2015

work page 2015
[15]

Edge boxes: Locating object proposals from edges

C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In European conference on computer vision, pages 391–405. Springer, 2014

work page 2014
[16]

Globally-optimal greedy algorithms for tracking a variable number of objects

Hamed Pirsiavash, Deva Ramanan, and Charless C Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR 2011, pages 1201–1208. IEEE, 2011

work page 2011
[17]

Learning to track: Online multi-object tracking by decision making

Yu Xiang, Alexandre Alahi, and Silvio Savarese. Learning to track: Online multi-object tracking by decision making. In Proceedings of the IEEE international conference on computer vision, pages 4705–4713, 2015

work page 2015
[18]

Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures

James Bergstra, Daniel Yamins, and David Daniel Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. 2013

work page 2013
[19]

Evaluating multiple object tracking performance: the clear mot metrics

Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. Journal on Image and Video Processing, 2008:1, 2008

work page 2008
[20]

Girshick, and Jian Sun

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015

work page arXiv 2015
[21]

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017

work page 2017
[22]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[23]

Girshick, Kaiming He, Bharath Hariharan, and Serge J

Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. CoRR, abs/1612.03144, 2016. 10 A PREPRINT - JANUARY 30, 2020

work page arXiv 2016
[24]

Everingham, L

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal- network.org/challenges/VOC/voc2007/workshop/index.html

work page 2007
[25]

Ga-net: Guided aggregation net for end-to-end stereo matching

Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. Ga-net: Guided aggregation net for end-to-end stereo matching. In CVPR, 2019

work page 2019
[26]

Unsupervised learning of depth and ego-motion from video

Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017

work page 2017
[27]

Self-supervised model adaptation for multimodal semantic segmentation

Abhinav Valada, Rohit Mohan, and Wolfram Burgard. Self-supervised model adaptation for multimodal semantic segmentation. International Journal of Computer Vision (IJCV), jul 2019. Special Issue: Deep Learning for Robotic Vision. 11

work page 2019