StereoGenBench: A Synthetic Multi-Camera Benchmark for Stereo Generation under Controlled Baseline Regimes

Feng Qiao; Nathan Jacobs; Yangzhi Cui

arxiv: 2605.23237 · v1 · pith:YP2RDL3Snew · submitted 2026-05-22 · 💻 cs.CV

StereoGenBench: A Synthetic Multi-Camera Benchmark for Stereo Generation under Controlled Baseline Regimes

Yangzhi Cui , Feng Qiao , Nathan Jacobs This is my paper

Pith reviewed 2026-05-25 04:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords stereo generationmulti-baseline benchmarksynthetic datasetUnreal Enginecamera calibrationmetric depthview synthesisbaseline regimes

0 comments

The pith

StereoGenBench supplies scene-paired multi-baseline stereo views complete with calibration, metric depth, and poses from one controlled synthetic source.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing stereo resources supply only subsets of the variables needed for stereo generation and view synthesis, leaving no single source with scene-paired, calibrated multi-baseline right views together with intrinsics, dense metric depth, and per-frame poses. The paper fills this gap by releasing StereoGenBench, a Unreal Engine dataset rendered from a rigid six-camera lateral array that produces up to 15 view pairs per scene. Adjacent baselines range from inter-pupillary to wide, focal length is sampled independently, and every view carries RGB, metric depth, intrinsics, per-pair baselines, and poses. Evaluation splits cover narrow and wide regimes while a train-only split offers broader coverage, with generation code released for extension.

Core claim

StereoGenBench is a synthetic dataset rendered with a rigid six-camera lateral array that yields up to 15 calibrated view pairs per scene across inter-pupillary to wide baselines, each view released with RGB, metric depth, intrinsics, per-pair baselines, and per-frame poses, providing the first unified controlled source for measuring baseline-regime sensitivity and target-camera consistency in stereo generation.

What carries the argument

A rigid six-camera lateral array rendered in Unreal Engine that produces up to 15 calibrated view pairs per scene with controllable baselines sampled from inter-pupillary to wide regimes and independently sampled focal lengths.

If this is right

Baseline-regime sensitivity of any stereo generation method becomes directly measurable while scene content stays fixed across pairs.
Target-camera consistency can be quantified across narrow and wide baseline regimes within the same scenes.
Training and evaluation data now exist with all geometric variables known and independently controllable.
New scenes can be added using the released generation code and compatible assets while preserving the same calibration and depth format.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The controlled baselines and depth ground truth could support experiments that isolate the effect of baseline length on disparity estimation accuracy.
The per-frame poses open the possibility of testing stereo generation under small camera motions that are not available in static-pair datasets.
Releasing the generation configuration may allow researchers to create custom splits that match the exact baseline distributions of particular real-world capture rigs.

Load-bearing premise

Synthetic renders from Unreal Engine are representative enough of real-world imagery that methods evaluated on them will behave similarly on actual camera captures.

What would settle it

Stereo generation models that achieve high scores on StereoGenBench splits show substantially lower performance when tested on matched real-world stereo sequences captured with equivalent baselines and intrinsics.

Figures

Figures reproduced from arXiv: 2605.23237 by Feng Qiao, Nathan Jacobs, Yangzhi Cui.

**Figure 2.** Figure 2: Generation pipeline. Scene construction, spawn validation, trajectory candidate ranking, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Representative scenes from the released map roots, illustrating indoor, outdoor, urban, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Realized baseline distributions in the released dataset. Panels (a), (b), (d), and (e) show [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Trajectory construction. Candidate viewpoints are sampled around the subject, invalid [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

Stereo image and video generation, stereo geometry estimation, and condition-controlled view synthesis require paired data in which the variables that determine binocular geometry -- camera baseline, intrinsics, scene depth, and camera motion -- are known and controllable. Existing stereo resources provide subsets of these variables, but resources commonly used for stereo generation evaluation do not, to our knowledge, provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source. We introduce StereoGenBench, a synthetic Unreal Engine benchmark designed to make baseline-regime sensitivity and target-camera consistency measurable under matched scene content. Each scene is rendered with a rigid six-camera lateral array, yielding up to 15 calibrated view pairs; adjacent baselines are sampled from inter-pupillary to wide-baseline regimes; focal length is sampled independently; and every view is released with RGB, metric depth, intrinsics, per-pair baselines, and per-frame poses. The splits include two evaluation families for narrow and wide baseline regimes and a train-only family for broader all-pairs coverage. We release the dataset, evaluation code, reference results, Croissant metadata, and generation code/configuration for extension with compatible assets. The dataset is available at https://huggingface.co/datasets/stereo-dataset/stereo-dataset

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StereoGenBench gives a practical synthetic multi-baseline dataset with full annotations, but its value for real stereo methods stays unproven.

read the letter

StereoGenBench is a dataset release that supplies matched scenes rendered with a fixed six-camera lateral array, producing up to 15 calibrated pairs per scene across narrow to wide baselines, plus dense metric depth, intrinsics, and per-frame poses. The splits separate narrow-baseline and wide-baseline evaluation families and include a broader train-only set. Releasing the generation code, config, and Croissant metadata lets others extend it with new assets. That combination of scene-paired multi-baseline coverage and complete geometry labels in one source is the concrete addition over prior synthetic stereo collections. The reference results and evaluation code lower the barrier for running controlled sensitivity tests on baseline effects. Those parts are useful and cleanly executed. The main limitation is the missing link to real imagery. The paper positions the benchmark as relevant for methods that will run on actual cameras, yet it contains no cross-domain checks, no ranking comparisons against real stereo pairs, and no tests of how the idealized lighting and zero noise change failure modes. The uniqueness statement rests on a brief “to our knowledge” claim without a comparison table of existing resources. Both gaps are straightforward to address in revision but currently leave the transfer assumption untested. This work is aimed at researchers who run stereo generation or view-synthesis experiments and want precise control over baseline and depth variables. A reader already using synthetic data for ablation studies would find the splits and annotations immediately usable. It deserves peer review because the dataset construction is reproducible and the annotations are comprehensive; referees can push on the generalization question without dismissing the contribution outright.

Referee Report

2 major / 0 minor

Summary. The paper introduces StereoGenBench, a synthetic multi-camera benchmark generated via Unreal Engine with a rigid six-camera lateral array. Each scene yields up to 15 calibrated view pairs across narrow-to-wide baselines, with independent focal-length sampling; every view includes RGB, dense metric depth, intrinsics, per-pair baselines, and per-frame poses. The resource is positioned to enable controlled measurement of baseline-regime sensitivity for stereo generation, geometry estimation, and view synthesis. Splits are defined for narrow-baseline, wide-baseline, and train-only all-pairs regimes. The authors release the dataset, evaluation code, reference results, Croissant metadata, and generation code/configuration, claiming that no prior resource supplies the full combination of scene-paired multi-baseline right-view ground truth, jointly recorded intrinsics, dense metric depth, and poses in a single controlled source.

Significance. If the transferability assumption holds, the benchmark would allow systematic, reproducible study of how stereo methods respond to controlled changes in baseline, intrinsics, and depth distribution—capabilities not jointly available in existing collections. Explicit release of generation code, evaluation scripts, and metadata constitutes a concrete strength that supports community extension and verification.

major comments (2)

[Abstract] Abstract: the uniqueness claim ('to our knowledge, ... do not provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source') is asserted without a comparative table or systematic literature review of existing stereo datasets; this directly underpins the motivation for releasing a new resource.
[Abstract] Abstract (paragraph on existing resources gap) and dataset-construction description: the central utility claim—that the benchmark is suitable for evaluating methods intended for real-world imagery—rests on the untested assumption that relative rankings and failure modes observed on idealized synthetic renders will match those on real camera data; no cross-domain experiments, realism ablations, or comparisons against paired real stereo sets are reported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the motivation and scope of StereoGenBench. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the uniqueness claim ('to our knowledge, ... do not provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source') is asserted without a comparative table or systematic literature review of existing stereo datasets; this directly underpins the motivation for releasing a new resource.

Authors: We agree that the uniqueness claim would be more robust with explicit evidence. In the revised version we will add a comparative table (and brief accompanying text) that systematically contrasts StereoGenBench against prior stereo resources on the dimensions of multi-baseline coverage, dense metric depth, joint intrinsics/poses, and scene-paired right-view ground truth. This will make the supporting literature review concrete rather than implicit. revision: yes
Referee: [Abstract] Abstract (paragraph on existing resources gap) and dataset-construction description: the central utility claim—that the benchmark is suitable for evaluating methods intended for real-world imagery—rests on the untested assumption that relative rankings and failure modes observed on idealized synthetic renders will match those on real camera data; no cross-domain experiments, realism ablations, or comparisons against paired real stereo sets are reported.

Authors: The manuscript frames StereoGenBench primarily as a controlled synthetic resource for isolating baseline-regime effects rather than as a direct proxy for real-camera evaluation. We acknowledge that the domain gap is untested and that no cross-domain validation is provided. We will add an explicit limitations paragraph in the discussion section stating this assumption and recommending that downstream users perform their own real-data validation when transfer is critical. No new experiments will be added, as the core contribution remains the synthetic benchmark itself. revision: partial

Circularity Check

0 steps flagged

No circularity; dataset release paper with no derivations or fitted claims

full rationale

The paper introduces StereoGenBench, a synthetic multi-camera dataset generated via Unreal Engine. The abstract and described contribution contain no equations, parameters, predictions, or derivation steps. Claims about gaps in existing resources are external observations, not derived from the paper's own content or self-citations. No load-bearing steps reduce to inputs by construction, self-definition, or fitted renaming. This is a standard dataset/benchmark release with no internal circular reasoning, consistent with a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; contribution is dataset creation using standard rendering practices.

pith-pipeline@v0.9.0 · 5770 in / 1059 out tokens · 24641 ms · 2026-05-25T04:47:31.401412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

InStereo2K: A large real dataset for stereo matching in indoor scenes.Science China Information Sciences, 63 (11):212101, 2020

Wei Bao, Wen Wang, Yuhua Xu, Yulan Guo, Siyu Hong, and Xiaohu Zhang. InStereo2K: A large real dataset for stereo matching in indoor scenes.Science China Information Sciences, 63 (11):212101, 2020. doi: 10.1007/s11432-019-2803-x

work page doi:10.1007/s11432-019-2803-x 2020
[2]

StereoSpace: Depth-free synthesis of stereo geometry via end-to-end diffusion in a canonical space, 2025

Tjark Behrens, Anton Obukhov, Bingxin Ke, Fabio Tosi, Matteo Poggi, and Konrad Schindler. StereoSpace: Depth-free synthesis of stereo geometry via end-to-end diffusion in a canonical space, 2025

work page 2025
[3]

Butler, Jonas Wulff, Garrett B

Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black. A naturalistic open source movie for optical flow evaluation. InEuropean Conference on Computer Vision (ECCV), volume 7577 ofLecture Notes in Computer Science, pages 611–625. Springer, 2012

work page 2012
[4]

Virtual KITTI 2, 2020

Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual KITTI 2, 2020

work page 2020
[5]

The Cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016

work page 2016
[6]

SVG: 3D stereoscopic video generation via denoising frame matrix

Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, and Yinda Zhang. SVG: 3D stereoscopic video generation via denoising frame matrix. In International Conference on Learning Representations (ICLR), 2025

work page 2025
[7]

DeDoDe: Detect, don’t describe – describe, don’t detect for local feature matching

Johan Edstedt, Georg Bökman, Mårten Wadenbäck, and Michael Felsberg. DeDoDe: Detect, don’t describe – describe, don’t detect for local feature matching. InInternational Conference on 3D Vision (3DV), pages 148–157. IEEE, 2024

work page 2024
[8]

Text2stereo: Repur- posing stable diffusion for stereo generation with consistency rewards

Aakash Garg, Libing Zeng, Andrii Tsarov, and Nima Khademi Kalantari. Text2stereo: Repur- posing stable diffusion for stereo generation with consistency rewards. InCVPR 2025 Workshop on Computer Vision for Mixed Reality (CV4MR), 2025

work page 2025
[9]

Are we ready for autonomous driving? the KITTI vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 2012

work page 2012
[10]

Eye2Eye: A simple approach for monocular-to-stereo video synthesis, 2025

Michal Geyer, Omer Tov, Linyi Jin, Richard Tucker, Inbar Mosseri, Tali Dekel, and Noah Snavely. Eye2Eye: A simple approach for monocular-to-stereo video synthesis, 2025

work page 2025
[11]

StereoCarla: A high-fidelity driving dataset for generaliz- able stereo, 2025

Xianda Guo, Chenming Zhang, Ruilin Wang, Youmin Zhang, Wenzhao Zheng, Matteo Poggi, Hao Zhao, Qin Zou, and Long Chen. StereoCarla: A high-fidelity driving dataset for generaliz- able stereo, 2025

work page 2025
[12]

Holopix50k: A large-scale in-the-wild stereo image dataset

Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, and Edward Li. Holopix50k: A large-scale in-the-wild stereo image dataset. InCVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

work page 2020
[13]

Stereo4D: Learning how things move in 3d from internet stereo videos

Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4D: Learning how things move in 3d from internet stereo videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10497–10509, 2025

work page 2025
[14]

T-SVG: Text-driven stereoscopic video generation, 2024

Qiao Jin, Xiaodong Chen, Wu Liu, Tao Mei, and Yongdong Zhang. T-SVG: Text-driven stereoscopic video generation, 2024

work page 2024
[15]

Match stereo videos via bidirec- tional alignment, 2024

Junpeng Jing, Ye Mao, Anlan Qiu, and Krystian Mikolajczyk. Match stereo videos via bidirec- tional alignment, 2024

work page 2024
[16]

Active-passive SimStereo – benchmarking the cross-generalization capabilities of deep learning-based stereo methods

Laurent Jospin, Allen Antony, Lian Xu, Hamid Laga, Farid Boussaid, and Mohammed Ben- namoun. Active-passive SimStereo – benchmarking the cross-generalization capabilities of deep learning-based stereo methods. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, pages 29235–29247, 2022. 11

work page 2022
[17]

DynamicStereo: Consistent dynamic depth from stereo videos

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. DynamicStereo: Consistent dynamic depth from stereo videos. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 13229–13239, 2023

work page 2023
[18]

DMS: Diffusion-based multi- baseline stereo generation for improving self-supervised depth estimation

Zihua Liu, Yizhou Li, Songyan Zhang, and Masatoshi Okutomi. DMS: Diffusion-based multi- baseline stereo generation for improving self-supervised depth estimation. InICCV Workshop on Advances in Image Manipulation (AIM), 2025

work page 2025
[19]

A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

Nikolaus Mayer, Eddy Ilg, Philip Häusser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4040–4048, 2016

work page 2016
[20]

Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andrés Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[21]

Object scene flow for autonomous vehicles

Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3061–3070, 2015

work page 2015
[22]

Elas- tic3D: Controllable stereo video conversion with guided latent decoding, 2025

Nando Metzger, Prune Truong, Goutam Bhat, Konrad Schindler, and Federico Tombari. Elas- tic3D: Controllable stereo video conversion with guided latent decoding, 2025

work page 2025
[23]

Towards open-world generation of stereo images and unsupervised matching

Feng Qiao, Zhexiao Xiong, Eric Xing, and Nathan Jacobs. Towards open-world generation of stereo images and unsupervised matching. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[24]

High-resolution stereo datasets with subpixel-accurate ground truth

Daniel Scharstein, Heiko Hirschmüller, York Kitajima, Greg Krathwohl, Nera Neši´c, Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In Pattern Recognition — 36th German Conference (GCPR), volume 8753 ofLecture Notes in Computer Science, pages 31–42. Springer, 2014. doi: 10.1007/978-3-319-11752-2_3

work page doi:10.1007/978-3-319-11752-2_3 2014
[25]

Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution im- ages and multi-camera videos. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3260–3269, 2017

work page 2017
[26]

StereoPilot: Learning unified and efficient stereo conversion via generative priors, 2025

Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, and Ying-Cong Chen. StereoPilot: Learning unified and efficient stereo conversion via generative priors, 2025

work page 2025
[27]

ImmersePro: End-to-end stereo video synthesis via implicit disparity learning, 2024

Jian Shi, Zhenyu Li, and Peter Wonka. ImmersePro: End-to-end stereo video synthesis via implicit disparity learning, 2024

work page 2024
[28]

StereoCrafter-Zero: Zero-shot stereo video generation with noisy restart, 2024

Jian Shi, Qian Wang, Zhenyu Li, Ramzi Idoughi, and Peter Wonka. StereoCrafter-Zero: Zero-shot stereo video generation with noisy restart, 2024

work page 2024
[29]

3D photography using context-aware layered depth inpainting

Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin Huang. 3D photography using context-aware layered depth inpainting. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[30]

M2SVid: End-to-end inpainting and refinement for monocular-to-stereo video conversion

Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, and Federico Tombari. M2SVid: End-to-end inpainting and refinement for monocular-to-stereo video conversion. InInternational Conference on 3D Vision (3DV), 2026

work page 2026
[31]

Trivedi, Vinayak A

Nisarg K. Trivedi, Vinayak A. Belludi, Li-Yun Wang, Pardis Taghavi, and Dante Lok. MODEST: Multi-optics depth-of-field stereo dataset, 2025

work page 2025
[32]

Stere- oDiffusion: Training-free stereo image generation using latent diffusion models

Lezhong Wang, Jeppe Revall Frisvad, Mark Bo Jensen, and Siavash Arjomand Bigdeli. Stere- oDiffusion: Training-free stereo image generation using latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 7416–7425, 2024. 12

work page 2024
[33]

IRS: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation

Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng, Kaiyong Zhao, and Xiaowen Chu. IRS: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation. InIEEE International Conference on Multimedia and Expo (ICME), 2021

work page 2021
[34]

TartanAir: A dataset to push the limits of visual SLAM

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916, 2020

work page 2020
[35]

ZeroStereo: Zero-shot stereo matching from single images

Xianqi Wang, Hao Yang, Gangwei Xu, Junda Cheng, Min Lin, Yong Deng, Jinliang Zang, Yurui Chen, and Xin Yang. ZeroStereo: Zero-shot stereo matching from single images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[36]

Bovik, Hamid R

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. doi: 10.1109/TIP.2003.819861

work page doi:10.1109/tip.2003.819861 2004
[37]

Brostow, and Michael Firman

Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambetov, Gabriel J. Brostow, and Michael Firman. Learning stereo from single images. InEuropean Conference on Computer Vision (ECCV), volume 12346 ofLecture Notes in Computer Science, pages 722–740. Springer, 2020. doi: 10.1007/978-3-030-58452-8_42

work page doi:10.1007/978-3-030-58452-8_42 2020
[38]

FoundationStereo: Zero-shot stereo matching

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. FoundationStereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5249–5260, 2025

work page 2025
[39]

Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks

Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. InEuropean Conference on Computer Vision (ECCV), volume 9908 ofLecture Notes in Computer Science, pages 842–857. Springer,

work page
[40]

doi: 10.1007/978-3-319-46493-0_51

work page doi:10.1007/978-3-319-46493-0_51
[41]

Plataniotis, Yao Zhao, and Yunchao Wei

Ke Xing, Xiaojie Jin, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo, Chen Fang, Jue Wang, Konstantinos N. Plataniotis, Yao Zhao, and Yunchao Wei. StereoWorld: Geometry-aware monocular-to-stereo video generation, 2025

work page 2025
[42]

DrivingStereo: A large-scale dataset for stereo matching in autonomous driving scenarios

Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng, Jianping Shi, and Bolei Zhou. DrivingStereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 899–908, 2019

work page 2019
[43]

Mono2Stereo: A benchmark and empirical study for stereo conversion

Songsong Yu, Yuxin Chen, Zhongang Qi, Zeke Xie, Yifan Wang, Lijun Wang, Ying Shan, and Huchuan Lu. Mono2Stereo: A benchmark and empirical study for stereo conversion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[44]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018. doi: 10.1109/CVPR.2018.00068

work page doi:10.1109/cvpr.2018.00068 2018
[45]

eval/MapSeenInTrain/IPD_Gaussian/AssetsvilleTown/scene_000000

Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, and Ying Shan. StereoCrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3D from monocular videos, 2024. 13 A Per-scene file schema and loading example Each released scene is organized as a self-contained directory. The required core files ...

work page 2024
[46]

The random-right control uses a right view from a different scene under the same evaluator aggregation protocol. Control right view PSNR↑SSIM↑LPIPS↓ E Match ↓P-PSNR↑SD↓ Rendered targetI R ∞1.0000 0.0000 0.00 32.63 0.0209 Copied leftI L 19.07 0.6530 0.1804 0.00 27.04 97.7503 Wrong-baseline target 19.08 0.6531 0.1806 29.10 27.42 0.5030 Random right view 9.5...

work page 1904

[1] [1]

InStereo2K: A large real dataset for stereo matching in indoor scenes.Science China Information Sciences, 63 (11):212101, 2020

Wei Bao, Wen Wang, Yuhua Xu, Yulan Guo, Siyu Hong, and Xiaohu Zhang. InStereo2K: A large real dataset for stereo matching in indoor scenes.Science China Information Sciences, 63 (11):212101, 2020. doi: 10.1007/s11432-019-2803-x

work page doi:10.1007/s11432-019-2803-x 2020

[2] [2]

StereoSpace: Depth-free synthesis of stereo geometry via end-to-end diffusion in a canonical space, 2025

Tjark Behrens, Anton Obukhov, Bingxin Ke, Fabio Tosi, Matteo Poggi, and Konrad Schindler. StereoSpace: Depth-free synthesis of stereo geometry via end-to-end diffusion in a canonical space, 2025

work page 2025

[3] [3]

Butler, Jonas Wulff, Garrett B

Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black. A naturalistic open source movie for optical flow evaluation. InEuropean Conference on Computer Vision (ECCV), volume 7577 ofLecture Notes in Computer Science, pages 611–625. Springer, 2012

work page 2012

[4] [4]

Virtual KITTI 2, 2020

Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual KITTI 2, 2020

work page 2020

[5] [5]

The Cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016

work page 2016

[6] [6]

SVG: 3D stereoscopic video generation via denoising frame matrix

Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, and Yinda Zhang. SVG: 3D stereoscopic video generation via denoising frame matrix. In International Conference on Learning Representations (ICLR), 2025

work page 2025

[7] [7]

DeDoDe: Detect, don’t describe – describe, don’t detect for local feature matching

Johan Edstedt, Georg Bökman, Mårten Wadenbäck, and Michael Felsberg. DeDoDe: Detect, don’t describe – describe, don’t detect for local feature matching. InInternational Conference on 3D Vision (3DV), pages 148–157. IEEE, 2024

work page 2024

[8] [8]

Text2stereo: Repur- posing stable diffusion for stereo generation with consistency rewards

Aakash Garg, Libing Zeng, Andrii Tsarov, and Nima Khademi Kalantari. Text2stereo: Repur- posing stable diffusion for stereo generation with consistency rewards. InCVPR 2025 Workshop on Computer Vision for Mixed Reality (CV4MR), 2025

work page 2025

[9] [9]

Are we ready for autonomous driving? the KITTI vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 2012

work page 2012

[10] [10]

Eye2Eye: A simple approach for monocular-to-stereo video synthesis, 2025

Michal Geyer, Omer Tov, Linyi Jin, Richard Tucker, Inbar Mosseri, Tali Dekel, and Noah Snavely. Eye2Eye: A simple approach for monocular-to-stereo video synthesis, 2025

work page 2025

[11] [11]

StereoCarla: A high-fidelity driving dataset for generaliz- able stereo, 2025

Xianda Guo, Chenming Zhang, Ruilin Wang, Youmin Zhang, Wenzhao Zheng, Matteo Poggi, Hao Zhao, Qin Zou, and Long Chen. StereoCarla: A high-fidelity driving dataset for generaliz- able stereo, 2025

work page 2025

[12] [12]

Holopix50k: A large-scale in-the-wild stereo image dataset

Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, and Edward Li. Holopix50k: A large-scale in-the-wild stereo image dataset. InCVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

work page 2020

[13] [13]

Stereo4D: Learning how things move in 3d from internet stereo videos

Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4D: Learning how things move in 3d from internet stereo videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10497–10509, 2025

work page 2025

[14] [14]

T-SVG: Text-driven stereoscopic video generation, 2024

Qiao Jin, Xiaodong Chen, Wu Liu, Tao Mei, and Yongdong Zhang. T-SVG: Text-driven stereoscopic video generation, 2024

work page 2024

[15] [15]

Match stereo videos via bidirec- tional alignment, 2024

Junpeng Jing, Ye Mao, Anlan Qiu, and Krystian Mikolajczyk. Match stereo videos via bidirec- tional alignment, 2024

work page 2024

[16] [16]

Active-passive SimStereo – benchmarking the cross-generalization capabilities of deep learning-based stereo methods

Laurent Jospin, Allen Antony, Lian Xu, Hamid Laga, Farid Boussaid, and Mohammed Ben- namoun. Active-passive SimStereo – benchmarking the cross-generalization capabilities of deep learning-based stereo methods. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, pages 29235–29247, 2022. 11

work page 2022

[17] [17]

DynamicStereo: Consistent dynamic depth from stereo videos

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. DynamicStereo: Consistent dynamic depth from stereo videos. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 13229–13239, 2023

work page 2023

[18] [18]

DMS: Diffusion-based multi- baseline stereo generation for improving self-supervised depth estimation

Zihua Liu, Yizhou Li, Songyan Zhang, and Masatoshi Okutomi. DMS: Diffusion-based multi- baseline stereo generation for improving self-supervised depth estimation. InICCV Workshop on Advances in Image Manipulation (AIM), 2025

work page 2025

[19] [19]

A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

Nikolaus Mayer, Eddy Ilg, Philip Häusser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4040–4048, 2016

work page 2016

[20] [20]

Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andrés Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[21] [21]

Object scene flow for autonomous vehicles

Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3061–3070, 2015

work page 2015

[22] [22]

Elas- tic3D: Controllable stereo video conversion with guided latent decoding, 2025

Nando Metzger, Prune Truong, Goutam Bhat, Konrad Schindler, and Federico Tombari. Elas- tic3D: Controllable stereo video conversion with guided latent decoding, 2025

work page 2025

[23] [23]

Towards open-world generation of stereo images and unsupervised matching

Feng Qiao, Zhexiao Xiong, Eric Xing, and Nathan Jacobs. Towards open-world generation of stereo images and unsupervised matching. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

[24] [24]

High-resolution stereo datasets with subpixel-accurate ground truth

Daniel Scharstein, Heiko Hirschmüller, York Kitajima, Greg Krathwohl, Nera Neši´c, Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In Pattern Recognition — 36th German Conference (GCPR), volume 8753 ofLecture Notes in Computer Science, pages 31–42. Springer, 2014. doi: 10.1007/978-3-319-11752-2_3

work page doi:10.1007/978-3-319-11752-2_3 2014

[25] [25]

Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution im- ages and multi-camera videos. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3260–3269, 2017

work page 2017

[26] [26]

StereoPilot: Learning unified and efficient stereo conversion via generative priors, 2025

Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, and Ying-Cong Chen. StereoPilot: Learning unified and efficient stereo conversion via generative priors, 2025

work page 2025

[27] [27]

ImmersePro: End-to-end stereo video synthesis via implicit disparity learning, 2024

Jian Shi, Zhenyu Li, and Peter Wonka. ImmersePro: End-to-end stereo video synthesis via implicit disparity learning, 2024

work page 2024

[28] [28]

StereoCrafter-Zero: Zero-shot stereo video generation with noisy restart, 2024

Jian Shi, Qian Wang, Zhenyu Li, Ramzi Idoughi, and Peter Wonka. StereoCrafter-Zero: Zero-shot stereo video generation with noisy restart, 2024

work page 2024

[29] [29]

3D photography using context-aware layered depth inpainting

Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin Huang. 3D photography using context-aware layered depth inpainting. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020

[30] [30]

M2SVid: End-to-end inpainting and refinement for monocular-to-stereo video conversion

Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, and Federico Tombari. M2SVid: End-to-end inpainting and refinement for monocular-to-stereo video conversion. InInternational Conference on 3D Vision (3DV), 2026

work page 2026

[31] [31]

Trivedi, Vinayak A

Nisarg K. Trivedi, Vinayak A. Belludi, Li-Yun Wang, Pardis Taghavi, and Dante Lok. MODEST: Multi-optics depth-of-field stereo dataset, 2025

work page 2025

[32] [32]

Stere- oDiffusion: Training-free stereo image generation using latent diffusion models

Lezhong Wang, Jeppe Revall Frisvad, Mark Bo Jensen, and Siavash Arjomand Bigdeli. Stere- oDiffusion: Training-free stereo image generation using latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 7416–7425, 2024. 12

work page 2024

[33] [33]

IRS: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation

Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng, Kaiyong Zhao, and Xiaowen Chu. IRS: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation. InIEEE International Conference on Multimedia and Expo (ICME), 2021

work page 2021

[34] [34]

TartanAir: A dataset to push the limits of visual SLAM

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916, 2020

work page 2020

[35] [35]

ZeroStereo: Zero-shot stereo matching from single images

Xianqi Wang, Hao Yang, Gangwei Xu, Junda Cheng, Min Lin, Yong Deng, Jinliang Zang, Yurui Chen, and Xin Yang. ZeroStereo: Zero-shot stereo matching from single images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

[36] [36]

Bovik, Hamid R

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. doi: 10.1109/TIP.2003.819861

work page doi:10.1109/tip.2003.819861 2004

[37] [37]

Brostow, and Michael Firman

Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambetov, Gabriel J. Brostow, and Michael Firman. Learning stereo from single images. InEuropean Conference on Computer Vision (ECCV), volume 12346 ofLecture Notes in Computer Science, pages 722–740. Springer, 2020. doi: 10.1007/978-3-030-58452-8_42

work page doi:10.1007/978-3-030-58452-8_42 2020

[38] [38]

FoundationStereo: Zero-shot stereo matching

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. FoundationStereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5249–5260, 2025

work page 2025

[39] [39]

Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks

Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. InEuropean Conference on Computer Vision (ECCV), volume 9908 ofLecture Notes in Computer Science, pages 842–857. Springer,

work page

[40] [40]

doi: 10.1007/978-3-319-46493-0_51

work page doi:10.1007/978-3-319-46493-0_51

[41] [41]

Plataniotis, Yao Zhao, and Yunchao Wei

Ke Xing, Xiaojie Jin, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo, Chen Fang, Jue Wang, Konstantinos N. Plataniotis, Yao Zhao, and Yunchao Wei. StereoWorld: Geometry-aware monocular-to-stereo video generation, 2025

work page 2025

[42] [42]

DrivingStereo: A large-scale dataset for stereo matching in autonomous driving scenarios

Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng, Jianping Shi, and Bolei Zhou. DrivingStereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 899–908, 2019

work page 2019

[43] [43]

Mono2Stereo: A benchmark and empirical study for stereo conversion

Songsong Yu, Yuxin Chen, Zhongang Qi, Zeke Xie, Yifan Wang, Lijun Wang, Ying Shan, and Huchuan Lu. Mono2Stereo: A benchmark and empirical study for stereo conversion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[44] [44]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018. doi: 10.1109/CVPR.2018.00068

work page doi:10.1109/cvpr.2018.00068 2018

[45] [45]

eval/MapSeenInTrain/IPD_Gaussian/AssetsvilleTown/scene_000000

Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, and Ying Shan. StereoCrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3D from monocular videos, 2024. 13 A Per-scene file schema and loading example Each released scene is organized as a self-contained directory. The required core files ...

work page 2024

[46] [46]

The random-right control uses a right view from a different scene under the same evaluator aggregation protocol. Control right view PSNR↑SSIM↑LPIPS↓ E Match ↓P-PSNR↑SD↓ Rendered targetI R ∞1.0000 0.0000 0.00 32.63 0.0209 Copied leftI L 19.07 0.6530 0.1804 0.00 27.04 97.7503 Wrong-baseline target 19.08 0.6531 0.1806 29.10 27.42 0.5030 Random right view 9.5...

work page 1904