pith. sign in

arxiv: 2605.23237 · v1 · pith:YP2RDL3Snew · submitted 2026-05-22 · 💻 cs.CV

StereoGenBench: A Synthetic Multi-Camera Benchmark for Stereo Generation under Controlled Baseline Regimes

Pith reviewed 2026-05-25 04:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords stereo generationmulti-baseline benchmarksynthetic datasetUnreal Enginecamera calibrationmetric depthview synthesisbaseline regimes
0
0 comments X

The pith

StereoGenBench supplies scene-paired multi-baseline stereo views complete with calibration, metric depth, and poses from one controlled synthetic source.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing stereo resources supply only subsets of the variables needed for stereo generation and view synthesis, leaving no single source with scene-paired, calibrated multi-baseline right views together with intrinsics, dense metric depth, and per-frame poses. The paper fills this gap by releasing StereoGenBench, a Unreal Engine dataset rendered from a rigid six-camera lateral array that produces up to 15 view pairs per scene. Adjacent baselines range from inter-pupillary to wide, focal length is sampled independently, and every view carries RGB, metric depth, intrinsics, per-pair baselines, and poses. Evaluation splits cover narrow and wide regimes while a train-only split offers broader coverage, with generation code released for extension.

Core claim

StereoGenBench is a synthetic dataset rendered with a rigid six-camera lateral array that yields up to 15 calibrated view pairs per scene across inter-pupillary to wide baselines, each view released with RGB, metric depth, intrinsics, per-pair baselines, and per-frame poses, providing the first unified controlled source for measuring baseline-regime sensitivity and target-camera consistency in stereo generation.

What carries the argument

A rigid six-camera lateral array rendered in Unreal Engine that produces up to 15 calibrated view pairs per scene with controllable baselines sampled from inter-pupillary to wide regimes and independently sampled focal lengths.

If this is right

  • Baseline-regime sensitivity of any stereo generation method becomes directly measurable while scene content stays fixed across pairs.
  • Target-camera consistency can be quantified across narrow and wide baseline regimes within the same scenes.
  • Training and evaluation data now exist with all geometric variables known and independently controllable.
  • New scenes can be added using the released generation code and compatible assets while preserving the same calibration and depth format.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The controlled baselines and depth ground truth could support experiments that isolate the effect of baseline length on disparity estimation accuracy.
  • The per-frame poses open the possibility of testing stereo generation under small camera motions that are not available in static-pair datasets.
  • Releasing the generation configuration may allow researchers to create custom splits that match the exact baseline distributions of particular real-world capture rigs.

Load-bearing premise

Synthetic renders from Unreal Engine are representative enough of real-world imagery that methods evaluated on them will behave similarly on actual camera captures.

What would settle it

Stereo generation models that achieve high scores on StereoGenBench splits show substantially lower performance when tested on matched real-world stereo sequences captured with equivalent baselines and intrinsics.

Figures

Figures reproduced from arXiv: 2605.23237 by Feng Qiao, Nathan Jacobs, Yangzhi Cui.

Figure 1
Figure 1. Figure 1: Six-camera rig outputs under the three baseline-sampling families. Each row shows six [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Generation pipeline. Scene construction, spawn validation, trajectory candidate ranking, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative scenes from the released map roots, illustrating indoor, outdoor, urban, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Realized baseline distributions in the released dataset. Panels (a), (b), (d), and (e) show [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Trajectory construction. Candidate viewpoints are sampled around the subject, invalid [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
read the original abstract

Stereo image and video generation, stereo geometry estimation, and condition-controlled view synthesis require paired data in which the variables that determine binocular geometry -- camera baseline, intrinsics, scene depth, and camera motion -- are known and controllable. Existing stereo resources provide subsets of these variables, but resources commonly used for stereo generation evaluation do not, to our knowledge, provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source. We introduce StereoGenBench, a synthetic Unreal Engine benchmark designed to make baseline-regime sensitivity and target-camera consistency measurable under matched scene content. Each scene is rendered with a rigid six-camera lateral array, yielding up to 15 calibrated view pairs; adjacent baselines are sampled from inter-pupillary to wide-baseline regimes; focal length is sampled independently; and every view is released with RGB, metric depth, intrinsics, per-pair baselines, and per-frame poses. The splits include two evaluation families for narrow and wide baseline regimes and a train-only family for broader all-pairs coverage. We release the dataset, evaluation code, reference results, Croissant metadata, and generation code/configuration for extension with compatible assets. The dataset is available at https://huggingface.co/datasets/stereo-dataset/stereo-dataset

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces StereoGenBench, a synthetic multi-camera benchmark generated via Unreal Engine with a rigid six-camera lateral array. Each scene yields up to 15 calibrated view pairs across narrow-to-wide baselines, with independent focal-length sampling; every view includes RGB, dense metric depth, intrinsics, per-pair baselines, and per-frame poses. The resource is positioned to enable controlled measurement of baseline-regime sensitivity for stereo generation, geometry estimation, and view synthesis. Splits are defined for narrow-baseline, wide-baseline, and train-only all-pairs regimes. The authors release the dataset, evaluation code, reference results, Croissant metadata, and generation code/configuration, claiming that no prior resource supplies the full combination of scene-paired multi-baseline right-view ground truth, jointly recorded intrinsics, dense metric depth, and poses in a single controlled source.

Significance. If the transferability assumption holds, the benchmark would allow systematic, reproducible study of how stereo methods respond to controlled changes in baseline, intrinsics, and depth distribution—capabilities not jointly available in existing collections. Explicit release of generation code, evaluation scripts, and metadata constitutes a concrete strength that supports community extension and verification.

major comments (2)
  1. [Abstract] Abstract: the uniqueness claim ('to our knowledge, ... do not provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source') is asserted without a comparative table or systematic literature review of existing stereo datasets; this directly underpins the motivation for releasing a new resource.
  2. [Abstract] Abstract (paragraph on existing resources gap) and dataset-construction description: the central utility claim—that the benchmark is suitable for evaluating methods intended for real-world imagery—rests on the untested assumption that relative rankings and failure modes observed on idealized synthetic renders will match those on real camera data; no cross-domain experiments, realism ablations, or comparisons against paired real stereo sets are reported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the motivation and scope of StereoGenBench. We address each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the uniqueness claim ('to our knowledge, ... do not provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source') is asserted without a comparative table or systematic literature review of existing stereo datasets; this directly underpins the motivation for releasing a new resource.

    Authors: We agree that the uniqueness claim would be more robust with explicit evidence. In the revised version we will add a comparative table (and brief accompanying text) that systematically contrasts StereoGenBench against prior stereo resources on the dimensions of multi-baseline coverage, dense metric depth, joint intrinsics/poses, and scene-paired right-view ground truth. This will make the supporting literature review concrete rather than implicit. revision: yes

  2. Referee: [Abstract] Abstract (paragraph on existing resources gap) and dataset-construction description: the central utility claim—that the benchmark is suitable for evaluating methods intended for real-world imagery—rests on the untested assumption that relative rankings and failure modes observed on idealized synthetic renders will match those on real camera data; no cross-domain experiments, realism ablations, or comparisons against paired real stereo sets are reported.

    Authors: The manuscript frames StereoGenBench primarily as a controlled synthetic resource for isolating baseline-regime effects rather than as a direct proxy for real-camera evaluation. We acknowledge that the domain gap is untested and that no cross-domain validation is provided. We will add an explicit limitations paragraph in the discussion section stating this assumption and recommending that downstream users perform their own real-data validation when transfer is critical. No new experiments will be added, as the core contribution remains the synthetic benchmark itself. revision: partial

Circularity Check

0 steps flagged

No circularity; dataset release paper with no derivations or fitted claims

full rationale

The paper introduces StereoGenBench, a synthetic multi-camera dataset generated via Unreal Engine. The abstract and described contribution contain no equations, parameters, predictions, or derivation steps. Claims about gaps in existing resources are external observations, not derived from the paper's own content or self-citations. No load-bearing steps reduce to inputs by construction, self-definition, or fitted renaming. This is a standard dataset/benchmark release with no internal circular reasoning, consistent with a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; contribution is dataset creation using standard rendering practices.

pith-pipeline@v0.9.0 · 5770 in / 1059 out tokens · 24641 ms · 2026-05-25T04:47:31.401412+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    InStereo2K: A large real dataset for stereo matching in indoor scenes.Science China Information Sciences, 63 (11):212101, 2020

    Wei Bao, Wen Wang, Yuhua Xu, Yulan Guo, Siyu Hong, and Xiaohu Zhang. InStereo2K: A large real dataset for stereo matching in indoor scenes.Science China Information Sciences, 63 (11):212101, 2020. doi: 10.1007/s11432-019-2803-x

  2. [2]

    StereoSpace: Depth-free synthesis of stereo geometry via end-to-end diffusion in a canonical space, 2025

    Tjark Behrens, Anton Obukhov, Bingxin Ke, Fabio Tosi, Matteo Poggi, and Konrad Schindler. StereoSpace: Depth-free synthesis of stereo geometry via end-to-end diffusion in a canonical space, 2025

  3. [3]

    Butler, Jonas Wulff, Garrett B

    Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black. A naturalistic open source movie for optical flow evaluation. InEuropean Conference on Computer Vision (ECCV), volume 7577 ofLecture Notes in Computer Science, pages 611–625. Springer, 2012

  4. [4]

    Virtual KITTI 2, 2020

    Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual KITTI 2, 2020

  5. [5]

    The Cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016

  6. [6]

    SVG: 3D stereoscopic video generation via denoising frame matrix

    Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, and Yinda Zhang. SVG: 3D stereoscopic video generation via denoising frame matrix. In International Conference on Learning Representations (ICLR), 2025

  7. [7]

    DeDoDe: Detect, don’t describe – describe, don’t detect for local feature matching

    Johan Edstedt, Georg Bökman, Mårten Wadenbäck, and Michael Felsberg. DeDoDe: Detect, don’t describe – describe, don’t detect for local feature matching. InInternational Conference on 3D Vision (3DV), pages 148–157. IEEE, 2024

  8. [8]

    Text2stereo: Repur- posing stable diffusion for stereo generation with consistency rewards

    Aakash Garg, Libing Zeng, Andrii Tsarov, and Nima Khademi Kalantari. Text2stereo: Repur- posing stable diffusion for stereo generation with consistency rewards. InCVPR 2025 Workshop on Computer Vision for Mixed Reality (CV4MR), 2025

  9. [9]

    Are we ready for autonomous driving? the KITTI vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 2012

  10. [10]

    Eye2Eye: A simple approach for monocular-to-stereo video synthesis, 2025

    Michal Geyer, Omer Tov, Linyi Jin, Richard Tucker, Inbar Mosseri, Tali Dekel, and Noah Snavely. Eye2Eye: A simple approach for monocular-to-stereo video synthesis, 2025

  11. [11]

    StereoCarla: A high-fidelity driving dataset for generaliz- able stereo, 2025

    Xianda Guo, Chenming Zhang, Ruilin Wang, Youmin Zhang, Wenzhao Zheng, Matteo Poggi, Hao Zhao, Qin Zou, and Long Chen. StereoCarla: A high-fidelity driving dataset for generaliz- able stereo, 2025

  12. [12]

    Holopix50k: A large-scale in-the-wild stereo image dataset

    Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, and Edward Li. Holopix50k: A large-scale in-the-wild stereo image dataset. InCVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

  13. [13]

    Stereo4D: Learning how things move in 3d from internet stereo videos

    Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4D: Learning how things move in 3d from internet stereo videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10497–10509, 2025

  14. [14]

    T-SVG: Text-driven stereoscopic video generation, 2024

    Qiao Jin, Xiaodong Chen, Wu Liu, Tao Mei, and Yongdong Zhang. T-SVG: Text-driven stereoscopic video generation, 2024

  15. [15]

    Match stereo videos via bidirec- tional alignment, 2024

    Junpeng Jing, Ye Mao, Anlan Qiu, and Krystian Mikolajczyk. Match stereo videos via bidirec- tional alignment, 2024

  16. [16]

    Active-passive SimStereo – benchmarking the cross-generalization capabilities of deep learning-based stereo methods

    Laurent Jospin, Allen Antony, Lian Xu, Hamid Laga, Farid Boussaid, and Mohammed Ben- namoun. Active-passive SimStereo – benchmarking the cross-generalization capabilities of deep learning-based stereo methods. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, pages 29235–29247, 2022. 11

  17. [17]

    DynamicStereo: Consistent dynamic depth from stereo videos

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. DynamicStereo: Consistent dynamic depth from stereo videos. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 13229–13239, 2023

  18. [18]

    DMS: Diffusion-based multi- baseline stereo generation for improving self-supervised depth estimation

    Zihua Liu, Yizhou Li, Songyan Zhang, and Masatoshi Okutomi. DMS: Diffusion-based multi- baseline stereo generation for improving self-supervised depth estimation. InICCV Workshop on Advances in Image Manipulation (AIM), 2025

  19. [19]

    A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

    Nikolaus Mayer, Eddy Ilg, Philip Häusser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4040–4048, 2016

  20. [20]

    Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

    Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andrés Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  21. [21]

    Object scene flow for autonomous vehicles

    Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3061–3070, 2015

  22. [22]

    Elas- tic3D: Controllable stereo video conversion with guided latent decoding, 2025

    Nando Metzger, Prune Truong, Goutam Bhat, Konrad Schindler, and Federico Tombari. Elas- tic3D: Controllable stereo video conversion with guided latent decoding, 2025

  23. [23]

    Towards open-world generation of stereo images and unsupervised matching

    Feng Qiao, Zhexiao Xiong, Eric Xing, and Nathan Jacobs. Towards open-world generation of stereo images and unsupervised matching. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  24. [24]

    High-resolution stereo datasets with subpixel-accurate ground truth

    Daniel Scharstein, Heiko Hirschmüller, York Kitajima, Greg Krathwohl, Nera Neši´c, Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In Pattern Recognition — 36th German Conference (GCPR), volume 8753 ofLecture Notes in Computer Science, pages 31–42. Springer, 2014. doi: 10.1007/978-3-319-11752-2_3

  25. [25]

    Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

    Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution im- ages and multi-camera videos. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3260–3269, 2017

  26. [26]

    StereoPilot: Learning unified and efficient stereo conversion via generative priors, 2025

    Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, and Ying-Cong Chen. StereoPilot: Learning unified and efficient stereo conversion via generative priors, 2025

  27. [27]

    ImmersePro: End-to-end stereo video synthesis via implicit disparity learning, 2024

    Jian Shi, Zhenyu Li, and Peter Wonka. ImmersePro: End-to-end stereo video synthesis via implicit disparity learning, 2024

  28. [28]

    StereoCrafter-Zero: Zero-shot stereo video generation with noisy restart, 2024

    Jian Shi, Qian Wang, Zhenyu Li, Ramzi Idoughi, and Peter Wonka. StereoCrafter-Zero: Zero-shot stereo video generation with noisy restart, 2024

  29. [29]

    3D photography using context-aware layered depth inpainting

    Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin Huang. 3D photography using context-aware layered depth inpainting. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  30. [30]

    M2SVid: End-to-end inpainting and refinement for monocular-to-stereo video conversion

    Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, and Federico Tombari. M2SVid: End-to-end inpainting and refinement for monocular-to-stereo video conversion. InInternational Conference on 3D Vision (3DV), 2026

  31. [31]

    Trivedi, Vinayak A

    Nisarg K. Trivedi, Vinayak A. Belludi, Li-Yun Wang, Pardis Taghavi, and Dante Lok. MODEST: Multi-optics depth-of-field stereo dataset, 2025

  32. [32]

    Stere- oDiffusion: Training-free stereo image generation using latent diffusion models

    Lezhong Wang, Jeppe Revall Frisvad, Mark Bo Jensen, and Siavash Arjomand Bigdeli. Stere- oDiffusion: Training-free stereo image generation using latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 7416–7425, 2024. 12

  33. [33]

    IRS: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation

    Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng, Kaiyong Zhao, and Xiaowen Chu. IRS: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation. InIEEE International Conference on Multimedia and Expo (ICME), 2021

  34. [34]

    TartanAir: A dataset to push the limits of visual SLAM

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916, 2020

  35. [35]

    ZeroStereo: Zero-shot stereo matching from single images

    Xianqi Wang, Hao Yang, Gangwei Xu, Junda Cheng, Min Lin, Yong Deng, Jinliang Zang, Yurui Chen, and Xin Yang. ZeroStereo: Zero-shot stereo matching from single images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  36. [36]

    Bovik, Hamid R

    Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. doi: 10.1109/TIP.2003.819861

  37. [37]

    Brostow, and Michael Firman

    Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambetov, Gabriel J. Brostow, and Michael Firman. Learning stereo from single images. InEuropean Conference on Computer Vision (ECCV), volume 12346 ofLecture Notes in Computer Science, pages 722–740. Springer, 2020. doi: 10.1007/978-3-030-58452-8_42

  38. [38]

    FoundationStereo: Zero-shot stereo matching

    Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. FoundationStereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5249–5260, 2025

  39. [39]

    Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks

    Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. InEuropean Conference on Computer Vision (ECCV), volume 9908 ofLecture Notes in Computer Science, pages 842–857. Springer,

  40. [40]

    doi: 10.1007/978-3-319-46493-0_51

  41. [41]

    Plataniotis, Yao Zhao, and Yunchao Wei

    Ke Xing, Xiaojie Jin, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo, Chen Fang, Jue Wang, Konstantinos N. Plataniotis, Yao Zhao, and Yunchao Wei. StereoWorld: Geometry-aware monocular-to-stereo video generation, 2025

  42. [42]

    DrivingStereo: A large-scale dataset for stereo matching in autonomous driving scenarios

    Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng, Jianping Shi, and Bolei Zhou. DrivingStereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 899–908, 2019

  43. [43]

    Mono2Stereo: A benchmark and empirical study for stereo conversion

    Songsong Yu, Yuxin Chen, Zhongang Qi, Zeke Xie, Yifan Wang, Lijun Wang, Ying Shan, and Huchuan Lu. Mono2Stereo: A benchmark and empirical study for stereo conversion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  44. [44]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018. doi: 10.1109/CVPR.2018.00068

  45. [45]

    eval/MapSeenInTrain/IPD_Gaussian/AssetsvilleTown/scene_000000

    Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, and Ying Shan. StereoCrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3D from monocular videos, 2024. 13 A Per-scene file schema and loading example Each released scene is organized as a self-contained directory. The required core files ...

  46. [46]

    The random-right control uses a right view from a different scene under the same evaluator aggregation protocol. Control right view PSNR↑SSIM↑LPIPS↓ E Match ↓P-PSNR↑SD↓ Rendered targetI R ∞1.0000 0.0000 0.00 32.63 0.0209 Copied leftI L 19.07 0.6530 0.1804 0.00 27.04 97.7503 Wrong-baseline target 19.08 0.6531 0.1806 29.10 27.42 0.5030 Random right view 9.5...