pith. machine review for the scientific record. sign in

arxiv: 2604.14302 · v1 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-view scene generationfreehand sketchesgeometric consistencydiffusion modelscamera-aware attentioncorrespondence supervisionstructure-from-motionscene synthesis
0
0 comments X

The pith

A single freehand sketch generates geometrically consistent multi-view scenes in one denoising step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that freehand sketches, which are abstract and spatially distorted, can still produce multiple geometrically consistent views of a scene. It does so by constructing a dataset of roughly nine thousand sketch-to-multi-view pairs and adding two components to a video diffusion model: adapters that embed camera geometry knowledge and a loss that supervises sparse 3D point matches. A sympathetic reader would care because this removes the usual need for photographs, text prompts, reference images, or per-scene optimization, turning a quick drawing into usable multi-view output.

Core claim

Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. It addresses absent training data, geometric reasoning from distorted input, and cross-view consistency with a curated dataset of ~9k sketch-to-multiview samples, Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer, and a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. The approach significantly outperforms state-of-the-art two-stage baselines.

What carries the argument

Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer, together with Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions.

If this is right

  • The method improves realism by over 60% in FID compared with two-stage baselines.
  • Geometric consistency rises by 23% in the Corr-Acc metric.
  • Inference runs up to 3.7 times faster by avoiding iterative refinement and per-scene optimization.
  • All views are produced simultaneously without reference images or extra stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-pass design could be tested for extending to time-consistent video generation from sketch sequences.
  • Similar camera-aware adapters might transfer to other low-information inputs such as rough diagrams in architecture or product design.
  • If the consistency generalizes, the technique could lower the barrier for casual users to create 3D scene assets without multi-view capture.

Load-bearing premise

The automated generation and filtering pipeline for the ~9k sketch-to-multiview dataset produces data that faithfully represents real freehand sketches and their geometric distortions.

What would settle it

Evaluating the model on an independent collection of real human freehand sketches and measuring whether the generated views achieve the reported Corr-Acc gains or permit accurate SfM reconstructions would settle the geometric-consistency claim.

Figures

Figures reproduced from arXiv: 2604.14302 by Ahmed Bourouis, Andrea Maracani, Mete Ozay, Savas Ozkan, Yi-Zhe Song.

Figure 1
Figure 1. Figure 1: Sketch-to-Multi-View generation. Given a single freehand sketch and a text caption, our method generates geometrically consistent multi-view images span￾ning a full 360◦ azimuth orbit at four different elevations. Top three rows: in-domain examples from FS-COCO [9]. Bottom two rows: zero-shot generalisation to unseen sketch domains (TU-Berlin [11] and InstantStyle [43]). We define this task as Sketch-to-Mu… view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. (a) A freehand sketch is encoded then denoised by a video DiT augmented with our CA3. All N views are generated in a single forward denoising process. (b) Each DiT block contains a self-attention with LoRa and a parallel CA3 to inject relative camera geometry. (c) During training, SfM correspondences from pseudo ground-truth views supervise the CA3 query–key projections via a sparse InfoNC… view at source ↗
Figure 3
Figure 3. Figure 3: S2WV dataset generation pipeline. (a) Multi-seed image generation from sketch and caption. (b) Segmentation of sketch and generated images. (c) Best seed selection via mIoU. (d) Multi-view synthesis from the selected image. (e) Dataset cu￾ration yields 9,222 samples. More details in Sec. 3.2 by maximizing the mean Intersection-over-Union (mIoU) between the sketch segmentation masks Ms and the image segment… view at source ↗
Figure 4
Figure 4. Figure 4: SfM correspondences across views. Each correspondence image shows two views with colored lines connecting matched keypoints identified by Structure-from￾Motion (COLMAP). The view angles (azimuth, elevation) for each pair are indicated above the correspondence visualizations. These correspondences serve as supervision targets for our camera-aware attention adapters (Sec. 3.4). 3.4 Correspondence Supervision… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of multi-view generation from sketches. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attention correspondence visualisation. Query pixel (red dot) in the front view; heatmaps show attention at layer 20 over three target viewpoints. Left: with correspondence supervision. Right: without (ablation Tab. 2 row b). Per-view PSNR remains competitive (12.198), but FID more than doubles (42.54 vs. 18.49) and Corr-Acc degrades to 0.174, confirming that preserving view identity through the VAE bottle… view at source ↗
Figure 7
Figure 7. Figure 7: Generalisation to unseen sketch domains. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative comparison. Same format as [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional qualitative comparison. Same format as [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of $\sim$9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7$\times$ inference speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a method for synthesizing geometrically consistent multi-view scenes from a single freehand sketch. Key contributions include a new dataset of ~9k sketch-to-multiview pairs generated via an automated pipeline, Parallel Camera-Aware Attention Adapters (CA3) to inject geometric biases into a video transformer, and a Sparse Correspondence Supervision Loss (CSL) based on SfM reconstructions. The framework performs all views in a single denoising process and claims substantial improvements over two-stage baselines: >60% better FID, 23% better Corr-Acc, and up to 3.7× faster inference.

Significance. If the central assumptions hold, particularly regarding the dataset's representation of real freehand distortions, this paper would make a notable contribution to the field of sketch-based multi-view generation by enabling efficient, consistent synthesis without iterative optimization or reference images. The emphasis on handling geometric distortions through inductive biases and supervision is valuable for practical applications in design and visualization.

major comments (2)
  1. The construction of the ~9k dataset via automated generation and filtering is load-bearing for the claims of handling 'geometrically impoverished' freehand sketches. The manuscript should include quantitative or qualitative analysis demonstrating that the pipeline's outputs exhibit the irregular distortions and conflicting perspective cues characteristic of actual human freehand drawings, rather than more consistent geometry from 3D model renders. Without this, the reported gains in geometric consistency (Corr-Acc) may not translate to real-world freehand inputs.
  2. The quantitative results reported in the abstract lack supporting details such as error bars, standard deviations, or ablation studies on the individual contributions of the CA3 adapters and CSL loss. Additionally, there is no discussion of potential biases in the automated dataset pipeline or verification that the training data matches the distribution of real freehand sketches, which is critical for validating the single-denoising process's effectiveness.
minor comments (1)
  1. The abstract could more explicitly name the state-of-the-art two-stage baselines being compared against to provide better context for the performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will incorporate revisions to strengthen the presentation of our dataset validation and experimental analysis.

read point-by-point responses
  1. Referee: The construction of the ~9k dataset via automated generation and filtering is load-bearing for the claims of handling 'geometrically impoverished' freehand sketches. The manuscript should include quantitative or qualitative analysis demonstrating that the pipeline's outputs exhibit the irregular distortions and conflicting perspective cues characteristic of actual human freehand drawings, rather than more consistent geometry from 3D model renders. Without this, the reported gains in geometric consistency (Corr-Acc) may not translate to real-world freehand inputs.

    Authors: We agree that explicit validation of the distortion characteristics is important for supporting our claims. Our automated pipeline applies randomized stroke simplification, perspective jitter, and edge perturbation steps derived from 3D renders to emulate freehand variability; however, the current manuscript does not include a dedicated comparison. In the revision we will add a new subsection with both qualitative side-by-side examples (our generated sketches versus real freehand drawings from public sources) and quantitative metrics such as average stroke curvature variance and multi-view perspective inconsistency scores. This will directly demonstrate the presence of conflicting geometric cues. revision: yes

  2. Referee: The quantitative results reported in the abstract lack supporting details such as error bars, standard deviations, or ablation studies on the individual contributions of the CA3 adapters and CSL loss. Additionally, there is no discussion of potential biases in the automated dataset pipeline or verification that the training data matches the distribution of real freehand sketches, which is critical for validating the single-denoising process's effectiveness.

    Authors: We acknowledge that the reported metrics would benefit from statistical detail and component-wise analysis. The revised manuscript will include error bars and standard deviations computed over multiple random seeds for all FID and Corr-Acc numbers. We will also expand the experiments with dedicated ablations that isolate the contribution of the Parallel Camera-Aware Attention Adapters and the Sparse Correspondence Supervision Loss. In addition, we will insert a discussion of potential pipeline biases (e.g., residual 3D consistency) together with a distributional comparison (feature histograms and Fréchet distance) between our generated sketches and real freehand sketch collections to confirm that the training distribution adequately covers the target domain. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external dataset construction and SfM supervision

full rationale

The paper's central claims rest on a newly introduced ~9k sketch-to-multiview dataset built via an automated external pipeline and a Sparse Correspondence Supervision Loss explicitly derived from independent Structure-from-Motion reconstructions. The CA3 adapters and single-pass denoising architecture are presented as novel inductive biases injected into a video transformer, without any reduction to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations. Performance metrics (FID, Corr-Acc) are benchmarked against separate two-stage baselines rather than tautologically following from the inputs. No ansatzes, uniqueness theorems, or renamings of known results are invoked in a self-referential manner.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality of the automatically generated dataset and the effectiveness of the introduced CA3 adapters and CSL loss for enforcing geometric consistency from distorted 2D input.

axioms (1)
  • domain assumption The automated generation and filtering pipeline produces a dataset of ~9k sketch-to-multiview samples that is representative of real freehand sketches.
    Invoked to address the absent training data challenge.

pith-pipeline@v0.9.0 · 5557 in / 1189 out tokens · 30037 ms · 2026-05-10T13:20:06.169520+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 25 canonical work pages · 8 internal anchors

  1. [1]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera con- trol in video diffusion transformers. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22875–22889 (2025)

  2. [2]

    Geovideo: Intro- ducing geometric regularization into video generation model.arXiv preprint arXiv:2512.03453, 2025

    Bai, Y., Fang, S., Yu, C., Wang, F., Huang, Q.: Geovideo: Introducing geometric regularization into video generation model. arXiv preprint arXiv:2512.03453 (2025)

  3. [3]

    In: International Conference on Learning Representations (2026)

    Bourouis, A., Bessmeltsev, M., Gryaditskaya, Y.: Sketchingreality: From freehand scene sketches to photorealistic images. In: International Conference on Learning Representations (2026)

  4. [4]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Bourouis, A., Fan, J.E., Gryaditskaya, Y.: Open vocabulary semantic scene sketch understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4176–4186 (2024)

  5. [5]

    openvocabulary scene sketch semantic understanding

    Bourouis,A.,Fan,J.E.,Gryaditskaya,Y.:Datausedinthepaper"openvocabulary scene sketch semantic understanding"(fscoco-seg) (2024)

  6. [6]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

  7. [7]

    arXiv preprint arXiv:2401.14257 (2024)

    Chen, M., Yuan, W., Wang, Y., Sheng, Z., He, Y., Dong, Z., Bo, L., Guo, Y.: Sketch2nerf: multi-view sketch-guided text-to-3d generation. arXiv preprint arXiv:2401.14257 (2024)

  8. [8]

    V3d: Video diffusion models are effective 3d generators

    Chen, Z., Wang, Y., Wang, F., Wang, Z., Liu, H.: V3d: Video diffusion models are effective 3d generators. arXiv preprint arXiv:2403.06738 (2024)

  9. [9]

    In: European conference on computer vision

    Chowdhury, P.N., Sain, A., Bhunia, A.K., Xiang, T., Gryaditskaya, Y., Song, Y.Z.: Fs-coco:Towardsunderstandingoffreehandsketchesofcommonobjectsincontext. In: European conference on computer vision. pp. 253–270. Springer (2022)

  10. [10]

    Hugging Face model card (2025),https: //huggingface.co/dx8152/Qwen-Edit-2509-Multiple-angles, accessed: 2026- 01-28

    dx8152: Qwen-Edit-2509-Multiple-angles. Hugging Face model card (2025),https: //huggingface.co/dx8152/Qwen-Edit-2509-Multiple-angles, accessed: 2026- 01-28

  11. [11]

    Eitz, M., Hays, J., Alexa, M.: How do humans sketch objects? ACM Transactions on graphics (TOG)31(4), 1–10 (2012)

  12. [12]

    Cat3d: Create any- thing in 3d with multi-view diffusion models,

    Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314 (2024)

  13. [13]

    He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024)

  14. [14]

    Advances in neural information processing systems30(2017)

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  15. [15]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., Wen, H., Dong, J., Wang, Y., Li, Y., Chen, X., Cao, Y.P., Liang, D., Qiao, Y., Dai, B., et al.: Epidiff: Enhancing multi-view synthesis via localized epipolar-constrained diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9784–9794 (2024)

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Kant, Y., Siarohin, A., Wu, Z., Vasilkovsky, M., Qian, G., Ren, J., Guler, R.A., Ghanem, B., Tulyakov, S., Gilitschenski, I.: Spad: Spatially aware multi-view dif- fusers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10026–10038 (2024) 16 Bourouis et al

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Kim, S., Li, K., Deng, X., Shi, Y., Cho, M., Wang, P.: Enhancing 3d fidelity of text-to-3d using cross-view correspondences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10649–10658 (2024)

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Koley, S., Bhunia, A.K., Sekhri, D., Sain, A., Chowdhury, P.N., Xiang, T., Song, Y.Z.: It’s all about your sketch: Democratising sketch control in diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7204–7214 (2024)

  20. [20]

    International journal of computer vision123(1), 32–73 (2017)

    Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision123(1), 32–73 (2017)

  21. [21]

    arXiv preprint arXiv:2512.03045 (2025)

    Kwon, M., Choi, J., Park, J., Jeon, S., Jang, J., Seo, J., Kwak, M., Kim, J.H., Kim, S.: Cameo: Correspondence-attention alignment for multi-view diffusion models. arXiv preprint arXiv:2512.03045 (2025)

  22. [22]

    Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025)

  23. [23]

    In: European Conference on Computer Vision

    Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with mast3r. In: European Conference on Computer Vision. pp. 71–91. Springer (2024)

  24. [24]

    arXiv preprint arXiv:2602.08068 (2026)

    Li, C., Yang, Y., Shao, J., Zhou, H., Schwarz, K., Liao, Y.: Rerope: Repurposing rope for relative camera control. arXiv preprint arXiv:2602.08068 (2026)

  25. [25]

    Advances in Neural Information Processing Systems37, 55975–56000 (2024)

    Li, P., Liu, Y., Long, X., Zhang, F., Lin, C., Li, M., Qi, X., Zhang, S., Xue, W., Luo, W., et al.: Era3d: High-resolution multiview diffusion using efficient row-wise attention. Advances in Neural Information Processing Systems37, 55975–56000 (2024)

  26. [26]

    Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496,

    Li, R., Yi, B., Liu, J., Gao, H., Ma, Y., Kanazawa, A.: Cameras as relative posi- tional encoding. arXiv preprint arXiv:2507.10496 (2025)

  27. [27]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

  28. [28]

    ACM Transactions on Graphics (TOG)43(4), 1–13 (2024)

    Liu, F.L., Fu, H., Lai, Y.K., Gao, L.: Sketchdream: Sketch-based text-to-3d gener- ation and editing. ACM Transactions on Graphics (TOG)43(4), 1–13 (2024)

  29. [29]

    In: Proceedings of the IEEE/CVF inter- national conference on computer vision

    Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 9298–9309 (2023)

  30. [30]

    In: European conference on computer vision

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024)

  31. [31]

    Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age.arXiv preprint arXiv:2309.03453, 2023

    Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W.: Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)

  32. [32]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H., Habermann, M., Theobalt, C., et al.: Wonder3d: Single image to 3d using cross- domain diffusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9970–9980 (2024)

  33. [33]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  34. [34]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Ma, B., Gao, H., Deng, H., Luo, Z., Huang, T., Tang, L., Wang, X.: You see it, you got it: Learning 3d creation on pose-free videos at scale. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2016–2029 (2025) S2MV 17

  35. [35]

    In: Proceedings of the AAAI conference on artificial intelligence

    Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 4296–4304 (2024)

  36. [36]

    arXiv preprint arXiv:2410.01595 (2024)

    Navard, P., Monsefi, A.K., Zhou, M., Chao, W.L., Yilmaz, A., Ramnath, R.: Knob- gen: controlling the sophistication of artwork in sketch-based diffusion models. arXiv preprint arXiv:2410.01595 (2024)

  37. [37]

    Representation Learning with Contrastive Predictive Coding

    Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)

  38. [38]

    Accelerating 3d deep learning with pytorch3d.arXiv preprint arXiv:2007.08501, 2020

    Ravi,N.,Reizenstein,J.,Novotny,D.,Gordon,T.,Lo,W.Y.,Johnson,J.,Gkioxari, G.: Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501 (2020)

  39. [39]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)

  40. [40]

    Zero123++: a single image to consistent multi-view dif- fusion base model.arXiv preprint arXiv:2310.15110, 2023

    Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., Su, H.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)

  41. [41]

    MVDream: Multi-view Diffusion for 3D Generation

    Shi,Y.,Wang,P.,Ye,J.,Long,M.,Li,K.,Yang,X.:Mvdream:Multi-viewdiffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)

  42. [42]

    Advances in Neural Information Processing Systems 36, 1363–1389 (2023)

    Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan, B.: Emergent correspon- dence from image diffusion. Advances in Neural Information Processing Systems 36, 1363–1389 (2023)

  43. [43]

    In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

    Tang, M., Vinker, Y., Yan, C., Zhang, L., Agrawala, M.: Instance segmentation of scene sketches using natural image priors. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–10 (2025)

  44. [44]

    In: European Conference on Computer Vision

    Voleti, V., Yao, C.H., Boss, M., Letts, A., Pankratz, D., Tochilkin, D., Laforte, C., Rombach, R., Jampani, V.: Sv3d: Novel multi-view synthesis and 3d genera- tion from a single image using latent video diffusion. In: European Conference on Computer Vision. pp. 439–457. Springer (2024)

  45. [45]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  46. [46]

    Instantstyle: Free lunch towards style-preserving in text-to-image generation.arXiv preprint arXiv:2404.02733, 2024

    Wang, H., Spinelli, M., Wang, Q., Bai, X., Qin, Z., Chen, A.: Instantstyle: Free lunch towards style-preserving in text-to-image generation. arXiv preprint arXiv:2404.02733 (2024)

  47. [47]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

  48. [48]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20697–20709 (2024)

  49. [49]

    IEEE transactions on image processing 13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)

  50. [50]

    In: ACM SIGGRAPH 2024 Conference Papers

    Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

  51. [51]

    Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

    Xiang, J., Chen, X., Xu, S., Wang, R., Lv, Z., Deng, Y., Zhu, H., Dong, Y., Zhao, H., Yuan, N.J., et al.: Native and compact structured latents for 3d generation. arXiv preprint arXiv:2512.14692 (2025) 18 Bourouis et al

  52. [52]

    In: European Conference on Computer Vision

    Xing,J.,Xia,M.,Zhang,Y.,Chen,H.,Yu,W.,Liu,H.,Liu,G.,Wang,X.,Shan,Y., Wong, T.T.: Dynamicrafter: Animating open-domain images with video diffusion priors. In: European Conference on Computer Vision. pp. 399–417. Springer (2024)

  53. [53]

    Camco: Camera- controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024

    Xu, D., Nie, W., Liu, C., Liu, S., Kautz, J., Wang, Z., Vahdat, A.: Camco: Camera-controllable 3d-consistent image-to-video generation. arXiv preprint arXiv:2406.02509 (2024)

  54. [54]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)

  55. [55]

    Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237,

    Zhang, C., Li, B., Wei, M., Cao, Y.P., Gambardella, C.C., Phung, D., Cai, J.: Unified camera positional encoding for controlled video generation. arXiv preprint arXiv:2512.07237 (2025)

  56. [56]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)

  57. [57]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

  58. [58]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, S., Xu, H., Guo, S., Xie, Z., Bao, H., Xu, W., Zou, C.: Spatialcrafter: Unleashing the imagination of video diffusion models for scene reconstruction from limited observations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 27794–27805 (2025)

  59. [59]

    Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

    Zhao, Z., Lai, Z., Lin, Q., Zhao, Y., Liu, H., Yang, S., Feng, Y., Yang, M., Zhang, S., Yang, X., et al.: Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202 (2025)

  60. [60]

    Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025a

    Zhou,J.,Gao,H.,Voleti,V.,Vasishta,A.,Yao,C.H.,Boss,M.,Torr,P.,Rupprecht, C., Jampani, V.: Stable virtual camera: Generative view synthesis with diffusion models. arXiv preprint arXiv:2503.14489 (2025) S2MV 19 Supplementary Material This supplementary document provides additional details that complement the main paper. Sec. A presents additional qualitativ...