pith. sign in

arxiv: 2511.16766 · v3 · pith:TLKXPYY3new · submitted 2025-11-20 · 💻 cs.CV

SVG360: Editable Multiview Vector Graphics from a Single SVG

Pith reviewed 2026-05-21 18:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords SVGmultiview vector graphicsconsistent vectorizationspatial memorystructure-aware reconstructioneditable assetsview-conditioned representationpath consolidation
0
0 comments X

The pith

SVG360 converts a single SVG into geometrically and visually consistent multiview vector assets using a view-consistent vectorization pipeline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SVG360 as a way to take one input SVG and produce a set of SVGs for different viewpoints that remain coherent as a single editable object. Direct vectorization of each view separately creates mismatched regions, fragmented paths, and shifting colors that break the ability to treat the output as one asset. The approach first lifts the rasterized SVG into a view-conditioned representation, then propagates part identities across views with a spatial memory step drawn from video segmentation, and finally reconstructs each view with structure-aware vectorization that merges redundant paths while keeping boundaries intact. A reader would care because this moves vector graphics from fixed single-view illustrations toward assets that support animation, editing, and 360-degree viewing without losing editability.

Core claim

SVG360 shows that lifting a rasterized single-view SVG into a view-conditioned object representation, propagating part identity across neighboring views through a spatial memory mechanism to enforce consistent region decomposition, path correspondence, and color assignment, and then reconstructing each view via structure-aware vectorization that consolidates redundant paths and optimizes local geometry, yields multiview SVGs that improve consistency, reduce path redundancy, and preserve fine structures better than direct per-view vectorization, all without task-specific retraining.

What carries the argument

view consistent vectorization pipeline that lifts the rasterized input to a view-conditioned object representation, propagates part identity via spatial memory, and reconstructs each view through structure-aware vectorization

If this is right

  • Region decomposition and path correspondence stay aligned across different prescribed camera views.
  • Color assignments remain stable from view to view without extra supervision.
  • Redundant paths are consolidated during reconstruction while boundaries and semantic parts are retained.
  • Fine structures from the original SVG appear more intact in the multiview outputs than in independent vectorization.
  • The generated assets function as coherent editable objects for design, animation, and multi-view editing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lifting-plus-memory pattern could support vector outputs that remain editable when integrated into 3D scene pipelines.
  • Spatial memory propagation might extend to maintaining consistency across time in vector-based animation sequences.
  • Design tools could incorporate this pipeline to automatically generate coherent multiview versions of user-created SVGs.
  • Scalability tests on SVGs with dense overlapping layers would clarify where the current region propagation begins to degrade.

Load-bearing premise

The spatial memory mechanism adapted from video segmentation can establish consistent region decomposition, path correspondence, and color assignment across views without any task-specific retraining or additional supervision.

What would settle it

Running the pipeline with the spatial memory propagation removed and observing whether the resulting multiview SVGs still maintain equivalent region consistency, path stability, and structure preservation across the prescribed camera views.

Figures

Figures reproduced from arXiv: 2511.16766 by Antonio Haas, Christian Franke, Grace Li Zhang, Mengnan Jiang, Michele Franco Adesso, Zhaolin Sun.

Figure 1
Figure 1. Figure 1: Our pipeline begins by converting the input SVG into a raster image, followed by rendering 3D consistent multi-view rasters using [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison. The figure summarizes representative issues observed in Adobe Turntable: (a) geometric inconsisten￾cies across adjacent views; (b) cluttered structures arising from overlapping thin components; (c) merging of parts with similar colors; (d) missing regions in certain viewpoints; (e) color drift and gradual loss of small details across views. Our method produces multi-view SVGs with s… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study of four segmentation strategies: Spatial [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Segmentation comparison of Spatial-SAM2 and the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Scalable Vector Graphics are a standard representation for editable visual design, yet they are usually authored as single view two dimensional illustrations. This limits their use in applications that require object level assets to remain coherent when observed, edited, or animated from different viewpoints. We present SVG360, a framework that converts a single input SVG into geometrically and visually consistent multiview SVG assets. The key challenge is that direct per view generation or vectorization produces view dependent regions, fragmented paths, and unstable colors, making the resulting SVGs difficult to edit as a coherent object. SVG360 addresses this problem through a view consistent vectorization pipeline. It first lifts the rasterized input into a view conditioned object representation and renders target views under prescribed cameras. It then propagates part identity across neighboring views using a spatial memory mechanism adapted from video segmentation, establishing consistent region decomposition, path correspondence, and color assignment without task specific retraining. Finally, each view is reconstructed as an editable SVG through structure aware vectorization, where redundant paths are consolidated and local geometry is optimized while preserving boundaries and semantic parts. Experiments on object level SVG assets show that SVG360 improves multiview consistency, reduces path redundancy, and better preserves fine structures compared with direct per view vectorization. By turning a single view SVG into a coherent 360 degree vector asset, SVG360 expands vector graphics from static illustration toward editable multiview content for design, animation, and structured visual editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents SVG360, a framework to convert a single input SVG into geometrically and visually consistent multiview SVG assets. It uses a three-stage view-consistent vectorization pipeline: (1) lifting the rasterized input to a view-conditioned object representation and rendering target views, (2) propagating part identity, path correspondence, and color using a spatial memory mechanism adapted from video segmentation without task-specific retraining, and (3) structure-aware vectorization per view to consolidate redundant paths while preserving boundaries and semantic parts. Experiments on object-level SVG assets are claimed to show improved multiview consistency, reduced path redundancy, and better preservation of fine structures versus direct per-view vectorization.

Significance. If the central claims hold, the work would be significant for enabling editable 360-degree vector assets from single-view SVGs, expanding vector graphics beyond static 2D illustrations toward applications in design, animation, and structured visual editing. The practical reuse of an off-the-shelf video segmentation memory module without retraining or supervision is a notable engineering strength that could facilitate adoption.

major comments (2)
  1. The second stage of the pipeline (propagation of part identity across views) relies on the assumption that a spatial memory mechanism adapted from video segmentation can establish consistent region decomposition, path correspondence, and color assignment for rendered views under prescribed cameras that may differ by tens of degrees. Video segmentation typically exploits small inter-frame 2D motion and temporal continuity; the manuscript does not appear to introduce explicit 3D geometric consistency, self-occlusion handling, or viewpoint-specific adaptations. This assumption is load-bearing for the multiview consistency claim and requires targeted ablation or quantitative validation on large viewpoint changes.
  2. Experiments section: while the abstract states that SVG360 improves multiview consistency and reduces path redundancy relative to direct per-view vectorization, the manuscript should report concrete quantitative metrics (e.g., path count reduction percentages, consistency scores with error bars, and dataset details including number of objects, viewpoint sampling, and baselines) to substantiate the claims. Without these, the magnitude of improvement remains difficult to assess.
minor comments (3)
  1. Abstract: the description of 'neighboring views' should be clarified in relation to the 360-degree goal; specify the camera sampling strategy and how propagation chains across non-adjacent views.
  2. The manuscript should cite the specific video segmentation method being adapted and discuss any modifications made to the memory mechanism.
  3. Figure captions and method diagrams would benefit from explicit labels indicating which stage (lifting, propagation, or vectorization) each component belongs to.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and outline the planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: The second stage of the pipeline (propagation of part identity across views) relies on the assumption that a spatial memory mechanism adapted from video segmentation can establish consistent region decomposition, path correspondence, and color assignment for rendered views under prescribed cameras that may differ by tens of degrees. Video segmentation typically exploits small inter-frame 2D motion and temporal continuity; the manuscript does not appear to introduce explicit 3D geometric consistency, self-occlusion handling, or viewpoint-specific adaptations. This assumption is load-bearing for the multiview consistency claim and requires targeted ablation or quantitative validation on large viewpoint changes.

    Authors: We thank the referee for this observation. Our pipeline renders the target views sequentially using small angular increments (typically 10 degrees) between neighboring views, preserving local 2D continuity that the off-the-shelf spatial memory module can exploit without modification. Larger viewpoint differences are handled by chaining these local propagations around the object. We intentionally avoided introducing explicit 3D consistency or retraining to maintain the engineering simplicity highlighted in the significance assessment. In the revision we will add a dedicated paragraph in Section 3.2 clarifying the view sampling strategy and include an ablation that varies angular step size while reporting quantitative consistency scores for viewpoint separations up to 30 degrees. revision: yes

  2. Referee: Experiments section: while the abstract states that SVG360 improves multiview consistency and reduces path redundancy relative to direct per-view vectorization, the manuscript should report concrete quantitative metrics (e.g., path count reduction percentages, consistency scores with error bars, and dataset details including number of objects, viewpoint sampling, and baselines) to substantiate the claims. Without these, the magnitude of improvement remains difficult to assess.

    Authors: We agree that explicit quantitative metrics are required to make the improvement claims precise. The current manuscript emphasizes qualitative comparisons and aggregate observations; we will expand the Experiments section with concrete numbers including average path-count reduction (reported as percentages with standard deviations), multiview part-consistency scores (e.g., mean IoU across views), full dataset statistics (number of objects, exact viewpoint sampling density), and direct numerical comparisons against the per-view baseline. Error bars and dataset details will be added to the relevant tables and figures. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline uses external video segmentation memory and descriptive stages without self-referential reductions

full rationale

The paper presents a three-stage engineering pipeline (lift rasterized SVG to view-conditioned representation, propagate part identity via adapted spatial memory from video segmentation, then apply structure-aware vectorization) with no equations, fitted parameters, or first-principles derivations. The central consistency claims rest on the external adaptation of video segmentation memory rather than any internal definition, self-citation chain, or renaming that reduces outputs to inputs by construction. All load-bearing steps invoke prior independent mechanisms or standard vectorization techniques, making the argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only: the approach assumes a liftable view-conditioned object representation exists from raster input and that video-segmentation-style memory transfers directly to static multiview SVG part consistency without retraining. No free parameters or invented entities are explicitly named.

axioms (2)
  • domain assumption A view-conditioned object representation can be lifted from the rasterized single-view SVG input under prescribed cameras.
    Invoked in the first stage of the pipeline described in the abstract.
  • domain assumption Spatial memory from video segmentation can propagate part identity, region decomposition, path correspondence, and color assignment across neighboring views without task-specific retraining.
    Central to the second stage; stated as establishing consistency without additional training.

pith-pipeline@v0.9.0 · 5797 in / 1471 out tokens · 44004 ms · 2026-05-21T18:30:48.248692+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 11 linked inside Pith

  1. [1]

    View 2d objects from new angles — il- lustrator, 2025

    Adobe Research. View 2d objects from new angles — il- lustrator, 2025. Official HelpX page for Illustrator Turntable (Beta). Last updated: Oct. 27, 2025. 1

  2. [2]

    Flux: Foundation models for universal image generation.https://blackforestlabs.ai,

    Black Forest Labs. Flux: Foundation models for universal image generation.https://blackforestlabs.ai,

  3. [3]

    Accessed: 2025-01-10. 3, 4

  4. [4]

    Deepsvg: A hierarchical generative network for vector graphics anima- tion

    Axel Carlier, Camille Couprie, and Jakob Verbeek. Deepsvg: A hierarchical generative network for vector graphics anima- tion. InNeurIPS, 2020. 1, 2

  5. [5]

    Mask2former: Masked-attention transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander Schwing, Alexander Kirillov, and Rohit Girdhar. Mask2former: Masked-attention transformer for universal image segmentation. InCVPR, pages 12829–12839, 2022. 3

  6. [6]

    Hq-sam: Im- proving the segmentation anything model with high-quality data.arXiv preprint arXiv:2306.01567, 2023

    Bowen Cheng, Rohit Girdhar, and Ishan Misra. Hq-sam: Im- proving the segmentation anything model with high-quality data.arXiv preprint arXiv:2306.01567, 2023. 3

  7. [7]

    Stcn: Spatio-temporal contrastive learning for video object seg- mentation

    Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Stcn: Spatio-temporal contrastive learning for video object seg- mentation. InNeurIPS, 2021. 3

  8. [8]

    Gaussiandreamer: Fast generation from text to 3d gaussians.arXiv preprint arXiv:2311.11284, 2023

    Fanbo Chu et al. Gaussiandreamer: Fast generation from text to 3d gaussians.arXiv preprint arXiv:2311.11284, 2023. 2

  9. [9]

    Get3d: A generative model of high quality 3d textured shapes learned from images

    Jun Gao et al. Get3d: A generative model of high quality 3d textured shapes learned from images. InNeurIPS, 2022. 2

  10. [10]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InICCV, pages 2961–2969, 2017. 3

  11. [11]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 1

  12. [12]

    Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 2

  13. [13]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations (ICLR),

  14. [14]

    Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models

    Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 1911–1920, 2022. 3

  15. [15]

    Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023

    Heewoo Jun, Alex Nichol, et al. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 2

  16. [16]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. InACM SIGGRAPH, 2023. 2

  17. [17]

    Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Gir- shick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Gir- shick. Segment anything. InICCV, 2023. 3

  18. [18]

    Maskdino: Towards a unified transformer-based framework for object detection and segmentation

    Feng Li, Hao Zhang, Peize Sun, Enze Lin, Xueyan Zou, Jian- wei Yang, Lei Zhang, and Jianfeng Gao. Maskdino: Towards a unified transformer-based framework for object detection and segmentation. InCVPR, 2023. 3

  19. [19]

    Diffvg: Differentiable vector graphics ras- terization for editing and learning.ACM Transactions on Graphics (SIGGRAPH), 39(6):193:1–193:15, 2020

    Tzu-Mao Li, Michal Luk ´ac, Micha ¨el Gharbi, and Jonathan Ragan-Kelley. Diffvg: Differentiable vector graphics ras- terization for editing and learning.ACM Transactions on Graphics (SIGGRAPH), 39(6):193:1–193:15, 2020. 1, 2

  20. [20]

    Sparc: Sparse representation and construc- tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025

    Zhihao Li, Yufei Wang, Heliang Zheng, Yihao Luo, and Bihan Wen. Sparc: Sparse representation and construc- tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025. 3

  21. [21]

    Zero-1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu et al. Zero-1-to-3: Zero-shot one image to 3d object. InICCV, 2023. 1, 2

  22. [22]

    Syncdreamer: Generating multiview consis- tent images from a single view

    Yuan Liu et al. Syncdreamer: Generating multiview consis- tent images from a single view. InICLR, 2024. 1, 2

  23. [23]

    Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu

    Zhen Liu, Yao Feng, Michael J. Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu. Meshd- iffusion: Score-based generative 3d mesh modeling. In International Conference on Learning Representations,

  24. [24]

    T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.arXiv preprint arXiv:2302.08453, 2023

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.arXiv preprint arXiv:2302.08453, 2023. 1

  25. [25]

    Polygen: An autoregressive generative model of 3d meshes

    Charlie Nash et al. Polygen: An autoregressive generative model of 3d meshes. InICML, 2020. 2

  26. [26]

    Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 1

  27. [27]

    Hierarchical text-conditional image gen- eration with clip latents.arXiv preprint arXiv:2204.06125,

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents.arXiv preprint arXiv:2204.06125,

  28. [28]

    Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

    Nikhila Ravi, Alexander Kirillov, Piotr Doll´ar, and Ross Gir- shick. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3

  29. [29]

    Lawrence Zitnick, William T

    Pradyumna Reddy, Jun Liu, Forrester Cole, Hanspeter Pfis- ter, C. Lawrence Zitnick, William T. Freeman, and Daniel M. Freeman. Im2vec: Synthesizing vector graphics without vec- tor supervision. InCVPR, pages 7342–7351, 2021. 1, 3

  30. [30]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022. 1

  31. [31]

    Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding.arXiv preprint arXiv:2205.11487, 2022. 1

  32. [32]

    The ciede2000 color-difference formula: Implementation notes, supplementary test data, and geometric byproducts.Color Research & Application, 30(1):21–30, 2005

    Gaurav Sharma, Wencheng Wu, and Edul N Dalal. The ciede2000 color-difference formula: Implementation notes, supplementary test data, and geometric byproducts.Color Research & Application, 30(1):21–30, 2005. 6

  33. [33]

    Flexible isosurface extraction for gradient-based mesh optimization.ACM Transactions on Graphics, 42(4), 2023

    Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosurface extraction for gradient-based mesh optimization.ACM Transactions on Graphics, 42(4), 2023. 3

  34. [34]

    Zero123++: a single image to consistent multi-view diffusion base model.arXiv:2310.15110, 2023

    Ruoshi Shi et al. Zero123++: a single image to consistent multi-view diffusion base model.arXiv:2310.15110, 2023. 1, 2

  35. [35]

    Splatter image: Ultra-fast single-image to 3d with gaussian splatting

    Szymon Szymanowicz et al. Splatter image: Ultra-fast single-image to 3d with gaussian splatting. InCVPR, 2024. 2

  36. [36]

    Dreamgaussian: Generative gaussian splatting for efficient 3d content creation.arXiv preprint arXiv:2309.16653, 2023

    Jiaxiang Tang et al. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation.arXiv preprint arXiv:2309.16653, 2023. 2

  37. [37]

    Triposr: Fast 3d object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024

    Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, , Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024. 2

  38. [38]

    Vtracer: A fast, vector-based image tracing engine.https://www.visioncortex.org/ vtracer/, 2023

    VisionCortex Project. Vtracer: A fast, vector-based image tracing engine.https://www.visioncortex.org/ vtracer/, 2023. Accessed: 2025-10-30. 4, 5

  39. [39]

    Sv3d: Stable video 3d for novel multi- view synthesis and 3d generation.arXiv:2403.12008, 2024

    Vikram V oleti et al. Sv3d: Stable video 3d for novel multi- view synthesis and 3d generation.arXiv:2403.12008, 2024. 1, 2

  40. [40]

    Starvector: Structure-aware vectorization via region fusion and diffusion priors

    Tianhao Wang, Zhi Liu, Hongwei Zhang, Dong Yu, and Jing Liao. Starvector: Structure-aware vectorization via region fusion and diffusion priors. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024. 3

  41. [41]

    Unique3d: High-quality and efficient 3d mesh generation from a single image

    Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, and Kaisheng Ma. Unique3d: High-quality and efficient 3d mesh generation from a single image. InNeurIPS, 2024. 2

  42. [42]

    Structured 3d latents for scalable and versatile 3d gen- eration.arXiv preprint arXiv:2412.01506, 2024

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration.arXiv preprint arXiv:2412.01506, 2024. 2, 3

  43. [43]

    Alvarez, and Ping Luo

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. Segformer: Simple and ef- ficient design for semantic segmentation with transformers. InNeurIPS, 2021. 3

  44. [44]

    Deaot: Memory-efficient decoupled attention for online video object segmentation

    Jing Yang, Wei Wu, Shu Chen, Lingxi Xie, and Qi Tian. Deaot: Memory-efficient decoupled attention for online video object segmentation. InCVPR, 2023. 3

  45. [45]

    Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging

    Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xi- aoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2025. 3

  46. [46]

    Live: Layer-wise image vectorization for structure-preserving vec- tor graphics.ACM Transactions on Graphics (TOG), 2024

    Jiaqi Yue, Kai Chen, Ying Xu, and Qian Yu. Live: Layer-wise image vectorization for structure-preserving vec- tor graphics.ACM Transactions on Graphics (TOG), 2024. 3

  47. [47]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3813–3824, 2023. 1