Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches

Anqi Liu; Jiachen Lu; Junqi Yang; Lei He; Luying Wang; Weixin Huang; Wenda Wang

arxiv: 2605.20733 · v1 · pith:FCRL7Z2Vnew · submitted 2026-05-20 · 💻 cs.CV

Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches

Wenda Wang , Anqi Liu , Junqi Yang , Lei He , Luying Wang , Jiachen Lu , Weixin Huang This is my paper

Pith reviewed 2026-05-21 04:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords sketch-to-3Dminimal surfacesvision-language guidancetopological encodinggeometric optimizationeditable manifoldsstructural loss3D surface generation

0 comments

The pith

A hybrid vision-language framework generates editable 3D minimal surfaces from hand-drawn sketches using geometric optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Sketch2MinSurf to convert hand-drawn sketches into structured 3D geometries that maintain topological consistency and can be edited directly in design tools. It integrates vision-language guidance with minimal-surface theory through a spatial-topological encoding of nodes and edges plus a custom structural loss. This combination aims to overcome limitations of prior generative models that often produce surfaces requiring manual fixes or containing inconsistencies. If the approach holds, it would let users create smooth, usable 3D forms from rough drawings without post-processing for topology or artifacts.

Core claim

Sketch2MinSurf is a hybrid vision-language and geometric optimization framework that integrates vision-language guidance with minimal-surface theory to generate smooth and editable 3D surfaces from hand-drawn sketches. The core of the approach is a spatial-topological encoding that represents geometry as tuples of node coordinates and real/virtual edge skeletons, enabling stable topological control during generation. The framework also introduces the Sketch2MinSurf Structural Loss, a reward-modulated objective that jointly constrains geometric reconstruction and topological coherence. This produces manifolds that are directly editable and free from non-manifold artifacts.

What carries the argument

The spatial-topological encoding that represents geometry as tuples of node coordinates and real/virtual edge skeletons to enable stable topological control during generation.

If this is right

Generated surfaces integrate directly into design workflows without topology repair steps.
Topological consistency is preserved across outputs, supporting reliable use in iterative modeling.
Minimal-surface integration yields smooth results applicable to artistic and structural design tasks.
The system enables creation of 3D forms for installations based on simple human sketches.
Outputs avoid non-manifold issues that commonly disrupt downstream 3D processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The node-edge tuple representation could transfer to other 2D-to-3D tasks requiring strict topology, such as diagram-to-model conversion.
Minimal-surface constraints might naturally produce forms with efficient material use when fabricated physically.
Vision-language integration opens possibilities for text-refined adjustments during surface generation.
Broader testing on varied sketch inputs could clarify generalization beyond the current evaluation set.

Load-bearing premise

The spatial-topological encoding enables stable topological control during generation and produces editable manifolds without non-manifold artifacts.

What would settle it

Loading the generated surfaces into a standard 3D modeling application and attempting direct edits to verify absence of artifacts or topology breaks, or evaluating the method on a new collection of sketches with known complex topologies to check consistency.

Figures

Figures reproduced from arXiv: 2605.20733 by Anqi Liu, Jiachen Lu, Junqi Yang, Lei He, Luying Wang, Weixin Huang, Wenda Wang.

**Figure 2.** Figure 2: Framework of Sketch2MinSurf. 3.2 Topology-aware skeleton encoding 3.2.1 Minimal surface representation We introduce a topology-aware skeleton encoding that decomposes minimal surfaces into two basic elements: saddle regions capturing negative Gaussian curvature, and cylindrical surfaces providing axial extension. This decomposition enables flexible manipulation through geometric operations while maintainin… view at source ↗

**Figure 3.** Figure 3: Framework of the Sketch2MinSurf description method. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Framework of S2MS-Loss. The structural reward evaluates topological quality and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Comparative reconstruction results across model generations. The v4.2 series achieves the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of our best model (v4.2 series) [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Framework of the minimal surface binary unit description method. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Schematic diagram of the minimal surface binary unit description method. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Detailed framework of the Sketch2MinSurf encoder. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Illustration of virtual edge skeleton and solid edge skeleton derivation. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Image input example (line sketch on left, grayscale shade rendering overlaid with line [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Overview of the training dataset. Each sample consists of a single-view rendering [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Camera-based coordinate representation method. [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Complete end-to-end design-to-construction workflow using Sketch2MinSurf. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Additional photographs of the completed Sketch2MinSurf-based architectural installation [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

read the original abstract

Converting hand-drawn sketches into structured 3D geometries remains challenging due to the difficulty of representing non-Euclidean surfaces and maintaining topological consistency. Existing generative models such as GANs, NeRFs, and diffusion architectures often fail to produce editable manifolds directly usable in downstream design workflows. We present Sketch2MinSurf, a hybrid vision-language and geometric optimization framework that integrates vision-language guidance with minimal-surface theory to generate smooth and editable 3D surfaces from hand-drawn sketches. The core of our approach is a spatial-topological encoding that represents geometry as tuples of node coordinates and real/virtual edge skeletons, enabling stable topological control during generation. We further introduce the Sketch2MinSurf Structural Loss (S2MS-Loss), a reward-modulated objective that jointly constrains geometric reconstruction and topological coherence. On a test set of 100 sketches, Sketch2MinSurf achieves a topological similarity score of 0.844, outperforming existing sketch-to-shape baselines. The generated manifolds are directly editable and free from non-manifold artifacts. A public art installation at a university showcases the method's potential for human-intent-driven 3D form generation. The dataset and code are available at https://anonymous.4open.science/r/Sketch2MinSurf/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sketch2MinSurf gives a practical sketch-to-editable-surface pipeline with topological control via node-edge encoding, but the outputs are not shown to be minimal surfaces in the variational sense.

read the letter

Hi colleague, the main thing to know is that this paper builds a hybrid vision-language plus geometric pipeline that turns hand sketches into 3D surfaces users can edit directly, and it reports a topological similarity score of 0.844 on a 100-sketch test set while releasing code and data. The spatial-topological encoding with real and virtual edges plus the reward-modulated S2MS-Loss is the concrete piece that lets them keep topology stable and avoid non-manifold junk during generation. That part looks workable for design tools where downstream editing matters more than pure math elegance. They also show a university art installation, which at least demonstrates the intent-driven angle in practice. What the work does cleanly is make the output meshes directly usable without extra cleanup steps, and the public release lowers the barrier for anyone wanting to try the encoding idea. The soft spot is the minimal-surface claim itself. The title and abstract promise surfaces with zero mean curvature, yet the described method centers on reconstruction plus coherence constraints at the discrete graph level; there is no visible area-minimization or mean-curvature flow term that would actually drive the surface into the minimal class. Without that variational step the results may be clean, editable manifolds but not guaranteed minimal. The performance number is given without baseline implementation details or error bars in the abstract, though the full text presumably expands on the test-set construction. This paper is for graphics and vision researchers who build sketch-based modeling tools and want something that produces editable output out of the box. A reader already working on topology-preserving generation could pick up the encoding trick and test it themselves. It deserves a serious referee because the approach is specific, the code is public, and the editability goal is well-motivated even if the geometric grounding needs tightening in revision.

Referee Report

2 major / 1 minor

Summary. The paper presents Sketch2MinSurf, a hybrid vision-language and geometric optimization framework that converts hand-drawn sketches into smooth, editable 3D minimal surfaces. It introduces a spatial-topological encoding based on tuples of node coordinates and real/virtual edge skeletons for topological control, along with the reward-modulated S2MS-Loss for joint geometric reconstruction and coherence. On a test set of 100 sketches the method reports a topological similarity score of 0.844 that outperforms existing sketch-to-shape baselines; the outputs are claimed to be directly editable manifolds free of non-manifold artifacts, with a public art installation as demonstration.

Significance. If the outputs are verifiably minimal surfaces (zero mean curvature) that remain editable and artifact-free, the work would offer a useful bridge between vision-language models and classical differential geometry for sketch-driven design workflows. Public release of dataset and code strengthens reproducibility.

major comments (2)

[Abstract] Abstract: the central performance claim (topological similarity 0.844 on 100 sketches, outperforming baselines) supplies no information on baseline methods, error bars, test-set construction, or validation procedures, leaving the quantitative superiority unsupported by visible evidence.
[Method] Method overview / S2MS-Loss description: the manuscript asserts generation of minimal surfaces yet the S2MS-Loss is defined only as reward-modulated reconstruction plus coherence; no variational term minimizing surface area or driving mean curvature to zero is described, undermining the claim that the outputs lie in the minimal-surface class.

minor comments (1)

[Abstract] The anonymous repository link should be replaced with a permanent identifier or additional reproducibility details once the review process allows.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment below, indicating the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claim (topological similarity 0.844 on 100 sketches, outperforming baselines) supplies no information on baseline methods, error bars, test-set construction, or validation procedures, leaving the quantitative superiority unsupported by visible evidence.

Authors: We agree that the abstract, as a high-level summary, omits key experimental details. The manuscript provides this information in the Experiments section, specifying the baselines (GAN-based and diffusion-based sketch-to-shape methods), the construction of the 100-sketch test set drawn from diverse hand-drawn sources, and the topological similarity validation protocol. We will revise the abstract to briefly reference the outperforming baselines and evaluation on the 100-sketch test set. The reported score derives from a single deterministic run on the fixed test set; we will add a clarifying statement on this point. revision: yes
Referee: [Method] Method overview / S2MS-Loss description: the manuscript asserts generation of minimal surfaces yet the S2MS-Loss is defined only as reward-modulated reconstruction plus coherence; no variational term minimizing surface area or driving mean curvature to zero is described, undermining the claim that the outputs lie in the minimal-surface class.

Authors: The referee is correct that the S2MS-Loss is presented as a reward-modulated combination of reconstruction and coherence terms. The minimal-surface property is realized via the geometric optimization stage that applies minimal-surface theory to the spatial-topological encoding. We acknowledge that the manuscript does not explicitly describe the variational terms or mean-curvature minimization steps. We will revise the Method section to include a clear description of how the optimization enforces zero mean curvature, adding the relevant formulation or projection steps. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces a spatial-topological encoding and S2MS-Loss as independent components within a hybrid vision-language and geometric optimization framework. The topological similarity score of 0.844 is presented as an empirical evaluation on a held-out test set of 100 sketches rather than a quantity derived tautologically from the model definition or loss formulation. No load-bearing step reduces the claim of producing minimal surfaces or editable manifolds to a self-definition, a fitted parameter renamed as prediction, or a self-citation chain. The derivation remains self-contained with external content from the proposed encoding, loss, and reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5776 in / 1184 out tokens · 44854 ms · 2026-05-21T04:56:23.693864+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 7 internal anchors

[1]

Andersson, S.T

S. Andersson, S.T. Hyde, K. Larsson, and S. Lidin. Minimal surfaces and structures: from inorganic and metal crystals to cell membranes and biopolymers.Chemical Reviews, 88(1):221–242, 1988

work page 1988
[2]

Asmar and H

K.E. Asmar and H. Sareen. Machinic interpolations: A gan pipeline for integrating lateral thinking in computational tools of architecture. InCongreso SIGraDi, pages 60–66, São Paulo, 2020. Blucher

work page 2020
[3]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, et al. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, 2024

work page 2024
[6]

C. Chen, M. Poppe, S. Poppe, C. Tschierske, and F. Liu. Liquid organic frameworks: A liquid crystalline 8-connected network with body-centered cubic symmetry.Angewandte Chemie, 132:21006–21011, 2020

work page 2020
[7]

Cheng, H

A.-C. Cheng, H. Yin, Y . Fu, Q. Guo, R. Yang, J. Kautz, et al. Spatialrgpt: Grounded spatial reasoning in vision–language models.Advances in Neural Information Processing Systems (NeurIPS), 37:135062– 135093, 2024

work page 2024
[8]

Cheng, H.-Y

Y .-C. Cheng, H.-Y . Lee, S. Tulyakov, A. Schwing, and L. Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4456–4465, 2023

work page 2023
[9]

J. Douglas. Minimal surfaces of higher topological structure.Annals of Mathematics, 40:205–298, 1939

work page 1939
[10]

Fogden and S.T

A. Fogden and S.T. Hyde. Continuous transformations of cubic minimal surfaces.European Physical Journal B: Condensed Matter and Complex Systems, 7(1):91–104, 1999

work page 1999
[11]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

work page 2020
[12]

D. Han, M. Han, and Unsloth team. Unsloth, 2023

work page 2023
[13]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), pages 6840–6851, 2020

work page 2020
[14]

Kapfer, S.T

S.C. Kapfer, S.T. Hyde, K. Mecke, C.H. Arns, and G.E. Schröder-Turk. Minimal surface scaffold designs for tissue engineering.Biomaterials, 32(29):6875–6882, 2011

work page 2011
[15]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):139:1–139:1, 2023

work page 2023
[16]

S. Kim, D. Kim, and S. Choi. Citycraft: 3d virtual city creation from a single image.The Visual Computer, 36:911–924, 2020

work page 2020
[17]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, et al. Segment anything. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023. 10

work page 2023
[18]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the International Conference on Machine Learning (ICML), pages 19730–19742, 2023

work page 2023
[19]

C.H. Lin, C. Kong, and S. Lucey. Learning efficient point cloud generation for dense 3d object reconstruc- tion. InAAAI Conference on Artificial Intelligence (AAAI), pages 7114–7121, 2018

work page 2018
[20]

Y . Liu, D. Chi, S. Wu, Z. Zhang, Y . Hu, L. Zhang, et al. Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025

work page arXiv 2025
[21]

Luo and W

S. Luo and W. Hu. Diffusion probabilistic models for 3d point cloud generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2837–2845, 2021

work page 2021
[22]

Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors

Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors. InAdvances in Neural Information Processing Systems, pages 68803–68832, 2024

work page 2024
[23]

Melas-Kyriazi, I

L. Melas-Kyriazi, I. Laina, C. Rupprecht, and A. Vedaldi. Realfusion: 360° reconstruction of any object from a single image. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8446–8455, 2023

work page 2023
[24]

Mildenhall, P.P

B. Mildenhall, P.P. Srinivasan, M. Tancik, J.T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

work page 2021
[25]

Müller, Y

N. Müller, Y . Siddiqui, L. Porzi, S.R. Bulo, P. Kontschieder, and M. Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4328–4338, 2023

work page 2023
[26]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen. Point-e: A system for generating 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

T. Oka. Transformation between inverse bicontinuous cubic phases of a lipid from diamond to primitive. Langmuir, 31:3180–3185, 2015

work page 2015
[28]

GPT-4 Technical Report

OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Pan and H

Z. Pan and H. Liu. Metaspatial: Reinforcing 3d spatial reasoning in vlms for the metaverse.arXiv preprint arXiv:2503.18470, 2025

work page arXiv 2025
[30]

Piegl and W

L. Piegl and W. Tiller.The NURBS Book. Springer Science & Business Media, 2012

work page 2012
[31]

Radford, J.W

A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning (ICML), pages 8748–8763, 2021

work page 2021
[32]

Sadoc and J

J.-F. Sadoc and J. Charvolin. Infinite periodic minimal surfaces and their crystallography in the hyperbolic plane.Foundations of Crystallography, 45:10–20, 1989

work page 1989
[33]

R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, et al. Zero123++: A single image to consistent multi-view diffusion base model.arXiv preprint arXiv:2310.15110, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

F. Sun, W. Liu, S. Gu, D. Lim, G. Bhat, F. Tombari, et al. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29469–29478, 2025

work page 2025
[35]

Szymanowicz, C

S. Szymanowicz, C. Rupprecht, and A. Vedaldi. Viewset diffusion: (0-)image-conditioned 3d generative models from 2d data. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 8863– 8873, 2023

work page 2023
[36]

A. Tono, H. Huang, A. Agrawal, and M. Fischer. Vitruvio: Conditional variational autoencoder to generate building meshes via single perspective sketches.Automation in Construction, 166:105498, 2024

work page 2024
[37]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Y . Wang. Computing minkowski sum of periodic surface models.Computer-Aided Design and Applications, 6:825–837, 2009. 11

work page 2009
[39]

Y . Wei, G. V osselman, and M.Y . Yang. Buildiff: 3d building shape generation using single-image conditional point cloud diffusion models. InIEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2902–2911, 2023

work page 2023
[40]

Unique3d: High-quality and efficient 3d mesh generation from a single image

Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, and Kaisheng Ma. Unique3d: High-quality and efficient 3d mesh generation from a single image. InAdvances in Neural Information Processing Systems, pages 125116–125141, 2024

work page 2024
[41]

Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. InAdvances in Neural Information Processing Systems, pages 121859–121881, 2024

work page 2024
[42]

Xiang, Q

L. Xiang, Q. Li, C. Li, Q. Yang, F. Xu, and Y . Mai. Block copolymer self-assembly directed synthesis of porous materials with ordered bicontinuous structures and their potential applications.Advanced Materials, 35:2207684, 2023

work page 2023
[43]

G. Yang, X. Huang, Z. Hao, M.Y . Liu, S. Belongie, and B. Hariharan. Pointflow: 3d point cloud generation with continuous normalizing flows. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 4541–4550, 2019

work page 2019
[44]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, et al. Llamafactory: Unified efficient fine-tuning of 100+ language models.arXiv preprint arXiv:2403.13372, 2024. 12 A Extended Related Work Overall, prior research has advanced image-to-3D reconstruction, multimodal spatial reasoning, and minimal-surface generation across three complementary directi...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Andersson, S.T

S. Andersson, S.T. Hyde, K. Larsson, and S. Lidin. Minimal surfaces and structures: from inorganic and metal crystals to cell membranes and biopolymers.Chemical Reviews, 88(1):221–242, 1988

work page 1988

[2] [2]

Asmar and H

K.E. Asmar and H. Sareen. Machinic interpolations: A gan pipeline for integrating lateral thinking in computational tools of architecture. InCongreso SIGraDi, pages 60–66, São Paulo, 2020. Blucher

work page 2020

[3] [3]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, et al. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, 2024

work page 2024

[6] [6]

C. Chen, M. Poppe, S. Poppe, C. Tschierske, and F. Liu. Liquid organic frameworks: A liquid crystalline 8-connected network with body-centered cubic symmetry.Angewandte Chemie, 132:21006–21011, 2020

work page 2020

[7] [7]

Cheng, H

A.-C. Cheng, H. Yin, Y . Fu, Q. Guo, R. Yang, J. Kautz, et al. Spatialrgpt: Grounded spatial reasoning in vision–language models.Advances in Neural Information Processing Systems (NeurIPS), 37:135062– 135093, 2024

work page 2024

[8] [8]

Cheng, H.-Y

Y .-C. Cheng, H.-Y . Lee, S. Tulyakov, A. Schwing, and L. Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4456–4465, 2023

work page 2023

[9] [9]

J. Douglas. Minimal surfaces of higher topological structure.Annals of Mathematics, 40:205–298, 1939

work page 1939

[10] [10]

Fogden and S.T

A. Fogden and S.T. Hyde. Continuous transformations of cubic minimal surfaces.European Physical Journal B: Condensed Matter and Complex Systems, 7(1):91–104, 1999

work page 1999

[11] [11]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

work page 2020

[12] [12]

D. Han, M. Han, and Unsloth team. Unsloth, 2023

work page 2023

[13] [13]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), pages 6840–6851, 2020

work page 2020

[14] [14]

Kapfer, S.T

S.C. Kapfer, S.T. Hyde, K. Mecke, C.H. Arns, and G.E. Schröder-Turk. Minimal surface scaffold designs for tissue engineering.Biomaterials, 32(29):6875–6882, 2011

work page 2011

[15] [15]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):139:1–139:1, 2023

work page 2023

[16] [16]

S. Kim, D. Kim, and S. Choi. Citycraft: 3d virtual city creation from a single image.The Visual Computer, 36:911–924, 2020

work page 2020

[17] [17]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, et al. Segment anything. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023. 10

work page 2023

[18] [18]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the International Conference on Machine Learning (ICML), pages 19730–19742, 2023

work page 2023

[19] [19]

C.H. Lin, C. Kong, and S. Lucey. Learning efficient point cloud generation for dense 3d object reconstruc- tion. InAAAI Conference on Artificial Intelligence (AAAI), pages 7114–7121, 2018

work page 2018

[20] [20]

Y . Liu, D. Chi, S. Wu, Z. Zhang, Y . Hu, L. Zhang, et al. Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025

work page arXiv 2025

[21] [21]

Luo and W

S. Luo and W. Hu. Diffusion probabilistic models for 3d point cloud generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2837–2845, 2021

work page 2021

[22] [22]

Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors

Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors. InAdvances in Neural Information Processing Systems, pages 68803–68832, 2024

work page 2024

[23] [23]

Melas-Kyriazi, I

L. Melas-Kyriazi, I. Laina, C. Rupprecht, and A. Vedaldi. Realfusion: 360° reconstruction of any object from a single image. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8446–8455, 2023

work page 2023

[24] [24]

Mildenhall, P.P

B. Mildenhall, P.P. Srinivasan, M. Tancik, J.T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

work page 2021

[25] [25]

Müller, Y

N. Müller, Y . Siddiqui, L. Porzi, S.R. Bulo, P. Kontschieder, and M. Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4328–4338, 2023

work page 2023

[26] [26]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen. Point-e: A system for generating 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

T. Oka. Transformation between inverse bicontinuous cubic phases of a lipid from diamond to primitive. Langmuir, 31:3180–3185, 2015

work page 2015

[28] [28]

GPT-4 Technical Report

OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Pan and H

Z. Pan and H. Liu. Metaspatial: Reinforcing 3d spatial reasoning in vlms for the metaverse.arXiv preprint arXiv:2503.18470, 2025

work page arXiv 2025

[30] [30]

Piegl and W

L. Piegl and W. Tiller.The NURBS Book. Springer Science & Business Media, 2012

work page 2012

[31] [31]

Radford, J.W

A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning (ICML), pages 8748–8763, 2021

work page 2021

[32] [32]

Sadoc and J

J.-F. Sadoc and J. Charvolin. Infinite periodic minimal surfaces and their crystallography in the hyperbolic plane.Foundations of Crystallography, 45:10–20, 1989

work page 1989

[33] [33]

R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, et al. Zero123++: A single image to consistent multi-view diffusion base model.arXiv preprint arXiv:2310.15110, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

F. Sun, W. Liu, S. Gu, D. Lim, G. Bhat, F. Tombari, et al. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29469–29478, 2025

work page 2025

[35] [35]

Szymanowicz, C

S. Szymanowicz, C. Rupprecht, and A. Vedaldi. Viewset diffusion: (0-)image-conditioned 3d generative models from 2d data. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 8863– 8873, 2023

work page 2023

[36] [36]

A. Tono, H. Huang, A. Agrawal, and M. Fischer. Vitruvio: Conditional variational autoencoder to generate building meshes via single perspective sketches.Automation in Construction, 166:105498, 2024

work page 2024

[37] [37]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Y . Wang. Computing minkowski sum of periodic surface models.Computer-Aided Design and Applications, 6:825–837, 2009. 11

work page 2009

[39] [39]

Y . Wei, G. V osselman, and M.Y . Yang. Buildiff: 3d building shape generation using single-image conditional point cloud diffusion models. InIEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2902–2911, 2023

work page 2023

[40] [40]

Unique3d: High-quality and efficient 3d mesh generation from a single image

Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, and Kaisheng Ma. Unique3d: High-quality and efficient 3d mesh generation from a single image. InAdvances in Neural Information Processing Systems, pages 125116–125141, 2024

work page 2024

[41] [41]

Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. InAdvances in Neural Information Processing Systems, pages 121859–121881, 2024

work page 2024

[42] [42]

Xiang, Q

L. Xiang, Q. Li, C. Li, Q. Yang, F. Xu, and Y . Mai. Block copolymer self-assembly directed synthesis of porous materials with ordered bicontinuous structures and their potential applications.Advanced Materials, 35:2207684, 2023

work page 2023

[43] [43]

G. Yang, X. Huang, Z. Hao, M.Y . Liu, S. Belongie, and B. Hariharan. Pointflow: 3d point cloud generation with continuous normalizing flows. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 4541–4550, 2019

work page 2019

[44] [44]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, et al. Llamafactory: Unified efficient fine-tuning of 100+ language models.arXiv preprint arXiv:2403.13372, 2024. 12 A Extended Related Work Overall, prior research has advanced image-to-3D reconstruction, multimodal spatial reasoning, and minimal-surface generation across three complementary directi...

work page internal anchor Pith review Pith/arXiv arXiv 2024