{"total":25,"items":[{"citing_arxiv_id":"2605.31466","ref_index":84,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching","primary_cat":"cs.CV","submitted_at":"2026-05-29T15:59:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29655","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation","primary_cat":"cs.CV","submitted_at":"2026-05-28T09:17:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SuperVoxelGPT creates shape-adaptive, deterministically ordered supervoxel tokens via saliency-guided CVT, cutting sequence length to 12.8% of uniform voxels while claiming SOTA quality and 10x speedup on Trellis-500K.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23888","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:49:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21572","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PhysX-Omni unifies simulation-ready 3D asset generation across rigid, deformable, and articulated objects via a new geometry representation, the PhysXVerse dataset, and the PhysX-Bench evaluation suite.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21472","ref_index":78,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stream3D: Sequential Multi-View 3D Generation via Evidential Memory","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:55:16+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21121","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation","primary_cat":"cs.CV","submitted_at":"2026-05-20T12:50:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ROAR-3D adds a token-wise view router and dual-stream attention to pretrained single-view 3D generators so they can use arbitrary unposed images for higher-fidelity output.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18680","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:21:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CMAG combines 3D concept scaffolding, prompt decomposition, taxonomy routing, hybrid retrieval, and agentic VLM verification to assemble topologically consistent avatars from catalog assets given free-form text prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18063","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting","primary_cat":"cs.CV","submitted_at":"2026-05-18T08:48:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MixCount provides a scalable synthetic dataset for mixed-object counting that improves state-of-the-art models on real benchmarks, cutting MAE by 20.14% on FSC-147 and 18.3% on PairTally.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16745","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-16T01:55:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13129","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation","primary_cat":"cs.GR","submitted_at":"2026-05-13T07:55:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10922","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Pixal3D: Pixel-Aligned 3D Generation from Images","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:55:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10887","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Count Anything at Any Granularity","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:32:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09606","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models","primary_cat":"cs.CR","submitted_at":"2026-05-10T15:35:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"core components: a geometry encoder and a flow-matching transformer. While conditioning the flow-matching trans- former [8] on input images has become standard practice, the choice of geometry encoder, as discussed above, remains a source of performance variation. Building upon the Vec- Set encoder, several works, including Michelangelo [49], CLAY [47], Hunyuan3D-2 [17], and TripoSG [19], have demonstrated impressive results in high-quality 3D synthe- sis. Concurrently, another promising direction leverages the sparse voxel encoder, with representative models such as Trellis series [43,44] and Spar3D [20] achieving competitive performance. The high fidelity of these models is now en- abling real-world deployment in industrial applications, such"},{"citing_arxiv_id":"2605.16355","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Generative 3D Gaussians with Learned Density Control","primary_cat":"cs.GR","submitted_at":"2026-05-08T17:54:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07971","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DVD: Discrete Voxel Diffusion for 3D Generation and Editing","primary_cat":"cs.CV","submitted_at":"2026-05-08T16:32:17+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Although numerous SLAT-based methods have been proposed, the majority generate voxels either through continuous diffusion models operating on V AE latents or by voxelizing alternative 3D representations. Direct3d-S2 [15] proposed Spatial Sparse Attention, which significantly improved the efficiency of training and inference on sparse volumetric data. TRELLIS.2 [16] proposes a native omni-voxel (O-V oxel) for the second stage, enhancing the generation quality. In contrast, our work revisits the first stage and directly models sparse voxel occupancy using discrete diffusion mechanisms. Discrete diffusion models.Discrete diffusion models (DDMs) [ 20, 22-24] connect a prior noise distribution and data distribution with continuous time Markov Chains (CTMCs) over discrete state"},{"citing_arxiv_id":"2605.07385","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Velocity-Space 3D Asset Editing","primary_cat":"cs.GR","submitted_at":"2026-05-08T07:42:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VS3D performs local 3D asset editing by injecting reconstruction-anchored source signals, partial-mean guidance, and twin-agreement residuals into the velocity sampler to control edit strength and preserve identity.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Precise local editing of 3D assets is essential for content creation, game development, and embodied simulation, yet it poses a fundamental dual requirement on any pretrained generator: the model must apply a strong, targeted modification to the region the user specifies while leaving every other region strictly unchanged. The dominant family of native 3D generators [7, 16-19] now builds on rectified-flow DiTs [4, 3], where generation is an ODE whose velocity field is learned by conditional flow matching. Editing with such a model therefore reduces tocontrolling its velocity field: the velocity update must carry a non-trivial edit signal on the target region and vanish on the rest. How to achieve this control on a frozen 3D rectified-flow DiT, without resorting to external masks or"},{"citing_arxiv_id":"2605.03105","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Pose Tracking with a Foundation Pose Model and an Ensemble Directional Kalman Filter","primary_cat":"cs.LG","submitted_at":"2026-05-04T19:45:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EnDKF combines ensemble Kalman filtering with directional statistics and unit quaternions to achieve lower pose tracking error than raw measurements in synthetic constant-velocity tests and FoundationPose-based head tracking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26509","ref_index":102,"ref_count":3,"confidence":0.9,"is_internal_anchor":true,"paper_title":"3D Generation for Embodied AI and Robotic Simulation: A Survey","primary_cat":"cs.RO","submitted_at":"2026-04-29T10:17:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and transfer.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"42 GarmentDreamer [100] arXiv '24 /f⌢nt/cube ⚛ Diff+3DGS Text/3DGS Mesh ❸ /external-link-alt 43 Dress-1-to-3 [15] ToG (SIG. '25) ⊷ /cube /balance-scaleFEM+Diff Sim-Feedback Sew. Pat. ❸ /external-link-alt 44 Image2Garment [101] arXiv '26 ⊷ /cube /balance-scaleFF Supervised Mesh + Params ❸ /external-link-alt 45 TRELLIS [34] CVPR '25 /f⌢nt ⊷/cube ⚛ Sp.DiT+VAE 3D GT Mesh/3DGS/RF ❹ /external-link-alt 46 TRELLIS.2 [102] arXiv '25 /f⌢nt ⊷/cube ⚛ Sp.DiT+VAE 3D GT Mesh/3DGS/RF ❹ /external-link-alt 47 EmbodiedGen [25] NeurIPS '25 /f⌢nt ⊷♂project-diagra¶/balance-scaleLLM/VLM+Diff Text/Image URDF/MJCF ❹ /external-link-alt 48 Seed3D [26] arXiv '25 ⊷ /cube Diff+VAE 3D GT Isaac Sim Ready ❹ /external-link-alt MPM, PBD) rather than discrete joints. Simulation- ready deformable assets need both rest-state geome-"},{"citing_arxiv_id":"2604.23629","ref_index":34,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation","primary_cat":"cs.GR","submitted_at":"2026-04-26T09:44:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Fast3R [143] 2025 Multi-view 3D recon. Amortized Scalable FF Shap-E [144] 2023 Text / image Implicit / mesh Latent diff. Fast gen. LGM 3DShape2VecSet [145] 2023 Latent Vector-set Vec-set VAE+diff. Unordered latents LGM XCube [58] 2024 Text/img/ctrl Sparse voxel Hierarchical latent Large-scale LGM TRELLIS [27] 2025 Text / image SLAT Rect. flow Mesh / 3DGS LGM TRELLIS.2 [34] 2025 Image O-Voxel Rect. flow 4B params, PBR LGM SparseFlex [146] 2025 Latent/cond. Sparse isosurface VAE + flow Arb. topology LGM TripoSG [9] 2025 Image Latent tokens VAE + rect. flow High-fidelity LGM MeshCraft [147] 2025 Image/latent Face tokens VAE + flow DiT Parallel gen. LGM Group labels :SDS=Score Distillation;MV=Multi-view Recon.;GAN;VAE=VAE/AE;Diff=Direct Diffusion;FF=Feed-Forward;LGM=Latent Gen."},{"citing_arxiv_id":"2605.13862","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation","primary_cat":"cs.GR","submitted_at":"2026-04-22T17:50:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18468","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation","primary_cat":"cs.CV","submitted_at":"2026-04-20T16:20:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Huang, Jin-Bey Yu, Chaeyeon Chung, Lina Song, Olivier Dionne, Jan Kautz, Simon Yuen, and Sanja Fidler. Kimodo: Scaling controllable human motion generation.arXiv, 2026. 15 [36] Tencent Hunyuan3D Team. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024. 15 [37] Tencent Hunyuan3D Team. Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details, 2025. URLhttps://arxiv.org/abs/2506.16504. 15 [38] Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, and Jiaolong Yang. Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025. 15 [39] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3:"},{"citing_arxiv_id":"2604.14302","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Geometrically Consistent Multi-View Scene Generation from Freehand Sketches","primary_cat":"cs.CV","submitted_at":"2026-04-15T18:00:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"diffusion or multi-view attention. A parallel line of work repurposes video dif- fusion models, treating the temporal axis as a viewpoint axis, for object-centric 4 Bourouis et al. orbiting [8,44] or scene-level [12] generation. More recent DiT-based architec- tures push the scale further: Hunyuan3D2.0 [59] trains a 2B-parameter shape- generation DiT, TRELLIS [51] employs rectified flow transformers over struc- tured 3D latents, and See3D [34] learns 3D priors from 16M video clips without pose annotations. How camera geometry is communicated to the generative model is a critical design axis. Absolute conditioning methods [13,29,50] inject camera parameters as global vectors, while per-pixel Plücker ray representations [13,53,60] offer"},{"citing_arxiv_id":"2604.11331","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale","primary_cat":"cs.CV","submitted_at":"2026-04-13T11:32:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tain explicit 3D memory based on historical frames, and condition subsequent generation on historical renderings to enforce coherence. Despite these efforts, all such methods fundamentally operate in a 2D latent space, which introduces substantial representation redundancy and fails to ensure spatial consistency at the 3D level. Another line of work [77,96] focuses on object-level 3D generation. Leveraging massive high-quality 3D asset data, these methods achieve 3D gen- eration in explicit 3D latent space based on point clouds or voxels. However, for scene-level generation, such high-quality 3D data is unavailable. How to achieve 3D scene generation using only multi-view image data remains a challenge."},{"citing_arxiv_id":"2604.09231","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation","primary_cat":"cs.CV","submitted_at":"2026-04-10T11:40:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Hitem3D 2.0 combines multi-view image synthesis with native 3D texture projection to improve completeness, cross-view consistency, and geometry alignment over prior methods.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"with the original pre-trained model weights, achieving high- quality multi-view image generation. Building upon the generated multi-view images and 3D geometry, we further propose a multi-view guided native 3D texture synthesis pipeline. To alleviate the computational and memory overhead, we design and train a V AE based on sparse voxel representations [49, 50]. The V AE adopts a dual-branch architecture to jointly compress geometry and texture features, supporting efficient reconstruction of both 3D geometry structure and surface appearance. Based on joint latent representations, we further train a native 3D tex- ture diffusion model conditioned on 3D geometry together with multi-view images, aiming to produce structurally co-"},{"citing_arxiv_id":"2604.05182","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows","primary_cat":"cs.CV","submitted_at":"2026-04-06T21:21:12+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"point-based formats such as 3D Gaussians [9,32,59,70]. More recently, a line of 3D generative work has modeled sparsity explicitly through structured latent representations-such as sparse voxel grids-to scale volumetric resolution sub- stantially [43,65-67]. In these frameworks, sparse volumes are either encoded by sparse 3D VAEs into dense latent codes [8,43,66,67] or, as in Direct3D-S2 [65], downsampled to a lower-resolution sparse volume that can be generated with an NSA-style transformer operating over 3D token blocks. While these meth- ods demonstrate efficiency in processing high-resolution volumes, their focus remains restricted to synthesizing intricate geometric details, often neglecting texture. Even when texture generation is supported [8,66], the resulting texture"}],"limit":50,"offset":0}