arxiv: 2604.14268 · v1 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

Team HY-World , Chenjie Cao , Xuhui Zuo , Zhenwei Wang , Yisu Zhang , Junta Wu , Zhenyang Liu , Yuning Gong

show 37 more authors

Yang Liu Bo Yuan Chao Zhang Coopers Li Dongyuan Guo Fan Yang Haiyu Zhang Hang Cao Jianchen Zhu Jiaxin Lin Jie Xiao Jihong Zhang Junlin Yu Lei Wang Lifu Wang Lilin Wang Linus Minghui Chen Peng He Penghao Zhao Qi Chen Rui Chen Rui Shao Sicong Liu Wangchen Qin Xiaochuan Niu Xiang Yuan Yi Sun Yifei Tang Yifu Sun Yihang Lian Yonghao Tan Yuhong Liu Yuyang Yin Zhiyuan Min Tengfei Wang Chunchao Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords modelworldhy-worldgenerationimagesintroducearchitectureenabling

0 comments

The pith

HY-World 2.0 converts text, images, and videos into navigable 3D Gaussian Splatting worlds through a chained four-stage pipeline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents HY-World 2.0 as a multi-modal system that accepts text prompts, single images, multiple views, or video and outputs complete 3D scenes. It processes limited inputs by first creating a panoramic overview, then planning movement paths, expanding the scene with consistent new views, and finally assembling a unified 3D representation. An accompanying rendering platform supports interactive exploration with lighting, collisions, and character elements. The authors report that the resulting open-source outputs reach state-of-the-art quality among similar systems and approach the level of closed-source alternatives on standard benchmarks. This capability would allow broader creation of detailed, explorable virtual environments without proprietary tools.

Core claim

HY-World 2.0 is a multi-modal world model that handles text, single-view images, multi-view images, and videos to produce 3D Gaussian Splatting representations. Generation from text or single views follows a four-stage process of panorama creation with HY-Pano 2.0, trajectory planning with WorldNav, expansion via the memory-consistent WorldStereo 2.0, and composition through the refined WorldMirror 2.0. Reconstruction from richer inputs uses the upgraded WorldMirror directly. The framework is completed by WorldLens, a rendering platform with flexible architecture, automatic lighting, collision detection, and training-rendering co-design. Experiments establish state-of-the-art results among开放

What carries the argument

The four-stage pipeline of panorama generation with HY-Pano 2.0, trajectory planning with WorldNav, world expansion with WorldStereo 2.0, and composition with WorldMirror 2.0 to build coherent 3D Gaussian Splatting scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular pipeline could be adapted for dynamic world updates in real-time simulation environments.
Open release of weights and code may accelerate integration into robotics training pipelines that require navigable 3D spaces.
If the memory consistency in view expansion holds across longer sequences, it could reduce drift in extended virtual tours.

Load-bearing premise

The four-stage pipeline produces coherent, high-fidelity, artifact-free, and fully navigable 3D Gaussian Splatting scenes across the claimed input modalities.

What would settle it

Visible artifacts, geometric inconsistencies, or failed navigation in the generated 3D Gaussian Splatting scenes on the paper's reported benchmarks would disprove the performance claims.

Figures

Figures reproduced from arXiv: 2604.14268 by Bo Yuan, Chao Zhang, Chenjie Cao, Chunchao Guo, Coopers Li, Dongyuan Guo, Fan Yang, Haiyu Zhang, Hang Cao, Jianchen Zhu, Jiaxin Lin, Jie Xiao, Jihong Zhang, Junlin Yu, Junta Wu, Lei Wang, Lifu Wang, Lilin Wang, Linus, Minghui Chen, Penghao Zhao, Peng He, Qi Chen, Rui Chen, Rui Shao, Sicong Liu, Team HY-World, Tengfei Wang, Wangchen Qin, Xiang Yuan, Xiaochuan Niu, Xuhui Zuo, Yang Liu, Yifei Tang, Yifu Sun, Yihang Lian, Yi Sun, Yisu Zhang, Yonghao Tan, Yuhong Liu, Yuning Gong, Yuyang Yin, Zhenwei Wang, Zhenyang Liu, Zhiyuan Min.

**Figure 1.** Figure 1: Versatile applications of HY-World 2.0. Our framework unifies world generation (synthesizing immersive, explorable 3D worlds from text or single-view images) and world reconstruction (recovering 3D representation from multi-view inputs). These capabilities empower diverse applications, including robotics simulation, game development, and environment mapping. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Architecture of HY-World 2.0. Our framework presents a four-stage process to transform multi-modal inputs into immersive 3D worlds: (1) initializing the world via Panorama Generation, (2) deriving exploration camera paths through Trajectory Planning, (3) expanding the world observations via memory-driven World Expansion, and (4) constructing the final 3DGS assets using World Composition. The core model/al… view at source ↗

**Figure 3.** Figure 3: Overview of the panorama generation architecture of HY-Pano 2.0. The Left side shows the framework pipeline of panorama generation, while the right side illustrates the circle padding (latent space) and the pixel blending (pixel space) for seamless panorama generation. Panorama Input Panoramic Point Cloud (MoGe2) Semantic Masks (SAM3) Panoramic Mesh (HY-World1.0) NavMesh (Recast Navigation) [PITH_FULL_IMA… view at source ↗

**Figure 4.** Figure 4: The initial scene parsing for trajectory planning. We obtain panoramic point clouds, meshes, semantic masks, and NavMesh via several pioneering works [67, 10, 23, 50]. 4.1 Geometric and Semantic Scene Parsing Given the panoramas, we first employ scene parsing to obtain panoramic point clouds, meshes, semantic masks, and navigation meshes for the subsequent trajectory planning, as shown in [PITH_FULL_IMAGE… view at source ↗

**Figure 5.** Figure 5: Illustration of five modes of trajectories planned in WorldNav. Some trajectories are omitted for a simplified visualization [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Three training stages of WorldStereo 2.0, progressively enabling camera control, memorybased consistency, and fast inference. navigation, we uniformly sample 72 candidate nodes along the ideal circle and validate them via ray-casting against the NavMesh. Valid nodes are then connected to form a continuous arc using a bidirectional greedy search. To maintain a smooth path, we apply a tail pruning mechanism… view at source ↗

**Figure 7.** Figure 7: Overall pipeline of WorldStereo 2.0. (a) The main Video Diffusion Transformer (DiT) branch is enhanced by the retrieval-based improved Spatial-Stereo Memory (SSM++) for fine-grained consistency. (b) The camera control branch is guided by the panoramic point cloud, serving as GlobalGeometric Memory (GGM) to confirm precise camera trajectory following and geometry-aware consistency. Here, we omit the VAE en… view at source ↗

**Figure 8.** Figure 8: Reconstruction and novel-view generation with different VAE variants. KeyframeVAE preserves appearance consistency in reconstructions and substantially improves the fidelity of generated novel views, particularly under large viewpoint changes. Please zoom in for details. Video latent space Keyframe latent space VAE-Encoder VAE-Decoder VAE-Encoder VAE-Decoder [1+𝑇𝑣𝑖𝑑 /4, H/8, W/8, C] [1, H/8, W/8, C] × (1+… view at source ↗

**Figure 9.** Figure 9: Keyframe-VAE in WorldStereo 2.0 versus a standard Video-VAE [64]. Unlike (a) Video-VAE, which performs spatio-temporal compression, (b) Keyframe-VAE applies spatial-only compression to better preserve high-frequency details and reduce artifacts essentially caused by Video-VAE encoding (e.g., motion blur and geometric distortion). Specifically, Keyframe-VAE loops the causal padding-based image encoding over… view at source ↗

**Figure 10.** Figure 10: Data construction of WorldStereo 2.0 memory training. (a) The global point clouds built for the GGM training with one reference view and Tg = 2 target views. (b) Trajectory retrieval strategies for SSM++ training tailored to dataset characteristics: temporally misaligned retrieval for existing multi-view data (top), and multi-trajectory retrieval for synthetic data (bottom). Explicit Camera Control. Follo… view at source ↗

**Figure 11.** Figure 11: Illustration of the RoPE [59] modification in SSM++. Target frames are spatially concatenated with their corresponding retrieved reference views along the horizontal axis (resulting in width 2W). Crucially, each retrieved view inherits the temporal index of its paired target frame before being fed into the main DiT branch. define the panoramic point cloud Ppan from Sec. 4.1 as the global point cloud, whic… view at source ↗

**Figure 12.** Figure 12: Model architecture of WorldMirror 2.0, which is a unified feed-forward model that takes multi-view images with optional geometric priors (camera poses, intrinsics, depth maps) as input, and simultaneously predicts dense point clouds, depth maps, surface normals, camera parameters, and 3DGS through a shared Transformer backbone with task-specific DPT decoder heads. 6 World Reconstruction: WorldMirror 2.0 B… view at source ↗

**Figure 13.** Figure 13: Analysis of normalized position encoding. (a) Average cosine similarity of centerpoint RoPE encodings to other resolutions. Normalized RoPE maintains high cross-resolution consistency (> 0.95), while standard RoPE degrades significantly. (b) and (c) show the mean and standard deviation of encoding values across resolutions, respectively. Normalized RoPE exhibits near-constant statistics, whereas standard… view at source ↗

**Figure 14.** Figure 14: The pipeline of depth alignment. We align the WorldMirror depth estimated from generated keyframes with the panoramic point cloud. The point cloud is rendered as geometric guidance, identifying reliable regions to supervise the linear alignment. The outlier detection process effectively eliminated and corrected the alignment coefficients based on global statistics. where Ncap is the architectural view-cou… view at source ↗

**Figure 15.** Figure 15: Comparison of reconstructed point clouds. (a) Reference panoramic point cloud estimated by MoGe2 [67]. (b-d) Results from state-of-the-art feedforward reconstruction methods: Mapanything [32], DepthAnything3 (DA3) [40], and WorldMirror 2.0. (e) WorldMirror 2.0 with the proposed depth alignment. Note that for all feedforward methods (b-d), the 10% of points with the lowest confidence are filtered out, and … view at source ↗

**Figure 16.** Figure 16: Visual results of panorama generation by HY-Pano 2.0. Our model supports both text and images of various resolutions as inputs. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative comparison on the text-to-panorama (T2P) task. Our method outperforms previous approaches in terms of layout coherence, fine-grained details, and overall visual aesthetics. experiences. Finally, aerial trajectories (Fig. 19f) incorporate additional bird’s-eye view (BEV) observations, improving the freedom of the 3D world’s viewpoint changing. 8.1.3 Results & Analysis of WorldStereo 2.0 Results… view at source ↗

**Figure 18.** Figure 18: Qualitative comparison on the image-to-panorama (I2P) task. Our method outperforms previous approaches in extension plausibility, content richness, and overall quality. Results of Camera Control Capability. We quantitatively evaluate the camera control capability of WorldStereo 2.0 in Tab. 6, while ablation studies are performed in Tab. 7. Both evaluations are applied with 100 out-of-distribution images s… view at source ↗

**Figure 19.** Figure 19: Qualitative ablation results of trajectory planning. Relying solely on panoramic views (b) results in severe artifacts and incomplete geometry. By sequentially integrating views generated from (c) regular, (d) surrounding & reconstruction, (e) wandering, and (f) aerial trajectories, our method progressively eliminates blind spots, refines complex object structures, and enhances overall scene completeness.… view at source ↗

**Figure 20.** Figure 20: Point clouds and 3DGS comparisons with video2world [21]. “Pcd.” denotes point cloud. For each scene, both methods are evaluated within 300 images generated by WorldStereo 2.0. Ablation Studies of Memory Training and Distillation. We comprehensively evaluate the memory training and post-distillation in Tab. 8. Incorporating GGM and SSM++ (Config A) substantially improves the photometric quality and multi-t… view at source ↗

**Figure 21.** Figure 21: Results of the overall world generation pipeline. Each scene is visualized across two rows. The top row displays, from left to right: the generated panorama, the aligned point clouds, a global overview of splattings, and the extracted coarse mesh. The bottom row showcases novel views rendered from various viewpoints. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

**Figure 22.** Figure 22: Interactive exploration within the generated 3D worlds of HY-World 2.0. By controlling virtual agents, users can navigate complex geometric structures (e.g., stairs and indoor layouts) with real-time collision detection and physically plausible feedback, demonstrating the readiness of our results for interactive applications [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

**Figure 23.** Figure 23: Qualitative comparison with Marble [72] using the same panoramic inputs. The input panoramas are displayed on the left. For each scene, we present novel view renderings from the generated 3DGS models of both methods. Compared to Marble, our approach achieves higher fidelity to the input conditions, sharper textures, and superior geometric consistency across diverse viewpoints. Please zoom in for details. … view at source ↗

**Figure 24.** Figure 24: Qualitative comparison with Marble [72] using the same input image. (a) displays input perspective images. For each scene, we compare both (b) generated panoramas and (c) 3DGS renderings. Compared to Marble, our approach better adheres to the input views while achieving comparable quality and superior completeness in 3DGS. Please zoom in for details. and DL3DV, following the protocol of [69]. For camera p… view at source ↗

**Figure 25.** Figure 25: Visual comparison of WorldMirror 1.0 and 2.0. We compare predicted surface normals and reconstructed point clouds. WorldMirror 2.0 produces more accurate normals with finer structural details and more consistent multi-view point clouds. !! "#$%&'()*+ ,-./0 12'-0%342/56789 12'-0%342/56$89 :7;<:7; 79#=<79#= :7;<:7; 79#=<79#= #>$<:7; ?;@<79#= #>$<:7; ?;@<79#= !! ";%&'()*+ [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 26.** Figure 26: Multi-resolution point cloud comparison of WorldMirror 1.0 and 2.0. We evaluate under dense (32 views, top) and sparse (8 views, bottom) settings at different inference resolutions. WorldMirror 1.0 degrades severely at high resolution, while WorldMirror 2.0 maintains consistent reconstruction quality across all resolutions. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗

**Figure 27.** Figure 27: Comparison with Pow3R and MapAnything under different prior conditions. All methods are evaluated at high resolution (756×1036). Results are averaged on 7-Scenes, NRGBD, and DTU datasets. Pow3R (pro) refers to the original Pow3R with Procrustes alignment. WorldMirror 2.0 demonstrates stronger geometric reasoning and 3D prior integration capabilities at high resolution. 8.2.2 Inference-Time Evaluation Geom… view at source ↗

read the original abstract

We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HY-World 2.0 is mainly useful for its full open-source release of models, code, and the WorldLens renderer rather than for major conceptual advances.

read the letter

HY-World 2.0 is mainly useful for its full open-source release of models, code, and the WorldLens renderer rather than for major conceptual advances. The paper extends their own HY-World 1.0 with a four-stage pipeline that turns text, single images, multi-view sets, or video into navigable 3D Gaussian Splatting scenes. The stages are panorama generation via HY-Pano 2.0, trajectory planning with WorldNav, expansion through an updated WorldStereo 2.0 that keeps memory consistent across keyframes, and final composition with a refined WorldMirror 2.0. They also ship WorldLens, a rendering platform with automatic IBL lighting, collision detection, and training-rendering co-design for interactive use with characters. The release of all weights and technical details stands out as the clearest positive step, since it lets others actually run and check the system instead of relying on closed comparisons like the one to Marble. The upgrades target specific pain points from the prior version, such as panorama fidelity and cross-view consistency, and the architecture description stays internally coherent. The softer spots are that the core ideas stay within existing trends in neural rendering and world models, so the work reads as targeted engineering rather than a new direction. Performance claims of leading open-source results and matching closed models will need the actual benchmark tables, ablations, and quantitative numbers to evaluate; the abstract alone does not supply them. No load-bearing contradictions appear in the described pipeline, but real-world scene coherence and artifact levels are the parts that matter most and can now be tested directly because of the release. This is for computer vision and graphics groups working on 3D synthesis, VR environments, or robotics simulation who need ready-to-use code. Readers focused on practical implementations will get the most out of the artifacts. It deserves peer review because the openness turns the engineering contribution into something verifiable and extendable, even if the novelty stays moderate.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces HY-World 2.0, a multi-modal world model extending HY-World 1.0 that accepts text prompts, single-view images, multi-view images, or videos as input and outputs navigable 3D Gaussian Splatting scenes. It employs a four-stage pipeline—panorama generation via HY-Pano 2.0, trajectory planning via WorldNav, world expansion via WorldStereo 2.0 (with consistent memory), and composition via WorldMirror 2.0 (with refined architecture and learning)—along with the WorldLens rendering platform featuring engine-agnostic design, IBL lighting, collision detection, and training-rendering co-design. The work claims state-of-the-art performance among open-source methods, results comparable to the closed-source Marble model, and releases all weights, code, and technical details for reproducibility.

Significance. If the pipeline produces coherent, high-fidelity 3DGS outputs and the empirical claims hold, the paper offers a substantial open-source advance in multi-modal 3D world reconstruction and simulation. The explicit release of artifacts, combined with component-level innovations such as consistent memory in WorldStereo 2.0 and refined learning in WorldMirror 2.0, directly supports community verification and extension, strengthening its value for downstream applications in graphics, robotics, and immersive environments.

major comments (1)

[Abstract and Experiments] Abstract and Experiments section: the central claim of SOTA performance among open-source approaches and comparability to Marble is load-bearing but unsupported by any named benchmarks, quantitative metrics, tables, or direct comparison details in the provided abstract. The experiments section must supply these specifics (e.g., exact datasets, PSNR/SSIM/LPIPS scores, and baseline tables) to allow assessment of the four-stage pipeline's effectiveness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on strengthening the empirical support for our performance claims. We agree that explicit details are needed and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the central claim of SOTA performance among open-source approaches and comparability to Marble is load-bearing but unsupported by any named benchmarks, quantitative metrics, tables, or direct comparison details in the provided abstract. The experiments section must supply these specifics (e.g., exact datasets, PSNR/SSIM/LPIPS scores, and baseline tables) to allow assessment of the four-stage pipeline's effectiveness.

Authors: We agree that the abstract does not include specific quantitative details and that the Experiments section requires clearer presentation of the supporting evidence. In the revised manuscript we will (1) update the abstract to reference key results, (2) expand the Experiments section with named benchmarks and exact datasets, (3) add tables reporting PSNR, SSIM, and LPIPS scores, and (4) include direct numerical comparisons to open-source baselines as well as the closed-source Marble model. These additions will allow readers to assess the four-stage pipeline's effectiveness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system paper with no derivation chain

full rationale

The manuscript describes a four-stage engineering pipeline (HY-Pano 2.0, WorldNav, WorldStereo 2.0, WorldMirror 2.0) for 3DGS world generation and reconstruction from multiple modalities. No equations, uniqueness theorems, fitted-parameter predictions, or ansatz derivations appear in the provided text. Claims rest on empirical benchmarks, released weights/code, and comparisons to external models such as Marble. Self-references to HY-World 1.0 and internal components are architectural descriptions, not load-bearing reductions of results to inputs. The work is therefore self-contained against external evaluation and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 5 invented entities

The central claim rests on the effectiveness of the proposed pipeline components, which are trained on unspecified datasets with many implicit hyperparameters; the output representation assumes Gaussian Splatting suffices for navigable scenes.

free parameters (1)

Model hyperparameters and training configurations
Implicit in the training of HY-Pano 2.0, WorldStereo 2.0, WorldMirror 2.0 and related components; standard for deep learning systems but unspecified in abstract.

axioms (1)

domain assumption 3D Gaussian Splatting provides a suitable representation for high-fidelity navigable 3D scenes
Used as the output format for all generated and reconstructed worlds.

invented entities (5)

HY-Pano 2.0 no independent evidence
purpose: Panorama generation from text or single-view images
New component introduced for the first stage of the generation pipeline.
WorldNav no independent evidence
purpose: Trajectory planning and 3D scene understanding
New module for path planning within generated worlds.
WorldStereo 2.0 no independent evidence
purpose: World expansion with consistent memory from keyframes
Upgraded keyframe-based view generation model.
WorldMirror 2.0 no independent evidence
purpose: Universal 3D prediction from multi-view images or videos
Feed-forward model upgrade for reconstruction.
WorldLens no independent evidence
purpose: High-performance 3DGS rendering with IBL lighting and collision detection
New rendering platform for interactive exploration.

pith-pipeline@v0.9.0 · 5780 in / 1782 out tokens · 56051 ms · 2026-05-10T13:40:38.843047+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...

Reference graph

Works this paper leans on

87 extracted references · 32 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

Map-free visual relocalization: Metric pose relative to a single image

Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Vic- tor Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. InEuropean Conference on Computer Vision, pages 690–708. Springer, 2022

2022
[2]

Estimating and exploiting the aleatoric uncertainty in surface normal estimation

Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Estimating and exploiting the aleatoric uncertainty in surface normal estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13137–13146, 2021

2021
[3]

Rethinking inductive biases for surface normal estima- tion

Gwangbin Bae and Andrew J Davison. Rethinking inductive biases for surface normal estima- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9535–9545, 2024

2024
[4]

arXiv preprint arXiv:2509.19296 (2025)

Sherwin Bahmani, Tianchang Shen, Jiawei Ren, Jiahui Huang, Yifeng Jiang, Haithem Turki, Andrea Tagliasacchi, David B Lindell, Zan Gojcic, Sanja Fidler, et al. Lyra: Generative 3d scene reconstruction via video diffusion model self-distillation.arXiv preprint arXiv:2509.19296, 2025

work page arXiv 2025
[5]

Mip- nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip- nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

2022
[6]

Leftrefill: Filling right can- vas based on left reference through generalized text-to-image diffusion model

Chenjie Cao, Yunuo Cai, Qiaole Dong, Yikai Wang, and Yanwei Fu. Leftrefill: Filling right can- vas based on left reference through generalized text-to-image diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7705–7715, 2024

2024
[7]

Mvsformer++: Revealing the devil in transformer’s details for multi-view stereo

Chenjie Cao, Xinlin Ren, and Yanwei Fu. Mvsformer++: Revealing the devil in transformer’s details for multi-view stereo. InInternational Conference on Learning Representations, 2024

2024
[8]

Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation

Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation. 2025

2025
[9]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Sam 3: Segment anything with concepts, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

2025
[11]

Oasis: A large-scale dataset for single image 3d in the wild

Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Kojima, Max Hamilton, and Jia Deng. Oasis: A large-scale dataset for single image 3d in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 679–688, 2020

2020
[12]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

2017
[13]

A note on two problems in connexion with graphs

Edsger W Dijkstra. A note on two problems in connexion with graphs. InEdsger Wybe Dijkstra: his life, work, and legacy, pages 287–290. 2022

2022
[14]

Learning to model the world: A survey of world models in artificial intelligence

Jiahua Dong, Qi Lyu, Baichen Liu, Xudong Wang, Wenqi Liang, Duzhen Zhang, Jiahang Tu, Hongliu Li, Hanbin Zhao, Henghui Ding, et al. Learning to model the world: A survey of world models in artificial intelligence. 2026. 39

2026
[15]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

work page arXiv 2025
[16]

Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans

Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2021

2021
[17]

Dit360: High-fidelity panoramic image generation via hybrid training.arXiv preprint arXiv:2510.11712, 2025

Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, and Lu Qi. Dit360: High-fidelity panoramic image generation via hybrid training.arXiv preprint arXiv:2510.11712, 2025

work page arXiv 2025
[18]

Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image

Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. InEuropean Conference on Computer Vision, pages 241–258. Springer, 2024

2024
[19]

Genie 3: A new frontier for world models

Google DeepMind. Genie 3: A new frontier for world models. https://deepmind.google/ blog/genie-3-a-new-frontier-for-world-models/ , 2025. Blog post, August 5, 2025

2025
[20]

Recurrent world models facilitate policy evolution.Ad- vances in neural information processing systems, 31, 2018

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution.Ad- vances in neural information processing systems, 31, 2018

2018
[21]

Höllein and M

Lukas Höllein and Matthias Nießner. World reconstruction from inconsistent views.arXiv preprint arXiv:2603.16736, 2026

work page arXiv 2026
[22]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review arXiv 2025
[23]

Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

Team HunyuanWorld. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

2025
[24]

Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

2025
[25]

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016

work page internal anchor Pith review arXiv 2016
[26]

Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors

Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 1071–1081, 2025

2025
[27]

Planning with Diffusion for Flexible Behavior Synthesis

Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022

work page internal anchor Pith review arXiv 2022
[28]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716, 2025

work page arXiv 2025
[29]

Perceptual losses for real-time style transfer and super-resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. InEuropean conference on computer vision, pages 694–711. Springer, 2016

2016
[30]

Cubediff: Repurposing diffusion-based image models for panorama generation

Nikolai Kalischek, Michael Oechsle, Fabian Manhardt, Philipp Henzler, Konrad Schindler, and Federico Tombari. Cubediff: Repurposing diffusion-based image models for panorama generation. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[31]

3d common corruptions and data augmentation

O˘guzhan Fatih Kar, Teresa Yeo, Andrei Atanov, and Amir Zamir. 3d common corruptions and data augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18963–18974, 2022

2022
[32]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review arXiv 2025
[33]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023. 40

2023
[34]

Zim: Zero-shot image matting for anything

Beomyoung Kim, Chanyong Shin, Joonhyun Jeong, Hyungsik Jung, Se-Yun Lee, Sewhan Chun, Dong-Hyun Hwang, and Joonsang Yu. Zim: Zero-shot image matting for anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23828– 23838, 2025

2025
[35]

Tanks and temples: Bench- marking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Bench- marking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017

2017
[36]

Evaluation of cnn- based single-image depth estimation methods

Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Korner. Evaluation of cnn- based single-image depth estimation methods. InProceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018

2018
[37]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory,

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory.arXiv preprint arXiv:2506.18903, 2025

work page arXiv 2025
[39]

Flashworld: High- quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678, 2025

Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, and Liujuan Cao. Flashworld: High-quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678, 2025

work page arXiv 2025
[40]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review arXiv 2025
[41]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

2024
[42]

A survey on cache methods in diffusion models: Toward efficient multi-modal generation.arXiv preprint arXiv:2510.19755, 2025

Jiacheng Liu, Xinyu Wang, Yuqi Lin, Zhikai Wang, Peiru Wang, Peiliang Cai, Qinming Zhou, Zhengan Yan, Zexuan Yan, Zhengyi Shi, et al. A survey on cache methods in diffusion models: Toward efficient multi-modal generation.arXiv preprint arXiv:2510.19755, 2025

work page arXiv 2025
[43]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024

2024
[44]

World- mirror: Universal 3d world reconstruction with any-prior prompting.arXiv preprint arXiv:2510.10726, 2025

Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo. Worldmirror: Universal 3d world reconstruction with any-prior prompting. arXiv preprint arXiv:2510.10726, 2025

work page arXiv 2025
[45]

Maskgaussian: Adaptive 3d gaussian representation from probabilistic masks

Yifei Liu, Zhihang Zhong, Yifan Zhan, Sheng Xu, and Xiao Sun. Maskgaussian: Adaptive 3d gaussian representation from probabilistic masks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 681–690, 2025

2025
[46]

Marching cubes: A high resolution 3d surface construction algorithm

William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction algorithm. InSeminal graphics: pioneering efforts that shaped the field, pages 347–353. 1998

1998
[47]

Genex: Generating an explorable world

Taiming Lu, Tianmin Shu, Junfei Xiao, Luoxin Ye, Jiahao Wang, Cheng Peng, Chen Wei, Daniel Khashabi, Rama Chellappa, Alan Yuille, et al. Genex: Generating an explorable world. arXiv preprint arXiv:2412.09624, 2024

work page arXiv 2024
[48]

Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model. arXiv preprint arXiv:2512.22096, 2025. 41

work page arXiv 2025
[49]

Cooperative computation of stereo disparity: A cooperative algorithm is derived for extracting disparity information from stereo image pairs.Science, 194(4262):283–287, 1976

David Marr and Tomaso Poggio. Cooperative computation of stereo disparity: A cooperative algorithm is derived for extracting disparity information from stereo image pairs.Science, 194(4262):283–287, 1976

1976
[50]

Recast navigation: State-of-the-art navmesh generation and navigation for games

Mikko Mononen and the Recast Navigation Contributors. Recast navigation: State-of-the-art navmesh generation and navigation for games. Open-source software repository, 2009–2026. Accessed: [2026-02]

2009
[51]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[52]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[53]

Accelerating 3D Deep Learning with PyTorch3D

Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d.arXiv:2007.08501, 2020

work page internal anchor Pith review arXiv 2007
[54]

Gen3c: 3d-informed world- consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[55]

Schneider, L

Manuel-Andreas Schneider, Lukas Höllein, and Matthias Nießner. Worldexplorer: Towards generating fully navigable 3d scenes.arXiv preprint arXiv:2506.01799, 2025

work page arXiv 2025
[56]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012

2012
[57]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Light field networks: Neural scene representations with single-evaluation rendering.Advances in Neural Information Processing Systems, 34:19313–19325, 2021

Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering.Advances in Neural Information Processing Systems, 34:19313–19325, 2021

2021
[59]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[60]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

work page internal anchor Pith review arXiv 2025
[61]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models. arXiv preprint arXiv:2601.20540, 2026

work page arXiv 2026
[62]

Worldstereo: Bridging camera-guided video generation and scene reconstruction via 3d geometric memories, 2026

Tencent Hunyuan 3D Team. Worldstereo: Bridging camera-guided video generation and scene reconstruction via 3d geometric memories, 2026

2026
[63]

Hunyuanvideo 1.5 technical report, 2025

Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025

2025
[64]

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025. 42

2025
[66]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

2025
[67]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025

work page internal anchor Pith review arXiv 2025
[68]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. 2020

2020
[69]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. pi3: Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review arXiv 2025
[70]

WorldCompass: Reinforcement learning for long-horizon world models, 2026

Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, et al. Worldcompass: Reinforcement learning for long-horizon world models.arXiv preprint arXiv:2602.09022, 2026

work page arXiv 2026
[71]

Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems, 36:8406–8441, 2023

2023
[72]

Marble.https://marble.worldlabs.ai/, 2025

World Labs. Marble.https://marble.worldlabs.ai/, 2025. Accessed: 2026-03-30

2025
[73]

arXiv preprint arXiv:2312.17090 (2023), equal Contribution by Wu, Haoning and Zhang, Zicheng

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Chunyi Li, Liang Liao, Annan Wang, Erli Zhang, Wenxiu Sun, Qiong Yan, Xiongkuo Min, Guangtai Zhai, and Weisi Lin. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023. Equal Contribution by Wu, Haoning and Zhang, Zicheng. Project Lead by Wu,...

work page arXiv 2023
[74]

Rgbd objects in the wild: scaling real-world 3d object learning from rgb-d videos

Hongchi Xia, Yang Fu, Sifei Liu, and Xiaolong Wang. Rgbd objects in the wild: scaling real-world 3d object learning from rgb-d videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22378–22389, 2024

2024
[75]

Physgaussian: Physics-integrated 3d gaussians for generative dynamics

Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. Physgaussian: Physics-integrated 3d gaussians for generative dynamics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4389–4398, 2024

2024
[76]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935, 2025

2025
[78]

Layerpano3d: Layered 3d panorama for hyper-immersive scene generation

Shuai Yang, Jing Tan, Mengchen Zhang, Tong Wu, Gordon Wetzstein, Ziwei Liu, and Dahua Lin. Layerpano3d: Layered 3d panorama for hyper-immersive scene generation. InProceedings of the special interest group on computer graphics and interactive techniques conference conference papers, pages 1–10, 2025

2025
[79]

Neoverse: Enhancing 4d world model with in-the-wild monocular videos.arXiv preprint arXiv:2601.00393, 2026

Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, and Zhaoxiang Zhang. Neoverse: Enhancing 4d world model with in-the-wild monocular videos.arXiv preprint arXiv:2601.00393, 2026

work page arXiv 2026
[80]

Matrix-3d: Omnidirectional explorable 3d world generation.arXiv preprint arXiv:2508.08086, 2025

Zhongqi Yang, Wenhang Ge, Yuqi Li, Jiaqi Chen, Haoyuan Li, Mengyin An, Fei Kang, Hua Xue, Baixin Xu, Yuyang Yin, et al. Matrix-3d: Omnidirectional explorable 3d world generation. arXiv preprint arXiv:2508.08086, 2025. 43

work page arXiv 2025

Showing first 80 references.