arxiv: 2604.10573 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

Bo Zhou , Qiuxia Lai , Zeren Sun , Xiangbo Shu , Yazhou Yao , Wenguan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D representation learningunposed multi-view imagesGaussian splattingself-supervised learninggeometric-semantic consistencyscene understandingspatial intelligencefeed-forward 3D models

0 comments

The pith

UniSplat learns unified 3D representations from unposed multi-view images by combining dual masking for geometry, coarse-to-fine splatting, and pose-conditioned recalibration for consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to build 3D representations that integrate geometry, appearance, and semantics directly from multiple images whose camera positions are unknown. Prior self-supervised approaches often yield weak geometry, limited visual detail, or mismatches between geometric and semantic outputs. UniSplat counters these issues with a dual-masking scheme that compels the model to recover structure from partial views, a progressive Gaussian splatting process that refines the radiance field step by step, and a recalibration step that projects predicted 3D points and semantics back onto the input images using estimated poses to enforce alignment. A sympathetic reader would expect this to produce representations that remain stable even with sparse, unposed inputs and that transfer to varied scene-understanding and embodied-AI tasks.

Core claim

The authors state that the UniSplat framework, built from a dual-masking strategy that masks both encoder and decoder tokens with geometry-rich decoder targets, a coarse-to-fine Gaussian splatting strategy that progressively refines the radiance field, and a pose-conditioned recalibration mechanism that re-projects predicted 3D point and semantic maps into the image plane using estimated camera parameters and aligns them with RGB and semantic predictions, produces unified 3D representations that are robust to unposed and sparse-view inputs and generalize across diverse tasks.

What carries the argument

UniSplat feed-forward framework whose three components are the dual-masking strategy for geometry induction, the coarse-to-fine Gaussian splatting strategy for appearance-semantics consistency, and the pose-conditioned recalibration mechanism that re-projects and aligns multiple output heads.

If this is right

Unified 3D representations can be obtained in a single feed-forward pass without requiring known camera poses.
Geometry induction is strengthened even when input views are sparse and unposed.
Appearance-semantics inconsistencies are reduced by the progressive refinement of the radiance field.
Cross-task consistency between geometry and semantics is maintained through re-projection alignment.
The resulting representations support generalization to scene understanding and embodied AI tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could reduce reliance on separate structure-from-motion pipelines in practical 3D capture systems.
Extending the recalibration step to sequential video frames might allow learning from moving cameras without explicit tracking.
The same consistency mechanism could be tested on outdoor or large-scale scenes to check robustness beyond indoor benchmarks.

Load-bearing premise

The pose-conditioned recalibration mechanism successfully enforces geometric-semantic consistency by re-projecting predicted 3D point and semantic maps into the image plane using estimated camera parameters.

What would settle it

A controlled test on a multi-view dataset with known ground-truth poses where the re-projected semantic maps diverge from the independently predicted semantic maps or where geometry quality collapses under sparse unposed inputs would falsify the consistency and robustness claims.

Figures

Figures reproduced from arXiv: 2604.10573 by Bo Zhou, Qiuxia Lai, Wenguan Wang, Xiangbo Shu, Yazhou Yao, Zeren Sun.

**Figure 2.** Figure 2: Overview of the proposed UniSplat framework. UniSplat integrates a dual-masking strategy for geometry induction (§3.2), a coarse-to-fine Gaussian splatting strategy for appearance refinement (§3.3), and a pose-conditioned recalibration mechanism for geometry–semantic consistency (§3.4). masked autoencoding and cross-view completion promoted reconstruction and correspondence [1, 15, 22, 71, 90, 91]. These … view at source ↗

**Figure 3.** Figure 3: Pose-conditioned recalibration mechanism (§ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of novel-view segmentation on ScanNet. (§4.2) Follwing [16], we map thousands of ScanNet labels into 8 common categories for visualization. The dense 2D GT labels are obtained by projecting sparse 3D annotations, so some regions are inevitably incomplete or missing. 4.4. Ablation Studies Ablation on Key Components [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of NVS on RE10K. (§4.2) Ref. Novel View Feature Depth [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visualizations of feature and depth on ScanNet [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs. Second, we develop a coarse-to-fine Gaussian splatting strategy that reduces appearance-semantics inconsistencies by progressively refining the radiance field. Finally, to enforce geometric-semantic consistency, we introduce a pose-conditioned recalibration mechanism that interrelates the outputs of multiple heads by re-projecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency, thereby resolving geometry-semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniSplat outlines a three-component pipeline for unified 3D representations from unposed images, but supplies no experiments to show whether the recalibration step actually works or just permits degenerate fixes.

read the letter

The paper's core contribution is UniSplat, a feed-forward model that combines dual-masking in the encoder-decoder to force geometry induction, coarse-to-fine Gaussian splatting to refine appearance, and a pose-conditioned recalibration that reprojects 3D point and semantic maps using estimated cameras to align with RGB and semantic outputs. This targets inconsistencies in prior self-supervised 3D methods for sparse, unposed inputs, which is a practical bottleneck in scene understanding and embodied AI work. The specific pairing of dual-masking with geometry-rich decoder targets and the recalibration for cross-head consistency is not a direct copy of the cited priors, so the assembly counts as new for this setting. The focus on unposed inputs and the attempt to tie geometry to semantics through reprojection are reasonable design choices on paper. The recalibration idea in particular tries to create an internal consistency signal without external pose supervision, which could be useful if it holds. The main limitation is the complete lack of results. The abstract describes the components and their intended effects but shows no numbers, no ablations, no error breakdowns, and no baseline comparisons. Without those, the central claim that the pipeline yields robust, generalizable representations cannot be checked. The stress-test concern about degeneracy is relevant here: when the same model predicts both the 3D maps and the camera parameters, the alignment loss can be satisfied by adjusting the estimated poses to mask errors in the geometry or semantics rather than improving them. The dual-masking and coarse-to-fine steps do not obviously block that shortcut. If the full paper contains experiments that demonstrate the consistency mechanism improves underlying structure instead of allowing trade-offs, that would change the picture. This work is aimed at researchers building feed-forward 3D models for robotics or scene understanding who already work with splatting or multi-view consistency losses. A reader in that group could extract the component ideas for their own pipelines. It deserves peer review because the problem is real and the proposed fixes are concrete enough to evaluate once the empirical section is examined. I would not cite it until the results appear and address the joint-estimation issue.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces UniSplat, a feed-forward framework for learning unified 3D representations from unposed multi-view images. It consists of three components: a dual-masking strategy that masks encoder and decoder tokens to strengthen geometry induction, a coarse-to-fine Gaussian splatting approach to progressively refine the radiance field and reduce appearance-semantics inconsistencies, and a pose-conditioned recalibration mechanism that re-projects predicted 3D point and semantic maps into the image plane using estimated camera parameters to align with RGB and semantic predictions for cross-task consistency.

Significance. If the components successfully deliver robust unified representations that generalize across tasks without relying on posed inputs, the work could establish a perceptual foundation for spatial intelligence in scene understanding and embodied AI. The self-supervised unification of geometry, appearance, and semantics from sparse unposed views addresses a relevant gap, though the absence of supporting evidence limits assessment of its practical impact.

major comments (2)

[Pose-conditioned recalibration mechanism] The pose-conditioned recalibration mechanism (described in the abstract) re-projects 3D point and semantic maps using camera parameters that must be estimated by the model itself, since inputs are unposed. This setup risks degenerate solutions in which pose adjustments compensate for errors in the 3D predictions rather than enforcing genuine geometric-semantic consistency; the dual-masking and coarse-to-fine components do not break this coupling, leaving the central claim of robustness under unposed sparse-view inputs vulnerable.
[Abstract and throughout] The manuscript supplies no quantitative results, ablation studies, error analysis, or comparisons to prior self-supervised methods. This absence is load-bearing for the claims of robustness, generalization across tasks, and resolution of weak geometry/inconsistencies, as stated in the abstract.

minor comments (1)

[Abstract] The abstract consists of a single extended paragraph; splitting it would improve readability without altering content.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and outlining planned revisions where appropriate.

read point-by-point responses

Referee: [Pose-conditioned recalibration mechanism] The pose-conditioned recalibration mechanism (described in the abstract) re-projects 3D point and semantic maps using camera parameters that must be estimated by the model itself, since inputs are unposed. This setup risks degenerate solutions in which pose adjustments compensate for errors in the 3D predictions rather than enforcing genuine geometric-semantic consistency; the dual-masking and coarse-to-fine components do not break this coupling, leaving the central claim of robustness under unposed sparse-view inputs vulnerable.

Authors: We appreciate the referee's concern regarding the risk of degenerate solutions where estimated poses could compensate for inaccuracies in 3D predictions. However, the dual-masking strategy is designed to operate independently of pose estimation: by masking both encoder and decoder tokens and directing decoder masks toward geometry-rich regions, the model must infer structural information solely from incomplete visual cues in the input images. This creates a geometry-aware representation prior that does not depend on pose adjustments. The coarse-to-fine Gaussian splatting further mitigates coupling by initializing with coarse geometry and radiance predictions before progressive refinement, limiting the scope for pose to retroactively correct errors. The recalibration mechanism then uses the estimated poses only to re-project and align outputs for consistency losses, with the joint multi-task objective encouraging genuine cross-task alignment rather than compensation. We will add a dedicated discussion and potential failure-case analysis in the revised manuscript to better articulate these interactions. revision: partial
Referee: [Abstract and throughout] The manuscript supplies no quantitative results, ablation studies, error analysis, or comparisons to prior self-supervised methods. This absence is load-bearing for the claims of robustness, generalization across tasks, and resolution of weak geometry/inconsistencies, as stated in the abstract.

Authors: We agree that the absence of quantitative results, ablation studies, error analysis, and comparisons to prior self-supervised methods is a significant limitation in the current manuscript. This weakens the ability to fully substantiate the claims regarding robustness under unposed inputs and cross-task generalization. We will incorporate these elements in the revised version, including benchmark evaluations for geometry, appearance, and semantic consistency, component-wise ablations, error breakdowns, and direct comparisons to relevant self-supervised baselines. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents UniSplat as a new feed-forward framework with three proposed components (dual-masking for geometry induction, coarse-to-fine Gaussian splatting, and pose-conditioned recalibration via re-projection and alignment). These are architectural and loss-design choices, not derivations that reduce predictions or results to inputs by construction. No equations are exhibited that make any output equivalent to a fitted parameter or self-defined quantity. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The recalibration mechanism is described as an interrelation step using estimated parameters, but this is a proposed consistency objective rather than a tautological redefinition; any potential degeneracy from joint pose estimation is a methodological concern outside the circularity criteria. The overall claim of unified representations is presented as emerging from the combination of these independent components without reducing to prior inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the unproven effectiveness of the three introduced mechanisms; no free parameters or invented entities are explicitly named, but the recalibration assumes accurate pose estimation from the model itself.

axioms (2)

domain assumption Dual-masking strategy strengthens geometry induction by forcing inference from incomplete cues
Stated as the purpose of the first component in the abstract.
domain assumption Coarse-to-fine Gaussian splatting reduces appearance-semantics inconsistencies
Presented as the function of the second component.

pith-pipeline@v0.9.0 · 5571 in / 1359 out tokens · 39705 ms · 2026-05-10T15:40:36.054091+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 19 canonical work pages · 4 internal anchors

[1]

Beit: Bert pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. InICLR, 2022. 2, 3

2022
[2]

Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields

Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields. InICCV, 2021. 1

2021
[3]

Nope-nerf: Optimising neural ra- diance field with no pose prior

Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neural ra- diance field with no pose prior. InCVPR, 2023. 2, 3

2023
[4]

Cycle-consistent learning for joint layout-to-image generation and object detection

Xinhao Cai, Qiuxia Lai, Gensheng Pei, Xiangbo Shu, Yazhou Yao, and Wenguan Wang. Cycle-consistent learning for joint layout-to-image generation and object detection. In ICCV, 2025. 1

2025
[5]

Unbiased object detection beyond frequency with visually prompted image synthesis

Xinhao Cai, Liulei Li, Gensheng Pei, Tao Chen, Jinshan Pan, Yazhou Yao, and Wenguan Wang. Unbiased object detection beyond frequency with visually prompted image synthesis. InICLR, 2026. 1

2026
[6]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR,
[7]

Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo

Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo. InICCV, 2021. 1, 2

2021
[8]

Scanrefer: 3d object localization in rgb-d scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InECCV, 2020. 15

2020
[9]

Polarnet: 3d point clouds for language- guided robotic manipulation.arXiv preprint arXiv:2309.15596, 2023

Shizhe Chen, Ricardo Garcia, Cordelia Schmid, and Ivan Laptev. Polarnet: 3d point clouds for language-guided robotic manipulation.arXiv preprint arXiv:2309.15596,

work page arXiv
[10]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 2

2020
[11]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InECCV, 2024. 1, 2

2024
[12]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,
[13]

Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research,

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research,
[14]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 6, 13

2017
[15]

Embodiedmae: A unified 3d multi-modal representation for robot manipulation.arXiv preprint arXiv:2505.10105, 2025

Zibin Dong, Fei Ni, Yifu Yuan, Yinchuan Li, and Jianye Hao. Embodiedmae: A Unified 3d Multi-Modal Representation for Robot Manipulation.arXiv preprint arXiv:2505.10105,

work page arXiv
[16]

Large Spatial Model: End-to-end Unposed Images to Semantic 3d

Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, Boris Ivanovic, Marco Pavone, and Yue Wang. Large Spatial Model: End-to-end Unposed Images to Semantic 3d. InNeurIPS, 2024. 1, 2, 6, 7, 13

2024
[17]

Eva: Exploring the limits of masked visual representa- tion learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. InCVPR, 2023. 6

2023
[18]

Colmap-free 3d gaussian splat- ting

Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A Efros, and Xiaolong Wang. Colmap-free 3d gaussian splat- ting. InCVPR, 2024. 2, 3

2024
[19]

Rvt-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545,

Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024. 13

work page arXiv 2024
[20]

Bootstrap your own latent-a new approach to self-supervised learning.NeurIPS, 2020

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.NeurIPS, 2020. 2

2020
[21]

Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning

Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019. 6, 13

work page arXiv 1910
[22]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 2, 3, 6

2022
[23]

Bottom up top down detection transform- ers for language grounding in images and point clouds

Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, and Kate- rina Fragkiadaki. Bottom up top down detection transform- ers for language grounding in images and point clouds. In ECCV, 2022. 15

2022
[24]

Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Let- ters, 2020

Stephen James, Zicong Ma, David Rovick Arrojo, and An- drew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Let- ters, 2020. 6, 13

2020
[25]

Large scale multi-view stereopsis evalu- ation

Rasmus Jensen, Anders Dahl, George V ogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evalu- ation. InCVPR, 2014. 7 9

2014
[26]

Leap: Liberate sparse-view 3d modeling from camera poses

Hanwen Jiang, Zhenyu Jiang, Yue Zhao, and Qixing Huang. Leap: Liberate sparse-view 3d modeling from camera poses. ICLR, 2024. 1, 2

2024
[27]

Rayzer: A Self- supervised Large View Synthesis Model.arXiv preprint arXiv:2505.00702, 2025

Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qix- ing Huang, and Georgios Pavlakos. Rayzer: A Self- supervised Large View Synthesis Model.arXiv preprint arXiv:2505.00702, 2025. 2, 3, 4

work page arXiv 2025
[28]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, Dahua Lin, and Bo Dai. Anysplat: Feed-forward 3d Gaussian Splatting from Unconstrained Views.arXiv preprint arXiv:2505.23716, 2025. 2, 5

work page arXiv 2025
[29]

Selfsplat: Pose-Free and 3d Prior-Free Generalizable 3d Gaussian Splatting

Gyeongjin Kang, Jisang Yoo, Jihyeon Park, Seungtae Nam, Hyeonsoo Im, Sangheon Shin, Sangpil Kim, and Eunbyung Park. Selfsplat: Pose-Free and 3d Prior-Free Generalizable 3d Gaussian Splatting. InCVPR, 2025. 2, 3

2025
[30]

3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023. 1, 2

2023
[31]

Decomposing nerf for editing via feature field dis- tillation.NeurIPS, 2022

Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz- mann. Decomposing nerf for editing via feature field dis- tillation.NeurIPS, 2022. 6

2022
[32]

PhD thesis, University of Washington,

Vikash Kumar.Manipulators and Manipulation in high di- mensional spaces. PhD thesis, University of Washington,
[33]

Ground- ing Image Matching in 3d with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing Image Matching in 3d with MASt3R. InECCV, 2025. 2

2025
[34]

Weinberger, Serge J

Boyi Li, Kilian Q. Weinberger, Serge J. Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven Semantic Seg- mentation. InICLR, 2022. 5, 6

2022
[35]

org/abs/2506.09565

Qijing Li, Jingxiang Sun, Liang An, Zhaoqi Su, Hong- wen Zhang, and Yebin Liu. Semanticsplat: Feed-Forward 3d Scene Understanding with Language-Aware Gaussian Fields.arXiv preprint arXiv:2506.09565, 2025. 1, 2

work page arXiv 2025
[36]

Vicasplat: A Single Run is All You Need for 3d Gaussian Splatting and Camera Estimation from Unposed Video Frames.arXiv preprint arXiv:2503.10286,

Zhiqi Li, Chengrui Dong, Yiming Chen, Zhangchi Huang, and Peidong Liu. Vicasplat: A Single Run is All You Need for 3d Gaussian Splatting and Camera Estimation from Unposed Video Frames.arXiv preprint arXiv:2503.10286,

work page arXiv
[37]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024. 8

2024
[38]

Infinite nature: Perpetual view generation of natural scenes from a single image

Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: Perpetual view generation of natural scenes from a single image. InICCV, 2021. 7, 13, 15

2021
[39]

Libero: Benchmarking knowl- edge transfer for lifelong robot learning.NeurIPS, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.NeurIPS, 2023. 6, 13

2023
[40]

Gwm: Towards scalable gaussian world models for robotic manipulation

Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Yansong Tang, and Siyuan Huang. Gwm: Towards scalable gaussian world models for robotic manipulation. In ICCV, 2025. 13

2025
[41]

Scaffold-GS: Structured 3d Gaussians for View-Adaptive Rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-GS: Structured 3d Gaussians for View-Adaptive Rendering. InCVPR, 2024. 2, 4

2024
[42]

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence? InNeurIPS, 2023

Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Tingfan Wu, Jay Vakil, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, and Franziska Meier. Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence? InNeu...

2023
[43]

What mat- ters in learning from offline human demonstrations for robot manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiri- any, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart´ın-Mart´ın. What mat- ters in learning from offline human demonstrations for robot manipulation. InCoRL, 2022. 13

2022
[44]

Mimicgen: A data generation system for scalable robot learning using human demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InCoRL, 2023. 13

2023
[45]

Prune and merge: Efficient token compression for vision transformer with spatial infor- mation preserved.IEEE TMM, 2025

Junzhu Mao, Yang Shen, Jinyang Guo, Yazhou Yao, Xian- sheng Hua, and Hengtao Shen. Prune and merge: Efficient token compression for vision transformer with spatial infor- mation preserved.IEEE TMM, 2025. 1

2025
[46]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021. 1

2021
[47]

Instant neural graphics primitives with a multires- olution hash encoding.ACM TOG, 41(4):1–15, 2022

Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding.ACM TOG, 41(4):1–15, 2022. 1

2022
[48]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu. Robocasa: Large-scale simula- tion of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024. 13

work page internal anchor Pith review arXiv 2024
[49]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Ani- matable neural radiance fields for modeling dynamic human bodies

Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Ani- matable neural radiance fields for modeling dynamic human bodies. InICCV, 2021. 1

2021
[51]

Deep hough voting for 3d object detection in point clouds

Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. InICCV, 2019. 15

2019
[52]

Langsplat: 3d Language Gaussian Splat- ting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d Language Gaussian Splat- ting. InCVPR, 2024. 2

2024
[53]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICLR, 2021. 6 10

2021
[54]

Real-world robot learn- ing with masked visual pre-training

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learn- ing with masked visual pre-training. InCoRL, 2023. 6

2023
[55]

Fcaf3d: Fully convolutional anchor-free 3d object detection

Danila Rukhovich, Anna V orontsova, and Anton Konushin. Fcaf3d: Fully convolutional anchor-free 3d object detection. InECCV, 2022. 15

2022
[56]

Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection

Danila Rukhovich, Anna V orontsova, and Anton Konushin. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. InCVPR,
[57]

Spatialsplat: Efficient Se- mantic 3d from Sparse Unposed Images.arXiv preprint arXiv:2505.23044, 2025

Yu Sheng, Jiajun Deng, Xinran Zhang, Yu Zhang, Bei Hua, Yanyong Zhang, and Jianmin Ji. Spatialsplat: Efficient Se- mantic 3d from Sparse Unposed Images.arXiv preprint arXiv:2505.23044, 2025. 2

work page arXiv 2025
[58]

Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

Brandon Smart, Chuanxia Zheng, Iro Laina, and V . Prisacariu. Splatt3r: Zero-shot Gaussian Splatting from Un- calibrated Image Pairs.arXiv preprint arXiv:2408.13912,

work page arXiv
[59]

Uni3r: Unified 3d reconstruction and semantic understanding via generalizable gaussian splatting from un- posed multi-view images.arXiv preprint arXiv:2508.03643,

Xiangyu Sun, Liu Liu, Seungtae Nam, Gyeongjin Kang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, Eunbyung Park, et al. Uni3r: Unified 3d reconstruction and semantic understanding via generalizable gaussian splatting from un- posed multi-view images.arXiv preprint arXiv:2508.03643,

work page arXiv
[60]

Lgm: Large multi-view gaus- sian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaus- sian model for high-resolution 3d content creation. InECCV,
[61]

Uniforward: Unified 3d scene and semantic field re- construction via feed-forward gaussian splatting from only sparse-view images.arXiv preprint arXiv:2506.09378, 2025

Qijian Tian, Xin Tan, Jingyu Gong, Yuan Xie, and Lizhuang Ma. Uniforward: Unified 3d scene and semantic field re- construction via feed-forward gaussian splatting from only sparse-view images.arXiv preprint arXiv:2506.09378, 2025. 2

work page arXiv 2025
[62]

Scene as occupancy

Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. InICCV, 2023. 15

2023
[63]

dm control: Software and tasks for continuous control.Software Impacts, 2020

Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm control: Software and tasks for continuous control.Software Impacts, 2020. 13

2020
[64]

Vggt: Visual Geometry Grounded Transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotn ´y. Vggt: Visual Geometry Grounded Transformer. InCVPR, 2025. 2, 5

2025
[65]

Videomae V2: Scaling Video Masked Autoencoders with Dual Masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae V2: Scaling Video Masked Autoencoders with Dual Masking. In CVPR, 2023. 3

2023
[66]

Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction.ICLR, 2024

Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction.ICLR, 2024. 1, 2

2024
[67]

Dust3r: Geometric 3d Vi- sion Made Easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d Vi- sion Made Easy. InCVPR, 2024. 2

2024
[68]

Embodiedscan: A holistic multi- modal 3d perception suite towards embodied ai

Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, et al. Embodiedscan: A holistic multi- modal 3d perception suite towards embodied ai. InCVPR,
[69]

Visual knowledge in the big model era: Retrospect and prospect.Frontiers of Information Technology & Electronic Engineering, 2025

Wenguan Wang, Yi Yang, and Yunhe Pan. Visual knowledge in the big model era: Retrospect and prospect.Frontiers of Information Technology & Electronic Engineering, 2025. 1

2025
[70]

Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving. InICCV, 2023. 15

2023
[71]

Croco: Self-Supervised Pre-training for 3d Vision Tasks by Cross-View Completion

Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and J ´erˆome Re- vaud. Croco: Self-Supervised Pre-training for 3d Vision Tasks by Cross-View Completion. InNeurIPS, 2022. 2, 3, 8

2022
[72]

TriFin- ger: An Open-Source Robot for Learning Dexterity

Manuel W ¨uthrich, Felix Widmaier, Felix Grimminger, Joel Akpo, Shruti Joshi, Vaibhav Agrawal, Bilal Hammoud, Ma- jid Khadiv, Miroslav Bogdanovic, Vincent Berenz, et al. Trifinger: An open-source robot for learning dexterity.arXiv preprint arXiv:2008.03596, 2020. 13

work page arXiv 2008
[73]

Spatialformer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding

Han Xiao, Wenzhao Zheng, Sicheng Zuo, Peng Gao, Jie Zhou, and Jiwen Lu. Spatialformer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding. In ECCV, 2024. 1

2024
[74]

Murf: multi-baseline radiance fields

Haofei Xu, Anpei Chen, Yuedong Chen, Christos Sakaridis, Yulun Zhang, Marc Pollefeys, Andreas Geiger, and Fisher Yu. Murf: multi-baseline radiance fields. InCVPR, 2024. 1, 2

2024
[75]

Depthsplat: Connecting Gaussian Splatting and Depth

Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting Gaussian Splatting and Depth. In CVPR, 2024

2024
[76]

Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation

Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen, and Gordon Wet- zstein. Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation. InECCV, 2024. 2

2024
[77]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multi- modal Large Language Models See, Remember, and Recall Spaces. InCVPR, 2025. 1

2025
[78]

Changetitans: Towards re- mote sensing change detection with neural memory.IEEE TGRS, 2025

Zhenyu Yang, Gensheng Pei, Yazhou Yao, Tianfei Zhou, Lizhong Ding, and Fumin Shen. Changetitans: Towards re- mote sensing change detection with neural memory.IEEE TGRS, 2025. 1

2025
[79]

No Pose, No Prob- lem: Surprisingly Simple 3d Gaussian Splats from Sparse Unposed Images

Botao Ye, Sifei Liu, Songyou Peng, Haofei Xu, Xueting Li, Ming-Hsuan Yang, and Marc Pollefeys. No Pose, No Prob- lem: Surprisingly Simple 3d Gaussian Splats from Sparse Unposed Images. InICLR, 2025. 1, 2, 7, 13

2025
[80]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InICCV, 2023. 6, 8, 13

2023

Showing first 80 references.