Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

Donghwan Shin; Honggyu An; Hyeonseo Yu; Hyuna Ko; Jaewoo Jung; Jisang Han; Kazumi Fukuda; Minkyeong Jeon; Mungyeom Kim; Seungryong Kim

arxiv: 2605.31595 · v1 · pith:RHYMM7DJnew · submitted 2026-05-29 · 💻 cs.CV

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

Mungyeom Kim , Minkyeong Jeon , Honggyu An , Jaewoo Jung , Hyuna Ko , Jisang Han , Hyeonseo Yu , Donghwan Shin

show 5 more authors

Sunghwan Hong Takuya Narihira Kazumi Fukuda Yuki Mitsufuji Seungryong Kim

This is my paper

Pith reviewed 2026-06-28 23:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D reconstructiondynamic scenesGaussian splattingfeed-forwardnovel view synthesismonocular videomotion modelingpoint tracking

0 comments

The pith

Timestamp-conditioned Gaussian query tokens aggregate temporal features to decode coherent 4D motion from monocular video in a feed-forward manner.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents C4G as a framework that replaces per-frame pixel-wise Gaussian prediction with a compact set of learnable query tokens. Each token pulls features from the full video sequence and decodes a 3D Gaussian whose position shifts according to the target timestamp. This design removes duplicated Gaussians and view-dependent artifacts while supporting reconstruction without known camera poses. The same aggregation step is reused to lift features into a 4D field for tracking tasks. A diffusion-based renderer is added only to recover fine details after the core Gaussian field is formed.

Core claim

C4G uses a compact collection of timestamp-conditioned learnable Gaussian query tokens; each token aggregates matching features across the entire temporal context and decodes one 3D Gaussian whose 3D position is modulated by the query timestamp, producing globally coherent motion without per-scene optimization or duplicated primitives.

What carries the argument

timestamp-conditioned learnable Gaussian query tokens that aggregate full-sequence features and decode timestamp-modulated 3D Gaussians

If this is right

Novel-view synthesis is achieved with far fewer Gaussians than per-frame methods.
Reconstruction proceeds without any camera-pose input or per-scene optimization.
Motion remains coherent even across large temporal separations.
The same token aggregation produces a 4D feature field usable for point tracking.
A separate diffusion renderer can be attached to restore high-frequency detail after the core field is built.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The token design could be tested on longer sequences to check whether global coherence scales without additional regularization.
Replacing the diffusion enhancement with a lighter decoder might reveal how much of the quality gain comes from the Gaussian field alone.
The 4D feature field might support downstream tasks such as action recognition or future-frame prediction if the tokens are kept frozen after training.

Load-bearing premise

The tokens can reliably collect corresponding features from every frame in the video to produce motion that stays consistent across large time gaps without duplication or viewpoint bias.

What would settle it

Apply the method to a monocular video containing sudden large object displacements or long occlusions and measure whether novel-view renderings at distant timestamps show duplicated surfaces or broken trajectories.

Figures

Figures reproduced from arXiv: 2605.31595 by Donghwan Shin, Honggyu An, Hyeonseo Yu, Hyuna Ko, Jaewoo Jung, Jisang Han, Kazumi Fukuda, Minkyeong Jeon, Mungyeom Kim, Seungryong Kim, Sunghwan Hong, Takuya Narihira, Yuki Mitsufuji.

**Figure 1.** Figure 1: Failures of pixel-wise feed-forward 4D reconstruction [102, 59, 104]. (a) Duplicated Gaussians from nearby input views cause ghost artifacts at target timestamps. (b) View-dependent bias prevents leveraging temporally distant views, leaving occluded regions poorly reconstructed. We argue that both issues stem from the fundamental design choice shared by all existing feed-forward 4D methods: per-pixel Gauss… view at source ↗

**Figure 2.** Figure 2: Pixel-wise 4DGS vs. Ours. (a) Pixel-wise methods produce duplicated, view-dependent Gaussians that cause ghosting at interpolated timestamps. (b) Our approach aggregate global temporal context, yielding a compact, unified Gaussian set with temporally coherent motion. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Main architecture of C4G. (a) A pre-trained encoder E extracts timestamp-injected features, which are decoded into 3D Gaussians by learnable query tokens conditioned on a target timestamp tb. (b) A VDM refinement module that takes the rendered video as input and refines it conditioned on the context views. generate dynamic scenes while preserving static geometry, typically by warping point maps to novel vi… view at source ↗

**Figure 4.** Figure 4: Analysis of attention patterns. Visualization of attention maps between the learnable query tokens and multi-frame image features. For the query token decoding a specific Gaussian (red dot), the two self-attention layers exhibit complementary behaviors: the first attends to geometrically corresponding regions across all frames, while the second concentrates on frames temporally close to the target timestam… view at source ↗

**Figure 5.** Figure 5: Qualitative results of novel view synthesis on dynamic datasets. We further provide qualitative comparisons between NeoVerse and C4G, showing both the rendered outputs of the feed-forward reconstruction model and the results after diffusion-based refinement. Our model exhibits fewer occlusion holes and ghost artifacts than NeoVerse, thereby mitigating hallucinations introduced by the diffusion-based enhanc… view at source ↗

**Figure 6.** Figure 6: Attention map visualization in dynamic regions. E Additional Qualitative Results E.1 Additional Attention Visualization on C4G. We additionally provide the visualization results of attention map extended to [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Attention map visualization in VDM-based rendering enhancement module. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

read the original abstract

Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians pixel-wise for each frame, suffering from duplicated Gaussians and view-dependent biases that hinder effective learning of scene motion. We present C4G, a feed-forward 4D reconstruction framework built upon a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates corresponding features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp, enabling globally coherent motion modeling without per-scene optimization. To capture fine-grained details, we further introduce a video diffusion model-based rendering enhancement module. Since our framework effectively aggregates features into Gaussians, we extend this capability to feature lifting, producing a 4D feature field that supports point tracking and dynamic scene understanding. C4G achieves strong novel-view synthesis performance using significantly fewer Gaussians and without requiring camera poses, while exhibiting stronger motion modeling and robustness to large temporal gaps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

C4G's timestamp-conditioned query tokens aim to fix duplication in feed-forward 4D Gaussians but the abstract supplies no mechanism or evidence that the aggregation actually works.

read the letter

The core idea is a small set of learnable Gaussian query tokens conditioned on timestamp. Each token pulls features from the full video, then decodes a 3D Gaussian whose position shifts with the target time. This is meant to produce globally coherent motion, fewer total Gaussians, and no need for camera poses or per-scene optimization. They also add a diffusion-based rendering step and reuse the same tokens to lift 4D features for tracking.

The shift away from per-pixel Gaussian prediction is the clearest departure from prior feed-forward work. Treating the tokens as a compact global aggregator rather than independent per-frame outputs is a reasonable response to the duplication and bias problems the abstract names. Extending the same representation to feature lifting is a straightforward practical move that could support downstream tasks like point tracking.

The soft spot is exactly what the stress-test note flags: the abstract describes the aggregation but shows no equations, correspondence mechanism, loss terms, or ablations that would let us check whether duplication and view bias are actually reduced. All numeric claims about novel-view quality and robustness to large time gaps sit on top of unspecified training and evaluation details, so we cannot yet tell if the gains come from the architecture or from other factors. The full paper would need to supply those internals before the central claim can be assessed.

This is for people already working on feed-forward dynamic reconstruction and 3D Gaussian methods. A reader who wants to try compact temporal aggregation would get something concrete to test. It is worth sending to referees because the problem is real and the token design is distinct enough to merit checking the implementation and results.

Referee Report

2 major / 0 minor

Summary. The paper introduces C4G, a feed-forward 4D reconstruction method for dynamic scenes from monocular video. It replaces per-frame pixel-wise Gaussian prediction with a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp. A video diffusion model is added for rendering enhancement, and the same aggregation is extended to produce a 4D feature field supporting point tracking. The central claims are that this yields strong novel-view synthesis with far fewer Gaussians, requires no camera poses or per-scene optimization, improves motion modeling, and is robust to large temporal gaps.

Significance. If the architecture and empirical claims hold, the work would be significant for enabling efficient, pose-free feed-forward 4D reconstruction. The use of learnable query tokens to enforce global temporal coherence without duplication or view-dependent bias, together with the extension to a 4D feature field, addresses a recognized limitation of current Gaussian-based dynamic methods. The absence of per-scene optimization and the reported robustness to large time gaps would be practically valuable if substantiated.

major comments (2)

[Abstract] Abstract (framework paragraph): the claim that timestamp-conditioned learnable Gaussian query tokens 'aggregate corresponding features across the full temporal context' and thereby avoid duplicated Gaussians and view-dependent biases is presented without any equation, architecture diagram, loss term, or correspondence mechanism. This is the load-bearing assumption for the entire method; without it the performance claims cannot be evaluated.
[Abstract] Abstract: no training procedure, loss formulation, or evaluation protocol is supplied. The reported gains in novel-view synthesis, motion modeling, and robustness to temporal gaps therefore rest on unspecified implementation details, making it impossible to determine whether the architecture itself produces the stated improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for clearer presentation of key claims in the abstract. The detailed mechanisms, training, and evaluation are fully specified in the manuscript body (Sections 3–5), but we agree the abstract can be revised for better self-containment. We address each point below.

read point-by-point responses

Referee: [Abstract] Abstract (framework paragraph): the claim that timestamp-conditioned learnable Gaussian query tokens 'aggregate corresponding features across the full temporal context' and thereby avoid duplicated Gaussians and view-dependent biases is presented without any equation, architecture diagram, loss term, or correspondence mechanism. This is the load-bearing assumption for the entire method; without it the performance claims cannot be evaluated.

Authors: The aggregation is realized via cross-attention between the compact learnable query tokens and multi-frame image features, with timestamp embeddings modulating both the queries and the decoded Gaussian positions; this is detailed with equations and a diagram in Section 3.2. No explicit correspondence loss is used—the temporal coherence emerges from end-to-end training on the reconstruction objective. We will revise the abstract to include a concise clause referencing the attention-based temporal aggregation. revision: yes
Referee: [Abstract] Abstract: no training procedure, loss formulation, or evaluation protocol is supplied. The reported gains in novel-view synthesis, motion modeling, and robustness to temporal gaps therefore rest on unspecified implementation details, making it impossible to determine whether the architecture itself produces the stated improvements.

Authors: Training uses an end-to-end objective combining L1, SSIM, and perceptual losses on rendered images plus a diffusion rendering loss (Section 4.2); evaluation follows standard novel-view metrics plus point-tracking accuracy on held-out frames (Section 5). The abstract omits these for brevity. We will add one sentence summarizing the training and evaluation protocol if space permits. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation self-contained with no reductions visible

full rationale

The provided abstract and text describe a feed-forward framework using timestamp-conditioned learnable Gaussian query tokens for feature aggregation and Gaussian decoding, but contain no equations, no fitted parameters presented as predictions, and no self-citations invoked to justify core claims. The central description of aggregation enabling coherent motion is presented as an architectural choice without any self-definitional loop, fitted-input renaming, or load-bearing self-citation chain that would reduce the result to its inputs by construction. This is the normal case of an independent method proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.1-grok · 5760 in / 1128 out tokens · 20281 ms · 2026-06-28T23:06:18.680857+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

116 extracted references · 35 canonical work pages · 15 internal anchors

[1]

C3G: Learning Compact 3D Representations with 2K Gaussians

An, H., Jung, J., Kim, M., Hong, S., Kim, C., Fukuda, K., Jeon, M., Han, J., Narihira, T., Ko, H., et al.: C3g: Learning compact 3d representations with 2k gaussians. arXiv preprint arXiv:2512.04021 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

An, H., Kim, J.H., Park, S., Jung, J., Han, J., Hong, S., Kim, S.: Cross-view completion models are zero-shot correspondence estimators. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1103–1115 (2025)

2025
[3]

Qwen3-VL Technical Report

Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y ., Tan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Balasingam, A., Chandler, J., Li, C., Zhang, Z., Balakrishnan, H.: Drivetrack: A benchmark for long-range point tracking in real-world videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22488–22497 (2024)

2024
[5]

ACM Trans

Bartle, A., Sheffer, A., Kim, V .G., Kaufman, D.M., Vining, N., Berthouzoz, F.: Physics-driven pattern adjustment for direct 3d garment editing. ACM Trans. Graph.35(4), 50–1 (2016)

2016
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bozic, A., Zollhofer, M., Theobalt, C., Nießner, M.: Deepdeform: Learning non-rigid rgb-d reconstruction with semi-supervised data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7002–7012 (2020)

2020
[7]

In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition

Bregler, C., Hertzmann, A., Biermann, H.: Recovering non-rigid 3d shape from image streams. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662). vol. 2, pp. 690–696. IEEE (2000)

2000
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cao, A., Johnson, J.: Hexplane: A fast representation for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 130–141 (2023)

2023
[9]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Charatan, D., Li, S.L., Tagliasacchi, A., Sitzmann, V .: pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19457–19467 (2024)

2024
[10]

In: European conference on computer vision

Chen, Y ., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In: European conference on computer vision. pp. 370–386. Springer (2024)

2024
[11]

Advances in Neural Information Processing Systems34, 9011–9023 (2021)

Cho, S., Hong, S., Jeon, S., Lee, Y ., Sohn, K., Kim, S.: Cats: Cost aggregation transformers for visual correspondence. Advances in Neural Information Processing Systems34, 9011–9023 (2021)

2021
[12]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 7174– 7194 (2022)

Cho, S., Hong, S., Kim, S.: Cats++: Boosting cost aggregation with convolutions and transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 7174– 7194 (2022)

2022
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cho, S., Shin, H., Hong, S., Arnab, A., Seo, P.H., Kim, S.: Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4113–4123 (2024)

2024
[14]

International Journal of Computer Vision107(2), 101–122 (2014)

Dai, Y ., Li, H., He, M.: A simple prior-free method for non-rigid structure-from-motion factorization. International Journal of Computer Vision107(2), 101–122 (2014)

2014
[15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Deng, K., Liu, A., Zhu, J.Y ., Ramanan, D.: Depth-supervised nerf: Fewer views and faster training for free. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12882–12891 (2022) 10

2022
[16]

Advances in Neural Information Processing Systems35, 13610–13626 (2022)

Doersch, C., Gupta, A., Markeeva, L., Recasens, A., Smaira, L., Aytar, Y ., Carreira, J., Zisserman, A., Yang, Y .: Tap-vid: A benchmark for tracking any point in a video. Advances in Neural Information Processing Systems35, 13610–13626 (2022)

2022
[17]

In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Du, Y ., Zhang, Y ., Yu, H.X., Tenenbaum, J.B., Wu, J.: Neural radiance flow for 4d view synthesis and video processing. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 14304–14314. IEEE Computer Society (2021)

2021
[18]

Advances in neural information processing systems37, 40212–40229 (2024)

Fan, Z., Zhang, J., Cong, W., Wang, P., Li, R., Wen, K., Zhou, S., Kadambi, A., Wang, Z., Xu, D., et al.: Large spatial model: End-to-end unposed images to semantic 3d. Advances in neural information processing systems37, 40212–40229 (2024)

2024
[19]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Fridovich-Keil, S., Meanti, G., Warburg, F.R., Recht, B., Kanazawa, A.: K-planes: Explicit radiance fields in space, time, and appearance. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12479–12488 (2023)

2023
[20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: Radiance fields without neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5501–5510 (2022)

2022
[21]

Advances in Neural Information Processing Systems35, 33768–33780 (2022)

Gao, H., Li, R., Tulsiani, S., Russell, B., Kanazawa, A.: Monocular dynamic view synthesis: A reality check. Advances in Neural Information Processing Systems35, 33768–33780 (2022)

2022
[22]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y ., Duckworth, D., Fleet, D.J., Gnanapra- gasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3749–3761 (2022)

2022
[23]

arXiv e-prints pp

Han, J., An, H., Jung, J., Narihira, T., Seo, J., Fukuda, K., Kim, C., Hong, S., Mitsufuji, Y ., Kim, S.: Dˆ 2ust3r: Enhancing 3d reconstruction with 4d pointmaps for dynamic scenes. arXiv e-prints pp. arXiv–2504 (2025)

2025
[24]

arXiv preprint arXiv:2512.04012 (2025)

Han, J., Hong, S., Jung, J., Jang, W., An, H., Wang, Q., Kim, S., Feng, C.: Emergent outlier view rejection in visual geometry grounded transformers. arXiv preprint arXiv:2512.04012 (2025)

work page arXiv 2025
[25]

arXiv preprint arXiv:2209.08742 (2022)

Hong, S., Cho, S., Kim, S., Lin, S.: Integrative feature and cost aggregation with transformers for dense correspondence. arXiv preprint arXiv:2209.08742 (2022)

work page arXiv 2022
[26]

arXiv preprint arXiv:2410.22128 (2024)

Hong, S., Jung, J., Shin, H., Han, J., Yang, J., Luo, C., Kim, S.: Pf3plat: Pose-free feed-forward 3d gaussian splatting. arXiv preprint arXiv:2410.22128 (2024)

work page arXiv 2024
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hong, S., Jung, J., Shin, H., Yang, J., Kim, S., Luo, C.: Unifying correspondence pose and nerf for generalized pose-free novel view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20196–20206 (2024)

2024
[28]

In: Proceedings of the IEEE/CVF international conference on computer vision

Hong, S., Kim, S.: Deep matching prior: Test-time optimization for dense correspondence. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9907–9917 (2021)

2021
[29]

Advances in Neural Information Processing Systems35, 13512–13526 (2022)

Hong, S., Nam, J., Cho, S., Hong, S., Jeon, S., Min, D., Kim, S.: Neural matching fields: Implicit representation of matching fields for visual correspondence. Advances in Neural Information Processing Systems35, 13512–13526 (2022)

2022
[30]

Iclr1(2), 3 (2022)

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

2022
[31]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Huang, R., Mikolajczyk, K.: No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse views. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 27947–27957 (2025)

2025
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., Jiang, Y ., Zhang, Y ., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) 11

2024
[33]

In: The Fourteenth International Conference on Learning Representations

Hur, J., Herrmann, C., Peng, S., Henzler, P., Ma, Z., Zickler, T., Sun, D.: Ufo-4d: Unposed feedforward 4d reconstruction from two images. In: The Fourteenth International Conference on Learning Representations
[34]

In: European conference on computer vision

Innmann, M., Zollhöfer, M., Nießner, M., Theobalt, C., Stamminger, M.: V olumedeform: Real-time volumetric non-rigid reconstruction. In: European conference on computer vision. pp. 362–379. Springer (2016)

2016
[35]

arXiv preprint arXiv:2407.04504 (2024)

Ji, S., Wu, G., Fang, J., Cen, J., Yi, T., Liu, W., Tian, Q., Wang, X.: Segment any 4d gaussians. arXiv preprint arXiv:2407.04504 (2024)

work page arXiv 2024
[36]

ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

Jiang, L., Mao, Y ., Xu, L., Lu, T., Ren, K., Jin, Y ., Xu, X., Yu, M., Pang, J., Zhao, F., et al.: Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

2025
[37]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y ., Liu, Y .: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

2025
[38]

In: ICCV 2025 Workshop on Wild 3D: 3D Modeling, Reconstruction, and Generation in the Wild (2023)

Jung, J., Han, J., Kang, J., Kim, S., Kwak, M.S., Kim, S.: Self-evolving neural radiance fields. In: ICCV 2025 Workshop on Wild 3D: 3D Modeling, Reconstruction, and Generation in the Wild (2023)

2025
[39]

ACM Trans

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

2023
[40]

Advances in Neural Information Processing Systems38, 71685–71724 (2026)

Kim, C., Shin, H., Hong, E., Yoon, H., Arnab, A., Seo, P.H., Hong, S., Kim, S.: Seg4diff: Unveiling open-vocabulary semantic segmentation in text-to-image diffusion transformers. Advances in Neural Information Processing Systems38, 71685–71724 (2026)

2026
[41]

Advances in Neural Information Processing Systems37, 129209–129226 (2024)

Kim, M., Lim, J., Han, B.: 4d gaussian splatting in the wild with uncertainty-aware regulariza- tion. Advances in Neural Information Processing Systems37, 129209–129226 (2024)

2024
[42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kim, M., Seo, S., Han, B.: Infonerf: Ray entropy minimization for few-shot neural volume rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12912–12921 (2022)

2022
[43]

arXiv preprint arXiv:2512.02006 (2025)

Koo, J., Kim, I.H., Kim, M., Park, J., Park, S., Kim, J., Yi, J., Cho, S., Kim, S.: Mv-tap: Tracking any point in multi-view videos. arXiv preprint arXiv:2512.02006 (2025)

work page arXiv 2025
[44]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1611–1621 (2021)

2021
[45]

In: Proceedings of the IEEE international conference on computer vision

Kumar, S., Dai, Y ., Li, H.: Monocular dense 3d reconstruction of a complex dynamic scene from two perspective frames. In: Proceedings of the IEEE international conference on computer vision. pp. 4649–4657 (2017)

2017
[46]

arXiv preprint arXiv:2301.10941 (2023)

Kwak, M.S., Song, J., Kim, S.: Geconerf: Few-shot neural radiance fields via geometric consistency. arXiv preprint arXiv:2301.10941 (2023)

work page arXiv 2023
[47]

arXiv preprint arXiv:2602.04877 (2026)

Lai, Z., Insafutdinov, E., Sucar, E., Vedaldi, A.: Cowtracker: Tracking by warping instead of correlation. arXiv preprint arXiv:2602.04877 (2026)

work page arXiv 2026
[48]

In: 5th Annual Conference on Robot Learning (2021)

Lee, A.X., Devin, C.M., Zhou, Y ., Lampe, T., Bousmalis, K., Springenberg, J.T., Byravan, A., Abdolmaleki, A., Gileadi, N., Khosid, D., et al.: Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In: 5th Annual Conference on Robot Learning (2021)

2021
[49]

arXiv preprint arXiv:2510.14945 (2025)

Lee, J., Jung, J., Han, J., Narihira, T., Fukuda, K., Seo, J., Hong, S., Mitsufuji, Y ., Kim, S.: 3d scene prompting for scene-consistent camera-controllable video generation. arXiv preprint arXiv:2510.14945 (2025)

work page arXiv 2025
[50]

TORA: Topological Representation Alignment for 3D Shape Assembly

Lee, N., Chen, Z., Pollefeys, M., Hong, S.: Tora: Topological representation alignment for 3d shape assembly. arXiv preprint arXiv:2604.04050 (2026) 12

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Lei, J., Weng, Y ., Harley, A.W., Guibas, L., Daniilidis, K.: Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 6165–6177 (2025)

2025
[52]

In: European conference on computer vision

Leroy, V ., Cabon, Y ., Revaud, J.: Grounding image matching in 3d with mast3r. In: European conference on computer vision. pp. 71–91. Springer (2024)

2024
[53]

Language-driven Semantic Segmentation

Li, B., Weinberger, K.Q., Belongie, S., Koltun, V ., Ranftl, R.: Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, J., Zhang, J., Bai, X., Zheng, J., Ning, X., Zhou, J., Gu, L.: Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20775–20785 (2024)

2024
[55]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6498–6508 (2021)

2021
[56]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Z., Tucker, R., Cole, F., Wang, Q., Jin, L., Ye, V ., Kanazawa, A., Holynski, A., Snavely, N.: Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10486–10496 (2025)

2025
[57]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Z., Wang, Q., Cole, F., Tucker, R., Snavely, N.: Dynibar: Neural dynamic image-based rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4273–4284 (2023)

2023
[58]

arXiv preprint arXiv:2412.03526 (2024)

Liang, H., Ren, J., Mirzaei, A., Torralba, A., Liu, Z., Gilitschenski, I., Fidler, S., Oztireli, C., Ling, H., Gojcic, Z., et al.: Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos. arXiv preprint arXiv:2412.03526 (2024)

work page arXiv 2024
[59]

arXiv preprint arXiv:2507.10065 (2025)

Lin, C., Lin, Y ., Pan, P., Yu, Y ., Yan, H., Fragkiadaki, K., Mu, Y .: Movies: Motion-aware 4d dynamic view synthesis in one second. arXiv preprint arXiv:2507.10065 (2025)

work page arXiv 2025
[60]

arXiv preprint arXiv:2506.09997 (2025)

Lin, C.H., Lv, Z., Wu, S., Xu, Z., Nguyen-Phuoc, T., Tseng, H.Y ., Straub, J., Khan, N., Xiao, L., Yang, M.H., et al.: Dgs-lrm: Real-time deformable 3d gaussian reconstruction from monocular videos. arXiv preprint arXiv:2506.09997 (2025)

work page arXiv 2025
[61]

Flow Matching for Generative Modeling

Lipman, Y ., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[62]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[63]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Lu, J., Huang, T., Li, P., Dou, Z., Lin, C., Cui, Z., Dong, Z., Yeung, S.K., Wang, W., Liu, Y .: Align3r: Aligned monocular depth estimation for dynamic videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22820–22830 (2025)

2025
[64]

In: 2024 International Conference on 3D Vision (3DV)

Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In: 2024 International Conference on 3D Vision (3DV). pp. 800–809. IEEE (2024)

2024
[65]

arXiv preprint arXiv:2506.18890 (2025)

Ma, Z., Chen, X., Yu, S., Bi, S., Zhang, K., Ziwen, C., Xu, S., Yang, J., Xu, Z., Sunkavalli, K., et al.: 4d-lrm: Large space-time reconstruction model from and to any view at any time. arXiv preprint arXiv:2506.18890 (2025)

work page arXiv 2025
[66]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Mehl, L., Schmalfuss, J., Jahedi, A., Nalivayko, Y ., Bruhn, A.: Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4981–4991 (2023)

2023
[67]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Miao, S., Huang, J., Bai, D., Yan, X., Zhou, H., Wang, Y ., Liu, B., Geiger, A., Liao, Y .: Evolsplat: Efficient volume-based gaussian splatting for urban view synthesis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 11286–11296 (2025) 13

2025
[68]

Communications of the ACM 65(1), 99–106 (2021)

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99–106 (2021)

2021
[69]

R3M: A Universal Visual Representation for Robot Manipulation

Nair, S., Rajeswaran, A., Kumar, V ., Finn, C., Gupta, A.: R3m: A universal visual representa- tion for robot manipulation. arXiv preprint arXiv:2203.12601 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[70]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 343–352 (2015)

2015
[71]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Niemeyer, M., Barron, J.T., Mildenhall, B., Sajjadi, M.S., Geiger, A., Radwan, N.: Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5480–5490 (2022)

2022
[72]

In: Proceedings of the IEEE/CVF international conference on computer vision

Novotny, D., Ravi, N., Graham, B., Neverova, N., Vedaldi, A.: C3dpo: Canonical 3d pose networks for non-rigid structure from motion. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7688–7697 (2019)

2019
[73]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Pan, X., Charron, N., Yang, Y ., Peters, S., Whelan, T., Kong, C., Parkhi, O., Newcombe, R., Ren, Y .C.: Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20133– 20143 (2023)

2023
[74]

In: Proceedings of the IEEE/CVF international conference on computer vision

Park, K., Sinha, U., Barron, J.T., Bouaziz, S., Goldman, D.B., Seitz, S.M., Martin-Brualla, R.: Nerfies: Deformable neural radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5865–5874 (2021)

2021
[75]

arXiv preprint arXiv:2106.13228 (2021)

Park, K., Sinha, U., Hedman, P., Barron, J.T., Bouaziz, S., Goldman, D.B., Martin-Brualla, R., Seitz, S.M.: Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. arXiv preprint arXiv:2106.13228 (2021)

work page arXiv 2021
[76]

The 2017 DAVIS Challenge on Video Object Segmentation

Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[77]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: Neural radiance fields for dynamic scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10318–10327 (2021)

2021
[78]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021
[79]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Ranftl, R., Vineet, V ., Chen, Q., Koltun, V .: Dense monocular depth estimation in complex dynamic scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4058–4066 (2016)

2016
[80]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Roessle, B., Barron, J.T., Mildenhall, B., Srinivasan, P.P., Nießner, M.: Dense depth priors for neural radiance fields from sparse input views. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12892–12901 (2022)

2022

Showing first 80 references.

[1] [1]

C3G: Learning Compact 3D Representations with 2K Gaussians

An, H., Jung, J., Kim, M., Hong, S., Kim, C., Fukuda, K., Jeon, M., Han, J., Narihira, T., Ko, H., et al.: C3g: Learning compact 3d representations with 2k gaussians. arXiv preprint arXiv:2512.04021 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

An, H., Kim, J.H., Park, S., Jung, J., Han, J., Hong, S., Kim, S.: Cross-view completion models are zero-shot correspondence estimators. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1103–1115 (2025)

2025

[3] [3]

Qwen3-VL Technical Report

Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y ., Tan...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Balasingam, A., Chandler, J., Li, C., Zhang, Z., Balakrishnan, H.: Drivetrack: A benchmark for long-range point tracking in real-world videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22488–22497 (2024)

2024

[5] [5]

ACM Trans

Bartle, A., Sheffer, A., Kim, V .G., Kaufman, D.M., Vining, N., Berthouzoz, F.: Physics-driven pattern adjustment for direct 3d garment editing. ACM Trans. Graph.35(4), 50–1 (2016)

2016

[6] [6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bozic, A., Zollhofer, M., Theobalt, C., Nießner, M.: Deepdeform: Learning non-rigid rgb-d reconstruction with semi-supervised data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7002–7012 (2020)

2020

[7] [7]

In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition

Bregler, C., Hertzmann, A., Biermann, H.: Recovering non-rigid 3d shape from image streams. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662). vol. 2, pp. 690–696. IEEE (2000)

2000

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cao, A., Johnson, J.: Hexplane: A fast representation for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 130–141 (2023)

2023

[9] [9]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Charatan, D., Li, S.L., Tagliasacchi, A., Sitzmann, V .: pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19457–19467 (2024)

2024

[10] [10]

In: European conference on computer vision

Chen, Y ., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In: European conference on computer vision. pp. 370–386. Springer (2024)

2024

[11] [11]

Advances in Neural Information Processing Systems34, 9011–9023 (2021)

Cho, S., Hong, S., Jeon, S., Lee, Y ., Sohn, K., Kim, S.: Cats: Cost aggregation transformers for visual correspondence. Advances in Neural Information Processing Systems34, 9011–9023 (2021)

2021

[12] [12]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 7174– 7194 (2022)

Cho, S., Hong, S., Kim, S.: Cats++: Boosting cost aggregation with convolutions and transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 7174– 7194 (2022)

2022

[13] [13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cho, S., Shin, H., Hong, S., Arnab, A., Seo, P.H., Kim, S.: Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4113–4123 (2024)

2024

[14] [14]

International Journal of Computer Vision107(2), 101–122 (2014)

Dai, Y ., Li, H., He, M.: A simple prior-free method for non-rigid structure-from-motion factorization. International Journal of Computer Vision107(2), 101–122 (2014)

2014

[15] [15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Deng, K., Liu, A., Zhu, J.Y ., Ramanan, D.: Depth-supervised nerf: Fewer views and faster training for free. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12882–12891 (2022) 10

2022

[16] [16]

Advances in Neural Information Processing Systems35, 13610–13626 (2022)

Doersch, C., Gupta, A., Markeeva, L., Recasens, A., Smaira, L., Aytar, Y ., Carreira, J., Zisserman, A., Yang, Y .: Tap-vid: A benchmark for tracking any point in a video. Advances in Neural Information Processing Systems35, 13610–13626 (2022)

2022

[17] [17]

In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Du, Y ., Zhang, Y ., Yu, H.X., Tenenbaum, J.B., Wu, J.: Neural radiance flow for 4d view synthesis and video processing. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 14304–14314. IEEE Computer Society (2021)

2021

[18] [18]

Advances in neural information processing systems37, 40212–40229 (2024)

Fan, Z., Zhang, J., Cong, W., Wang, P., Li, R., Wen, K., Zhou, S., Kadambi, A., Wang, Z., Xu, D., et al.: Large spatial model: End-to-end unposed images to semantic 3d. Advances in neural information processing systems37, 40212–40229 (2024)

2024

[19] [19]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Fridovich-Keil, S., Meanti, G., Warburg, F.R., Recht, B., Kanazawa, A.: K-planes: Explicit radiance fields in space, time, and appearance. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12479–12488 (2023)

2023

[20] [20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: Radiance fields without neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5501–5510 (2022)

2022

[21] [21]

Advances in Neural Information Processing Systems35, 33768–33780 (2022)

Gao, H., Li, R., Tulsiani, S., Russell, B., Kanazawa, A.: Monocular dynamic view synthesis: A reality check. Advances in Neural Information Processing Systems35, 33768–33780 (2022)

2022

[22] [22]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y ., Duckworth, D., Fleet, D.J., Gnanapra- gasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3749–3761 (2022)

2022

[23] [23]

arXiv e-prints pp

Han, J., An, H., Jung, J., Narihira, T., Seo, J., Fukuda, K., Kim, C., Hong, S., Mitsufuji, Y ., Kim, S.: Dˆ 2ust3r: Enhancing 3d reconstruction with 4d pointmaps for dynamic scenes. arXiv e-prints pp. arXiv–2504 (2025)

2025

[24] [24]

arXiv preprint arXiv:2512.04012 (2025)

Han, J., Hong, S., Jung, J., Jang, W., An, H., Wang, Q., Kim, S., Feng, C.: Emergent outlier view rejection in visual geometry grounded transformers. arXiv preprint arXiv:2512.04012 (2025)

work page arXiv 2025

[25] [25]

arXiv preprint arXiv:2209.08742 (2022)

Hong, S., Cho, S., Kim, S., Lin, S.: Integrative feature and cost aggregation with transformers for dense correspondence. arXiv preprint arXiv:2209.08742 (2022)

work page arXiv 2022

[26] [26]

arXiv preprint arXiv:2410.22128 (2024)

Hong, S., Jung, J., Shin, H., Han, J., Yang, J., Luo, C., Kim, S.: Pf3plat: Pose-free feed-forward 3d gaussian splatting. arXiv preprint arXiv:2410.22128 (2024)

work page arXiv 2024

[27] [27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hong, S., Jung, J., Shin, H., Yang, J., Kim, S., Luo, C.: Unifying correspondence pose and nerf for generalized pose-free novel view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20196–20206 (2024)

2024

[28] [28]

In: Proceedings of the IEEE/CVF international conference on computer vision

Hong, S., Kim, S.: Deep matching prior: Test-time optimization for dense correspondence. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9907–9917 (2021)

2021

[29] [29]

Advances in Neural Information Processing Systems35, 13512–13526 (2022)

Hong, S., Nam, J., Cho, S., Hong, S., Jeon, S., Min, D., Kim, S.: Neural matching fields: Implicit representation of matching fields for visual correspondence. Advances in Neural Information Processing Systems35, 13512–13526 (2022)

2022

[30] [30]

Iclr1(2), 3 (2022)

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

2022

[31] [31]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Huang, R., Mikolajczyk, K.: No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse views. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 27947–27957 (2025)

2025

[32] [32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., Jiang, Y ., Zhang, Y ., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) 11

2024

[33] [33]

In: The Fourteenth International Conference on Learning Representations

Hur, J., Herrmann, C., Peng, S., Henzler, P., Ma, Z., Zickler, T., Sun, D.: Ufo-4d: Unposed feedforward 4d reconstruction from two images. In: The Fourteenth International Conference on Learning Representations

[34] [34]

In: European conference on computer vision

Innmann, M., Zollhöfer, M., Nießner, M., Theobalt, C., Stamminger, M.: V olumedeform: Real-time volumetric non-rigid reconstruction. In: European conference on computer vision. pp. 362–379. Springer (2016)

2016

[35] [35]

arXiv preprint arXiv:2407.04504 (2024)

Ji, S., Wu, G., Fang, J., Cen, J., Yi, T., Liu, W., Tian, Q., Wang, X.: Segment any 4d gaussians. arXiv preprint arXiv:2407.04504 (2024)

work page arXiv 2024

[36] [36]

ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

Jiang, L., Mao, Y ., Xu, L., Lu, T., Ren, K., Jin, Y ., Xu, X., Yu, M., Pang, J., Zhao, F., et al.: Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

2025

[37] [37]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y ., Liu, Y .: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

2025

[38] [38]

In: ICCV 2025 Workshop on Wild 3D: 3D Modeling, Reconstruction, and Generation in the Wild (2023)

Jung, J., Han, J., Kang, J., Kim, S., Kwak, M.S., Kim, S.: Self-evolving neural radiance fields. In: ICCV 2025 Workshop on Wild 3D: 3D Modeling, Reconstruction, and Generation in the Wild (2023)

2025

[39] [39]

ACM Trans

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

2023

[40] [40]

Advances in Neural Information Processing Systems38, 71685–71724 (2026)

Kim, C., Shin, H., Hong, E., Yoon, H., Arnab, A., Seo, P.H., Hong, S., Kim, S.: Seg4diff: Unveiling open-vocabulary semantic segmentation in text-to-image diffusion transformers. Advances in Neural Information Processing Systems38, 71685–71724 (2026)

2026

[41] [41]

Advances in Neural Information Processing Systems37, 129209–129226 (2024)

Kim, M., Lim, J., Han, B.: 4d gaussian splatting in the wild with uncertainty-aware regulariza- tion. Advances in Neural Information Processing Systems37, 129209–129226 (2024)

2024

[42] [42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kim, M., Seo, S., Han, B.: Infonerf: Ray entropy minimization for few-shot neural volume rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12912–12921 (2022)

2022

[43] [43]

arXiv preprint arXiv:2512.02006 (2025)

Koo, J., Kim, I.H., Kim, M., Park, J., Park, S., Kim, J., Yi, J., Cho, S., Kim, S.: Mv-tap: Tracking any point in multi-view videos. arXiv preprint arXiv:2512.02006 (2025)

work page arXiv 2025

[44] [44]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1611–1621 (2021)

2021

[45] [45]

In: Proceedings of the IEEE international conference on computer vision

Kumar, S., Dai, Y ., Li, H.: Monocular dense 3d reconstruction of a complex dynamic scene from two perspective frames. In: Proceedings of the IEEE international conference on computer vision. pp. 4649–4657 (2017)

2017

[46] [46]

arXiv preprint arXiv:2301.10941 (2023)

Kwak, M.S., Song, J., Kim, S.: Geconerf: Few-shot neural radiance fields via geometric consistency. arXiv preprint arXiv:2301.10941 (2023)

work page arXiv 2023

[47] [47]

arXiv preprint arXiv:2602.04877 (2026)

Lai, Z., Insafutdinov, E., Sucar, E., Vedaldi, A.: Cowtracker: Tracking by warping instead of correlation. arXiv preprint arXiv:2602.04877 (2026)

work page arXiv 2026

[48] [48]

In: 5th Annual Conference on Robot Learning (2021)

Lee, A.X., Devin, C.M., Zhou, Y ., Lampe, T., Bousmalis, K., Springenberg, J.T., Byravan, A., Abdolmaleki, A., Gileadi, N., Khosid, D., et al.: Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In: 5th Annual Conference on Robot Learning (2021)

2021

[49] [49]

arXiv preprint arXiv:2510.14945 (2025)

Lee, J., Jung, J., Han, J., Narihira, T., Fukuda, K., Seo, J., Hong, S., Mitsufuji, Y ., Kim, S.: 3d scene prompting for scene-consistent camera-controllable video generation. arXiv preprint arXiv:2510.14945 (2025)

work page arXiv 2025

[50] [50]

TORA: Topological Representation Alignment for 3D Shape Assembly

Lee, N., Chen, Z., Pollefeys, M., Hong, S.: Tora: Topological representation alignment for 3d shape assembly. arXiv preprint arXiv:2604.04050 (2026) 12

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Lei, J., Weng, Y ., Harley, A.W., Guibas, L., Daniilidis, K.: Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 6165–6177 (2025)

2025

[52] [52]

In: European conference on computer vision

Leroy, V ., Cabon, Y ., Revaud, J.: Grounding image matching in 3d with mast3r. In: European conference on computer vision. pp. 71–91. Springer (2024)

2024

[53] [53]

Language-driven Semantic Segmentation

Li, B., Weinberger, K.Q., Belongie, S., Koltun, V ., Ranftl, R.: Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[54] [54]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, J., Zhang, J., Bai, X., Zheng, J., Ning, X., Zhou, J., Gu, L.: Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20775–20785 (2024)

2024

[55] [55]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6498–6508 (2021)

2021

[56] [56]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Z., Tucker, R., Cole, F., Wang, Q., Jin, L., Ye, V ., Kanazawa, A., Holynski, A., Snavely, N.: Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10486–10496 (2025)

2025

[57] [57]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Z., Wang, Q., Cole, F., Tucker, R., Snavely, N.: Dynibar: Neural dynamic image-based rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4273–4284 (2023)

2023

[58] [58]

arXiv preprint arXiv:2412.03526 (2024)

Liang, H., Ren, J., Mirzaei, A., Torralba, A., Liu, Z., Gilitschenski, I., Fidler, S., Oztireli, C., Ling, H., Gojcic, Z., et al.: Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos. arXiv preprint arXiv:2412.03526 (2024)

work page arXiv 2024

[59] [59]

arXiv preprint arXiv:2507.10065 (2025)

Lin, C., Lin, Y ., Pan, P., Yu, Y ., Yan, H., Fragkiadaki, K., Mu, Y .: Movies: Motion-aware 4d dynamic view synthesis in one second. arXiv preprint arXiv:2507.10065 (2025)

work page arXiv 2025

[60] [60]

arXiv preprint arXiv:2506.09997 (2025)

Lin, C.H., Lv, Z., Wu, S., Xu, Z., Nguyen-Phuoc, T., Tseng, H.Y ., Straub, J., Khan, N., Xiao, L., Yang, M.H., et al.: Dgs-lrm: Real-time deformable 3d gaussian reconstruction from monocular videos. arXiv preprint arXiv:2506.09997 (2025)

work page arXiv 2025

[61] [61]

Flow Matching for Generative Modeling

Lipman, Y ., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[62] [62]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[63] [63]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Lu, J., Huang, T., Li, P., Dou, Z., Lin, C., Cui, Z., Dong, Z., Yeung, S.K., Wang, W., Liu, Y .: Align3r: Aligned monocular depth estimation for dynamic videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22820–22830 (2025)

2025

[64] [64]

In: 2024 International Conference on 3D Vision (3DV)

Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In: 2024 International Conference on 3D Vision (3DV). pp. 800–809. IEEE (2024)

2024

[65] [65]

arXiv preprint arXiv:2506.18890 (2025)

Ma, Z., Chen, X., Yu, S., Bi, S., Zhang, K., Ziwen, C., Xu, S., Yang, J., Xu, Z., Sunkavalli, K., et al.: 4d-lrm: Large space-time reconstruction model from and to any view at any time. arXiv preprint arXiv:2506.18890 (2025)

work page arXiv 2025

[66] [66]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Mehl, L., Schmalfuss, J., Jahedi, A., Nalivayko, Y ., Bruhn, A.: Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4981–4991 (2023)

2023

[67] [67]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Miao, S., Huang, J., Bai, D., Yan, X., Zhou, H., Wang, Y ., Liu, B., Geiger, A., Liao, Y .: Evolsplat: Efficient volume-based gaussian splatting for urban view synthesis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 11286–11296 (2025) 13

2025

[68] [68]

Communications of the ACM 65(1), 99–106 (2021)

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99–106 (2021)

2021

[69] [69]

R3M: A Universal Visual Representation for Robot Manipulation

Nair, S., Rajeswaran, A., Kumar, V ., Finn, C., Gupta, A.: R3m: A universal visual representa- tion for robot manipulation. arXiv preprint arXiv:2203.12601 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[70] [70]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 343–352 (2015)

2015

[71] [71]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Niemeyer, M., Barron, J.T., Mildenhall, B., Sajjadi, M.S., Geiger, A., Radwan, N.: Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5480–5490 (2022)

2022

[72] [72]

In: Proceedings of the IEEE/CVF international conference on computer vision

Novotny, D., Ravi, N., Graham, B., Neverova, N., Vedaldi, A.: C3dpo: Canonical 3d pose networks for non-rigid structure from motion. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7688–7697 (2019)

2019

[73] [73]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Pan, X., Charron, N., Yang, Y ., Peters, S., Whelan, T., Kong, C., Parkhi, O., Newcombe, R., Ren, Y .C.: Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20133– 20143 (2023)

2023

[74] [74]

In: Proceedings of the IEEE/CVF international conference on computer vision

Park, K., Sinha, U., Barron, J.T., Bouaziz, S., Goldman, D.B., Seitz, S.M., Martin-Brualla, R.: Nerfies: Deformable neural radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5865–5874 (2021)

2021

[75] [75]

arXiv preprint arXiv:2106.13228 (2021)

Park, K., Sinha, U., Hedman, P., Barron, J.T., Bouaziz, S., Goldman, D.B., Martin-Brualla, R., Seitz, S.M.: Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. arXiv preprint arXiv:2106.13228 (2021)

work page arXiv 2021

[76] [76]

The 2017 DAVIS Challenge on Video Object Segmentation

Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[77] [77]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: Neural radiance fields for dynamic scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10318–10327 (2021)

2021

[78] [78]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021

[79] [79]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Ranftl, R., Vineet, V ., Chen, Q., Koltun, V .: Dense monocular depth estimation in complex dynamic scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4058–4066 (2016)

2016

[80] [80]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Roessle, B., Barron, J.T., Mildenhall, B., Srinivasan, P.P., Nießner, M.: Dense depth priors for neural radiance fields from sparse input views. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12892–12901 (2022)

2022