PanopticQuery: Unified Query-Time Reasoning for 4D Scenes

Ruilin Tang; Shengfeng He; Wenxi Liu; Yang Zhou; Yan Huang; Zhong Ye

arxiv: 2604.05638 · v1 · submitted 2026-04-07 · 💻 cs.CV

PanopticQuery: Unified Query-Time Reasoning for 4D Scenes

Ruilin Tang , Yang Zhou , Zhong Ye , Wenxi Liu , Yan Huang , Shengfeng He This is my paper

Pith reviewed 2026-05-10 20:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords PanopticQuery4D Gaussian Splattingnatural language queriesdynamic scenessemantic groundingmulti-view consensus4D reconstructionPanoptic-L4D

0 comments

The pith

PanopticQuery enables query-time reasoning for natural language in 4D scenes by lifting aggregated 2D semantics onto 4D Gaussian Splatting reconstructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that complex natural language queries about dynamic scenes can be answered accurately once noisy 2D semantic labels are turned into globally consistent 4D groundings. It starts from high-fidelity dynamic reconstructions produced by 4D Gaussian Splatting and adds a consensus step that collects predictions from many views and time steps. Inconsistent labels are filtered, geometric constraints are enforced, and the surviving semantics are optimized into neural fields that support direct language queries. A new benchmark, Panoptic-L4D, supplies the test cases needed to measure progress on attributes, actions, spatial relations, and multi-object interactions. Experiments indicate the resulting system outperforms earlier approaches on these challenging query types.

Core claim

PanopticQuery transforms noisy, view-dependent 2D semantic predictions into globally consistent 4D interpretations by using a multi-view semantic consensus mechanism on top of 4D Gaussian Splatting reconstructions, enabling accurate handling of complex semantics like temporal actions and spatial relations in natural language queries.

What carries the argument

The multi-view semantic consensus mechanism, which aggregates 2D semantic predictions across views and time frames, filters inconsistent outputs, enforces geometric consistency, and uses neural field optimization to lift the semantics into structured 4D groundings.

If this is right

Language queries involving actions and spatial relations become answerable directly from the reconstructed 4D scene.
A new evaluation benchmark, Panoptic-L4D, provides standardized test cases for language-based 4D querying.
Consistency across time and viewpoints improves semantic grounding for complex multi-object scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robotics systems that must interpret spoken instructions about moving objects could use the same query-time lifting step.
The method may scale to longer video sequences if the consensus filtering remains effective as temporal drift increases.
Future work could test whether the same aggregation principle improves purely geometric 4D tasks such as motion prediction.

Load-bearing premise

Aggregating noisy 2D semantic predictions across multiple views and time frames will produce globally consistent 4D interpretations without introducing new errors or losing fine-grained details.

What would settle it

Removing the consensus aggregation step on the Panoptic-L4D benchmark and showing that performance on queries about multi-object interactions or temporal actions falls below the reported state-of-the-art levels.

Figures

Figures reproduced from arXiv: 2604.05638 by Ruilin Tang, Shengfeng He, Wenxi Liu, Yang Zhou, Yan Huang, Zhong Ye.

**Figure 1.** Figure 1: While state-of-the-art embedding-based methods, such as 4D LangSplat, perform well on static attribute queries, they struggle with actions [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The Panoptic-L4D Construction Pipeline. Our two-phase pro [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Examples from the Panoptic-L4D Benchmark. Our dataset spans diverse environments, including outdoor scenes, room-scale interactions, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Linguistic Diversity in Panoptic-L4D. Word clouds visualize the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of PanopticQuery. We render multi-view RGB/depth videos from an initial 4DGS and obtain prompt-conditioned masks with a [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative Comparison on Neu3D dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative Comparison on Panoptic-L4D dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Various query types’ results of our method on Panoptic-L4D dataset. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Understanding dynamic 4D environments through natural language queries requires not only accurate scene reconstruction but also robust semantic grounding across space, time, and viewpoints. While recent methods using neural representations have advanced 4D reconstruction, they remain limited in contextual reasoning, especially for complex semantics such as interactions, temporal actions, and spatial relations. A key challenge lies in transforming noisy, view-dependent predictions into globally consistent 4D interpretations. We introduce PanopticQuery, a framework for unified query-time reasoning in 4D scenes. Our approach builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that grounds natural language queries by aggregating 2D semantic predictions across multiple views and time frames. This process filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization. To support evaluation, we present Panoptic-L4D, a new benchmark for language-based querying in dynamic scenes. Experiments demonstrate that PanopticQuery sets a new state of the art on complex language queries, effectively handling attributes, actions, spatial relationships, and multi-object interactions. A video demonstration is available in the supplementary materials.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PanopticQuery layers multi-view consensus on 4D Gaussian Splatting to ground language queries in dynamic scenes and ships a new benchmark, but the SOTA claim sits on untested ground for correlated noise.

read the letter

The paper's main move is to take 4D Gaussian Splatting for dynamic reconstruction and add a multi-view semantic consensus step that pools 2D predictions across frames and viewpoints, then lifts the result into consistent 4D groundings through neural field optimization. They also release Panoptic-L4D, a benchmark aimed at complex queries involving attributes, actions, spatial relations, and multi-object interactions. That combination is the actual new piece: a query-time pipeline rather than another reconstruction method alone. It directly targets a real pain point in robotics and AR where you need to answer natural language questions about moving scenes without view-dependent drift. The approach of filtering inconsistencies and enforcing geometric consistency is a sensible engineering step that builds on existing 2D semantic tools without overcomplicating the base representation. Credit to the authors for framing the problem clearly and for recognizing that simple per-view predictions fall short on temporal actions and interactions. The soft spot is the evaluation. The abstract claims state-of-the-art results, yet the benchmark is new, so prior methods have not been run on it. This makes it hard to separate gains from the consensus mechanism versus benchmark construction or data characteristics. The stress-test point about correlated errors from occlusions or dynamics is worth checking; if the aggregation just averages rather than resolves viewpoint biases, the 4D consistency claim weakens, especially for fine-grained interactions. No equations or ablation details appear in the abstract, so the full paper needs to show whether the optimization step actually corrects those issues or introduces its own smoothing artifacts. This paper is for computer-vision researchers working on 4D reconstruction and vision-language grounding. Anyone building query interfaces for dynamic environments will find the pipeline and benchmark useful as a starting point. It deserves a serious referee because the core idea is grounded in a clear problem and a reproducible direction, even if the current evidence is preliminary. Send it for review to get concrete feedback on the consensus robustness and benchmark comparisons.

Referee Report

1 major / 2 minor

Summary. The paper introduces PanopticQuery, a framework for unified query-time reasoning in 4D scenes. It builds on 4D Gaussian Splatting for dynamic reconstruction and proposes a multi-view semantic consensus mechanism that aggregates 2D semantic predictions across views and time frames. This mechanism is described as filtering inconsistent outputs, enforcing geometric consistency, and lifting 2D semantics into structured 4D groundings via neural field optimization. The work also presents the new Panoptic-L4D benchmark for language-based querying in dynamic scenes and claims state-of-the-art results on complex queries involving attributes, actions, spatial relationships, and multi-object interactions.

Significance. If the multi-view consensus and neural field optimization steps produce reliable globally consistent 4D interpretations, the framework could meaningfully advance language-driven reasoning over dynamic scenes, with potential applications in robotics and augmented reality. The introduction of the Panoptic-L4D benchmark is a clear positive contribution that enables standardized evaluation. However, the overall significance hinges on whether the aggregation step genuinely mitigates rather than propagates errors in temporally correlated 2D predictions, which remains the least secure aspect of the central claim.

major comments (1)

[Method (multi-view semantic consensus and neural field optimization)] The multi-view semantic consensus mechanism (described in the method overview and the paragraph beginning 'Our approach builds on 4D Gaussian Splatting...') claims to 'filter inconsistent outputs, enforce geometric consistency, and lift 2D semantics into structured 4D groundings.' No analysis or ablation is provided on whether this aggregation resolves correlated errors from viewpoint-dependent biases or occlusions, which are common in dynamic interactions. If such correlated noise is present, the process could propagate rather than eliminate errors for temporal actions and multi-object relations, directly undermining the SOTA claim on complex queries.

minor comments (2)

[Abstract] The abstract states that a video demonstration is available in the supplementary materials; including at least one qualitative figure or table summarizing query examples and failure cases in the main paper would improve readability.
[Benchmark and Experiments] The new benchmark Panoptic-L4D is introduced without a detailed comparison table showing how its query complexity or scene dynamics differ from prior 4D or language-grounding datasets; adding this would strengthen the evaluation section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of the Panoptic-L4D benchmark. We address the single major comment below and will incorporate the suggested analysis into the revised manuscript.

read point-by-point responses

Referee: [Method (multi-view semantic consensus and neural field optimization)] The multi-view semantic consensus mechanism (described in the method overview and the paragraph beginning 'Our approach builds on 4D Gaussian Splatting...') claims to 'filter inconsistent outputs, enforce geometric consistency, and lift 2D semantics into structured 4D groundings.' No analysis or ablation is provided on whether this aggregation resolves correlated errors from viewpoint-dependent biases or occlusions, which are common in dynamic interactions. If such correlated noise is present, the process could propagate rather than eliminate errors for temporal actions and multi-object relations, directly undermining the SOTA claim on complex queries.

Authors: We agree that a dedicated analysis of error propagation versus mitigation for correlated viewpoint biases and occlusions is valuable and was not present in the original submission. The multi-view consensus is designed to aggregate predictions across views and time frames precisely to filter inconsistencies, with the subsequent neural field optimization enforcing global 4D consistency; the SOTA results on complex queries in Panoptic-L4D provide indirect support. However, to directly address the concern, the revised manuscript will include a new ablation subsection. This will report (i) quantitative performance with and without the consensus step on query subsets involving temporal actions and multi-object interactions, (ii) view-consistency metrics before and after aggregation, and (iii) qualitative visualizations of error filtering on occluded or biased frames. We expect these additions to clarify that the mechanism reduces rather than propagates correlated errors. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents PanopticQuery as an engineering pipeline: 4D Gaussian Splatting for reconstruction plus a multi-view semantic consensus step that aggregates 2D predictions, filters inconsistencies, and lifts semantics via neural field optimization. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described method. The SOTA claim rests on empirical results on the new Panoptic-L4D benchmark rather than any closed-form reduction to inputs. This is the normal case of a self-contained applied framework whose correctness is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; evaluation is limited to high-level description.

pith-pipeline@v0.9.0 · 5520 in / 930 out tokens · 37184 ms · 2026-05-10T20:02:30.664441+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that grounds natural language queries by aggregating 2D semantic predictions across multiple views and time frames. This process filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce PanopticQuery, a framework for unified query-time reasoning in 4D scenes.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

[1]

Nerf: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P . P . Srinivasan, M. Tancik, J. T. Barron, R. Ra- mamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021. 1

work page 2021
[2]

Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields,

K. Park, U. Sinha, P . Hedman, J. T. Barron, S. Bouaziz, D. B. Gold- man, R. Martin-Brualla, and S. M. Seitz, “Hypernerf: A higher- dimensional representation for topologically varying neural radi- ance fields,”arXiv preprint arXiv:2106.13228, 2021. 1, 2, 3

work page arXiv 2021
[3]

Nerfies: Deformable neural radiance fields,

K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla, “Nerfies: Deformable neural radiance fields,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 5865–5874. 1, 2

work page 2021
[4]

3d gaussian splatting for real-time radiance field rendering,

B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–14, 2023. 1, 2

work page 2023
[5]

4d gaussian splatting for real-time dynamic scene rendering,

G. Wu, T. Yi, J. Fang, L. Xie, X. Zhang, W. Wei, W. Liu, Q. Tian, and X. Wang, “4d gaussian splatting for real-time dynamic scene rendering,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 20 310–20 320. 1, 2, 5

work page 2024
[6]

De- formable 3d gaussians for high-fidelity monocular dynamic scene reconstruction,

Z. Yang, X. Gao, W. Zhou, S. Jiao, Y. Zhang, and X. Jin, “De- formable 3d gaussians for high-fidelity monocular dynamic scene reconstruction,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2024, pp. 20 331–20 341. 1, 2

work page 2024
[7]

4d langsplat: 4d language gaussian splatting via multimodal large language models,

W. Li, R. Zhou, J. Zhou, Y. Song, J. Herter, M. Qin, G. Huang, and H. Pfister, “4d langsplat: 4d language gaussian splatting via multimodal large language models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 22 001–22 011. 1, 3, 4, 7

work page 2025
[8]

Gaussian grouping: Segment and edit anything in 3d scenes,

M. Ye, M. Danelljan, F. Yu, and L. Ke, “Gaussian grouping: Segment and edit anything in 3d scenes,” inEuropean conference on computer vision. Springer, 2024, pp. 162–179. 1, 6

work page 2024
[9]

Segment any 4d gaussians,

S. Ji, G. Wu, J. Fang, J. Cen, T. Yi, W. Liu, Q. Tian, and X. Wang, “Segment any 4d gaussians,”arXiv preprint arXiv:2407.04504, 2024. 1, 6, 7

work page arXiv 2024
[10]

D- nerf: Neural radiance fields for dynamic scenes,

A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer, “D- nerf: Neural radiance fields for dynamic scenes,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 318–10 327. 2, 3

work page 2021
[11]

Neural 3d video synthesis from multi-view video,

T. Li, M. Slavcheva, M. Zollhoefer, S. Green, C. Lassner, C. Kim, T. Schmidt, S. Lovegrove, M. Goesele, R. Newcombeet al., “Neural 3d video synthesis from multi-view video,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5521–5531. 2, 3, 6

work page 2022
[12]

Nerfplayer: A streamable dynamic scene representa- tion with decomposed neural radiance fields,

L. Song, A. Chen, Z. Li, Z. Chen, L. Chen, J. Yuan, Y. Xu, and A. Geiger, “Nerfplayer: A streamable dynamic scene representa- tion with decomposed neural radiance fields,”IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 5, pp. 2732– 2742, 2023. 2

work page 2023
[13]

Space-time neural irradiance fields for free-viewpoint video,

W. Xian, J.-B. Huang, J. Kopf, and C. Kim, “Space-time neural irradiance fields for free-viewpoint video,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9421–9431. 2

work page 2021
[14]

Hexplane: A fast representation for dynamic scenes,

A. Cao and J. Johnson, “Hexplane: A fast representation for dynamic scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 130–141. 2

work page 2023
[15]

K-planes: Explicit radiance fields in space, time, and appearance,

S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and A. Kanazawa, “K-planes: Explicit radiance fields in space, time, and appearance,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 479–12 488. 2

work page 2023
[16]

Spacetime gaussian feature splatting for real-time dynamic view synthesis,

Z. Li, Z. Chen, Z. Li, and Y. Xu, “Spacetime gaussian feature splatting for real-time dynamic view synthesis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8508–8520. 2

work page 2024
[17]

Coda-4dgs: Dynamic gaussian splatting with context and deformation awareness for autonomous driving,

R. Song, C. Liang, Y. Xia, W. Zimmer, H. Cao, H. Caesar, A. Fes- tag, and A. Knoll, “Coda-4dgs: Dynamic gaussian splatting with context and deformation awareness for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 28 031–28 041. 2

work page 2025
[18]

MoDGS: Dynamic gaussian splatting from casually- captured monocular videos with depth priors,

Q. LIU, Y. Liu, J. Wang, X. Lyu, P . Wang, W. Wang, and J. Hou, “MoDGS: Dynamic gaussian splatting from casually- captured monocular videos with depth priors,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=2prShxdLkX 2

work page 2025
[19]

Lerf: Language embedded radiance fields,

J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 19 729–19 739. 3

work page 2023
[20]

Langsplat: 3d language gaussian splatting,

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 051–20 060. 3, 7

work page 2024
[21]

Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps

W. Li, Y. Zhao, M. Qin, Y. Liu, Y. Cai, C. Gan, and H. Pfister, “Langsplatv2: High-dimensional 3d language gaussian splatting with 450+ fps,”arXiv preprint arXiv:2507.07136, 2025. 3

work page arXiv 2025
[22]

Refersplat: Referring segmentation in 3d gaussian splatting.arXiv preprint arXiv:2508.08252, 2025

S. He, G. Jie, C. Wang, Y. Zhou, S. Hu, G. Li, and H. Ding, “Refersplat: Referring segmentation in 3d gaussian splatting,” arXiv preprint arXiv:2508.08252, 2025. 3

work page arXiv 2025
[23]

Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding,

R. Li, S. Li, L. Kong, X. Yang, and J. Liang, “Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3707–3717. 3 IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 12

work page 2025
[24]

Freeq-graph: Free- form querying with semantic consistent scene graph for 3d scene understanding,

C. Zhan, Y. Zhang, G. Wang, and H. Wang, “Freeq-graph: Free- form querying with semantic consistent scene graph for 3d scene understanding,”arXiv preprint arXiv:2506.13629, 2025. 3

work page arXiv 2025
[25]

Dgd: Dynamic 3d gaussians distillation,

I. Labe, N. Issachar, I. Lang, and S. Benaim, “Dgd: Dynamic 3d gaussians distillation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 361–378. 3

work page 2024
[26]

4-legs: 4d language embedded gaussian splat- ting,

G. Fiebelman, T. Cohen, A. Morgenstern, P . Hedman, and H. Averbuch-Elor, “4-legs: 4d language embedded gaussian splat- ting,” inComputer Graphics Forum. Wiley Online Library, 2025, p. e70085. 3

work page 2025
[27]

Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields,

S. Zhou, H. Ren, Y. Weng, S. Zhang, Z. Wang, D. Xu, Z. Fan, S. You, Z. Wang, L. Guibaset al., “Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14 179–14 190. 3

work page 2025
[28]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026. 3

work page 2023
[29]

Gpt-4v(ision) system card,

OpenAI, “Gpt-4v(ision) system card,”OpenAI, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:263218031 3

work page 2023
[30]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond,”arXiv preprint arXiv:2308.12966, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P . Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y. Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P . Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Qwen3-VL Technical Report

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P . Wang, P . Wang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Sam- wise: Infusing wisdom in sam2 for text-driven video segmenta- tion,

C. Cuttano, G. Trivigno, G. Rosi, C. Masone, and G. Averta, “Sam- wise: Infusing wisdom in sam2 for text-driven video segmenta- tion,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3395–3405. 3, 7

work page 2025
[35]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Dataset and pipeline for multi-view light-field video,

N. Sabater, G. Boisson, B. Vandame, P . Kerbiriou, F. Babon, M. Hog, R. Gendrot, T. Langlois, O. Bureller, A. Schubertet al., “Dataset and pipeline for multi-view light-field video,” inProceedings of the IEEE conference on computer vision and pattern recognition Workshops, 2017, pp. 30–40. 4 Ruilin Tangis currently pursuing the B.S. de- gree with the Scho...

work page 2017

[1] [1]

Nerf: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P . P . Srinivasan, M. Tancik, J. T. Barron, R. Ra- mamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021. 1

work page 2021

[2] [2]

Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields,

K. Park, U. Sinha, P . Hedman, J. T. Barron, S. Bouaziz, D. B. Gold- man, R. Martin-Brualla, and S. M. Seitz, “Hypernerf: A higher- dimensional representation for topologically varying neural radi- ance fields,”arXiv preprint arXiv:2106.13228, 2021. 1, 2, 3

work page arXiv 2021

[3] [3]

Nerfies: Deformable neural radiance fields,

K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla, “Nerfies: Deformable neural radiance fields,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 5865–5874. 1, 2

work page 2021

[4] [4]

3d gaussian splatting for real-time radiance field rendering,

B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–14, 2023. 1, 2

work page 2023

[5] [5]

4d gaussian splatting for real-time dynamic scene rendering,

G. Wu, T. Yi, J. Fang, L. Xie, X. Zhang, W. Wei, W. Liu, Q. Tian, and X. Wang, “4d gaussian splatting for real-time dynamic scene rendering,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 20 310–20 320. 1, 2, 5

work page 2024

[6] [6]

De- formable 3d gaussians for high-fidelity monocular dynamic scene reconstruction,

Z. Yang, X. Gao, W. Zhou, S. Jiao, Y. Zhang, and X. Jin, “De- formable 3d gaussians for high-fidelity monocular dynamic scene reconstruction,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2024, pp. 20 331–20 341. 1, 2

work page 2024

[7] [7]

4d langsplat: 4d language gaussian splatting via multimodal large language models,

W. Li, R. Zhou, J. Zhou, Y. Song, J. Herter, M. Qin, G. Huang, and H. Pfister, “4d langsplat: 4d language gaussian splatting via multimodal large language models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 22 001–22 011. 1, 3, 4, 7

work page 2025

[8] [8]

Gaussian grouping: Segment and edit anything in 3d scenes,

M. Ye, M. Danelljan, F. Yu, and L. Ke, “Gaussian grouping: Segment and edit anything in 3d scenes,” inEuropean conference on computer vision. Springer, 2024, pp. 162–179. 1, 6

work page 2024

[9] [9]

Segment any 4d gaussians,

S. Ji, G. Wu, J. Fang, J. Cen, T. Yi, W. Liu, Q. Tian, and X. Wang, “Segment any 4d gaussians,”arXiv preprint arXiv:2407.04504, 2024. 1, 6, 7

work page arXiv 2024

[10] [10]

D- nerf: Neural radiance fields for dynamic scenes,

A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer, “D- nerf: Neural radiance fields for dynamic scenes,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 318–10 327. 2, 3

work page 2021

[11] [11]

Neural 3d video synthesis from multi-view video,

T. Li, M. Slavcheva, M. Zollhoefer, S. Green, C. Lassner, C. Kim, T. Schmidt, S. Lovegrove, M. Goesele, R. Newcombeet al., “Neural 3d video synthesis from multi-view video,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5521–5531. 2, 3, 6

work page 2022

[12] [12]

Nerfplayer: A streamable dynamic scene representa- tion with decomposed neural radiance fields,

L. Song, A. Chen, Z. Li, Z. Chen, L. Chen, J. Yuan, Y. Xu, and A. Geiger, “Nerfplayer: A streamable dynamic scene representa- tion with decomposed neural radiance fields,”IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 5, pp. 2732– 2742, 2023. 2

work page 2023

[13] [13]

Space-time neural irradiance fields for free-viewpoint video,

W. Xian, J.-B. Huang, J. Kopf, and C. Kim, “Space-time neural irradiance fields for free-viewpoint video,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9421–9431. 2

work page 2021

[14] [14]

Hexplane: A fast representation for dynamic scenes,

A. Cao and J. Johnson, “Hexplane: A fast representation for dynamic scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 130–141. 2

work page 2023

[15] [15]

K-planes: Explicit radiance fields in space, time, and appearance,

S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and A. Kanazawa, “K-planes: Explicit radiance fields in space, time, and appearance,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 479–12 488. 2

work page 2023

[16] [16]

Spacetime gaussian feature splatting for real-time dynamic view synthesis,

Z. Li, Z. Chen, Z. Li, and Y. Xu, “Spacetime gaussian feature splatting for real-time dynamic view synthesis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8508–8520. 2

work page 2024

[17] [17]

Coda-4dgs: Dynamic gaussian splatting with context and deformation awareness for autonomous driving,

R. Song, C. Liang, Y. Xia, W. Zimmer, H. Cao, H. Caesar, A. Fes- tag, and A. Knoll, “Coda-4dgs: Dynamic gaussian splatting with context and deformation awareness for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 28 031–28 041. 2

work page 2025

[18] [18]

MoDGS: Dynamic gaussian splatting from casually- captured monocular videos with depth priors,

Q. LIU, Y. Liu, J. Wang, X. Lyu, P . Wang, W. Wang, and J. Hou, “MoDGS: Dynamic gaussian splatting from casually- captured monocular videos with depth priors,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=2prShxdLkX 2

work page 2025

[19] [19]

Lerf: Language embedded radiance fields,

J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 19 729–19 739. 3

work page 2023

[20] [20]

Langsplat: 3d language gaussian splatting,

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 051–20 060. 3, 7

work page 2024

[21] [21]

Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps

W. Li, Y. Zhao, M. Qin, Y. Liu, Y. Cai, C. Gan, and H. Pfister, “Langsplatv2: High-dimensional 3d language gaussian splatting with 450+ fps,”arXiv preprint arXiv:2507.07136, 2025. 3

work page arXiv 2025

[22] [22]

Refersplat: Referring segmentation in 3d gaussian splatting.arXiv preprint arXiv:2508.08252, 2025

S. He, G. Jie, C. Wang, Y. Zhou, S. Hu, G. Li, and H. Ding, “Refersplat: Referring segmentation in 3d gaussian splatting,” arXiv preprint arXiv:2508.08252, 2025. 3

work page arXiv 2025

[23] [23]

Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding,

R. Li, S. Li, L. Kong, X. Yang, and J. Liang, “Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3707–3717. 3 IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 12

work page 2025

[24] [24]

Freeq-graph: Free- form querying with semantic consistent scene graph for 3d scene understanding,

C. Zhan, Y. Zhang, G. Wang, and H. Wang, “Freeq-graph: Free- form querying with semantic consistent scene graph for 3d scene understanding,”arXiv preprint arXiv:2506.13629, 2025. 3

work page arXiv 2025

[25] [25]

Dgd: Dynamic 3d gaussians distillation,

I. Labe, N. Issachar, I. Lang, and S. Benaim, “Dgd: Dynamic 3d gaussians distillation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 361–378. 3

work page 2024

[26] [26]

4-legs: 4d language embedded gaussian splat- ting,

G. Fiebelman, T. Cohen, A. Morgenstern, P . Hedman, and H. Averbuch-Elor, “4-legs: 4d language embedded gaussian splat- ting,” inComputer Graphics Forum. Wiley Online Library, 2025, p. e70085. 3

work page 2025

[27] [27]

Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields,

S. Zhou, H. Ren, Y. Weng, S. Zhang, Z. Wang, D. Xu, Z. Fan, S. You, Z. Wang, L. Guibaset al., “Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14 179–14 190. 3

work page 2025

[28] [28]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026. 3

work page 2023

[29] [29]

Gpt-4v(ision) system card,

OpenAI, “Gpt-4v(ision) system card,”OpenAI, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:263218031 3

work page 2023

[30] [30]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond,”arXiv preprint arXiv:2308.12966, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P . Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y. Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P . Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Qwen3-VL Technical Report

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P . Wang, P . Wang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Sam- wise: Infusing wisdom in sam2 for text-driven video segmenta- tion,

C. Cuttano, G. Trivigno, G. Rosi, C. Masone, and G. Averta, “Sam- wise: Infusing wisdom in sam2 for text-driven video segmenta- tion,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3395–3405. 3, 7

work page 2025

[35] [35]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Dataset and pipeline for multi-view light-field video,

N. Sabater, G. Boisson, B. Vandame, P . Kerbiriou, F. Babon, M. Hog, R. Gendrot, T. Langlois, O. Bureller, A. Schubertet al., “Dataset and pipeline for multi-view light-field video,” inProceedings of the IEEE conference on computer vision and pattern recognition Workshops, 2017, pp. 30–40. 4 Ruilin Tangis currently pursuing the B.S. de- gree with the Scho...

work page 2017