pith. sign in

arxiv: 2606.28215 · v1 · pith:NNWP5NRNnew · submitted 2026-06-26 · 💻 cs.CV · cs.AI· cs.GR

HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

Pith reviewed 2026-06-29 04:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR
keywords 4D reconstructionmonocular videomulti-object interactionshuman-in-the-loopvision-language modelsphysical plausibilitytemporal consistencydata generation
0
0 comments X

The pith

HAT-4D reconstructs 4D multi-object interactions from monocular videos by combining vision models with targeted human feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HAT-4D as a framework that lifts single-camera videos of multiple objects interacting into complete 4D models containing 3D shapes, motion over time, and physical contacts. Prior monocular methods often break down when objects overlap or depth is unclear, but this system pairs vision-language models with limited human corrections at several stages to correct those issues and produce usable assets. The resulting data engine also supplies a new benchmark called MVOIK-4D and shows that the outputs can raise the accuracy of other models when used for further training.

Core claim

HAT-4D is the first agentic framework that integrates VLMs with a multi-level human-in-the-loop feedback mechanism to reconstruct the 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single monocular video, yielding physically plausible assets without multi-camera rigs and enabling the MVOIK-4D benchmark.

What carries the argument

The multi-level human-in-the-loop feedback mechanism integrated with VLMs that resolves depth ambiguities and interaction-induced occlusions during 3D generation and 4D propagation.

If this is right

  • HAT-4D reaches state-of-the-art scores on most evaluation metrics while preserving competitive semantic alignment.
  • Datasets generated by HAT-4D raise baseline model performance when used for fine-tuning.
  • The framework produces the open-world MVOIK-4D benchmark together with a multi-dimensional protocol that tests physical plausibility and temporal consistency.
  • The method supplies scalable 4D interaction data for Embodied AI and VLA training without requiring multi-camera capture setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feedback loop could be tested on single-object or static-scene videos to measure how much human input is still needed once interaction complexity drops.
  • If the generated assets prove stable in physics simulators, they could directly seed large-scale robot training environments drawn from everyday video sources.
  • The benchmark protocol itself might serve as a template for evaluating other video-to-4D pipelines that claim physical realism.

Load-bearing premise

A small amount of human feedback suffices to resolve depth ambiguities and occlusions consistently across diverse in-the-wild videos.

What would settle it

Compare physical-plausibility and temporal-consistency scores on a held-out set of complex interaction videos when the system runs with zero human feedback versus the reported human-assisted version; a negligible gap would undermine the assumption.

Figures

Figures reproduced from arXiv: 2606.28215 by Boran Wen, Chenyang Yu, Haoyuan Wang, Jiaxin Li, Li Zhang, Mingyu Zhang, Ruiyang Liu, Su Linxiang, Xinrui Shi, Yichen Zhao, Yifan Ding, Yong-Lu Li, Yuxiang WU, Zhenkai Zhang.

Figure 1
Figure 1. Figure 1: Examples of object interaction sequences from the MVOIK-4D [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of HAT-4D. The framework extracts an interaction knowledge graph from a monocular video, reconstructs and composes the interacting objects, and generates subsequent 3D states through memory-guided 4D propagation. The evaluation agent updates the memory with successful results and feeds failed cases back for repair, while human users can refine intermediate outputs throughout the pipeline. through … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Interaction Knowledge Graph (IKG) formulation. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Construction of MVOIK-4D and annotation of MemoryOIK-4D. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison with baseline methods. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of human intervention and view scaling on 4D reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Failure cases. Left: pressing and folding a thin plastic carton causes complex deformation, distorting novel-view geometry and over-smoothing details. Right: the spring toy’s rapid extension and recoil cause motion blur and geometric inconsistencies. views are rendered around it and used as input views. Another four captured viewpoints are reserved as validation views for evaluation. To study the effect of… view at source ↗
Figure 8
Figure 8. Figure 8: Overview of the HAT-4D framework. (a) Object interaction under￾standing: given a monocular video, an interaction understanding agent extracts an interaction knowledge graph describing objects and their interactions. (b) 3D ob￾ject generation and composition: conditioned on the object information and the t-th frame, the 3D Object Generation Agent reconstructs individual 3D objects, which are then composed a… view at source ↗
Figure 9
Figure 9. Figure 9: User interface for Gaussian-level and Region-level interactive refine [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: User interface for object-level interactive refinement. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of the top-20 scene types in MVOIK-4D measured by case count. The box open scenario appears most frequently. Although the interaction type is similar, the objects contained in the box vary across scenes, making these cases particularly useful for evaluating a model’s ability to maintain object memory when objects are temporarily occluded during box opening. Evaluation protocol. To assess the … view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of the top-20 interaction types in MVOIK-4D by case count. Only the 20 most frequent interaction categories are shown. The dataset covers a wide range of object interaction behaviors, including manipulation (e.g., grip, move), state changes (e.g., open, cover), and physical interactions (e.g., cut, collide, squeeze), reflecting the diversity of real-world object interactions. 10.1 Overall Rec… view at source ↗
Figure 13
Figure 13. Figure 13: Distribution of the top-20 object categories in MVOIK-4D by case count. Only the 20 most frequent object types are shown. The dataset contains a diverse set of everyday objects, including containers (box, cup, lid), tools (knife, clip, clamp), and deformable or manipulable objects (apple, banana, modeling clay), reflecting the variety of objects involved in real-world interactions. For each viewpoint v, w… view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative visualization of temporal memory metric. The top timeline illus￾trates a sequence where a shoebox rotates 180°. We compare L4GM model with the Ground Truth (GT) at Frame 51 (front view) and Frame 103 (occluded/back view). The segmentation masks are color-coded: yellow indicates pixels matched with the current input frame, green represents successful retrieval from past memory (demon￾strating t… view at source ↗
Figure 15
Figure 15. Figure 15: More qualitative comparison between different baselines and [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: More multi-view data in MVOIK-4D dataset. [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗
read the original abstract

Extracting dynamic 4D object interactions from massive, in-the-wild monocular videos offers a highly efficient data collection pathway for scaling Embodied AI and training VLAs. However, existing monocular 4D reconstruction methods primarily focus on isolated objects, often failing under the severe occlusions and complex dynamics inherent in multi-object interactions. To bridge this gap, we propose HAT-4D, the first agentic framework designed to reconstruct the 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single video. By integrating VLMs with a multi-level human-in-the-loop feedback mechanism, HAT-4D efficiently resolves depth ambiguities and interaction-induced occlusions during 3D generation and 4D propagation, yielding physically plausible assets without relying on expensive multicamera rigs. As a scalable data engine, HAT-4D facilitates the creation of MVOIK-4D, an open-world benchmark for monocular 4D interaction reconstruction, accompanied by a novel multi-dimensional evaluation protocol focused on physical plausibility and temporal consistency. Extensive experiments demonstrate that HAT-4D achieves SOTA performance on most evaluation metrics, while maintaining competitive semantic alignment. Ablation studies show that introducing a small amount of human feedback improves interaction reconstruction. Moreover, the data produced by HAT-4D effectively improves baseline performance when used for fine-tuning. Our data and code are available at https://lijiaxin0111.github.io/HAT4D/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces HAT-4D, the first agentic framework that integrates VLMs with a multi-level human-in-the-loop feedback mechanism to reconstruct 3D geometry, temporal dynamics, and physical interactions of multiple objects from monocular in-the-wild videos. It produces the MVOIK-4D benchmark with a new multi-dimensional evaluation protocol focused on physical plausibility and temporal consistency, claims SOTA results on most metrics while maintaining semantic alignment, shows via ablations that small human feedback improves reconstruction, and demonstrates that the generated data improves baseline performance on fine-tuning.

Significance. If the central claims hold with proper quantitative support, HAT-4D would offer a practical, hardware-light pathway for generating large-scale 4D interaction data for Embodied AI and VLAs. The open benchmark, evaluation protocol, and code release would provide a reusable resource for standardizing assessment of monocular 4D multi-object methods.

major comments (2)
  1. [Abstract] Abstract: the claim that 'a small amount of human feedback improves interaction reconstruction' and enables scalable resolution of depth ambiguities/occlusions is load-bearing for both the SOTA reconstruction results and the downstream data-improvement claim, yet the abstract (and the reviewed portions) supply no quantitative metrics on feedback volume (interventions per video or total time), inter-video consistency, or failure modes when feedback is minimized.
  2. [Abstract] Abstract: SOTA performance is asserted on 'most evaluation metrics' with 'competitive semantic alignment,' but the abstract provides no details on baselines, error bars, dataset splits, or the precise definition/measurement of physical plausibility, preventing verification of the central empirical claims.
minor comments (1)
  1. [Abstract] The abstract states that data and code are available at a GitHub link; confirming that the released assets include the exact human-feedback logs and evaluation scripts would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each point below and will revise the abstract accordingly to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'a small amount of human feedback improves interaction reconstruction' and enables scalable resolution of depth ambiguities/occlusions is load-bearing for both the SOTA reconstruction results and the downstream data-improvement claim, yet the abstract (and the reviewed portions) supply no quantitative metrics on feedback volume (interventions per video or total time), inter-video consistency, or failure modes when feedback is minimized.

    Authors: We agree that the abstract would be strengthened by including quantitative details on human feedback volume and costs. The manuscript's ablation studies (Section 4.3) demonstrate the benefit of limited human input, and we will revise the abstract to report key figures such as average interventions per video and time per sequence drawn from those experiments. revision: yes

  2. Referee: [Abstract] Abstract: SOTA performance is asserted on 'most evaluation metrics' with 'competitive semantic alignment,' but the abstract provides no details on baselines, error bars, dataset splits, or the precise definition/measurement of physical plausibility, preventing verification of the central empirical claims.

    Authors: We acknowledge that the abstract is high-level and omits these specifics. The experiments section details the baselines, MVOIK-4D splits, metrics (including physical plausibility and temporal consistency), and results with variability measures. We will revise the abstract to name the primary baselines and briefly define the evaluation focus. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents HAT-4D as an agentic framework that integrates external VLMs with a multi-level human-in-the-loop mechanism to resolve depth ambiguities and occlusions in monocular 4D reconstruction. No equations, fitted parameters, or self-referential predictions appear in the abstract or description. Claims of SOTA performance and data utility rest on empirical results from external components rather than any reduction to self-defined inputs, self-citations, or ansatzes. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger is therefore minimal and provisional.

axioms (1)
  • domain assumption VLMs combined with limited human feedback can reliably resolve depth ambiguities and occlusions in multi-object interaction videos
    This is the core mechanism claimed to enable the reconstruction.

pith-pipeline@v0.9.1-grok · 5852 in / 1191 out tokens · 30045 ms · 2026-06-29T04:14:10.503714+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

100 extracted references · 2 canonical work pages

  1. [1]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  2. [2]

    arXiv preprint arXiv:2311.15127 (2023)

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  3. [3]

    Medical Image Analysis71, 102062 (Jul 2021).https://doi.org/10.1016/j.media.2021.102062,http:// dx.doi.org/10.1016/j.media.2021.102062

    Budd, S., Robinson, E.C., Kainz, B.: A survey on active learning and human-in- the-loop deep learning for medical image analysis. Medical Image Analysis71, 102062 (Jul 2021).https://doi.org/10.1016/j.media.2021.102062,http:// dx.doi.org/10.1016/j.media.2021.102062

  4. [4]

    Cen, J., Fang, J., Zhou, Z., Yang, C., Xie, L., Zhang, X., Shen, W., Tian, Q.: Segment anything in 3d with radiance fields (2024),https://arxiv.org/abs/ 2304.12308

  5. [5]

    arXiv preprint arXiv:2601.14253 (2026)

    Chen, H., Chen, X., Zhang, Y., Xu, Z., Chen, A.: Motion 3-to-4: 3d motion recon- struction for 4d synthesis. arXiv preprint arXiv:2601.14253 (2026)

  6. [6]

    arXiv preprint arXiv:2503.09631 (2025)

    Chen, J., Zhang, B., Tang, X., Wonka, P.: V2m4: 4d mesh animation reconstruction from a single monocular video. arXiv preprint arXiv:2503.09631 (2025)

  7. [7]

    Chen, J., Gao, D., Lin, K.Q., Shou, M.Z.: Affordance grounding from demonstra- tion video to target image (2023),https://arxiv.org/abs/2303.14644

  8. [8]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

    Chen, Y., Guo, S., Yang, T., Ding, L., Yu, X., Gu, J., Xue, T.: 4dslomo: 4d recon- struction for high speed scene with asynchronous capture. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–11 (2025)

  9. [9]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chen, Z., Li, J., Liang, J., Tan, L., Guo, Y., Lu, C., Li, Y.L.: Mˆ 3-vos: Multi-phase, multi-transition, and multi-scenery video object segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29193–29202 (2025)

  10. [10]

    Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., Bing, L.: Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms (2024),https://arxiv.org/abs/2406.07476

  11. [11]

    Advances in Neural Information Processing Systems36, 35799–35813 (2023) HAT-4D 17

    Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems36, 35799–35813 (2023) HAT-4D 17

  12. [12]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13142–13153 (2023)

  13. [13]

    Do, T.T., Nguyen, A., Reid, I.: Affordancenet: An end-to-end deep learning ap- proach for object affordance detection (2018),https://arxiv.org/abs/1709. 07326

  14. [14]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Geng, C., Zhang, Y., Wu, S., Wu, J.: Birth and death of a rose. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26102–26113 (2025)

  15. [15]

    Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d (2024), https://arxiv.org/abs/2311.04400

  16. [16]

    Hou, C., Ze, Y., Fu, Y., Gao, Z., Hu, S., Yu, Y., Zhang, S., Xu, H.: 4d visual pre-training for robot learning (2025),https://arxiv.org/abs/2508.17230

  17. [17]

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models (2023),https:// arxiv.org/abs/2311.17982

  18. [18]

    Hunyuan3D, T., :, Zhang, B., Guo, C., Guo, D., Liu, H., Yan, H., Shi, H., Yu, J., Xu, J., Huang, J., Li, K., Wang, L., Linus, Wang, P., Lin, Q., Tang, R., Yang, X., Li, Y., Guan, Y., Zhao, Y., Yang, Y., Lai, Z., Liang, Z., Zhao, Z.: Hy3d-bench: Generation of 3d assets (2026),https://arxiv.org/abs/2602.03907

  19. [19]

    Ji, K., Shi, Y., Jin, Z., Chen, K., Xu, L., Ma, Y., Yu, J., Wang, J.: Towards immersive human-x interaction: A real-time framework for physically plausible motion synthesis (2025),https://arxiv.org/abs/2508.02106

  20. [20]

    arXiv preprint arXiv:2311.02848 (2023)

    Jiang,Y.,Zhang,L.,Gao,J.,Hu,W.,Yao,Y.:Consistent4d:Consistent360{\deg} dynamic object generation from monocular video. arXiv preprint arXiv:2311.02848 (2023)

  21. [21]

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything (2023),https://arxiv.org/abs/2304.02643

  22. [22]

    Krishna,R.,Zhu,Y.,Groth,O.,Johnson,J.,Hata,K.,Kravitz,J.,Chen,S.,Kalan- tidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Li, F.F.: Visual genome: Con- necting language and vision using crowdsourced dense image annotations (2016), https://arxiv.org/abs/1602.07332

  23. [23]

    Advances in Neural Information Processing Systems37, 62189–62222 (2024)

    Li, B., Zheng, C., Zhu, W., Mai, J., Zhang, B., Wonka, P., Ghanem, B.: Vivid-zoo: Multi-view video generation with diffusion model. Advances in Neural Information Processing Systems37, 62189–62222 (2024)

  24. [24]

    arXiv preprint arXiv:2503.20784 (2025)

    Li, J., Gao, H.a., Li, W., Chi, H., Liu, C., Du, C., Liu, Y., Gao, M., Zhang, G., Zhang, Z., et al.: Fb-4d: Spatial-temporal coherent dynamic 3d content generation with feature banks. arXiv preprint arXiv:2503.20784 (2025)

  25. [25]

    Li, R., Zhang, S., He, X.: Sgtr: End-to-end scene graph generation with transformer (2022),https://arxiv.org/abs/2112.12970

  26. [26]

    org/abs/2506.15483

    Li, S., Zhang, H., Chen, X., Wang, Y., Ban, Y.: Genhoi: Generalizing text-driven 4d human-object interaction synthesis for unseen objects (2025),https://arxiv. org/abs/2506.15483

  27. [27]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li, Y.L., Xu, Y., Xu, X., Mao, X., Yao, Y., Liu, S., Lu, C.: Beyond object recog- nition: A new benchmark towards object concept learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20029–20040 (2023) 18 J. Li et al

  28. [28]

    Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object (2023),https://arxiv.org/abs/2303. 11328

  29. [29]

    In: Pro- ceedings of the AAAI Conference on Artificial Intelligence

    Liu, X., Wen, B., Liu, X., Zhou, Z., Fan, H., Lu, C., Ma, L., Chen, Y., Li, Y.L.: Interacted object grounding in spatio-temporal human-object interactions. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 5622–5630 (2025)

  30. [30]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Lu, C.Y., Zhou, P., Xing, A., Pokhariya, C., Dey, A., Shah, I.N., Mavidipalli, R., Hu, D., Comport, A.I., Chen, K., et al.: Diva-360: The dynamic visual dataset for immersive neural fields. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 22466–22476 (2024)

  31. [31]

    arXiv preprint arXiv:2506.18890 (2025)

    Ma, Z., Chen, X., Yu, S., Bi, S., Zhang, K., Ziwen, C., Xu, S., Yang, J., Xu, Z., Sunkavalli, K., et al.: 4d-lrm: Large space-time reconstruction model from and to any view at any time. arXiv preprint arXiv:2506.18890 (2025)

  32. [32]

    In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

    Mou, L., Lei, J., Wang, C., Liu, L., Daniilidis, K.: Dimo: Diverse 3d motion gener- ation for arbitrary objects. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 14357–14368 (2025)

  33. [33]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  34. [34]

    Advances in Neural Information Processing Systems37, 56828–56858 (2024)

    Ren, J., Xie, C., Mirzaei, A., Kreis, K., Liu, Z., Torralba, A., Fidler, S., Kim, S.W., Ling, H., et al.: L4gm: Large 4d gaussian reconstruction model. Advances in Neural Information Processing Systems37, 56828–56858 (2024)

  35. [35]

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

  36. [36]

    Settles, B.: Active learning literature survey (2009)

  37. [37]

    arXiv preprint arXiv:2508.10104 (2025)

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

  38. [38]

    Team, S.D., Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., Lin, A., Liu, J., Ma, Z., Sagar, A., Song, B., Wang, X., Yang, J., Zhang, B., Dollár, P., Gkioxari, G., Feiszli, M., Malik, J.: Sam 3d: 3dfy anything in images (2025),https://arxiv.org/abs/2511.16624

  39. [39]

    Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Fvd: A new metric for video generation (2019)

  40. [40]

    In: European Conference on Computer Vision

    Voleti, V., Yao, C.H., Boss, M., Letts, A., Pankratz, D., Tochilkin, D., Laforte, C., Rombach, R., Jampani, V.: Sv3d: Novel multi-view synthesis and 3d genera- tion from a single image using latent video diffusion. In: European Conference on Computer Vision. pp. 439–457. Springer (2024)

  41. [41]

    Wang, B., Wu, V., Wu, B., Keutzer, K.: Latte: Accelerating lidar point cloud annotation via sensor fusion, one-click annotation, and tracking (2019),https: //arxiv.org/abs/1904.09085

  42. [42]

    IEEE Transac- tions on Pattern Analysis and Machine Intelligence41(7), 1559–1572 (Jul 2019)

    Wang, G., Zuluaga, M.A., Li, W., Pratt, R., Patel, P.A., Aertsen, M., Doel, T., David, A.L., Deprest, J., Ourselin, S., Vercauteren, T.: Deepigeos: A deep interactive geodesic framework for medical image segmentation. IEEE Transac- tions on Pattern Analysis and Machine Intelligence41(7), 1559–1572 (Jul 2019). https://doi.org/10.1109/tpami.2018.2840695,htt...

  43. [43]

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer (2025),https://arxiv.org/abs/2503. 11651

  44. [44]

    Wang, Y., Li, K., Li, X., Yu, J., He, Y., Wang, C., Chen, G., Pei, B., Yan, Z., Zheng, R., Xu, J., Wang, Z., Shi, Y., Jiang, T., Li, S., Zhang, H., Huang, Y., Qiao, Y., Wang, Y., Wang, L.: Internvideo2: Scaling foundation models for multimodal video understanding (2024),https://arxiv.org/abs/2403.15377

  45. [45]

    arXiv preprint arXiv:2601.11421 (2026)

    Wang, Z., Liu, C., Xiang, Y., Zhang, R., Hao, Q., Lu, H., Chen, H., Feng, Z., Zheng, K., Ye, D., et al.: The great march 100: 100 detail-oriented tasks for evaluating embodied ai agents. arXiv preprint arXiv:2601.11421 (2026)

  46. [46]

    Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering (2024),https: //arxiv.org/abs/2310.08528

  47. [47]

    arXiv preprint arXiv:2407.17470 (2024)

    Xie, Y., Yao, C.H., Voleti, V., Jiang, H., Jampani, V.: Sv4d: Dynamic 3d con- tent generation with multi-frame and multi-view consistency. arXiv preprint arXiv:2407.17470 (2024)

  48. [48]

    arXiv preprint arXiv:2503.16396 (2025)

    Yao, C.H., Xie, Y., Voleti, V., Jiang, H., Jampani, V.: Sv4d 2.0: Enhancing spatio- temporal consistency in multi-view video diffusion for high-quality 4d generation. arXiv preprint arXiv:2503.16396 (2025)

  49. [49]

    Yu, W., Zhu, S., Yang, T., Chen, C.: Consistency-based active learning for object detection (2022),https://arxiv.org/abs/2103.10374

  50. [50]

    Zeng, J., Bu, Q., Wang, B., Xia, W., Chen, L., Dong, H., Song, H., Wang, D., Hu, D., Luo, P., Cui, H., Zhao, B., Li, X., Qiao, Y., Li, H.: Learning manipulation by predicting interaction (2024),https://arxiv.org/abs/2406.00439

  51. [51]

    In: European Confer- ence on Computer Vision

    Zeng, Y., Jiang, Y., Zhu, S., Lu, Y., Lin, Y., Zhu, H., Hu, W., Cao, X., Yao, Y.: Stag4d: Spatial-temporal anchored generative 4d gaussians. In: European Confer- ence on Computer Vision. pp. 163–179. Springer (2024)

  52. [52]

    arXiv preprint arXiv:2507.23785 (2025)

    Zhang, B., Xu, S., Wang, C., Yang, J., Zhao, F., Chen, D., Guo, B.: Gaus- sian variation field diffusion for high-fidelity video-to-4d synthesis. arXiv preprint arXiv:2507.23785 (2025)

  53. [53]

    Zhang, C., Moing, G.L., Koppula, S., Rocco, I., Momeni, L., Xie, J., Sun, S., Sukthankar, R., Barral, J.K., Hadsell, R., Ghahramani, Z., Zisserman, A., Zhang, J., Sajjadi, M.S.M.: Efficiently reconstructing dynamic scenes one d4rt at a time (2025),https://arxiv.org/abs/2512.08924

  54. [54]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhang, M., Zhuo, L., Tan, T., Xie, G., Nie, X., Li, Y., Zhao, R., He, Z., Wang, Z., Cai, J., et al.: Ipr-1: Interactive physical reasoner. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 33415– 33425 (2026)

  55. [55]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

  56. [56]

    arXiv preprint arXiv:2508.10868 (2025) 20 J

    Zhang, Y., Zhang, L., Ma, R., Cao, N.: Texverse: A universe of 3d objects with high-resolution textures. arXiv preprint arXiv:2508.10868 (2025) 20 J. Li et al. HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human–Agent Collaboration Supplementary Material We introduce: –More implementation details of the HAT-4D framework in Sec.7, ad...

  57. [57]

    Do NOT output numeric pose or distance

  58. [58]

    distance_hint must be one of: {near_zero, near, mid, far, unknown}

  59. [59]

    Focus on events that start, end, or change within this segment

  60. [60]

    left cup

    Keep object naming consistent with previous memory. Object naming rules: 4.1 Prefer distinguishing objects by appearance (color, texture, shape, size, damage, stain, marks). 4.2 Objects should represent real object instances. Avoid pure spatial region names. 36 J. Li et al. 4.3 Only when objects are visually indistinguishable, spatial qualifiers such as "...

  61. [61]

    Objects should be ordered from +X to -X (more negative means farther)

    Output x-axis depth ordering timeline. Objects should be ordered from +X to -X (more negative means farther)

  62. [62]

    segment_id

    Output key pairwise spatial relations: horizontal (left/right/overlap) vertical (above/below/overlap) optional depth (front/behind/same_depth). segment_id: {segment_id} object_text_hint: {object_text} relation_hint: {relation_hint} segment_frame_tokens: {segment_tokens} frame_mapping: {token_lines} previous_segment_memory: {memory_text} Return JSON with k...

  63. [63]

    Merge duplicated object identities

  64. [64]

    Resolve temporal conflicts across segments

  65. [65]

    left cup

    Keep only symbolic spatial relations. Do not output numeric pose or distance. Object merging rules: 3.1 Prefer instance names distinguished by 38 J. Li et al. appearance details (color, texture, shape, size, markings). 3.2 If objects are visually indistinguishable, spatial qualifiers such as "left cup" / "right cup" are allowed, but must remain consistent...

  66. [66]

    Output ordering from +X to -X (more negative means farther)

    Merge x-axis depth ordering timelines. Output ordering from +X to -X (more negative means farther)

  67. [67]

    objects": [string],

    Merge relative_position_timeline across the entire video. Inputs: object_text_hint: {object_text} relation_hint: {relation_hint} all_frame_tokens: {frame_tokens} segment_fragments_json: {segment_fragments_json} segment_memories_json: {segment_memories_json} Return JSON with keys: { "objects": [string], "object_meta": [ { "id": string, "name": string, "cat...

  68. [68]

    Use world coordinate convention: objects ordered from +X to -X (more negative means farther)

  69. [69]

    Provide time-sliced depth ordering covering the video timeline

  70. [70]

    x_depth_order_timeline

    If information is insufficient, output a single global ordering. Inputs: object_text_hint: {object_text} relation_hint: {relation_hint} frame_tokens: {frame_tokens} current_scene_graph_json: {current_scene_graph_json} segment_memories_json: {segment_memories_json} Return JSON: { "x_depth_order_timeline": [ { "frame_range": string, "start_frame_index": int...

  71. [71]

    Provide exactly {n_pos} positive points

  72. [72]

    Provide exactly {n_neg} negative points

  73. [73]

    Positive points must lie on the target object

  74. [74]

    Negative points should lie on non-target objects or background regions

  75. [75]

    First decompose the target object into visible semantic parts. 42 J. Li et al

  76. [76]

    Distribute positive points across different visible parts whenever possible

  77. [77]

    part_decomposition

    If failure analysis is provided, prioritize correcting the detected issues, such as: - missing_parts - extra_objects - wrong_positive_points - wrong_negative_points Return JSON: { "part_decomposition": [ { "part_name": string, "is_visible": bool, "importance": "high|medium|low" } ], "positive_points": [ { "x": float, "y": float, "part_name": string } ], "...

  78. [78]

    Original image with mask overlay

  79. [79]

    Binary mask image (white = mask)

  80. [80]

    Foreground cutout (mask region only)

Showing first 80 references.