HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

Boran Wen; Chenyang Yu; Haoyuan Wang; Jiaxin Li; Li Zhang; Mingyu Zhang; Ruiyang Liu; Su Linxiang; Xinrui Shi; Yichen Zhao

arxiv: 2606.28215 · v1 · pith:NNWP5NRNnew · submitted 2026-06-26 · 💻 cs.CV · cs.AI· cs.GR

HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

Jiaxin Li , Yuxiang Wu , Zhenkai Zhang , Xinrui Shi , Haoyuan Wang , Yichen Zhao , Su Linxiang , Chenyang Yu

show 6 more authors

Mingyu Zhang Yifan Ding Boran Wen Li Zhang Ruiyang Liu Yong-Lu Li

This is my paper

Pith reviewed 2026-06-29 04:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR

keywords 4D reconstructionmonocular videomulti-object interactionshuman-in-the-loopvision-language modelsphysical plausibilitytemporal consistencydata generation

0 comments

The pith

HAT-4D reconstructs 4D multi-object interactions from monocular videos by combining vision models with targeted human feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HAT-4D as a framework that lifts single-camera videos of multiple objects interacting into complete 4D models containing 3D shapes, motion over time, and physical contacts. Prior monocular methods often break down when objects overlap or depth is unclear, but this system pairs vision-language models with limited human corrections at several stages to correct those issues and produce usable assets. The resulting data engine also supplies a new benchmark called MVOIK-4D and shows that the outputs can raise the accuracy of other models when used for further training.

Core claim

HAT-4D is the first agentic framework that integrates VLMs with a multi-level human-in-the-loop feedback mechanism to reconstruct the 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single monocular video, yielding physically plausible assets without multi-camera rigs and enabling the MVOIK-4D benchmark.

What carries the argument

The multi-level human-in-the-loop feedback mechanism integrated with VLMs that resolves depth ambiguities and interaction-induced occlusions during 3D generation and 4D propagation.

If this is right

HAT-4D reaches state-of-the-art scores on most evaluation metrics while preserving competitive semantic alignment.
Datasets generated by HAT-4D raise baseline model performance when used for fine-tuning.
The framework produces the open-world MVOIK-4D benchmark together with a multi-dimensional protocol that tests physical plausibility and temporal consistency.
The method supplies scalable 4D interaction data for Embodied AI and VLA training without requiring multi-camera capture setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feedback loop could be tested on single-object or static-scene videos to measure how much human input is still needed once interaction complexity drops.
If the generated assets prove stable in physics simulators, they could directly seed large-scale robot training environments drawn from everyday video sources.
The benchmark protocol itself might serve as a template for evaluating other video-to-4D pipelines that claim physical realism.

Load-bearing premise

A small amount of human feedback suffices to resolve depth ambiguities and occlusions consistently across diverse in-the-wild videos.

What would settle it

Compare physical-plausibility and temporal-consistency scores on a held-out set of complex interaction videos when the system runs with zero human feedback versus the reported human-assisted version; a negligible gap would undermine the assumption.

Figures

Figures reproduced from arXiv: 2606.28215 by Boran Wen, Chenyang Yu, Haoyuan Wang, Jiaxin Li, Li Zhang, Mingyu Zhang, Ruiyang Liu, Su Linxiang, Xinrui Shi, Yichen Zhao, Yifan Ding, Yong-Lu Li, Yuxiang WU, Zhenkai Zhang.

**Figure 2.** Figure 2: Overview of HAT-4D. The framework extracts an interaction knowledge graph from a monocular video, reconstructs and composes the interacting objects, and generates subsequent 3D states through memory-guided 4D propagation. The evaluation agent updates the memory with successful results and feeds failed cases back for repair, while human users can refine intermediate outputs throughout the pipeline. through … view at source ↗

**Figure 3.** Figure 3: Overview of the Interaction Knowledge Graph (IKG) formulation. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Construction of MVOIK-4D and annotation of MemoryOIK-4D. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison with baseline methods. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of human intervention and view scaling on 4D reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Failure cases. Left: pressing and folding a thin plastic carton causes complex deformation, distorting novel-view geometry and over-smoothing details. Right: the spring toy’s rapid extension and recoil cause motion blur and geometric inconsistencies. views are rendered around it and used as input views. Another four captured viewpoints are reserved as validation views for evaluation. To study the effect of… view at source ↗

**Figure 8.** Figure 8: Overview of the HAT-4D framework. (a) Object interaction understanding: given a monocular video, an interaction understanding agent extracts an interaction knowledge graph describing objects and their interactions. (b) 3D object generation and composition: conditioned on the object information and the t-th frame, the 3D Object Generation Agent reconstructs individual 3D objects, which are then composed a… view at source ↗

**Figure 9.** Figure 9: User interface for Gaussian-level and Region-level interactive refine [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: User interface for object-level interactive refinement. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Distribution of the top-20 scene types in MVOIK-4D measured by case count. The box open scenario appears most frequently. Although the interaction type is similar, the objects contained in the box vary across scenes, making these cases particularly useful for evaluating a model’s ability to maintain object memory when objects are temporarily occluded during box opening. Evaluation protocol. To assess the … view at source ↗

**Figure 12.** Figure 12: Distribution of the top-20 interaction types in MVOIK-4D by case count. Only the 20 most frequent interaction categories are shown. The dataset covers a wide range of object interaction behaviors, including manipulation (e.g., grip, move), state changes (e.g., open, cover), and physical interactions (e.g., cut, collide, squeeze), reflecting the diversity of real-world object interactions. 10.1 Overall Rec… view at source ↗

**Figure 13.** Figure 13: Distribution of the top-20 object categories in MVOIK-4D by case count. Only the 20 most frequent object types are shown. The dataset contains a diverse set of everyday objects, including containers (box, cup, lid), tools (knife, clip, clamp), and deformable or manipulable objects (apple, banana, modeling clay), reflecting the variety of objects involved in real-world interactions. For each viewpoint v, w… view at source ↗

**Figure 14.** Figure 14: Qualitative visualization of temporal memory metric. The top timeline illustrates a sequence where a shoebox rotates 180°. We compare L4GM model with the Ground Truth (GT) at Frame 51 (front view) and Frame 103 (occluded/back view). The segmentation masks are color-coded: yellow indicates pixels matched with the current input frame, green represents successful retrieval from past memory (demonstrating t… view at source ↗

**Figure 15.** Figure 15: More qualitative comparison between different baselines and [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

**Figure 16.** Figure 16: More multi-view data in MVOIK-4D dataset. [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗

read the original abstract

Extracting dynamic 4D object interactions from massive, in-the-wild monocular videos offers a highly efficient data collection pathway for scaling Embodied AI and training VLAs. However, existing monocular 4D reconstruction methods primarily focus on isolated objects, often failing under the severe occlusions and complex dynamics inherent in multi-object interactions. To bridge this gap, we propose HAT-4D, the first agentic framework designed to reconstruct the 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single video. By integrating VLMs with a multi-level human-in-the-loop feedback mechanism, HAT-4D efficiently resolves depth ambiguities and interaction-induced occlusions during 3D generation and 4D propagation, yielding physically plausible assets without relying on expensive multicamera rigs. As a scalable data engine, HAT-4D facilitates the creation of MVOIK-4D, an open-world benchmark for monocular 4D interaction reconstruction, accompanied by a novel multi-dimensional evaluation protocol focused on physical plausibility and temporal consistency. Extensive experiments demonstrate that HAT-4D achieves SOTA performance on most evaluation metrics, while maintaining competitive semantic alignment. Ablation studies show that introducing a small amount of human feedback improves interaction reconstruction. Moreover, the data produced by HAT-4D effectively improves baseline performance when used for fine-tuning. Our data and code are available at https://lijiaxin0111.github.io/HAT4D/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HAT-4D adds a VLM-plus-multi-level-human loop for multi-object 4D from monocular video and releases a new benchmark, but the abstract gives no numbers on feedback volume or consistency.

read the letter

The main takeaway is that this work targets the data bottleneck for embodied AI by turning ordinary videos into 4D multi-object interaction assets. It does so with an agentic setup that feeds VLMs and then applies human corrections at several stages to fix depth and occlusion problems that break existing monocular methods. They also introduce the MVOIK-4D benchmark and a multi-dimensional eval protocol centered on physical plausibility and temporal consistency, and they report that the resulting data improves downstream fine-tuning.

What the paper actually contributes is the shift from single-object to interaction-focused reconstruction plus the explicit human-in-the-loop design. Releasing code and data is a concrete plus for anyone who wants to test the pipeline.

The soft spot is exactly the one flagged in the stress-test note. The abstract claims that only a small amount of human feedback is enough to make the outputs scalable and physically plausible, yet it supplies no counts of interventions per video, total annotator time, inter-video variance, or failure cases when feedback is reduced. The SOTA and ablation statements are mentioned without baseline details, error bars, or split information, so it is impossible to judge whether the gains are meaningful. If the full paper contains those numbers and shows the feedback cost stays low across diverse videos, the central claim strengthens; otherwise the practicality argument stays unverified.

This is aimed at groups working on 4D reconstruction pipelines or data engines for VLAs and robotics. A reader who needs interaction datasets would find the benchmark and released assets useful if the methods check out.

I would send it to peer review. The problem is real, the framing is new, and the release lowers the barrier for follow-up work, but referees will need to press on the missing quantitative support for the human component.

Referee Report

2 major / 1 minor

Summary. The paper introduces HAT-4D, the first agentic framework that integrates VLMs with a multi-level human-in-the-loop feedback mechanism to reconstruct 3D geometry, temporal dynamics, and physical interactions of multiple objects from monocular in-the-wild videos. It produces the MVOIK-4D benchmark with a new multi-dimensional evaluation protocol focused on physical plausibility and temporal consistency, claims SOTA results on most metrics while maintaining semantic alignment, shows via ablations that small human feedback improves reconstruction, and demonstrates that the generated data improves baseline performance on fine-tuning.

Significance. If the central claims hold with proper quantitative support, HAT-4D would offer a practical, hardware-light pathway for generating large-scale 4D interaction data for Embodied AI and VLAs. The open benchmark, evaluation protocol, and code release would provide a reusable resource for standardizing assessment of monocular 4D multi-object methods.

major comments (2)

[Abstract] Abstract: the claim that 'a small amount of human feedback improves interaction reconstruction' and enables scalable resolution of depth ambiguities/occlusions is load-bearing for both the SOTA reconstruction results and the downstream data-improvement claim, yet the abstract (and the reviewed portions) supply no quantitative metrics on feedback volume (interventions per video or total time), inter-video consistency, or failure modes when feedback is minimized.
[Abstract] Abstract: SOTA performance is asserted on 'most evaluation metrics' with 'competitive semantic alignment,' but the abstract provides no details on baselines, error bars, dataset splits, or the precise definition/measurement of physical plausibility, preventing verification of the central empirical claims.

minor comments (1)

[Abstract] The abstract states that data and code are available at a GitHub link; confirming that the released assets include the exact human-feedback logs and evaluation scripts would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each point below and will revise the abstract accordingly to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'a small amount of human feedback improves interaction reconstruction' and enables scalable resolution of depth ambiguities/occlusions is load-bearing for both the SOTA reconstruction results and the downstream data-improvement claim, yet the abstract (and the reviewed portions) supply no quantitative metrics on feedback volume (interventions per video or total time), inter-video consistency, or failure modes when feedback is minimized.

Authors: We agree that the abstract would be strengthened by including quantitative details on human feedback volume and costs. The manuscript's ablation studies (Section 4.3) demonstrate the benefit of limited human input, and we will revise the abstract to report key figures such as average interventions per video and time per sequence drawn from those experiments. revision: yes
Referee: [Abstract] Abstract: SOTA performance is asserted on 'most evaluation metrics' with 'competitive semantic alignment,' but the abstract provides no details on baselines, error bars, dataset splits, or the precise definition/measurement of physical plausibility, preventing verification of the central empirical claims.

Authors: We acknowledge that the abstract is high-level and omits these specifics. The experiments section details the baselines, MVOIK-4D splits, metrics (including physical plausibility and temporal consistency), and results with variability measures. We will revise the abstract to name the primary baselines and briefly define the evaluation focus. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents HAT-4D as an agentic framework that integrates external VLMs with a multi-level human-in-the-loop mechanism to resolve depth ambiguities and occlusions in monocular 4D reconstruction. No equations, fitted parameters, or self-referential predictions appear in the abstract or description. Claims of SOTA performance and data utility rest on empirical results from external components rather than any reduction to self-defined inputs, self-citations, or ansatzes. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger is therefore minimal and provisional.

axioms (1)

domain assumption VLMs combined with limited human feedback can reliably resolve depth ambiguities and occlusions in multi-object interaction videos
This is the core mechanism claimed to enable the reconstruction.

pith-pipeline@v0.9.1-grok · 5852 in / 1191 out tokens · 30045 ms · 2026-06-29T04:14:10.503714+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

100 extracted references · 2 canonical work pages

[1]

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

Pith/arXiv arXiv 2025
[2]

arXiv preprint arXiv:2311.15127 (2023)

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

Pith/arXiv arXiv 2023
[3]

Medical Image Analysis71, 102062 (Jul 2021).https://doi.org/10.1016/j.media.2021.102062,http:// dx.doi.org/10.1016/j.media.2021.102062

Budd, S., Robinson, E.C., Kainz, B.: A survey on active learning and human-in- the-loop deep learning for medical image analysis. Medical Image Analysis71, 102062 (Jul 2021).https://doi.org/10.1016/j.media.2021.102062,http:// dx.doi.org/10.1016/j.media.2021.102062

work page doi:10.1016/j.media.2021.102062 2021
[4]

Cen, J., Fang, J., Zhou, Z., Yang, C., Xie, L., Zhang, X., Shen, W., Tian, Q.: Segment anything in 3d with radiance fields (2024),https://arxiv.org/abs/ 2304.12308

arXiv 2024
[5]

arXiv preprint arXiv:2601.14253 (2026)

Chen, H., Chen, X., Zhang, Y., Xu, Z., Chen, A.: Motion 3-to-4: 3d motion recon- struction for 4d synthesis. arXiv preprint arXiv:2601.14253 (2026)

arXiv 2026
[6]

arXiv preprint arXiv:2503.09631 (2025)

Chen, J., Zhang, B., Tang, X., Wonka, P.: V2m4: 4d mesh animation reconstruction from a single monocular video. arXiv preprint arXiv:2503.09631 (2025)

arXiv 2025
[7]

Chen, J., Gao, D., Lin, K.Q., Shou, M.Z.: Affordance grounding from demonstra- tion video to target image (2023),https://arxiv.org/abs/2303.14644

arXiv 2023
[8]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

Chen, Y., Guo, S., Yang, T., Ding, L., Yu, X., Gu, J., Xue, T.: 4dslomo: 4d recon- struction for high speed scene with asynchronous capture. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–11 (2025)

2025
[9]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, Z., Li, J., Liang, J., Tan, L., Guo, Y., Lu, C., Li, Y.L.: Mˆ 3-vos: Multi-phase, multi-transition, and multi-scenery video object segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29193–29202 (2025)

2025
[10]

Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., Bing, L.: Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms (2024),https://arxiv.org/abs/2406.07476

Pith/arXiv arXiv 2024
[11]

Advances in Neural Information Processing Systems36, 35799–35813 (2023) HAT-4D 17

Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems36, 35799–35813 (2023) HAT-4D 17

2023
[12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13142–13153 (2023)

2023
[13]

Do, T.T., Nguyen, A., Reid, I.: Affordancenet: An end-to-end deep learning ap- proach for object affordance detection (2018),https://arxiv.org/abs/1709. 07326

2018
[14]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Geng, C., Zhang, Y., Wu, S., Wu, J.: Birth and death of a rose. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26102–26113 (2025)

2025
[15]

Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d (2024), https://arxiv.org/abs/2311.04400

Pith/arXiv arXiv 2024
[16]

Hou, C., Ze, Y., Fu, Y., Gao, Z., Hu, S., Yu, Y., Zhang, S., Xu, H.: 4d visual pre-training for robot learning (2025),https://arxiv.org/abs/2508.17230

arXiv 2025
[17]

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models (2023),https:// arxiv.org/abs/2311.17982

arXiv 2023
[18]

Hunyuan3D, T., :, Zhang, B., Guo, C., Guo, D., Liu, H., Yan, H., Shi, H., Yu, J., Xu, J., Huang, J., Li, K., Wang, L., Linus, Wang, P., Lin, Q., Tang, R., Yang, X., Li, Y., Guan, Y., Zhao, Y., Yang, Y., Lai, Z., Liang, Z., Zhao, Z.: Hy3d-bench: Generation of 3d assets (2026),https://arxiv.org/abs/2602.03907

arXiv 2026
[19]

Ji, K., Shi, Y., Jin, Z., Chen, K., Xu, L., Ma, Y., Yu, J., Wang, J.: Towards immersive human-x interaction: A real-time framework for physically plausible motion synthesis (2025),https://arxiv.org/abs/2508.02106

arXiv 2025
[20]

arXiv preprint arXiv:2311.02848 (2023)

Jiang,Y.,Zhang,L.,Gao,J.,Hu,W.,Yao,Y.:Consistent4d:Consistent360{\deg} dynamic object generation from monocular video. arXiv preprint arXiv:2311.02848 (2023)

arXiv 2023
[21]

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything (2023),https://arxiv.org/abs/2304.02643

Pith/arXiv arXiv 2023
[22]

Krishna,R.,Zhu,Y.,Groth,O.,Johnson,J.,Hata,K.,Kravitz,J.,Chen,S.,Kalan- tidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Li, F.F.: Visual genome: Con- necting language and vision using crowdsourced dense image annotations (2016), https://arxiv.org/abs/1602.07332

Pith/arXiv arXiv 2016
[23]

Advances in Neural Information Processing Systems37, 62189–62222 (2024)

Li, B., Zheng, C., Zhu, W., Mai, J., Zhang, B., Wonka, P., Ghanem, B.: Vivid-zoo: Multi-view video generation with diffusion model. Advances in Neural Information Processing Systems37, 62189–62222 (2024)

2024
[24]

arXiv preprint arXiv:2503.20784 (2025)

Li, J., Gao, H.a., Li, W., Chi, H., Liu, C., Du, C., Liu, Y., Gao, M., Zhang, G., Zhang, Z., et al.: Fb-4d: Spatial-temporal coherent dynamic 3d content generation with feature banks. arXiv preprint arXiv:2503.20784 (2025)

arXiv 2025
[25]

Li, R., Zhang, S., He, X.: Sgtr: End-to-end scene graph generation with transformer (2022),https://arxiv.org/abs/2112.12970

arXiv 2022
[26]

org/abs/2506.15483

Li, S., Zhang, H., Chen, X., Wang, Y., Ban, Y.: Genhoi: Generalizing text-driven 4d human-object interaction synthesis for unseen objects (2025),https://arxiv. org/abs/2506.15483

arXiv 2025
[27]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, Y.L., Xu, Y., Xu, X., Mao, X., Yao, Y., Liu, S., Lu, C.: Beyond object recog- nition: A new benchmark towards object concept learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20029–20040 (2023) 18 J. Li et al

2023
[28]

Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object (2023),https://arxiv.org/abs/2303. 11328

2023
[29]

In: Pro- ceedings of the AAAI Conference on Artificial Intelligence

Liu, X., Wen, B., Liu, X., Zhou, Z., Fan, H., Lu, C., Ma, L., Chen, Y., Li, Y.L.: Interacted object grounding in spatio-temporal human-object interactions. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 5622–5630 (2025)

2025
[30]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

Lu, C.Y., Zhou, P., Xing, A., Pokhariya, C., Dey, A., Shah, I.N., Mavidipalli, R., Hu, D., Comport, A.I., Chen, K., et al.: Diva-360: The dynamic visual dataset for immersive neural fields. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 22466–22476 (2024)

2024
[31]

arXiv preprint arXiv:2506.18890 (2025)

Ma, Z., Chen, X., Yu, S., Bi, S., Zhang, K., Ziwen, C., Xu, S., Yang, J., Xu, Z., Sunkavalli, K., et al.: 4d-lrm: Large space-time reconstruction model from and to any view at any time. arXiv preprint arXiv:2506.18890 (2025)

arXiv 2025
[32]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

Mou, L., Lei, J., Wang, C., Liu, L., Daniilidis, K.: Dimo: Diverse 3d motion gener- ation for arbitrary objects. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 14357–14368 (2025)

2025
[33]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021
[34]

Advances in Neural Information Processing Systems37, 56828–56858 (2024)

Ren, J., Xie, C., Mirzaei, A., Kreis, K., Liu, Z., Torralba, A., Fidler, S., Kim, S.W., Ling, H., et al.: L4gm: Large 4d gaussian reconstruction model. Advances in Neural Information Processing Systems37, 56828–56858 (2024)

2024
[35]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

2021
[36]

Settles, B.: Active learning literature survey (2009)

2009
[37]

arXiv preprint arXiv:2508.10104 (2025)

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

Pith/arXiv arXiv 2025
[38]

Team, S.D., Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., Lin, A., Liu, J., Ma, Z., Sagar, A., Song, B., Wang, X., Yang, J., Zhang, B., Dollár, P., Gkioxari, G., Feiszli, M., Malik, J.: Sam 3d: 3dfy anything in images (2025),https://arxiv.org/abs/2511.16624

Pith/arXiv arXiv 2025
[39]

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Fvd: A new metric for video generation (2019)

2019
[40]

In: European Conference on Computer Vision

Voleti, V., Yao, C.H., Boss, M., Letts, A., Pankratz, D., Tochilkin, D., Laforte, C., Rombach, R., Jampani, V.: Sv3d: Novel multi-view synthesis and 3d genera- tion from a single image using latent video diffusion. In: European Conference on Computer Vision. pp. 439–457. Springer (2024)

2024
[41]

Wang, B., Wu, V., Wu, B., Keutzer, K.: Latte: Accelerating lidar point cloud annotation via sensor fusion, one-click annotation, and tracking (2019),https: //arxiv.org/abs/1904.09085

Pith/arXiv arXiv 2019
[42]

IEEE Transac- tions on Pattern Analysis and Machine Intelligence41(7), 1559–1572 (Jul 2019)

Wang, G., Zuluaga, M.A., Li, W., Pratt, R., Patel, P.A., Aertsen, M., Doel, T., David, A.L., Deprest, J., Ourselin, S., Vercauteren, T.: Deepigeos: A deep interactive geodesic framework for medical image segmentation. IEEE Transac- tions on Pattern Analysis and Machine Intelligence41(7), 1559–1572 (Jul 2019). https://doi.org/10.1109/tpami.2018.2840695,htt...

work page doi:10.1109/tpami.2018.2840695 2019
[43]

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer (2025),https://arxiv.org/abs/2503. 11651

2025
[44]

Wang, Y., Li, K., Li, X., Yu, J., He, Y., Wang, C., Chen, G., Pei, B., Yan, Z., Zheng, R., Xu, J., Wang, Z., Shi, Y., Jiang, T., Li, S., Zhang, H., Huang, Y., Qiao, Y., Wang, Y., Wang, L.: Internvideo2: Scaling foundation models for multimodal video understanding (2024),https://arxiv.org/abs/2403.15377

arXiv 2024
[45]

arXiv preprint arXiv:2601.11421 (2026)

Wang, Z., Liu, C., Xiang, Y., Zhang, R., Hao, Q., Lu, H., Chen, H., Feng, Z., Zheng, K., Ye, D., et al.: The great march 100: 100 detail-oriented tasks for evaluating embodied ai agents. arXiv preprint arXiv:2601.11421 (2026)

arXiv 2026
[46]

Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering (2024),https: //arxiv.org/abs/2310.08528

arXiv 2024
[47]

arXiv preprint arXiv:2407.17470 (2024)

Xie, Y., Yao, C.H., Voleti, V., Jiang, H., Jampani, V.: Sv4d: Dynamic 3d con- tent generation with multi-frame and multi-view consistency. arXiv preprint arXiv:2407.17470 (2024)

arXiv 2024
[48]

arXiv preprint arXiv:2503.16396 (2025)

Yao, C.H., Xie, Y., Voleti, V., Jiang, H., Jampani, V.: Sv4d 2.0: Enhancing spatio- temporal consistency in multi-view video diffusion for high-quality 4d generation. arXiv preprint arXiv:2503.16396 (2025)

arXiv 2025
[49]

Yu, W., Zhu, S., Yang, T., Chen, C.: Consistency-based active learning for object detection (2022),https://arxiv.org/abs/2103.10374

arXiv 2022
[50]

Zeng, J., Bu, Q., Wang, B., Xia, W., Chen, L., Dong, H., Song, H., Wang, D., Hu, D., Luo, P., Cui, H., Zhao, B., Li, X., Qiao, Y., Li, H.: Learning manipulation by predicting interaction (2024),https://arxiv.org/abs/2406.00439

arXiv 2024
[51]

In: European Confer- ence on Computer Vision

Zeng, Y., Jiang, Y., Zhu, S., Lu, Y., Lin, Y., Zhu, H., Hu, W., Cao, X., Yao, Y.: Stag4d: Spatial-temporal anchored generative 4d gaussians. In: European Confer- ence on Computer Vision. pp. 163–179. Springer (2024)

2024
[52]

arXiv preprint arXiv:2507.23785 (2025)

Zhang, B., Xu, S., Wang, C., Yang, J., Zhao, F., Chen, D., Guo, B.: Gaus- sian variation field diffusion for high-fidelity video-to-4d synthesis. arXiv preprint arXiv:2507.23785 (2025)

arXiv 2025
[53]

Zhang, C., Moing, G.L., Koppula, S., Rocco, I., Momeni, L., Xie, J., Sun, S., Sukthankar, R., Barral, J.K., Hadsell, R., Ghahramani, Z., Zisserman, A., Zhang, J., Sajjadi, M.S.M.: Efficiently reconstructing dynamic scenes one d4rt at a time (2025),https://arxiv.org/abs/2512.08924

arXiv 2025
[54]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, M., Zhuo, L., Tan, T., Xie, G., Nie, X., Li, Y., Zhao, R., He, Z., Wang, Z., Cai, J., et al.: Ipr-1: Interactive physical reasoner. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 33415– 33425 (2026)

2026
[55]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

2018
[56]

arXiv preprint arXiv:2508.10868 (2025) 20 J

Zhang, Y., Zhang, L., Ma, R., Cao, N.: Texverse: A universe of 3d objects with high-resolution textures. arXiv preprint arXiv:2508.10868 (2025) 20 J. Li et al. HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human–Agent Collaboration Supplementary Material We introduce: –More implementation details of the HAT-4D framework in Sec.7, ad...

arXiv 2025
[57]

Do NOT output numeric pose or distance
[58]

distance_hint must be one of: {near_zero, near, mid, far, unknown}
[59]

Focus on events that start, end, or change within this segment
[60]

left cup

Keep object naming consistent with previous memory. Object naming rules: 4.1 Prefer distinguishing objects by appearance (color, texture, shape, size, damage, stain, marks). 4.2 Objects should represent real object instances. Avoid pure spatial region names. 36 J. Li et al. 4.3 Only when objects are visually indistinguishable, spatial qualifiers such as "...
[61]

Objects should be ordered from +X to -X (more negative means farther)

Output x-axis depth ordering timeline. Objects should be ordered from +X to -X (more negative means farther)
[62]

segment_id

Output key pairwise spatial relations: horizontal (left/right/overlap) vertical (above/below/overlap) optional depth (front/behind/same_depth). segment_id: {segment_id} object_text_hint: {object_text} relation_hint: {relation_hint} segment_frame_tokens: {segment_tokens} frame_mapping: {token_lines} previous_segment_memory: {memory_text} Return JSON with k...
[63]

Merge duplicated object identities
[64]

Resolve temporal conflicts across segments
[65]

left cup

Keep only symbolic spatial relations. Do not output numeric pose or distance. Object merging rules: 3.1 Prefer instance names distinguished by 38 J. Li et al. appearance details (color, texture, shape, size, markings). 3.2 If objects are visually indistinguishable, spatial qualifiers such as "left cup" / "right cup" are allowed, but must remain consistent...
[66]

Output ordering from +X to -X (more negative means farther)

Merge x-axis depth ordering timelines. Output ordering from +X to -X (more negative means farther)
[67]

objects": [string],

Merge relative_position_timeline across the entire video. Inputs: object_text_hint: {object_text} relation_hint: {relation_hint} all_frame_tokens: {frame_tokens} segment_fragments_json: {segment_fragments_json} segment_memories_json: {segment_memories_json} Return JSON with keys: { "objects": [string], "object_meta": [ { "id": string, "name": string, "cat...
[68]

Use world coordinate convention: objects ordered from +X to -X (more negative means farther)
[69]

Provide time-sliced depth ordering covering the video timeline
[70]

x_depth_order_timeline

If information is insufficient, output a single global ordering. Inputs: object_text_hint: {object_text} relation_hint: {relation_hint} frame_tokens: {frame_tokens} current_scene_graph_json: {current_scene_graph_json} segment_memories_json: {segment_memories_json} Return JSON: { "x_depth_order_timeline": [ { "frame_range": string, "start_frame_index": int...
[71]

Provide exactly {n_pos} positive points
[72]

Provide exactly {n_neg} negative points
[73]

Positive points must lie on the target object
[74]

Negative points should lie on non-target objects or background regions
[75]

First decompose the target object into visible semantic parts. 42 J. Li et al
[76]

Distribute positive points across different visible parts whenever possible
[77]

part_decomposition

If failure analysis is provided, prioritize correcting the detected issues, such as: - missing_parts - extra_objects - wrong_positive_points - wrong_negative_points Return JSON: { "part_decomposition": [ { "part_name": string, "is_visible": bool, "importance": "high|medium|low" } ], "positive_points": [ { "x": float, "y": float, "part_name": string } ], "...
[78]

Original image with mask overlay
[79]

Binary mask image (white = mask)
[80]

Foreground cutout (mask region only)

Showing first 80 references.

[1] [1]

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

Pith/arXiv arXiv 2025

[2] [2]

arXiv preprint arXiv:2311.15127 (2023)

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

Pith/arXiv arXiv 2023

[3] [3]

Medical Image Analysis71, 102062 (Jul 2021).https://doi.org/10.1016/j.media.2021.102062,http:// dx.doi.org/10.1016/j.media.2021.102062

Budd, S., Robinson, E.C., Kainz, B.: A survey on active learning and human-in- the-loop deep learning for medical image analysis. Medical Image Analysis71, 102062 (Jul 2021).https://doi.org/10.1016/j.media.2021.102062,http:// dx.doi.org/10.1016/j.media.2021.102062

work page doi:10.1016/j.media.2021.102062 2021

[4] [4]

Cen, J., Fang, J., Zhou, Z., Yang, C., Xie, L., Zhang, X., Shen, W., Tian, Q.: Segment anything in 3d with radiance fields (2024),https://arxiv.org/abs/ 2304.12308

arXiv 2024

[5] [5]

arXiv preprint arXiv:2601.14253 (2026)

Chen, H., Chen, X., Zhang, Y., Xu, Z., Chen, A.: Motion 3-to-4: 3d motion recon- struction for 4d synthesis. arXiv preprint arXiv:2601.14253 (2026)

arXiv 2026

[6] [6]

arXiv preprint arXiv:2503.09631 (2025)

Chen, J., Zhang, B., Tang, X., Wonka, P.: V2m4: 4d mesh animation reconstruction from a single monocular video. arXiv preprint arXiv:2503.09631 (2025)

arXiv 2025

[7] [7]

Chen, J., Gao, D., Lin, K.Q., Shou, M.Z.: Affordance grounding from demonstra- tion video to target image (2023),https://arxiv.org/abs/2303.14644

arXiv 2023

[8] [8]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

Chen, Y., Guo, S., Yang, T., Ding, L., Yu, X., Gu, J., Xue, T.: 4dslomo: 4d recon- struction for high speed scene with asynchronous capture. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–11 (2025)

2025

[9] [9]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, Z., Li, J., Liang, J., Tan, L., Guo, Y., Lu, C., Li, Y.L.: Mˆ 3-vos: Multi-phase, multi-transition, and multi-scenery video object segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29193–29202 (2025)

2025

[10] [10]

Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., Bing, L.: Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms (2024),https://arxiv.org/abs/2406.07476

Pith/arXiv arXiv 2024

[11] [11]

Advances in Neural Information Processing Systems36, 35799–35813 (2023) HAT-4D 17

Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems36, 35799–35813 (2023) HAT-4D 17

2023

[12] [12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13142–13153 (2023)

2023

[13] [13]

Do, T.T., Nguyen, A., Reid, I.: Affordancenet: An end-to-end deep learning ap- proach for object affordance detection (2018),https://arxiv.org/abs/1709. 07326

2018

[14] [14]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Geng, C., Zhang, Y., Wu, S., Wu, J.: Birth and death of a rose. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26102–26113 (2025)

2025

[15] [15]

Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d (2024), https://arxiv.org/abs/2311.04400

Pith/arXiv arXiv 2024

[16] [16]

Hou, C., Ze, Y., Fu, Y., Gao, Z., Hu, S., Yu, Y., Zhang, S., Xu, H.: 4d visual pre-training for robot learning (2025),https://arxiv.org/abs/2508.17230

arXiv 2025

[17] [17]

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models (2023),https:// arxiv.org/abs/2311.17982

arXiv 2023

[18] [18]

Hunyuan3D, T., :, Zhang, B., Guo, C., Guo, D., Liu, H., Yan, H., Shi, H., Yu, J., Xu, J., Huang, J., Li, K., Wang, L., Linus, Wang, P., Lin, Q., Tang, R., Yang, X., Li, Y., Guan, Y., Zhao, Y., Yang, Y., Lai, Z., Liang, Z., Zhao, Z.: Hy3d-bench: Generation of 3d assets (2026),https://arxiv.org/abs/2602.03907

arXiv 2026

[19] [19]

Ji, K., Shi, Y., Jin, Z., Chen, K., Xu, L., Ma, Y., Yu, J., Wang, J.: Towards immersive human-x interaction: A real-time framework for physically plausible motion synthesis (2025),https://arxiv.org/abs/2508.02106

arXiv 2025

[20] [20]

arXiv preprint arXiv:2311.02848 (2023)

Jiang,Y.,Zhang,L.,Gao,J.,Hu,W.,Yao,Y.:Consistent4d:Consistent360{\deg} dynamic object generation from monocular video. arXiv preprint arXiv:2311.02848 (2023)

arXiv 2023

[21] [21]

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything (2023),https://arxiv.org/abs/2304.02643

Pith/arXiv arXiv 2023

[22] [22]

Krishna,R.,Zhu,Y.,Groth,O.,Johnson,J.,Hata,K.,Kravitz,J.,Chen,S.,Kalan- tidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Li, F.F.: Visual genome: Con- necting language and vision using crowdsourced dense image annotations (2016), https://arxiv.org/abs/1602.07332

Pith/arXiv arXiv 2016

[23] [23]

Advances in Neural Information Processing Systems37, 62189–62222 (2024)

Li, B., Zheng, C., Zhu, W., Mai, J., Zhang, B., Wonka, P., Ghanem, B.: Vivid-zoo: Multi-view video generation with diffusion model. Advances in Neural Information Processing Systems37, 62189–62222 (2024)

2024

[24] [24]

arXiv preprint arXiv:2503.20784 (2025)

Li, J., Gao, H.a., Li, W., Chi, H., Liu, C., Du, C., Liu, Y., Gao, M., Zhang, G., Zhang, Z., et al.: Fb-4d: Spatial-temporal coherent dynamic 3d content generation with feature banks. arXiv preprint arXiv:2503.20784 (2025)

arXiv 2025

[25] [25]

Li, R., Zhang, S., He, X.: Sgtr: End-to-end scene graph generation with transformer (2022),https://arxiv.org/abs/2112.12970

arXiv 2022

[26] [26]

org/abs/2506.15483

Li, S., Zhang, H., Chen, X., Wang, Y., Ban, Y.: Genhoi: Generalizing text-driven 4d human-object interaction synthesis for unseen objects (2025),https://arxiv. org/abs/2506.15483

arXiv 2025

[27] [27]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, Y.L., Xu, Y., Xu, X., Mao, X., Yao, Y., Liu, S., Lu, C.: Beyond object recog- nition: A new benchmark towards object concept learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20029–20040 (2023) 18 J. Li et al

2023

[28] [28]

Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object (2023),https://arxiv.org/abs/2303. 11328

2023

[29] [29]

In: Pro- ceedings of the AAAI Conference on Artificial Intelligence

Liu, X., Wen, B., Liu, X., Zhou, Z., Fan, H., Lu, C., Ma, L., Chen, Y., Li, Y.L.: Interacted object grounding in spatio-temporal human-object interactions. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 5622–5630 (2025)

2025

[30] [30]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

Lu, C.Y., Zhou, P., Xing, A., Pokhariya, C., Dey, A., Shah, I.N., Mavidipalli, R., Hu, D., Comport, A.I., Chen, K., et al.: Diva-360: The dynamic visual dataset for immersive neural fields. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 22466–22476 (2024)

2024

[31] [31]

arXiv preprint arXiv:2506.18890 (2025)

Ma, Z., Chen, X., Yu, S., Bi, S., Zhang, K., Ziwen, C., Xu, S., Yang, J., Xu, Z., Sunkavalli, K., et al.: 4d-lrm: Large space-time reconstruction model from and to any view at any time. arXiv preprint arXiv:2506.18890 (2025)

arXiv 2025

[32] [32]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

Mou, L., Lei, J., Wang, C., Liu, L., Daniilidis, K.: Dimo: Diverse 3d motion gener- ation for arbitrary objects. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 14357–14368 (2025)

2025

[33] [33]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021

[34] [34]

Advances in Neural Information Processing Systems37, 56828–56858 (2024)

Ren, J., Xie, C., Mirzaei, A., Kreis, K., Liu, Z., Torralba, A., Fidler, S., Kim, S.W., Ling, H., et al.: L4gm: Large 4d gaussian reconstruction model. Advances in Neural Information Processing Systems37, 56828–56858 (2024)

2024

[35] [35]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

2021

[36] [36]

Settles, B.: Active learning literature survey (2009)

2009

[37] [37]

arXiv preprint arXiv:2508.10104 (2025)

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

Pith/arXiv arXiv 2025

[38] [38]

Team, S.D., Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., Lin, A., Liu, J., Ma, Z., Sagar, A., Song, B., Wang, X., Yang, J., Zhang, B., Dollár, P., Gkioxari, G., Feiszli, M., Malik, J.: Sam 3d: 3dfy anything in images (2025),https://arxiv.org/abs/2511.16624

Pith/arXiv arXiv 2025

[39] [39]

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Fvd: A new metric for video generation (2019)

2019

[40] [40]

In: European Conference on Computer Vision

Voleti, V., Yao, C.H., Boss, M., Letts, A., Pankratz, D., Tochilkin, D., Laforte, C., Rombach, R., Jampani, V.: Sv3d: Novel multi-view synthesis and 3d genera- tion from a single image using latent video diffusion. In: European Conference on Computer Vision. pp. 439–457. Springer (2024)

2024

[41] [41]

Wang, B., Wu, V., Wu, B., Keutzer, K.: Latte: Accelerating lidar point cloud annotation via sensor fusion, one-click annotation, and tracking (2019),https: //arxiv.org/abs/1904.09085

Pith/arXiv arXiv 2019

[42] [42]

IEEE Transac- tions on Pattern Analysis and Machine Intelligence41(7), 1559–1572 (Jul 2019)

Wang, G., Zuluaga, M.A., Li, W., Pratt, R., Patel, P.A., Aertsen, M., Doel, T., David, A.L., Deprest, J., Ourselin, S., Vercauteren, T.: Deepigeos: A deep interactive geodesic framework for medical image segmentation. IEEE Transac- tions on Pattern Analysis and Machine Intelligence41(7), 1559–1572 (Jul 2019). https://doi.org/10.1109/tpami.2018.2840695,htt...

work page doi:10.1109/tpami.2018.2840695 2019

[43] [43]

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer (2025),https://arxiv.org/abs/2503. 11651

2025

[44] [44]

Wang, Y., Li, K., Li, X., Yu, J., He, Y., Wang, C., Chen, G., Pei, B., Yan, Z., Zheng, R., Xu, J., Wang, Z., Shi, Y., Jiang, T., Li, S., Zhang, H., Huang, Y., Qiao, Y., Wang, Y., Wang, L.: Internvideo2: Scaling foundation models for multimodal video understanding (2024),https://arxiv.org/abs/2403.15377

arXiv 2024

[45] [45]

arXiv preprint arXiv:2601.11421 (2026)

Wang, Z., Liu, C., Xiang, Y., Zhang, R., Hao, Q., Lu, H., Chen, H., Feng, Z., Zheng, K., Ye, D., et al.: The great march 100: 100 detail-oriented tasks for evaluating embodied ai agents. arXiv preprint arXiv:2601.11421 (2026)

arXiv 2026

[46] [46]

Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering (2024),https: //arxiv.org/abs/2310.08528

arXiv 2024

[47] [47]

arXiv preprint arXiv:2407.17470 (2024)

Xie, Y., Yao, C.H., Voleti, V., Jiang, H., Jampani, V.: Sv4d: Dynamic 3d con- tent generation with multi-frame and multi-view consistency. arXiv preprint arXiv:2407.17470 (2024)

arXiv 2024

[48] [48]

arXiv preprint arXiv:2503.16396 (2025)

Yao, C.H., Xie, Y., Voleti, V., Jiang, H., Jampani, V.: Sv4d 2.0: Enhancing spatio- temporal consistency in multi-view video diffusion for high-quality 4d generation. arXiv preprint arXiv:2503.16396 (2025)

arXiv 2025

[49] [49]

Yu, W., Zhu, S., Yang, T., Chen, C.: Consistency-based active learning for object detection (2022),https://arxiv.org/abs/2103.10374

arXiv 2022

[50] [50]

Zeng, J., Bu, Q., Wang, B., Xia, W., Chen, L., Dong, H., Song, H., Wang, D., Hu, D., Luo, P., Cui, H., Zhao, B., Li, X., Qiao, Y., Li, H.: Learning manipulation by predicting interaction (2024),https://arxiv.org/abs/2406.00439

arXiv 2024

[51] [51]

In: European Confer- ence on Computer Vision

Zeng, Y., Jiang, Y., Zhu, S., Lu, Y., Lin, Y., Zhu, H., Hu, W., Cao, X., Yao, Y.: Stag4d: Spatial-temporal anchored generative 4d gaussians. In: European Confer- ence on Computer Vision. pp. 163–179. Springer (2024)

2024

[52] [52]

arXiv preprint arXiv:2507.23785 (2025)

Zhang, B., Xu, S., Wang, C., Yang, J., Zhao, F., Chen, D., Guo, B.: Gaus- sian variation field diffusion for high-fidelity video-to-4d synthesis. arXiv preprint arXiv:2507.23785 (2025)

arXiv 2025

[53] [53]

Zhang, C., Moing, G.L., Koppula, S., Rocco, I., Momeni, L., Xie, J., Sun, S., Sukthankar, R., Barral, J.K., Hadsell, R., Ghahramani, Z., Zisserman, A., Zhang, J., Sajjadi, M.S.M.: Efficiently reconstructing dynamic scenes one d4rt at a time (2025),https://arxiv.org/abs/2512.08924

arXiv 2025

[54] [54]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, M., Zhuo, L., Tan, T., Xie, G., Nie, X., Li, Y., Zhao, R., He, Z., Wang, Z., Cai, J., et al.: Ipr-1: Interactive physical reasoner. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 33415– 33425 (2026)

2026

[55] [55]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

2018

[56] [56]

arXiv preprint arXiv:2508.10868 (2025) 20 J

Zhang, Y., Zhang, L., Ma, R., Cao, N.: Texverse: A universe of 3d objects with high-resolution textures. arXiv preprint arXiv:2508.10868 (2025) 20 J. Li et al. HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human–Agent Collaboration Supplementary Material We introduce: –More implementation details of the HAT-4D framework in Sec.7, ad...

arXiv 2025

[57] [57]

Do NOT output numeric pose or distance

[58] [58]

distance_hint must be one of: {near_zero, near, mid, far, unknown}

[59] [59]

Focus on events that start, end, or change within this segment

[60] [60]

left cup

Keep object naming consistent with previous memory. Object naming rules: 4.1 Prefer distinguishing objects by appearance (color, texture, shape, size, damage, stain, marks). 4.2 Objects should represent real object instances. Avoid pure spatial region names. 36 J. Li et al. 4.3 Only when objects are visually indistinguishable, spatial qualifiers such as "...

[61] [61]

Objects should be ordered from +X to -X (more negative means farther)

Output x-axis depth ordering timeline. Objects should be ordered from +X to -X (more negative means farther)

[62] [62]

segment_id

Output key pairwise spatial relations: horizontal (left/right/overlap) vertical (above/below/overlap) optional depth (front/behind/same_depth). segment_id: {segment_id} object_text_hint: {object_text} relation_hint: {relation_hint} segment_frame_tokens: {segment_tokens} frame_mapping: {token_lines} previous_segment_memory: {memory_text} Return JSON with k...

[63] [63]

Merge duplicated object identities

[64] [64]

Resolve temporal conflicts across segments

[65] [65]

left cup

Keep only symbolic spatial relations. Do not output numeric pose or distance. Object merging rules: 3.1 Prefer instance names distinguished by 38 J. Li et al. appearance details (color, texture, shape, size, markings). 3.2 If objects are visually indistinguishable, spatial qualifiers such as "left cup" / "right cup" are allowed, but must remain consistent...

[66] [66]

Output ordering from +X to -X (more negative means farther)

Merge x-axis depth ordering timelines. Output ordering from +X to -X (more negative means farther)

[67] [67]

objects": [string],

Merge relative_position_timeline across the entire video. Inputs: object_text_hint: {object_text} relation_hint: {relation_hint} all_frame_tokens: {frame_tokens} segment_fragments_json: {segment_fragments_json} segment_memories_json: {segment_memories_json} Return JSON with keys: { "objects": [string], "object_meta": [ { "id": string, "name": string, "cat...

[68] [68]

Use world coordinate convention: objects ordered from +X to -X (more negative means farther)

[69] [69]

Provide time-sliced depth ordering covering the video timeline

[70] [70]

x_depth_order_timeline

If information is insufficient, output a single global ordering. Inputs: object_text_hint: {object_text} relation_hint: {relation_hint} frame_tokens: {frame_tokens} current_scene_graph_json: {current_scene_graph_json} segment_memories_json: {segment_memories_json} Return JSON: { "x_depth_order_timeline": [ { "frame_range": string, "start_frame_index": int...

[71] [71]

Provide exactly {n_pos} positive points

[72] [72]

Provide exactly {n_neg} negative points

[73] [73]

Positive points must lie on the target object

[74] [74]

Negative points should lie on non-target objects or background regions

[75] [75]

First decompose the target object into visible semantic parts. 42 J. Li et al

[76] [76]

Distribute positive points across different visible parts whenever possible

[77] [77]

part_decomposition

If failure analysis is provided, prioritize correcting the detected issues, such as: - missing_parts - extra_objects - wrong_positive_points - wrong_negative_points Return JSON: { "part_decomposition": [ { "part_name": string, "is_visible": bool, "importance": "high|medium|low" } ], "positive_points": [ { "x": float, "y": float, "part_name": string } ], "...

[78] [78]

Original image with mask overlay

[79] [79]

Binary mask image (white = mask)

[80] [80]

Foreground cutout (mask region only)