NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives.arXiv preprint arXiv:2311.18259
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
ExpertEdit edits novice motions to expert skill levels by learning a motion prior from unpaired videos and infilling masked skill-critical spans.
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
TIME is a motion-based embedding from point tracks, trained only on synthetic data via masked autoencoding, that matches state-of-the-art video model performance with up to 10,000x less training data.
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
citing papers explorer
-
No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
-
ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos
ExpertEdit edits novice motions to expert skill levels by learning a motion prior from unpaired videos and infilling masked skill-critical spans.
-
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
-
The TIME Machine: On The Power of Motion for Efficient Perception
TIME is a motion-based embedding from point tracks, trained only on synthetic data via masked autoencoding, that matches state-of-the-art video model performance with up to 10,000x less training data.
-
HumanNet: Scaling Human-centric Video Learning to One Million Hours
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.
-
SAM 2: Segment Anything in Images and Videos
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.
-
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.