NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
hub
Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ExpertEdit edits novice motions to expert skill levels by learning a motion prior from unpaired videos and infilling masked skill-critical spans.
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
LVDR integrates keypoint-guided MCTS into a latent diffusion reasoning model to deliver competitive skill assessment accuracy alongside explicit visual reasoning trajectories on four sports and surgical datasets.
Introduces Ego-MC-Bench benchmark and Ego-CoMist synthetic dataset showing that fine-tuning video LLMs on proactive mistake corrections improves performance especially for smaller models.
Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.
TIME is a motion-based embedding from point tracks, trained only on synthetic data via masked autoencoding, that matches state-of-the-art video model performance with up to 10,000x less training data.
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
Exocentric-only LoRA adaptation of Cosmos3-Nano on a new synchronized retail video dataset matches or exceeds combined ego+exo training on most held-out metrics.
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
citing papers explorer
-
No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
-
ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos
ExpertEdit edits novice motions to expert skill levels by learning a motion prior from unpaired videos and infilling masked skill-critical spans.
-
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
-
Latent Visual Diffusion Reasoning with Monte Carlo Tree Search
LVDR integrates keypoint-guided MCTS into a latent diffusion reasoning model to deliver competitive skill assessment accuracy alongside explicit visual reasoning trajectories on four sports and surgical datasets.
-
Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?
Introduces Ego-MC-Bench benchmark and Ego-CoMist synthetic dataset showing that fine-tuning video LLMs on proactive mistake corrections improves performance especially for smaller models.
-
Harnessing Streaming Video in the Wild
Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.
-
The TIME Machine: On The Power of Motion for Efficient Perception
TIME is a motion-based embedding from point tracks, trained only on synthetic data via masked autoencoding, that matches state-of-the-art video model performance with up to 10,000x less training data.
-
HumanNet: Scaling Human-centric Video Learning to One Million Hours
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.
-
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
-
RetailSMV: Exocentric vs. Egocentric Adaptation of Foundation Video World Models in Retail
Exocentric-only LoRA adaptation of Cosmos3-Nano on a new synchronized retail video dataset matches or exceeds combined ego+exo training on most held-out metrics.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.