H-Flow learns dense human scene flow from monocular video via joint pose and depth prediction in a multi-head transformer, using physics-inspired geometric and biomechanical priors for self-supervision, and introduces the DynAct4D synthetic benchmark.
hub
Sam 3d body: Robust full-body human mesh recovery
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 14representative citing papers
LEXIS-Flow uses VQ-VAE-learned interaction signatures to guide diffusion-based reconstruction of 3D human-object meshes and dense proximity fields from single RGB images, outperforming SOTA on benchmarks.
DanceCrafter generates high-fidelity, text-controlled dance sequences using a new Choreographic Syntax framework and a large fine-grained motion dataset.
Vision-language models perform only marginally above random on action quality assessment and retain systematic biases even after targeted prompting and contrastive reformulation.
SUGAR turns diverse human videos into deployable humanoid loco-manipulation policies via automated prior extraction, physics refinement, and hierarchical distillation, showing scaling with data volume and zero-shot real-world transfer on six tasks.
HA-HOI produces physically plausible 4D HOI animations from monocular videos by anchoring object reconstruction to human motion and refining the result in a physics-based humanoid-object simulator.
SAMe grounds complaints to organs, builds a lightweight patient anatomy model from one body image, and outputs probe initialization poses, outperforming keypoint baselines in real-robot liver and kidney trials.
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
Pi-HOC predicts dense 3D semantic contacts for all human-object pairs in an image via instance-aware tokens and an InteractionFormer, achieving higher accuracy and 20x throughput than prior methods.
RoSHI is a hybrid wearable that combines sparse IMUs and egocentric SLAM to capture accurate full-body 3D pose and shape data in natural environments for robot learning.
A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.
A minimal Gaussian splatting avatar pipeline using the Momentum Human Rig achieves the highest reported PSNR on PeopleSnapshot and ZJU-MoCap without learned deformations.
citing papers explorer
-
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
-
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.