SHOW is a mask-promptable framework coupling feed-forward scene reconstruction with human mesh recovery in a unified metric space to resolve scale ambiguity and improve human-scene alignment from monocular video.
hub
Sam 3d body: Robust full-body human mesh recovery
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 20representative citing papers
H-Flow learns dense human scene flow from monocular video via joint pose and depth prediction in a multi-head transformer, using physics-inspired geometric and biomechanical priors for self-supervision, and introduces the DynAct4D synthetic benchmark.
LEXIS-Flow uses VQ-VAE-learned interaction signatures to guide diffusion-based reconstruction of 3D human-object meshes and dense proximity fields from single RGB images, outperforming SOTA on benchmarks.
DanceCrafter generates high-fidelity, text-controlled dance sequences using a new Choreographic Syntax framework and a large fine-grained motion dataset.
Vision-language models perform only marginally above random on action quality assessment and retain systematic biases even after targeted prompting and contrastive reformulation.
PRIMA boosts 3D quadruped mesh recovery by injecting BioCLIP biological priors and using test-time adaptation with 2D constraints to build the Quadruped3D pseudo-3D dataset and reach SOTA on imbalanced animal benchmarks.
MAEM is a training-free framework that combines monocular 3D mesh recovery with a two-stage epipolar matching strategy using disjoint-set-union clustering and per-joint triangulation for multi-view multi-person 3D pose estimation in basketball.
SUGAR turns diverse human videos into deployable humanoid loco-manipulation policies via automated prior extraction, physics refinement, and hierarchical distillation, showing scaling with data volume and zero-shot real-world transfer on six tasks.
Method converts exocentric videos to egocentric format via body-pose extraction and kinematics to improve egocentric world-model prediction and planning.
HA-HOI produces physically plausible 4D HOI animations from monocular videos by anchoring object reconstruction to human motion and refining the result in a physics-based humanoid-object simulator.
SAMe grounds complaints to organs, builds a lightweight patient anatomy model from one body image, and outputs probe initialization poses, outperforming keypoint baselines in real-robot liver and kidney trials.
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
Pi-HOC predicts dense 3D semantic contacts for all human-object pairs in an image via instance-aware tokens and an InteractionFormer, achieving higher accuracy and 20x throughput than prior methods.
RoSHI is a hybrid wearable that combines sparse IMUs and egocentric SLAM to capture accurate full-body 3D pose and shape data in natural environments for robot learning.
WNM introduces a 4D world narrative representation orchestrated by agents to drive video foundation models for high controllability.
A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.
A minimal Gaussian splatting avatar pipeline using the Momentum Human Rig achieves the highest reported PSNR on PeopleSnapshot and ZJU-MoCap without learned deformations.
LUNA is an LBS-free neural animation model that maps 2D controls to 3D Gaussian deformations via a transformer motion regressor and hybrid supervision for realistic motion and zero-shot generalization.
SMART adapts SMPLest-X via stratified finetuning and RAFT tracking to achieve 0.647 validation and 0.593 test scores on the FIFA 2026 skeletal tracking challenge, versus baseline 1.053.
citing papers explorer
-
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
-
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.