iTryOn is a video diffusion Transformer that injects spatial 3D hand guidance and semantic action captions to enable interactive garment replacement in videos.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 5representative citing papers
PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
citing papers explorer
-
iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance
iTryOn is a video diffusion Transformer that injects spatial 3D hand guidance and semantic action captions to enable interactive garment replacement in videos.
-
Pareto-Guided Optimal Transport for Multi-Reward Alignment
PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
-
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.