MetaEarth-MM unifies multi-modal remote sensing image generation and any-to-any translation across five modalities via scene-centered joint modeling on the new EarthMM dataset.
hub
Scalable diffusion models with transformers
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
4DLidarOpen is a new open dataset providing synchronized 4D FMCW Lidar velocity measurements, multi-Lidar and camera data, and 3D bounding-box annotations with track IDs to support benchmarks on 3D detection, BEV segmentation, flow prediction, and motion forecasting.
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
DiffTSP applies discrete diffusion to knowledge graph triple set prediction, recovering all missing triples simultaneously via edge-masking noise reversal and a structure-aware transformer, achieving SOTA on three datasets.
TacticGen generates realistic, adaptable football tactics via a multi-agent diffusion transformer trained on 3.3M events and 100M frames, supporting rule-, language-, or model-based guidance at inference time.
SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
MIMIC-D enables multi-modal multi-agent coordination via joint training of decentralized diffusion policies using only local information.
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
A cold diffusion model with direct and delta-normalized reverse processes, using UNet and transformer backbones, outperforms diffusion baselines for dereverberating acoustic and electronic drum stems on in-domain and out-of-domain tests.
The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world performance than prior methods.
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
TAX-DPD combines a feed-forward dense GMM for global placement priors with disentangled point cloud diffusion for local geometry and pose to achieve precise robotic object placement.
NC-Diffusion matches quantization noise to the diffusion forward process, adds an adaptive frequency filter and zero-shot enhancement, and reports superior fidelity on benchmarks.
A primitive-based truncated diffusion model with keypoint attention encoding generates more efficient and diverse trajectories for mobile manipulators than vanilla diffusion in cluttered 3D simulations.
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
Flow-Opt combines a flow-matching DiT model with a custom differentiable safety filter and learned initialization to enable fast centralized trajectory optimization for tens of robots.
RF-HiT uses rectified flow and a multi-scale hierarchical transformer to reach 91.27% Dice on ACDC and 87.40% on BraTS 2021 with only 10.14 GFLOPs, 13.6M parameters, and three inference steps.
AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantically accurate, temporally coherent animations in seconds.
citing papers explorer
-
MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling
MetaEarth-MM unifies multi-modal remote sensing image generation and any-to-any translation across five modalities via scene-centered joint modeling on the new EarthMM dataset.
-
4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving
4DLidarOpen is a new open dataset providing synchronized 4D FMCW Lidar velocity measurements, multi-Lidar and camera data, and 3D bounding-box annotations with track IDs to support benchmarks on 3D detection, BEV segmentation, flow prediction, and motion forecasting.
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
One Pass for All: A Discrete Diffusion Model for Knowledge Graph Triple Set Prediction
DiffTSP applies discrete diffusion to knowledge graph triple set prediction, recovering all missing triples simultaneously via edge-masking noise reversal and a structure-aware transformer, achieving SOTA on three datasets.
-
TacticGen: Grounding Adaptable and Scalable Generation of Football Tactics
TacticGen generates realistic, adaptable football tactics via a multi-agent diffusion transformer trained on 3.3M events and 100M frames, supporting rule-, language-, or model-based guidance at inference time.
-
SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning
SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
-
MIMIC-D: Multi-modal Imitation for MultI-agent Coordination with Decentralized Diffusion Policies
MIMIC-D enables multi-modal multi-agent coordination via joint training of decentralized diffusion policies using only local information.
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
A Cold Diffusion Approach for Percussive Dereverberation
A cold diffusion model with direct and delta-normalized reverse processes, using UNet and transformer backbones, outperforms diffusion baselines for dereverberating acoustic and electronic drum stems on in-domain and out-of-domain tests.
-
From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data
The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world performance than prior methods.
-
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
-
Disentangled Point Diffusion for Precise Object Placement
TAX-DPD combines a feed-forward dense GMM for global placement priors with disentangled point cloud diffusion for local geometry and pose to achieve precise robotic object placement.
-
A Noise Constrained Diffusion (NC-Diffusion) Framework for High Fidelity Image Compression
NC-Diffusion matches quantization noise to the diffusion forward process, adds an adaptive frequency filter and zero-shot enhancement, and reports superior fidelity on benchmarks.
-
Primitive-based Truncated Diffusion for Efficient Trajectory Generation of Differential Drive Mobile Manipulators
A primitive-based truncated diffusion model with keypoint attention encoding generates more efficient and diverse trajectories for mobile manipulators than vanilla diffusion in cluttered 3D simulations.
-
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
-
Flow-Opt: Scalable Centralized Multi-Robot Trajectory Optimization with Flow Matching and Differentiable Optimization
Flow-Opt combines a flow-matching DiT model with a custom differentiable safety filter and learned initialization to enable fast centralized trajectory optimization for tens of robots.
-
RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation
RF-HiT uses rectified flow and a multi-scale hierarchical transformer to reach 91.27% Dice on ACDC and 87.40% on BraTS 2021 with only 10.14 GFLOPs, 13.6M parameters, and three inference steps.
-
AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation
AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantically accurate, temporally coherent animations in seconds.