VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.
hub
arXiv preprint arXiv:2505.23656 (2025)
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.
NanoGen unifies DiT training on ImageNet and T2I, reveals negative Pearson correlations (-0.377 to -0.580) in method rankings across metrics from 21 models, and motivates DiffusionBench for holistic evaluation.
GEM-4D improves video world models for robot manipulation by distilling 4D geometric correspondences into training and adding an inverse dynamics module, achieving SOTA geometric consistency and 81% real-world success.
Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.
SARA introduces semantic saliency to guide relational alignment in video diffusion models, improving text following and motion quality over prior alignment methods.
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
PhysisForcing applies trajectory and relational alignment losses to DiT features in video models, improving physical plausibility on R-Bench, PAI-Bench, and EZS-Bench while raising closed-loop robotic success rates from 16% to 24%.
PhysRAG curates 7K videos from WISA-80K, builds a physical video database, and injects knowledge via learnable queries into a diffusion model to reach SOTA visual quality and physical compliance on PhyGenBench and VBench.
MaineCoon is presented as the first 22B-parameter real-time streaming audio-visual autoregressive model optimized for social-interactive applications, using novel training techniques and an agentic inference framework.
PILA aligns frozen flow-matching video models to a physics attribute bank via MoE experts and operational residuals, reporting SOTA physical plausibility on VBench-2.0, VideoPhy-2 and PhyGenBench while preserving visual quality.
Tempered Self-similarity Alignment transfers relational structure from foundation-model STSS into video generators via probabilistic correspondence alignment, yielding reported gains in physical plausibility on VideoPhy benchmarks.
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
citing papers explorer
-
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.
-
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.
-
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
GEM-4D improves video world models for robot manipulation by distilling 4D geometric correspondences into training and adding an inverse dynamics module, achieving SOTA geometric consistency and 81% real-world success.
-
Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis
Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.
-
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
-
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
SARA introduces semantic saliency to guide relational alignment in video diffusion models, improving text following and motion quality over prior alignment methods.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
-
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
-
PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation
PhysisForcing applies trajectory and relational alignment losses to DiT features in video models, improving physical plausibility on R-Bench, PAI-Bench, and EZS-Bench while raising closed-loop robotic success rates from 16% to 24%.
-
PhysRAG: Enhancing Physics-Awareness in Video Generation via Retrieval-Augmented Generation
PhysRAG curates 7K videos from WISA-80K, builds a physical video database, and injects knowledge via learnable queries into a diffusion model to reach SOTA visual quality and physical compliance on PhyGenBench and VBench.
-
MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model
MaineCoon is presented as the first 22B-parameter real-time streaming audio-visual autoregressive model optimized for social-interactive applications, using novel training techniques and an agentic inference framework.
-
Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment
PILA aligns frozen flow-matching video models to a physics attribute bank via MoE experts and operational residuals, reporting SOTA physical plausibility on VBench-2.0, VideoPhy-2 and PhyGenBench while preserving visual quality.
-
Tempered Self-Similarity Alignment for Physically Plausible Video Generation
Tempered Self-similarity Alignment transfers relational structure from foundation-model STSS into video generators via probabilistic correspondence alignment, yielding reported gains in physical plausibility on VideoPhy benchmarks.
-
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.