arxiv: 2605.07288 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

Jiaxuan Gao , Yongjian Guo , Zhong Guan , Wen Huang , Wanlun Ma , Xi Xiao , Junwu Xiong , Sheng Wen

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords world modelsvision-language-actionVLAstyle augmentationlatent bootstrappingpolicy post-trainingLIBERO benchmarksimulators

0 comments

The pith

Sword makes world models reliable simulators for VLA policies by disentangling visual style from task dynamics and bootstrapping latents for consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Sword to fix generalization failures and visual sensitivity in world models used as simulators for vision-language-action models. Existing approaches produce blurry or hallucinatory long-horizon rollouts when colors, lighting, or textures shift even slightly, which breaks downstream reinforcement-learning policy training. Sword adds structure-guided style augmentation to separate irrelevant visual textures from the dynamics that matter for actions, plus dynamic latent bootstrapping to keep training and inference consistent without high memory cost. If the approach holds, world models can support safer and more efficient policy optimization entirely inside simulation on benchmarks such as LIBERO.

Core claim

Sword introduces Structure-Guided Style Augmentation to disentangle visual textures from task-relevant dynamics, thereby improving generalization, and Dynamic Latent Bootstrapping to maintain consistency between training and inference while keeping memory low, yielding world models that outperform the WoVR baseline on LIBERO in generalization, generation quality, robustness, fidelity, and the success rate of reinforcement-learning post-training for VLA models.

What carries the argument

Structure-Guided Style Augmentation, which uses scene structure to separate style textures from dynamics during world-model training, paired with Dynamic Latent Bootstrapping for low-memory consistent rollouts.

If this is right

World models become less prone to cascading hallucinations from minor changes in color or illumination during closed-loop simulation.
Long-horizon state predictions retain higher fidelity, supporting extended policy rollouts in imagination.
Reinforcement-learning post-training of VLA models achieves higher task success rates without additional real-world interaction.
Simulators generalize across visual style variations while preserving the dynamics needed for action prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same disentanglement principle could be applied to other robotics simulators to narrow the sim-to-real gap for non-VLA policies.
Lower memory bootstrapping may allow larger-scale world-model training on consumer hardware for longer-horizon tasks.
If style separation proves general, future work could combine it with language-conditioned dynamics for more controllable simulation.

Load-bearing premise

Structure-guided style augmentation separates visual textures from task dynamics without discarding information required for accurate long-horizon prediction or policy learning.

What would settle it

A controlled test on LIBERO where Sword-trained models show no reduction in error accumulation or no gain in VLA post-training success rate when initial-state visual perturbations are introduced.

Figures

Figures reproduced from arXiv: 2605.07288 by Jiaxuan Gao, Junwu Xiong, Sheng Wen, Wanlun Ma, Wen Huang, Xi Xiao, Yongjian Guo, Zhong Guan.

**Figure 1.** Figure 1: We propose a new world model, Sword, and compare its predicted video frames with those of its variant without Dynamic Latent Bootstrapping (Ours w/o DLB) and WovR [1]. The development of Vision-Language-Action (VLA) models [2, 3] has marked a critical milestone in robotic manipulation, enabling endto-end action generation conditioned on multimodal inputs [4, 5, 6]. While imitation learning has establis… view at source ↗

**Figure 2.** Figure 2: The pipeline of Sword. We define the world model as a learned state transition function that predicts the next observation conditioned on the current observation and action: oˆt+1 ∼ Pˆ ϕ(ot+1 | ot, at), (1) where Pˆ ϕ is parameterized by a diffusion-based generative Transformer. By training a reliable world model, we enable the VLA policy to optimize its actions within a learned simulation environment, the… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of model performance on OOD data. The baseline (WoVR) fails to [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Combined evaluation of our model. Left: Our model produces sharper and more stable long-horizon predictions. Right: Evaluation of action-following ability and physical interaction consistency. to accurately fit color saturation and illumination, and geometric artifacts appear in objects such as cabinets. This suggests that WoVR relies on the rote memorization of object appearances rather than a deep unders… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison between the full model and the variant without Dynamic Latent [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Additional qualitative comparisons under OOD settings. As a supplement to Sec. 4.2, we provide additional qualitative comparisons between Sword (Ours) and WoVR on OOD data. Sword produces more accurate and temporally consistent predicted frames, demonstrating stronger robustness under distribution shifts. A.2 Additional Qualitative Results for Physical Fidelity This section provides supplementary qualitati… view at source ↗

**Figure 7.** Figure 7: Additional qualitative comparisons of robustness and fidelity. We provide supplementary results to compare the robustness and fidelity of predicted video frames. Sword (Ours) better follows the input actions, whereas WoVR produces incorrect gripper states of the robotic arm. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

The integration of Vision-Language-Action (VLA) models with World Models has gained increasing attention. One representative approach treats learned World Models as generative simulators, enabling policy optimization entirely within "imagination." However, when deployed as simulators for specific environments such as the LIBERO benchmark, existing World Models often suffer from poor generalization and long-horizon error accumulation. During closed-loop rollouts, these models are highly sensitive to initial-state perturbations; minor changes in color, illumination, and other visual factors can trigger cascading hallucinations, leading to severe blurriness or overexposure. Moreover, long-horizon error accumulation further degrades the quality and fidelity of predicted future states. These issues limit the reliability of World Models as simulators. To mitigate these problems, we propose Sword, a robust World Model framework. Our method introduces Structure-Guided Style Augmentation to disentangle the visual textures of interactive environments from task-relevant dynamics, thereby improving generalization. We further propose Dynamic Latent Bootstrapping, which maintains consistency between training and inference while keeping memory consumption low. Extensive experiments on the LIBERO benchmark show that our method significantly outperforms the baseline WoVR in terms of generalization, generation quality, robustness, fidelity, and the success rate of reinforcement-learning post-training for VLA models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sword targets style sensitivity and rollout drift in world models for VLA training with two concrete techniques, but the size of the gains still needs the full numbers to evaluate.

read the letter

Sword's core pitch is that world models used as simulators for VLA policy training break down under small visual changes and accumulate errors over long horizons, and the authors offer two fixes to make them more reliable on something like the LIBERO benchmark. Structure-Guided Style Augmentation tries to vary textures while preserving the dynamics that matter for the task, and Dynamic Latent Bootstrapping keeps training and inference consistent without high memory cost. Those are the pieces that feel new in combination for this setting. The motivation section does a clear job naming the actual problems people run into with models like WoVR, such as color or lighting shifts triggering bad hallucinations in closed-loop rollouts. That part reads as grounded rather than invented. The experiments claim better generalization, generation quality, robustness, fidelity, and higher success rates when the world model is used for downstream RL post-training. If the ablations and quantitative gaps hold up, this would be a useful incremental step for anyone trying to scale simulator-based training for embodied agents. The soft spot is that the abstract states the outperformance without showing the actual deltas, error bars, or controls, so it is hard to judge whether the new components drive most of the lift or whether other factors are at play. The key assumption that the style augmentation really disentangles visuals from dynamics without losing predictive information also needs the results to confirm it does not trade off accuracy. This is aimed at people working on world models or VLA post-training in robotics who already know the baseline issues with generative simulators. A reader looking for practical robustness tricks rather than a full theoretical overhaul would get the most out of it. The work is coherent enough on its own terms to deserve a serious referee who can check the experimental details and see whether the improvements are large enough to matter in practice.

Referee Report

2 major / 2 minor

Summary. The paper proposes Sword, a style-robust world model framework for use as simulators in Vision-Language-Action (VLA) policy post-training. It introduces two components: Structure-Guided Style Augmentation to disentangle visual textures from task-relevant dynamics for improved generalization, and Dynamic Latent Bootstrapping to reduce long-horizon error accumulation while maintaining training-inference consistency and low memory use. Extensive experiments on the LIBERO benchmark are claimed to show significant outperformance over the WoVR baseline across generalization, generation quality, robustness, fidelity, and downstream RL post-training success rates.

Significance. If the empirical claims hold with rigorous validation, the work could meaningfully advance reliable generative simulators for closed-loop VLA policy optimization, addressing persistent issues of style sensitivity and compounding errors that currently limit world-model-based 'imagination' training. The emphasis on structure-guided augmentation and bootstrapping offers a practical path toward more robust deployment in real-world robotic settings.

major comments (2)

Abstract and §4 (Experiments): the claim of 'significantly outperforms' the WoVR baseline is presented without any quantitative metrics, success rates, error bars, or ablation tables in the abstract and is only summarized at high level; this is load-bearing for the central empirical contribution and requires explicit reporting of key numbers (e.g., success rate deltas on LIBERO tasks) with statistical significance to allow assessment of effect size.
§3.1 (Structure-Guided Style Augmentation): the method is asserted to disentangle visual textures from task-relevant dynamics 'without discarding information needed for accurate long-horizon prediction'; however, no information-theoretic bounds, reconstruction error analysis, or controlled ablations on information loss are provided, leaving the weakest assumption untested and directly undermining the generalization and fidelity claims.

minor comments (2)

Notation in §3.2 (Dynamic Latent Bootstrapping): the bootstrapping procedure is described at a high level; adding a clear algorithmic pseudocode or explicit update equations would improve reproducibility.
Figure captions and §4: several qualitative rollout visualizations are referenced but lack side-by-side quantitative metrics (e.g., FID or PSNR scores) aligned with the visual examples, reducing clarity of the generation-quality claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where appropriate.

read point-by-point responses

Referee: Abstract and §4 (Experiments): the claim of 'significantly outperforms' the WoVR baseline is presented without any quantitative metrics, success rates, error bars, or ablation tables in the abstract and is only summarized at high level; this is load-bearing for the central empirical contribution and requires explicit reporting of key numbers (e.g., success rate deltas on LIBERO tasks) with statistical significance to allow assessment of effect size.

Authors: We agree that the abstract would be strengthened by including explicit quantitative results. In the revised manuscript, we have updated the abstract to report key metrics from the LIBERO experiments, including average success rate improvements (with deltas relative to WoVR), generation quality scores, and references to error bars and statistical significance. Section 4 has been expanded with full tables containing error bars, ablation results, and p-values to substantiate the effect sizes. revision: yes
Referee: §3.1 (Structure-Guided Style Augmentation): the method is asserted to disentangle visual textures from task-relevant dynamics 'without discarding information needed for accurate long-horizon prediction'; however, no information-theoretic bounds, reconstruction error analysis, or controlled ablations on information loss are provided, leaving the weakest assumption untested and directly undermining the generalization and fidelity claims.

Authors: The design of Structure-Guided Style Augmentation explicitly targets preservation of task-relevant dynamics by operating on structural features, and our experiments on LIBERO demonstrate that long-horizon prediction quality and downstream policy performance are not degraded (and in fact improve) relative to baselines. We acknowledge that dedicated information-theoretic bounds and isolated reconstruction ablations for information loss were not included in the original submission. To address this directly, we have added a controlled ablation study and reconstruction error analysis (including feature-level MSE and perceptual metrics) in the revised §3.1 and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical proposal with benchmark validation

full rationale

The paper proposes two new components (Structure-Guided Style Augmentation and Dynamic Latent Bootstrapping) for improving world models in VLA post-training and validates them via experiments on the LIBERO benchmark. No equations, derivations, or first-principles claims are presented in the abstract or described structure that reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The argument follows a standard propose-and-validate pattern whose validity depends on external experimental results rather than internal reduction to inputs. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract. The work introduces algorithmic techniques rather than new theoretical constructs or fitted constants.

pith-pipeline@v0.9.0 · 5551 in / 1131 out tokens · 54170 ms · 2026-05-11T01:07:12.523808+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Structure-Guided Style Augmentation... Dynamic Latent Bootstrapping... outperforms WoVR on LIBERO... generalization, generation quality, robustness, fidelity
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dynamic Latent Bootstrapping... maintains consistency between training and inference

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 21 canonical work pages · 11 internal anchors

[1]

Wovr: World models as reliable simulators for post-training vla policies with rl.ArXiv, abs/2602.13977, 2026

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026

work page arXiv 2026
[2]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

2025
[3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision- language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

A survey on vision-language-action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925,

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision-language- action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925, 2025

work page arXiv 2025
[6]

arXiv preprint arXiv:2509.19012 (2025)

Dapeng Zhang, Jing Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, and Qingguo Zhou. Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012, 2025

work page arXiv 2025
[7]

RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

Zhong Guan, Haoran Sun, Yongjian Guo, Shuai Di, Xiaodong Bai, Jing Long, Tianyun Zhao, Mingxi Luo, Chen Zhou, Yucheng Guo, et al. Rl-vla 3: Reinforcement learning vla accelerating via full asynchronism.arXiv preprint arXiv:2602.05765, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

work page arXiv 2025
[9]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review arXiv 2025
[10]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023
[11]

Wmpo: World model-based policy optimization for vision-language-action models, 2025

Zhu Fangqi, Yan Zhengyang, Hong Zicong, Shou Quanxin, Ma Xiao, and Guo Song. Wmpo: World model-based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

work page arXiv 2025
[12]

NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, ...

2026
[13]

Sanketi, Grecia Salazar, Michael S

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski...

2023
[14]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

arXiv preprint arXiv:2510.03342 (2025)

Gemini Robotics Team. Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer.arXiv e-prints, page arXiv:2510.03342, October 2025

work page arXiv 2025
[16]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Lerobot: An open-source library for end-to-end robot learning.arXiv preprint arXiv:2602.22818, 2026

Remi Cadene, Simon Aliberts, Francesco Capuano, Michel Aractingi, Adil Zouitine, Pepijn Kooijmans, Jade Choghari, Martino Russi, Caroline Pascal, Steven Palma, et al. Lerobot: An open-source library for end-to-end robot learning.arXiv preprint arXiv:2602.22818, 2026

work page arXiv 2026
[18]

Thousand-gpu large-scale training and optimization recipe for ai-native cloud embodied intelligence infrastructure.arXiv preprint arXiv:2603.11101, 2026

Chen Zhou, Haoran Sun, Hedan Yang, Jing Long, Junwu Xiong, Luqiao Wang, Mingxi Luo, Qiming Yang, Shuai Di, Song Wang, et al. Thousand-gpu large-scale training and optimization recipe for ai-native cloud embodied intelligence infrastructure.arXiv preprint arXiv:2603.11101, 2026

work page arXiv 2026
[19]

Smolvla: A vision-language-action model for affordable and efficient robotics, 2025

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. Smolvla: A vision-language-action model for affordable and efficient robotics, 2025

2025
[20]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[22]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 11

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

2024
[25]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024
[26]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review arXiv 2025
[27]

Cosmos-transfer1: Conditional world generation with adaptive multimodal control, 2025

NVIDIA, :, Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, Dieter Fox, Yunhao Ge, Jinwei Gu, Ali Hassani, Michael Isaev, Pooya Jannaty, Shiyi Lan, Tobias Lasser, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Fabio Ramos, Xuanchi Ren, T...

2025
[28]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024

2024
[29]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

work page Pith review arXiv 2023
[30]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

2020
[32]

Motionstream: Real-time video generation with interactive motion controls, 2026

Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls, 2026

2026
[33]

Longlive: Real-time interactive long video generation, 2025

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation, 2025

2025
[34]

Rlinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

Chao Yu, Yuanqing Wang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Yongji Wu, Chunyang Zhu, Junhao Hu, et al. Rlinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

work page arXiv 2025
[35]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

2018
[36]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

2017
[37]

FVD: A new metric for video generation, 2019

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. FVD: A new metric for video generation, 2019

2019
[38]

Flolpips: A bespoke video quality metric for frame interpolation

Duolikun Danier, Fan Zhang, and David Bull. Flolpips: A bespoke video quality metric for frame interpolation. In2022 Picture Coding Symposium (PCS), pages 283–287. IEEE, 2022. 12 A Technical appendices and supplementary material A.1 Additional Qualitative Results for Generalization under Distribution Shifts This section provides supplementary qualitative ...

2022