StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation
Pith reviewed 2026-05-21 04:50 UTC · model grok-4.3
The pith
StreamGVE adapts pre-trained streaming video generators for high-quality editing in few steps without any training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
StreamGVE shows that pre-trained streaming video generation models can be turned into effective editing tools without retraining by running dual-branch fast sampling that maintains few-step noise-to-data generation while a self-attention bridge and cross-attention mechanisms inject source-video conditions, supplemented by source-oriented guidance and visual prompting to raise target quality and practicality.
What carries the argument
Dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting that preserves few-step sampling while injecting source-video conditions into pre-trained streaming generators.
If this is right
- Video editing tasks become practical at much lower computational cost than iterative baselines.
- The same adaptation works across different pre-trained streaming models without further changes.
- Source-oriented guidance and visual prompting measurably raise editing accuracy and user control.
- Few-step processing opens the door to near-real-time video editing workflows.
Where Pith is reading between the lines
- The dual-branch pattern could be tested on other generative domains such as audio or 3D content editing.
- Pairing the method with simple user interfaces might let non-experts perform complex edits quickly.
- The noise-to-data shift suggests similar lightweight adaptations could reduce training needs in related generation tasks.
Load-bearing premise
Pre-trained streaming generation models can be directly adapted for video editing without any training by the proposed dual-branch sampling, attention mechanisms, and source-oriented guidance while still satisfying both fast sampling and accurate conditioning requirements.
What would settle it
A controlled test in which StreamGVE produces visibly lower fidelity to the source video or lower visual quality than strong baselines when both are limited to the same small number of sampling steps would disprove the central claim.
Figures
read the original abstract
Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation. To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions. Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality. The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes StreamGVE, a training-free video editing framework that adapts pre-trained streaming video generation models to the editing task. It preserves few-step sampling by introducing dual-branch fast sampling, a self-attention bridge for intra-frame consistency, and cross-attention grounding/boosting for source-video conditioning. Additional components include source-oriented guidance to improve target quality and a visual prompting strategy for flexible editing. The central claim is that the method is effective, robust, and generalizable, consistently outperforming prior video editing approaches across diverse tasks while incurring minimal additional time cost.
Significance. If the empirical claims hold, the work offers a meaningful contribution to efficient, training-free video editing by shifting from data-to-data to noise-to-data paradigms compatible with modern streaming generators. The training-free adaptation, few-step efficiency, and cross-model generalizability are clear strengths that could reduce computational barriers in practical video manipulation pipelines.
major comments (2)
- [§3.3] §3.3, dual-branch sampling description: the claim that the self-attention bridge and cross-attention grounding together satisfy both fast sampling and accurate conditioning is central to the method, yet the interaction between the two branches is not shown to provably avoid drift or inconsistency in the few-step regime; a concrete ablation isolating the bridge's contribution would strengthen this.
- [§4.1] §4.1 and Table 1: the reported outperformance is asserted across tasks, but the quantitative tables lack error bars, multiple random seeds, or statistical tests; without these, it is difficult to confirm that gains are robust rather than sensitive to particular prompt or video selections.
minor comments (3)
- [§3.4] Notation for the source-oriented guidance term is introduced without an explicit equation; adding a compact formulation would improve clarity.
- [Figure 3] Figure 3 caption and axis labels could be expanded to indicate which baseline each column corresponds to and what metric is visualized.
- [§3.5] The visual prompting strategy is described at a high level; a short pseudocode snippet or parameter list would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We appreciate the suggestions for clarifying the method and strengthening the empirical claims. Below we respond point by point to the major comments and indicate the revisions made.
read point-by-point responses
-
Referee: [§3.3] §3.3, dual-branch sampling description: the claim that the self-attention bridge and cross-attention grounding together satisfy both fast sampling and accurate conditioning is central to the method, yet the interaction between the two branches is not shown to provably avoid drift or inconsistency in the few-step regime; a concrete ablation isolating the bridge's contribution would strengthen this.
Authors: We agree that a formal proof of drift avoidance would be desirable but lies outside the scope of the current empirical framework. The dual-branch design uses the self-attention bridge to propagate source-frame features into the target branch at each denoising step while cross-attention grounding injects source conditioning; together they empirically maintain consistency under few-step sampling. To address the request, we have added a new ablation in the revised §3.3 and supplementary material that isolates the self-attention bridge by comparing variants with and without it, showing measurable reductions in temporal inconsistency metrics. revision: yes
-
Referee: [§4.1] §4.1 and Table 1: the reported outperformance is asserted across tasks, but the quantitative tables lack error bars, multiple random seeds, or statistical tests; without these, it is difficult to confirm that gains are robust rather than sensitive to particular prompt or video selections.
Authors: We acknowledge that error bars and multi-seed statistics would improve confidence in the reported gains. Because each video-generation run is computationally expensive, the original experiments used a single fixed seed per configuration. In the revision we have added results from three independent seeds for the main quantitative comparisons, included standard-deviation error bars in the updated Table 1, and inserted a brief discussion of observed variability. Full statistical hypothesis testing across every task remains resource-limited, but the consistent ranking across diverse prompts and videos supports the robustness claim. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper describes a training-free adaptation of pre-trained streaming video generation models for editing tasks. It introduces new components such as dual-branch fast sampling, a self-attention bridge, cross-attention grounding/boosting, source-oriented guidance, and a visual prompting strategy. These are presented as architectural additions rather than derivations from fitted parameters or self-referential definitions. No equations, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claims to inputs by construction appear in the provided text. The method relies on external pre-trained models and novel conditioning mechanisms, making the derivation chain self-contained without circular reductions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We reformulate video editing as source-conditioned noise-to-data streaming generation... dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Built on pre-trained streaming generation models... few-step sampling
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
Albergo,M.S.,Boffi,N.M.,Vanden-Eijnden,E.:Stochasticinterpolants:Aunifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18208–18218 (2022)
work page 2022
-
[3]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Bai, J., He, T., Wang, Y., Guo, J., Hu, H., Liu, Z., Bian, J.: Uniedit: A unified tuning-free framework for video motion and appearance editing. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 10171–10180 (2025)
work page 2025
-
[4]
arXiv preprint arXiv:2506.20652 , year=
Bar-On, R., Cohen-Bar, D., Cohen-Or, D.: Editp23: 3d editing via propagation of image prompts to multi-view. arXiv preprint arXiv:2506.20652 (2025)
-
[5]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023)
work page 2023
-
[6]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Cai, M., Cun, X., Li, X., Liu, W., Zhang, Z., Zhang, Y., Shan, Y., Yue, X.: Ditctrl: Exploring attention control in multi-modal diffusion transformer for tuning-free multi-prompt longer video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7763–7772 (2025)
work page 2025
-
[7]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 22560–22570 (October 2023)
work page 2023
-
[8]
ACM trans- actions on Graphics (TOG)42(4), 1–10 (2023)
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM trans- actions on Graphics (TOG)42(4), 1–10 (2023)
work page 2023
-
[9]
Advances in Neural Information Processing Systems37, 24081–24125 (2024)
Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024)
work page 2024
-
[10]
arXiv preprint arXiv:2311.00213 , year=
Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. arXiv preprint arXiv:2311.00213 (2023)
-
[11]
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Dao, T.: Flashattention-2: Faster attention with better parallelism and work par- titioning. arXiv preprint arXiv:2307.08691 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Advances in neural information pro- cessing systems35, 16344–16359 (2022)
Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in neural information pro- cessing systems35, 16344–16359 (2022)
work page 2022
- [14]
-
[15]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Dong, W., Xue, S., Duan, X., Han, S.: Prompt tuning inversion for text-driven im- age editing using diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7430–7440 (2023)
work page 2023
-
[16]
arXiv preprint arXiv:2509.22407 (2025)
Dong, Z., Wang, X., Zhu, Z., Wang, Y., Wang, Y., Zhou, Y., Wang, B., Ni, C., Ouyang, R., Qin, W., et al.: Emma: Generalizing real-world robot manipulation via generative visual transfer. arXiv preprint arXiv:2509.22407 (2025)
-
[17]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) StreamGVE 17
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[18]
In: Forty-first international conference on machine learning (2024)
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)
work page 2024
-
[19]
arXiv preprint arXiv:2212.05032 , year=
Feng, W., He, X., Fu, T.J., Jampani, V., Akula, A., Narayana, P., Basu, S., Wang, X.E., Wang, W.Y.: Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032 (2022)
-
[20]
arXiv preprint arXiv:2511.18346 (2025)
Gao, W., Fan, J., Zeng, J., Yang, S.: Flowportal: Residual-corrected flow for training-free video relighting and background replacement. arXiv preprint arXiv:2511.18346 (2025)
-
[21]
Garibi, D., Patashnik, O., Voynov, A., Averbuch-Elor, H., Cohen-Or, D.: Renoise: Real image inversion through iterative noising (2024)
work page 2024
-
[22]
In: The Twelfth International Conference on Learning Representations (2024)
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion fea- tures for consistent video editing. In: The Twelfth International Conference on Learning Representations (2024)
work page 2024
-
[23]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Guo, Y., Yang, C., Yang, Z., Ma, Z., Lin, Z., Yang, Z., Lin, D., Jiang, L.: Long con- text tuning for video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17281–17291 (2025)
work page 2025
-
[24]
Prompt-to-Prompt Image Editing with Cross Attention Control
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Advances in neural information processing systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)
work page 2020
-
[26]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Jeong, H., Lee, S., Ye, J.C.: Reangle-a-video: 4d video generation as video-to- video translation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11164–11175 (2025)
work page 2025
-
[28]
Memflow: Flowing adaptive memory for consistent and efficient long video narratives,
Ji, S., Chen, X., Yang, S., Tao, X., Wan, P., Zhao, H.: Memflow: Flowing adap- tive memory for consistent and efficient long video narratives. arXiv preprint arXiv:2512.14699 (2025)
-
[29]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)
work page 2025
-
[30]
UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models
Jiao, G., Huang, B., Wang, K.C., Liao, R.: Uniedit-flow: Unleashing inversion and editing in the era of flow models. arXiv preprint arXiv:2504.13109 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,
Jin, Y., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., Lin, Z.: Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954 (2024)
-
[32]
International Conference on Learning Representations (ICLR) (2024)
Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Pnp inversion: Boosting diffusion-based editing with 3 lines of code. International Conference on Learning Representations (ICLR) (2024)
work page 2024
-
[33]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Kara, O., Kurtkaya, B., Yesiltepe, H., Rehg, J.M., Yanardag, P.: Rave: Random- ized noise shuffling for fast and consistent video editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6507–6516 (2024)
work page 2024
-
[34]
arXiv preprint arXiv:2505.23145 (2025)
Kim, J., Hong, Y., Park, J., Ye, J.C.: Flowalign: Trajectory-regularized, inversion- free flow-based image editing. arXiv preprint arXiv:2505.23145 (2025)
-
[35]
arXiv preprint arXiv:2403.14468 , year=
Ku, M., Wei, C., Ren, W., Yang, H., Chen, W.: Anyv2v: A tuning-free framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024) 18 G. Jiao, C. Zhang, et al
-
[36]
Kulikov, V., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion-free text-based editing using pre-trained flow models. arXiv preprint arXiv:2412.08629 (2024)
-
[37]
In: Proceedings of the 29th symposium on operating systems prin- ciples
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems prin- ciples. pp. 611–626 (2023)
work page 2023
-
[38]
Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)
work page 2024
-
[39]
Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
arXiv preprint arXiv:2506.05046 (2025)
Li, G., Yang, Y., Song, C., Zhang, C.: Flowdirector: Training-free flow steering for precise text-to-video editing. arXiv preprint arXiv:2506.05046 (2025)
-
[41]
arXiv preprint arXiv:2509.22199 (2025)
Li, H., Zhang, I., Ouyang, R., Wang, X., Zhu, Z., Yang, Z., Zhang, Z., Wang, B., Ni, C., Qin, W., et al.: Mimicdreamer: Aligning human and robot demonstrations for scalable vla training. arXiv preprint arXiv:2509.22199 (2025)
-
[42]
Li, M., Xie, C., Wu, Y., Zhang, L., Wang, M.: Five: A fine-grained video edit- ing benchmark for evaluating emerging diffusion and rectified flow models. arXiv preprint arXiv:2503.13684 (2025)
-
[43]
Li, W., Pan, W., Luan, P.C., Gao, Y., Alahi, A.: Stable video infinity: Infinite- length video generation with error recycling. arXiv preprint arXiv:2510.09212 (2025)
-
[44]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, X., Ma, C., Yang, X., Yang, M.H.: Vidtome: Video token merging for zero-shot video editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7486–7495 (2024)
work page 2024
-
[45]
arXiv preprint arXiv:2405.15757 (2024)
Liang,F.,Kodaira,A.,Xu,C.,Tomizuka,M.,Keutzer,K.,Marculescu,D.:Looking backward: Streaming video-to-video translation with feature banks. arXiv preprint arXiv:2405.15757 (2024)
-
[46]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition
Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J.: Video-p2p: Video editing with cross- attention control. In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition. pp. 8599–8608 (2024)
work page 2024
-
[48]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
In: The Twelfth Interna- tional Conference on Learning Representations (2023)
Liu, X., Zhang, X., Ma, J., Peng, J., et al.: Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In: The Twelfth Interna- tional Conference on Learning Representations (2023)
work page 2023
-
[50]
Advances in Neural Information Processing Systems36, 47500–47510 (2023)
Luo, G., Dunlap, L., Park, D.H., Holynski, A., Darrell, T.: Diffusion hyperfeatures: Searching through time and space for semantic correspondence. Advances in Neural Information Processing Systems36, 47500–47510 (2023)
work page 2023
-
[51]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency mod- els: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
In: International Conference on Learning Representations (2022) StreamGVE 19
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022) StreamGVE 19
work page 2022
-
[53]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inver- sion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6038–6047 (2023)
work page 2023
-
[54]
arXiv preprint arXiv:2512.22118 (2025)
Ouyang, Z., Zheng, D., Wu, X.M., Jiang, J.J., Lin, K.Y., Meng, J., Zheng, W.S.: Proedit: Inversion-based editing from prompts done right. arXiv preprint arXiv:2512.22118 (2025)
-
[55]
Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)
work page 2023
-
[56]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Movie Gen: A Cast of Media Foundation Models
Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.Y., Chuang, C.Y., et al.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Qi, C., Cun, X., Zhang, Y., Lei, C., Wang, X., Shan, Y., Chen, Q.: Fatezero: Fusing attentions for zero-shot text-based video editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15932–15942 (2023)
work page 2023
-
[59]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
work page 2021
-
[60]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
work page 2022
-
[61]
Semantic im- age inversion and editing using rectified stochastic differen- tial equations
Rout, L., Chen, Y., Ruiz, N., Caramanis, C., Shakkottai, S., Chu, W.S.: Semantic image inversion and editing using rectified stochastic differential equations. arXiv preprint arXiv:2410.10792 (2024)
-
[62]
Saad, M.A., Bovik, A.C.: Blind quality assessment of videos using a model of natu- ral scene statistics and motion coherency. In: 2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR). pp. 332–336. IEEE (2012)
work page 2012
-
[63]
Sabour, A., Fidler, S., Kreis, K.: Align your flow: Scaling continuous-time flow map distillation (2025)
work page 2025
-
[64]
In: European Conference on Computer Vision
Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: European Conference on Computer Vision. pp. 87–103. Springer (2024)
work page 2024
-
[65]
Advances in Neural Information Processing Systems37, 68658–68685 (2024)
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., Dao, T.: Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems37, 68658–68685 (2024)
work page 2024
-
[66]
Stochastic sampling from deterministic flow models.arXiv preprint arXiv:2410.02217,
Singh, S., Fischer, I.: Stochastic sampling from deterministic flow models. arXiv preprint arXiv:2410.02217 (2024)
- [67]
-
[68]
Denoising Diffusion Implicit Models
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[69]
Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. arXiv preprint arXiv:2303.01469 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
Team, D.: Lucy edit: Open-weight text-guided video editing (2025),https : //d2drjpuinn46lb.cloudfront.net/Lucy_Edit__High_Fidelity_Text_Guided_ Video_Editing.pdf 20 G. Jiao, C. Zhang, et al
work page 2025
-
[71]
Tinaz, B., Fabian, Z., Soltanolkotabi, M.: Emergence and evolution of interpretable concepts in diffusion models (2025)
work page 2025
-
[72]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Tu, S., Dai, Q., Cheng, Z.Q., Hu, H., Han, X., Wu, Z., Jiang, Y.G.: Motioned- itor: Editing video motion via content-aware diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7882– 7891 (2024)
work page 2024
-
[73]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Tumanyan, N., Bar-Tal, O., Bagon, S., Dekel, T.: Splicing vit features for semantic appearance transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10748–10757 (2022)
work page 2022
-
[74]
Advances in neural information pro- cessing systems30(2017)
Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)
work page 2017
-
[75]
Tam- ing rectified flow for inversion and editing
Wang, J., Pu, J., Qi, Z., Guo, J., Ma, Y., Huang, N., Chen, Y., Li, X., Shan, Y.: Taming rectified flow for inversion and editing. arXiv preprint arXiv:2411.04746 (2024)
-
[76]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wang, Y., Wang, L., Ma, Z., Hu, Q., Xu, K., Guo, Y.: Videodirector: Precise video editing via text-to-video models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2589–2598 (2025)
work page 2025
-
[77]
IEEE transactions on image processing 13(4), 600–612 (2004)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
work page 2004
-
[78]
Wan: Open and Advanced Large-Scale Video Generative Models
WanTeam, Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Godiva: Generating open-domain videos from natural descriptions
Wu, C., Huang, L., Zhang, Q., Li, B., Ji, L., Yang, F., Sapiro, G., Duan, N.: Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806 (2021)
-
[80]
Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.