Recognition: 1 theorem link
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
Pith reviewed 2026-05-16 16:04 UTC · model grok-4.3
The pith
Large Diffusion Transformers perform precise instructional image editing via in-context generation without major retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ICEdit shows that an in-context editing paradigm, applied directly to large Diffusion Transformers, enables precise modifications from natural language instructions by embedding the source image and editing prompt together in the generation process. Combined with lightweight fine-tuning on a tiny dataset and an inference-time selection of promising noise samples via VLMs, the method reaches superior editing performance without altering the underlying DiT architecture.
What carries the argument
The in-context editing paradigm that feeds the original image together with the editing instruction into the DiT's generation process as a single prompt.
If this is right
- Instructional editing becomes feasible for users who lack access to massive compute or datasets.
- Quality improves further by scaling the base DiT size rather than the fine-tuning data.
- Inference cost drops because only high-quality noise samples are processed to completion.
- The same in-context strategy can be applied to other generation tasks inside the same model family.
Where Pith is reading between the lines
- Existing large generative models outside the DiT family could adopt similar in-context framing for editing tasks.
- Extending the approach to video or 3D data would test whether the same efficiency gains hold in higher-dimensional domains.
- Real-time editing interfaces become practical if the early filter step is accelerated with smaller VLMs.
Load-bearing premise
Large Diffusion Transformers already understand natural language editing instructions well enough to apply them accurately when the task is presented in an in-context format.
What would settle it
A controlled test on a held-out DiT model showing that the in-context method fails to outperform strong training-free baselines on diverse editing instructions would falsify the central claim.
read the original abstract
Instruction-based image editing enables precise modifications via natural language prompts, but existing methods face a precision-efficiency tradeoff: fine-tuning demands massive datasets (>10M) and computational resources, while training-free approaches suffer from weak instruction comprehension. We address this by proposing ICEdit, which leverages the inherent comprehension and generation abilities of large-scale Diffusion Transformers (DiTs) through three key innovations: (1) An in-context editing paradigm without architectural modifications; (2) Minimal parameter-efficient fine-tuning for quality improvement; (3) Early Filter Inference-Time Scaling, which uses VLMs to select high-quality noise samples for efficiency. Experiments show that ICEdit achieves state-of-the-art editing performance with only 0.1\% of the training data and 1\% trainable parameters compared to previous methods. Our approach establishes a new paradigm for balancing precision and efficiency in instructional image editing. Codes and demos can be found in https://river-zhang.github.io/ICEdit-gh-pages/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ICEdit, an instructional image editing approach that applies an in-context generation paradigm to large-scale Diffusion Transformers without any architectural changes, augments it with minimal parameter-efficient fine-tuning, and introduces Early Filter Inference-Time Scaling via VLMs to select high-quality noise samples. It claims state-of-the-art editing performance while using only 0.1% of the training data and 1% trainable parameters relative to prior methods.
Significance. If the reported gains can be isolated to the in-context paradigm and minimal PEFT rather than differences in the underlying DiT backbone, the work would demonstrate a practical route to high-precision editing with dramatically reduced data and compute, shifting the precision-efficiency tradeoff in the field.
major comments (1)
- [Section 4] Section 4 and the experimental tables: the SOTA and efficiency claims (0.1% data, 1% trainable parameters) rest on comparisons against prior editing methods, yet the text does not explicitly confirm that every baseline was re-implemented and evaluated on the identical pretrained DiT weights and model scale used for ICEdit. Without this verification, the performance delta cannot be attributed solely to the proposed paradigm.
minor comments (1)
- [Abstract] Abstract: the claim of state-of-the-art performance is stated without any quantitative metrics, specific baselines, or key numbers; inserting one or two headline results (e.g., from the main table) would make the abstract self-contained.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive feedback. We address the major comment regarding experimental comparisons below and will revise the manuscript to improve clarity.
read point-by-point responses
-
Referee: Section 4 and the experimental tables: the SOTA and efficiency claims (0.1% data, 1% trainable parameters) rest on comparisons against prior editing methods, yet the text does not explicitly confirm that every baseline was re-implemented and evaluated on the identical pretrained DiT weights and model scale used for ICEdit. Without this verification, the performance delta cannot be attributed solely to the proposed paradigm.
Authors: We agree that explicit confirmation is necessary to isolate the contribution of the in-context paradigm. All baselines were re-implemented and evaluated using the exact same pretrained DiT weights and model scale as ICEdit (the large-scale DiT backbone without architectural modifications). This ensures fair comparison under identical backbone conditions. However, we acknowledge that the manuscript text does not state this explicitly. We will revise Section 4 and the experimental setup description to include a clear statement confirming that every baseline uses the identical pretrained DiT weights and scale, allowing the performance gains to be attributed to the proposed in-context generation and minimal PEFT. revision: yes
Circularity Check
No circularity: empirical performance claims rest on external comparisons without self-referential reductions
full rationale
The paper describes an in-context editing paradigm applied to existing large-scale Diffusion Transformers, using minimal PEFT and an inference-time filter, with no equations, derivations, or fitted parameters presented as predictions. The central claims (SOTA performance at 0.1% data and 1% trainable parameters) are supported by experimental tables comparing against prior methods; these are not reduced by construction to quantities defined within the paper itself. No self-citations serve as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are merely renamed. Baseline fairness concerns affect validity but do not create circularity under the specified criteria.
Axiom & Free-Parameter Ledger
free parameters (1)
- trainable parameters fraction
axioms (1)
- domain assumption Large-scale Diffusion Transformers possess inherent comprehension and generation abilities that support in-context instructional editing without architectural changes.
Forward citations
Cited by 19 Pith papers
-
Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing
A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
Towards In-Context Tone Style Transfer with A Large-Scale Triplet Dataset
A new 100k triplet dataset and in-context diffusion framework ICTone enable state-of-the-art tone style transfer by jointly conditioning on content and reference images with scorer-based reward learning.
-
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
-
A Unified and Controllable Framework for Layered Image Generation with Visual Effects
LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
-
InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation
InstructMoLE replaces per-token routing with instruction-guided global routing for mixture-of-low-rank-experts in diffusion transformers and adds an output-space orthogonality loss to improve multi-conditional image g...
-
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
-
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
-
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.
-
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...
-
LongCat-Image Technical Report
LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.
-
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
-
Qwen-Image Technical Report
Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...
-
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
-
Emerging Properties in Unified Multimodal Pretraining
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
-
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
Reference graph
Works this paper leans on
-
[1]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023
work page 2023
-
[2]
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[3]
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks.arXiv preprint arXiv:2311.10089, 2023
-
[4]
Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2025
work page 2025
-
[5]
Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guid- ing instruction-based image editing via multimodal large language models.arXiv preprint arXiv:2309.17102, 2023
-
[6]
Ace: All-round creator and editor following instructions via diffusion transformer
Zhen Han, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang, Chaojie Mao, Chenwei Xie, Yu Liu, and Jingren Zhou. Ace: All-round creator and editor following instructions via diffusion transformer. arXiv preprint arXiv:2410.00086, 2024
-
[7]
Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction-based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025
-
[8]
Step1X-Edit: A Practical Framework for General Image Editing
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing.arXiv pre...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Smartedit: Exploring complex instruction- based image editing with multimodal large language models
Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction- based image editing with multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362–8371, 2024
work page 2024
-
[10]
Yingjing Xu, Jie Kong, Jiazhi Wang, Xiao Pan, Bo Lin, and Qiang Liu. Insightedit: Towards better instruction following for image editing.arXiv preprint arXiv:2411.17323, 2024
-
[11]
Prompt-to-Prompt Image Editing with Cross Attention Control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.CoRR, abs/2208.01626, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
SDEdit: Guided image synthesis and editing with stochastic differential equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022
work page 2022
-
[13]
Taming rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024
Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024
-
[14]
Stable flow: Vital layers for training-free image editing, 2024
Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchinov, Kfir Aberman, Dani Lischinski, and Daniel Cohen-Or. Stable flow: Vital layers for training-free image editing, 2024
work page 2024
-
[15]
Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024
-
[16]
Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023. 22
-
[17]
Kv-edit: Training-free image editing for precise background preservation, 2025
Tianrui Zhu, Shiyi Zhang, Jiawei Shao, and Yansong Tang. Kv-edit: Training-free image editing for precise background preservation, 2025
work page 2025
-
[18]
Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Pnp inversion: Boosting diffusion-based editing with 3 lines of code.International Conference on Learning Representa- tions (ICLR), 2024
work page 2024
-
[19]
Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing
Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22560–22570, October 2023
work page 2023
-
[20]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023
work page 2023
-
[21]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Scaling rectified flow transformers for high-resolution image synthesis, 2024
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024
work page 2024
-
[23]
Flux: Official inference repository for flux.1 models, 2024
Black Forest Labs. Flux: Official inference repository for flux.1 models, 2024. Accessed: 2024-11-12
work page 2024
-
[24]
Chaehun Shin, Jooyoung Choi, Heeseung Kim, and Sungroh Yoon. Large-scale text-to- image model with inpainting is a zero-shot subject-driven image generator.arXiv preprint arXiv:2411.15466, 2024
-
[25]
Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025
-
[26]
In-context lora for diffusion transformers.arXiv preprint arxiv:2410.23775, 2024
Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers.arXiv preprint arxiv:2410.23775, 2024
-
[27]
Tan Zhenxiong, Liu Songhua, Yang Xingyi, Xue Qiaochu, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024
-
[28]
Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to- more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025
-
[29]
Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffusion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025
-
[30]
Golden noise for diffusion models: A learning framework.arXiv preprint arXiv:2411.09502, 2024
Zikai Zhou, Shitong Shao, Lichen Bai, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework.arXiv preprint arXiv:2411.09502, 2024
-
[31]
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991
work page 1991
-
[32]
Null-text inversion for editing real images using guided diffusion models
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023
work page 2023
-
[33]
Plug-and-play diffusion features for text-driven image-to-image translation, 2022
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation, 2022. 23
work page 2022
-
[34]
Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, and Wenhu Chen. Om- niedit: Building image editing generalist models through specialist supervision.arXiv preprint arXiv:2411.07199, 2024
-
[35]
Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea.arXiv preprint arXiv:2411.15738, 2024
- [36]
-
[37]
Lianghua Huang, Wei Wang, Zhi-Fan Wu, Huanzhang Dou, Yupeng Shi, Yutong Feng, Chen Liang, Yu Liu, and Jingren Zhou. Group diffusion transformers are unsupervised multitask learners.arXiv preprint arxiv:2410.15027, 2024
-
[38]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[39]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[40]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
One Step Diffusion via Shortcut Models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Nur Muhammad Mahi Shafiullah, Chris Paxton, Lerrel Pinto, Soumith Chintala, and Arthur Szlam. Clip-fields: Weakly supervised semantic fields for robotic memory.arXiv preprint arXiv:2210.05663, 2022
-
[47]
Clipscore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7514–7528. Association for Computation...
work page 2021
-
[48]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the International Conference on Computer Vision (ICCV), 2021
work page 2021
-
[50]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation.arXiv preprint arXiv:2312.14867, 2023. 24
-
[52]
Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024
Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024
-
[53]
High-resolution image synthesis with latent diffusion models, 2021
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021
work page 2021
-
[54]
Multi-modal attention for speech emotion recognition.arXiv preprint arXiv:2009.04107, 2020
Zexu Pan, Zhaojie Luo, Jichen Yang, and Haizhou Li. Multi-modal attention for speech emotion recognition.arXiv preprint arXiv:2009.04107, 2020
-
[55]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[56]
Prodigy: An expeditiously adaptive parameter-free learner
Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[57]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022
work page 2022
-
[58]
Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, and Zhihao Xia. Instruction-based image manipulation by watching how things move.arXiv preprint arXiv:2412.12087, 2024
-
[59]
Transfer between Modalities with MetaQueries
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025. 25
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.