Recognition: unknown
Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models
Pith reviewed 2026-05-10 11:45 UTC · model grok-4.3
The pith
Masked logit nudging guides visual autoregressive models to edit images according to a text prompt while leaving unrelated areas unchanged.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Masked logit nudging converts fixed source image encodings into logits via the VAR encoder and nudges the autoregressive model's predicted logits toward these targets only within spatial masks derived from cross-attention differences between source and edited prompts, along with a quantization error correction refinement, leading to state-of-the-art prompt-guided editing performance.
What carries the argument
Masked logit nudging, which aligns model predictions with source token maps inside attention-derived edit masks to follow the target prompt.
If this is right
- Delivers the best image editing performance on the PIE benchmark at both 512px and 1024px resolutions.
- Outperforms prior methods on image reconstruction tasks for COCO at 512px and OpenImages at 1024px.
- Achieves comparable or better results than diffusion models while being substantially faster.
- Outperforms other visual autoregressive approaches in editing and reconstruction quality.
Where Pith is reading between the lines
- Attention map differences may provide a reliable, prompt-based way to localize edits without needing explicit masks.
- The approach could extend to video or 3D autoregressive models for temporal or volumetric editing.
- Speed advantages might enable interactive editing applications where diffusion methods are too slow.
- Refinement for quantization errors could improve general reconstruction in autoregressive image models beyond editing.
Load-bearing premise
The cross-attention difference masking scheme accurately identifies only the image regions that need to change without affecting unrelated areas or missing necessary ones.
What would settle it
Running the method on the PIE benchmark and finding that it does not achieve the highest editing scores at 512px or 1024px resolutions would falsify the performance claim.
Figures
read the original abstract
We address the problem of prompt-guided image editing in visual autoregressive models. Given a source image and a target text prompt, we aim to modify the source image according to the target prompt, while preserving all regions which are unrelated to the requested edit. To this end, we present Masked Logit Nudging, which uses the source image token maps to introduce a guidance step that aligns the model's predictions under the target prompt with these source token maps. Specifically, we convert the fixed source encodings into logits using the VAR encoding, nudging the model's predicted logits towards the targets along a semantic trajectory defined by the source-target prompts. Edits are applied only within spatial masks obtained through a dedicated masking scheme that leverages cross-attention differences between the source and edited prompts. Then, we introduce a refinement to correct quantization errors and improve reconstruction quality. Our approach achieves the best image editing performance on the PIE benchmark at 512px and 1024px resolutions. Beyond editing, our method delivers faithful reconstructions and outperforms previous methods on COCO at 512px and OpenImages at 1024px. Overall, our method outperforms VAR-related approaches and achieves comparable or even better performance than diffusion models, while being much faster. Code is available at 'https://github.com/AmirMaEl/MLN'.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Masked Logit Nudging for prompt-guided image editing in visual autoregressive (VAR) models. It converts source image encodings to logits to nudge predictions under a target prompt along a semantic trajectory, restricts modifications to spatial masks derived from cross-attention map differences between source and edited prompts, and adds a refinement step to correct quantization errors. The central claims are that the method achieves the best editing performance on the PIE benchmark at 512 px and 1024 px, delivers faithful reconstructions, outperforms prior methods on COCO (512 px) and OpenImages (1024 px), surpasses other VAR approaches, and matches or exceeds diffusion models while being substantially faster. Code is released at https://github.com/AmirMaEl/MLN.
Significance. If the performance claims hold after proper quantification and validation, the work would be significant as a faster, autoregressive alternative to diffusion-based editing that preserves unrelated regions via targeted logit nudging. The public code release is a clear strength that aids reproducibility and extension.
major comments (2)
- [Method (masking scheme)] The masking scheme (described in the method section) that computes spatial masks from cross-attention differences between source and target prompts lacks any equation for the difference metric, any procedure for threshold selection, and any ablation or region-specific metrics to confirm it isolates only intended edit regions without leakage or omission. In an autoregressive VAR model, where each token conditions on all prior tokens, this omission is load-bearing for the headline claims of superior PIE performance and parity with diffusion models, as mask inaccuracies would propagate errors through the generation sequence.
- [Experiments and Results] The abstract and results claims assert best-in-class performance on PIE at 512 px / 1024 px, faithful reconstruction, and outperformance on COCO / OpenImages, yet the manuscript provides no quantitative tables, baseline implementations, statistical significance tests, or ablation studies on the masking or nudging components. This directly limits evaluation of the central performance assertions.
minor comments (2)
- The abstract refers to 'a dedicated masking scheme' and 'a refinement to correct quantization errors' without cross-references to the specific subsections or equations where these are formalized.
- Consider expanding the related-work discussion to include prior uses of cross-attention differences for localization in editing or segmentation tasks.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We appreciate the acknowledgment of the potential significance of our approach as a faster autoregressive alternative to diffusion-based editing. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and evidence.
read point-by-point responses
-
Referee: [Method (masking scheme)] The masking scheme (described in the method section) that computes spatial masks from cross-attention differences between source and target prompts lacks any equation for the difference metric, any procedure for threshold selection, and any ablation or region-specific metrics to confirm it isolates only intended edit regions without leakage or omission. In an autoregressive VAR model, where each token conditions on all prior tokens, this omission is load-bearing for the headline claims of superior PIE performance and parity with diffusion models, as mask inaccuracies would propagate errors through the generation sequence.
Authors: We agree that the masking scheme requires a more rigorous and explicit formalization to support the performance claims, especially in light of the autoregressive token dependencies. In the revised manuscript, we will add the exact equation defining the cross-attention difference metric used to derive the spatial masks, along with the full procedure for threshold selection (including any percentile-based or validation-driven criteria). We will also incorporate ablation studies and region-specific quantitative metrics (such as edit-region fidelity and background preservation scores) to verify that modifications are isolated without leakage or omission. These changes will directly substantiate the load-bearing role of the masking in achieving the reported results. revision: yes
-
Referee: [Experiments and Results] The abstract and results claims assert best-in-class performance on PIE at 512 px / 1024 px, faithful reconstruction, and outperformance on COCO / OpenImages, yet the manuscript provides no quantitative tables, baseline implementations, statistical significance tests, or ablation studies on the masking or nudging components. This directly limits evaluation of the central performance assertions.
Authors: We acknowledge that the current manuscript version presents performance claims primarily through summary statements and qualitative results without dedicated quantitative tables, explicit baseline details, statistical tests, or component ablations, which hinders full assessment. In the revision, we will add comprehensive tables with standard metrics (e.g., LPIPS, CLIP score, SSIM) comparing against all relevant baselines on PIE at both 512 px and 1024 px, as well as on COCO (512 px) and OpenImages (1024 px). We will document baseline implementations, include statistical significance testing where feasible, and provide targeted ablations on the masking scheme and logit nudging to quantify their individual contributions. These additions will provide the necessary evidence for the central assertions. revision: yes
Circularity Check
No load-bearing circularity; method extends VAR token maps and cross-attention independently
full rationale
The paper introduces Masked Logit Nudging as a guidance mechanism that converts source encodings to logits and applies nudging within masks derived from cross-attention differences between prompts. These steps operate on the existing autoregressive token prediction structure of VAR models without redefining any core quantity in terms of the target performance metric. Benchmark results on PIE, COCO, and OpenImages are reported as empirical outcomes rather than quantities fitted inside the derivation. No self-citation chain or ansatz is invoked to force the central claims; the masking and refinement steps remain externally verifiable against the model's attention maps and quantization process. This yields only a minor self-reference score consistent with normal method papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual autoregressive models generate coherent images from discrete token sequences and produce usable cross-attention signals between text and image tokens.
Reference graph
Works this paper leans on
-
[1]
Detecting and mitigating memorization in diffusion models through anisotropy of the log-probability
Rohan Asthana and Vasileios Belagiannis. Detecting and mitigating memorization in diffusion models through anisotropy of the log-probability. InThe Fourteenth Inter- national Conference on Learning Representations, 2026. 3
2026
-
[2]
Ledits++: Limitless image editing using text-to-image models
Manuel Brack, Felix Friedrich, Katharia Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, and Apolin´ario Passos. Ledits++: Limitless image editing using text-to-image models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8861–8870, 2024. 2, 3, 5, 7, 8, 9, 11, 12, 13
2024
-
[3]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 3
2023
-
[4]
Extracting training data from diffusion models
Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagiel- ski, Vikash Sehwag, Florian Tram `er, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. In32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270, Anaheim, CA,
-
[5]
USENIX Association. 3
-
[6]
Quan Dao, Xiaoxiao He, Ligong Han, Ngan Hoai Nguyen, Amin Heyrani Nobar, Faez Ahmed, Han Zhang, Viet Anh Nguyen, and Dimitris Metaxas. Discrete noise inversion for next-scale autoregressive text-based image editing.arXiv preprint arXiv:2509.01984, 2025. 3, 7
-
[7]
Turboedit: Text-based image editing using few-step diffusion models
Gilad Deutch, Rinon Gal, Daniel Garibi, Or Patashnik, and Daniel Cohen-Or. Turboedit: Text-based image editing using few-step diffusion models. InSIGGRAPH Asia 2024 Con- ference Papers, pages 1–12, 2024. 2, 5, 7, 8, 6, 9, 11, 12
2024
-
[8]
Visual autoregressive modelling for monocular depth estimation
Amir El-Ghoussani, Andr ´e Kaup, Nassir Navab, Gustavo Carneiro, and Vasileios Belagiannis. Visual autoregressive modelling for monocular depth estimation. InProceedings of the 21st International Conference on Computer Vision The- ory and Applications - Volume 3: VISAPP, pages 44–54. IN- STICC, SciTePress, 2026. 3
2026
-
[9]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 3
2021
-
[10]
Depthart: monocular depth estimation as autoregressive refinement task
Bulat Gabdullin, Nina Konovalova, Nikolay Patakin, Dmitry Senushkin, and Anton Konushin. Depthart: monocular depth estimation as autoregressive refinement task. InProceedings of the Thirty-Fourth International Joint Conference on Arti- ficial Intelligence, pages 1017–1025, 2025. 3
2025
-
[11]
An image is worth one word: Personalizing text-to-image gener- ation using textual inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image gener- ation using textual inversion. InThe Eleventh International Conference on Learning Representations. 3
-
[12]
Renoise: Real image inversion through iterative noising, 2024
Daniel Garibi, Or Patashnik, Andrey V oynov, Hadar Averbuch-Elor, and Daniel Cohen-Or. Renoise: Real image inversion through iterative noising, 2024. 7, 8, 9
2024
-
[13]
Infinity: Scaling bit- wise autoregressive modeling for high-resolution image syn- thesis
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bit- wise autoregressive modeling for high-resolution image syn- thesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025. 2, 3, 7, 8, 9, 13
2025
-
[14]
Xiaoxiao He, Ligong Han, Quan Dao, Song Wen, Minhao Bai, Di Liu, Han Zhang, Martin Renqiang Min, Felix Juefei- Xu, Chaowei Tan, et al. Dice: Discrete inversion enabling controllable editing for multinomial diffusion and masked generative models.arXiv preprint arXiv:2410.08207, 2024. 7
-
[15]
Prompt-to-prompt image editing with cross-attention control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control. InICLR, 2023. 2, 3, 1
2023
-
[16]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 5
work page internal anchor Pith review arXiv 2022
-
[17]
Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2, 7
2020
-
[18]
The curious case of neural text degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInter- national Conference on Learning Representations. 4
-
[19]
Revisiting gradient-based uncertainty for monocular depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Julia Hornauer, Amir El-Ghoussani, and Vasileios Belagian- nis. Revisiting gradient-based uncertainty for monocular depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3
2025
-
[20]
An edit friendly ddpm noise space: Inversion and manipulations
Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. An edit friendly ddpm noise space: Inversion and manipulations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12469– 12478, 2024. 2, 5, 7, 8, 12
2024
-
[21]
Categorical repa- rameterization with gumbel-softmax
Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa- rameterization with gumbel-softmax. InInternational Con- ference on Learning Representations, 2017. 4
2017
-
[22]
Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based edit- ing with 3 lines of code.arXiv preprint arXiv:2310.01506,
-
[23]
2, 3, 5, 7, 8, 9, 12
-
[24]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.Interna- tional journal of computer vision, 128(7):1956–1981, 2020. 8
1956
-
[25]
Flux.https://github.com/ black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 7, 9
2024
-
[26]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 8
2014
-
[27]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Biye Li, Huaian Chen, and Yi Jin. Star: Scale-wise text-conditioned autoregressive image generation.arXiv preprint arXiv:2406.10797, 2024. 3
-
[29]
Null-text inversion for editing real im- ages using guided diffusion models
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real im- ages using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023. 2, 3
2023
-
[30]
Swiftedit: Lightning fast text- guided image editing via one-step diffusion
Trong-Tung Nguyen, Quang Nguyen, Khoi Nguyen, Anh Tran, and Cuong Pham. Swiftedit: Lightning fast text- guided image editing via one-step diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 21492–21501, 2025. 9, 10
2025
-
[31]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth Interna- tional Conference on Learning Representations. 7, 8, 9
-
[32]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4, 8
2021
-
[33]
Dominic Rampas, Pablo Pernias, and Marc Aubreville. A novel sampling scheme for text-and image-conditional im- age synthesis in quantized latent spaces.arXiv preprint arXiv:2211.07292, 2022. 7
-
[34]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 7, 8, 9
2022
-
[35]
Semantic im- age inversion and editing using rectified stochastic differen- tial equations
Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Carama- nis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic im- age inversion and editing using rectified stochastic differen- tial equations. InThe Thirteenth International Conference on Learning Representations. 3, 8, 7, 9, 11
-
[36]
Dvir Samuel, Barak Meiri, Haggai Maron, Yoad Tewel, Nir Darshan, Shai Avidan, Gal Chechik, and Rami Ben-Ari. Lightning-fast image inversion and editing for text-to-image diffusion models.arXiv preprint arXiv:2312.12540, 2023. 2
-
[37]
Adversarial diffusion distillation
Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InComputer Vision – ECCV 2024. Springer, 2024. 7
2024
-
[38]
Denois- ing diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations. 8
-
[39]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[40]
Hart: Efficient visual generation with hybrid autoregressive transformer
Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Jun- song Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, and Song Han. Hart: Efficient visual generation with hybrid autoregressive transformer. InThe Thirteenth International Conference on Learning Representations. 3, 7
-
[41]
Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024. 2, 3, 4
2024
-
[42]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efficient foundation lan- guage models.ArXiv, abs/2302.13971, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Plug-and-play diffusion features for text-driven image-to-image translation
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023. 3
1921
-
[44]
Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 3
2017
-
[45]
Switti: De- signing scale-wise transformers for text-to-image synthesis
Anton V oronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin Khrulkov, and Dmitry Baranchuk. Switti: De- signing scale-wise transformers for text-to-image synthesis. arXiv preprint arXiv:2412.01819, 2024. 3, 4, 7, 8, 1
-
[46]
arXiv preprint arXiv:2411.04746 , year=
Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Tam- ing rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024. 3
-
[47]
Training- free text-guided image editing with visual autoregressive model
Yufei Wang, Lanqing Guo, Zhihao Li, Jiaxing Huang, Pichao Wang, Bihan Wen, and Jian Wang. Training-free text-guided image editing with visual autoregressive model.arXiv preprint arXiv:2503.23897, 2025. 2, 3, 7
-
[48]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, Jos ´e Lezama, Nitesh B Gundavarapu, Luca Ver- sari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023. 3
work page internal anchor Pith review arXiv 2023
-
[49]
Arbitrary-steps image super-resolution via diffusion inver- sion
Zongsheng Yue, Kang Liao, and Chen Change Loy. Arbitrary-steps image super-resolution via diffusion inver- sion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23153–23163, 2025. 7, 8
2025
-
[50]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 7
2018
-
[51]
Xinlu Zhang, Yujie Lu, Weizhi Wang, An Yan, Jun Yan, Lianke Qin, Heng Wang, Xifeng Yan, William Yang Wang, and Linda Ruth Petzold. Gpt-4v (ision) as a general- ist evaluator for vision-language tasks.arXiv preprint arXiv:2311.01361, 2023. 8
-
[52]
Image and video tokenization with binary spherical quantization
Yue Zhao, Yuanjun Xiong, and Philipp Kraehenbuehl. Image and video tokenization with binary spherical quantization. In The Thirteenth International Conference on Learning Repre- sentations. 2 Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models Supplementary Material Supplementary Material This supplementary document provi...
-
[53]
Detailed analysis of the cross-attention–driven edit masks, including quantitative mask–GT compar- isons, threshold sensitivity, and layer/head ablations (Sec. 6.1.2)
-
[54]
Additional comparison and ablations of nudging sched- ule (Sec. 6.2)
-
[55]
Further MLN ablations and hyperparameters (Sec. 6.3)
-
[56]
Extended analysis of quantization errors and the pro- posed quantization refinement procedure (Secs. 6.4)
-
[57]
Details and qualitative samples of the reconstruction ex- periments (Sec. 6.5)
-
[58]
Details and additional qualitative samples of the editing experiments (Sec. 6.6)
-
[59]
Adapted upscaled PIE-benchmark at 1024px (Sec. 6.7)
-
[60]
Recaptioning for reconstruction experiments at 1024px (Sec. 6.8)
-
[61]
Additional qualitative editing samples (Sec. 6.9)
-
[62]
More ablations (Sec. 6.10)
-
[63]
Failure Analysis (Sec. 6.11). 6.1. Cross-attention mask analysis Our masking mechanism follows the attention-based edit- ing philosophy of DDIM inversion and P2P [14], but ap- plies it directly to the cross-attention activations of the V AR transformer, which uses the same multi-head attention structure as GPT-style models. To extract these activations, w...
-
[64]
logit nudging without a mask – no QR
-
[65]
masked regeneration – no QR
-
[66]
[tulip→lionlion]
MLN – with QR. We measure mask IoU against the PIE ground-truth region and report background fidelity. Table 7.Mask–GT agreement and background fidelity (PIE- 512).MLN achieves the strongest localization and background preservation. Method Mask IoU (%)↑PSNR (bg)↑LPIPS (bg)↓CLIP (edit)↑ Logit nudging – 25.8 85.2 24.4 Masked regeneration 57 26.5 79.7 22.2 M...
-
[67]
SWITTI without QR, and
-
[68]
A photo of a [ cat→ dogdog] sitting on a chair
SWITTI with QR. This visualization highlights how QR specifically reduces blocky artifacts and restores sharpness in high-frequency re- gions without introducing over-smoothing (see fig. 14). 6.6. Details and qualitative samples of editing ex- periments In our PIE-Bench editing experiments, we evaluate our method against recent diffusion-based and flow-ba...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.