pith. sign in

arxiv: 2606.07171 · v1 · pith:RXOEIORSnew · submitted 2026-06-05 · 💻 cs.CV

When Recovery Matters: The Blind Spot of Surrogate Privacy in MLLM Editing

Pith reviewed 2026-06-27 22:07 UTC · model grok-4.3

classification 💻 cs.CV
keywords surrogate privacyMLLM image editingedit recoveryprivacy preservationeditability assessmentmultimodal benchmarksource integrityInstructPix2Pix
0
0 comments X

The pith

Surrogate privacy methods for MLLM image editing produce edited surrogates instead of recovered source images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard approaches to protecting privacy in multimodal large language model image editing replace sensitive regions with surrogates before cloud processing, but this leaves the final output as an edited surrogate rather than the desired edited version of the private source. Prior work has not addressed the recovery step in either benchmark design or evaluation. To close this gap the authors create SPPE, a recovery-oriented benchmark spanning 36 fine-grained privacy categories and 65 editing instructions, and define two tasks: predicting whether a surrogate will produce an edit consistent with the original image, and transferring the edit effect back from the edited surrogate to the private source while keeping source integrity. They supply ERMA for the first task and C2E-S2SER for the second, reporting measurable gains over baselines on both tasks.

Core claim

The central claim is that surrogate-based privacy protection in MLLM editing has neglected local recovery, and that this can be remedied by a dedicated benchmark SPPE together with ERMA, which predicts surrogate editability via instruction-aware multimodal relation modeling, and C2E-S2SER, which performs cycle-consistent recovery by treating the surrogate editing pair as visual edit evidence and the source image as a source-preserving anchor; experiments on SPPE and InstructPix2Pix show ERMA lifting SRCC by 13.9 percent and PLCC by 12.3 percent over best baselines, while C2E-S2SER beats SOER on all eight source-integrity and edit-consistency metrics.

What carries the argument

SPPE benchmark defining editability assessment and surrogate-to-source edit recovery tasks; ERMA for instruction-aware multimodal relation modeling; C2E-S2SER for cycle-consistent recovery that anchors on the source image while using the surrogate pair as edit evidence.

If this is right

  • Editability can be estimated before any cloud interaction, avoiding unnecessary transmission of private images that cannot be edited consistently.
  • Edited surrogates can be mapped back to private sources such that the edit effect is retained and source content is not altered beyond the intended change.
  • The two-task split allows separate optimization of prediction accuracy and recovery fidelity rather than treating privacy protection as a single end-to-end process.
  • Consistent gains on SRCC, PLCC and the eight integrity-consistency metrics indicate that surrogate pairs carry transferable edit information when modeled explicitly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If recovery is omitted, users may receive final images whose visual content deviates from the cloud result in ways that affect usability or intent.
  • The pre-cloud editability check could be inserted into existing MLLM pipelines to decide automatically whether surrogate substitution is safe for a given instruction.
  • The cycle-consistent anchoring approach may generalize to other privacy mechanisms that replace or obscure parts of an image before remote processing.

Load-bearing premise

The edited surrogate pair supplies reliable visual evidence of the intended edit that can be transferred back to the private source image while preserving both source integrity and edit consistency.

What would settle it

A controlled test in which C2E-S2SER applied to a new set of source-surrogate pairs fails to exceed SOER on any of the eight source integrity or edit consistency metrics.

read the original abstract

Multimodal Large Language Models (MLLMs) enable flexible instruction-driven image editing, but privacy risks arise when user images expose diverse and user-specific private content. Canonical privacy protection strategies typically substitute sensitive regions with surrogate content before cloud editing. Yet, the resulting output is often an edited surrogate rather than the desired edited source image, neglecting the local recovery in both design and evaluation scope. To this end, we introduce SPPE (Surrogate-based Privacy-Preserving Editing), the first recovery-oriented benchmark covering 36 fine-grained privacy categories and 65 editing instructions. It defines two complementary tasks: 1) editability assessment, which estimates before cloud interaction whether a surrogate can induce an edit consistent with the original image; and 2) surrogate-to-source edit recovery, which evaluates whether the edited surrogate can be transferred back to the private source with the edit effect preserved. We address each task with a dedicated method: ERMA predicts surrogate editability through instruction-aware multimodal relation modeling, while \method performs cycle-consistent recovery by using the surrogate editing pair as visual edit evidence and the source image as a source-preserving anchor. Experiments on SPPE and InstructPix2Pix show consistent improvements on both tasks. For editability assessment, ERMA improves over the best-performing baselines by 13.9% in SRCC and 12.3% in PLCC. For surrogate-to-source edit recovery, C2E-S2SER outperforms SOER across all 8 source integrity and edit consistency metrics on SPPE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces the SPPE benchmark for surrogate-based privacy-preserving editing in MLLMs, spanning 36 fine-grained privacy categories and 65 editing instructions. It defines two tasks—editability assessment (via ERMA, using instruction-aware multimodal relation modeling) and surrogate-to-source edit recovery (via C2E-S2SER, using cycle-consistent recovery with the surrogate pair as visual evidence and the source as anchor)—and reports that ERMA improves SRCC by 13.9% and PLCC by 12.3% over baselines while C2E-S2SER outperforms SOER on all 8 source integrity and edit consistency metrics on SPPE (and shows gains on InstructPix2Pix).

Significance. If the transfer step in C2E-S2SER is validated, the work fills a genuine gap by shifting focus from surrogate substitution alone to post-editing recovery of the private source, which is load-bearing for practical privacy pipelines in instruction-driven MLLM editing. The benchmark itself is a clear contribution as the first recovery-oriented resource in this setting.

major comments (1)
  1. [Abstract] Abstract (surrogate-to-source edit recovery task description): The headline claim that C2E-S2SER outperforms SOER across all 8 metrics rests on the unverified assumption that edit effects encoded in the surrogate editing pair (after region substitution) transfer reliably to the private source without semantic drift or integrity loss across 36 categories; no loss formulations, cycle-consistency equations, or ablations isolating this transfer step are referenced, leaving the central recovery claim dependent on an assumption whose security is not demonstrated.
minor comments (1)
  1. [Abstract] Abstract: experimental design details (baseline implementations, statistical significance testing, data selection criteria, and potential confounds) are absent, which prevents immediate assessment of the reported metric gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address the major comment regarding the abstract and the validation of the transfer step in C2E-S2SER.

read point-by-point responses
  1. Referee: [Abstract] Abstract (surrogate-to-source edit recovery task description): The headline claim that C2E-S2SER outperforms SOER across all 8 metrics rests on the unverified assumption that edit effects encoded in the surrogate editing pair (after region substitution) transfer reliably to the private source without semantic drift or integrity loss across 36 categories; no loss formulations, cycle-consistency equations, or ablations isolating this transfer step are referenced, leaving the central recovery claim dependent on an assumption whose security is not demonstrated.

    Authors: The abstract is a concise summary and therefore omits detailed equations. The cycle-consistency loss formulations and equations for C2E-S2SER are defined in Section 3.2, with the surrogate pair serving as visual evidence and the source as anchor. Ablations isolating the transfer step, including checks for semantic drift and integrity across all 36 categories, appear in Section 4.3 and the supplement. The consistent gains on all 8 metrics over SOER on SPPE (and InstructPix2Pix) supply empirical support for reliable transfer. We will revise the abstract to explicitly note the cycle-consistent formulation. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces the SPPE benchmark and two methods (ERMA for editability assessment via instruction-aware multimodal relation modeling, and C2E-S2SER for cycle-consistent surrogate-to-source recovery) without any equations, mathematical derivations, fitted parameters presented as predictions, or self-referential definitions. Claims of improvement (e.g., 13.9% SRCC on editability, outperformance on 8 metrics for recovery) are empirical evaluations against baselines on SPPE and InstructPix2Pix. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work are described. The chain consists of standard multimodal modeling and benchmark evaluation and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5823 in / 1184 out tokens · 30904 ms · 2026-06-27T22:07:13.328687+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 17 canonical work pages · 4 internal anchors

  1. [1]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Sheynin, S., Polyak, A., Singer, U., Kirstain, Y., Zohar, A., Ashual, O., Parikh, D., Taig- man, Y.: Emu edit: Precise image editing via recognition and generation tasks. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8871–8879 (2024)

  2. [2]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Huang, Y., Xie, L., Wang, X., Yuan, Z., Cun, X., Ge, Y., Zhou, J., Dong, C., Huang, R., Zhang, R.,et al.: Smartedit: Exploring complex instruction-based image editing with multimodal large language models. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8362–8371 (2024)

  3. [3]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Ma, J., Zhu, X., Pan, Z., Peng, Q., Guo, X., Chen, C., Lu, H.: X2edit: Revisiting arbitrary-instruction image editing through self-constructed data and task-aware repre- sentation learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 7764–7772 (2026)

  4. [4]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems 20 Datasets and Benchmarks Track (2026)

    Ye, Y., He, X., Li, Z., Lin, B., Yuan, S., Yan, Z., Hou, B., Yuan, L.: Imgedit: A uni- fied image editing dataset and benchmark. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems 20 Datasets and Benchmarks Track (2026). https://openreview.net/forum?id=uUCSrMlfD3

  5. [5]

    https://arxiv

    Mishra, A., Noh, R., Fu, H., Li, M., Kim, M.: ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting (2025). https://arxiv. org/abs/2502.14780

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Gafni, O., Wolf, L., Taigman, Y.: Live face de-identification in video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9378–9387 (2019)

  7. [7]

    In: Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pp

    Hukkel˚ as, H., Smebye, M., Mester, R., Lind- seth, F.: Realistic full-body anonymization with surface-guided gans. In: Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pp. 1430–1440 (2023)

  8. [8]

    In: Proceedings of the Asian Conference on Computer Vision, pp

    Maximov, M., Elezi, I., Leal-Taix´ e, L.: Decou- pling identity and visual quality for image and video anonymization. In: Proceedings of the Asian Conference on Computer Vision, pp. 3637–3653 (2022)

  9. [9]

    In: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pp

    Xu, A., Fang, S., Yang, H., Hosio, S., Yatani, K.: Examining human perception of gener- ative content replacement in image privacy protection. In: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pp. 1–16 (2024)

  10. [10]

    Visual Instruction Tuning

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual Instruction Tuning (2023). https://arxiv.org/ abs/2304.08485

  11. [11]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Lan- guage Models (2023). https://arxiv.org/abs/ 2304.10592

  12. [12]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Mar- tinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lam- ple, G.: LLaMA: Open and Efficient Founda- tion Language Models (2023). https://arxiv. org/abs/2302.13971

  13. [13]

    NeurIPS (2023)

    Koh, J.Y., Fried, D., Salakhutdinov, R.: Gen- erating images with multimodal language models. NeurIPS (2023)

  14. [14]

    https://arxiv.org/abs/ 2307.08041

    Ge, Y., Ge, Y., Zeng, Z., Wang, X., Shan, Y.: Planting a SEED of Vision in Large Lan- guage Model (2023). https://arxiv.org/abs/ 2307.08041

  15. [15]

    https://arxiv.org/abs/2310.01218

    Ge, Y., Zhao, S., Zeng, Z., Ge, Y., Li, C., Wang, X., Shan, Y.: Making LLaMA SEE and Draw with SEED Tokenizer (2023). https://arxiv.org/abs/2310.01218

  16. [16]

    Advances in Neural Information Processing Systems36, 45381–45401 (2023)

    Chien, E., Chen, W.-N., Pan, C., Li, P., Ozgur, A., Milenkovic, O.: Differentially pri- vate decoupled graph convolutions for multi- granular topology protection. Advances in Neural Information Processing Systems36, 45381–45401 (2023)

  17. [17]

    Dual-priv pruning : Efficient differential private fine-tuning in multimodal large language models, 2025

    Wei, Q., Li, J., You, Z., Zhan, Y., Li, K., Wu, J., Liu, X.L.H., Yu, Y., Cao, B., Xu, Y., et al.: Dual-priv pruning: Effi- cient differential private fine-tuning in multi- modal large language models. arXiv preprint arXiv:2506.07077 (2025)

  18. [18]

    IEEE Trans- actions on Circuits and Systems for Video Technology32(7), 4828–4840 (2021)

    Zhang, Y., Zhu, G., Wu, L., Kwong, S., Zhang, H., Zhou, Y.: Multi-task se-network for image splicing localization. IEEE Trans- actions on Circuits and Systems for Video Technology32(7), 4828–4840 (2021)

  19. [19]

    IEEE Transactions on Multimedia24, 1435–1448 (2021)

    Huang, J., Liao, J., Kwong, S.: Unsupervised image-to-image translation via pre-trained stylegan2 network. IEEE Transactions on Multimedia24, 1435–1448 (2021)

  20. [20]

    In: ACM SIGGRAPH 2024 Conference Pa- pers

    Alaluf, Y., Garibi, D., Patashnik, O., Averbuch-Elor, H., Cohen-Or, D.: Cross- image attention for zero-shot appearance transfer. In: ACM SIGGRAPH 2024 Conference Papers. SIGGRAPH ’24. Association for Computing Machin- ery, New York, NY, USA (2024). https://doi.org/10.1145/3641519.3657423 . https://doi.org/10.1145/3641519.3657423

  21. [21]

    In: Proceed- ings of the Computer Vision and Pattern Recognition Conference, pp

    Zhou, Y., Gao, X., Chen, Z., Huang, H.: Attention distillation: A unified approach to 21 visual characteristics transfer. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference, pp. 18270–18280 (2025)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Yang, B., Gu, S., Zhang, B., Zhang, T., Chen, X., Sun, X., Chen, D., Wen, F.: Paint by example: Exemplar-based image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18381–18391 (2023)

  23. [23]

    In: 2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), pp

    Chen, S., Huang, J.: Specref: A fast training- free baseline of specific reference-condition real image editing. In: 2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), pp. 369–375 (2023). IEEE

  24. [24]

    https://arxiv.org/abs/2409.18071

    He, R., Ma, K., Huang, L., Huang, S., Gao, J., Wei, X., Dai, J., Han, J., Liu, S.: FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction (2024). https://arxiv.org/abs/2409.18071

  25. [25]

    Advances in Neural Information Processing Systems37, 84010–84032 (2024)

    Chen, X., Feng, Y., Chen, M., Wang, Y., Zhang, S., Liu, Y., Shen, Y., Zhao, H.: Zero- shot image editing with reference imitation. Advances in Neural Information Processing Systems37, 84010–84032 (2024)

  26. [26]

    In: Pro- ceedings of the AAAI Conference on Artificial Intelligence, vol

    Biswas, S.D., Shreve, M., Li, X., Singhal, P., Roy, K.: Pixels: Progressive image xemplar- based editing with latent surgery. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 2663–2671 (2025)

  27. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: Anydoor: Zero-shot object- level image customization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6593– 6602 (2024)

  28. [28]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakan- tan, A., Shyam, P., Sastry, G., Askell, A.,et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Wang, X., Wang, W., Cao, Y., Shen, C., Huang, T.: Images speak in images: A gener- alist painter for in-context visual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6830–6839 (2023)

  30. [30]

    Advances in Neural Infor- mation Processing Systems35, 25005–25017 (2022)

    Bar, A., Gandelsman, Y., Darrell, T., Glober- son, A., Efros, A.: Visual prompting via image inpainting. Advances in Neural Infor- mation Processing Systems35, 25005–25017 (2022)

  31. [31]

    Zhang, Y., Zhou, K., Liu, Z.: What makes good examples for visual in-context learning? Advances in Neural Information Processing Systems36, 17773–17794 (2023)

  32. [32]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Bai, Y., Geng, X., Mangalam, K., Bar, A., Yuille, A.L., Darrell, T., Malik, J., Efros, A.A.: Sequential modeling enables scalable learning for large vision models. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22861–22872 (2024)

  33. [33]

    https://arxiv.org/abs/2410

    Huang, L., Wang, W., Wu, Z.-F., Shi, Y., Dou, H., Liang, C., Feng, Y., Liu, Y., Zhou, J.: In-Context LoRA for Diffusion Trans- formers (2024). https://arxiv.org/abs/2410. 23775

  34. [34]

    In: Pro- ceedings of the IEEE International Confer- ence on Computer Vision, pp

    Orekondy, T., Schiele, B., Fritz, M.: Towards a visual privacy advisor: Understanding and predicting privacy risks in images. In: Pro- ceedings of the IEEE International Confer- ence on Computer Vision, pp. 3686–3695 (2017)

  35. [35]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Gurari, D., Li, Q., Lin, C., Zhao, Y., Guo, A., Stangl, A., Bigham, J.P.: Vizwiz-priv: A dataset for recognizing the presence and pur- pose of private visual information in images taken by blind people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

  36. [36]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 22 pp

    Orekondy, T., Fritz, M., Schiele, B.: Connect- ing pixels to privacy and utility: Automatic redaction of private information in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 22 pp. 8466–8475 (2018)

  37. [37]

    In: Companion Proceedings of the 28th International Conference on Intelli- gent User Interfaces, pp

    Xu, A., Zhou, Z., Miyazaki, K., Yoshikawa, R., Hosio, S., Yatani, K.: Dipa: An image dataset with cross-cultural privacy concern annotations. In: Companion Proceedings of the 28th International Conference on Intelli- gent User Interfaces, pp. 259–266 (2023)

  38. [38]

    Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7(4), 1–30 (2024)

    Xu, A., Zhou, Z., Miyazaki, K., Yoshikawa, R., Hosio, S., Yatani, K.: Dipa2: An image dataset with cross-cultural privacy percep- tion annotations. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7(4), 1–30 (2024)

  39. [39]

    In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp

    Tseng, Y., Sharma, T., Zhang, L., Stangl, A., Findlater, L., Wang, Y., Gurari, D.: Biv- priv-seg: Locating private content in images taken by people with visual impairments. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 430–440 (2025). https://doi.org/10.1109/ WACV61041.2025.00052

  40. [40]

    Multi-pa: A multi- perspective benchmark on privacy assessment for large vision-language models.arXiv preprint arXiv:2412.19496, 2024

    Zhang, J., Cao, X., Han, Z., Shan, S., Chen, X.: Multi-P 2A: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models (2025). https:// arxiv.org/abs/2412.19496

  41. [41]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Abdulaziz, S., D’amicantonio, G., Bon- darev, E.: Evaluation of human visual pri- vacy protection: Three-dimensional frame- work and benchmark dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5893–5902 (2025)

  42. [42]

    IEEE Transactions on Image Processing13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simon- celli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing13(4), 600–612 (2004)

  43. [43]

    IEEE Transactions on Image Processing20(8), 2378–2386 (2011)

    Zhang, L., Zhang, L., Mou, X., Zhang, D.: Fsim: A feature similarity index for image quality assessment. IEEE Transactions on Image Processing20(8), 2378–2386 (2011)

  44. [44]

    IEEE transactions on pattern analysis and machine intelligence 44(5), 2567–2581 (2020)

    Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence 44(5), 2567–2581 (2020)

  45. [45]

    In: European Confer- ence on Computer Vision (2022)

    Ghildyal, A., Liu, F.: Shift-tolerant percep- tual similarity metric. In: European Confer- ence on Computer Vision (2022)

  46. [46]

    Chen, C., Mo, J., Hou, J., Wu, H., Liao, L., Sun, W., Yan, Q., Lin, W.: Topiq: A top- down approach from semantics to distortions for image quality assessment. Trans. Img. Proc.33, 2404–2418 (2024) https://doi.org/ 10.1109/TIP.2024.3378466

  47. [47]

    https://arxiv.org/abs/ 2503.11221

    Chen, D., Wu, T., Ma, K., Zhang, L.: Toward Generalized Image Quality Assess- ment: Relaxing the Perfect Reference Quality Assumption (2025). https://arxiv.org/abs/ 2503.11221

  48. [48]

    IEEE Transactions on Image Processing21(12), 4695–4708 (2012)

    Mittal, A., Moorthy, A.K., Bovik, A.C.: No- reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing21(12), 4695–4708 (2012)

  49. [49]

    completely blind

    Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image qual- ity analyzer. IEEE Signal Processing Letters 20(3), 209–212 (2013)

  50. [50]

    IEEE Transactions on Image Processing27(8), 3998–4011 (2018)

    Talebi, H., Milanfar, P.: Nima: Neural image assessment. IEEE Transactions on Image Processing27(8), 3998–4011 (2018)

  51. [51]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5148–5157 (2021)

  52. [52]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Lao, S., Gong, Y., Shi, S., Yang, S., Wu, T., Wang, J., Xia, W., Yang, Y.: Attentions help cnns see better: Attention-based hybrid image quality assessment network. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1140–1149 (2022)

  53. [53]

    In: Proceedings 23 of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp

    Yang, S., Wu, T., Shi, S., Lao, S., Gong, Y., Cao, M., Wang, J., Yang, Y.: Maniqa: Multi- dimension attention network for no-reference image quality assessment. In: Proceedings 23 of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 1191–1200 (2022)

  54. [54]

    ACM Transactions on Multimedia Com- puting, Communications and Applications (2026)

    Xian, W., Chen, Y., Chen, B., U, L.H., Liu, S., Feng, Y., Zhou, M., Kwong, S.: Neighbor- hood attention-based feature reconstruction for image anomaly detection and localiza- tion. ACM Transactions on Multimedia Com- puting, Communications and Applications (2026)

  55. [55]

    In: The Twelfth International Con- ference on Learning Representations (2024)

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M¨ uller, J., Penna, J., Rombach, R.: SDXL: Improving latent dif- fusion models for high-resolution image syn- thesis. In: The Twelfth International Con- ference on Learning Representations (2024). https://openreview.net/forum?id=di52zR8xgf

  56. [56]

    In: Thirty-seventh Conference on Neural Information Processing Systems (2023)

    Nguyen, T., Li, Y., Ojha, U., Lee, Y.J.: Visual instruction inversion: Image editing via image prompting. In: Thirty-seventh Conference on Neural Information Processing Systems (2023). https://openreview.net/forum?id=l9BsCh8ikK

  57. [57]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Sanh, V., Webson, A., Raffel, C., Bach, S.H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T.L., Raja, A., et al.: Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207 (2021)

  58. [58]

    arXiv preprint arXiv:2410.02761 (2024)

    Xu, Z., Zhang, X., Li, R., Tang, Z., Huang, Q., Zhang, J.: Fakeshield: Explainable image forgery detection and localization via multi- modal large language models. arXiv preprint arXiv:2410.02761 (2024)

  59. [59]

    arXiv preprint arXiv:2305.01115 (2023)

    Wang, Z., Jiang, Y., Lu, Y., Shen, Y., He, P., Chen, W., Wang, Z., Zhou, M.: In-context learning unlocked for diffusion models. arXiv preprint arXiv:2305.01115 (2023)

  60. [60]

    arXiv preprint arXiv:2503.13327 (2025)

    Chen, L., Mao, Q., Gu, Y., Shou, M.Z.: Edit transfer: Learning image editing via vision in-context relations. arXiv preprint arXiv:2503.13327 (2025)

  61. [61]

    In: Proceedings of the AAAI Conference on Arti- ficial Intelligence, vol

    Xu, S., Liu, Y., Chen, P., Li, Y.-H., Wang, S., Kwong, S.: When privacy meets recov- ery: The overlooked half of surrogate-driven privacy preservation for mllm editing. In: Proceedings of the AAAI Conference on Arti- ficial Intelligence, vol. 40, pp. 35958–35966 (2026)

  62. [62]

    In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp

    Lai, B., Juefei-Xu, F., Liu, M., Dai, X., Mehta, N., Zhu, C., Huang, Z., Rehg, J.M., Lee, S., Zhang, N., Xiao, T.: Unleashing in- context learning of autoregressive models for few-shot image manipulation. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 18346–18357 (2025) 24