pith. sign in

arxiv: 2604.06748 · v1 · submitted 2026-04-08 · 💻 cs.CV

From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

Pith reviewed 2026-05-10 18:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual in-context learninginteractive segmentationuser guidanceDeLVMscribbles and clickscontrollable visionobject removal
0
0 comments X

The pith

Encoding user cues like scribbles and clicks into example pairs turns static visual in-context learners into interactive systems that follow new guidance without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual in-context learning models adapt to new tasks from example input-output pairs but stay static and cannot accept user signals such as clicks or boxes to steer results. The paper shows that simply drawing those signals into the demonstration pairs lets the model treat them as general instructions, so it responds to unseen interactions on the fly. Standard models ignore the added cues, while the adapted version raises interactive segmentation by 7.95 percent IoU, directed super-resolution by 2.46 PSNR, and interactive object removal by 3.14 percent LPIPS. This preserves the core no-fine-tuning approach of in-context learning while giving users direct control over predictions in segmentation, editing, and pose tasks.

Core claim

By encoding interactions directly into the example input-output pairs, static visual in-context learners such as DeLVM become controllable systems that accept natural visual cues like scribbles, clicks, or bounding boxes; the model then generalizes to unseen interactions without task-specific fine-tuning and lets users dynamically steer predictions for personalized outcomes.

What carries the argument

Encoding user visual cues (scribbles, clicks, bounding boxes) directly into the demonstration input-output pairs so the in-context learner treats them as reusable guidance signals.

Load-bearing premise

The model will interpret the drawn user cues in the example pairs as generalizable instructions instead of noise or task-specific artifacts.

What would settle it

Run the adapted model on new test cases with user cues present in the examples yet observe no improvement over the static baseline and no following of the provided cues.

Figures

Figures reproduced from arXiv: 2604.06748 by Carlos Schmidt, Simon Rei{\ss}.

Figure 1
Figure 1. Figure 1: Test-time adaptation to new interactions enables following user preferred interaction signals [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Top: Example context sets, queries and outputs for interactive tasks with blended inter￾actions (e.g., bounding boxes) in input images. Bottom: Processing in i-DeLVM. Training with the context set, query and output is conducted in token space. 3.3 Visual Interactions and Analysis While generic, efficient and conceptually simple, blending images and interactions also overwrites visual information from cin (… view at source ↗
Figure 3
Figure 3. Figure 3: Reconstruction quality of bounding boxes and clicks of different line widths and side [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of our i-DeLVM and baseline models for interactive segmentation, interactive object removal and interactive human pose estimation for diverse interaction types (best viewed zooming in, context sets omitted for brevity). context set, and follow the segmentation setting from their training, i.e., segmenting all objects, disregarding the context. i-DeLVMUnseen produces a segmentation direc… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of our i-DeLVM and baseline models for interactive super-resolution of a chair and desk. In [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of top 10 tokens for white/black segmentation maps compared to colored [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of top 10 tokens for pose maps on the mpii dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of top 10 tokens for interactive super-resolution on the ADE20K dataset. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of top 10 tokens for interactive object removal on the RORD images. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example pairs from the different datasets for each task with different interaction cues [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Object removal success and failure cases. For each pair, the images to the left are the [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Throughout the training steps, a higher masking ratio leads to higher token accuracy on [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative interactive segmentation results from the main paper in higher resolution for [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative interactive object removal results from the main paper in higher resolution for [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative interactive pose estimation results from the main paper in higher resolution for [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative interactive super-resolution results from the main paper in higher resolution [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
read the original abstract

Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of $+7.95%$ IoU for interactive segmentation, $+2.46$ PSNR for directed super-resolution, and $-3.14%$ LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that static visual in-context learning models (particularly DeLVM) can be transformed into interactive, user-driven systems by directly encoding user guidance signals such as scribbles, clicks, or bounding boxes into the example input-output pairs. This preserves the core in-context learning philosophy, allowing prompt-based adaptation to unseen interactions without fine-tuning. Experiments reportedly show that unmodified SOTA models ignore such cues, while the proposed Interactive DeLVM yields gains of +7.95% IoU on interactive segmentation, +2.46 PSNR on directed super-resolution, and -3.14% LPIPS on interactive object removal.

Significance. If the central claim holds, the work would meaningfully extend visual in-context learning toward practical, controllable applications by enabling users to steer predictions with personalized visual cues. The simplicity of the encoding approach and the reported quantitative improvements are potential strengths, as they suggest a lightweight way to add interactivity while avoiding task-specific retraining.

major comments (2)
  1. [Abstract] Abstract: the central claim that encoding cues into example pairs induces the base model to abstract user intent as generalizable guidance (rather than noise or low-level patterns) is load-bearing for the contribution, yet the abstract supplies no method details, baselines, number of examples, or error analysis. This prevents verification that the reported gains arise from transferable cue semantics instead of overfitting to the specific interaction styles shown in the prompts.
  2. [Method and Experiments] Method and Experiments sections: unmodified SOTA models are stated to ignore cues entirely, so the success of the encoding depends on the in-context signal being strong enough to override this tendency. No ablations appear to test robustness across cue densities, styles, or interaction types differing from the training examples; without such tests the assumption that the model will parse added marks as instructions (rather than extraneous content) remains unverified and could fail for new cue configurations.
minor comments (2)
  1. [Abstract] Abstract: the quantitative claims would be clearer if the number of in-context examples, the exact DeLVM variant, and the evaluation protocol (e.g., how user cues are simulated at test time) were briefly stated.
  2. [Figures] Figures: ensure that example visualizations of encoded input-output pairs clearly distinguish the added cues from the original image content so readers can assess the encoding strategy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing our responses and indicating revisions made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that encoding cues into example pairs induces the base model to abstract user intent as generalizable guidance (rather than noise or low-level patterns) is load-bearing for the contribution, yet the abstract supplies no method details, baselines, number of examples, or error analysis. This prevents verification that the reported gains arise from transferable cue semantics instead of overfitting to the specific interaction styles shown in the prompts.

    Authors: We agree that the abstract could provide more specifics to support verification of the central claim. In the revised version, we have expanded the abstract to briefly describe the encoding approach, note the typical number of in-context examples (2-4), reference the baselines (unmodified SOTA models), and highlight the quantitative gains. Detailed evidence that the improvements stem from transferable cue semantics—rather than overfitting—is presented through direct comparisons in the Experiments section, where unmodified models ignore the cues entirely while our method leverages them effectively. Error analysis is included in the supplementary material and main experiments. revision: yes

  2. Referee: [Method and Experiments] Method and Experiments sections: unmodified SOTA models are stated to ignore cues entirely, so the success of the encoding depends on the in-context signal being strong enough to override this tendency. No ablations appear to test robustness across cue densities, styles, or interaction types differing from the training examples; without such tests the assumption that the model will parse added marks as instructions (rather than extraneous content) remains unverified and could fail for new cue configurations.

    Authors: The referee correctly notes the absence of explicit robustness ablations in the original submission. Our experiments already cover multiple interaction types (scribbles, clicks, bounding boxes) and demonstrate that unmodified models ignore cues while ours succeeds, supporting that the signal overrides the tendency to treat marks as extraneous. We have added new ablations in the revised manuscript testing variations in cue density and style, including some differing from primary examples. These results indicate the model interprets cues as guidance. While exhaustive coverage of every possible new configuration is not feasible, the in-context learning paradigm is designed for generalization to unseen prompts, and our findings are consistent with this. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical encoding with no self-referential derivation

full rationale

The paper proposes encoding user interactions (scribbles/clicks/boxes) directly into static in-context example pairs to adapt models like DeLVM for interactive use. This is a straightforward augmentation technique with no equations, fitted parameters, or load-bearing self-citations in the provided text. The central claim rests on empirical experiments showing baseline models ignore cues while the augmented pairs enable control, without any reduction of predictions to inputs by construction or imported uniqueness theorems. The assumption that models will generalize from the encoded cues is an empirical hypothesis, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new entities are introduced in the abstract; the method is presented as a simple encoding change on top of existing models.

pith-pipeline@v0.9.0 · 5609 in / 1039 out tokens · 28754 ms · 2026-05-10T18:46:04.895764+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

  1. [1]

    Andriluka, L

    M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. InProceedings of the IEEE Conference on computer Vision and Pattern Recognition, pages 3686–3693, 2014

  2. [2]

    Y . Bai, X. Geng, K. Mangalam, A. Bar, A. L. Yuille, T. Darrell, J. Malik, and A. A. Efros. Sequential modeling enables scalable learning for large vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22861–22872, 2024

  3. [3]

    A. Bar, Y . Gandelsman, T. Darrell, A. Globerson, and A. Efros. Visual prompting via image inpainting.Advances in Neural Information Processing Systems, 35:25005–25017, 2022

  4. [4]

    Batra, A

    D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen. icoseg: Interactive co-segmentation with intelligent scribble guidance. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3169–3176. IEEE, 2010. 12

  5. [5]

    Bridgeman, M

    L. Bridgeman, M. V olino, J.-Y . Guillemaut, and A. Hilton. Multi-person 3d pose estimation and tracking in sports. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019

  6. [6]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  7. [7]

    T. Chen, Y . Liu, Z. Wang, J. Yuan, Q. You, H. Yang, and M. Zhou. Improving in-context learning in diffusion models with visual context-modulated prompts.arXiv preprint arXiv:2312.01408, 2023

  8. [8]

    Cheng, H

    S. Cheng, H. Sun, T. Xie, H. Zhao, Y . Chen, B. Xu, and X. Li. Spt: Sequence prompt transformer for interactive image segmentation. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  9. [9]

    Czolbe and A

    S. Czolbe and A. V . Dalca. Neuralizer: General neuroimage analysis without re-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6217–6230, 2023

  10. [10]

    Esser, R

    P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  11. [11]

    Everingham, L

    M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010

  12. [12]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  13. [13]

    J. Guo, Z. Hao, C. Wang, Y . Tang, H. Wu, H. Hu, K. Han, and C. Xu. Data-efficient large vision models through sequential autoregression.arXiv preprint arXiv:2402.04841, 2024

  14. [14]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106. 09685

  15. [15]

    Jang and C.-S

    W.-D. Jang and C.-S. Kim. Interactive image segmentation via backpropagating refinement scheme. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5297–5306, 2019

  16. [16]

    L. Ke, M. Ye, M. Danelljan, Y .-W. Tai, C.-K. Tang, F. Yu, et al. Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

  17. [17]

    Kirillov, E

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  18. [18]

    Kontogianni, M

    T. Kontogianni, M. Gygli, J. Uijlings, and V . Ferrari. Continuous adaptation for interactive object segmentation by learning from corrections. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 579–596. Springer, 2020

  19. [19]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014

  20. [20]

    J. Liu, H. Wang, W. Yin, J.-J. Sonke, and E. Gavves. Click prompt learning with optimal transport for interactive segmentation. InEuropean Conference on Computer Vision, pages 93–110. Springer, 2024. 13

  21. [21]

    Q. Liu, M. Zheng, B. Planche, S. Karanam, T. Chen, M. Niethammer, and Z. Wu. Pseudoclick: Interactive image segmentation with click imitation. InEuropean Conference on Computer Vision, pages 728–745. Springer, 2022

  22. [22]

    Q. Liu, J. Cho, M. Bansal, and M. Niethammer. Rethinking interactive image segmentation with low latency high quality and diverse prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3773–3782, 2024

  23. [23]

    Mahadevan, P

    S. Mahadevan, P. V oigtlaender, and B. Leibe. Iteratively trained interactive segmentation.arXiv preprint arXiv:1805.04398, 2018

  24. [24]

    Marinov, P

    Z. Marinov, P. F. Jäger, J. Egger, J. Kleesiek, and R. Stiefelhagen. Deep interactive segmentation of medical images: A systematic review and taxonomy.IEEE transactions on pattern analysis and machine intelligence, 2024

  25. [25]

    Rethinking annotator simulation: Realistic evaluation of whole-body pet lesion interactive segmentation methods,

    Z. Marinov, M. Kim, J. Kleesiek, and R. Stiefelhagen. Rethinking annotator simulation: Realistic evaluation of whole-body pet lesion interactive segmentation methods.arXiv preprint arXiv:2404.01816, 2024

  26. [26]

    Negrini and S

    A. Negrini and S. Reiß. Conquering the retina: Bringing visual in-context learning to oct, 2025. URLhttps://arxiv.org/abs/2506.15200

  27. [27]

    Nielsen and R

    F. Nielsen and R. Nock. Clickremoval: Interactive pinpoint image object removal. InPro- ceedings of the 13th annual ACM international conference on Multimedia, pages 315–318, 2005

  28. [28]

    Raunak, S

    V . Raunak, S. Dalmia, V . Gupta, and F. Metze. On long-tailed phenomena in neural machine translation.arXiv preprint arXiv:2010.04924, 2020

  29. [29]

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URL https: //arxiv.org/abs/2408.00714

  30. [30]

    S. Reiß, C. Seibold, A. Freytag, E. Rodner, and R. Stiefelhagen. Every annotation counts: Multi-label deep supervision for medical image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9532–9542, 2021

  31. [31]

    S. Reiß, Z. Marinov, A. Jaus, C. Seibold, M. S. Sarfraz, E. Rodner, and R. Stiefelhagen. Is visual in-context learning for compositional medical tasks within reach? InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2642–2652, October 2025

  32. [32]

    S. Ren, X. Huang, X. Li, J. Xiao, J. Mei, Z. Wang, A. Yuille, and Y . Zhou. Medical vision generalist: Unifying medical imaging tasks in context.arXiv preprint arXiv:2406.05565, 2024

  33. [33]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  34. [34]

    grabcut

    C. Rother, V . Kolmogorov, and A. Blake. " grabcut" interactive foreground extraction using iterated graph cuts.ACM transactions on graphics (TOG), 23(3):309–314, 2004

  35. [35]

    Sagong, Y .-J

    M.-C. Sagong, Y .-J. Yeo, S.-W. Jung, and S.-J. Ko. Rord: A real-world object removal dataset. InBMVC, page 542, 2022

  36. [36]

    Shtedritski, C

    A. Shtedritski, C. Rupprecht, and A. Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11987–11997, 2023

  37. [37]

    Sofiiuk, I

    K. Sofiiuk, I. Petrov, O. Barinova, and A. Konushin. f-brs: Rethinking backpropagating refinement for interactive segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8623–8632, 2020. 14

  38. [38]

    Sofiiuk, I

    K. Sofiiuk, I. A. Petrov, and A. Konushin. Reviving iterative training with mask guidance for interactive segmentation. In2022 IEEE International Conference on Image Processing (ICIP), pages 3141–3145. IEEE, 2022

  39. [39]

    G. Song, H. Myeong, and K. M. Lee. Seednet: Automatic seed generation with deep reinforce- ment learning for robust interactive segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1760–1768, 2018

  40. [40]

    Van Den Oord, O

    A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

  41. [41]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  42. [42]

    X. Wang, W. Wang, Y . Cao, C. Shen, and T. Huang. Images speak in images: A generalist painter for in-context visual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023

  43. [43]

    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

  44. [44]

    Z. Wang, Y . Jiang, Y . Lu, P. He, W. Chen, Z. Wang, M. Zhou, et al. In-context learning unlocked for diffusion models.Advances in Neural Information Processing Systems, 36:8542–8562, 2023

  45. [45]

    N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang. Deep interactive object selection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 373–381, 2016

  46. [46]

    J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018

  47. [47]

    Zhang, A

    L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

  48. [48]

    Zhang, P

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  49. [49]

    Zhang, P

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  50. [50]

    B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017. 15 A Datasets We operate on the datasets PASCAL VOC (11) for interactive segmentation, the Real-world Object Removal dataset ( 35), the MPII dataset ...