From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

Carlos Schmidt; Simon Rei{\ss}

arxiv: 2604.06748 · v1 · submitted 2026-04-08 · 💻 cs.CV

From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

Carlos Schmidt , Simon Rei{\ss} This is my paper

Pith reviewed 2026-05-10 18:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual in-context learninginteractive segmentationuser guidanceDeLVMscribbles and clickscontrollable visionobject removal

0 comments

The pith

Encoding user cues like scribbles and clicks into example pairs turns static visual in-context learners into interactive systems that follow new guidance without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual in-context learning models adapt to new tasks from example input-output pairs but stay static and cannot accept user signals such as clicks or boxes to steer results. The paper shows that simply drawing those signals into the demonstration pairs lets the model treat them as general instructions, so it responds to unseen interactions on the fly. Standard models ignore the added cues, while the adapted version raises interactive segmentation by 7.95 percent IoU, directed super-resolution by 2.46 PSNR, and interactive object removal by 3.14 percent LPIPS. This preserves the core no-fine-tuning approach of in-context learning while giving users direct control over predictions in segmentation, editing, and pose tasks.

Core claim

By encoding interactions directly into the example input-output pairs, static visual in-context learners such as DeLVM become controllable systems that accept natural visual cues like scribbles, clicks, or bounding boxes; the model then generalizes to unseen interactions without task-specific fine-tuning and lets users dynamically steer predictions for personalized outcomes.

What carries the argument

Encoding user visual cues (scribbles, clicks, bounding boxes) directly into the demonstration input-output pairs so the in-context learner treats them as reusable guidance signals.

Load-bearing premise

The model will interpret the drawn user cues in the example pairs as generalizable instructions instead of noise or task-specific artifacts.

What would settle it

Run the adapted model on new test cases with user cues present in the examples yet observe no improvement over the static baseline and no following of the provided cues.

Figures

Figures reproduced from arXiv: 2604.06748 by Carlos Schmidt, Simon Rei{\ss}.

**Figure 2.** Figure 2: Top: Example context sets, queries and outputs for interactive tasks with blended interactions (e.g., bounding boxes) in input images. Bottom: Processing in i-DeLVM. Training with the context set, query and output is conducted in token space. 3.3 Visual Interactions and Analysis While generic, efficient and conceptually simple, blending images and interactions also overwrites visual information from cin (… view at source ↗

**Figure 3.** Figure 3: Reconstruction quality of bounding boxes and clicks of different line widths and side [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of our i-DeLVM and baseline models for interactive segmentation, interactive object removal and interactive human pose estimation for diverse interaction types (best viewed zooming in, context sets omitted for brevity). context set, and follow the segmentation setting from their training, i.e., segmenting all objects, disregarding the context. i-DeLVMUnseen produces a segmentation direc… view at source ↗

**Figure 5.** Figure 5: Qualitative results of our i-DeLVM and baseline models for interactive super-resolution of a chair and desk. In [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of top 10 tokens for white/black segmentation maps compared to colored [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of top 10 tokens for pose maps on the mpii dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of top 10 tokens for interactive super-resolution on the ADE20K dataset. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of top 10 tokens for interactive object removal on the RORD images. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Example pairs from the different datasets for each task with different interaction cues [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Object removal success and failure cases. For each pair, the images to the left are the [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Throughout the training steps, a higher masking ratio leads to higher token accuracy on [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative interactive segmentation results from the main paper in higher resolution for [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative interactive object removal results from the main paper in higher resolution for [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative interactive pose estimation results from the main paper in higher resolution for [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative interactive super-resolution results from the main paper in higher resolution [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

read the original abstract

Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of $+7.95%$ IoU for interactive segmentation, $+2.46$ PSNR for directed super-resolution, and $-3.14%$ LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Encoding user cues into in-context example pairs turns static visual models interactive without retraining, but the gains rest on the untested assumption that the base model will treat added marks as general rules rather than noise.

read the letter

The paper's main contribution is a straightforward encoding trick: they draw user scribbles, clicks, or boxes straight into the input-output example pairs fed to models like DeLVM. This keeps the original in-context setup intact while letting users steer outputs on tasks such as segmentation, directed super-resolution, and object removal. They first show that unmodified SOTA models ignore these cues entirely, then report gains from their version, including nearly 8 points IoU on interactive segmentation plus smaller lifts on the other two tasks. The simplicity is the real strength here—no fine-tuning, no architecture changes, just better prompt construction that matches how these models already work. It directly addresses a practical gap for anyone who wants controllable editing or annotation tools built on existing in-context learners. The soft spots sit in the mechanism and the evidence. The base models were trained only on clean static pairs, so nothing in their pretraining ensures they will abstract the added marks into transferable guidance instead of treating them as extra visual content or artifacts. The abstract itself flags that unmodified models fail on cues, which makes the success depend on the encoding being robust enough to override that tendency. The reported numbers are positive but lack detail on variance, exact baselines, or how performance holds when cues vary in style or density. This is the kind of work that matters for people building user-facing vision systems who already use in-context models. It deserves peer review so the experimental setup and generalization tests can be examined in full.

Referee Report

2 major / 2 minor

Summary. The paper claims that static visual in-context learning models (particularly DeLVM) can be transformed into interactive, user-driven systems by directly encoding user guidance signals such as scribbles, clicks, or bounding boxes into the example input-output pairs. This preserves the core in-context learning philosophy, allowing prompt-based adaptation to unseen interactions without fine-tuning. Experiments reportedly show that unmodified SOTA models ignore such cues, while the proposed Interactive DeLVM yields gains of +7.95% IoU on interactive segmentation, +2.46 PSNR on directed super-resolution, and -3.14% LPIPS on interactive object removal.

Significance. If the central claim holds, the work would meaningfully extend visual in-context learning toward practical, controllable applications by enabling users to steer predictions with personalized visual cues. The simplicity of the encoding approach and the reported quantitative improvements are potential strengths, as they suggest a lightweight way to add interactivity while avoiding task-specific retraining.

major comments (2)

[Abstract] Abstract: the central claim that encoding cues into example pairs induces the base model to abstract user intent as generalizable guidance (rather than noise or low-level patterns) is load-bearing for the contribution, yet the abstract supplies no method details, baselines, number of examples, or error analysis. This prevents verification that the reported gains arise from transferable cue semantics instead of overfitting to the specific interaction styles shown in the prompts.
[Method and Experiments] Method and Experiments sections: unmodified SOTA models are stated to ignore cues entirely, so the success of the encoding depends on the in-context signal being strong enough to override this tendency. No ablations appear to test robustness across cue densities, styles, or interaction types differing from the training examples; without such tests the assumption that the model will parse added marks as instructions (rather than extraneous content) remains unverified and could fail for new cue configurations.

minor comments (2)

[Abstract] Abstract: the quantitative claims would be clearer if the number of in-context examples, the exact DeLVM variant, and the evaluation protocol (e.g., how user cues are simulated at test time) were briefly stated.
[Figures] Figures: ensure that example visualizations of encoded input-output pairs clearly distinguish the added cues from the original image content so readers can assess the encoding strategy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing our responses and indicating revisions made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that encoding cues into example pairs induces the base model to abstract user intent as generalizable guidance (rather than noise or low-level patterns) is load-bearing for the contribution, yet the abstract supplies no method details, baselines, number of examples, or error analysis. This prevents verification that the reported gains arise from transferable cue semantics instead of overfitting to the specific interaction styles shown in the prompts.

Authors: We agree that the abstract could provide more specifics to support verification of the central claim. In the revised version, we have expanded the abstract to briefly describe the encoding approach, note the typical number of in-context examples (2-4), reference the baselines (unmodified SOTA models), and highlight the quantitative gains. Detailed evidence that the improvements stem from transferable cue semantics—rather than overfitting—is presented through direct comparisons in the Experiments section, where unmodified models ignore the cues entirely while our method leverages them effectively. Error analysis is included in the supplementary material and main experiments. revision: yes
Referee: [Method and Experiments] Method and Experiments sections: unmodified SOTA models are stated to ignore cues entirely, so the success of the encoding depends on the in-context signal being strong enough to override this tendency. No ablations appear to test robustness across cue densities, styles, or interaction types differing from the training examples; without such tests the assumption that the model will parse added marks as instructions (rather than extraneous content) remains unverified and could fail for new cue configurations.

Authors: The referee correctly notes the absence of explicit robustness ablations in the original submission. Our experiments already cover multiple interaction types (scribbles, clicks, bounding boxes) and demonstrate that unmodified models ignore cues while ours succeeds, supporting that the signal overrides the tendency to treat marks as extraneous. We have added new ablations in the revised manuscript testing variations in cue density and style, including some differing from primary examples. These results indicate the model interprets cues as guidance. While exhaustive coverage of every possible new configuration is not feasible, the in-context learning paradigm is designed for generalization to unseen prompts, and our findings are consistent with this. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical encoding with no self-referential derivation

full rationale

The paper proposes encoding user interactions (scribbles/clicks/boxes) directly into static in-context example pairs to adapt models like DeLVM for interactive use. This is a straightforward augmentation technique with no equations, fitted parameters, or load-bearing self-citations in the provided text. The central claim rests on empirical experiments showing baseline models ignore cues while the augmented pairs enable control, without any reduction of predictions to inputs by construction or imported uniqueness theorems. The assumption that models will generalize from the encoded cues is an empirical hypothesis, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new entities are introduced in the abstract; the method is presented as a simple encoding change on top of existing models.

pith-pipeline@v0.9.0 · 5609 in / 1039 out tokens · 28754 ms · 2026-05-10T18:46:04.895764+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

[1]

Andriluka, L

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. InProceedings of the IEEE Conference on computer Vision and Pattern Recognition, pages 3686–3693, 2014

work page 2014
[2]

Y . Bai, X. Geng, K. Mangalam, A. Bar, A. L. Yuille, T. Darrell, J. Malik, and A. A. Efros. Sequential modeling enables scalable learning for large vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22861–22872, 2024

work page 2024
[3]

A. Bar, Y . Gandelsman, T. Darrell, A. Globerson, and A. Efros. Visual prompting via image inpainting.Advances in Neural Information Processing Systems, 35:25005–25017, 2022

work page 2022
[4]

Batra, A

D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen. icoseg: Interactive co-segmentation with intelligent scribble guidance. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3169–3176. IEEE, 2010. 12

work page 2010
[5]

Bridgeman, M

L. Bridgeman, M. V olino, J.-Y . Guillemaut, and A. Hilton. Multi-person 3d pose estimation and tracking in sports. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019

work page 2019
[6]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[7]

T. Chen, Y . Liu, Z. Wang, J. Yuan, Q. You, H. Yang, and M. Zhou. Improving in-context learning in diffusion models with visual context-modulated prompts.arXiv preprint arXiv:2312.01408, 2023

work page arXiv 2023
[8]

Cheng, H

S. Cheng, H. Sun, T. Xie, H. Zhao, Y . Chen, B. Xu, and X. Li. Spt: Sequence prompt transformer for interactive image segmentation. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025
[9]

Czolbe and A

S. Czolbe and A. V . Dalca. Neuralizer: General neuroimage analysis without re-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6217–6230, 2023

work page 2023
[10]

Esser, R

P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021
[11]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010

work page 2010
[12]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

J. Guo, Z. Hao, C. Wang, Y . Tang, H. Wu, H. Hu, K. Han, and C. Xu. Data-efficient large vision models through sequential autoregression.arXiv preprint arXiv:2402.04841, 2024

work page arXiv 2024
[14]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106. 09685

work page 2021
[15]

Jang and C.-S

W.-D. Jang and C.-S. Kim. Interactive image segmentation via backpropagating refinement scheme. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5297–5306, 2019

work page 2019
[16]

L. Ke, M. Ye, M. Danelljan, Y .-W. Tai, C.-K. Tang, F. Yu, et al. Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

work page 2023
[17]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023
[18]

Kontogianni, M

T. Kontogianni, M. Gygli, J. Uijlings, and V . Ferrari. Continuous adaptation for interactive object segmentation by learning from corrections. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 579–596. Springer, 2020

work page 2020
[19]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014

work page 2014
[20]

J. Liu, H. Wang, W. Yin, J.-J. Sonke, and E. Gavves. Click prompt learning with optimal transport for interactive segmentation. InEuropean Conference on Computer Vision, pages 93–110. Springer, 2024. 13

work page 2024
[21]

Q. Liu, M. Zheng, B. Planche, S. Karanam, T. Chen, M. Niethammer, and Z. Wu. Pseudoclick: Interactive image segmentation with click imitation. InEuropean Conference on Computer Vision, pages 728–745. Springer, 2022

work page 2022
[22]

Q. Liu, J. Cho, M. Bansal, and M. Niethammer. Rethinking interactive image segmentation with low latency high quality and diverse prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3773–3782, 2024

work page 2024
[23]

Mahadevan, P

S. Mahadevan, P. V oigtlaender, and B. Leibe. Iteratively trained interactive segmentation.arXiv preprint arXiv:1805.04398, 2018

work page arXiv 2018
[24]

Marinov, P

Z. Marinov, P. F. Jäger, J. Egger, J. Kleesiek, and R. Stiefelhagen. Deep interactive segmentation of medical images: A systematic review and taxonomy.IEEE transactions on pattern analysis and machine intelligence, 2024

work page 2024
[25]

Rethinking annotator simulation: Realistic evaluation of whole-body pet lesion interactive segmentation methods,

Z. Marinov, M. Kim, J. Kleesiek, and R. Stiefelhagen. Rethinking annotator simulation: Realistic evaluation of whole-body pet lesion interactive segmentation methods.arXiv preprint arXiv:2404.01816, 2024

work page arXiv 2024
[26]

Negrini and S

A. Negrini and S. Reiß. Conquering the retina: Bringing visual in-context learning to oct, 2025. URLhttps://arxiv.org/abs/2506.15200

work page arXiv 2025
[27]

Nielsen and R

F. Nielsen and R. Nock. Clickremoval: Interactive pinpoint image object removal. InPro- ceedings of the 13th annual ACM international conference on Multimedia, pages 315–318, 2005

work page 2005
[28]

Raunak, S

V . Raunak, S. Dalmia, V . Gupta, and F. Metze. On long-tailed phenomena in neural machine translation.arXiv preprint arXiv:2010.04924, 2020

work page arXiv 2010
[29]

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URL https: //arxiv.org/abs/2408.00714

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

S. Reiß, C. Seibold, A. Freytag, E. Rodner, and R. Stiefelhagen. Every annotation counts: Multi-label deep supervision for medical image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9532–9542, 2021

work page 2021
[31]

S. Reiß, Z. Marinov, A. Jaus, C. Seibold, M. S. Sarfraz, E. Rodner, and R. Stiefelhagen. Is visual in-context learning for compositional medical tasks within reach? InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2642–2652, October 2025

work page 2025
[32]

S. Ren, X. Huang, X. Li, J. Xiao, J. Mei, Z. Wang, A. Yuille, and Y . Zhou. Medical vision generalist: Unifying medical imaging tasks in context.arXiv preprint arXiv:2406.05565, 2024

work page arXiv 2024
[33]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[34]

grabcut

C. Rother, V . Kolmogorov, and A. Blake. " grabcut" interactive foreground extraction using iterated graph cuts.ACM transactions on graphics (TOG), 23(3):309–314, 2004

work page 2004
[35]

Sagong, Y .-J

M.-C. Sagong, Y .-J. Yeo, S.-W. Jung, and S.-J. Ko. Rord: A real-world object removal dataset. InBMVC, page 542, 2022

work page 2022
[36]

Shtedritski, C

A. Shtedritski, C. Rupprecht, and A. Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11987–11997, 2023

work page 2023
[37]

Sofiiuk, I

K. Sofiiuk, I. Petrov, O. Barinova, and A. Konushin. f-brs: Rethinking backpropagating refinement for interactive segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8623–8632, 2020. 14

work page 2020
[38]

Sofiiuk, I

K. Sofiiuk, I. A. Petrov, and A. Konushin. Reviving iterative training with mask guidance for interactive segmentation. In2022 IEEE International Conference on Image Processing (ICIP), pages 3141–3145. IEEE, 2022

work page 2022
[39]

G. Song, H. Myeong, and K. M. Lee. Seednet: Automatic seed generation with deep reinforce- ment learning for robust interactive segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1760–1768, 2018

work page 2018
[40]

Van Den Oord, O

A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

work page 2017
[41]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[42]

X. Wang, W. Wang, Y . Cao, C. Shen, and T. Huang. Images speak in images: A generalist painter for in-context visual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023

work page 2023
[43]

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

work page 2004
[44]

Z. Wang, Y . Jiang, Y . Lu, P. He, W. Chen, Z. Wang, M. Zhou, et al. In-context learning unlocked for diffusion models.Advances in Neural Information Processing Systems, 36:8542–8562, 2023

work page 2023
[45]

N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang. Deep interactive object selection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 373–381, 2016

work page 2016
[46]

J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018

work page 2018
[47]

Zhang, A

L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

work page 2023
[48]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

work page 2018
[49]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018
[50]

B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017. 15 A Datasets We operate on the datasets PASCAL VOC (11) for interactive segmentation, the Real-world Object Removal dataset ( 35), the MPII dataset ...

work page 2017

[1] [1]

Andriluka, L

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. InProceedings of the IEEE Conference on computer Vision and Pattern Recognition, pages 3686–3693, 2014

work page 2014

[2] [2]

Y . Bai, X. Geng, K. Mangalam, A. Bar, A. L. Yuille, T. Darrell, J. Malik, and A. A. Efros. Sequential modeling enables scalable learning for large vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22861–22872, 2024

work page 2024

[3] [3]

A. Bar, Y . Gandelsman, T. Darrell, A. Globerson, and A. Efros. Visual prompting via image inpainting.Advances in Neural Information Processing Systems, 35:25005–25017, 2022

work page 2022

[4] [4]

Batra, A

D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen. icoseg: Interactive co-segmentation with intelligent scribble guidance. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3169–3176. IEEE, 2010. 12

work page 2010

[5] [5]

Bridgeman, M

L. Bridgeman, M. V olino, J.-Y . Guillemaut, and A. Hilton. Multi-person 3d pose estimation and tracking in sports. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019

work page 2019

[6] [6]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[7] [7]

T. Chen, Y . Liu, Z. Wang, J. Yuan, Q. You, H. Yang, and M. Zhou. Improving in-context learning in diffusion models with visual context-modulated prompts.arXiv preprint arXiv:2312.01408, 2023

work page arXiv 2023

[8] [8]

Cheng, H

S. Cheng, H. Sun, T. Xie, H. Zhao, Y . Chen, B. Xu, and X. Li. Spt: Sequence prompt transformer for interactive image segmentation. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025

[9] [9]

Czolbe and A

S. Czolbe and A. V . Dalca. Neuralizer: General neuroimage analysis without re-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6217–6230, 2023

work page 2023

[10] [10]

Esser, R

P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021

[11] [11]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010

work page 2010

[12] [12]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

J. Guo, Z. Hao, C. Wang, Y . Tang, H. Wu, H. Hu, K. Han, and C. Xu. Data-efficient large vision models through sequential autoregression.arXiv preprint arXiv:2402.04841, 2024

work page arXiv 2024

[14] [14]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106. 09685

work page 2021

[15] [15]

Jang and C.-S

W.-D. Jang and C.-S. Kim. Interactive image segmentation via backpropagating refinement scheme. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5297–5306, 2019

work page 2019

[16] [16]

L. Ke, M. Ye, M. Danelljan, Y .-W. Tai, C.-K. Tang, F. Yu, et al. Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

work page 2023

[17] [17]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023

[18] [18]

Kontogianni, M

T. Kontogianni, M. Gygli, J. Uijlings, and V . Ferrari. Continuous adaptation for interactive object segmentation by learning from corrections. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 579–596. Springer, 2020

work page 2020

[19] [19]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014

work page 2014

[20] [20]

J. Liu, H. Wang, W. Yin, J.-J. Sonke, and E. Gavves. Click prompt learning with optimal transport for interactive segmentation. InEuropean Conference on Computer Vision, pages 93–110. Springer, 2024. 13

work page 2024

[21] [21]

Q. Liu, M. Zheng, B. Planche, S. Karanam, T. Chen, M. Niethammer, and Z. Wu. Pseudoclick: Interactive image segmentation with click imitation. InEuropean Conference on Computer Vision, pages 728–745. Springer, 2022

work page 2022

[22] [22]

Q. Liu, J. Cho, M. Bansal, and M. Niethammer. Rethinking interactive image segmentation with low latency high quality and diverse prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3773–3782, 2024

work page 2024

[23] [23]

Mahadevan, P

S. Mahadevan, P. V oigtlaender, and B. Leibe. Iteratively trained interactive segmentation.arXiv preprint arXiv:1805.04398, 2018

work page arXiv 2018

[24] [24]

Marinov, P

Z. Marinov, P. F. Jäger, J. Egger, J. Kleesiek, and R. Stiefelhagen. Deep interactive segmentation of medical images: A systematic review and taxonomy.IEEE transactions on pattern analysis and machine intelligence, 2024

work page 2024

[25] [25]

Rethinking annotator simulation: Realistic evaluation of whole-body pet lesion interactive segmentation methods,

Z. Marinov, M. Kim, J. Kleesiek, and R. Stiefelhagen. Rethinking annotator simulation: Realistic evaluation of whole-body pet lesion interactive segmentation methods.arXiv preprint arXiv:2404.01816, 2024

work page arXiv 2024

[26] [26]

Negrini and S

A. Negrini and S. Reiß. Conquering the retina: Bringing visual in-context learning to oct, 2025. URLhttps://arxiv.org/abs/2506.15200

work page arXiv 2025

[27] [27]

Nielsen and R

F. Nielsen and R. Nock. Clickremoval: Interactive pinpoint image object removal. InPro- ceedings of the 13th annual ACM international conference on Multimedia, pages 315–318, 2005

work page 2005

[28] [28]

Raunak, S

V . Raunak, S. Dalmia, V . Gupta, and F. Metze. On long-tailed phenomena in neural machine translation.arXiv preprint arXiv:2010.04924, 2020

work page arXiv 2010

[29] [29]

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URL https: //arxiv.org/abs/2408.00714

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

S. Reiß, C. Seibold, A. Freytag, E. Rodner, and R. Stiefelhagen. Every annotation counts: Multi-label deep supervision for medical image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9532–9542, 2021

work page 2021

[31] [31]

S. Reiß, Z. Marinov, A. Jaus, C. Seibold, M. S. Sarfraz, E. Rodner, and R. Stiefelhagen. Is visual in-context learning for compositional medical tasks within reach? InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2642–2652, October 2025

work page 2025

[32] [32]

S. Ren, X. Huang, X. Li, J. Xiao, J. Mei, Z. Wang, A. Yuille, and Y . Zhou. Medical vision generalist: Unifying medical imaging tasks in context.arXiv preprint arXiv:2406.05565, 2024

work page arXiv 2024

[33] [33]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[34] [34]

grabcut

C. Rother, V . Kolmogorov, and A. Blake. " grabcut" interactive foreground extraction using iterated graph cuts.ACM transactions on graphics (TOG), 23(3):309–314, 2004

work page 2004

[35] [35]

Sagong, Y .-J

M.-C. Sagong, Y .-J. Yeo, S.-W. Jung, and S.-J. Ko. Rord: A real-world object removal dataset. InBMVC, page 542, 2022

work page 2022

[36] [36]

Shtedritski, C

A. Shtedritski, C. Rupprecht, and A. Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11987–11997, 2023

work page 2023

[37] [37]

Sofiiuk, I

K. Sofiiuk, I. Petrov, O. Barinova, and A. Konushin. f-brs: Rethinking backpropagating refinement for interactive segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8623–8632, 2020. 14

work page 2020

[38] [38]

Sofiiuk, I

K. Sofiiuk, I. A. Petrov, and A. Konushin. Reviving iterative training with mask guidance for interactive segmentation. In2022 IEEE International Conference on Image Processing (ICIP), pages 3141–3145. IEEE, 2022

work page 2022

[39] [39]

G. Song, H. Myeong, and K. M. Lee. Seednet: Automatic seed generation with deep reinforce- ment learning for robust interactive segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1760–1768, 2018

work page 2018

[40] [40]

Van Den Oord, O

A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

work page 2017

[41] [41]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[42] [42]

X. Wang, W. Wang, Y . Cao, C. Shen, and T. Huang. Images speak in images: A generalist painter for in-context visual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023

work page 2023

[43] [43]

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

work page 2004

[44] [44]

Z. Wang, Y . Jiang, Y . Lu, P. He, W. Chen, Z. Wang, M. Zhou, et al. In-context learning unlocked for diffusion models.Advances in Neural Information Processing Systems, 36:8542–8562, 2023

work page 2023

[45] [45]

N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang. Deep interactive object selection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 373–381, 2016

work page 2016

[46] [46]

J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018

work page 2018

[47] [47]

Zhang, A

L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

work page 2023

[48] [48]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

work page 2018

[49] [49]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018

[50] [50]

B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017. 15 A Datasets We operate on the datasets PASCAL VOC (11) for interactive segmentation, the Real-world Object Removal dataset ( 35), the MPII dataset ...

work page 2017