From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks
Pith reviewed 2026-05-10 18:46 UTC · model grok-4.3
The pith
Encoding user cues like scribbles and clicks into example pairs turns static visual in-context learners into interactive systems that follow new guidance without fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By encoding interactions directly into the example input-output pairs, static visual in-context learners such as DeLVM become controllable systems that accept natural visual cues like scribbles, clicks, or bounding boxes; the model then generalizes to unseen interactions without task-specific fine-tuning and lets users dynamically steer predictions for personalized outcomes.
What carries the argument
Encoding user visual cues (scribbles, clicks, bounding boxes) directly into the demonstration input-output pairs so the in-context learner treats them as reusable guidance signals.
Load-bearing premise
The model will interpret the drawn user cues in the example pairs as generalizable instructions instead of noise or task-specific artifacts.
What would settle it
Run the adapted model on new test cases with user cues present in the examples yet observe no improvement over the static baseline and no following of the provided cues.
Figures
read the original abstract
Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of $+7.95%$ IoU for interactive segmentation, $+2.46$ PSNR for directed super-resolution, and $-3.14%$ LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that static visual in-context learning models (particularly DeLVM) can be transformed into interactive, user-driven systems by directly encoding user guidance signals such as scribbles, clicks, or bounding boxes into the example input-output pairs. This preserves the core in-context learning philosophy, allowing prompt-based adaptation to unseen interactions without fine-tuning. Experiments reportedly show that unmodified SOTA models ignore such cues, while the proposed Interactive DeLVM yields gains of +7.95% IoU on interactive segmentation, +2.46 PSNR on directed super-resolution, and -3.14% LPIPS on interactive object removal.
Significance. If the central claim holds, the work would meaningfully extend visual in-context learning toward practical, controllable applications by enabling users to steer predictions with personalized visual cues. The simplicity of the encoding approach and the reported quantitative improvements are potential strengths, as they suggest a lightweight way to add interactivity while avoiding task-specific retraining.
major comments (2)
- [Abstract] Abstract: the central claim that encoding cues into example pairs induces the base model to abstract user intent as generalizable guidance (rather than noise or low-level patterns) is load-bearing for the contribution, yet the abstract supplies no method details, baselines, number of examples, or error analysis. This prevents verification that the reported gains arise from transferable cue semantics instead of overfitting to the specific interaction styles shown in the prompts.
- [Method and Experiments] Method and Experiments sections: unmodified SOTA models are stated to ignore cues entirely, so the success of the encoding depends on the in-context signal being strong enough to override this tendency. No ablations appear to test robustness across cue densities, styles, or interaction types differing from the training examples; without such tests the assumption that the model will parse added marks as instructions (rather than extraneous content) remains unverified and could fail for new cue configurations.
minor comments (2)
- [Abstract] Abstract: the quantitative claims would be clearer if the number of in-context examples, the exact DeLVM variant, and the evaluation protocol (e.g., how user cues are simulated at test time) were briefly stated.
- [Figures] Figures: ensure that example visualizations of encoded input-output pairs clearly distinguish the added cues from the original image content so readers can assess the encoding strategy.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing our responses and indicating revisions made to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that encoding cues into example pairs induces the base model to abstract user intent as generalizable guidance (rather than noise or low-level patterns) is load-bearing for the contribution, yet the abstract supplies no method details, baselines, number of examples, or error analysis. This prevents verification that the reported gains arise from transferable cue semantics instead of overfitting to the specific interaction styles shown in the prompts.
Authors: We agree that the abstract could provide more specifics to support verification of the central claim. In the revised version, we have expanded the abstract to briefly describe the encoding approach, note the typical number of in-context examples (2-4), reference the baselines (unmodified SOTA models), and highlight the quantitative gains. Detailed evidence that the improvements stem from transferable cue semantics—rather than overfitting—is presented through direct comparisons in the Experiments section, where unmodified models ignore the cues entirely while our method leverages them effectively. Error analysis is included in the supplementary material and main experiments. revision: yes
-
Referee: [Method and Experiments] Method and Experiments sections: unmodified SOTA models are stated to ignore cues entirely, so the success of the encoding depends on the in-context signal being strong enough to override this tendency. No ablations appear to test robustness across cue densities, styles, or interaction types differing from the training examples; without such tests the assumption that the model will parse added marks as instructions (rather than extraneous content) remains unverified and could fail for new cue configurations.
Authors: The referee correctly notes the absence of explicit robustness ablations in the original submission. Our experiments already cover multiple interaction types (scribbles, clicks, bounding boxes) and demonstrate that unmodified models ignore cues while ours succeeds, supporting that the signal overrides the tendency to treat marks as extraneous. We have added new ablations in the revised manuscript testing variations in cue density and style, including some differing from primary examples. These results indicate the model interprets cues as guidance. While exhaustive coverage of every possible new configuration is not feasible, the in-context learning paradigm is designed for generalization to unseen prompts, and our findings are consistent with this. revision: partial
Circularity Check
No circularity: direct empirical encoding with no self-referential derivation
full rationale
The paper proposes encoding user interactions (scribbles/clicks/boxes) directly into static in-context example pairs to adapt models like DeLVM for interactive use. This is a straightforward augmentation technique with no equations, fitted parameters, or load-bearing self-citations in the provided text. The central claim rests on empirical experiments showing baseline models ignore cues while the augmented pairs enable control, without any reduction of predictions to inputs by construction or imported uniqueness theorems. The assumption that models will generalize from the encoded cues is an empirical hypothesis, not a circular derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. InProceedings of the IEEE Conference on computer Vision and Pattern Recognition, pages 3686–3693, 2014
work page 2014
-
[2]
Y . Bai, X. Geng, K. Mangalam, A. Bar, A. L. Yuille, T. Darrell, J. Malik, and A. A. Efros. Sequential modeling enables scalable learning for large vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22861–22872, 2024
work page 2024
-
[3]
A. Bar, Y . Gandelsman, T. Darrell, A. Globerson, and A. Efros. Visual prompting via image inpainting.Advances in Neural Information Processing Systems, 35:25005–25017, 2022
work page 2022
- [4]
-
[5]
L. Bridgeman, M. V olino, J.-Y . Guillemaut, and A. Hilton. Multi-person 3d pose estimation and tracking in sports. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019
work page 2019
- [6]
- [7]
- [8]
-
[9]
S. Czolbe and A. V . Dalca. Neuralizer: General neuroimage analysis without re-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6217–6230, 2023
work page 2023
- [10]
-
[11]
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010
work page 2010
-
[12]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [13]
-
[14]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106. 09685
work page 2021
-
[15]
W.-D. Jang and C.-S. Kim. Interactive image segmentation via backpropagating refinement scheme. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5297–5306, 2019
work page 2019
-
[16]
L. Ke, M. Ye, M. Danelljan, Y .-W. Tai, C.-K. Tang, F. Yu, et al. Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023
work page 2023
-
[17]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023
work page 2023
-
[18]
T. Kontogianni, M. Gygli, J. Uijlings, and V . Ferrari. Continuous adaptation for interactive object segmentation by learning from corrections. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 579–596. Springer, 2020
work page 2020
-
[19]
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014
work page 2014
-
[20]
J. Liu, H. Wang, W. Yin, J.-J. Sonke, and E. Gavves. Click prompt learning with optimal transport for interactive segmentation. InEuropean Conference on Computer Vision, pages 93–110. Springer, 2024. 13
work page 2024
-
[21]
Q. Liu, M. Zheng, B. Planche, S. Karanam, T. Chen, M. Niethammer, and Z. Wu. Pseudoclick: Interactive image segmentation with click imitation. InEuropean Conference on Computer Vision, pages 728–745. Springer, 2022
work page 2022
-
[22]
Q. Liu, J. Cho, M. Bansal, and M. Niethammer. Rethinking interactive image segmentation with low latency high quality and diverse prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3773–3782, 2024
work page 2024
-
[23]
S. Mahadevan, P. V oigtlaender, and B. Leibe. Iteratively trained interactive segmentation.arXiv preprint arXiv:1805.04398, 2018
-
[24]
Z. Marinov, P. F. Jäger, J. Egger, J. Kleesiek, and R. Stiefelhagen. Deep interactive segmentation of medical images: A systematic review and taxonomy.IEEE transactions on pattern analysis and machine intelligence, 2024
work page 2024
-
[25]
Z. Marinov, M. Kim, J. Kleesiek, and R. Stiefelhagen. Rethinking annotator simulation: Realistic evaluation of whole-body pet lesion interactive segmentation methods.arXiv preprint arXiv:2404.01816, 2024
-
[26]
A. Negrini and S. Reiß. Conquering the retina: Bringing visual in-context learning to oct, 2025. URLhttps://arxiv.org/abs/2506.15200
-
[27]
F. Nielsen and R. Nock. Clickremoval: Interactive pinpoint image object removal. InPro- ceedings of the 13th annual ACM international conference on Multimedia, pages 315–318, 2005
work page 2005
- [28]
-
[29]
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URL https: //arxiv.org/abs/2408.00714
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
S. Reiß, C. Seibold, A. Freytag, E. Rodner, and R. Stiefelhagen. Every annotation counts: Multi-label deep supervision for medical image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9532–9542, 2021
work page 2021
-
[31]
S. Reiß, Z. Marinov, A. Jaus, C. Seibold, M. S. Sarfraz, E. Rodner, and R. Stiefelhagen. Is visual in-context learning for compositional medical tasks within reach? InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2642–2652, October 2025
work page 2025
- [32]
-
[33]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
- [34]
-
[35]
M.-C. Sagong, Y .-J. Yeo, S.-W. Jung, and S.-J. Ko. Rord: A real-world object removal dataset. InBMVC, page 542, 2022
work page 2022
-
[36]
A. Shtedritski, C. Rupprecht, and A. Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11987–11997, 2023
work page 2023
-
[37]
K. Sofiiuk, I. Petrov, O. Barinova, and A. Konushin. f-brs: Rethinking backpropagating refinement for interactive segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8623–8632, 2020. 14
work page 2020
-
[38]
K. Sofiiuk, I. A. Petrov, and A. Konushin. Reviving iterative training with mask guidance for interactive segmentation. In2022 IEEE International Conference on Image Processing (ICIP), pages 3141–3145. IEEE, 2022
work page 2022
-
[39]
G. Song, H. Myeong, and K. M. Lee. Seednet: Automatic seed generation with deep reinforce- ment learning for robust interactive segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1760–1768, 2018
work page 2018
-
[40]
A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
work page 2017
-
[41]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[42]
X. Wang, W. Wang, Y . Cao, C. Shen, and T. Huang. Images speak in images: A generalist painter for in-context visual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023
work page 2023
-
[43]
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004
work page 2004
-
[44]
Z. Wang, Y . Jiang, Y . Lu, P. He, W. Chen, Z. Wang, M. Zhou, et al. In-context learning unlocked for diffusion models.Advances in Neural Information Processing Systems, 36:8542–8562, 2023
work page 2023
-
[45]
N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang. Deep interactive object selection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 373–381, 2016
work page 2016
-
[46]
J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018
work page 2018
- [47]
- [48]
- [49]
-
[50]
B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017. 15 A Datasets We operate on the datasets PASCAL VOC (11) for interactive segmentation, the Real-world Object Removal dataset ( 35), the MPII dataset ...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.