Personalized Object Identification and Localization via In-Context Inference with Vision-Language Models

Byung-Woo Hong; Kensuke Nakamura

arxiv: 2607.00357 · v1 · pith:IGO5LIS2new · submitted 2026-07-01 · 💻 cs.CV

Personalized Object Identification and Localization via In-Context Inference with Vision-Language Models

Kensuke Nakamura , Byung-Woo Hong This is my paper

Pith reviewed 2026-07-02 15:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords personalized object localizationpersonalized object identificationvision-language modelsin-context inferencefew-shot object detectioninstance-level detection

0 comments

The pith

IPLoc-ID adds identification to personalized localization so vision-language models can reject images without the reference object.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces personalized object identification and localization (POIL) as an extension of prior personalized object localization work. POIL requires a model to both localize a specific object instance from a few reference images and reject query images that lack that exact instance. IPLoc-ID solves this by using a vision-language model to first predict a candidate bounding box and then verify its match to the reference via a self-posed query inside one autoregressive generation. Experiments on new datasets show the method reduces false positives on negative images while preserving localization accuracy comparable to the localization-only baseline.

Core claim

POIL is solved by first predicting a candidate bounding box and then determining whether it corresponds to the reference object instance through a self-posed query that connects the two steps within a single autoregressive generation of a vision-language model.

What carries the argument

The self-posed query, which links bounding-box prediction and instance verification inside one autoregressive generation.

If this is right

POIL becomes solvable with existing vision-language models without separate training stages for identification.
False-positive detections drop on images that do not contain the reference instance compared with localization-only methods.
Localization performance on images that do contain the instance stays close to the performance of the earlier IPLoc method.
The identification step integrates naturally into few-shot object detection pipelines that previously lacked instance-level rejection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same single-pass verification trick could be tested on other VLM tasks that require confirming whether a detected region matches a reference.
Allowing the model to consider several candidate boxes before the verification step might further reduce errors on hard negative images.
Applications such as robotic search or photo library search would gain the ability to skip irrelevant scenes without extra post-processing.

Load-bearing premise

A vision-language model can reliably decide from the self-posed query whether a candidate bounding box matches the reference object instance.

What would settle it

On a collection of negative query images known to lack the reference object, IPLoc-ID still produces many bounding-box outputs instead of correctly indicating absence.

Figures

Figures reproduced from arXiv: 2607.00357 by Byung-Woo Hong, Kensuke Nakamura.

**Figure 1.** Figure 1: [In-context inference for personalized object identification and localization task] (Left) Examples of reference data, (Right) positive and negative query images, and inference results using Florence-2, No-Time-To-Train (NT3), Qwen2-VL with prompting, IPLoc, and the proposed IPLoc-ID, respectively: Red boxes indicate reference annotations, green boxes indicate correct detections, and magenta boxes indicate… view at source ↗

**Figure 2.** Figure 2: [The proposed IPLoc-ID framework] (Top) We introduce personalized object identification and localization (POIL) and construct datasets by augmenting video object tracking data with negative query images. (Bottom) IPLoc-ID extends the sequence-generation formulation of IPLoc by generating a BBOX candidate, a self-posed query, and an identification answer in an autoregressive process, enabling the model to r… view at source ↗

**Figure 3.** Figure 3: [Training curves] The mIoU (solid line) and F1-score (dotted line) curves for the LaSOT test set during training based on different backbones trained using (blue) only BBOX loss (IPLoc), (green) two-stage training, and (magenta) the proposed unified loss. 4.2. Backbone Model Selection The proposed method assumes a transformer-based VLM with autoregressive text generation. We empirically select the backbone… view at source ↗

read the original abstract

Personalized object localization (POL) localizes an object instance in a query image based on a few reference images with bounding-box annotations and a target object label. The pioneering method, IPLoc, solves this task through in-context inference with vision-language models (VLMs). However, it assumes that the query image always contains the target object. This assumption severely limits its applicability to real-world scenarios with many irrelevant images. To address this issue, we formulate a new task, personalized object identification and localization (POIL), by positioning POL within the broader few-shot object detection framework. POIL aims to localize the target object instance while rejecting query images that do not contain the reference object instance. We also present POIL datasets constructed from public sources. We further propose an in-context algorithm named IPLoc-ID for solving POIL with VLMs. IPLoc-ID first predicts a candidate bounding box and then determines whether it corresponds to the reference object instance. We introduce a self-posed query to connect these two steps within a single autoregressive generation framework. Through ablation studies and comprehensive experiments, we show that IPLoc-ID substantially suppresses false-positive detections on negative query images while maintaining localization performance comparable to IPLoc. Overall, IPLoc-ID effectively addresses the practical instance-level POIL task, which cannot be sufficiently solved by conventional object detection, few-shot object detection, or the localization-only IPLoc method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds rejection of negative images to personalized few-shot localization by extending IPLoc with a self-posed query inside one VLM generation, but the reliability of that identification step remains the weakest part of the claim.

read the letter

The main takeaway is that this work defines a POIL task to handle query images that may not contain the reference object, then gives an IPLoc-ID algorithm that first proposes a box and then uses a self-posed query to decide whether the box matches the few-shot references.

What is new is the explicit rejection capability and the single-pass connection via the self-posed query. Prior IPLoc assumed the object was always present, which the authors correctly flag as a practical limit. The paper also builds POIL datasets from existing public sources and runs ablations plus comparisons against standard few-shot detectors.

The approach is straightforward and the motivation is sound. Adding identification without extra models or multi-turn interaction is a reasonable engineering move for VLM-based pipelines.

The soft spot is exactly where the stress-test note points: the identification decision happens inside the same autoregressive sequence after the model has emitted box coordinates. Current VLMs are known to be brittle at treating their own prior tokens as reliable visual evidence and at fine-grained instance discrimination under few-shot conditioning. If that step fails systematically on negative queries, the reported false-positive suppression does not hold. The abstract claims ablation studies support the gains, but without the actual tables, prompt templates, or failure-case analysis it is difficult to judge how often the self-posed query succeeds versus how often it simply inherits the base VLM's limitations.

The work is aimed at researchers already using VLMs for few-shot localization who need to operate on mixed positive and negative images. It is incremental rather than foundational, but the task formulation and the concrete algorithm are clear enough that a serious referee could evaluate the empirical claims and the prompt design in detail. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper formulates the personalized object identification and localization (POIL) task, extending personalized object localization (POL) to reject query images without the target instance. It proposes IPLoc-ID, an in-context inference algorithm for vision-language models (VLMs) that generates a candidate bounding box and then uses a self-posed query to determine if the box corresponds to the reference object in a single autoregressive generation. Ablation studies and experiments on POIL datasets constructed from public sources show that IPLoc-ID reduces false-positive detections on negative query images while maintaining localization performance comparable to the prior IPLoc method.

Significance. If the results hold, this addresses a key limitation in applying POL to real-world scenarios with irrelevant images, making it more practical within the few-shot object detection framework. The work gives credit to reproducibility by constructing POIL datasets from public sources and building on existing VLMs without fine-tuning. The empirical nature avoids parameter fitting, focusing on algorithmic use of in-context learning.

major comments (2)

[Section 3 (Proposed Method)] The headline result of false-positive suppression on negative queries depends on the VLM correctly interpreting the self-posed query to reject non-matching candidate boxes within one autoregressive sequence. The manuscript does not provide a dedicated analysis or quantitative breakdown of identification errors in this step, which is load-bearing for the claim that the method 'substantially suppresses false-positive detections' (abstract; Section 3, algorithm description).
[Experiments and Ablations] The abstract references ablation studies demonstrating the benefits, but without specific tables or figures showing metrics on negative vs positive queries (e.g., false positive rate, localization IoU or mAP), the support for 'comparable localization performance' and the overall POIL effectiveness cannot be verified in detail.

minor comments (2)

[Abstract] The description of the self-posed query could be clarified earlier to help readers understand how it connects the localization and identification steps.
[References] Consider adding citations to recent studies on VLM limitations in multi-step visual reasoning to contextualize the approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major comment point-by-point below, with plans to revise the paper to strengthen the presentation of results.

read point-by-point responses

Referee: [Section 3 (Proposed Method)] The headline result of false-positive suppression on negative queries depends on the VLM correctly interpreting the self-posed query to reject non-matching candidate boxes within one autoregressive sequence. The manuscript does not provide a dedicated analysis or quantitative breakdown of identification errors in this step, which is load-bearing for the claim that the method 'substantially suppresses false-positive detections' (abstract; Section 3, algorithm description).

Authors: We agree that a dedicated quantitative breakdown of identification errors would provide stronger support for the false-positive suppression claim. In the revised manuscript, we will add an analysis subsection (likely in Section 4) that reports identification accuracy metrics on the self-posed query step, including rejection rates on negative queries, error types (e.g., false acceptance of non-matching boxes), and comparison against the localization-only baseline. revision: yes
Referee: [Experiments and Ablations] The abstract references ablation studies demonstrating the benefits, but without specific tables or figures showing metrics on negative vs positive queries (e.g., false positive rate, localization IoU or mAP), the support for 'comparable localization performance' and the overall POIL effectiveness cannot be verified in detail.

Authors: The ablation studies and main experiments in Section 4 are conducted on the constructed POIL datasets that explicitly include both positive and negative queries, with results showing reduced false positives while preserving localization performance comparable to IPLoc. To improve verifiability, we will add or expand tables/figures in the revised version that explicitly break out metrics such as false positive rate, precision, and IoU/mAP separately for negative versus positive queries. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical algorithm validated on public data

full rationale

The paper introduces the POIL task and IPLoc-ID algorithm as an empirical extension of prior VLM in-context methods. Performance claims rest on ablation studies and experiments using constructed datasets from public sources, not on any equations, fitted parameters, or self-citations that reduce the reported false-positive suppression or localization metrics to quantities defined inside the paper. The self-posed query mechanism is an algorithmic design choice whose effectiveness is measured externally rather than assumed by construction. No load-bearing self-citation chain or self-definitional reduction exists.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical behavior of off-the-shelf vision-language models when prompted with the self-posed query; no new mathematical axioms or fitted parameters are introduced by the authors.

axioms (1)

domain assumption Vision-language models possess sufficient in-context reasoning capability to perform both localization and instance identification from a small number of reference examples.
The entire IPLoc-ID pipeline depends on this capability of existing VLMs; the paper does not derive or prove it.

pith-pipeline@v0.9.1-grok · 5780 in / 1303 out tokens · 33657 ms · 2026-07-02T15:18:12.106051+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 9 canonical work pages · 5 internal anchors

[1]

Minderer, A

M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al., Sim- ple open-vocabulary object detection, in: European conference on computer vision, Springer, 2022, pp. 728–755

2022
[2]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al., Grounding dino: Marrying dino with grounded pre-training for open-set object detection, in: European conference on computer vision, Springer, 2024, pp. 38–55

2024
[3]

Köhler, M

M. Köhler, M. Eisenbach, H.-M. Gross, Few-shot object detection: A comprehensive survey, IEEE transactions on neural networks and learning systems 35 (9) (2023) 11958–11978

2023
[4]

Z. Xin, S. Chen, T. Wu, Y. Shao, W. Ding, X. You, Few-shot object detection: Research advances and challenges, Information Fusion 107 (2024) 102307

2024
[5]

X. Wang, T. Huang, J. Gonzalez, T. Darrell, F. Yu, Frustratingly simple few-shot object detection, in: H. D. III, A. Singh (Eds.), Proceedings of the 37th International Conference on Machine Learning, Vol. 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 9919–9928. URLhttps://proceedings.mlr.press/v119/wang20j.html

2020
[6]

Doveh, N

S. Doveh, N. Shabtay, E. Schwartz, H. Kuehne, R. Giryes, R. Feris, L. Kar- linsky, J. Glass, A. Arbelle, S. Ullman, et al., Teaching vlms to localize specific objects from in-context examples, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9572–9582

2025
[7]

B. Sun, B. Li, S. Cai, Y. Yuan, C. Zhang, Fsce: Few-shot object detection via contrastive proposal encoding, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7352–7362. 22

2021
[8]

L. Qiao, Y. Zhao, Z. Li, X. Qiu, J. Wu, C. Zhang, Defrcn: Decoupled faster r-cnn for few-shot object detection, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 8681–8690

2021
[9]

X. Yan, Z. Chen, A. Xu, X. Wang, X. Liang, L. Lin, Meta r-cnn: Towards general solver for instance-level low-shot learning, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9577– 9586

2019
[10]

G. Han, J. Ma, S. Huang, L. Chen, S.-F. Chang, Few-shot object detection with fully cross-transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5321–5330

2022
[11]

Zhang, Y

X. Zhang, Y. Liu, Y. Wang, A. Boularias, Detect everything with few examples, in: Proceedings of The 8th Conference on Robot Learning, Vol. 270 of Proceedings of Machine Learning Research, PMLR, 2024, pp. 3986– 4004

2024
[12]

X. Yu, Y. Sha, L. Liu, X. Shen, D. Yang, A closer look at cross-domain few-shot object detection: Fine-tuning matters and parallel decoder helps, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[13]

C.-B. Feng, Y. Sha, L. Liu, Y. Yu, C. M. Vong, X. Yu, X. Shen, Few-shot object detection with vision foundation models and graph diffusion, in: The Fourteenth International Conference on Learning Representations, 2026

2026
[14]

Espinosa, C

M. Espinosa, C. Yang, L. Ericsson, S. McDonagh, E. J. Crowley, No time to train! training-free reference-based instance segmentation, arXiv preprint arXiv:2507.02798 (2025)

work page arXiv 2025
[15]

Psomas, G

B. Psomas, G. Retsinas, N. Efthymiadis, P. Filntisis, Y. Avrithis, P. Maragos, O. Chum, G. Tolias, Instance-level composed image retrieval, in: The Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[16]

X. Hao, K. Zhu, H. Guo, H. Guo, N. Jiang, Q. Lu, M. Tang, J. Wang, Referring expression instance retrieval and a strong end-to-end baseline, in: Proceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 4464–4473

2025
[17]

Y. Ren, B. Li, C. Zhang, Y. Zhang, B. Yin, Few-shot object localization, arXiv preprint arXiv:2403.12466 (2024)

work page arXiv 2024
[18]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visualmodelsfromnaturallanguagesupervision, in: Internationalconference on machine learning, PmLR, 2021, pp. 8748–8763. 23

2021
[19]

Cherti, R

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gor- don, C. Schuhmann, L. Schmidt, J. Jitsev, Reproducible scaling laws for contrastive language-image learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 2818–2829

2023
[20]

J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation, in: Inter- national conference on machine learning, PMLR, 2022, pp. 12888–12900

2022
[21]

J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, in: International conference on machine learning, PMLR, 2023, pp. 19730– 19742

2023
[22]

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, et al., Sam 2: Segment anything in images and videos, in: International Conference on Learning Representations, Vol. 2025, 2025, pp. 28085–28128

2025
[23]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., Dinov2: Learning robust visual features without supervision, arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, Advances in neural information processing systems 36 (2023) 34892–34916

2023
[25]

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, et al., Gemma 3 technical report, arXiv preprint arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al., Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al., Qwen3-vl technical report, arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Zhang, H

H. Zhang, H. Li, F. Li, T. Ren, X. Zou, S. Liu, S. Huang, J. Gao, Leizhang, C. Li, et al., Llava-grounding: Grounded visual chat with large multimodal models, in: European Conference on Computer Vision, Springer, 2024, pp. 19–35

2024
[29]

Y. Yao, Q. Yang, H. Zhong, J. Wei, Y. Men, S. Bai, M. Cui, Z. Yang, Qwen3- vl-seg: Unlocking open-world referring segmentation with vision-language grounding, arXiv preprint arXiv:2605.07141 (2026). 24

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Press, M

O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, M. Lewis, Measuring and narrowing the compositionality gap in language models, in: Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 5687–5711

2023
[31]

J. Qi, Z. Xu, Y. Shen, M. Liu, D. Jin, Q. Wang, L. Huang, The art of socratic questioning: Recursive thinking with large language models, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 4177–4199

2023
[32]

G. Sun, C. Qin, J. Wang, Z. Chen, R. Xu, Z. Tao, Sq-llava: Self-questioning for large vision-language assistant, in: European Conference on Computer Vision, Springer, 2024, pp. 156–172

2024
[33]

Prasad, E

A. Prasad, E. Stengel-Eskin, M. Bansal, Rephrase, augment, reason: Vi- sual grounding of questions for vision-language models, in: International Conference on Learning Representations, 2024

2024
[34]

S. Min, M. Lewis, L. Zettlemoyer, H. Hajishirzi, Metaicl: Learning to learn in context, in: Proceedings of the 2022 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 2791–2809

2022
[35]

Monajatipoor, L

M. Monajatipoor, L. H. Li, M. Rouhsedaghat, L. Yang, K.-W. Chang, Metavl: Transferring in-context learning ability from language models to vision-language models, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023, pp. 495–508

2023
[36]

K. P. Yu, Z. Zhang, F. Hu, S. Storks, J. Chai, Eliciting in-context learning in vision-language models for videos through curated data distributional properties, in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 20416–20431

2024
[37]

Sheng, D

D. Sheng, D. Chen, Z. Tan, Q. Liu, Q. Chu, J. Bao, T. Gong, B. Liu, S. Xu, N. Yu, Towards more unified in-context visual understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13362–13372

2024
[38]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International journal of computer vision 88 (2) (2010) 303–338

2010
[39]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755

2014
[40]

D. M. W. Powers, Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation, Journal of Machine Learning Technologies 2 (1) (2011) 37–63. 25

2011
[41]

E.J.Hu, Y.Shen, P.Wallis, Z.Allen-Zhu, Y.Li, S.Wang, L.Wang, W.Chen, LoRA: Low-rank adaptation of large language models, in: International Conference on Learning Representations, 2022

2022
[42]

H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, H. Ling, Lasot: A high-quality benchmark for large-scale single object tracking, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5374–5383

2019
[43]

Samuel, R

D. Samuel, R. Ben-Ari, M. Levy, N. Darshan, G. Chechik, Where’s waldo: Diffusion features for personalized segmentation and retrieval, Advances in Neural Information Processing Systems 37 (2024) 128160–128181

2024
[44]

Huang, X

L. Huang, X. Zhao, K. Huang, Got-10k: A large high-diversity benchmark for generic object tracking in the wild, IEEE transactions on pattern analysis and machine intelligence 43 (5) (2019) 1562–1577

2019
[45]

L. Peng, J. Gao, X. Liu, W. Li, S. Dong, Z. Zhang, H. Fan, L. Zhang, Vast- track: Vast category visual object tracking, Advances in Neural Information Processing Systems 37 (2024) 130797–130818

2024
[46]

Riquelme, J

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Su- sano Pinto, D. Keysers, N. Houlsby, Scaling vision with sparse mixture of experts, Advances in Neural Information Processing Systems 34 (2021) 8583–8595

2021
[47]

McCloskey, N

M. McCloskey, N. J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, Psychology of Learning and Motivation 24 (1989) 109–165. doi:10.1016/S0079-7421(08)60536-8

work page doi:10.1016/s0079-7421(08)60536-8 1989
[48]

Suhas Kotha and Percy Liang

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceed- ings of the National Academy of Sciences 114 (13) (2017) 3521–3526. doi:10.1073/pnas.1611835114

work page doi:10.1073/pnas.1611835114 2017
[49]

B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu, M. Zeng, C. Liu, L. Yuan, Florence-2: Advancing a unified representation for a variety of vision tasks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829. 26 Appendix A. Additional Experimental Results Appendix A.1. Pretest on instruction prompts The prop...

2024

[1] [1]

Minderer, A

M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al., Sim- ple open-vocabulary object detection, in: European conference on computer vision, Springer, 2022, pp. 728–755

2022

[2] [2]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al., Grounding dino: Marrying dino with grounded pre-training for open-set object detection, in: European conference on computer vision, Springer, 2024, pp. 38–55

2024

[3] [3]

Köhler, M

M. Köhler, M. Eisenbach, H.-M. Gross, Few-shot object detection: A comprehensive survey, IEEE transactions on neural networks and learning systems 35 (9) (2023) 11958–11978

2023

[4] [4]

Z. Xin, S. Chen, T. Wu, Y. Shao, W. Ding, X. You, Few-shot object detection: Research advances and challenges, Information Fusion 107 (2024) 102307

2024

[5] [5]

X. Wang, T. Huang, J. Gonzalez, T. Darrell, F. Yu, Frustratingly simple few-shot object detection, in: H. D. III, A. Singh (Eds.), Proceedings of the 37th International Conference on Machine Learning, Vol. 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 9919–9928. URLhttps://proceedings.mlr.press/v119/wang20j.html

2020

[6] [6]

Doveh, N

S. Doveh, N. Shabtay, E. Schwartz, H. Kuehne, R. Giryes, R. Feris, L. Kar- linsky, J. Glass, A. Arbelle, S. Ullman, et al., Teaching vlms to localize specific objects from in-context examples, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9572–9582

2025

[7] [7]

B. Sun, B. Li, S. Cai, Y. Yuan, C. Zhang, Fsce: Few-shot object detection via contrastive proposal encoding, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7352–7362. 22

2021

[8] [8]

L. Qiao, Y. Zhao, Z. Li, X. Qiu, J. Wu, C. Zhang, Defrcn: Decoupled faster r-cnn for few-shot object detection, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 8681–8690

2021

[9] [9]

X. Yan, Z. Chen, A. Xu, X. Wang, X. Liang, L. Lin, Meta r-cnn: Towards general solver for instance-level low-shot learning, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9577– 9586

2019

[10] [10]

G. Han, J. Ma, S. Huang, L. Chen, S.-F. Chang, Few-shot object detection with fully cross-transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5321–5330

2022

[11] [11]

Zhang, Y

X. Zhang, Y. Liu, Y. Wang, A. Boularias, Detect everything with few examples, in: Proceedings of The 8th Conference on Robot Learning, Vol. 270 of Proceedings of Machine Learning Research, PMLR, 2024, pp. 3986– 4004

2024

[12] [12]

X. Yu, Y. Sha, L. Liu, X. Shen, D. Yang, A closer look at cross-domain few-shot object detection: Fine-tuning matters and parallel decoder helps, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[13] [13]

C.-B. Feng, Y. Sha, L. Liu, Y. Yu, C. M. Vong, X. Yu, X. Shen, Few-shot object detection with vision foundation models and graph diffusion, in: The Fourteenth International Conference on Learning Representations, 2026

2026

[14] [14]

Espinosa, C

M. Espinosa, C. Yang, L. Ericsson, S. McDonagh, E. J. Crowley, No time to train! training-free reference-based instance segmentation, arXiv preprint arXiv:2507.02798 (2025)

work page arXiv 2025

[15] [15]

Psomas, G

B. Psomas, G. Retsinas, N. Efthymiadis, P. Filntisis, Y. Avrithis, P. Maragos, O. Chum, G. Tolias, Instance-level composed image retrieval, in: The Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[16] [16]

X. Hao, K. Zhu, H. Guo, H. Guo, N. Jiang, Q. Lu, M. Tang, J. Wang, Referring expression instance retrieval and a strong end-to-end baseline, in: Proceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 4464–4473

2025

[17] [17]

Y. Ren, B. Li, C. Zhang, Y. Zhang, B. Yin, Few-shot object localization, arXiv preprint arXiv:2403.12466 (2024)

work page arXiv 2024

[18] [18]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visualmodelsfromnaturallanguagesupervision, in: Internationalconference on machine learning, PmLR, 2021, pp. 8748–8763. 23

2021

[19] [19]

Cherti, R

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gor- don, C. Schuhmann, L. Schmidt, J. Jitsev, Reproducible scaling laws for contrastive language-image learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 2818–2829

2023

[20] [20]

J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation, in: Inter- national conference on machine learning, PMLR, 2022, pp. 12888–12900

2022

[21] [21]

J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, in: International conference on machine learning, PMLR, 2023, pp. 19730– 19742

2023

[22] [22]

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, et al., Sam 2: Segment anything in images and videos, in: International Conference on Learning Representations, Vol. 2025, 2025, pp. 28085–28128

2025

[23] [23]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., Dinov2: Learning robust visual features without supervision, arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, Advances in neural information processing systems 36 (2023) 34892–34916

2023

[25] [25]

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, et al., Gemma 3 technical report, arXiv preprint arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al., Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al., Qwen3-vl technical report, arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Zhang, H

H. Zhang, H. Li, F. Li, T. Ren, X. Zou, S. Liu, S. Huang, J. Gao, Leizhang, C. Li, et al., Llava-grounding: Grounded visual chat with large multimodal models, in: European Conference on Computer Vision, Springer, 2024, pp. 19–35

2024

[29] [29]

Y. Yao, Q. Yang, H. Zhong, J. Wei, Y. Men, S. Bai, M. Cui, Z. Yang, Qwen3- vl-seg: Unlocking open-world referring segmentation with vision-language grounding, arXiv preprint arXiv:2605.07141 (2026). 24

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Press, M

O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, M. Lewis, Measuring and narrowing the compositionality gap in language models, in: Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 5687–5711

2023

[31] [31]

J. Qi, Z. Xu, Y. Shen, M. Liu, D. Jin, Q. Wang, L. Huang, The art of socratic questioning: Recursive thinking with large language models, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 4177–4199

2023

[32] [32]

G. Sun, C. Qin, J. Wang, Z. Chen, R. Xu, Z. Tao, Sq-llava: Self-questioning for large vision-language assistant, in: European Conference on Computer Vision, Springer, 2024, pp. 156–172

2024

[33] [33]

Prasad, E

A. Prasad, E. Stengel-Eskin, M. Bansal, Rephrase, augment, reason: Vi- sual grounding of questions for vision-language models, in: International Conference on Learning Representations, 2024

2024

[34] [34]

S. Min, M. Lewis, L. Zettlemoyer, H. Hajishirzi, Metaicl: Learning to learn in context, in: Proceedings of the 2022 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 2791–2809

2022

[35] [35]

Monajatipoor, L

M. Monajatipoor, L. H. Li, M. Rouhsedaghat, L. Yang, K.-W. Chang, Metavl: Transferring in-context learning ability from language models to vision-language models, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023, pp. 495–508

2023

[36] [36]

K. P. Yu, Z. Zhang, F. Hu, S. Storks, J. Chai, Eliciting in-context learning in vision-language models for videos through curated data distributional properties, in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 20416–20431

2024

[37] [37]

Sheng, D

D. Sheng, D. Chen, Z. Tan, Q. Liu, Q. Chu, J. Bao, T. Gong, B. Liu, S. Xu, N. Yu, Towards more unified in-context visual understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13362–13372

2024

[38] [38]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International journal of computer vision 88 (2) (2010) 303–338

2010

[39] [39]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755

2014

[40] [40]

D. M. W. Powers, Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation, Journal of Machine Learning Technologies 2 (1) (2011) 37–63. 25

2011

[41] [41]

E.J.Hu, Y.Shen, P.Wallis, Z.Allen-Zhu, Y.Li, S.Wang, L.Wang, W.Chen, LoRA: Low-rank adaptation of large language models, in: International Conference on Learning Representations, 2022

2022

[42] [42]

H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, H. Ling, Lasot: A high-quality benchmark for large-scale single object tracking, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5374–5383

2019

[43] [43]

Samuel, R

D. Samuel, R. Ben-Ari, M. Levy, N. Darshan, G. Chechik, Where’s waldo: Diffusion features for personalized segmentation and retrieval, Advances in Neural Information Processing Systems 37 (2024) 128160–128181

2024

[44] [44]

Huang, X

L. Huang, X. Zhao, K. Huang, Got-10k: A large high-diversity benchmark for generic object tracking in the wild, IEEE transactions on pattern analysis and machine intelligence 43 (5) (2019) 1562–1577

2019

[45] [45]

L. Peng, J. Gao, X. Liu, W. Li, S. Dong, Z. Zhang, H. Fan, L. Zhang, Vast- track: Vast category visual object tracking, Advances in Neural Information Processing Systems 37 (2024) 130797–130818

2024

[46] [46]

Riquelme, J

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Su- sano Pinto, D. Keysers, N. Houlsby, Scaling vision with sparse mixture of experts, Advances in Neural Information Processing Systems 34 (2021) 8583–8595

2021

[47] [47]

McCloskey, N

M. McCloskey, N. J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, Psychology of Learning and Motivation 24 (1989) 109–165. doi:10.1016/S0079-7421(08)60536-8

work page doi:10.1016/s0079-7421(08)60536-8 1989

[48] [48]

Suhas Kotha and Percy Liang

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceed- ings of the National Academy of Sciences 114 (13) (2017) 3521–3526. doi:10.1073/pnas.1611835114

work page doi:10.1073/pnas.1611835114 2017

[49] [49]

B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu, M. Zeng, C. Liu, L. Yuan, Florence-2: Advancing a unified representation for a variety of vision tasks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829. 26 Appendix A. Additional Experimental Results Appendix A.1. Pretest on instruction prompts The prop...

2024