arxiv: 2603.03577 · v2 · submitted 2026-03-03 · 💻 cs.CV · cs.RO

Recognition: no theorem link

From Local Matches to Global Masks: Template-Guided Instance Detection and Segmentation in Open-World Scenes

Qifan Zhang , Sai Haneesh Allu , Jikai Wang , Yangxiao Lu , Yu Xiang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:16 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords instance detectioninstance segmentationopen-world scenestemplate matchingpatch-level matchingsegment anything modelrobotic perception

0 comments

The pith

L2G-Det detects and segments novel object instances in cluttered scenes by matching dense patches from templates to prompt an augmented SAM model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents L2G-Det as a way to find and outline specific objects in new, messy environments when only a few template photos are available. It skips the usual step of first guessing where objects might be and instead matches small patches from the templates directly across the whole query image to create candidate locations. A filtering step removes bad matches, and the remaining points then tell a modified Segment Anything Model exactly which object to outline completely. This targets cases where background clutter or partial hiding defeats standard proposal methods.

Core claim

L2G-Det bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks.

What carries the argument

dense patch-level matching between template and query image patches that produces candidate points refined by a candidate selection module to prompt an augmented SAM with instance-specific tokens

If this is right

Better handling of occlusion and background clutter than proposal-first methods in open-world robotic settings.
Detection and segmentation of novel instances using only a small set of template images without retraining.
More complete instance masks by using filtered local matches to guide SAM prompting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support real-time robotic grasping by supplying precise masks directly from templates.
Extending the matching step across video frames might enable consistent tracking of the same instance.
The candidate selection logic could apply to other dense-matching tasks such as visual localization.

Load-bearing premise

Dense patch matching will yield accurate enough candidate points in cluttered scenes and the selection module will keep true matches while dropping false ones so the augmented SAM can build full masks.

What would settle it

A cluttered test scene with heavy occlusion where the final masks are fragmented or miss the target object because the candidate points after selection are too noisy or incomplete.

Figures

Figures reproduced from arXiv: 2603.03577 by Jikai Wang, Qifan Zhang, Sai Haneesh Allu, Yangxiao Lu, Yu Xiang.

**Figure 1.** Figure 1: Conceptual comparison between object proposal-based instance detection methods and our local-to-global instance [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of our L2G-Det framework for novel instance detection. It consists of a candidate selection module and an [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison between the original SAM and our aug [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on RoboTools benchmark. From [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of the number of template images K. Training Strategy AP AP50 AP75 CL-Joint 69.5 82.7 74.8 Joint 69.9 82.6 75.3 SAM∗ (Ours) 71.9 84.6 77.2 TABLE V: Comparison of different training strategies. TABLE VI: Effect of different dense feature extractors used in the dense matching stage. (1) L2G-Det w/o Adapter + SAM Dense Backbone AP AP50 AP75 LoFTR [43] 41.3 49.2 44.4 DINOv2-Large [33] 64.4 76.3 69.8 DI… view at source ↗

**Figure 7.** Figure 7: Qualitative results on HR-InsDet [40] benchmark. From left to right, we show the ground-truth annotations, results [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of template-based synthetic training images. Three synthesized scenes are illustrated: 1) Single-object [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative real-world detection results on 8 target objects with and without instance-specific object tokens. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

L2G-Det replaces proposals with dense patch matching to generate SAM prompts, which is a clean logical step for template-guided open-world segmentation, though the abstract shows no numbers to judge whether it actually works better.

read the letter

The core idea here is straightforward: match patches from the template images directly to the query scene, collect those matches as candidate points, run them through a selection module to drop false positives, and feed the survivors into an augmented SAM that uses instance-specific tokens to build full masks. This local-to-global route avoids the proposal stage that the authors correctly flag as brittle under clutter and occlusion. The pipeline description hangs together without internal contradictions, and the motivation section does a solid job spelling out why existing methods fall short in robotic settings. That combination of dense matching plus SAM prompting is the piece that feels new relative to the proposal baselines they reference. The main limitation is that the abstract only asserts improved performance without any metrics, ablations, or even a sketch of how the candidate selection module actually works. Without those details it is impossible to tell whether the filtering step keeps true matches while removing noise in realistic scenes, or how much the instance tokens help SAM reconstruct complete objects. If the full paper supplies clear quantitative gains on standard open-world benchmarks plus ablations that isolate the selection module, the contribution becomes much stronger. This is aimed at robotic vision researchers who need reliable instance segmentation from a handful of templates rather than category-level detectors. A reader working on perception for manipulation or navigation would find the idea worth discussing, but only once the results are visible. It should go to peer review because the framing is coherent and the problem is real, even if the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes L2G-Det, a local-to-global template-guided instance detection and segmentation framework for open-world robotic perception. Given a small set of template images, the method performs dense patch-level matching between templates and the query scene to generate candidate points, refines these via a candidate selection module to suppress false positives, and feeds the filtered points as prompts to an augmented Segment Anything Model (SAM) equipped with instance-specific object tokens to reconstruct complete instance masks. The central claim is improved robustness to occlusion and clutter relative to proposal-based baselines.

Significance. If the empirical results hold, the work offers a coherent proposal-free pipeline that could advance open-world instance segmentation in robotics by replacing brittle explicit proposals with dense matching-derived points and SAM prompting. The local-to-global design directly targets failure modes under clutter and occlusion, and the integration of matching with an augmented SAM is a logically consistent extension of existing architectures.

major comments (2)

[Abstract] Abstract: The statement that 'Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings' supplies no quantitative metrics, tables, ablation studies, or details on the candidate selection module, leaving the central empirical claim without visible support in the manuscript.
[Method] Method description (local-to-global pipeline): The candidate selection module is described only at a high level as suppressing false positives; no architecture, loss, training procedure, or decision criterion is provided, which is load-bearing for the claim that filtered points enable reliable SAM mask reconstruction.

minor comments (1)

[Title/Abstract] The acronym L2G-Det is used in the title and abstract without an explicit expansion on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive evaluation of the significance of L2G-Det. We will revise the manuscript to strengthen the empirical support in the abstract and to provide a more detailed description of the candidate selection module.

read point-by-point responses

Referee: [Abstract] Abstract: The statement that 'Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings' supplies no quantitative metrics, tables, ablation studies, or details on the candidate selection module, leaving the central empirical claim without visible support in the manuscript.

Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revised version we will add specific metrics (e.g., mAP and mask IoU gains over proposal-based baselines on the evaluated open-world benchmarks) together with explicit references to the corresponding tables and ablations. revision: yes
Referee: [Method] Method description (local-to-global pipeline): The candidate selection module is described only at a high level as suppressing false positives; no architecture, loss, training procedure, or decision criterion is provided, which is load-bearing for the claim that filtered points enable reliable SAM mask reconstruction.

Authors: We acknowledge that the candidate selection module requires a more complete technical description. We will expand the method section to include the module's full architecture, the loss function, training procedure, and the precise decision criteria used to filter false-positive points, thereby clarifying how the refined prompts improve SAM mask quality. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The described pipeline relies on external components (dense patch matching between templates and query image, a candidate selection module, and an augmented Segment Anything Model) whose independence is not contradicted by the abstract. No equations, fitted parameters renamed as predictions, self-citations as load-bearing uniqueness theorems, or ansatzes smuggled via prior work are present. The central claim of generating instance masks from local matches is constructed as a sequence of distinct modules rather than reducing to its own inputs by definition. The approach remains self-contained against external benchmarks such as SAM.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; specific free parameters, axioms, and entities cannot be audited without the full methods and experiments sections.

axioms (1)

domain assumption Dense patch-level matching between a small set of templates and the query image produces usable candidate points even under occlusion and clutter.
This is the load-bearing step that replaces explicit proposals.

pith-pipeline@v0.9.0 · 5454 in / 1236 out tokens · 42399 ms · 2026-05-15T16:16:02.135381+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 10 internal anchors

[1]

A modular robotic system for autonomous ex- ploration and semantic updating in large-scale indoor en- vironments, 2025

Sai Haneesh Allu, Itay Kadosh, Tyler Summers, and Yu Xiang. A modular robotic system for autonomous ex- ploration and semantic updating in large-scale indoor en- vironments, 2025. URL https://arxiv.org/abs/2409.15493

work page arXiv 2025
[2]

Target driven instance detection.arXiv preprint arXiv:1803.04610, 2018

Phil Ammirato, Cheng-Yang Fu, Mykhailo Shvets, Jana Kosecka, and Alexander C Berg. Target driven instance detection.arXiv preprint arXiv:1803.04610, 2018

work page arXiv 2018
[3]

Surf: Speeded up robust features

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean Conference on Computer Vision (ECCV), pages 404–417, 2006

work page 2006
[4]

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Doll ´ar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network.arXiv:2504.13181, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

URLhttps://doi.org/10.1109/ICRA48506

Richard Bormann, Xinjie Wang, Markus V ¨olk, Kilian Kleeberger, and Jochen Lindermayr. Real-time instance detection with fast incremental learning. InProceed- ings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021. doi: 10.1109/ ICRA48506.2021.9561202

work page arXiv 2021
[6]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, An- drew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman R ¨adle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Li...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the International Con- ference on Computer Vision (ICCV), 2021

work page 2021
[8]

Sam-adapter: Adapting segment anything in underperformed scenes

Tianrun Chen, Lanyun Zhu, Chaotao Deng, Runlong Cao, Yan Wang, Shangzhan Zhang, Zejian Li, Lingyun Sun, Ying Zang, and Papa Mao. Sam-adapter: Adapting segment anything in underperformed scenes. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 3367–3375, 2023

work page 2023
[9]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational con- ference on machine learning, pages 1597–1607. PMLR, 2020

work page 2020
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[11]

Cut, paste and learn: Surprisingly easy synthesis for instance detection

Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance detection. InProceedings of the IEEE Inter- national Conference on Computer Vision (ICCV), 2017

work page 2017
[12]

Lu, Michael Ferguson, and Aaron Hoy

Eitan Marder-Eppstein, David V . Lu, Michael Ferguson, and Aaron Hoy. Ros navigation stack. URL https:// github.com/ros-planning/navigation

work page
[13]

arXiv preprint arXiv:2110.04544 , year=

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.arXiv 2110.04544, 2021

work page arXiv 2021
[14]

slam gmapping, 2013

Brian Gerkey. slam gmapping, 2013. URL https://github. com/ros-perception/slam gmapping

work page 2013
[15]

Deep Learning

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016

work page 2016
[16]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, and Serge Belongie. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 709–727, 2022

work page 2022
[17]

Learning open-world object proposals without learning to classify.IEEE Robotics and Automation Letters, 7(2):5453–5460, 2022

Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, and Weicheng Kuo. Learning open-world object proposals without learning to classify.IEEE Robotics and Automation Letters, 7(2):5453–5460, 2022

work page 2022
[18]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

work page 2023
[20]

Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A

James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. InProceedings of the National Academy of Sciences (PNAS), volume 114, pages 3521–3526, 2017

work page 2017
[21]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3045–3059, 2021

work page 2021
[22]

V oxdet: V oxel learning for novel instance detection.Advances in Neural Information Processing Systems, 36, 2024

Bowen Li, Jiashun Wang, Yaoyu Hu, Chen Wang, and Sebastian Scherer. V oxdet: V oxel learning for novel instance detection.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[23]

Learning without for- getting

Zhizhong Li and Derek Hoiem. Learning without for- getting. InIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pages 1–1. IEEE, 2017

work page 2017
[24]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. InProceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017
[25]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Gen6d: Gener- alizable model-free 6-dof object pose estimation from rgb images

Yuan Liu, Yilin Wen, Sida Peng, Cheng Lin, Xiaoxiao Long, Taku Komura, and Wenping Wang. Gen6d: Gener- alizable model-free 6-dof object pose estimation from rgb images. InEuropean Conference on Computer Vision, pages 298–315. Springer, 2022

work page 2022
[27]

David G. Lowe. Distinctive image features from scale- invariant keypoints.International Journal of Computer Vision (IJCV), 60(2):91–110, 2004

work page 2004
[28]

Adapting pre-trained vision models for novel instance detection and segmentation, 2024

Yangxiao Lu, Jishnu Jaykumar P, Yunhui Guo, Nicholas Ruozzi, and Yu Xiang. Adapting pre-trained vision models for novel instance detection and segmentation, 2024

work page 2024
[29]

Deep template-based object instance detection

Jean-Philippe Mercier, Mathieu Garon, Philippe Giguere, and Jean-Francois Lalonde. Deep template-based object instance detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1507–1516, 2021

work page 2021
[30]

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ah- madi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In2016 F ourth International Conference on 3D Vision (3DV), 2016

work page 2016
[31]

Cnos: A strong baseline for cad-based novel object segmentation

Van Nguyen Nguyen, Thibault Groueix, Georgy Poni- matkin, Vincent Lepetit, and Tomas Hodan. Cnos: A strong baseline for cad-based novel object segmentation. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 2134–2140, 2023

work page 2023
[32]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Rep- resentation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Os2d: One-stage one-shot object detection by matching anchor features

Anton Osokin, Denis Sumin, and Vasily Lomakin. Os2d: One-stage one-shot object detection by matching anchor features. InComputer Vision–ECCV 2020: 16th Euro- pean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pages 635–652. Springer, 2020

work page 2020
[35]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on ma- chine learning, pages 8748–8763. PMLR, 2021

work page 2021
[36]

Asiful Islam Rahman and Yang Wang

Md. Asiful Islam Rahman and Yang Wang. Optimizing intersection-over-union in deep neural networks for im- age segmentation. InInternational Symposium on Visual Computing (ISVC), 2016

work page 2016
[38]

URL https://arxiv.org/abs/2408.00714

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information processing systems, 28, 2015

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information processing systems, 28, 2015

work page 2015
[40]

Superglue: Learn- ing feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Mal- isiewicz, and Andrew Rabinovich. Superglue: Learn- ing feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[41]

A high-resolution dataset for instance detection with multi-view instance capture

Qianqian Shen, Yunhan Zhao, Nahyun Kwon, Jeeeun Kim, Yanan Li, and Shu Kong. A high-resolution dataset for instance detection with multi-view instance capture. InNeurIPS Datasets and Benchmarks Track, 2023

work page 2023
[42]

Solving instance de- tection from an open-world perspective

Qianqian Shen, Yunhan Zhao, Nahyun Kwon, Jeeeun Kim, Yanan Li, and Shu Kong. Solving instance de- tection from an open-world perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[43]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khali- dov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamon- jisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth ´ee Darcet, Th ´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Co...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

LoFTR: Detector-free local feature matching with transformers.CVPR, 2021

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers.CVPR, 2021

work page 2021
[45]

What makes for good views for contrastive learning?Advances in neural information processing systems, 33:6827–6839, 2020

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning?Advances in neural information processing systems, 33:6827–6839, 2020

work page 2020
[46]

Fcos: Fully convolutional one-stage object detection

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9626–9635, 2019. doi: 10.1109/ICCV . 2019.00972

work page doi:10.1109/iccv 2019
[47]

Dense- fusion: 6d object pose estimation by iterative dense fusion

Chen Wang, Danfei Xu, Yuke Zhu, Roberto Mart ´ın- Mart´ın, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Dense- fusion: 6d object pose estimation by iterative dense fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[48]

Efficient LoFTR: Semi-dense local feature matching with sparse-like speed

Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xi- aowei Zhou. Efficient LoFTR: Semi-dense local feature matching with sparse-like speed. InCVPR, 2024

work page 2024
[49]

Ts-sam: Fine-tuning segment-anything model for downstream tasks.arXiv preprint arXiv:2408.01835, 2024

Yang Yu, Chen Xu, and Kai Wang. Ts-sam: Fine-tuning segment-anything model for downstream tasks.arXiv preprint arXiv:2408.01835, 2024

work page arXiv 2024
[50]

Ni, and Heung-Yeung Shum

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to- end object detection, 2022

work page 2022
[51]

Objects as Points

Xingyi Zhou, Dequan Wang, and Philipp Kr ¨ahenb¨uhl. Objects as points.arXiv preprint arXiv:1904.07850, 2019. APPENDIX A. More Ablation Studies Effect of Template-specific Similarity Computation. We further conduct an ablation study to analyze the role of template-specific similarity computation in Eq. (7). Our method computes the similarity score between...

work page internal anchor Pith review Pith/arXiv arXiv 1904
[52]

8: Examples of template-based synthetic training images

2) 3) Fig. 8: Examples of template-based synthetic training images. Three synthesized scenes are illustrated: 1) Single-object composition, 2) Multi-object composition without overlap, and 3) Multi-object composition with overlap, where the target object may be partially occluded by surrounding objects. L2G-Det with SAM L2G-Det with SAM* Object 1 Object 2...

work page