{"work":{"id":"3757dc8f-79d5-4beb-a03b-eb4c9a33427d","openalex_id":null,"doi":null,"arxiv_id":"2303.05499","raw_key":null,"title":"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection","authors":null,"authors_text":"Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang","year":2023,"venue":"cs.CV","abstract":"In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \\url{https://github.com/IDEA-Research/GroundingDINO}.","external_url":"https://arxiv.org/abs/2303.05499","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T05:45:23.501617+00:00","pith_arxiv_id":"2303.05499","created_at":"2026-05-09T06:10:36.790084+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection","render_title":"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"},"hub":{"state":{"work_id":"3757dc8f-79d5-4beb-a03b-eb4c9a33427d","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":85,"external_cited_by_count":null,"distinct_field_count":7,"first_pith_cited_at":"2023-04-17T17:59:25+00:00","last_pith_cited_at":"2026-05-21T19:51:20+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-10T20:06:48.171539+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":12},{"context_role":"method","n":11},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"background","n":13},{"context_polarity":"use_method","n":10},{"context_polarity":"use_dataset","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T17:59:41.591580+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"SAM 2: Segment Anything in Images and Videos","work_id":"acc13f66-d814-44f9-9688-375688bf2d4a","shared_citers":10},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":7},{"title":"Segment Anything","work_id":"2bbf46ca-720a-45a1-8e9c-10c33fbeada0","shared_citers":7},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":6},{"title":"SAM 3: Segment Anything with Concepts","work_id":"4a72a006-2592-4554-aad0-a9c41a9f952d","shared_citers":6},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":5},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":4},{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":3},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":3},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":3},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":3},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":3},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":3},{"title":"Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks","work_id":"42c46ece-c4e8-4d5e-abae-f8d5b4208995","shared_citers":3},{"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","shared_citers":3},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":3},{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":3},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":3},{"title":"PaLM-E: An Embodied Multimodal Language Model","work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","shared_citers":3},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":3},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":3},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":3},{"title":"SAM 3D: 3Dfy Anything in Images","work_id":"dc22e9ff-fcf5-4069-8ace-35ae3a0bfd7c","shared_citers":3},{"title":"Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic","work_id":"44525076-312a-4259-b79c-134cd7eeb297","shared_citers":3}],"time_series":[{"n":2,"year":2023},{"n":2,"year":2024},{"n":2,"year":2025},{"n":30,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T17:59:19.546840+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T17:59:59.570083+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection","claims":[{"claim_text":"In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selec","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T17:59:50.264539+00:00"}},"summary":{"title":"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection","claims":[{"claim_text":"In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selec","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"SAM 2: Segment Anything in Images and Videos","work_id":"acc13f66-d814-44f9-9688-375688bf2d4a","shared_citers":10},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":7},{"title":"Segment Anything","work_id":"2bbf46ca-720a-45a1-8e9c-10c33fbeada0","shared_citers":7},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":6},{"title":"SAM 3: Segment Anything with Concepts","work_id":"4a72a006-2592-4554-aad0-a9c41a9f952d","shared_citers":6},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":5},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":4},{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":3},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":3},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":3},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":3},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":3},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":3},{"title":"Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks","work_id":"42c46ece-c4e8-4d5e-abae-f8d5b4208995","shared_citers":3},{"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","shared_citers":3},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":3},{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":3},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":3},{"title":"PaLM-E: An Embodied Multimodal Language Model","work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","shared_citers":3},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":3},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":3},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":3},{"title":"SAM 3D: 3Dfy Anything in Images","work_id":"dc22e9ff-fcf5-4069-8ace-35ae3a0bfd7c","shared_citers":3},{"title":"Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic","work_id":"44525076-312a-4259-b79c-134cd7eeb297","shared_citers":3}],"time_series":[{"n":2,"year":2023},{"n":2,"year":2024},{"n":2,"year":2025},{"n":30,"year":2026}],"dependency_candidates":[]},"authors":[]}}