{"total":22,"items":[{"citing_arxiv_id":"2605.23287","ref_index":28,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images","primary_cat":"cs.CV","submitted_at":"2026-05-22T06:59:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LangFlash introduces a feed-forward model for 3D language Gaussian splatting from sparse unposed images, claiming superior novel view synthesis and semantic consistency via enriched training data and sparse semantic encoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10190","ref_index":29,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-11T08:40:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"ventional detectors [1, 34] and more recent Transformer- based models [2, 49] have achieved remarkable progress by learning powerful visual representations and global context modeling. However, their label spaces are typicallyclosed, constrained to a fixed set of categories defined during train- Figure 1. Qualitative comparison onLVISbetween the baseline detector (Grounding DINO-T [29]) and the proposedDetRefiner. DetRefiner significantly improves detection performance (AP r: 19.9→30.0 / AP all: 27.4→34.1), successfully detecting more objects and producing better calibrated scores (e.g.,fork,knife, painting,polo shirt). For visualization, a box score threshold of 0.3 and an IoU threshold of 0.3 are applied for class-wise NMS. ing."},{"citing_arxiv_id":"2605.10130","ref_index":26,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection","primary_cat":"cs.CV","submitted_at":"2026-05-11T07:41:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Thermal-Det is the first LLM-supervised open-vocabulary thermal object detector, created via synthetic data conversion from GroundingCap-1M and RGB-to-thermal distillation, yielding 2-4% AP gains on benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"open-world understanding. As a result, current thermal de- tectors remain closed-set, struggling to generalize to unseen object classes or adapt to real-world scenarios where new object types frequently appear. Open-vocabulary object detection (OVD) has recently emerged as a powerful paradigm for scalable recognition, driven by models such as GLIP [22], Grounding DINO [26], OWLv2 [30], and LLMDet [6] that couple visual and linguistic supervision. These models leverage large- scale RGB datasets with paired captions to learn rich, text- conditioned visual representations, enabling detection of unseen categories through natural-language prompts. How- ever, their success is largely confined to the visible spec- trum, where semantics and texture cues are consistent with"},{"citing_arxiv_id":"2604.23718","ref_index":24,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Caries DETR: Tooth Structure-aware Prior and Lesion-aware Dynamic Loss Refinement for DETR Based Caries Detection","primary_cat":"cs.CV","submitted_at":"2026-04-26T14:02:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Caries-DETR adds tooth-structure query initialization and lesion-aware loss reweighting to DETR, reaching state-of-the-art caries detection on AlphaDent and DentalAI datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21453","ref_index":27,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Instance-level Visual Active Tracking with Occlusion-Aware Planning","primary_cat":"cs.CV","submitted_at":"2026-04-23T09:11:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OA-VAT improves visual active tracking by combining instance-level prototype discrimination with occlusion-aware diffusion planning, reporting gains over prior SOTA on simulated and real drone benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18476","ref_index":30,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection","primary_cat":"cs.CV","submitted_at":"2026-04-20T16:28:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11402","ref_index":27,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Scene Change Detection with Vision-Language Representation Learning","primary_cat":"cs.CV","submitted_at":"2026-04-13T12:43:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LangSCD fuses VLM-generated text descriptions with visual features and adds geometric-semantic matching to improve scene change detection, while releasing the NYC-CD dataset of 8122 New York City image pairs with multiclass annotations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10527","ref_index":48,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"STORM: End-to-End Referring Multi-Object Tracking in Videos","primary_cat":"cs.CV","submitted_at":"2026-04-12T08:43:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08916","ref_index":34,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation","primary_cat":"cs.CV","submitted_at":"2026-04-10T03:26:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08337","ref_index":36,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-09T15:10:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05780","ref_index":26,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion","primary_cat":"cs.CV","submitted_at":"2026-04-07T12:17:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VoxSAMNet introduces sparsity-aware deformable attention via a dummy node and foreground modulation with dropout plus text-guided filtering to reach new state-of-the-art mIoU of 18.2% on SemanticKITTI and 20.2% on SSCBench-KITTI-360 for monocular 3D scene completion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04444","ref_index":17,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection","primary_cat":"cs.CV","submitted_at":"2026-04-06T05:41:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HSA-DINO improves open-vocabulary object detection on domain-shifted tasks via hierarchical semantic prompts and dynamic routing while preserving pre-trained generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02905","ref_index":19,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting","primary_cat":"cs.CV","submitted_at":"2026-04-03T09:24:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniSpector organizes visual prompt space with spatial-spectral and contrastive encoders to support open-set defect localization, beating baselines by at least 19.7% AP50b and 15.8% AP50m on the new Inspect Anything benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"intra-class variability and low inter-class separability due to variations in texture, illumination, or geometry. Despite strong supervised performance, all of these methods assume a fixed defect taxonomy and operate in a closed-set regime, making them unsuitable for real-world industrial environ- ments where novel defect types continue to emerge. 2.2. Visual/Text-prompted Open-Set Detection GroundingDINO [19] and YOLO-World [5] introduce novel object detection by associating language descriptions with visual features, without relying on pre-defined cate- gories. However, in domains involving specialized or tech- nical objects, verbal descriptions can be challenging. DI- NOv [18] explores the use of visual prompts as in-context examples for general vision tasks, while T-Rex2 [12] and"},{"citing_arxiv_id":"2604.00503","ref_index":20,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training","primary_cat":"cs.CV","submitted_at":"2026-04-01T05:36:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PET-DINO unifies visual and text prompts in Grounding DINO via an alignment-friendly generation module and prompt-enriched training strategies to improve zero-shot open-set object detection.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"large-scale vision-language model CLIP [24]. Region- CLIP [37] and DetCLIP [31] align language with image re- gions and leverage image-text pairs with pseudo boxes to enrich region-level knowledge for more generalized object detection. GLIP [17] further unifies object detection and phrase grounding tasks, transforming detection into text- guided region localization. Grounding DINO [20] further proposes a feature enhancer and a cross-modality decoder to achieve denser fusion. GLEE [29], using sentence-level text encoding, achieves object perception tasks without task- specific fine-tuning. Moreover, text-prompted and long-tail perception paradigms have also been extended to segmen- tation and tracking [2-4, 16, 23]. Object detection with visual prompts."},{"citing_arxiv_id":"2602.22683","ref_index":31,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses","primary_cat":"cs.CV","submitted_at":"2026-02-26T06:55:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.05991","ref_index":14,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"3D Instruction Ambiguity Detection","primary_cat":"cs.AI","submitted_at":"2026-01-09T18:17:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Defines 3D Instruction Ambiguity Detection as a new task, releases the Ambi3D benchmark, shows state-of-the-art 3D LLMs struggle with it, and proposes the AmbiVer framework that gathers multi-view visual evidence to guide VLMs in judging ambiguity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.20538","ref_index":29,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment","primary_cat":"cs.CV","submitted_at":"2025-12-23T17:29:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AlignPose performs generalizable 6D pose estimation by multi-view feature-metric refinement that minimizes feature discrepancy between on-the-fly rendered object features and observed images across calibrated views.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12598","ref_index":23,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Setting the Stage: Text-Driven Scene-Consistent Image Generation","primary_cat":"cs.CV","submitted_at":"2025-12-14T08:35:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.03666","ref_index":21,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos","primary_cat":"cs.CV","submitted_at":"2025-12-03T10:54:44+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01236","ref_index":19,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards","primary_cat":"cs.CV","submitted_at":"2025-12-01T03:25:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21064","ref_index":25,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection","primary_cat":"cs.AI","submitted_at":"2025-11-26T05:08:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OVOD-Agent models visual reasoning as a weakly Markovian decision process with bandit-driven exploration to create a self-evolving open-vocabulary detector that improves on rare categories in COCO and LVIS.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.12554","ref_index":23,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis","primary_cat":"cs.CV","submitted_at":"2025-11-16T11:16:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EmoVerse is a large open-source dataset enabling interpretable visual emotion analysis via B-A-S triplets, region grounding, and unified CES/DES representations created through an MLLM-driven pipeline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}