{"work":{"id":"69dffacb-bfe8-442d-be86-48624c60426f","openalex_id":null,"doi":null,"arxiv_id":"2502.13923","raw_key":null,"title":"Qwen2.5-VL Technical Report","authors":null,"authors_text":"Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al","year":2025,"venue":"cs.CV","abstract":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.","external_url":"https://arxiv.org/abs/2502.13923","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-07-04T14:39:58.528783+00:00","pith_arxiv_id":"2502.13923","created_at":"2026-05-09T02:27:20.922186+00:00","updated_at":"2026-07-04T14:39:58.528783+00:00","title_quality_ok":false,"display_title":"Qwen2.5-VL Technical Report","render_title":"Qwen2.5-VL Technical Report"},"hub":{"state":{"work_id":"69dffacb-bfe8-442d-be86-48624c60426f","tier":"mega_hub","tier_reason":"1,000+ Pith inbound or 100,000+ external citations","pith_inbound_count":1264,"external_cited_by_count":null,"distinct_field_count":22,"first_pith_cited_at":"2024-06-12T09:36:52+00:00","last_pith_cited_at":"2026-07-02T17:59:59+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"needed","recognition_status":"needed","updated_at":"2026-07-04T15:06:47.408072+00:00","tier_text":"mega_hub"},"tier":"mega_hub","role_counts":[{"context_role":"background","n":153},{"context_role":"baseline","n":57},{"context_role":"method","n":57},{"context_role":"dataset","n":5},{"context_role":"other","n":3}],"polarity_counts":[{"context_polarity":"background","n":147},{"context_polarity":"use_method","n":59},{"context_polarity":"baseline","n":56},{"context_polarity":"unclear","n":6},{"context_polarity":"use_dataset","n":5},{"context_polarity":"support","n":2}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Qwen2.5-VL Technical Report","claims":[{"claim_text":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T18:13:28.003986+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"2bc8cfa9-c2ce-48ef-87d4-e0de52ae9a92","orcid":null,"display_name":"Shuai Bai"},{"id":"dd80061e-c973-48f3-af57-2764aa5433c7","orcid":null,"display_name":"Keqin Chen"},{"id":"540ccffd-e57b-4849-b7e9-9cd5c0c17482","orcid":null,"display_name":"Xuejing Liu"},{"id":"3d40df98-a97a-46a5-903b-b388be7785ab","orcid":null,"display_name":"Jialin Wang"},{"id":"1f4a16b8-5401-4f0d-9de5-c7d725338b8e","orcid":null,"display_name":"Wenbin Ge"},{"id":"5788a139-4b79-4854-8880-11a56d0c51b4","orcid":null,"display_name":"Sibo Song"}]},"error":null,"updated_at":"2026-05-13T18:13:28.001713+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T18:13:27.728863+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":112},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":100},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":82},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":80},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":79},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":78},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":72},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":70},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":64},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":61},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":61},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":57},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":42},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":38},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":32},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":32},{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":30},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":29},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":28},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":28},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":27},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":27},{"title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","work_id":"0f06e436-0c76-4e3c-be5e-6168f6bc4336","shared_citers":26},{"title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","work_id":"50eec732-2d41-432f-9dcf-ac7fff235ea5","shared_citers":26}],"time_series":[{"n":25,"year":2025},{"n":387,"year":2026}]},"error":null,"updated_at":"2026-05-13T17:25:55.668312+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T18:13:27.104671+00:00"},"reader_index":{"job_type":"reader_index","status":"succeeded","result":{"note":"annotated reader requires full-text/OA fetch; shell is wired for mega hubs","status":"reader queued"},"error":null,"updated_at":"2026-06-30T11:30:01.262604+00:00"},"recognition_alignment":{"job_type":"recognition_alignment","status":"succeeded","result":{"modules":["IndisputableMonolith.Gravity.PropagationSpeed","IndisputableMonolith.Foundation.PreTemporalForcingOrder","IndisputableMonolith.Physics.LightConeCausalityFromRS","IndisputableMonolith.Cosmology.EtaBPrefactorDerivation","IndisputableMonolith.Physics.MaxwellEquationsFromRS","IndisputableMonolith.Gravity.BlackHoleEntropyFromLedger","IndisputableMonolith.Thermodynamics.FermiDirac","IndisputableMonolith.Gravity.BlackHoleHorizonStates"],"query_chars":1895},"error":null,"updated_at":"2026-06-30T11:30:01.245724+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Qwen2.5-VL Technical Report","claims":[{"claim_text":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T18:13:27.732366+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Qwen2.5-VL Technical Report","claims":[{"claim_text":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T17:25:52.717336+00:00"}},"summary":{"title":"Qwen2.5-VL Technical Report","claims":[{"claim_text":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":112},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":100},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":82},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":80},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":79},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":78},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":72},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":70},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":64},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":61},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":61},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":57},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":42},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":38},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":32},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":32},{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":30},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":29},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":28},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":28},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":27},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":27},{"title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","work_id":"0f06e436-0c76-4e3c-be5e-6168f6bc4336","shared_citers":26},{"title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","work_id":"50eec732-2d41-432f-9dcf-ac7fff235ea5","shared_citers":26}],"time_series":[{"n":25,"year":2025},{"n":387,"year":2026}]},"authors":[{"id":"3d40df98-a97a-46a5-903b-b388be7785ab","orcid":null,"display_name":"Jialin Wang","source":"manual","import_confidence":0.72},{"id":"dd80061e-c973-48f3-af57-2764aa5433c7","orcid":null,"display_name":"Keqin Chen","source":"manual","import_confidence":0.72},{"id":"2bc8cfa9-c2ce-48ef-87d4-e0de52ae9a92","orcid":null,"display_name":"Shuai Bai","source":"manual","import_confidence":0.72},{"id":"5788a139-4b79-4854-8880-11a56d0c51b4","orcid":null,"display_name":"Sibo Song","source":"manual","import_confidence":0.72},{"id":"1f4a16b8-5401-4f0d-9de5-c7d725338b8e","orcid":null,"display_name":"Wenbin Ge","source":"manual","import_confidence":0.72},{"id":"540ccffd-e57b-4849-b7e9-9cd5c0c17482","orcid":null,"display_name":"Xuejing Liu","source":"manual","import_confidence":0.72}]},"citers":{"total":1264,"items":[{"citing_arxiv_id":"2607.02517","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory","primary_cat":"cs.CV","submitted_at":"2026-07-02T17:59:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A video world model framework that uses LLM-orchestrated 3D trajectories as control signals for generation to achieve persistent dynamic object memory and viewpoint freedom.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.02490","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-07-02T17:53:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VRRL trains LVLMs for visually grounded self-reflection via prefix masking and buffered roll-ins, yielding higher out-of-distribution accuracy on grounding and navigation tasks than standard RL baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.02484","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning","primary_cat":"cs.CV","submitted_at":"2026-07-02T17:50:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EADP filters textual noise via statistical entropy then casts token selection as submodular maximization with spatial prior to preserve fine-grained cues in VLMs under strict budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.02025","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evaluating Vision-Language Models as a Zero-Shot Learning Alternative to You Only Look Once and Optical Character Recognition for Nigerian License Plate Recognition","primary_cat":"cs.CV","submitted_at":"2026-07-02T10:55:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Five VLMs are benchmarked on 88 Nigerian license plate images; Gemini and Qwen achieve lower character error rates than GPT-4o, Claude, and Llama in a zero-shot setting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01938","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation","primary_cat":"cs.RO","submitted_at":"2026-07-02T09:32:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhysMani couples a physics-principled 3D Gaussian world model with a future-aware policy to achieve higher success rates on dynamic manipulation tasks in simulation and real robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01897","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rank-Then-Act: Reward-Free Control from Frame-Order Progress","primary_cat":"cs.LG","submitted_at":"2026-07-02T08:50:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RTA trains a VLM as a progress ordinal scorer via GRPO on shuffled expert frames and uses Spearman rank correlation with temporal indices as a bounded RL reward, matching or exceeding prior video reward methods on discrete and continuous control benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01784","ref_index":59,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SpaceEra++: A Unified Framework Towards 3D Spatial Reasoning in Video","primary_cat":"cs.CV","submitted_at":"2026-07-02T06:56:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"SpaceEra++ adds ScenePick frame sampling and SpaceAlign pairwise constraints to the prior SpaceEra system, claiming consistent benchmark gains for 3D video spatial reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01748","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RTE-FM-Dehazer: Radiative Transfer Equation Inspired Flow Matching for Real-World Image Dehazing","primary_cat":"cs.CV","submitted_at":"2026-07-02T06:06:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RTE-FM-Dehazer trains a flow-matching model with an RTE-derived diffusion-absorption regularizer on a new 50k real-haze dataset and reports leading results on five real-world dehazing benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01737","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReQuest: Rethinking-based Question-Aware Frame Selection for Long-Form Video QA","primary_cat":"cs.CV","submitted_at":"2026-07-02T05:46:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReQuest introduces an uncertainty-driven question-adaptive keyframe selector with rethinking routing and adaptive NMS that boosts long-form video QA accuracy on Video-MME, MLVU, and LongVideoBench without fine-tuning the base MLLM.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01707","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression","primary_cat":"cs.CV","submitted_at":"2026-07-02T04:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LASER uses Visual Grounding Reward and Sink Suppression Reward to preserve visual attention trajectories and suppress sink tokens, reducing visual forgetting in LVLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01658","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Teaching Vision-Language-Action Models What to See and Where to Look","primary_cat":"cs.CV","submitted_at":"2026-07-02T03:34:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DriveTeach-VLA adds Driving-aware Vision Distillation pretraining and 2D Trajectory-Guided Prompts to VLA models, then reports state-of-the-art results on NAVSIM and nuScenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01191","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2026-07-01T17:24:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01117","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-07-01T16:04:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01086","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-07-01T15:40:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongVQUBench introduces a hierarchical benchmark with local, cross-event, and global quality understanding tasks plus needle distortion QA to measure LVLMs' long-term video quality reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01050","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GeoSearcher: Anchor-Guided Progressive Reasoning for Remote Sensing Visual Grounding with Process Supervision","primary_cat":"cs.CV","submitted_at":"2026-07-01T15:12:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GeoSearcher introduces anchor-centric reasoning supervised fine-tuning and process-faithful group relative policy optimization to improve MLLM-based remote sensing visual grounding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.02089","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ESC: Emotional Self-Correction for Reliable Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-07-01T14:25:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ESC uses emotional cues triggered by an external verifier to enable training-free self-correction in VLMs, improving reliability on safety, hallucination, and reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00983","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"QCA: Query- and Content-Aware Keyframe Selection for Long Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-07-01T14:19:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"QCA selects compact, query-relevant keyframes from long videos via segment-wise budget allocation and diversity-aware addition, achieving higher accuracy than GPT-4o on LongVideoBench with half the frames.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00867","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EFlow: Learning Evidence Flow for Long-Video Reasoning with Adaptive Reflection","primary_cat":"cs.CV","submitted_at":"2026-07-01T12:32:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EFlow separates temporal grounding from logical reasoning via two CoT stages and adds confidence-aware reflection, trained via SFT and RL on custom trajectory data, yielding gains on five video benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00861","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TrajLoc: Trajectory-Attention Localization for Multi-Object Motion Control","primary_cat":"cs.CV","submitted_at":"2026-07-01T12:25:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TrajLoc enforces per-object trajectory constraints in I2V generation via attention-layer Gaussian heatmap substitution, yielding +4.3 dB PSNR and 51% lower endpoint error on datasets with up to 20 objects across two backbones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00858","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MoVA: Learning Asymmetric Dual Projections for Modular Long Video-Text Alignment","primary_cat":"cs.CV","submitted_at":"2026-07-01T12:23:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoVA introduces modular asymmetric dual projections to handle temporal misalignment and semantic asymmetry in long video-text alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00816","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards High-Resolution Visual Perception via Hierarchical Entity Exploration","primary_cat":"cs.CV","submitted_at":"2026-07-01T11:41:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HEE is a training-free, model-agnostic method for high-resolution visual perception in MLLMs using hierarchical entity exploration with dual scoring, detection, clustering, and backtracking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00622","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Watch: Active Video Anomaly Understanding via Interleaved Policy Optimization","primary_cat":"cs.CV","submitted_at":"2026-07-01T08:41:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Anom-π framework for active video anomaly understanding via interleaved policy optimization and iDPO under weak supervision, claiming a 2B model outperforms larger SOTA VAU models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00465","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2026-07-01T05:34:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StochasT uses stochastic clustering of language tasks into varying turn depths for the same image to improve LVLMs on both single-turn and multi-turn scenarios without discarding data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00410","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MindAU: EEG-Conditioned Facial Action Unit Editing via Dual-Stream Manifold Alignment","primary_cat":"cs.CV","submitted_at":"2026-07-01T04:12:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MindAU is a dual-stream manifold alignment system that conditions a multimodal diffusion editor on EEG signals to perform fine-grained, identity-preserving facial action unit edits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00302","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Wake up for Touch! Mask-isolated Tactile Alignment Learning in MLLMs","primary_cat":"cs.CV","submitted_at":"2026-07-01T01:02:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Splash partitions MLLM parameters into dormant and critical subspaces via significance quantification, updating only the dormant subspace for tactile alignment while preserving general capabilities and achieving SOTA on visuo-tactile benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00283","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What's Hidden Matters: Identifying Planning-Critical Occluded Agents using Vision-Language Models","primary_cat":"cs.RO","submitted_at":"2026-07-01T00:14:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces PKL to rank planning-critical occluded agents, creates a VLM-annotated benchmark on nuScenes, and shows fine-tuning on this data improves performance ~30% over random selection with smaller models outperforming larger zero-shot counterparts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00159","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Identifying and Resolving Pitfalls of Knowledge-Based VQA Benchmarks: Auditing, Repairing, and Augmenting","primary_cat":"cs.CL","submitted_at":"2026-06-30T20:35:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Audit of KB-VQA benchmarks reveals systematic violations of answer derivability, question clarity, and visual disambiguation assumptions, with new repair and multi-entity augmentation protocols producing different model performance trends.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00125","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Decompose, Compare, and Decide: Multimodal LLMs are Implicit Few-Shot Learners","primary_cat":"cs.CV","submitted_at":"2026-06-30T20:00:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeCoDe decomposes few-shot classification into binary pairwise image comparisons whose affirmative logits serve as similarity scores, enabling strong performance from unmodified MLLMs on twelve datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00115","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking","primary_cat":"cs.CV","submitted_at":"2026-06-30T19:51:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PixelEyes decouples reasoning and perception via mask-guided search and semantic BFS, introduces PixelEyes-6K dataset and Pinpoint-Bench benchmark, and open-sources code and models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31982","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ERA: Entropy-Guided Visual Token Pruning with Rectified Attention for Efficient MLLMs","primary_cat":"cs.CV","submitted_at":"2026-06-30T17:20:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ERA proposes entropy-guided token pruning with bias-aware recycling and logit rectification to compress visual inputs in MLLMs while mitigating attention collapse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31933","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs","primary_cat":"cs.CV","submitted_at":"2026-06-30T16:38:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31924","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InstanceControl: Controllable Complex Image Generation without Instance Labeling","primary_cat":"cs.CV","submitted_at":"2026-06-30T16:33:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InstanceControl uses VLMs to auto-generate instance masks from text and visual conditions, with adaptive refinement, to enable controllable multi-object image generation without manual labeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31903","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attend, Transform, or Silence: Operator-Level Visual Skipping for Efficient Multimodal LLM Inference","primary_cat":"cs.CV","submitted_at":"2026-06-30T16:08:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper proposes an operator-level visual-token skipping framework for MLLMs that reduces TFLOPs by 33.7% on Qwen3-VL while retaining 99.5% performance across VQA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31711","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Arena-T2I Hard: Benchmarking and Improving Faithfulness with Dependency-Aware Checklist","primary_cat":"cs.AI","submitted_at":"2026-06-30T14:17:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Arena-T2I Hard benchmark with ~30 decomposed constraints per prompt and a dependency-aware checklist reward yields better faithfulness-aesthetics trade-off than single-reward or weighted-sum baselines on SD3.5-Medium and FLUX.1-dev.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31599","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-06-30T12:47:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViToS uses dual-stream RL with cross-feedback optimization to prune medical image tokens to 77% length while reporting 108.27% and 104.16% relative performance on two 7B VLMs across seven benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31495","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Surprise as a Signal for Plasticity and Metacognition","primary_cat":"cs.AI","submitted_at":"2026-06-30T11:14:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Prediction-error surprise computed over frozen encoder latents gates episodic memory plasticity in continual ImageNet streams and modulates VLM assertiveness, hedging, and single-shot learning, with reported retention and AUROC gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31451","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniTac: A Unified Multimodal Model for Cross-Sensor Tactile Understanding and Generation","primary_cat":"cs.RO","submitted_at":"2026-06-30T10:25:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniTac is the first unified multimodal model for cross-sensor tactile understanding and generation, using dual-level representations, two new understanding tasks, and a two-stage training paradigm with sensor-prior sampling to achieve SOTA understanding and realistic cross-sensor generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31407","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?","primary_cat":"cs.CV","submitted_at":"2026-06-30T09:35:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VSE perturbs images only to probe visual ambiguity in VLMs, clusters outputs into semantic prototypes, and computes mass-weighted dispersion, outperforming prior entropy methods on five VQA benchmarks across five models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31326","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bridging Video Understanding and Generation in a Unified Framework","primary_cat":"cs.CV","submitted_at":"2026-06-30T08:29:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Vega unifies video understanding and generation via shared vocabulary and hybrid autoregressive-diffusion architecture, reporting strong results on VBench and VideoMME.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31257","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-30T07:33:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31187","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Deny: Action Denial in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-06-30T06:16:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MLLMs drop from over 85% accuracy on action presence to under 50% on matched action-denial videos, exposing a causal verification gap that causal graph prompts partially close.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31169","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Single Character: Evaluating MLLMs for Sentence-Level Oracle Bone Inscription Understanding","primary_cat":"cs.CV","submitted_at":"2026-06-30T05:59:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces the S-OBI benchmark for sentence-level oracle bone inscription understanding and reports that current MLLMs remain dependent on character-level recognition due to propagating visual errors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31157","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking Foundation Model Collaboration: Enhancing Specialized Models through Proxy Task Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-30T05:35:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FAT decomposes structured prediction into specialist hypothesis generation and foundation-model proxy reasoning, yielding consistent gains over baselines on detection, trajectory, and segmentation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31148","ref_index":81,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding","primary_cat":"cs.CV","submitted_at":"2026-06-30T05:21:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PruneGround prunes 3D scenes via language-guided VLM, reformulates descriptions with multi-view reasoning, and adapts a spatial LLM to achieve SOTA 3D visual grounding on ScanRefer and most Nr3D/Sr3D settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31054","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs","primary_cat":"cs.CV","submitted_at":"2026-06-30T02:46:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ADAPT reduces MLLM hallucinations 40-60% by aligning cross-attention dynamics via visual anchors, supervised inference, and preference tuning while preserving general capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30577","ref_index":8,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms","primary_cat":"cs.CV","submitted_at":"2026-06-29T17:20:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Presents APRIL-MedSeg, a modular YAML-configurable toolbox for 2D medical image segmentation integrating semi-supervised, domain adaptation, distillation, weakly supervised, text-guided, and foundation model paradigms with unified dataset and deployment interfaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30378","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-29T14:38:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30288","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context","primary_cat":"cs.CV","submitted_at":"2026-06-29T13:30:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30251","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TACO: Tool-Augmented Credit Optimization for Agentic Tool Use","primary_cat":"cs.MA","submitted_at":"2026-06-29T13:01:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TACO combines Differential Answer-Probe Reward (DAPR) and Outcome-Gated Advantage Routing (OGAR) to assign credit to tool calls in agentic visual reasoning, producing accuracy gains on multimodal benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30220","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA","primary_cat":"cs.CV","submitted_at":"2026-06-29T12:33:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Audit of four VideoQA benchmarks reveals text-only shortcuts in VLMs; new diagnostics Blind Gap, Visual Gain, and Shortcut Score quantify and filter visual dependence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30217","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning","primary_cat":"cs.CL","submitted_at":"2026-06-29T12:30:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30084","ref_index":127,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding","primary_cat":"cs.CV","submitted_at":"2026-06-29T10:20:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InnerZoom bridges cross-layer evidence in one forward pass to achieve SOTA GUI grounding accuracy on six benchmarks while cutting latency up to 31.8% versus two-pass baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30054","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-06-29T09:45:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29915","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"H-GRPO: Permutation-Invariant Reinforcement Learning for Grounded Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-29T07:51:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"De-compositional Evidence Grounding decomposes visual reasoning into atomic sub-questions each tied to a specific image region to improve VLM performance and interpretability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29905","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StrucTab: A Structured Optimization Framework for Table Parsing","primary_cat":"cs.CV","submitted_at":"2026-06-29T07:41:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StrucTab achieves SOTA table parsing performance by unifying structural subtasks through sequential reasoning and using decomposed RL rewards in Uni-TabRL, plus a new TableVerse-5K benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29879","ref_index":5,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LWDrive: Layer-Wise World-Model-Guided Vision-Language Model Planning for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-06-29T07:17:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LWDrive uses future-frame supervision on VLMs to create world-model features that a multi-layer Foresight Cascade Planner refines into final trajectories, reporting 92.0 on NAVSIM and 89.6 on NAVSIM-v2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29808","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Making Multimodal LLMs Reliable Chart Data Extractors: A Benchmark and Training Framework","primary_cat":"cs.HC","submitted_at":"2026-06-29T05:40:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a benchmark for MLLM-based chart data extraction from unlabeled images and a human-centered training framework that reaches SOTA numerical accuracy with a 7B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29805","ref_index":2,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Clearer Sight, Fewer Lies: Oriented Pickup Preference Optimization for Multimodal Hallucination Mitigation","primary_cat":"cs.CV","submitted_at":"2026-06-29T05:33:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPPO is an evidence-aware preference optimization objective that contrasts faithful responses under varying visual evidence strengths to reduce hallucinations in MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29705","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots","primary_cat":"cs.AI","submitted_at":"2026-06-29T02:16:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29579","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ScAle: Attention Head Scaling as a Minimal Adapter for Spatial Reasoning in Vision Language Models","primary_cat":"cs.CV","submitted_at":"2026-06-28T19:48:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ScAle learns scalar coefficients to modulate last-token attention and MLP activations in frozen VLMs, achieving up to 134.1% relative accuracy gains on spatial benchmarks with only 1K parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29462","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MIRROR: Aligning Semantic Relations from Language to Image via Gromov--Wasserstein","primary_cat":"cs.CV","submitted_at":"2026-06-28T15:39:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MIRROR derives a closed-form Semi-Inverse Gromov-Wasserstein loss to align language-derived relational priors with visual representations inside decoder-only Transformers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29376","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAD-GS: Learning Reliable 3D Semantic Gaussian Fields via Dynamic Geo-Semantic Anchoring","primary_cat":"cs.CV","submitted_at":"2026-06-28T12:52:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SAD-GS proposes dynamic geo-semantic anchoring via SAD and GSFL to learn reliable 3D semantic Gaussian fields, reporting best performance on LERF-OVS, 3D-OVS, and Mip-NeRF360 for open-vocabulary localization and segmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29357","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking","primary_cat":"cs.CV","submitted_at":"2026-06-28T12:12:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A language dependency parsing mechanism combined with Qwen-VL enables adaptive updates to textual descriptions for improved vision-language tracking performance on benchmarks like TNL2K and LaSOT.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29350","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fast Enough to Act: Spatio-Temporal Visual Token Merging for Low-Latency Robotic VLMs and VLAs","primary_cat":"cs.CV","submitted_at":"2026-06-28T11:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ST-Merge is a plug-and-play spatio-temporal token merging method that delivers 2x speedup on VLMs and 8.3x on a VLA at high resolution with minimal accuracy loss via 3D coordinate matching and positional correction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29308","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MirrorPPR: Exemplar-Based Portrait Photo Retouching","primary_cat":"cs.CV","submitted_at":"2026-06-28T10:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MirrorPPR extracts retouching operations from exemplar pairs via a dedicated extractor and transfers them to query images through a LoRA-adapted Diffusion Transformer, enabled by a new 47-million-pair dataset and self-augmentation for alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29213","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study","primary_cat":"cs.CL","submitted_at":"2026-06-28T05:46:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Benchmark of ten OCR systems on Devanagari shows synthetic text fails to separate models while real scans cause most to collapse, with specialized VLMs fragile and English performance not predictive of Indic results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29069","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Low-cost concept-based localized explanations: How far can we get with training-free approaches?","primary_cat":"cs.AI","submitted_at":"2026-06-27T20:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mid-scale MLLMs reach 62-88% object-level exact-match accuracy in zero-shot localized concept naming via closed-set prompting and an embedding-based Open-CoNa strategy across datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28864","ref_index":2,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On Test-Time Scaling for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-06-27T11:12:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Small well-performing LVLMs gain the largest benefits from test-time scaling (up to ~30% improvement), often matching or exceeding larger models, while visual tokens contribute mainly early in the reasoning chain.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28862","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HKVLM: Faithful Reasoning Grounding by Binding Language Queries to a Frozen Detector","primary_cat":"cs.CV","submitted_at":"2026-06-27T11:10:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HKVLM trains only an alignment hook to bind frozen LM query embeddings to frozen detector proposals via contrastive retrieval and bipartite assignment, yielding 50-90x grounding gains and reduced hallucinations on RefCOCO and POPE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28696","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models","primary_cat":"cs.AI","submitted_at":"2026-06-27T02:43:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"COMPASS is a unified multimodal framework using a shared expert token τ_c to ground composition-intent for both perception and generation, backed by the new Comp-11 dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28551","ref_index":17,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DataComp-VLM: Improved Open Datasets for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-06-26T19:11:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28266","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RSICCLLM: A Multimodal Large Language Model for Remote Sensing Image Change Captioning","primary_cat":"cs.CV","submitted_at":"2026-06-26T16:57:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RSICCLLM introduces a post-training framework with RSICI dataset, difference-aware supervised fine-tuning, and dual-negative preference optimization that claims to outperform much larger models on remote sensing image change captioning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28149","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Toward Robust In-Context Segmentation via Concept Guidance","primary_cat":"cs.CV","submitted_at":"2026-06-26T14:46:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CG-ICS improves ICS robustness by using MLLM-proposed textual concepts scored via SAM3 and tree search plus visual exemplars to activate a frozen SAM3, claiming SOTA accuracy and lower variance across references.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28133","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots","primary_cat":"cs.RO","submitted_at":"2026-06-26T14:34:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27988","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Visual Diffusion Reasoning with Monte Carlo Tree Search","primary_cat":"cs.CV","submitted_at":"2026-06-26T11:35:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LVDR integrates keypoint-guided MCTS into a latent diffusion reasoning model to deliver competitive skill assessment accuracy alongside explicit visual reasoning trajectories on four sports and surgical datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27974","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering","primary_cat":"cs.CV","submitted_at":"2026-06-26T11:23:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProMSA is a progressive multimodal search agent for KB-VQA that iteratively selects search tools under budgets, trained via rejection-sampling SFT then TN-GSPO RL, reporting gains on E-VQA and InfoSeek over RAG baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27880","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OrthoTryOn: Geometric Orthogonalization for Conflict-Free Unified Fashion Generation","primary_cat":"cs.CV","submitted_at":"2026-06-26T09:23:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OrthoTryOn uses Orthogonal Subspace Projection on shared LoRA and Fisher-guided Negative Guidance to enable conflict-free unified fashion generation, outperforming task-specific models on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27872","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"S$^2$-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-26T09:13:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"S²-VLA uses a state-space model to maintain a belief state that produces dynamic gating weights for fusing visual, language, and action features, claiming better long-horizon manipulation than 7B models with only 2B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27828","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-26T08:12:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27330","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning","primary_cat":"cs.CL","submitted_at":"2026-06-25T17:44:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PEEU enables a 7B MLLM to reach 30.6% accuracy on GUI task planning by autonomous exploration and hindsight experience synthesis, outperforming a 32B model through stronger high-level OOD generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27268","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-25T16:50:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"E-TTS introduces a plug-and-play test-time scaling method for embodied tasks that unifies reasoning-action sampling with history buffers and closed-loop refinement to improve performance on manipulation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27161","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference","primary_cat":"cs.AI","submitted_at":"2026-06-25T15:29:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TOPS formulates visual token pruning as constructing Token Optimal Preservation Sets using three information-theoretic principles and demonstrates superior performance on MLLM benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27153","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DMuon: Efficient Distributed Muon Training with Near-Adam Overhead","primary_cat":"cs.DC","submitted_at":"2026-06-25T15:23:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DMuon delivers 1.48x-3.01x end-to-end and 6.85x-163x optimizer-step speedups for Muon on embodied foundation models and LLMs while matching AdamW per-step latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27147","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Safe Autoregressive Image Generation with Iterative Self-Improving Codebooks","primary_cat":"cs.CV","submitted_at":"2026-06-25T15:18:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Iterative self-improving codebooks enhance safety in autoregressive multimodal models by self-identifying unsafe generations and updating the codebook to eliminate harmful visual token mappings without external feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26947","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Multi-Reference Image Generation with Dynamic Reward Optimization","primary_cat":"cs.CV","submitted_at":"2026-06-25T12:21:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces OmniRef-Bench benchmark and DyRef two-stage framework using Difficulty-aware Advantage Reweighting and Discriminative Reward Scaling to improve open-source models on complex multi-reference image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26923","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GAVEL: Grounded Caption Error Verification and Localization","primary_cat":"cs.CL","submitted_at":"2026-06-25T12:00:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GAVEL introduces a joint task, dataset, and benchmark for verifying, explaining, and localizing caption-image misalignments, with a supervised baseline that improves grounding and explanation metrics over strong closed-source models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26904","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Confidence-Aware Tool Orchestration for Robust Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-06-25T11:37:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Robust-TO integrates per-frame reliability scores into tool orchestration and a confidence-cost GRPO reward to improve video reasoning robustness under corruption.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26794","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP","primary_cat":"cs.CV","submitted_at":"2026-06-25T09:27:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ReasonCLIP-58M applies continual pretraining with visually grounded reasoning captions on 58M examples to improve CLIP-style models on commonsense and compositional reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26602","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DiCoBench: Benchmarking Multi-Image Fine-Grained Perception via Differential and Commonality Visual Cues","primary_cat":"cs.CV","submitted_at":"2026-06-25T05:02:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DiCoBench is a new high-resolution multi-image benchmark exposing large gaps between top MLLMs and human performance (98.3%) on differential and commonality visual cue perception.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26535","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Hallucination to Grounding: Diagnosing Visual Spatial Intelligence via CRISP","primary_cat":"cs.CV","submitted_at":"2026-06-25T02:18:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CRISP diagnoses a systematic perception-reasoning disconnect in VLMs, showing proprietary models have latent reasoning but poor metric estimation while open-source models lack compositional reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26025","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"In-Context World Modeling for Robotic Control","primary_cat":"cs.RO","submitted_at":"2026-06-24T16:53:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ICWM frames system identification as in-context adaptation so VLA policies can infer dynamics from self-generated interactions and handle novel configurations without parameter updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.25842","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Graph it first! Enabling Reasoning on Long-form Egocentric Videos through Scene Graphs","primary_cat":"cs.CV","submitted_at":"2026-06-24T13:55:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Egocentric Scene Graphs convert long videos into short structured text so MLLMs can answer questions about entire sequences, achieving SOTA on HD-EPIC VQA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.25041","ref_index":3,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models","primary_cat":"cs.CV","submitted_at":"2026-06-23T18:01:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Wan-Streamer is a unified end-to-end Transformer for low-latency streaming audio-visual interaction using block-causal attention on interleaved multimodal tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24649","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic Collaborative Cognition for Zero-Shot 3D Understanding","primary_cat":"cs.CV","submitted_at":"2026-06-23T14:44:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A collaborative Planning-Perception agent framework using MLLMs constructs a holistic cognitive map through iterative viewpoint supplementation and achieves reported SOTA gains on six 3D benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24253","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TuringViT: Making SOTA Vision Transformers Accessible to All","primary_cat":"cs.CV","submitted_at":"2026-06-23T07:42:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TuringViT claims a new ViT design with linear attention and curated data that matches SOTA performance using 10% of typical pretraining data while supporting dynamic resolutions and improving VLM integration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23997","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ChartWalker: Benchmarking the Cross-Chart RAG Task with Hierarchical Knowledge Graphs","primary_cat":"cs.IR","submitted_at":"2026-06-22T23:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ChartWalker provides a hierarchical knowledge graph construction method and structure-aware sampling to generate cross-chart RAG benchmarks, releasing ChartWalker-Bench that exposes performance gaps across RAG paradigms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23881","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification","primary_cat":"cs.CL","submitted_at":"2026-06-22T19:27:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A decoupled training-free IBA framework for KB-VQA selects entities via MLLM candidate choice then ranks evidence with off-the-shelf re-rankers, outperforming coupled fine-tuned baselines on Encyclopedic-VQA and InfoSeek.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23835","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2026-06-22T18:16:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ABACUS adapts a 3B unified foundation model using density-aware zooming, boundary-aware GRPO, and cycle-consistent self-critique to achieve SOTA on seven counting and generation benchmarks without task-specific training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23682","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Keep The Essentials: Efficient Reference Conditioned Generation via Token Dropping","primary_cat":"cs.CV","submitted_at":"2026-06-22T17:59:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sparse Context achieves 2-4x faster inference in reference-conditioned diffusion models by fine-tuning with random token dropping and applying task-aware selection at inference time, without loss of visual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23581","ref_index":68,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse","primary_cat":"cs.DC","submitted_at":"2026-06-22T16:47:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":100,"offset":0}}