{"total":485,"items":[{"citing_arxiv_id":"2605.22819","ref_index":97,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cambrian-P: Pose-Grounded Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-21T17:59:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Dave, Changxi Zheng, and Carl Vondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InECCV, 2024. [95] Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. In3DV, 2025. [96] Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025. [97] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. [98] Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Contin-"},{"citing_arxiv_id":"2605.22816","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-21T17:58:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22678","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Swift Sampling: Selecting Temporal Surprises via Taylor Series","primary_cat":"cs.CV","submitted_at":"2026-05-21T16:20:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22208","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning","primary_cat":"cs.CV","submitted_at":"2026-05-21T09:14:25+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22109","ref_index":68,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?","primary_cat":"cs.AI","submitted_at":"2026-05-21T07:42:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21973","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-21T04:03:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"F2G improves video temporal grounding accuracy by decoupling event identification from boundary measurement using predictive temporal perception to create citable evidence segments for LLM reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21622","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-20T18:32:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21479","ref_index":102,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:58:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21059","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multimodal LLMs under Pairwise Modalities","primary_cat":"cs.CV","submitted_at":"2026-05-20T11:44:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20965","ref_index":74,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy","primary_cat":"cs.CV","submitted_at":"2026-05-20T09:50:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ILVAD is a plug-and-play method that builds a saliency map from inter-layer attention discrepancies on early tokens to enhance visual evidence focus and ground generated text, reducing hallucinations in LVLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20950","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T09:37:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20892","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-20T08:31:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"FruitEnsemble uses a weighted ensemble of backbones for top-3 candidates followed by MLLM arbitration on low-confidence samples to reach 70.49% accuracy on a new 306-class fruit dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20733","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches","primary_cat":"cs.CV","submitted_at":"2026-05-20T05:37:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Hybrid vision-language and geometric optimization framework generates editable minimal surfaces from sketches, reporting 0.844 topological similarity on 100 test sketches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20525","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-19T21:54:12+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20469","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation","primary_cat":"cs.CV","submitted_at":"2026-05-19T20:30:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20369","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DEL: Digit Entropy Loss for Numerical Learning of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-19T18:18:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DEL is a new loss for LLM numerical learning that applies supervised digit entropy optimization and extends to floating-point numbers, showing improved accuracy and distance metrics over prior methods on math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19866","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-19T13:58:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19852","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-19T13:44:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19322","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-19T04:02:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DynaTok introduces temporally adaptive budget allocation with EMA memory and spatial selection with memory to compress video tokens, retaining over 95% accuracy at 90% reduction on VideoQA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19307","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems","primary_cat":"cs.CV","submitted_at":"2026-05-19T03:37:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MetaRA applies metamorphic testing to VQA tasks and shows that MLLM models exhibit sensitivity to linguistic perturbations and superficial visual cues not detected by conventional accuracy benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20273","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Modality-Decoupled Online Recursive Editing","primary_cat":"cs.LG","submitted_at":"2026-05-19T03:11:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"M-ORE decouples text and visual update statistics in MLLMs and applies recursive low-rank edits in an orthogonal subspace to reduce cross-modal conflict and long-horizon interference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19260","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees","primary_cat":"cs.AI","submitted_at":"2026-05-19T02:13:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19004","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-18T18:26:51+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EgoTraj is a new open multimodal dataset of 75 long-horizon egocentric human navigation sequences in urban environments with head pose, gaze, and scene data, plus benchmarks of trajectory prediction methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18678","ref_index":114,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:18:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"[114] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. [115] Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. [116] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze"},{"citing_arxiv_id":"2605.18621","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:31:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CrossView Suite supplies a 1.6M-sample dataset, scene-disjoint benchmark, and explicit-alignment framework to advance MLLMs from single-view perception to cross-view spatial intelligence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18547","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation","primary_cat":"cs.AI","submitted_at":"2026-05-18T15:27:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISAFF is a tuning-free speaker-centered visual affective feature learning framework for emotion recognition in conversation that guides frozen VLMs to active speakers and uses reliability-guided complementation from textual and acoustic modalities to achieve competitive performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18359","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RAVE: Re-Allocating Visual Attention in Large Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T13:12:50+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18209","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-18T10:54:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatioRoute introduces dynamic prompt routing that improves zero-shot spatial VQA accuracy by up to 5% on the SQA3D benchmark across VLMs without 3D inputs or fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18172","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs","primary_cat":"cs.AI","submitted_at":"2026-05-18T10:15:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18160","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T10:04:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18115","ref_index":87,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens","primary_cat":"cs.CV","submitted_at":"2026-05-18T09:24:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17954","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A More Word-like Image Tokenization for MLLMs","primary_cat":"cs.CV","submitted_at":"2026-05-18T07:09:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiVT clusters patch embeddings into coherent semantic units and adapts token count to image complexity, matching or exceeding baselines with fewer visual tokens on multimodal benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17946","ref_index":28,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain","primary_cat":"cs.AI","submitted_at":"2026-05-18T07:03:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20246","ref_index":23,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents","primary_cat":"cs.LG","submitted_at":"2026-05-18T04:56:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GROW decomposes trajectories into state-action samples to enable GRPO for multi-turn VLM agents and reports state-of-the-art results on more than 800 Minecraft tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17489","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Employing Vision-Language Models for Face Image Quality Assessment","primary_cat":"cs.CV","submitted_at":"2026-05-17T14:57:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Vision-language models enable zero-shot face image quality assessment whose biometric utility depends on model architecture rather than size, with outputs that align with traditional methods but vary by prompt.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17366","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation","primary_cat":"cs.IR","submitted_at":"2026-05-17T10:20:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17336","ref_index":84,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms","primary_cat":"cs.RO","submitted_at":"2026-05-17T09:09:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey proposing a hierarchical taxonomy for multimodal tactile fusion datasets and methods across perception, generation, and interaction in embodied intelligence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17310","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-17T08:02:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17283","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OProver: A Unified Framework for Agentic Formal Theorem Proving","primary_cat":"cs.CL","submitted_at":"2026-05-17T06:39:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17228","ref_index":76,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making","primary_cat":"cs.CL","submitted_at":"2026-05-17T02:28:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17128","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"New Wide-Net-Casting Jailbreak Attacks Risk Large Models","primary_cat":"cs.CR","submitted_at":"2026-05-16T19:22:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper demonstrates that a tailored jailbreak method for querying groups of large models can achieve up to 100% success rate in some experiments on unprotected models, revealing overlooked multi-model safety risks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16953","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study","primary_cat":"cs.AI","submitted_at":"2026-05-16T12:08:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16877","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Zero-Shot Faithful Textual Explanations via Directional-Derivative Influence on Predictions","primary_cat":"cs.CV","submitted_at":"2026-05-16T08:39:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FaithTrace uses the directional derivative of class logits along text-induced directions in feature space as an influence score to produce and evaluate more faithful zero-shot textual explanations for image classifiers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16732","ref_index":71,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-16T00:52:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16090","ref_index":59,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation","primary_cat":"cs.CR","submitted_at":"2026-05-15T15:47:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15951","ref_index":75,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-15T13:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15864","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination","primary_cat":"cs.CV","submitted_at":"2026-05-15T11:31:14+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15824","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization","primary_cat":"cs.CV","submitted_at":"2026-05-15T10:25:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FashionChameleon achieves interactive multi-garment video customization in real time by training a teacher model with in-context learning on single-garment pairs, applying streaming distillation, and using training-free KV cache rescheduling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15755","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T09:16:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes AGSR and the FAB-G supervised multi-agent framework that predicts attribute salience from human annotations to constrain MLLM emotion reasoning, yielding gains on EmoArt and cross-dataset tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15714","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Position: Early-Stage Quality Assurance in Annotation Pipelines Is More Cost-Effective Than Late-Stage Validation","primary_cat":"cs.SE","submitted_at":"2026-05-15T08:07:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Early-stage QA in annotation pipelines is more cost-effective than late-stage validation, supported by a proposed taxonomy of trigger points and a parametric error-propagation model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}