{"total":17,"items":[{"citing_arxiv_id":"2605.22273","ref_index":51,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Exposing Vulnerabilities in Visible-Infrared VLMs: A Unified Geometric Adversarial Framework with Cross-Task Transferability","primary_cat":"cs.CV","submitted_at":"2026-05-21T10:15:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CFGPatch combines curved fractal geometry with modality-specific spiral textures to create adversarial patches that fool VIS-IR VLMs and transfer across classification, captioning, and VQA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17310","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-17T08:02:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13375","ref_index":27,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-13T11:32:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09429","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-10T09:07:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"is a robust alternative to one-shot scalar pruning. 1 Introduction Large Vision-Language Models (LVLMs) [1, 2, 3, 4, 5] achieve strong multimodal reasoning by processing visual inputs and text instructions in a unified sequence [6, 7]. To preserve fine-grained visual evidence, recent models encode images with dense patch representations [8, 9, 10] or dynamic tiling strategies [11, 5, 12], producing sequences from hundreds to thousands of visual tokens. Although these tokens improve perceptual fidelity, many correspond to background regions or details irrelevant to a given instruction [13, 14]. The resulting redundancy substantially increases inference cost, especially in decoder self-attention, whose complexity grows quadratically with sequence length [15, 16, 17, 18]."},{"citing_arxiv_id":"2605.08985","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?","primary_cat":"cs.CV","submitted_at":"2026-05-09T15:10:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"with visual tokens withdrawal for rapid inference. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5334-5342, 2025. [23] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296-26306, 2024. [24] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. [25] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892-34916, 2023. [26] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi"},{"citing_arxiv_id":"2605.08560","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ZAYA1-VL-8B Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-08T23:41:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"These vision tokens are then fed as vision embeddings to the LLM without other architectural changes and the LLM itself is finetuned to understand the meaning of these new tokens. Beyond the choice of connector, several complementary innovations on the vision encoder side, have become cen- tral to modern VLM design. Dynamic resolution strategies, popularized by the AnyRes technique in LLaV A-NeXT [ 24], allow VLMs to process images at their native aspect ratio and resolution by partitioning them into fixed-size tiles, a constraint imposed by the vision encoder's absolute position embeddings and fixed context window. While this substantially improves performance on detail-sensitive tasks such as OCR and fine- grained VQA, it introduces redundancy at tile boundaries"},{"citing_arxiv_id":"2604.08545","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1 Multimodal Large Language Models. Multimodal large language models (MLLMs) [3, 16, 36, 2, 45] have achieved strong performance on a wide range of vision-language tasks by integrating visual encoders with large language models [1, 17]. Early MLLMs mainly focus on direct answer generation for tasks such as visual question answering and image understanding [16, 14, 35]. Inspired by the success of chain-of-thought in LLMs, recent MLLMs introduce explicit intermediate reasoning to handle more complex multimodal problems [11, 40]. These models generate step-by-step textual rationales before producing final answers, leading to improvements on complex multimodal reasoning tasks [30, 44, 47, 49]. More recently, several"},{"citing_arxiv_id":"2510.21122","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation","primary_cat":"cs.CV","submitted_at":"2025-10-24T03:23:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.12382","ref_index":29,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Exploring the Secondary Risks of Large Language Models","primary_cat":"cs.LG","submitted_at":"2025-06-14T07:31:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.04973","ref_index":58,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts","primary_cat":"cs.AI","submitted_at":"2024-07-06T06:48:16+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LogicVista is a new benchmark dataset with 448 visual logic questions that evaluates multimodal LLMs on five reasoning tasks covering nine capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.01284","ref_index":39,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?","primary_cat":"cs.AI","submitted_at":"2024-07-01T13:39:08+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"from Wikipedia [37] and textbooks, thereby providing essential knowledge for LMMs' reasoning. Figure 1 illustrates our overview experimental results. Not surprisingly, GPT-4o [ 38] achieves the best overall performance across different visual mathematics categories. Closed-source LLMs (GPT-4V , Gemini 1.5 Pro) and LMMs with larger parameter scales (LLaV A-NeXT-110B [ 39]) generally exhibit superior visual mathematical reasoning capabilities. However, most LMMs perform significantly worse on multi-step problems compared to one-step problems, suggesting that the number of knowledge concepts is positively correlated with the question's difficulty and negatively correlated with LMM performance. In specialized disciplines, most LMMs excel in calculation but"},{"citing_arxiv_id":"2406.16852","ref_index":48,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Long Context Transfer from Language to Vision","primary_cat":"cs.CV","submitted_at":"2024-06-24T17:58:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Ring attention with blockwise transformers for near-infinite context, 2023. [46] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. [47] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. [48] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. [49] Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners, 2024. [50] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023."},{"citing_arxiv_id":"2406.09411","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding","primary_cat":"cs.CV","submitted_at":"2024-06-13T17:59:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.19088","ref_index":54,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions","primary_cat":"cs.CL","submitted_at":"2024-05-29T13:51:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces YesBut benchmark showing state-of-the-art multimodal models lag humans on interpreting humorous contradictions in comics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.16994","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning","primary_cat":"cs.CV","submitted_at":"2024-04-25T19:29:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.07104","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SGLang: Efficient Execution of Structured Language Model Programs","primary_cat":"cs.AI","submitted_at":"2023-12-12T09:34:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"optimize multi-call programs for API-only models. Using SGLang, we implemented various LLM applications, including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, multi- turn chat, and multi-modality processing. We tested the performance on models including Llama- 7B/70B [49], Mistral-8x7B [17], LLaV A-v1.5-7B (image) [28], and LLaV A-NeXT-34B (video) [62] on NVIDIA A10G and A100 GPUs. Experimental results show that SGLang achieves up to 6.4× higher throughput across a wide range of workloads, models, and hardware setups, compared to existing programming and inference systems, including Guidance [13], vLLM [23], and LMQL [4]. 2 dimensions = [\"Clarity\", \"Originality\", \"Evidence\"]@functiondefmulti_dimensional_judge(s, path, essay): s += system(\"Evaluate an essay about an image."},{"citing_arxiv_id":"2305.07895","ref_index":71,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models","primary_cat":"cs.CV","submitted_at":"2023-05-13T11:28:37+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"InternVL2-Llama3-76B [64] Shanghai AI Lab 842 PaliGemma-3B-mix-448 [65] Google 614 CongRong [66] CloudWalk 827 Cambrian-13B [63] NYU 610 GLM-4v [67] Zhipu AI 814 MiniCPM-V-2 [68] OpenBMB 605 MiniMonkey-2B [69] HUST 802 Cambrian-34B [63] NYU 591 InternVL2-8B [64] Shanghai AI Lab 794 CogVLM-17B-Chat [67] Zhipu AI 590 Claude3.5-Sonnet [70] Anthropic 788 LLaV A-Next-Yi-34B [71] UW-Madison 574 GPT-4o-mini-20240718 [72] OpenAI 785 TextMonkey [30] HUST 561 InternVL2-4B [64] Shanghai AI Lab 784 Monkey-Chat [29] HUST 534 InternVL2-2B [64] Shanghai AI Lab 781 InternLM-XComposer2 [73] Shanghai AI Lab 532 GLM-4v-9B [67] Zhipu AI 776 LLaV A-Next-Vicuna-7B [71] UW-Madison 532 CogVLM2-19B-Chat [67] Zhipu AI 757 LLaV A-Next-Mistral-7B [71] UW-Madison 531"}],"limit":50,"offset":0}