Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
hub Mixed citations
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.
ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.
Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.
citing papers explorer
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension
Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models
SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.
-
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.
-
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
DeepSeek-VL: Towards Real-World Vision-Language Understanding
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.
- The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm