Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You · 2024

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.

Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain

cs.CV · 2026-02-19 · unverdicted · novelty 7.0

Introduces VIG metric to measure visual contribution via perplexity reduction and applies it for selective training of LVLMs on high-VIG samples and tokens to improve grounding with reduced supervision.

SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SkyNative introduces an encoder-free architecture using raw patch tokens and modality-specific parameters in a unified autoregressive model to improve image-grounded reasoning in remote sensing vision-language tasks.

Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.

citing papers explorer

Showing 4 of 4 citing papers.

RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI cs.RO · 2026-05-07 · unverdicted · none · ref 39
RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.
Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain cs.CV · 2026-02-19 · unverdicted · none · ref 17
Introduces VIG metric to measure visual contribution via perplexity reduction and applies it for selective training of LVLMs on high-VIG samples and tokens to improve grounding with reduced supervision.
SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning cs.CV · 2026-05-18 · unverdicted · none · ref 37
SkyNative introduces an encoder-free architecture using raw patch tokens and modality-specific parameters in a unified autoregressive model to improve image-grounded reasoning in remote sensing vision-language tasks.
Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models cs.CV · 2026-05-17 · unverdicted · none · ref 40
Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024

fields

years

verdicts

representative citing papers

citing papers explorer