RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.
Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Introduces VIG metric to measure visual contribution via perplexity reduction and applies it for selective training of LVLMs on high-VIG samples and tokens to improve grounding with reduced supervision.
SkyNative introduces an encoder-free architecture using raw patch tokens and modality-specific parameters in a unified autoregressive model to improve image-grounded reasoning in remote sensing vision-language tasks.
Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.
citing papers explorer
-
RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI
RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.
-
Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain
Introduces VIG metric to measure visual contribution via perplexity reduction and applies it for selective training of LVLMs on high-VIG samples and tokens to improve grounding with reduced supervision.
-
SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning
SkyNative introduces an encoder-free architecture using raw patch tokens and modality-specific parameters in a unified autoregressive model to improve image-grounded reasoning in remote sensing vision-language tasks.
-
Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models
Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.