Feast your eyes: Mixture-of- resolution adaptation for multimodal large language models

Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji · 2024 · arXiv 2403.03003

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.

UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-language models.

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

cs.CV · 2026-03-24 · unverdicted · novelty 6.0

ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

cs.CV · 2025-08-25 · unverdicted · novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.

From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception

cs.CV · 2026-04-14 · unverdicted · novelty 5.0

A CVAE-based Variational Information Flow module is proposed to counteract visual attenuation in MLLMs and improve fine-grained perception on VQA and grounding tasks.

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

cs.CV · 2024-04-25 · unverdicted · novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

citing papers explorer

Showing 7 of 7 citing papers.

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation cs.CV · 2026-04-17 · unverdicted · none · ref 29
The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing cs.CV · 2026-04-15 · unverdicted · none · ref 14
UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-language models.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models cs.CV · 2026-04-08 · unverdicted · none · ref 43
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling cs.CV · 2026-03-24 · unverdicted · none · ref 38
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 81
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception cs.CV · 2026-04-14 · unverdicted · none · ref 4
A CVAE-based Variational Information Flow module is proposed to counteract visual attenuation in MLLMs and improve fine-grained perception on VQA and grounding tasks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 76
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Feast your eyes: Mixture-of- resolution adaptation for multimodal large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer