hub Tool reference

Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, Song-Chun Zhu · 2021 · arXiv 2110.13214

Tool reference. 80% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

22 Pith papers citing it

Method reference 80% of classified citations

read on arXiv browse 22 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 8 background 2

citation-polarity summary

use dataset 8 background 2

representative citing papers

Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

cs.CV · 2024-10-17 · unverdicted · novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.

Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

Octopus introduces history-free gradient orthogonalization in a two-stage finetuning framework to achieve state-of-the-art continual learning results for multimodal LLMs on the UCIT benchmark.

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

cs.CV · 2025-05-23 · unverdicted · novelty 6.0

Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

cs.CV · 2025-04-14 · conditional · novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

cs.CL · 2025-03-10 · unverdicted · novelty 6.0

A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

cs.CL · 2024-11-15 · conditional · novelty 6.0

Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.

CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.

MAny: Merge Anything for Multimodal Continual Instruction Tuning

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

MAny addresses dual-forgetting in multimodal continual instruction tuning via CPM and LPM merging strategies, delivering up to 8.57% accuracy gains on UCIT benchmarks without additional training.

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

cs.CV · 2024-12-13 · accept · novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

cs.CV · 2024-08-09 · unverdicted · novelty 5.0

mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

cs.CV · 2024-08-03 · conditional · novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

cs.CV · 2024-07-03 · conditional · novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

cs.CV · 2023-12-21 · unverdicted · novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

cs.CV · 2023-10-14 · unverdicted · novelty 5.0

MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.

MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

cs.LG · 2023-08-23 · unverdicted · novelty 5.0

MM-LIMA uses proposed quality metrics and a trainable selector to pick 200 high-quality multimodal instruction examples and outperforms MiniGPT-4 on evaluations.

DeepSeek-VL: Towards Real-World Vision-Language Understanding

cs.AI · 2024-03-08 · unverdicted · novelty 4.0

DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

cs.CV · 2024-02-06 · unverdicted · novelty 4.0

MobileVLM V2 shows that 1.7B and 3B parameter vision-language models can reach or exceed the performance of 3B and 7B+ models on common VLM benchmarks via targeted design and data improvements.

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

cs.CV · 2023-09-26 · conditional · novelty 4.0

InternLM-XComposer generates articles with seamlessly integrated images and achieves state-of-the-art results on vision-language benchmarks including MME, MMBench, and Seed-Bench.

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

cs.CL · 2025-02-05 · unverdicted · novelty 2.0

Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

citing papers explorer

Showing 22 of 22 citing papers.

Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning cs.CV · 2026-05-11 · unverdicted · none · ref 28
DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding cs.CV · 2025-04-14 · unverdicted · none · ref 47
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation cs.CV · 2024-10-17 · unverdicted · none · ref 56
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models cs.LG · 2026-05-14 · unverdicted · none · ref 41
Octopus introduces history-free gradient orthogonalization in a two-stage finetuning framework to achieve state-of-the-art continual learning results for multimodal LLMs on the UCIT benchmark.
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM cs.CV · 2025-05-23 · unverdicted · none · ref 45
Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 83
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL cs.CL · 2025-03-10 · unverdicted · none · ref 44
A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 167
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization cs.CL · 2024-11-15 · conditional · none · ref 60
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering cs.CV · 2026-04-18 · unverdicted · none · ref 25
CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.
MAny: Merge Anything for Multimodal Continual Instruction Tuning cs.LG · 2026-04-15 · unverdicted · none · ref 10
MAny addresses dual-forgetting in multimodal continual instruction tuning via CPM and LPM merging strategies, delivering up to 8.57% accuracy gains on UCIT benchmarks without additional training.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding cs.CV · 2024-12-13 · accept · none · ref 62
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models cs.CV · 2024-08-09 · unverdicted · none · ref 237
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone cs.CV · 2024-08-03 · conditional · none · ref 69
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output cs.CV · 2024-07-03 · conditional · none · ref 98
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 100
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning cs.CV · 2023-10-14 · unverdicted · none · ref 28
MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.
MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets cs.LG · 2023-08-23 · unverdicted · none · ref 29
MM-LIMA uses proposed quality metrics and a trainable selector to pick 200 high-quality multimodal instruction examples and outperforms MiniGPT-4 on evaluations.
DeepSeek-VL: Towards Real-World Vision-Language Understanding cs.AI · 2024-03-08 · unverdicted · none · ref 20
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model cs.CV · 2024-02-06 · unverdicted · none · ref 50
MobileVLM V2 shows that 1.7B and 3B parameter vision-language models can reach or exceed the performance of 3B and 7B+ models on common VLM benchmarks via targeted design and data improvements.
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition cs.CV · 2023-09-26 · conditional · none · ref 56
InternLM-XComposer generates articles with seamlessly integrated images and achieves state-of-the-art results on vision-language benchmarks including MME, MMBench, and Seed-Bench.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 116
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer