Canonical reference

Title resolution pending

Minesh Mathew, Dimosthenis Karatzas · 2007 · arXiv 2007.00398

Canonical reference. 80% of citing Pith papers cite this work as background.

17 Pith papers citing it

Background 80% of classified citations

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 4 dataset 1

citation-polarity summary

background 4 use dataset 1

representative citing papers

FCMBench-Video: Benchmarking Document Video Intelligence

cs.CV · 2026-04-28 · unverdicted · novelty 7.0

FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.

ReaLB: Real-Time Load Balancing for Multimodal MoE Inference

cs.DC · 2026-04-21 · unverdicted · novelty 7.0

ReaLB balances multimodal MoE inference loads by switching vision-heavy experts to lower FP4 precision per device rank, hiding the change in the dispatch phase to deliver 1.10-1.32x speedup with <1% accuracy degradation.

Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR

cs.CV · 2025-11-19 · unverdicted · novelty 7.0

FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.

Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

cs.CV · 2025-09-09 · conditional · novelty 7.0

Visual-TableQA is a new open-domain benchmark of rendered table images and complex QA pairs created via multi-LLM collaborative generation, with fine-tuned models showing robust generalization to external tests.

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.

Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

cs.CV · 2026-05-20 · conditional · novelty 6.0

SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.

Visual Reasoning through Tool-supervised Reinforcement Learning

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.

Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

cs.CL · 2026-04-16 · unverdicted · novelty 6.0

MM-AQA shows frontier VLMs rarely abstain on unanswerable multimodal questions, multi-agent setups improve abstention at an accuracy cost, and effective abstention needs training rather than prompting or extra agents.

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster decoding.

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

FileGram: Grounding Agent Personalization in File-System Behavioral Traces

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.

Routing-Based Continual Learning for Multimodal Large Language Models

cs.LG · 2025-11-03 · unverdicted · novelty 6.0

Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.

Long Context Transfer from Language to Vision

cs.CV · 2024-06-24 · unverdicted · novelty 6.0

Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.

PaliGemma 2: A Family of Versatile VLMs for Transfer

cs.CV · 2024-12-04 · unverdicted · novelty 4.0

PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at various sizes and resolutions.

PaliGemma: A versatile 3B VLM for transfer

cs.CV · 2024-07-10 · unverdicted · novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

citing papers explorer

Showing 17 of 17 citing papers.

FCMBench-Video: Benchmarking Document Video Intelligence cs.CV · 2026-04-28 · unverdicted · none · ref 5
FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
ReaLB: Real-Time Load Balancing for Multimodal MoE Inference cs.DC · 2026-04-21 · unverdicted · none · ref 28
ReaLB balances multimodal MoE inference loads by switching vision-heavy experts to lower FP4 precision per device rank, hiding the change in the dispatch phase to deliver 1.10-1.32x speedup with <1% accuracy degradation.
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment cs.CV · 2026-04-09 · unverdicted · none · ref 50
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR cs.CV · 2025-11-19 · unverdicted · none · ref 20
FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.
Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images cs.CV · 2025-09-09 · conditional · none · ref 26
Visual-TableQA is a new open-domain benchmark of rendered table images and complex QA pairs created via multi-LLM collaborative generation, with fine-tuned models showing robust generalization to external tests.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding cs.CV · 2025-04-14 · unverdicted · none · ref 53
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models cs.CV · 2026-05-20 · conditional · none · ref 55
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
Visual Reasoning through Tool-supervised Reinforcement Learning cs.CV · 2026-04-21 · unverdicted · none · ref 14
ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems cs.CL · 2026-04-16 · unverdicted · none · ref 1
MM-AQA shows frontier VLMs rarely abstain on unanswerable multimodal questions, multi-agent setups improve abstention at an accuracy cost, and effective abstention needs training rather than prompting or extra agents.
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction cs.LG · 2026-04-14 · unverdicted · none · ref 45
RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster decoding.
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 24
Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models cs.CV · 2026-04-08 · unverdicted · none · ref 29
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
FileGram: Grounding Agent Personalization in File-System Behavioral Traces cs.CV · 2026-04-06 · unverdicted · none · ref 17
FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.
Routing-Based Continual Learning for Multimodal Large Language Models cs.LG · 2025-11-03 · unverdicted · none · ref 44
Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
Long Context Transfer from Language to Vision cs.CV · 2024-06-24 · unverdicted · none · ref 54
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
PaliGemma 2: A Family of Versatile VLMs for Transfer cs.CV · 2024-12-04 · unverdicted · none · ref 64
PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at various sizes and resolutions.
PaliGemma: A versatile 3B VLM for transfer cs.CV · 2024-07-10 · unverdicted · none · ref 94
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer