CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
hub
Chartqa: A benchmark for question answering about charts with visual and logical reasoning
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 12representative citing papers
RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or channel methods.
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.
BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
MMTR-Bench shows that current MLLMs face significant difficulty reconstructing masked text from visual context, especially at sentence and paragraph lengths.
ERASE prunes 85% of vision tokens in Qwen2.5-VL-7B while retaining 89.46% accuracy, outperforming prior methods that retain only 78.1%.
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.
citing papers explorer
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
-
Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference
RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or channel methods.
-
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.
-
Process Rewards with Learned Reliability
BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
-
Can MLLMs "Read" What is Missing?
MMTR-Bench shows that current MLLMs face significant difficulty reconstructing masked text from visual context, especially at sentence and paragraph lengths.
-
ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning
ERASE prunes 85% of vision tokens in Qwen2.5-VL-7B while retaining 89.46% accuracy, outperforming prior methods that retain only 78.1%.
-
Perceptual Flow Network for Visually Grounded Reasoning
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
-
Let ViT Speak: Generative Language-Image Pre-training
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
-
Context Unrolling in Omni Models
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.
- RISE: Reliable Improvement in Self-Evolving Vision-Language Models