DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
background 1polarities
support 1representative citing papers
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
Multi-agent VLM frameworks outperform single VLMs for automated coding of on-screen collaborative learning behaviors using the ICAP framework.
A generative video synthesis pipeline paired with a semantic graph neural network yields gains in accident anticipation accuracy and lead time on driving datasets, accompanied by a new benchmark release.
DBMF integrates scores from text-image and vision branches to improve out-of-distribution detection on endoscopic datasets by up to 24.84% over prior methods.
citing papers explorer
-
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
-
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
-
Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors
Multi-agent VLM frameworks outperform single VLMs for automated coding of on-screen collaborative learning behaviors using the ICAP framework.
-
Learning from the Unseen: Generative Data Augmentation for Geometric-Semantic Accident Anticipation
A generative video synthesis pipeline paired with a semantic graph neural network yields gains in accident anticipation accuracy and lead time on driving datasets, accompanied by a new benchmark release.
-
DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection
DBMF integrates scores from text-image and vision branches to improve out-of-distribution detection on endoscopic datasets by up to 24.84% over prior methods.