OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
Websrc: A dataset for web-based structural reading comprehension
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CV 2roles
background 1polarities
background 1representative citing papers
ViTexQA is a dataset forcing multi-frame text fusion for all questions, with FrameThinker achieving 6.3% ROUGE-L gain over baselines via CoT SFT and temporally-grounded RL.
citing papers explorer
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
ViTexQA: A Multi-Frame Temporal Perception Dataset for Video Text Question Answering
ViTexQA is a dataset forcing multi-frame text fusion for all questions, with FrameThinker achieving 6.3% ROUGE-L gain over baselines via CoT SFT and temporally-grounded RL.