PlotChain benchmark reports top MLLMs reaching ~80% field-level accuracy on engineering plot reading under human-like tolerances, but with persistent failures on frequency-domain tasks like bandpass and FFT spectra.
Chartx & chartvlm: A versatile bench- mark and foundation model for complicated chart reasoning
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
CrystalXRD-Bench is a new 250-sample benchmark for VLMs on XRD peak indexing, where the best model (GPT-5.4) reaches Jaccard 0.5888 and 37.6% exact match while most stay below 0.50, showing the task remains unsolved.
Chart-RL uses RL policy optimization and LoRA to boost VLM chart reasoning, enabling a 4B model to reach 0.634 accuracy versus 0.580 for an 8B model with lower latency.
ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.
PlotPick shows that general vision-language models outperform the dedicated DePlot model on chart-to-table benchmarks, with the largest gains on box plots and other chart types absent from specialized training data.
An LLM-powered agent pipeline extracts ~9,000 structured concrete materials records from 278 publications with F1 scores up to 0.97, creating the largest open blended cement concrete database and demonstrating that larger, richer datasets improve ML prediction and generalization.
PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.
Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.
citing papers explorer
-
PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading
PlotChain benchmark reports top MLLMs reaching ~80% field-level accuracy on engineering plot reading under human-like tolerances, but with persistent failures on frequency-domain tasks like bandpass and FFT spectra.
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials
CrystalXRD-Bench is a new 250-sample benchmark for VLMs on XRD peak indexing, where the best model (GPT-5.4) reaches Jaccard 0.5888 and 37.6% exact match while most stay below 0.50, showing the task remains unsolved.
-
Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models
Chart-RL uses RL policy optimization and LoRA to boost VLM chart reasoning, enabling a 4B model to reach 0.634 accuracy versus 0.580 for an 8B model with lower latency.
-
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch
ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.
-
PlotPick: AI-powered batch extraction of numerical data from scientific figures
PlotPick shows that general vision-language models outperform the dedicated DePlot model on chart-to-table benchmarks, with the largest gains on box plots and other chart types absent from specialized training data.
-
Large language model-enabled automated data extraction for concrete materials informatics
An LLM-powered agent pipeline extracts ~9,000 structured concrete materials records from 278 publications with F1 scores up to 0.97, creating the largest open blended cement concrete database and demonstrating that larger, richer datasets improve ML prediction and generalization.
-
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.
-
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.