Recognition: 2 theorem links
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Pith reviewed 2026-05-11 10:04 UTC · model grok-4.3
The pith
DeepSeek-VL2 matches or exceeds prior vision-language models on multimodal tasks while using fewer activated parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepSeek-VL2 incorporates dynamic tiling for vision encoding to handle high-resolution images with different aspect ratios and uses DeepSeekMoE models with Multi-head Latent Attention to compress key-value caches, enabling efficient inference. Trained on an improved vision-language dataset, the three variants achieve competitive or state-of-the-art performance across multimodal tasks with similar or fewer activated parameters than existing open-source dense and MoE models.
What carries the argument
Dynamic tiling vision encoding paired with Multi-head Latent Attention inside a Mixture-of-Experts language model, which processes variable-aspect-ratio images efficiently and reduces inference memory and latency.
Load-bearing premise
The performance improvements come primarily from the dynamic tiling strategy and Multi-head Latent Attention rather than from the improved training dataset or other tuning details.
What would settle it
Train an otherwise identical model without dynamic tiling or without Multi-head Latent Attention and check whether its scores on the reported benchmarks fall below the competitive range achieved by the full DeepSeek-VL2 variants.
read the original abstract
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents DeepSeek-VL2, a family of Mixture-of-Experts vision-language models (DeepSeek-VL2-Tiny, Small, and the base model with 1.0B/2.8B/4.5B activated parameters). It describes two primary architectural upgrades over DeepSeek-VL: a dynamic tiling strategy for vision encoding that accommodates high-resolution images with varying aspect ratios, and the use of DeepSeekMoE equipped with Multi-head Latent Attention to compress KV caches for efficient inference. The models are trained on an improved vision-language dataset and are claimed to deliver competitive or state-of-the-art results on visual question answering, OCR, document/table/chart understanding, and visual grounding tasks while using similar or fewer activated parameters than prior open-source dense and MoE models.
Significance. If the performance numbers hold under scrutiny, the work would illustrate how targeted changes in vision tokenization and attention mechanisms can support strong multimodal capabilities at modest activated-parameter budgets, which is relevant for practical deployment. The public release of code and checkpoints is a clear positive for reproducibility.
major comments (2)
- [Abstract and §1] Abstract and §1 (Introduction): The text states that the models 'significantly improves upon its predecessor... through two key major upgrades' and achieve their results 'thanks to' dynamic tiling and Multi-head Latent Attention. However, the experimental section provides no ablation that fixes the training dataset, data mixture, and optimization schedule while removing or replacing only the dynamic tiling (reverting to fixed-resolution encoding) or only the Multi-head Latent Attention (reverting to standard attention within the MoE layers). Without such controls, the causal contribution of the two architectural changes to the reported efficiency-performance trade-off cannot be isolated from possible gains due to the 'improved vision-language dataset' or unstated hyperparameter differences.
- [Experimental results] Experimental results (tables comparing against other models): The benchmark tables report point estimates for the three variants but do not include standard deviations across multiple runs, confidence intervals, or statistical tests. This makes it difficult to determine whether the claimed 'competitive or state-of-the-art' margins are robust, especially for the smaller 1.0B and 2.8B variants where variance is typically higher.
minor comments (2)
- [Model architecture description] The manuscript would benefit from an explicit table or paragraph comparing total (non-activated) parameter counts alongside the activated counts for both DeepSeek-VL2 variants and the baseline models; this would clarify the sparsity level achieved by the MoE design.
- [Figures] Figure captions for the dynamic tiling illustration and the attention mechanism diagram could be expanded to include the exact mathematical formulation or pseudocode for the tiling selection and latent vector compression steps.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address the two major comments point by point below, indicating the revisions we intend to make to strengthen the presentation of our contributions.
read point-by-point responses
-
Referee: [Abstract and §1] Abstract and §1 (Introduction): The text states that the models 'significantly improves upon its predecessor... through two key major upgrades' and achieve their results 'thanks to' dynamic tiling and Multi-head Latent Attention. However, the experimental section provides no ablation that fixes the training dataset, data mixture, and optimization schedule while removing or replacing only the dynamic tiling (reverting to fixed-resolution encoding) or only the Multi-head Latent Attention (reverting to standard attention within the MoE layers). Without such controls, the causal contribution of the two architectural changes to the reported efficiency-performance trade-off cannot be isolated from possible gains due to the 'improved vision-language dataset' or unstated hyperparameter differences.
Authors: We appreciate the referee's point that the current experiments do not isolate the individual effects of dynamic tiling and Multi-head Latent Attention through controlled ablations with fixed data and training. The manuscript presents these two upgrades as the primary architectural changes enabling improved handling of high-resolution images and efficient inference, in combination with the enhanced vision-language dataset. While internal development confirmed their importance, we did not run the specific ablations described. We will revise the abstract and Section 1 to describe the performance as resulting from the combination of the architectural upgrades and the improved dataset, avoiding language that implies isolated causality. We will also add a short discussion paragraph on the design motivations for each upgrade, drawing on their individual properties and comparisons to prior approaches. revision: partial
-
Referee: [Experimental results] Experimental results (tables comparing against other models): The benchmark tables report point estimates for the three variants but do not include standard deviations across multiple runs, confidence intervals, or statistical tests. This makes it difficult to determine whether the claimed 'competitive or state-of-the-art' margins are robust, especially for the smaller 1.0B and 2.8B variants where variance is typically higher.
Authors: We agree that including variability measures would allow readers to better assess the robustness of the reported results. Training each model variant requires substantial compute, making multiple independent runs impractical in our setting. We will revise the experimental section to explicitly note that all results are from single training runs and add a limitation statement in the discussion or conclusion. We will also qualify the 'competitive or state-of-the-art' claims in the text where the margins are modest, consistent with reporting practices in other large-scale multimodal model papers. revision: partial
Circularity Check
No circularity: empirical results on external benchmarks
full rationale
The paper presents an empirical vision-language model with two described upgrades (dynamic tiling vision encoding and DeepSeekMoE with Multi-head Latent Attention) plus training on an improved dataset, followed by standard benchmark evaluations. No derivation chain, first-principles prediction, or fitted parameter is claimed; performance numbers are reported outcomes of training and testing rather than quantities defined in terms of themselves. Self-citations to prior DeepSeek MoE work exist but are not load-bearing for any tautological reduction, as the central claims rest on external benchmark scores rather than internal redefinitions or unverified self-references.
Axiom & Free-Parameter Ledger
free parameters (1)
- activated parameter counts
axioms (1)
- domain assumption Mixture-of-Experts routing improves inference efficiency without harming quality when trained properly
Forward citations
Cited by 42 Pith papers
-
CHASM: Unveiling Covert Advertisements on Chinese Social Media
CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.
-
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
-
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding
GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...
-
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and m...
-
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.
-
SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images
SpecVQA is a new benchmark dataset and evaluation suite for testing multimodal large language models on scientific spectral image understanding and visual question answering, supported by a curve-preserving sampling m...
-
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
-
Can Multimodal Large Language Models Truly Understand Small Objects?
Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.
-
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
-
HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
HyperGVL is the first benchmark for LVLMs on hypergraph tasks from basic counting to NP-hard reasoning, with 12 models tested and a router proposed to adapt representations.
-
RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs
RemoteAgent uses RL fine-tuning on VagueEO to align MLLMs for vague EO intent recognition, handling simple tasks internally and routing dense predictions to tools via Model Context Protocol.
-
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
-
PolyReal: A Benchmark for Real-World Polymer Science Workflows
PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.
-
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models
SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
-
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
-
ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction
ChartZero achieves zero-shot line chart data extraction by training only on synthetic mathematical functions, using a Global Orthogonal Instance loss to prevent curve fragmentation and a VLM-guided strategy for legend...
-
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
-
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
-
DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates
DenTab provides 2,000 annotated dental table images and 2,208 questions to benchmark 16 systems on table structure recognition and VQA, revealing that strong layout recovery does not ensure reliable multi-step arithme...
-
MLLM-as-a-Judge Exhibits Model Preference Bias
MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
-
EchoAgent: Towards Reliable Echocardiography Interpretation with "Eyes","Hands" and "Minds"
EchoAgent is a new agentic AI system that integrates visual observation, quantitative measurement, and expert knowledge reasoning to achieve reliable echocardiography interpretation with up to 80% accuracy on CAMUS an...
-
Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding
MLLMs exhibit a consistent recognition-reasoning inversion on discrete visual symbols across domains, underperforming on elementary perception while appearing competent on higher-level reasoning via linguistic compensation.
-
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
-
SmolVLM: Redefining small and efficient multimodal models
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
-
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
SciVQR is a new multimodal benchmark covering 54 scientific subfields that evaluates MLLMs on visual comprehension and multi-step reasoning, revealing significant limitations in leading models.
-
UniMesh: Unifying 3D Mesh Understanding and Generation
UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
-
Bias-constrained multimodal intelligence for equitable and reliable clinical AI
BiasCareVL is a bias-aware vision-language framework trained on 3.44 million medical samples that outperforms prior methods on clinical tasks like diagnosis and segmentation while aiming for equitable performance unde...
-
AstroVLM: Expert Multi-agent Collaborative Reasoning for Astronomical Imaging Quality Diagnosis
AstroVLM deploys expert multi-agent collaboration with VLMs to outperform baselines on real-world astronomical imaging quality diagnosis.
-
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.
-
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...
-
A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.
-
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
-
Emerging Properties in Unified Multimodal Pretraining
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment
Geometric Reward Credit Assignment disentangles rewards to geometric tokens and adds reprojection consistency to boost 3D keypoint accuracy from 0.64 to 0.93 and bounding box IoU to 0.686 on a ShapeNetCore benchmark w...
-
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while fi...
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
-
Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies
The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
M. Abdin, J. Aneja, H. Awadalla, A. Awadallah, A. A. Awan, N. Bach, A. Bahree, A. Bakhtiari, J. Bao, H. Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
agentsea. Wave-ui 25k. https://huggingface.co/datasets/agentsea/wave-u i-25k, 2024
work page 2024
-
[3]
P . Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. De Monicault, S. Garg, T. Gervet, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073, 2024
work page internal anchor Pith review arXiv 2024
-
[4]
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
A. Amini, S. Gabriel, P . Lin, R. Koncel-Kedziorski, Y. Choi, and H. Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019
work page Pith review arXiv 1905
-
[5]
Anthropic. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-s onnet, 2024
work page 2024
- [6]
-
[7]
L. Blecher. Latex-ocr — a tool to convert images of latex equations into latex code. https://github.com/lukas-blecher/LaTeX-OCR, 2023. Accessed: 2023-10-17
work page 2023
-
[8]
O. B. Bohan and H. Face. Megalith 10m dataset. https://huggingface.co/dataset s/madebyollin/megalith-10m, 2024
work page 2024
-
[9]
M. Cai, H. Liu, S. K. Mustikovela, G. P . Meyer, Y. Chai, D. Park, and Y. J. Lee. Vip-llava: Making large multimodal models understand arbitrary visual prompts. In CVPR, pages 12914–12923. IEEE, 2024
work page 2024
- [10]
-
[11]
K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023
work page internal anchor Pith review arXiv 2023
-
[12]
L. Chen, J. Li, X. Dong, P . Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving large multi-modal models with better captions. ECCV, 2023
work page 2023
-
[13]
L. Chen, J. Li, X. Dong, P . Zhang, Y. Zang, Z. Chen, H. Duan, J. Wang, Y. Qiao, D. Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024
work page internal anchor Pith review arXiv 2024
-
[14]
W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, and W. Y. Wang. Tabfact: A large-scale dataset for table-based fact verification. In International Conference on Learning Representations
-
[15]
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P . Luo, T. Lu, Y. Qiao, and J. Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 21
work page internal anchor Pith review arXiv 2023
-
[16]
Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Ma, et al. Internvl2: Better than the best—expanding performance boundaries of open-source multi- modal models with the progressive scaling strategy, 2024
work page 2024
-
[17]
A. Cherian, K.-C. Peng, S. Lohit, K. Smith, and J. B. Tenenbaum. Are deep neural networks smarter than second graders? arXiv preprint arXiv:2212.09993, 2022
-
[18]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P . Wendell, M. Zaharia, and R. Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm,
-
[20]
URL https://www.databricks.com/blog/2023/04/12/dolly-first-ope n-commercially-viable-instruction-tuned-llm
work page 2023
-
[21]
D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024
work page internal anchor Pith review arXiv 2024
-
[22]
W. Dai, N. Lee, B. Wang, Z. Yang, Z. Liu, J. Barker, T. Rintamaki, M. Shoeybi, B. Catanzaro, and W. Ping. Nvlm: Open frontier-class multimodal llms. arXiv preprint, 2024
work page 2024
-
[23]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024
work page internal anchor Pith review arXiv 2024
-
[24]
X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[25]
M. Diem, S. Fiel, F. Kleber, R. Sablatnig, J. M. Saavedra, D. Contreras, J. M. Barrios, and L. S. Oliveira. Icfhr 2014 competition on handwritten digit string recognition in challenging datasets (hdsrc 2014). In 2014 14th International Conference on Frontiers in Handwriting Recognition, pages 779–784. IEEE, 2014
work page 2014
-
[26]
B. Egan, A. Redden, XWAVE, and SilentAntagonist. Dalle3 1 Million+ High Quality Captions, May 2024. URL https://huggingface.co/datasets/ProGamerGov/sy nthetic-dataset-1m-dalle3-high-quality-captions
work page 2024
-
[27]
C. Fu, P . Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y. Wu, and R. Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024. URL https://arxiv.org/abs/2306.13394
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [28]
-
[29]
J. Gu, X. Meng, G. Lu, L. Hou, N. Minzhe, X. Liang, L. Yao, R. Huang, W. Zhang, X. Jiang, C. Xu, and H. Xu. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. In NeurIPS, 2022
work page 2022
- [30]
-
[31]
HAI-LLM: Efficient and lightweight training tool for large models, 2023
High-flyer. HAI-LLM: Efficient and lightweight training tool for large models, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm
work page 2023
-
[32]
D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019
work page 2019
- [33]
-
[34]
S. Kazemzadeh, V . Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014
work page 2014
-
[35]
A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–
work page 2016
-
[37]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. White- head, A. C. Berg, W.-Y. Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023
work page 2023
-
[38]
Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In NeurIPS, 2023
work page 2023
-
[39]
M. Koupaee and W. Y. Wang. Wikihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305, 2018
-
[40]
A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V . Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020
work page 2020
-
[41]
LAION. Laion-aesthetics, 2023. URL https://laion.ai/blog/laion-aesthetics. Accessed: 2023-10-27
work page 2023
-
[42]
H. Laurençon, L. Saulnier, L. Tronchon, S. Bekman, A. Singh, A. Lozhkov, T. Wang, S. Karamcheti, A. M. Rush, D. Kiela, M. Cord, and V . Sanh. OBELICS: an open web-scale filtered dataset of interleaved image-text documents. In NeurIPS, 2023
work page 2023
-
[43]
H. Laurençon, A. Marafioti, V . Sanh, and L. Tronchon. Building and better understanding vision-language models: insights and future directions., 2024
work page 2024
-
[44]
H. Laurençon, L. Tronchon, M. Cord, and V . Sanh. What matters when building vision- language models?, 2024
work page 2024
-
[45]
H. Laurençon, L. Tronchon, and V . Sanh. Unlocking the conversion of web screenshots into html code with the websight dataset, 2024
work page 2024
-
[46]
B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P . Zhang, Y. Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 23
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [47]
-
[48]
F. Li, R. Zhang, H. Zhang, Y. Zhang, B. Li, W. Li, Z. Ma, and C. Li. Llava-next- interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
L. Li, Y. Wang, R. Xu, P . Wang, X. Feng, L. Kong, and Q. Liu. Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models. In ACL, 2024
work page 2024
- [50]
- [51]
- [52]
-
[53]
F. Lin, J. Yuan, S. Wu, F. Wang, and Z. Wang. Uninext: Exploring a unified architecture for vision recognition. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3200–3208, 2023
work page 2023
-
[54]
A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2023
work page 2023
-
[56]
H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog /2024-01-30-llava-next/
work page 2024
-
[57]
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision, pages 38–55. Springer, 2025
work page 2025
- [58]
-
[59]
Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2025
work page 2025
-
[60]
H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024. 24
work page internal anchor Pith review arXiv 2024
-
[61]
P . Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Representations
- [62]
-
[63]
C. Ma, Y. Jiang, J. Wu, Z. Yuan, and X. Qi. Groma: Localized visual tokenization for grounding multimodal large language models. In European Conference on Computer Vision, pages 417–435. Springer, 2025
work page 2025
- [64]
-
[65]
J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016
work page 2016
-
[66]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022
A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022
- [67]
- [68]
- [69]
-
[70]
OpenAI. Gpt-4v(ision) system card. https://openai.com/research/gpt-4v-sys tem-card, 2023
work page 2023
-
[71]
B. Peng, C. Li, P . He, M. Galley, and J. Gao. Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277, 2023
work page internal anchor Pith review arXiv 2023
-
[72]
Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023
work page internal anchor Pith review arXiv 2023
-
[73]
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazeb- nik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image- to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015
work page 2015
-
[74]
Large-scale classifica- tion of fine-art paintings: Learning the right metric on the right feature
B. Saleh and A. Elgammal. Large-scale classification of fine-art paintings: Learning the right metric on the right feature. arXiv preprint arXiv:1505.00855, 2015
-
[75]
S. Shah, A. Mishra, N. Yadati, and P . P . Talukdar. Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8876–8884, 2019. 25
work page 2019
-
[76]
S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019
work page 2019
- [77]
- [78]
- [79]
-
[80]
K. Srinivasan, K. Raman, J. Chen, M. Bendersky, and M. Najork. Wit: Wikipedia-based im- age text dataset for multimodal multilingual machine learning. In SIGIR, page 2443–2449, 2021
work page 2021
-
[81]
K. Sun, J. Pan, Y. Ge, H. Li, H. Duan, X. Wu, R. Zhang, A. Zhou, Z. Qin, Y. Wang, J. Dai, Y. Qiao, L. Wang, and H. Li. Journeydb: A benchmark for generative image understanding. In NeurIPS, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.