Recognition: 2 theorem links
· Lean TheoremLLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Pith reviewed 2026-05-12 10:48 UTC · model grok-4.3
The pith
LLaVA-OneVision-1.5 builds competitive multimodal models from scratch using an open end-to-end framework on 85M curated pretraining examples and 22M instructions for under $16,000.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLaVA-OneVision-1.5 yields exceptionally competitive performance across a broad range of downstream tasks through an open end-to-end efficient training framework that combines large-scale curated datasets of 85M concept-balanced pretraining examples and 22M instruction examples, offline parallel data packing to stay within a $16,000 budget, and RL-based post-training that unlocks robust chain-of-thought reasoning, with the 8B model outperforming Qwen2.5-VL-7B on 18 of 27 benchmarks and the 4B model surpassing Qwen2.5-VL-3B on all 27 benchmarks.
What carries the argument
The complete open end-to-end training framework that integrates concept-balanced pretraining data, instruction data, offline parallel data packing for cost efficiency, and a lightweight RL post-training stage to improve multimodal reasoning.
If this is right
- High-quality curated datasets can deliver strong multimodal performance even when total training spend is limited to $16,000.
- A lightweight RL post-training stage can elicit better chain-of-thought reasoning on complex multimodal tasks without large additional compute.
- Smaller 4B-scale models can exceed the benchmark results of larger closed models when trained with this framework.
- Fully open data and code release lowers the barrier for reproducible multimodal research.
Where Pith is reading between the lines
- If other groups replicate the data curation steps, similar performance levels may become accessible to teams with modest budgets.
- The results point to data quality and balancing as potentially more decisive than raw data volume in multimodal pretraining.
- The framework could be tested on additional vision-language tasks or extended to new modalities to check whether the efficiency gains generalize.
Load-bearing premise
The 85M concept-balanced pretraining dataset and 22M instruction dataset are of sufficiently higher quality than prior data sources to produce the reported gains, and that benchmark comparisons are free of selection effects or evaluation differences.
What would settle it
An independent reproduction that trains the same model sizes on the released datasets and framework but fails to match the claimed outperformance margins over Qwen2.5-VL-7B and Qwen2.5-VL-3B on the 27 benchmarks.
read the original abstract
We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 22M instruction dataset LLaVA-OneVision-1.5-Instruct. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision-1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. (4) RL-based Post-training: We unlock the model's latent potential through a lightweight RL stage, effectively eliciting robust chain-of-thought reasoning to significantly boost performance on complex multimodal reasoning tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LLaVA-OneVision-1.5, a family of open large multimodal models (LMMs) trained entirely from scratch. It describes construction of an 85M concept-balanced pretraining dataset (LLaVA-OneVision-1.5-Mid-Traning) and a 22M instruction dataset (LLaVA-OneVision-1.5-Instruct), an efficient end-to-end training framework using offline parallel data packing that completes within a $16,000 budget, state-of-the-art results where the 8B variant outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks and the 4B variant surpasses Qwen2.5-VL-3B on all 27, and a lightweight RL post-training stage to elicit chain-of-thought reasoning on complex multimodal tasks.
Significance. If the performance claims hold under controlled evaluation, the work would provide a fully open, low-cost, and reproducible pipeline for training competitive vision-language models. This could meaningfully advance democratization of multimodal research by releasing curated datasets, training code, and an RL stage that improves reasoning, while demonstrating that high performance is achievable without massive compute.
major comments (3)
- [Abstract and Experimental Results] Abstract and Experimental Results section: The central performance claims (8B model beats Qwen2.5-VL-7B on 18/27 benchmarks; 4B beats Qwen2.5-VL-3B on all 27) are presented without reported error bars, details on benchmark subset selection, prompt templates, decoding parameters, or confirmation that comparisons were run under identical conditions. This leaves open the possibility that observed deltas arise from evaluation differences rather than the claimed framework or data.
- [Dataset Curation and Training Framework] Dataset Curation and Training Framework sections: Performance gains are attributed to the 85M concept-balanced pretraining set and 22M instruction set, yet no ablation studies are described that hold architecture, training recipe, and compute fixed while swapping in prior open datasets (e.g., LLaVA-1.5 or ShareGPT4V mixtures). Without such controlled comparisons, the claim that these specific curated corpora are materially higher-quality and responsible for the results cannot be verified.
- [RL-based Post-training] RL-based Post-training section: The manuscript states that a lightweight RL stage significantly boosts performance on complex reasoning tasks, but provides no quantitative before/after results on the 27 benchmarks, no details on the reward model or RL algorithm, and no comparison to standard supervised fine-tuning baselines. This makes it impossible to assess the incremental contribution of the RL component.
minor comments (2)
- [Abstract] Abstract: 'LLaVA-OneVision-1.5-Mid-Traning' appears to be a typographical error for 'Mid-Training'.
- [Experimental Results] The manuscript would benefit from an explicit table listing all 27 benchmarks, the exact scores for LLaVA-OneVision-1.5 variants and the Qwen2.5-VL baselines, and any data-exclusion rules applied during evaluation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications based on our open framework and outlining revisions to improve the manuscript's rigor and reproducibility.
read point-by-point responses
-
Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: The central performance claims (8B model beats Qwen2.5-VL-7B on 18/27 benchmarks; 4B beats Qwen2.5-VL-3B on all 27) are presented without reported error bars, details on benchmark subset selection, prompt templates, decoding parameters, or confirmation that comparisons were run under identical conditions. This leaves open the possibility that observed deltas arise from evaluation differences rather than the claimed framework or data.
Authors: We thank the referee for highlighting this. All evaluations were conducted under identical conditions using our publicly released evaluation code and the same harness for baselines. We will revise the Experimental Results section and add a dedicated appendix detailing benchmark subsets, exact prompt templates, decoding parameters (e.g., temperature=0, top_p=1.0, greedy decoding), and confirmation of controlled settings. Error bars are not reported as single-run results are standard for large-scale training; we will add a note on this limitation and include inference-seed variance for representative benchmarks in the revision. revision: yes
-
Referee: [Dataset Curation and Training Framework] Dataset Curation and Training Framework sections: Performance gains are attributed to the 85M concept-balanced pretraining set and 22M instruction set, yet no ablation studies are described that hold architecture, training recipe, and compute fixed while swapping in prior open datasets (e.g., LLaVA-1.5 or ShareGPT4V mixtures). Without such controlled comparisons, the claim that these specific curated corpora are materially higher-quality and responsible for the results cannot be verified.
Authors: We agree that explicit ablations would strengthen attribution. Full-scale ablations holding architecture, recipe, and compute fixed are not feasible within our $16,000 budget and timeline. We will revise the Dataset Curation section to elaborate on the concept-balancing procedure and provide qualitative/quantitative comparisons to prior mixtures. The datasets are fully open-sourced, enabling the community to run such controlled ablations independently. A limited small-scale ablation on data subsets will be included if space allows. revision: partial
-
Referee: [RL-based Post-training] RL-based Post-training section: The manuscript states that a lightweight RL stage significantly boosts performance on complex reasoning tasks, but provides no quantitative before/after results on the 27 benchmarks, no details on the reward model or RL algorithm, and no comparison to standard supervised fine-tuning baselines. This makes it impossible to assess the incremental contribution of the RL component.
Authors: We acknowledge the need for more quantitative evidence. In the revision, we will add a table with before/after performance on all 27 benchmarks, details on the reward model (trained via preference data), the RL algorithm employed, and direct comparisons against an SFT-only baseline. This will quantify the incremental gains from the lightweight RL stage while keeping the overall compute low. revision: yes
Circularity Check
No circularity: purely empirical claims with no derivations or equations present.
full rationale
The manuscript describes dataset construction (85M pretraining + 22M instruction), an efficient training framework, RL post-training, and benchmark results without any equations, mathematical derivations, or claimed first-principles reductions. Performance statements (e.g., outperforming Qwen2.5-VL variants on 18/27 or 27/27 benchmarks) are direct empirical comparisons, not outputs derived from fitted parameters or self-referential definitions. No load-bearing self-citations reduce to unverified prior claims within a derivation chain, as no such chain exists. The central claims rest on data curation and training details rather than any circular reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Curated large-scale datasets of the stated sizes and balance produce higher-quality multimodal models than prior alternatives
- domain assumption Standard supervised and RL training procedures on these data yield the reported benchmark gains
Forward citations
Cited by 39 Pith papers
-
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.
-
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
-
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
-
GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment
GameScope provides 4,048 multi-codec gaming videos with MOS ratings and attribute annotations, claimed as the first comprehensive dataset for gaming video quality assessment across codecs and content types.
-
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results acro...
-
Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios
DailyClue is a new benchmark that requires MLLMs to actively seek visual clues in authentic daily scenarios across four domains and 16 subtasks before performing reasoning.
-
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.
-
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
-
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.
-
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
-
BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing
BoxComm is the first large-scale benchmark for category-aware commentary generation and rhythm assessment in boxing, showing state-of-the-art multimodal models struggle with tactical analysis and temporal pacing.
-
Token Warping Helps MLLMs Look from Nearby Viewpoints
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
-
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
ChartNet is a million-scale multimodal dataset for chart understanding created via code-guided synthesis spanning 24 chart types with five aligned modalities per sample.
-
SCP: Spatial Causal Prediction in Video
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models
Instruction token embeddings encode visual information that can be leveraged to detect object hallucinations in MLLMs via a new combined score outperforming prior detectors.
-
Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration
A training-free attention-guided debiasing framework mitigates position bias in MLLM multi-image retrieval by exploiting the observed mismatch between biased logits and aligned attention maps, yielding over 40% accura...
-
LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs
LDDR proposes a linear DPP-based dynamic-resolution frame sampler that achieves 3x speedup and up to 2.5-point gains on video MLLM benchmarks by selecting non-redundant frames and allocating tokens accordingly.
-
Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection
The paper releases the Sens-VisualNews dataset of 9,576 annotated news images for sensational image detection and benchmarks open multimodal LLMs on zero-shot and fine-tuned performance.
-
Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.
-
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
-
Causal Probing for Internal Visual Representations in Multimodal Large Language Models
Activation steering reveals localized encoding for entities versus distributed encoding for abstract concepts in MLLMs, identifying depth as key for the latter and a perception-reasoning disconnect.
-
SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?
SurgCheck benchmark reveals that vision-language models for surgical VQA often depend on linguistic shortcuts rather than visual reasoning, shown by consistent performance drops on less-biased questions.
-
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
-
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
-
Boosting Visual Instruction Tuning with Self-Supervised Guidance
Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
-
Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance
Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.
-
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
-
Small Vision-Language Models are Smart Compressors for Long Video Understanding
Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
PersonaVLM: Long-Term Personalized Multimodal LLMs
PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.
-
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Response-G1 uses query-guided scene graph generation, memory retrieval, and retrieval-augmented prompting to improve proactive response timing in streaming video understanding.
-
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.
-
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
-
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...
-
Steering the Verifiability of Multimodal AI Hallucinations
Researchers create a human-labeled dataset of obvious and elusive multimodal hallucinations and use learned activation-space probes to control their verifiability in MLLMs.
-
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
AOT reduces visual tokens in VLLMs via intra-frame and inter-frame anchors with local-global optimal transport, delivering competitive benchmark performance and efficiency gains in a training-free way.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv:2502.13923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, and S.-H. Gary Chan. Revisiting referring expression comprehension evaluation in the era of large multimodal models. arXiv:2406.16866, 2024a. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the...
-
[3]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv:2306.13394,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv:2505.07062,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Improved baselines with visual instruction tuning
16 Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024a. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024b. URLhttps://llava-vl. github.io/blog/2024-01-30-llava-next/. Yuliang Liu, Zhang ...
work page 2024
-
[6]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[8]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features....
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InNeurIPS, 2025a. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. InNeurIPS, 2024a. P...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence
Jiaxing Zhang, Ruyi Gan, Junjie Wang, Yuxiang Zhang, Lin Zhang, Ping Yang, Xinyu Gao, Ziwei Wu, Xiaoqun Dong, Junqing He, Jianheng Zhuo, Qi Yang, Yongfeng Huang, Xiayu Li, Yanghan Wu, Junyu Lu, Xinyu Zhu, Weifeng Chen, Ting Han, Kunhao Pan, Rui Wang, Hao Wang, Xiaojun Wu, Zhongshen Zeng, and Chongpei Chen. Fengshenbang 1.0: Being the foundation of chinese...
-
[11]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models. InACL, 2025a. Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
19 A LLaVA-OV-1.5 vs. Qwen2.5-VL with Same LLM ToenableafaircomparisonwithQwen2.5-VL,wetrainLLaVA-Onevision-1.5-3BbasedonQwen2.5- 3B-Instruct. As shown in Fig. 9, LLaVA-Onevision-1.5-3B also demonstrates superior performance, achieving better results on 17 out of 27 downstream benchmarks. Figure 9Comparison between LLaVA-OV-1.5-3B and Qwen2.5-VL-3B model ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.