{"work":{"id":"41c2802e-aff9-482f-b506-10955ff0838d","openalex_id":null,"doi":null,"arxiv_id":"2509.23661","raw_key":null,"title":"LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training","authors":null,"authors_text":"Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng","year":2025,"venue":"cs.CV","abstract":"We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 22M instruction dataset LLaVA-OneVision-1.5-Instruct. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision-1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. (4) RL-based Post-training: We unlock the model's latent potential through a lightweight RL stage, effectively eliciting robust chain-of-thought reasoning to significantly boost performance on complex multimodal reasoning tasks.","external_url":"https://arxiv.org/abs/2509.23661","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T06:25:23.731972+00:00","pith_arxiv_id":"2509.23661","created_at":"2026-05-10T11:30:18.484621+00:00","updated_at":"2026-05-25T06:25:23.731972+00:00","title_quality_ok":true,"display_title":"LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training","render_title":"LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training"},"hub":{"state":{"work_id":"41c2802e-aff9-482f-b506-10955ff0838d","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":45,"external_cited_by_count":null,"distinct_field_count":5,"first_pith_cited_at":"2025-05-21T12:18:15+00:00","last_pith_cited_at":"2026-05-21T18:00:22+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-04T11:07:32.791218+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":11},{"context_role":"baseline","n":2},{"context_role":"dataset","n":1},{"context_role":"method","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":10},{"context_polarity":"baseline","n":2},{"context_polarity":"support","n":1},{"context_polarity":"unclear","n":1},{"context_polarity":"use_dataset","n":1},{"context_polarity":"use_method","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}