{"work":{"id":"0e8e025f-ca1e-49cc-aee2-33f3a0201f3c","openalex_id":null,"doi":null,"arxiv_id":"2505.07062","raw_key":null,"title":"Seed1.5-VL Technical Report","authors":null,"authors_text":"Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen","year":2025,"venue":"cs.CV","abstract":"We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)","external_url":"https://arxiv.org/abs/2505.07062","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T06:05:26.667025+00:00","pith_arxiv_id":"2505.07062","created_at":"2026-05-09T06:15:38.839991+00:00","updated_at":"2026-05-25T06:05:26.667025+00:00","title_quality_ok":false,"display_title":"Seed1.5-VL Technical Report","render_title":"Seed1.5-VL Technical Report"},"hub":{"state":{"work_id":"0e8e025f-ca1e-49cc-aee2-33f3a0201f3c","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":74,"external_cited_by_count":null,"distinct_field_count":8,"first_pith_cited_at":"2024-06-12T09:36:52+00:00","last_pith_cited_at":"2026-05-21T14:40:03+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-27T22:37:59.093857+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":15},{"context_role":"baseline","n":12},{"context_role":"method","n":6},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"background","n":14},{"context_polarity":"baseline","n":12},{"context_polarity":"use_method","n":6},{"context_polarity":"unclear","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T14:31:30.718510+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":21},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":20},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":16},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":14},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":13},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":12},{"title":"Kimi-VL Technical Report","work_id":"c876520f-8a20-44f3-b92a-bf7d35bd430f","shared_citers":11},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":11},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":10},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":9},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":9},{"title":"UI-TARS: Pioneering Automated GUI Interaction with Native Agents","work_id":"0bbcf263-a46d-4525-a438-11fce3316568","shared_citers":9},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8},{"title":"Qwen-Image Technical Report","work_id":"d06d7ecc-7579-4f89-a60b-4278a0f3c562","shared_citers":8},{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":7},{"title":"MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts","work_id":"e22c3789-9e71-4242-b6ea-3e60e06e2b66","shared_citers":7},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":7},{"title":"arXiv preprint arXiv:2410.05243 , year=","work_id":"9def1724-6fd2-4d5b-8339-4c1ee76e62f8","shared_citers":6},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":6},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":6},{"title":"java21\" shown on the file path of the file manager. Text 1 between text Click once at the position before","work_id":"5345a78f-68f1-4e8c-b240-30b0bb230de3","shared_citers":6},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":5},{"title":"Kimi K2: Open Agentic Intelligence","work_id":"7f18284c-12d3-4137-bea1-1da97e8cf3c1","shared_citers":5},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":5}],"time_series":[{"n":4,"year":2025},{"n":42,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T14:31:28.881464+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T14:31:35.385052+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Seed1.5-VL Technical Report","claims":[{"claim_text":"We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal sy","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Seed1.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T14:31:39.656141+00:00"}},"summary":{"title":"Seed1.5-VL Technical Report","claims":[{"claim_text":"We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal sy","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Seed1.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":21},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":20},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":16},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":14},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":13},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":12},{"title":"Kimi-VL Technical Report","work_id":"c876520f-8a20-44f3-b92a-bf7d35bd430f","shared_citers":11},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":11},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":10},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":9},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":9},{"title":"UI-TARS: Pioneering Automated GUI Interaction with Native Agents","work_id":"0bbcf263-a46d-4525-a438-11fce3316568","shared_citers":9},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8},{"title":"Qwen-Image Technical Report","work_id":"d06d7ecc-7579-4f89-a60b-4278a0f3c562","shared_citers":8},{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":7},{"title":"MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts","work_id":"e22c3789-9e71-4242-b6ea-3e60e06e2b66","shared_citers":7},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":7},{"title":"arXiv preprint arXiv:2410.05243 , year=","work_id":"9def1724-6fd2-4d5b-8339-4c1ee76e62f8","shared_citers":6},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":6},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":6},{"title":"java21\" shown on the file path of the file manager. Text 1 between text Click once at the position before","work_id":"5345a78f-68f1-4e8c-b240-30b0bb230de3","shared_citers":6},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":5},{"title":"Kimi K2: Open Agentic Intelligence","work_id":"7f18284c-12d3-4137-bea1-1da97e8cf3c1","shared_citers":5},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":5}],"time_series":[{"n":4,"year":2025},{"n":42,"year":2026}],"dependency_candidates":[]},"authors":[]}}