{"work":{"id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","openalex_id":null,"doi":null,"arxiv_id":"2508.18265","raw_key":null,"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","authors":null,"authors_text":"Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei","year":2025,"venue":"cs.CV","abstract":"We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\\% gain in overall reasoning performance and a 4.05$\\times$ inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.","external_url":"https://arxiv.org/abs/2508.18265","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T06:20:24.176236+00:00","pith_arxiv_id":"2508.18265","created_at":"2026-05-09T06:00:35.781737+00:00","updated_at":"2026-05-25T06:20:24.176236+00:00","title_quality_ok":true,"display_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","render_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency"},"hub":{"state":{"work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":277,"external_cited_by_count":null,"distinct_field_count":11,"first_pith_cited_at":"2025-05-22T17:59:03+00:00","last_pith_cited_at":"2026-05-22T17:58:36+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-27T21:07:55.025762+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":50},{"context_role":"baseline","n":41},{"context_role":"method","n":12},{"context_role":"dataset","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":50},{"context_polarity":"baseline","n":41},{"context_polarity":"use_method","n":11},{"context_polarity":"unclear","n":2},{"context_polarity":"use_dataset","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","claims":[{"claim_text":"We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynami","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T21:23:33.641926+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"58cc9d30-7cb9-4659-b9bb-e69cbcad0c55","orcid":null,"display_name":"Weiyun Wang"},{"id":"a8152194-e272-4c2e-af7f-6f194f98f21a","orcid":null,"display_name":"Zhangwei Gao"},{"id":"3428f7fc-73e9-4007-86dd-5a712c2f47a8","orcid":null,"display_name":"Lixin Gu"},{"id":"b71dd099-1687-41b9-860b-319a80048fab","orcid":null,"display_name":"Hengjun Pu"},{"id":"ac7f378f-83fa-43db-8f8d-bebafeca8cdf","orcid":null,"display_name":"Long Cui"},{"id":"6f2d4882-f82e-47bf-9c26-1450a2bc0fed","orcid":null,"display_name":"Xingguang Wei"}]},"error":null,"updated_at":"2026-05-13T21:23:34.248819+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T21:13:32.146218+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":90},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":64},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":39},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":37},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":31},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":30},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":29},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":26},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":25},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":21},{"title":"OpenAI GPT-5 System Card","work_id":"ca87689a-0d29-4476-b504-b65dbbb08af4","shared_citers":20},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":18},{"title":"Kimi-VL Technical Report","work_id":"c876520f-8a20-44f3-b92a-bf7d35bd430f","shared_citers":17},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":16},{"title":"GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning","work_id":"366607ba-e4ea-4726-98c3-63356e32351c","shared_citers":15},{"title":"LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training","work_id":"41c2802e-aff9-482f-b506-10955ff0838d","shared_citers":12},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":12},{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":12},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":12},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":12},{"title":"Seed1.5-VL Technical Report","work_id":"0e8e025f-ca1e-49cc-aee2-33f3a0201f3c","shared_citers":12},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":11},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":11},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":11}],"time_series":[{"n":159,"year":2026}]},"error":null,"updated_at":"2026-05-13T21:23:34.375522+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T21:13:36.959260+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","claims":[{"claim_text":"We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynami","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T21:23:36.799540+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","claims":[{"claim_text":"We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynami","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T21:23:34.250738+00:00"}},"summary":{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","claims":[{"claim_text":"We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynami","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":90},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":64},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":39},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":37},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":31},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":30},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":29},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":26},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":25},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":21},{"title":"OpenAI GPT-5 System Card","work_id":"ca87689a-0d29-4476-b504-b65dbbb08af4","shared_citers":20},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":18},{"title":"Kimi-VL Technical Report","work_id":"c876520f-8a20-44f3-b92a-bf7d35bd430f","shared_citers":17},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":16},{"title":"GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning","work_id":"366607ba-e4ea-4726-98c3-63356e32351c","shared_citers":15},{"title":"LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training","work_id":"41c2802e-aff9-482f-b506-10955ff0838d","shared_citers":12},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":12},{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":12},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":12},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":12},{"title":"Seed1.5-VL Technical Report","work_id":"0e8e025f-ca1e-49cc-aee2-33f3a0201f3c","shared_citers":12},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":11},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":11},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":11}],"time_series":[{"n":159,"year":2026}]},"authors":[{"id":"b71dd099-1687-41b9-860b-319a80048fab","orcid":null,"display_name":"Hengjun Pu","source":"manual","import_confidence":0.72},{"id":"3428f7fc-73e9-4007-86dd-5a712c2f47a8","orcid":null,"display_name":"Lixin Gu","source":"manual","import_confidence":0.72},{"id":"ac7f378f-83fa-43db-8f8d-bebafeca8cdf","orcid":null,"display_name":"Long Cui","source":"manual","import_confidence":0.72},{"id":"58cc9d30-7cb9-4659-b9bb-e69cbcad0c55","orcid":null,"display_name":"Weiyun Wang","source":"manual","import_confidence":0.72},{"id":"6f2d4882-f82e-47bf-9c26-1450a2bc0fed","orcid":null,"display_name":"Xingguang Wei","source":"manual","import_confidence":0.72},{"id":"a8152194-e272-4c2e-af7f-6f194f98f21a","orcid":null,"display_name":"Zhangwei Gao","source":"manual","import_confidence":0.72}]}}