{"work":{"id":"69dffacb-bfe8-442d-be86-48624c60426f","openalex_id":null,"doi":null,"arxiv_id":"2502.13923","raw_key":null,"title":"Qwen2.5-VL Technical Report","authors":null,"authors_text":"Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al","year":2025,"venue":"cs.CV","abstract":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.","external_url":"https://arxiv.org/abs/2502.13923","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T08:35:32.689639+00:00","pith_arxiv_id":"2502.13923","created_at":"2026-05-09T02:27:20.922186+00:00","updated_at":"2026-05-25T08:35:32.689639+00:00","title_quality_ok":false,"display_title":"Qwen2.5-VL Technical Report","render_title":"Qwen2.5-VL Technical Report"},"hub":{"state":{"work_id":"69dffacb-bfe8-442d-be86-48624c60426f","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":895,"external_cited_by_count":null,"distinct_field_count":19,"first_pith_cited_at":"2024-06-12T09:36:52+00:00","last_pith_cited_at":"2026-05-22T17:58:36+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-31T05:01:58.395625+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":150},{"context_role":"method","n":57},{"context_role":"baseline","n":56},{"context_role":"dataset","n":5},{"context_role":"other","n":3}],"polarity_counts":[{"context_polarity":"background","n":144},{"context_polarity":"use_method","n":59},{"context_polarity":"baseline","n":55},{"context_polarity":"unclear","n":6},{"context_polarity":"use_dataset","n":5},{"context_polarity":"support","n":2}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Qwen2.5-VL Technical Report","claims":[{"claim_text":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T18:13:28.003986+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"2bc8cfa9-c2ce-48ef-87d4-e0de52ae9a92","orcid":null,"display_name":"Shuai Bai"},{"id":"dd80061e-c973-48f3-af57-2764aa5433c7","orcid":null,"display_name":"Keqin Chen"},{"id":"540ccffd-e57b-4849-b7e9-9cd5c0c17482","orcid":null,"display_name":"Xuejing Liu"},{"id":"3d40df98-a97a-46a5-903b-b388be7785ab","orcid":null,"display_name":"Jialin Wang"},{"id":"1f4a16b8-5401-4f0d-9de5-c7d725338b8e","orcid":null,"display_name":"Wenbin Ge"},{"id":"5788a139-4b79-4854-8880-11a56d0c51b4","orcid":null,"display_name":"Sibo Song"}]},"error":null,"updated_at":"2026-05-13T18:13:28.001713+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T18:13:27.728863+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":112},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":100},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":82},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":80},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":79},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":78},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":72},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":70},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":64},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":61},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":61},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":57},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":42},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":38},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":32},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":32},{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":30},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":29},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":28},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":28},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":27},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":27},{"title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","work_id":"0f06e436-0c76-4e3c-be5e-6168f6bc4336","shared_citers":26},{"title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","work_id":"50eec732-2d41-432f-9dcf-ac7fff235ea5","shared_citers":26}],"time_series":[{"n":25,"year":2025},{"n":387,"year":2026}]},"error":null,"updated_at":"2026-05-13T17:25:55.668312+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T18:13:27.104671+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Qwen2.5-VL Technical Report","claims":[{"claim_text":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T18:13:27.732366+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Qwen2.5-VL Technical Report","claims":[{"claim_text":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T17:25:52.717336+00:00"}},"summary":{"title":"Qwen2.5-VL Technical Report","claims":[{"claim_text":"We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5-VL Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":112},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":100},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":82},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":80},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":79},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":78},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":72},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":70},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":64},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":61},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":61},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":57},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":42},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":38},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":32},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":32},{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":30},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":29},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":28},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":28},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":27},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":27},{"title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","work_id":"0f06e436-0c76-4e3c-be5e-6168f6bc4336","shared_citers":26},{"title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","work_id":"50eec732-2d41-432f-9dcf-ac7fff235ea5","shared_citers":26}],"time_series":[{"n":25,"year":2025},{"n":387,"year":2026}]},"authors":[{"id":"3d40df98-a97a-46a5-903b-b388be7785ab","orcid":null,"display_name":"Jialin Wang","source":"manual","import_confidence":0.72},{"id":"dd80061e-c973-48f3-af57-2764aa5433c7","orcid":null,"display_name":"Keqin Chen","source":"manual","import_confidence":0.72},{"id":"2bc8cfa9-c2ce-48ef-87d4-e0de52ae9a92","orcid":null,"display_name":"Shuai Bai","source":"manual","import_confidence":0.72},{"id":"5788a139-4b79-4854-8880-11a56d0c51b4","orcid":null,"display_name":"Sibo Song","source":"manual","import_confidence":0.72},{"id":"1f4a16b8-5401-4f0d-9de5-c7d725338b8e","orcid":null,"display_name":"Wenbin Ge","source":"manual","import_confidence":0.72},{"id":"540ccffd-e57b-4849-b7e9-9cd5c0c17482","orcid":null,"display_name":"Xuejing Liu","source":"manual","import_confidence":0.72}]}}