{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:NGQGVISHEUGLHCFXJU4WYWMCX7","short_pith_number":"pith:NGQGVISH","schema_version":"1.0","canonical_sha256":"69a06aa247250cb388b74d396c5982bff1897a4b4e5fcd96726618264aa54fdd","source":{"kind":"arxiv","id":"2508.11737","version":1},"attestation_state":"computed","paper":{"title":"Ovis2.5 Technical Report","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Ovis2.5 processes images at native resolutions and adds reflection to reach 78.3 on the OpenCompass multimodal leaderboard.","cross_cats":["cs.AI","cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Chengkun Hou, Gui Hu, Guodong Zheng, Haijun Li, Hailong Sun, Hui Sun, Huping Ding, Jiahe Li, Jiamang Wang, Jianshan Zhao, Jinlong Huang, Junke Tang, Junpeng Jiang, Kaifu Zhang, Lunhao Duan, Qing-Guo Chen, Sensen Gao, Shanshan Zhao, Shengze Shi, Shiyin Lu, Sijia Chen, Siran Yang, Tianli Zhou, Wanying Chen, Weihong Zhang, Weihua Luo, Wenjie Zhang, Wen Li, Yang Li, Yanqing Ma, Yibo Wang, Yi-Feng Wu, Yiliang Gu, Yinglun Li, Yuhui Chen, Yuping He, Yuwei Hu, Yu Xia, Yuxuan Han, Zhao Xu, Zhichao Wei, Zhixing Du","submitted_at":"2025-08-15T17:01:08Z","abstract_excerpt":"We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an option"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2508.11737","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2025-08-15T17:01:08Z","cross_cats_sorted":["cs.AI","cs.CL","cs.LG"],"title_canon_sha256":"fc99a4c0a3021fc73f0bfa752295498c9fc389733691ba767bc28f3dddb295dc","abstract_canon_sha256":"a1d9b04e4e2d7624437f702c0157551661eccfa59b8bfbd4cd315543c7e0a673"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.261104Z","signature_b64":"BH44xFBRr8BJV1yY7M5Y5yUMAaJBeWip+JwTVEIUfN+AMmgD6D/xCtqMqvwqmOK6qxIExlWZrsQfHe+Unr9IBQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"69a06aa247250cb388b74d396c5982bff1897a4b4e5fcd96726618264aa54fdd","last_reissued_at":"2026-05-17T23:38:50.260604Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.260604Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Ovis2.5 Technical Report","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Ovis2.5 processes images at native resolutions and adds reflection to reach 78.3 on the OpenCompass multimodal leaderboard.","cross_cats":["cs.AI","cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Chengkun Hou, Gui Hu, Guodong Zheng, Haijun Li, Hailong Sun, Hui Sun, Huping Ding, Jiahe Li, Jiamang Wang, Jianshan Zhao, Jinlong Huang, Junke Tang, Junpeng Jiang, Kaifu Zhang, Lunhao Duan, Qing-Guo Chen, Sensen Gao, Shanshan Zhao, Shengze Shi, Shiyin Lu, Sijia Chen, Siran Yang, Tianli Zhou, Wanying Chen, Weihong Zhang, Weihua Luo, Wenjie Zhang, Wen Li, Yang Li, Yanqing Ma, Yibo Wang, Yi-Feng Wu, Yiliang Gu, Yinglun Li, Yuhui Chen, Yuping He, Yuwei Hu, Yu Xia, Yuxuan Han, Zhao Xu, Zhichao Wei, Zhixing Du","submitted_at":"2025-08-15T17:01:08Z","abstract_excerpt":"We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an option"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Ovis2.5-9B averages 78.3 on the OpenCompass multimodal leaderboard, marking a substantial improvement over Ovis2-8B and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9 and establishes SOTA for its size.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the benchmark gains are primarily attributable to the native-resolution vision transformer and reflection mechanism rather than differences in training data volume, quality, or undisclosed hyperparameter tuning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Ovis2.5 introduces native-resolution visual processing and reflective chain-of-thought to reach SOTA open-source multimodal performance at 9B and 2B scales on benchmarks including STEM and chart analysis.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Ovis2.5 processes images at native resolutions and adds reflection to reach 78.3 on the OpenCompass multimodal leaderboard.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"029ca85268d2631dd753927534197d93120c6c21e5a65dbd302549e645fd12b6"},"source":{"id":"2508.11737","kind":"arxiv","version":1},"verdict":{"id":"1e22e58d-8d9e-4e92-adf4-b846d1e2e62f","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T20:26:58.365351Z","strongest_claim":"Ovis2.5-9B averages 78.3 on the OpenCompass multimodal leaderboard, marking a substantial improvement over Ovis2-8B and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9 and establishes SOTA for its size.","one_line_summary":"Ovis2.5 introduces native-resolution visual processing and reflective chain-of-thought to reach SOTA open-source multimodal performance at 9B and 2B scales on benchmarks including STEM and chart analysis.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the benchmark gains are primarily attributable to the native-resolution vision transformer and reflection mechanism rather than differences in training data volume, quality, or undisclosed hyperparameter tuning.","pith_extraction_headline":"Ovis2.5 processes images at native resolutions and adds reflection to reach 78.3 on the OpenCompass multimodal leaderboard."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":3,"snapshot_sha256":"45b54bf26d5059f82d62f62629fcdcc84b42ec1e97721d442abfd3fa458cc41b"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2508.11737","created_at":"2026-05-17T23:38:50.260687+00:00"},{"alias_kind":"arxiv_version","alias_value":"2508.11737v1","created_at":"2026-05-17T23:38:50.260687+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2508.11737","created_at":"2026-05-17T23:38:50.260687+00:00"},{"alias_kind":"pith_short_12","alias_value":"NGQGVISHEUGL","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"NGQGVISHEUGLHCFX","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"NGQGVISH","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2605.23898","citing_title":"SPACENUM: Revisiting Spatial Numerical Understanding in VLMs","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2602.18600","citing_title":"MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2602.04802","citing_title":"VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17823","citing_title":"Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19522","citing_title":"iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2602.07064","citing_title":"OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2602.08392","citing_title":"ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2602.18600","citing_title":"MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2509.18154","citing_title":"MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16331","citing_title":"BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14465","citing_title":"From Table to Cell: Attention for Better Reasoning with TABALIGN","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13667","citing_title":"SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13277","citing_title":"Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03318","citing_title":"EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03765","citing_title":"ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11960","citing_title":"Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters","ref_index":75,"is_internal_anchor":true},{"citing_arxiv_id":"2604.28076","citing_title":"TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10576","citing_title":"SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10286","citing_title":"AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks","ref_index":114,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25186","citing_title":"FCMBench-Video: Benchmarking Document Video Intelligence","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11627","citing_title":"POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07477","citing_title":"ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07457","citing_title":"EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06777","citing_title":"Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06912","citing_title":"Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models","ref_index":52,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/NGQGVISHEUGLHCFXJU4WYWMCX7","json":"https://pith.science/pith/NGQGVISHEUGLHCFXJU4WYWMCX7.json","graph_json":"https://pith.science/api/pith-number/NGQGVISHEUGLHCFXJU4WYWMCX7/graph.json","events_json":"https://pith.science/api/pith-number/NGQGVISHEUGLHCFXJU4WYWMCX7/events.json","paper":"https://pith.science/paper/NGQGVISH"},"agent_actions":{"view_html":"https://pith.science/pith/NGQGVISHEUGLHCFXJU4WYWMCX7","download_json":"https://pith.science/pith/NGQGVISHEUGLHCFXJU4WYWMCX7.json","view_paper":"https://pith.science/paper/NGQGVISH","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2508.11737&json=true","fetch_graph":"https://pith.science/api/pith-number/NGQGVISHEUGLHCFXJU4WYWMCX7/graph.json","fetch_events":"https://pith.science/api/pith-number/NGQGVISHEUGLHCFXJU4WYWMCX7/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/NGQGVISHEUGLHCFXJU4WYWMCX7/action/timestamp_anchor","attest_storage":"https://pith.science/pith/NGQGVISHEUGLHCFXJU4WYWMCX7/action/storage_attestation","attest_author":"https://pith.science/pith/NGQGVISHEUGLHCFXJU4WYWMCX7/action/author_attestation","sign_citation":"https://pith.science/pith/NGQGVISHEUGLHCFXJU4WYWMCX7/action/citation_signature","submit_replication":"https://pith.science/pith/NGQGVISHEUGLHCFXJU4WYWMCX7/action/replication_record"}},"created_at":"2026-05-17T23:38:50.260687+00:00","updated_at":"2026-05-17T23:38:50.260687+00:00"}