Recognition: 2 theorem links
· Lean TheoremChronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
Pith reviewed 2026-05-13 05:43 UTC · model grok-4.3
The pith
Chronicles-OCR introduces a benchmark with 2,800 images to test VLLMs on visual perception of Chinese characters across their full evolutionary trajectory in the seven scripts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Chronicles-OCR is a benchmark of 2,800 strictly balanced images spanning the Seven Chinese Scripts that uses a Stage-Adaptive Annotation Paradigm to support four tasks: cross-period character spotting, fine-grained archaic character recognition via visual referring, ancient text parsing, and script classification, thereby isolating visual perception from semantic reasoning to expose VLLM limitations in cross-temporal settings.
What carries the argument
The Stage-Adaptive Annotation Paradigm, which adjusts labeling rules to accommodate large morphological and topological shifts in character forms across historical stages while maintaining evaluation consistency.
If this is right
- VLLMs can be evaluated for robustness to script evolution without semantic shortcuts.
- Failure modes in historical text perception become identifiable at specific evolutionary stages.
- Digital humanities projects gain a standardized metric for AI support on ancient Chinese materials.
- Model development can target evolution-aware perception rather than static modern forms.
Where Pith is reading between the lines
- The same isolation of visual form from meaning could be applied to other long-evolving scripts to compare model robustness across writing systems.
- Models trained or fine-tuned on this benchmark data might generalize better to degraded or variant modern text.
- The four tasks could serve as a template for automated analysis pipelines in museum digitization of inscribed artifacts.
Load-bearing premise
The 2,800 images are strictly balanced and representative of the complete evolutionary trajectory, and the annotation paradigm handles all variations without introducing selection bias.
What would settle it
Current VLLMs achieve high accuracy on all four tasks when evaluated on the released dataset, or an independent audit shows the image selection favors particular script stages or media types.
Figures
read the original abstract
Vision Large Language Models (VLLMs) have achieved remarkable success in modern text-rich visual understanding. However, their perceptual robustness in the face of the continuous morphological evolution of historical writing systems remains largely unexplored. Existing ancient text datasets typically focus on isolated historical periods, failing to capture the systematic visual distribution shifts spanning thousands of years. To bridge this gap and empower Digital Humanities, we introduce Chronicles-OCR, the first comprehensive benchmark specifically designed to evaluate the cross-temporal visual perception capabilities of VLLMs across the complete evolutionary trajectory of Chinese characters, known as the Seven Chinese Scripts. Curated in collaboration with top-tier institutional domain experts, the dataset comprises 2,800 strictly balanced images encompassing highly diverse physical media, ranging from tortoise shells to paper-based calligraphy. To accommodate the drastic morphological and topological variations across different historical stages, we propose a novel Stage-Adaptive Annotation Paradigm. Based on this, Chronicles-OCR formulates four rigorous quantitative tasks: cross-period character spotting, fine-grained archaic character recognition via visual referring, ancient text parsing, and script classification. By isolating visual perception from semantic reasoning, Chronicles-OCR provides an authoritative platform to expose the limitations of current VLLMs, paving the way for robust, evolution-aware historical text perception. Chronicles-OCR is publicly available at https://github.com/VirtualLUOUCAS/Chronicles-OCR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Chronicles-OCR, the first benchmark for evaluating VLLMs on cross-temporal visual perception of Chinese characters across the complete evolutionary trajectory of the Seven Scripts. It presents a curated dataset of 2,800 images spanning tortoise shells to paper-based media, developed with domain experts, and proposes a Stage-Adaptive Annotation Paradigm to handle morphological variations. The benchmark defines four tasks—cross-period character spotting, fine-grained archaic character recognition via visual referring, ancient text parsing, and script classification—explicitly isolating visual perception from semantic reasoning to expose current VLLM limitations.
Significance. If the dataset curation achieves true balance without selection bias and the annotation paradigm is validated, Chronicles-OCR would provide a valuable, publicly available resource for digital humanities and VLLM robustness research. It addresses a clear gap in existing ancient-text datasets, which are limited to isolated periods, and offers quantitative tasks that could drive development of evolution-aware perception models.
major comments (2)
- [§3] §3 (Dataset Curation): The claim of a 'strictly balanced' collection of 2,800 images across the Seven Scripts and diverse physical media is not supported by any sampling protocol, per-script counts, media-type stratification, or quantitative uniformity metrics (e.g., distribution statistics or inter-expert agreement scores). Without these, it is impossible to verify that task difficulties are even and that visual perception is isolated from curation artifacts.
- [§4] §4 (Stage-Adaptive Annotation Paradigm): The paradigm is described at a high level but lacks concrete methodological details on how it accommodates drastic morphological and topological changes without introducing selection bias, such as explicit exclusion criteria, quantitative bias checks, or validation against the full evolutionary trajectory. This directly affects the reliability of the four tasks and the central claim of an 'authoritative platform'.
minor comments (2)
- [Data Availability] The GitHub link is provided in the abstract but should be repeated with a permanent identifier or DOI in the main text and data-availability statement for reproducibility.
- [Figures] Figure captions for the example images and task illustrations could be expanded to explicitly note the script period and media type for each sample to aid reader interpretation.
Simulated Author's Rebuttal
We sincerely thank the referee for the thorough and constructive comments. We have carefully addressed each major point below and revised the manuscript to incorporate additional details and evidence where the original submission was insufficient.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Curation): The claim of a 'strictly balanced' collection of 2,800 images across the Seven Scripts and diverse physical media is not supported by any sampling protocol, per-script counts, media-type stratification, or quantitative uniformity metrics (e.g., distribution statistics or inter-expert agreement scores). Without these, it is impossible to verify that task difficulties are even and that visual perception is isolated from curation artifacts.
Authors: We agree that the original manuscript did not provide sufficient quantitative support for the 'strictly balanced' claim. In the revised Section 3, we now include the full sampling protocol developed with domain experts, per-script image counts, media-type stratification tables, distribution statistics, and inter-expert agreement metrics (including Cohen's kappa). These additions enable verification that task difficulties are even across the Seven Scripts and that visual perception is isolated from curation artifacts. revision: yes
-
Referee: [§4] §4 (Stage-Adaptive Annotation Paradigm): The paradigm is described at a high level but lacks concrete methodological details on how it accommodates drastic morphological and topological changes without introducing selection bias, such as explicit exclusion criteria, quantitative bias checks, or validation against the full evolutionary trajectory. This directly affects the reliability of the four tasks and the central claim of an 'authoritative platform'.
Authors: We acknowledge that the original description of the Stage-Adaptive Annotation Paradigm was at too high a level. In the revised Section 4, we have added concrete methodological details, including explicit exclusion criteria for morphological variations, quantitative bias checks (pre- and post-annotation distribution comparisons), and validation steps against the complete evolutionary trajectory. These changes enhance transparency and support the reliability of the four tasks. revision: yes
Circularity Check
No significant circularity in benchmark definition
full rationale
The paper presents a new dataset and benchmark for cross-temporal Chinese character perception, consisting of data curation (2,800 images across Seven Scripts), a Stage-Adaptive Annotation Paradigm, and four task definitions. No mathematical derivations, equations, fitted parameters, or predictive models appear in the provided text. The central claims rest on explicit construction choices (expert curation, task isolation of visual perception) rather than any reduction to self-referential inputs, self-citations, or renamed prior results. The contribution is self-contained as a definitional benchmark without internal circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The seven scripts represent the complete evolutionary trajectory of Chinese characters spanning thousands of years.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Curated in collaboration with top-tier institutional domain experts, the dataset comprises 2,800 strictly balanced images...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024
-
[4]
OmniDocBench: Benchmarking diverse pdf document parsing with comprehensive annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. OmniDocBench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[5]
Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, and Lianwen Jin. OCR-Reasoning benchmark: Unveiling the true capabilities of MLLMs in complex text-rich image reasoning.arXiv preprint arXiv:2505.17163, 2025
-
[6]
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9 B Ultra-Compact Vision-Language Model
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. PaddleOCR-VL: Boosting multilingual document parsing via a 0.9B ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025
-
[7]
Hunyuanocr Technical Report.arXiv preprint arXiv:2511.19575, 2025
Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, et al. HunyuanOCR technical report.arXiv preprint arXiv:2511.19575, 2025
-
[8]
Yang Liu, Jiahuan Cao, Hiuyi Cheng, Yongxin Shi, Kai Ding, and Lianwen Jin. MCS-Bench: A comprehensive benchmark for evaluating multimodal large language models in chinese classical studies. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025
work page 2025
-
[9]
arXiv preprint arXiv:2401.12467 (2024)
Haisu Guan, Jinpeng Wan, Yuliang Liu, Pengjie Wang, Kaile Zhang, Zhebin Kuang, Xinyu Wang, Xiang Bai, and Lianwen Jin. An open dataset for the evolution of oracle bone characters: EVOBC.arXiv preprint arXiv:2401.12467, 2024
-
[10]
Qingju Jiao, Jingwen Wu, Qi Liu, Han Zhang, Zhan Zhang, Bang Li, Jing Xiong, Guoying Liu, and Yongge Liu. A graph-based evolutionary dataset for oracle bone characters from inscriptions to modern chinese scripts.npj Heritage Science, 2025
work page 2025
-
[11]
Mengru Wang, Yu Cai, Li Gao, Ruichen Feng, Qingju Jiao, Xiaolin Ma, and Yu Jia. Study on the evolution of chinese characters based on few-shot learning: From oracle bone inscriptions to regular script.Plos one, 2022
work page 2022
-
[12]
Zijian Chen, Wenjun Zhang, Guangtao Zhai, et al. OBI-Bench: Can LMMs aid in study of ancient script on oracle bones? InInternational Conference on Learning Representations, volume 2025, 2025
work page 2025
-
[13]
Oracle bone inscriptions information processing: A comprehensive survey.npj Heritage Science, 2026
Zijian Chen, Wenjie Hua, Jinhao Li, Yucheng Zhu, Xiaona Zhi, Zhiji Liu, Tingzhu Chen, Wenjun Zhang, and Guangtao Zhai. Oracle bone inscriptions information processing: A comprehensive survey.npj Heritage Science, 2026. doi: 10.1038/s40494-026-02511-w
-
[14]
OBC306: A large-scale oracle bone character recognition dataset
Shuangping Huang, Haobin Wang, Yongge Liu, Xiaosong Shi, and Lianwen Jin. OBC306: A large-scale oracle bone character recognition dataset. In2019 International Conference on Document Analysis and Recognition (ICDAR), 2019
work page 2019
-
[15]
Jing Li, Xueke Chi, Qiufeng Wang, Dahan Wang, Kaizhu Huang, Yongge Liu, and Cheng-lin Liu. A comprehensive survey of oracle character recognition: challenges, benchmarks, and beyond.arXiv preprint arXiv:2411.11354, 2024
-
[16]
James P Philips and Nasseh Tabrizi. Historical document processing: Historical document processing: A survey of techniques, tools, and trends.arXiv preprint arXiv:2002.06300, 2020
-
[17]
Jiahuan Cao, Yang Liu, Peirong Zhang, Yongxin Shi, Kai Ding, and Lianwen Jin. TongGu-VL: Advancing visual-language understanding in chinese classical studies through parameter sensitivity-guided instruction tuning. InProceedings of the 33rd ACM International Conference on Multimedia, 2025
work page 2025
-
[18]
Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, Xu Peng, Taisong Jin, Yongge Liu, Shengwei Han, Jing Yang, et al. OracleAgent: A multimodal reasoning agent for oracle bone script research.arXiv preprint arXiv:2510.26114, 2025
-
[19]
V-oracle: Making progressive reasoning in deciphering oracle bones for you and me
Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Jiapeng Wang, Yifan Zhang, Zhuoma GongQue, Chong Sun, Yida Xu, Yadong Xue, et al. V-oracle: Making progressive reasoning in deciphering oracle bones for you and me. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025
work page 2025
-
[20]
WenyanGPT: A large language model for classical chinese tasks
Xinyu Yao, Mengdi Wang, Bo Chen, and Xiaobing Zhao. WenyanGPT: A large language model for classical chinese tasks. arXiv preprint arXiv:2504.20609, 2025
-
[21]
Bang Li, Qianwen Dai, Feng Gao, Weiye Zhu, Qiang Li, and Yongge Liu. HWOBC-a handwriting oracle bone character recognition database.Journal of Physics: Conference Series, 2020. doi: 10.1088/1742-6596/1651/1/012050. URL https://doi.org/10.1088/1742-6596/1651/1/012050
-
[22]
Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, AndyPian Wu, Chaoyang Wang, Chengjie Wang, Taisong Jin, Seven Shu, et al. OracleFusion: Assisting the decipherment of Oracle Bone Script with structurally constrained semantic typography. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
work page 2025
-
[23]
arXiv preprint arXiv:2401.15365 (2024)
Pengjie Wang, Kaile Zhang, Xinyu Wang, Shengwei Han, Yongge Liu, Jinpeng Wan, Haisu Guan, Zhebin Kuang, Lianwen Jin, Xiang Bai, et al. An open dataset for oracle bone script recognition and decipherment.arXiv preprint arXiv:2401.15365, 10 2024
-
[24]
CASIA-AHCDB: A large-scale chinese ancient handwritten characters database
Yue Xu, Fei Yin, Da-Han Wang, Xu-Yao Zhang, Zhaoxiang Zhang, and Cheng-Lin Liu. CASIA-AHCDB: A large-scale chinese ancient handwritten characters database. In2019 international conference on document analysis and recognition (ICDAR), 2019
work page 2019
-
[25]
Rowan K Flad. Divination and power: A multiregional view of the development of oracle bone divination in early China. Current Anthropology, 2008
work page 2008
-
[26]
David N Keightley. Graphs, words, and meanings: Three reference works for shang oracle-bone studies, with an excursus on the religious role of the day or sun, 1997
work page 1997
-
[27]
Rui Guo. A research on an intelligent recognition tool for bronze inscriptions of the shang and zhou dynasties.Journal of Chinese Writing Systems, 2020
work page 2020
-
[28]
BIRD: Bronze inscription restoration and dating
Wenjie Hua, Hoang H Nguyen, and Gangyan Ge. BIRD: Bronze inscription restoration and dating. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025
work page 2025
-
[29]
Rixin Zhou, Peiqiang Qiu, Qian Zhang, Chuntao Li, and Xi Yang. LadderMoE: Ladder-side mixture of experts adapters for bronze inscription recognition.arXiv preprint arXiv:2510.01651, 2025
-
[30]
Bridging cultural divides: Metadata and the seal collection in a western context
Veronica Fu. Bridging cultural divides: Metadata and the seal collection in a western context. InUnderstanding and Utilizing Informal Archives. IGI Global Scientific Publishing, 2026
work page 2026
-
[31]
Yun Ou, Zhen-Jie Zhou, Di-Wen Kang, Pan Zhou, and Xue-Wei Liu. Qin seal script character recognition with fuzzy and incomplete information.Baghdad Science Journal, 2024
work page 2024
-
[32]
Wenhui Zhou, Jinyu Liu, Jiefeng Li, Jiyi Li, Lili Lin, Fumiyo Fukumoto, and Guojun Dai. Style-independent radical sequence learning for zero-shot recognition of small seal script.Journal of the Franklin Institute, 2023
work page 2023
-
[33]
Liu Guoqing, Hao Changning, Yan Jingbo, Dong Jing, Zhao Zuolong, and Hao Lujia. Stroke extraction algorithm of clerical script in Han dynasty based on contour: Take “stele of cao quan” as an example.Mobile Information Systems, 2022
work page 2022
-
[34]
Yu Lei, Tianzhao Zhou, and Yuankui Ma. Research on efficient calligraphy image classification based on attention enhancement.Mathematics, 2025
work page 2025
-
[35]
Juan Wu. Han dynasty portrait image feature extraction and cloud computing-supported symbolic interpretation: A new approach to cultural heritage digitalization.Scalable Computing: Practice and Experience, 2024
work page 2024
-
[36]
Stroke systems in chinese characters: A systemic functional perspective on simplified regular script
Xuanwei Peng. Stroke systems in chinese characters: A systemic functional perspective on simplified regular script. Semiotica, 2017
work page 2017
-
[37]
Hailin Yang, Lianwen Jin, Weiguo Huang, Zhaoyang Yang, Songxuan Lai, and Jifeng Sun. Dense and tight detection of chinese characters in historical documents: Datasets and a recognition guided detector.IEEE Access, 2018
work page 2018
-
[38]
The advantages and disadvantages of Regular Script in the study of calligraphy
Wei Zhang. The advantages and disadvantages of Regular Script in the study of calligraphy. In2nd International Conference on Language, Art and Cultural Exchange (ICLACE 2021), 2021
work page 2021
-
[39]
Xuanhong Wang, Cong Li, Zengguo Sun, and Luying Hui. RS-GAN: unsupervised running script font generation via disentangled representation learning and contextual transformer.Pattern Analysis and Applications, 2025
work page 2025
-
[40]
Chinese cursive character detection method.The Journal of Engineering, 2020
Xiao Qin, Jianhui Jiang, Wei Fan, and Changan Yuan. Chinese cursive character detection method.The Journal of Engineering, 2020
work page 2020
-
[41]
Jung Liang, Wen-Hung Liao, and Yi-Chieh Wu. Toward automatic recognition of cursive chinese calligraphy: An open dataset for cursive chinese calligraphy text. In2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM), 2020
work page 2020
-
[42]
Yao Wu, Jie Jiang, and Yi Li. A method of chinese characters changing from regular script to semi-cursive scrip described by track and point set. In2018 international joint conference on information, media and engineering (ICIME), 2018
work page 2018
-
[43]
Jia Chen and Kwong Lum. Unconstrained freehand cursive script: A revolution in chinese calligraphic art.International Journal of Politics, Culture, and Society, 1995
work page 1995
-
[44]
Adele Schlombs.Huai-su and the beginnings of wild cursive script in Chinese calligraphy. Franz Steiner Verlag, 1998
work page 1998
-
[45]
Zhu Lei Gang, Loy Chee Luen, and Lee Keok Cheong. The aesthetic structure of Cursive Script.International Journal of Academic Research in Business and Social Sciences, 2023
work page 2023
-
[46]
Mengru Wang, Yu Cai, Li Gao, Ruichen Feng, Qingju Jiao, Xiaolin Ma, and Yu Jia. Study on the evolution of chinese characters based on few-shot learning: From oracle bone inscriptions to regular script.PLOS ONE, 2022. doi: 10.1371/ journal.pone.0272974
work page 2022
-
[47]
Towards real-world document parsing via realistic scene synthesis and document-aware training, 2026
Gengluo Li, Pengyuan Lyu, Chengquan Zhang, Huawen Shen, Liang Wu, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, and Yu Zhou. Towards real-world document parsing via realistic scene synthesis and document-aware training, 2026
work page 2026
-
[48]
Gengluo Li, Chengquan Zhang, Yupu Liang, Huawen Shen, Yaping Zhang, Pengyuan Lyu, Weinong Wang, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, and Yu Zhou. MMTIT-Bench: A multilingual and multi-scenario benchmark with cognition-perception-reasoning guided text-image machine translation, 2026. 11
work page 2026
-
[49]
Mitigating object hallucinations via sentence-level early intervention
Shangpin Peng, Senqiao Yang, Li Jiang, and Zhuotao Tian. Mitigating object hallucinations via sentence-level early intervention. InProceedings of the IEEE International Conference on Computer Vision, 2025
work page 2025
-
[50]
Language models are few-shot learners.Advances in neural information processing systems, 2020
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 2020
work page 2020
-
[51]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. OpenAI GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2023
work page 2023
-
[55]
Visual instruction tuning.Advances in neural information processing systems, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 2023
work page 2023
-
[56]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[57]
LLaV A-NeXT: Improved reasoning, OCR, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaV A-NeXT: Improved reasoning, OCR, and world knowledge, 2024
work page 2024
-
[58]
InstructBLIP: Towards general-purpose vision-language models with instruction tuning, 2023
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning, 2023
work page 2023
-
[59]
Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, and Min Zhang. Uni-DPO: A unified paradigm for dynamic preference optimization of LLMs.arXiv preprint arXiv:2506.10054, 2025
- [60]
-
[61]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Wenjin Hou, Shangpin Peng, Weinong Wang, Zheng Ruan, Yue Zhang, Zhenglin Zhou, Mingqi Gao, Yifei Chen, Kaiqi Wang, Hongming Yang, et al. Uni-OPD: Unifying on-policy distillation with a dual-perspective recipe.arXiv preprint arXiv:2605.03677, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[63]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id= qwen3.5
work page 2026
-
[64]
M5HisDoc: A large-scale multi-style chinese historical document analysis benchmark
Yongxin Shi, Chongyu Liu, Dezhi Peng, Cheng Jian, Jiarong Huang, and Lianwen Jin. M5HisDoc: A large-scale multi-style chinese historical document analysis benchmark. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023
work page 2023
-
[65]
Component-level oracle bone inscription retrieval
Zhikai Hu, Yiu-ming Cheung, Yonggang Zhang, Peiying Zhang, and Pui-ling Tang. Component-level oracle bone inscription retrieval. InProceedings of the 2024 International Conference on Multimedia Retrieval, 2024
work page 2024
-
[66]
Oracle bone inscriptions multi-modal dataset.arXiv preprint arXiv:2407.03900, 2024
Bang Li, Donghao Luo, Yujie Liang, Jing Yang, Zengmao Ding, Xu Peng, Boyuan Jiang, Shengwei Han, Dan Sui, Peichao Qin, et al. Oracle bone inscriptions multi-modal dataset.arXiv preprint arXiv:2407.03900, 2024
-
[67]
A large-scale dataset for chinese historical document recognition and analysis.Scientific Data, 2025
Yongxin Shi, Dezhi Peng, Yuyi Zhang, Jiahuan Cao, and Lianwen Jin. A large-scale dataset for chinese historical document recognition and analysis.Scientific Data, 2025
work page 2025
-
[68]
Zijian Chen, Wenjie Hua, Jinhao Li, Lirong Deng, Fan Du, Tingzhu Chen, and Guangtao Zhai. Pictobi-20k: Unveiling large multimodal models in visual decipherment for pictographic oracle bone characters. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026
work page 2026
-
[69]
Rui Song, Lida Shi, Ruihua Qi, Yingji Li, and Hao Xu. Enhancing multimodal large language models for ancient chinese character evolution analysis via glyph-driven fine-tuning.arXiv preprint arXiv:2604.11299, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[70]
Binary codes capable of correcting deletions, insertions, and reversals
Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. InSoviet physics doklady, 1966
work page 1966
-
[71]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi`ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024. 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning D...
-
[74]
Molmo and PixMo: Open weights and open data for state-of-the-art vision-language models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and PixMo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025
work page 2025
-
[75]
Ovis2.5 technical report, 2025
Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, S...
-
[76]
V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, J...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
Kimi K2.5: Visual Agentic Intelligence
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[78]
Seed1.8 Model Card: Towards generalized real-world agency, 2025
Bytedance Seed. Seed1.8 Model Card: Towards generalized real-world agency, 2025. URL https://github.com/ ByteDance-Seed/Seed-1.8/blob/main/Seed-1.8-Modelcard.pdf
work page 2025
-
[79]
Seed2.0 Model Card: Towards intelligence frontier for real-world complexity, February 2026
ByteDance Seed Team. Seed2.0 Model Card: Towards intelligence frontier for real-world complexity, February 2026. URL https://github.com/ByteDance-Seed/Seed2.0. Model Card
work page 2026
-
[80]
Xiaomi MiMo-V2-Omni: See, hear, act in the agentic era
Xiaomi Corporation. Xiaomi MiMo-V2-Omni: See, hear, act in the agentic era. https://mimo.xiaomi.com/ mimo-v2-omni, 2026
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.