arxiv: 2604.11299 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI

Recognition: unknown

Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning

Rui Song , Lida Shi , Ruihua Qi , Yingji Li , Hao Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords ancient Chinese charactersglyph evolutionmultimodal large language modelsfine-tuningcharacter recognitionevolutionary reasoningbenchmarktext evolution

0 comments

The pith

A glyph-driven fine-tuning framework improves multimodal models' analysis of ancient Chinese character evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a benchmark with 11 tasks and over 130,000 examples to measure how well multimodal large language models handle the historical development of Chinese writing. Existing models show clear limits in comparing glyphs and reasoning about evolutionary changes. The authors introduce a fine-tuning method that trains models to track consistent patterns in how character shapes transform across time. This produces measurable gains on every task, including for models with only 2 billion parameters. The benchmark and trained models are released to support further work on AI-assisted study of script history.

Core claim

We propose a glyph-driven fine-tuning framework (GEVO) that explicitly encourages models to capture evolutionary consistency in glyph transformations and enhances their understanding of text evolution. Experimental results show that even models at the 2B scale achieve consistent and comprehensive performance improvements across all evaluated tasks.

What carries the argument

The glyph-driven fine-tuning framework (GEVO) that trains models to focus on evolutionary consistency in glyph transformations.

If this is right

Models show gains on character recognition and evolutionary reasoning tasks.
Improvements hold for models as small as 2 billion parameters.
The method addresses observed weaknesses in glyph-level comparison.
The benchmark and trained models are released publicly to enable further research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same glyph-consistency approach could be tested on evolution in other writing systems or visual scripts.
Public release of the benchmark allows direct comparison of future methods on the same tasks.
If glyph focus drives the gains, similar visual-consistency training might aid other multimodal tasks involving shape changes over time.

Load-bearing premise

The 11-task benchmark measures genuine understanding of character evolution rather than superficial pattern matching, and the observed gains arise specifically from the glyph-driven component.

What would settle it

Apply ordinary fine-tuning to the same data without the glyph-driven component and check whether the performance gains on the 11 tasks largely disappear.

Figures

Figures reproduced from arXiv: 2604.11299 by Hao Xu, Lida Shi, Ruihua Qi, Rui Song, Yingji Li.

**Figure 2.** Figure 2: The 11 basic tasks that constitute the evaluation of MLLMs, along with corresponding examples. Due to [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Benchmark quantity statistics. understanding. 4.1 Analysis of Results Based on the quantitative results above, we aim to further investigate how script style influences predictions. Given that the Qwen series of models demonstrate the most outstanding performance on the evaluation benchmark, we further conduct an in-depth analysis of the prediction results from the Qwen3-VL-2B-Instruct and Qwen3-VL-8B-… view at source ↗

**Figure 6.** Figure 6: Characters accuracy on different script styles. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of results for Qwen3-VL-2B-Instruct after SFT (Qwen3-VL-2B-SFT) across all tasks. oracle bone script remains below 10%, indicating that current MLLMs possess little to no effective capability for recognizing this script. This observation further underscores the necessity of endowing MLLMs with specialized knowledge to support research on ancient scripts [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: The two-stage framework of GDEVA training. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: The confusion matrices of script style predic [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Accuracy of characters across different script [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of similar image representations among “日” (sun) and “口” (mouth) baed on Qwen3-VL-2B-Instruct and GEVO in twodimensional space. Boxes in different colors are used to distinguish images of script styles corresponding to different modern Chinese characters. We conduct a visualization analysis to explore GEVO’s ability to distinguish glyphs in the representation space, thereby providing subs… view at source ↗

**Figure 12.** Figure 12: Detailed instructions for Task 1. We use the special character ‘ [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Detailed instructions for Task 2. B More MLLMs Details We evaluate several common MLLMs including: TongGu-VL-2B-Instruct (Cao et al., 2025) (An expert model trained on a cultural heritage dataset, which has been reported to exhibit stronger comprehension of ancient Chinese scripts compared to other models.), Qwen2.5-VL series (Team, 2025a) and Qwen3-VL series (Team, 2025b), InternVL-3_5 series (Wang et… view at source ↗

**Figure 14.** Figure 14: Detailed instructions for Task 3. MLLMs T1.1 T1.2 T1.3 T1.4 T2.1 T2.2 T2.3 T2.4 T3.1 T3.2 T3.3 Average Qwen3-VL-2B-Instruct 17.66 58.87 53.59 25.53 20.82 70.47 73.23 25.82 30.36 17.51 45.88 39.98 Qwen3-VL-8B-Instruct 47.79 76.63 70.22 58.45 29.76 74.71 82.77 69.67 50.96 49.20 71.75 61.99 Qwen3-VL-30B-A3B-Instruct 31.47 68.20 56.31 46.88 29.56 72.34 75.66 65.72 53.60 34.70 66.15 54.60 InternVL3_5-1B-HF 6.6… view at source ↗

**Figure 16.** Figure 16: The loss variation of the model in the second [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗

**Figure 17.** Figure 17: The loss variation of the model in the third [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗

**Figure 18.** Figure 18: Visualization of similar image representations among “力” (fource) and “刀” (knife) baed on Qwen3-VL-2B-Instruct and GEVO in twodimensional space. Qwen3-VL-2B-Instruct (ten thousand) (square) GEVO (ten thousand) (square) [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

**Figure 19.** Figure 19: Visualization of similar image representa [PITH_FULL_IMAGE:figures/full_fig_p015_19.png] view at source ↗

read the original abstract

In recent years, rapid advances in Multimodal Large Language Models (MLLMs) have increasingly stimulated research on ancient Chinese scripts. As the evolution of written characters constitutes a fundamental pathway for understanding cultural transformation and historical continuity, how MLLMs can be systematically leveraged to support and advance text evolution analysis remains an open and largely underexplored problem. To bridge this gap, we construct a comprehensive benchmark comprising 11 tasks and over 130,000 instances, specifically designed to evaluate the capability of MLLMs in analyzing the evolution of ancient Chinese scripts. We conduct extensive evaluations across multiple widely used MLLMs and observe that, while existing models demonstrate a limited ability in glyph-level comparison, their performance on core tasks-such as character recognition and evolutionary reasoning-remains substantially constrained. Motivated by these findings, we propose a glyph-driven fine-tuning framework (GEVO) that explicitly encourages models to capture evolutionary consistency in glyph transformations and enhances their understanding of text evolution. Experimental results show that even models at the 2B scale achieve consistent and comprehensive performance improvements across all evaluated tasks. To facilitate future research, we publicly release both the benchmark and the trained models\footnote{https://github.com/songruiecho/GEVO}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a new 130k-instance benchmark for MLLM analysis of ancient Chinese script evolution plus a glyph-focused fine-tuning method, but the reported gains lack an ablation that isolates the proposed consistency objective from ordinary supervised tuning on the same data.

read the letter

This paper gives us a new benchmark covering 11 tasks and more than 130,000 examples aimed at how multimodal models handle the historical development of Chinese characters. It also introduces GEVO, a fine-tuning approach meant to push models to track consistent glyph changes across time. The public release of the benchmark and the trained models is the clearest practical output here, and that alone makes the work usable for people building tools in digital humanities.

Referee Report

1 major / 2 minor

Summary. The manuscript constructs a benchmark of 11 tasks and over 130,000 instances to evaluate MLLMs on ancient Chinese character evolution analysis. It identifies limitations in existing models for glyph-level comparison and evolutionary reasoning, then proposes the GEVO glyph-driven fine-tuning framework to encourage capture of evolutionary consistency in glyph transformations. Experiments report that even 2B-scale models achieve consistent performance gains across tasks, with public release of the benchmark and trained models.

Significance. If the attribution of gains to the glyph-driven mechanism holds, the work supplies a targeted benchmark and fine-tuning approach for a specialized domain in historical linguistics and cultural heritage. The public release of the benchmark and models is a clear strength that enables reproducibility and further research.

major comments (1)

[Experimental results] Experimental results section: the central claim that GEVO 'explicitly encourages models to capture evolutionary consistency in glyph transformations' and thereby produces the observed improvements requires an ablation that isolates the evolutionary-consistency objective from standard supervised fine-tuning on the identical 130k instances. Only comparisons to untuned base MLLMs are described; without a matched control (identical data, optimization, and architecture but omitting the glyph-driven loss), the causal contribution of the proposed framework remains unestablished and is load-bearing for the paper's main contribution.

minor comments (2)

[Abstract] Abstract: the statement of 'consistent and comprehensive performance improvements' lacks any quantitative summary, baseline names, or mention of statistical testing, which reduces informativeness for readers.
[Benchmark construction] Benchmark description: task definitions, data collection protocol, and split statistics are referenced but not detailed enough for independent replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of the benchmark, the 11 tasks, and the public release of data and models. We address the single major comment below and will revise the manuscript to incorporate the requested analysis.

read point-by-point responses

Referee: Experimental results section: the central claim that GEVO 'explicitly encourages models to capture evolutionary consistency in glyph transformations' and thereby produces the observed improvements requires an ablation that isolates the evolutionary-consistency objective from standard supervised fine-tuning on the identical 130k instances. Only comparisons to untuned base MLLMs are described; without a matched control (identical data, optimization, and architecture but omitting the glyph-driven loss), the causal contribution of the proposed framework remains unestablished and is load-bearing for the paper's main contribution.

Authors: We agree that the current experimental design compares GEVO only against untuned base models and does not yet isolate the glyph-driven objective from standard supervised fine-tuning on the same 130k instances. This leaves the specific causal contribution of the evolutionary-consistency term partially unestablished. In the revised manuscript we will add the requested ablation: we will fine-tune the same MLLM architectures on the identical 130,000 instances using the same optimization schedule and data order, but replace the glyph-driven loss with standard next-token prediction loss. The resulting performance deltas will be reported alongside the existing GEVO results for all 11 tasks. This addition will directly quantify the incremental benefit attributable to the glyph-driven component while preserving the overall experimental protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent benchmark and experimental claims

full rationale

The paper is an empirical ML study that constructs a public 11-task benchmark of 130k instances, evaluates base MLLMs, and proposes a glyph-driven fine-tuning method (GEVO) whose improvements are asserted via experimental results. No equations, derivations, or first-principles predictions appear in the provided text that reduce any claimed outcome to fitted inputs or self-definitions by construction. The framework is described as encouraging evolutionary consistency without the consistency metric or performance gains being defined in terms of each other. Public release of benchmark and models further supports independence. This matches the default non-circular case for empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on the representativeness of the new benchmark tasks for evolutionary analysis and the assumption that glyph-focused training captures genuine historical consistency; no new physical entities or unstated mathematical axioms beyond standard ML evaluation practices.

axioms (2)

domain assumption The constructed benchmark tasks are representative of real-world ancient Chinese character evolution analysis needs.
Implicit in the motivation and evaluation sections of the abstract.
domain assumption Performance improvements after fine-tuning reflect enhanced understanding rather than overfitting to the benchmark.
Required for interpreting the reported gains across tasks.

pith-pipeline@v0.9.0 · 5527 in / 1294 out tokens · 53077 ms · 2026-05-10T15:22:43.110450+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
cs.CV 2026-05 unverdicted novelty 7.0

Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

InProceedings of the Asian Conference on Computer Vision, pages 1622– 1639

Ffd augmentor: towards few-shot oracle char- acter recognition from scratch. InProceedings of the Asian Conference on Computer Vision, pages 1622– 1639. Zhihan Zhou, Daqian Shi, Rui Song, Lida Shi, Xi- aolei Diao, and Hao Xu. 2025. Ancientbench: Towards comprehensive evaluation on excavated and transmitted chinese corpora.arXiv preprint arXiv:2512.17756. ...

work page arXiv 2025
[2]

Bronze Inscription → Regular Script → Oracle Bone Script,

A more detailed breakdown of the task com- position is provided in Figure 3. Each task uses accuracy as the evaluation metric, meaning that if the correct answer appears in the generated result, it is considered correct. If the generated result con- tains multiple candidate answers, it is assumed that the model did not understand the instruction and is th...

2025