pith. sign in

arxiv: 2607.00664 · v1 · pith:6WBPSGHXnew · submitted 2026-07-01 · 💻 cs.CL

YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLMs for Japanese

Pith reviewed 2026-07-02 13:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords kanji readingJapanese LLMsbenchmarkphonological understandingLLM evaluationmultiple readingsgeneration tasksYOMI-Bench
0
0 comments X

The pith

Even Japanese-specific LLMs perform poorly on kanji reading tasks according to a new benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Kanji characters in Japanese often have multiple possible readings, so surface text alone does not determine the correct pronunciation. The paper introduces YOMI-Bench, a set of four tasks built to test LLMs on kanji reading and phonological understanding. Evaluations across one multilingual open model, four Japanese-specific open models, and five commercial models show low performance overall. Japanese-specific models still struggle, and commercial models fare especially poorly on generation tasks that require selecting the right reading. A reader would care because this pinpoints a concrete limitation in applying current LLMs to Japanese text production and analysis.

Core claim

The paper proposes YOMI-Bench consisting of four tasks to evaluate kanji reading performance in Japanese. Through systematic testing, it establishes that even Japanese-specific open LLMs exhibit low performance, while commercial LLMs also perform poorly on generation tasks that require consideration of kanji readings.

What carries the argument

YOMI-Bench, a benchmark of four tasks designed to measure LLMs' ability to handle multiple possible readings for individual kanji characters.

If this is right

  • Even Japanese-specific open LLMs exhibit low performance on the benchmark tasks.
  • Commercial LLMs perform poorly on generation tasks that require consideration of kanji readings.
  • Multilingual open LLMs also show low performance due to the linguistic characteristic of multiple kanji readings.
  • The benchmark reveals that inferring correct readings from surface-level text alone remains difficult for current models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training data that explicitly covers phonological variations across kanji readings could raise scores on these tasks.
  • The same task design could be adapted to test reading disambiguation in other languages with character-based scripts.
  • Low benchmark results suggest that Japanese text generation systems may produce incorrect readings until this gap is closed.
  • Model developers could add explicit disambiguation layers focused on Japanese phonology to improve reliability.

Load-bearing premise

The four tasks in YOMI-Bench are valid and unbiased measures of kanji reading ability that reflect real-world difficulties rather than artifacts of task design.

What would settle it

A model achieving consistently high accuracy, such as above 80 percent, across all four tasks while retaining general language capabilities would falsify the reported low performance.

Figures

Figures reproduced from arXiv: 2607.00664 by Hiroya Takamura, Hitomi Yanaka, Ryota Mibayashi.

Figure 1
Figure 1. Figure 1: The overview of YOMI-Bench. the kanji character “覚” has three possible readings: “kaku,” “obo,” and “sa.” These readings vary de￾pending on the word in which the character appears, such as “覚醒 (kakusei/Awakening),” “覚える (oboeru/Memorize),” and “覚める (sameru/Wake up).” Thus, in order to correctly predict the read￾ings of kanji characters appearing within individual words, models should not only possess the k… view at source ↗
read the original abstract

We propose YOMI-Bench, a benchmark for evaluating kanji reading and phonological understanding of large language models (LLMs) for Japanese. In Japanese, a single kanji character often has multiple possible readings, making it difficult to infer the correct reading from surface-level text alone. Due to these linguistic characteristics, it is empirically known that LLMs exhibit low performance in kanji reading for Japanese. The proposed YOMI-Bench consists of four tasks specifically designed to evaluate kanji reading performance in Japanese. In our evaluation using YOMI-Bench, we assessed one multilingual open LLM, four Japanese-specific open LLMs, and five commercial LLMs. As a result, we found that even Japanese-specific models show low performance, and that commercial models also perform poorly on generation tasks that require consideration of kanji readings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes YOMI-Bench, a benchmark consisting of four tasks to evaluate the kanji reading and phonological understanding of LLMs for Japanese. It evaluates one multilingual open LLM, four Japanese-specific open LLMs, and five commercial LLMs, concluding that even Japanese-specific models exhibit low performance and that commercial models perform poorly on generation tasks requiring consideration of kanji readings.

Significance. If the benchmark tasks are shown to be valid and unbiased measures of kanji reading ability, this work could highlight important limitations in current LLMs for handling Japanese language specifics, which is relevant for improving multilingual and language-specific models. The evaluation across different model types provides a broad view of the issue.

major comments (1)
  1. [Abstract] Abstract: The abstract provides no details on task construction, data sources, metrics, or statistical significance, making it impossible to verify whether the reported low performance supports the claims about model capabilities.
minor comments (1)
  1. The paper could benefit from including example instances from the four tasks to illustrate the evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract provides no details on task construction, data sources, metrics, or statistical significance, making it impossible to verify whether the reported low performance supports the claims about model capabilities.

    Authors: We agree that the current abstract is high-level and omits key details on task construction, data sources, and metrics, which are instead described in Sections 3 (Benchmark Construction) and 4 (Experiments). This is a valid observation. To address it, we will revise the abstract to concisely summarize the four tasks, their data sources (e.g., existing Japanese corpora and manually curated examples), the primary metrics (accuracy and F1 for reading prediction/generation), and note that evaluations are deterministic with no statistical significance testing applied due to the fixed benchmark design. This change will allow readers to better assess the claims directly from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes YOMI-Bench as a new benchmark with four tasks for evaluating kanji reading in LLMs and reports direct evaluation results on open and commercial models. No derivations, equations, fitted parameters, predictions, or uniqueness theorems are present. The claims rest on empirical performance numbers from the benchmark application itself, with no self-referential reductions or load-bearing self-citations that collapse the argument. This is a standard benchmark paper whose central content is self-contained against external model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark proposal paper with no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5681 in / 942 out tokens · 22224 ms · 2026-07-02T13:01:59.164136+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 20 canonical work pages · 2 internal anchors

  1. [1]

    Neural substrates of phonological selection for Japanese character Kanji based on fMRI investigations

    Kayako Matsuo and Shen-Hsing Annabel Chen and Chih-Wei Hue and Chiao-Yi Wu and Epifanio Bagarinao and Wen-Yih Isaac Tseng and Toshiharu Nakai. Neural substrates of phonological selection for Japanese character Kanji based on fMRI investigations. NeuroImage. 2010. doi:10.1016/j.neuroimage.2009.12.099

  2. [2]

    Historical Analysis of Japanese Writing Systems Hiragana, Katakana, and Kanji

    Yessy Harun and Febi Nur Biduri. Historical Analysis of Japanese Writing Systems Hiragana, Katakana, and Kanji. International Journal of Social Service and Research. 2024

  3. [4]

    arXiv preprint arXiv:2412.14471

    Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs , author=. arXiv preprint arXiv:2412.14471. 2025. doi:10.48550/arXiv.2402.01349

  4. [5]

    What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations

    Katharina Trinley and Toshiki Nakai and Tatiana Anikina and Tanja Baeumel. What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations. arXiv preprint arXiv:2507.20279. 2025

  5. [6]

    Benchmax: A comprehensive multilingual evaluation suite for large language models

    Xu Huang and Wenhao Zhu and Hanxu Hu and Conghui He and Lei Li and Shujian Huang and Fei Yuan. BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models. arXiv preprint arXiv:2502.07346. 2025

  6. [7]

    The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs

    Masaharu Mizumoto and Dat Nguyen and Zhiheng Han and Jiyuan Fang and Heyuan Guan and Xingfu Li and Naoya Shiraishi and Xuyang Tian and Yo Nakawake and Le Minh Nguyen. The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs. arXiv preprint arXiv:2509.14704. 2025

  7. [8]

    Multilingual Large Language Models: A Systematic Survey

    Shaolin Zhu and Supryadi and Shaoyang Xu and Haoran Sun and Leiyu Pan and Menglong Cui and Jiangcun Du and Renren Jin and António Branco and Deyi Xiong. Multilingual Large Language Models: A Systematic Survey. arXiv preprint arXiv:2411.11072. 2024. doi:10.48550/arXiv.2411.11072

  8. [9]

    A survey on large language model benchmarks, 2025

    Shiwen Ni and Guhong Chen and Shuaimin Li and Xuanang Chen and Siyi Li and Bingli Wang and Qiyao Wang and Xingjian Wang and Yifan Zhang and Liyang Fan and Chengming Li and Ruifeng Xu and Le Sun and Min Yang. A Survey on Large Language Model Benchmarks. arXiv preprint arXiv:2508.15361. 2025

  9. [10]

    Japanese Rhyme Generation Based on Mora Similarity and Generation Probability

    Mibayashi, Ryota and Yamamoto, Takehiro and Ohshima, Hiroaki. Japanese Rhyme Generation Based on Mora Similarity and Generation Probability. Proceedings of the 27th International Conference on Information Integration and Web Intelligence. 2025. doi:10.1007/978-3-032-11976-6_7

  10. [11]

    and Kann, Katharina and Mielke, Sabrina J

    Cotterell, Ryan and Kirov, Christo and Sylak-Glassman, John and Walther, G \'e raldine and Vylomova, Ekaterina and McCarthy, Arya D. and Kann, Katharina and Mielke, Sabrina J. and Nicolai, Garrett and Silfverberg, Miikka and Yarowsky, David and Eisner, Jason and Hulden, Mans. The C o NLL -- SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection...

  11. [12]

    Barrault, Lo \"i c and Biesialska, Magdalena and Bojar, Ond r ej and Costa-juss \`a , Marta R. and Federmann, Christian and Graham, Yvette and Grundkiewicz, Roman and Haddow, Barry and Huck, Matthias and Joanis, Eric and Kocmi, Tom and Koehn, Philipp and Lo, Chi-kiu and Ljube s i \'c , Nikola and Monz, Christof and Morishita, Makoto and Nagata, Masaaki an...

  12. [13]

    Training Verifiers to Solve Math Word Problems

    Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168. 2021

  13. [14]

    Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M

    Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat. XL -Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. Proceedings of Findings of the Association for Computational Linguistics. 2021

  14. [15]

    Development of a question answering system focused on an encyclopedia

    Sekine, Satoshi. Development of a question answering system focused on an encyclopedia. Proceedings of the 9th Annual Meeting of the Association for Natural Language Processing. 2003

  15. [16]

    JEMH op QA : Dataset for J apanese Explainable Multi-Hop Question Answering

    Ishii, Ai and Inoue, Naoya and Suzuki, Hisami and Sekine, Satoshi. JEMH op QA : Dataset for J apanese Explainable Multi-Hop Question Answering. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024

  16. [17]

    R oman L ens: The Role Of Latent R omanization In Multilinguality In LLM s

    Saji, Alan and Husain, Jaavid Aktar and Jayakumar, Thanmay and Dabre, Raj and Kunchukuttan, Anoop and Puduppully, Ratish. R oman L ens: The Role Of Latent R omanization In Multilinguality In LLM s. Findings of the Association for Computational Linguistics. 2025

  17. [18]

    Large language models are not robust multiple choice selectors, 2024

    Zheng, Chujie and Zhou, Hao and Meng, Fandong and Zhou, Jie and Huang, Minlie. Large Language Models Are Not Robust Multiple Choice Selectors. International Conference on Representation Learning. 2024. doi:10.48550/arXiv.2309.03882

  18. [19]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt. Measuring Massive Multitask Language Understanding. International Conference on Learning Representations. 2021

  19. [20]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. Proceedings of the 37th International Conference on Neural Information Proce...

  20. [21]

    Large Language Models Lack Understanding of Character Composition of Words

    Andrew Shin and Kunitake Kaneko. Large Language Models Lack Understanding of Character Composition of Words. ICML Workshop on Large Language Models and Cognition. 2024

  21. [22]

    David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman. GPQA : A Graduate-Level Google-Proof Q&A Benchmark. Proceedings of the First Conference on Language Modeling. 2024

  22. [23]

    Large language models are zero-shot reasoners

    Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke. Large language models are zero-shot reasoners. Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022

  23. [24]

    Exploring the Potential of Prompt-Based Method for Kanji-Kana Conversion in Japanese Braille Translation

    Micah Kitsunai, Deborah Watty, Shu-Kai Hsieh. Exploring the Potential of Prompt-Based Method for Kanji-Kana Conversion in Japanese Braille Translation. In the 29th Annual Meeting of Japanese Association for Natural Language Processing. 2024

  24. [25]

    JGLUE : J apanese General Language Understanding Evaluation

    Kurihara, Kentaro and Kawahara, Daisuke and Shibata, Tomohide. JGLUE : J apanese General Language Understanding Evaluation. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

  25. [26]

    Should We Respect LLM s? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

    Yin, Ziqi and Wang, Hao and Horio, Kaito and Kawahara, Daisuke and Sekine, Satoshi. Should We Respect LLM s? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance. Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024). 2024. doi:10.18653/v1/2024.sicon-1.2

  26. [27]

    ByT5: Towards a token-free future with pre-trained byte-to-byte models , journal =

    Linting Xue and Aditya Barua and Noah Constant and Rami Al. ByT5: Towards a token-free future with pre-trained byte-to-byte models , journal =. 2021 , url =. 2105.13626 , timestamp =

  27. [28]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  28. [29]

    Dan Gusfield , title =. 1997

  29. [30]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  30. [31]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  31. [32]

    Evaluating Large Language Models Trained on Code

    Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=

  32. [33]

    DeepRapper : N eural Rap Generation with Rhyme and Rhythm Modeling

    Liqiang Xue and Kaitao Song and Di Wu and Xu Tan and Ningyu Zhang and Tao Qin and Wentao Zhang and Tie-Yan Liu. DeepRapper : N eural Rap Generation with Rhyme and Rhythm Modeling. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021. doi:10....

  33. [34]

    and Ba, J

    Kingma, D. and Ba, J. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representation. 2015

  34. [35]

    and Addanki, K

    Wu, D. and Addanki, K. Learning to Rap Battle with Bilingual Recursive Neural Networks. Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015

  35. [36]

    Ghost W riter: U sing an LSTM for Automatic Rap Lyric Generation

    Peter Potash and Alexey Romanov and Anna Rumshisky. Ghost W riter: U sing an LSTM for Automatic Rap Lyric Generation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015. doi:https://doi.org/10.18653/v1/D15-1221

  36. [37]

    PLATO-Ad: A Unified Advertisement Text Generation Framework with Multi-Task Prompt Learning

    Zeyang Lei and Chao Zhang and Xinchao Xu and Wenquan Wu and Zheng-yu Niu and Hua Wu and Haifeng Wang and Yi Yang and Shuanglong Li. PLATO-Ad: A Unified Advertisement Text Generation Framework with Multi-Task Prompt Learning. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2022. doi:10.18653/v1/2022.e...

  37. [38]

    DopeLearning: A Computational Approach to Rap Lyrics Generation

    Eetu Malmi and Pyry Takala and Hannu Toivonen and Tapani Raiko and Aristides Gionis. DopeLearning: A Computational Approach to Rap Lyrics Generation. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. doi:10.1145/2939672.2939841

  38. [39]

    Verse Generation by Reverse Generation Considering Rhyme and Answer in Japanese Rap Battles

    Mibayashi, Ryota and Yamamoto, Takehiro and Tsukuda, Kosetsu and Watanabe, Kento and Nakano, Tomoyasu and Goto, Masataka and Ohshima, Hiroaki. Verse Generation by Reverse Generation Considering Rhyme and Answer in Japanese Rap Battles. Proceedings of the 16th International Symposium on Computer Music Multidisciplinary Research. 2023. doi:10.5281/zenodo.10109961

  39. [40]

    Nikolov and Eetu Malmi and Curtis G

    Nikola I. Nikolov and Eetu Malmi and Curtis G. Northcutt and Lorenzo Parisi. Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.42