On the Definition of Japanese Word
Pith reviewed 2026-05-25 17:58 UTC · model grok-4.3
The pith
Short Unit Words used in UD Japanese treebanks do not qualify as syntactic words under the annotation guidelines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The annotation guidelines for Universal Dependencies require syntactic words as basic units, but Short Unit Words in Japanese UD treebanks are not syntactic words as specified by those guidelines.
What carries the argument
The UD guidelines' definition of syntactic words, applied to evaluate whether Short Unit Words qualify in Japanese.
If this is right
- Dependency parsing models trained on current Japanese UD data would use units that do not align with the intended syntactic words.
- Annotation consistency across languages in UD could be compromised if Japanese uses non-qualifying units.
- Future revisions might need to adopt different word units to comply with the guidelines.
- Non-mainstream linguistic definitions of Japanese words could be considered for annotation despite their unfamiliarity.
Where Pith is reading between the lines
- Other languages with ambiguous word boundaries might face similar challenges in applying UD guidelines.
- Adopting linguistic word definitions could improve cross-lingual comparability in dependency annotations.
- Testing the application of word definitions on sample sentences could reveal practical annotation issues.
Load-bearing premise
The UD guidelines provide a sufficiently clear, language-independent definition of syntactic words that can be applied to Japanese.
What would settle it
A direct comparison showing that Short Unit Words satisfy the UD syntactic word criteria in specific Japanese sentences would falsify the claim.
Figures
read the original abstract
The annotation guidelines for Universal Dependencies (UD) stipulate that the basic units of dependency annotation are syntactic words, but it is not clear what are syntactic words in Japanese. Departing from the long tradition of using phrasal units called bunsetsu for dependency parsing, the current UD Japanese treebanks adopt the Short Unit Words. However, we argue that they are not syntactic word as specified by the annotation guidelines. Although we find non-mainstream attempts to linguistically define Japanese words, such definitions have never been applied to corpus annotation. We discuss the costs and benefits of adopting the rather unfamiliar criteria.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that Short Unit Words (SUW) adopted in current UD Japanese treebanks do not qualify as syntactic words under the UD annotation guidelines, which prioritize syntactic criteria over orthographic or traditional phrasal units such as bunsetsu. It reviews non-mainstream linguistic attempts to define Japanese words, notes that such definitions have not been used in corpus annotation, and discusses costs and benefits of different criteria.
Significance. If substantiated, the result would identify an inconsistency between UD guidelines and Japanese treebank practice, with implications for cross-linguistic comparability of syntactic annotations. The discussion of alternative word definitions could inform guideline revisions for agglutinative languages, but the paper supplies no new data, treebank comparisons, or explicit criterion applications to support its central claim.
major comments (2)
- [Abstract] Abstract and introduction: the claim that SUW 'are not syntactic word as specified by the annotation guidelines' is asserted without quoting or applying any specific UD guideline criterion (e.g., the syntactic-word definition in the UD guidelines) to Japanese examples, leaving the mismatch interpretive rather than demonstrated.
- [UD guidelines discussion] Discussion of UD guidelines: the argument presupposes that the guidelines contain an operational, language-independent definition of syntactic words sufficient to exclude SUW, yet provides no direct test of this assumption against Japanese morphological structure or bunsetsu units.
minor comments (2)
- [Abstract] Abstract: grammatical agreement error ('syntactic word' should read 'syntactic words').
- [Abstract] Abstract: the phrase 'we find non-mainstream attempts' is imprecise; name the specific linguistic works referenced.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment point by point below and indicate where revisions will be made to strengthen the explicit demonstration of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract and introduction: the claim that SUW 'are not syntactic word as specified by the annotation guidelines' is asserted without quoting or applying any specific UD guideline criterion (e.g., the syntactic-word definition in the UD guidelines) to Japanese examples, leaving the mismatch interpretive rather than demonstrated.
Authors: We agree that the abstract and introduction assert the central claim without direct quotation or application of specific UD criteria. Although the full manuscript references the guidelines' syntactic priorities, we will revise these sections to include explicit quotations from the UD syntactic word definition and apply the criteria to concrete Japanese examples involving Short Unit Words, thereby demonstrating the mismatch rather than leaving it interpretive. revision: yes
-
Referee: [UD guidelines discussion] Discussion of UD guidelines: the argument presupposes that the guidelines contain an operational, language-independent definition of syntactic words sufficient to exclude SUW, yet provides no direct test of this assumption against Japanese morphological structure or bunsetsu units.
Authors: The manuscript contrasts the UD guidelines' syntactic criteria with traditional Japanese units such as bunsetsu and reviews alternative linguistic definitions. We acknowledge the value of a more direct test. In revision, we will add explicit applications of the UD syntactic word criteria to Japanese morphological structures and bunsetsu units, providing the requested direct comparison. revision: yes
Circularity Check
No significant circularity; argument applies external UD guidelines to Japanese units
full rationale
The paper's central claim compares Short Unit Words against the syntactic word definition supplied by the external Universal Dependencies annotation guidelines and prior linguistic literature. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear; the derivation consists of an interpretive mismatch between an independent external standard and the chosen annotation units. This is self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption UD annotation guidelines stipulate that the basic units are syntactic words whose definition is language-independent enough to apply to Japanese.
Reference graph
Works this paper leans on
-
[1]
Masayuki Asahara, Hiroshi Kanayama, Takaaki Tanaka, Yusuke Miyao, Sumire Uematsu, Shinsuke Mori, Yuji Matsumoto, Mai Omura, and Yugo Murawaki. 2018. Universal D ependencies version 2 for J apanese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
work page 2018
-
[2]
Masayuki Asahara and Yuji Matsumoto. 2016. BCCWJ-DepPara : A syntactic annotation treebank on the ` B alanced C orpus of C ontemporary W ritten J apanese'. In Proceedings of the 12th Workshop on Asian Langauge Resources (ALR12), pages 49--58
work page 2016
-
[3]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:abs/1409.0473
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[4]
Daisuke Bekki. 2010. Nihon-go bunp\= o no keishiki riron . Kurosio Publishers. (in Japanese)
work page 2010
-
[5]
Sabine Buchholz and Erwin Marsi. 2006. https://www.aclweb.org/anthology/W06-2920 C o NLL - X shared task on multilingual dependency parsing . In Proceedings of the Tenth Conference on Computational Natural Language Learning ( C o NLL -X) , pages 149--164
work page 2006
-
[6]
Noam Chomsky. 1970. Remarks on nominalization. In Roderick A. Jacobs and Peter S. Rosenbaum, editors, Readings in English Transformational Grammar, pages 184--221. Ginn
work page 1970
-
[7]
Cohen, Dipanjan Das, and Noah A
Shay B. Cohen, Dipanjan Das, and Noah A. Smith. 2011. https://www.aclweb.org/anthology/D11-1005 Unsupervised structure prediction with non-parallel multilingual guidance . In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 50--61
work page 2011
-
[8]
William Croft, Dawn Nordquist, Katherine Looney, and Michael Regan. 2017. Linguistic typology meets U niversal D ependencies. In Proceedings of the 15th International Workshop on Treebanks and Linguistic Theories (TLT15), pages 63--75
work page 2017
-
[9]
Anna-Maria Di Sciullo and Edwin Williams. 1987. On the Definition of Word. MIT Press
work page 1987
-
[10]
Shinkichi Hahimoto. 1933. Kokugo-h\= o y\= o setsu . Meiji Shoin. (in Japanese)
work page 1933
-
[11]
Jan Haji c , Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Ant \`o nia Mart \' , Llu \' s M \`a rquez, Adam Meyers, Joakim Nivre, Sebastian Pad \'o , Jan S t e p \'a nek, Pavel Stra n \'a k, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009. https://www.aclweb.org/anthology/W09-1201 The C o NLL -2009 shared task: Syntactic and semantic ...
work page 2009
-
[12]
Masatsugu Hangyo, Daisuke Kawahara, and Sadao Kurohashi. 2012. Building a diverse document leads corpus annotated with semantic relations. In Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation, pages 535--544
work page 2012
-
[13]
Martin Haspelmath. 2010. https://doi.org/10.1353/lan.2010.0021 Comparative concepts and descriptive categories in crosslinguistic studies . Language, 86(3):663--687
-
[14]
Martin Haspelmath. 2011. https://doi.org/10.1515/flin.2011.002 The indeterminacy of word segmentation and the nature of morphology and syntax . Folia Linguistica, 45(1):31--80
-
[15]
Martin Haspelmath. 2015. https://doi.org/10.1515/9781614514510-009 Defining vs. diagnosing linguistic categories: A case study of clitic phenomena . In Joanna Blaszczak, Dorota Klimek-Jankowska, and Krzysztof Migdalski, editors, How Categorical are Categories? New Approaches to the Old Questions of Noun, Verb, and Adjective, pages 273--304. De Gruyter Mouton
-
[16]
Shiro Hattori. 1960. Gengo-gaku no H\= o h\= o , chapter Fuzoku-go to Fuzoku-keishiki. Iwanami Shoten. (in Japanese)
work page 1960
-
[17]
Sho Hoshino, Yusuke Miyao, Katsuhito Sudoh, and Masaaki Nagata. 2013. Two-stage pre-ordering for J apanese-to- E nglish statistical machine translation. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1062--1066
work page 2013
-
[18]
Taro Kageyama. 1993. Bunp\= o to Go-keisei . Hituzi Syobo Publishing. (in Japanese)
work page 1993
-
[19]
Daisuke Kawahara, Yuichiro Machida, Tomohide Shibata, Sadao Kurohashi, Hayato Kobayashi, and Manabu Sassano. 2014. Rapid development of a corpus with discourse annotations using two-stage crowdsourcing. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 269--278
work page 2014
-
[20]
Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. https://www.aclweb.org/anthology/W04-3230 Applying conditional random fields to J apanese morphological analysis . In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pages 230--237
work page 2004
-
[21]
Sadao Kurohashi and Makoto Nagao. 1994. KN parser: J apanese dependency/case structure analyzer. In Proceedings of the Workshop on Sharable Natural Language, pages 48--55
work page 1994
-
[22]
Sadao Kurohashi and Makoto Nagao. 1998. Building a J apanese parsed corpus while improving the parsing system. In Proceedings of the NLPRS, pages 719--724
work page 1998
-
[23]
Sadao Kurohashi, Toshihisa Nakamura, Yuji Matsumoto, and Makoto Nagao. 1994. Improvements of J apanese morphological analyzer JUMAN . In Proceedings of The International Workshop on Sharable Natural Language Resources, pages 22--38
work page 1994
-
[24]
Rochelle Lieber. 1992. Deconstructing Morphology. University of Chicago Press
work page 1992
-
[25]
Kikuo Maekawa, Makoto Yamazaki, Toshinobu Ogiso, Takehiko Maruyama, Hideki Ogura, Wakako Kashino, Hanae Koiso, Masaya Yamaguchi, Makiro Tanaka, and Yasuharu Den. 2014. https://doi.org/10.1007/s10579-013-9261-0 Balanced C orpus of C ontemporary W ritten J apanese . Language Resources and Evaluation, 48:345--371
-
[26]
Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar T\" a ckstr\" o m, Claudia Bedini, N\' u ria Bertomeu Castell\' o , and Jungmee Lee. 2013. https://www.aclweb.org/anthology/P13-2017 Universal dependency annotation for multilingual parsing . In Proceedings of the ...
work page 2013
-
[27]
Osahito Miyaoka. 2015. Go to wa Nani ka Saik\= o (Reconsidering What is the ``Word''?) . Sanseido. (in Japanese)
work page 2015
-
[28]
Yugo Murawaki and Sadao Kurohashi. 2008. https://www.aclweb.org/anthology/D08-1045 Online acquisition of J apanese unknown morphemes using morphological constraints . In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), pages 429--437
work page 2008
-
[29]
Toshinobu Ogiso, Asuko Kondo, Yoko Mabuchi, and Noriko Hattori. 2017. Construction of the ``corpus of historical J apanese: M eiji-- T aish\= o series i -- magazines''. In Proceedings of Digital Humanities 2017
work page 2017
-
[30]
Hideki Ogura, Hanae Koiso, Yumi Fujiike, Sayaka Miyauchi, Hikari Konishi, and Yutaka Hara. 2011. Gendai Kakikotoba Kink\= o K\= o pasu Keitairon J\= o h\= o Kiteish\= u Dai 4 Han (Rules Governing the Morphological Analysis Contained in the BCCWJ , 4th ed.) . (in Japanese)
work page 2011
-
[31]
Gregory Pringle. 2016. http://www.cjvlang.com/Spicks/udjapanese.html Thoughts on the U niversal D ependencies proposal for J apanese: The problem of the word as a linguistic unit . Accessed: 2019-06-22
work page 2016
-
[32]
Milan Straka and Jana Strakov \'a . 2017. https://doi.org/10.18653/v1/K17-3009 Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDP ipe . In Proceedings of the C o NLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , pages 88--99
-
[33]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, pages 3104--3112
work page 2014
-
[34]
Takaaki Tanaka, Yusuke Miyao, Masayuki Asahara, Sumire Uematsu, Hiroshi Kanayama, Shinsuke Mori, and Yuji Matsumoto. 2016. https://www.aclweb.org/anthology/L16-1261 U niversal D ependencies for J apanese . In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
work page 2016
-
[35]
Arseny Tolmachev, Daisuke Kawahara, and Sadao Kurohashi. 2019. https://www.aclweb.org/anthology/N19-1281 Shrinking J apanese morphological analyzers with neural networks and semi-supervised learning . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1...
work page 2019
-
[36]
Kiyotaka Uchimoto, Satoshi Sekine, and Hitoshi Isahara. 1999. https://www.aclweb.org/anthology/E99-1026 J apanese dependency structure analysis based on maximum entropy models . In Ninth Conference of the E uropean Chapter of the Association for Computational Linguistics
work page 1999
-
[37]
Universal Dependencies contributors . 2019 a . http://universaldependencies.org/introduction Introduction . Accessed: 2019-06-22
work page 2019
-
[38]
Universal Dependencies contributors . 2019 b . http://universaldependencies.org/u/overview/tokenization.html Tokenization and word segmentation . Accessed: 2019-06-22
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.