pith. sign in

arxiv: 1906.09719 · v1 · pith:4ACZ4CXInew · submitted 2019-06-24 · 💻 cs.CL

On the Definition of Japanese Word

Pith reviewed 2026-05-25 17:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords Japanese word definitionsyntactic wordsUniversal Dependenciesdependency annotationShort Unit Wordsbunsetsu
0
0 comments X

The pith

Short Unit Words used in UD Japanese treebanks do not qualify as syntactic words under the annotation guidelines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the proper definition of syntactic words for Japanese in the context of Universal Dependencies annotation. It concludes that the Short Unit Words adopted in existing Japanese UD treebanks do not match the syntactic word concept outlined in the guidelines. This choice deviates from the traditional use of bunsetsu units in Japanese dependency parsing. The author points out that while some linguistic definitions of Japanese words exist, they have not been applied to corpus annotation. Using these unfamiliar criteria would involve weighing costs against benefits in annotation practice.

Core claim

The annotation guidelines for Universal Dependencies require syntactic words as basic units, but Short Unit Words in Japanese UD treebanks are not syntactic words as specified by those guidelines.

What carries the argument

The UD guidelines' definition of syntactic words, applied to evaluate whether Short Unit Words qualify in Japanese.

If this is right

  • Dependency parsing models trained on current Japanese UD data would use units that do not align with the intended syntactic words.
  • Annotation consistency across languages in UD could be compromised if Japanese uses non-qualifying units.
  • Future revisions might need to adopt different word units to comply with the guidelines.
  • Non-mainstream linguistic definitions of Japanese words could be considered for annotation despite their unfamiliarity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other languages with ambiguous word boundaries might face similar challenges in applying UD guidelines.
  • Adopting linguistic word definitions could improve cross-lingual comparability in dependency annotations.
  • Testing the application of word definitions on sample sentences could reveal practical annotation issues.

Load-bearing premise

The UD guidelines provide a sufficiently clear, language-independent definition of syntactic words that can be applied to Japanese.

What would settle it

A direct comparison showing that Short Unit Words satisfy the UD syntactic word criteria in specific Japanese sentences would falsify the claim.

Figures

Figures reproduced from arXiv: 1906.09719 by Yugo Murawaki.

Figure 1
Figure 1. Figure 1: Japanese syntactic words as concatenations of SUWs (Short Unit Words). A glossed example is followed [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

The annotation guidelines for Universal Dependencies (UD) stipulate that the basic units of dependency annotation are syntactic words, but it is not clear what are syntactic words in Japanese. Departing from the long tradition of using phrasal units called bunsetsu for dependency parsing, the current UD Japanese treebanks adopt the Short Unit Words. However, we argue that they are not syntactic word as specified by the annotation guidelines. Although we find non-mainstream attempts to linguistically define Japanese words, such definitions have never been applied to corpus annotation. We discuss the costs and benefits of adopting the rather unfamiliar criteria.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript argues that Short Unit Words (SUW) adopted in current UD Japanese treebanks do not qualify as syntactic words under the UD annotation guidelines, which prioritize syntactic criteria over orthographic or traditional phrasal units such as bunsetsu. It reviews non-mainstream linguistic attempts to define Japanese words, notes that such definitions have not been used in corpus annotation, and discusses costs and benefits of different criteria.

Significance. If substantiated, the result would identify an inconsistency between UD guidelines and Japanese treebank practice, with implications for cross-linguistic comparability of syntactic annotations. The discussion of alternative word definitions could inform guideline revisions for agglutinative languages, but the paper supplies no new data, treebank comparisons, or explicit criterion applications to support its central claim.

major comments (2)
  1. [Abstract] Abstract and introduction: the claim that SUW 'are not syntactic word as specified by the annotation guidelines' is asserted without quoting or applying any specific UD guideline criterion (e.g., the syntactic-word definition in the UD guidelines) to Japanese examples, leaving the mismatch interpretive rather than demonstrated.
  2. [UD guidelines discussion] Discussion of UD guidelines: the argument presupposes that the guidelines contain an operational, language-independent definition of syntactic words sufficient to exclude SUW, yet provides no direct test of this assumption against Japanese morphological structure or bunsetsu units.
minor comments (2)
  1. [Abstract] Abstract: grammatical agreement error ('syntactic word' should read 'syntactic words').
  2. [Abstract] Abstract: the phrase 'we find non-mainstream attempts' is imprecise; name the specific linguistic works referenced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment point by point below and indicate where revisions will be made to strengthen the explicit demonstration of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract and introduction: the claim that SUW 'are not syntactic word as specified by the annotation guidelines' is asserted without quoting or applying any specific UD guideline criterion (e.g., the syntactic-word definition in the UD guidelines) to Japanese examples, leaving the mismatch interpretive rather than demonstrated.

    Authors: We agree that the abstract and introduction assert the central claim without direct quotation or application of specific UD criteria. Although the full manuscript references the guidelines' syntactic priorities, we will revise these sections to include explicit quotations from the UD syntactic word definition and apply the criteria to concrete Japanese examples involving Short Unit Words, thereby demonstrating the mismatch rather than leaving it interpretive. revision: yes

  2. Referee: [UD guidelines discussion] Discussion of UD guidelines: the argument presupposes that the guidelines contain an operational, language-independent definition of syntactic words sufficient to exclude SUW, yet provides no direct test of this assumption against Japanese morphological structure or bunsetsu units.

    Authors: The manuscript contrasts the UD guidelines' syntactic criteria with traditional Japanese units such as bunsetsu and reviews alternative linguistic definitions. We acknowledge the value of a more direct test. In revision, we will add explicit applications of the UD syntactic word criteria to Japanese morphological structures and bunsetsu units, providing the requested direct comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity; argument applies external UD guidelines to Japanese units

full rationale

The paper's central claim compares Short Unit Words against the syntactic word definition supplied by the external Universal Dependencies annotation guidelines and prior linguistic literature. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear; the derivation consists of an interpretive mismatch between an independent external standard and the chosen annotation units. This is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that UD guidelines define syntactic words in a way that can be applied to Japanese and that Short Unit Words have been chosen without satisfying that definition. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption UD annotation guidelines stipulate that the basic units are syntactic words whose definition is language-independent enough to apply to Japanese.
    Directly stated in the opening sentence of the abstract.

pith-pipeline@v0.9.0 · 5610 in / 1069 out tokens · 20100 ms · 2026-05-25T17:58:35.660704+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    Masayuki Asahara, Hiroshi Kanayama, Takaaki Tanaka, Yusuke Miyao, Sumire Uematsu, Shinsuke Mori, Yuji Matsumoto, Mai Omura, and Yugo Murawaki. 2018. Universal D ependencies version 2 for J apanese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

  2. [2]

    Masayuki Asahara and Yuji Matsumoto. 2016. BCCWJ-DepPara : A syntactic annotation treebank on the ` B alanced C orpus of C ontemporary W ritten J apanese'. In Proceedings of the 12th Workshop on Asian Langauge Resources (ALR12), pages 49--58

  3. [3]

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:abs/1409.0473

  4. [4]

    Daisuke Bekki. 2010. Nihon-go bunp\= o no keishiki riron . Kurosio Publishers. (in Japanese)

  5. [5]

    Sabine Buchholz and Erwin Marsi. 2006. https://www.aclweb.org/anthology/W06-2920 C o NLL - X shared task on multilingual dependency parsing . In Proceedings of the Tenth Conference on Computational Natural Language Learning ( C o NLL -X) , pages 149--164

  6. [6]

    Noam Chomsky. 1970. Remarks on nominalization. In Roderick A. Jacobs and Peter S. Rosenbaum, editors, Readings in English Transformational Grammar, pages 184--221. Ginn

  7. [7]

    Cohen, Dipanjan Das, and Noah A

    Shay B. Cohen, Dipanjan Das, and Noah A. Smith. 2011. https://www.aclweb.org/anthology/D11-1005 Unsupervised structure prediction with non-parallel multilingual guidance . In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 50--61

  8. [8]

    William Croft, Dawn Nordquist, Katherine Looney, and Michael Regan. 2017. Linguistic typology meets U niversal D ependencies. In Proceedings of the 15th International Workshop on Treebanks and Linguistic Theories (TLT15), pages 63--75

  9. [9]

    Anna-Maria Di Sciullo and Edwin Williams. 1987. On the Definition of Word. MIT Press

  10. [10]

    Shinkichi Hahimoto. 1933. Kokugo-h\= o y\= o setsu . Meiji Shoin. (in Japanese)

  11. [11]

    Jan Haji c , Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Ant \`o nia Mart \' , Llu \' s M \`a rquez, Adam Meyers, Joakim Nivre, Sebastian Pad \'o , Jan S t e p \'a nek, Pavel Stra n \'a k, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009. https://www.aclweb.org/anthology/W09-1201 The C o NLL -2009 shared task: Syntactic and semantic ...

  12. [12]

    Masatsugu Hangyo, Daisuke Kawahara, and Sadao Kurohashi. 2012. Building a diverse document leads corpus annotated with semantic relations. In Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation, pages 535--544

  13. [13]

    Martin Haspelmath. 2010. https://doi.org/10.1353/lan.2010.0021 Comparative concepts and descriptive categories in crosslinguistic studies . Language, 86(3):663--687

  14. [14]

    Martin Haspelmath. 2011. https://doi.org/10.1515/flin.2011.002 The indeterminacy of word segmentation and the nature of morphology and syntax . Folia Linguistica, 45(1):31--80

  15. [15]

    Martin Haspelmath. 2015. https://doi.org/10.1515/9781614514510-009 Defining vs. diagnosing linguistic categories: A case study of clitic phenomena . In Joanna Blaszczak, Dorota Klimek-Jankowska, and Krzysztof Migdalski, editors, How Categorical are Categories? New Approaches to the Old Questions of Noun, Verb, and Adjective, pages 273--304. De Gruyter Mouton

  16. [16]

    Shiro Hattori. 1960. Gengo-gaku no H\= o h\= o , chapter Fuzoku-go to Fuzoku-keishiki. Iwanami Shoten. (in Japanese)

  17. [17]

    Sho Hoshino, Yusuke Miyao, Katsuhito Sudoh, and Masaaki Nagata. 2013. Two-stage pre-ordering for J apanese-to- E nglish statistical machine translation. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1062--1066

  18. [18]

    Taro Kageyama. 1993. Bunp\= o to Go-keisei . Hituzi Syobo Publishing. (in Japanese)

  19. [19]

    Daisuke Kawahara, Yuichiro Machida, Tomohide Shibata, Sadao Kurohashi, Hayato Kobayashi, and Manabu Sassano. 2014. Rapid development of a corpus with discourse annotations using two-stage crowdsourcing. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 269--278

  20. [20]

    Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. https://www.aclweb.org/anthology/W04-3230 Applying conditional random fields to J apanese morphological analysis . In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pages 230--237

  21. [21]

    Sadao Kurohashi and Makoto Nagao. 1994. KN parser: J apanese dependency/case structure analyzer. In Proceedings of the Workshop on Sharable Natural Language, pages 48--55

  22. [22]

    Sadao Kurohashi and Makoto Nagao. 1998. Building a J apanese parsed corpus while improving the parsing system. In Proceedings of the NLPRS, pages 719--724

  23. [23]

    Sadao Kurohashi, Toshihisa Nakamura, Yuji Matsumoto, and Makoto Nagao. 1994. Improvements of J apanese morphological analyzer JUMAN . In Proceedings of The International Workshop on Sharable Natural Language Resources, pages 22--38

  24. [24]

    Rochelle Lieber. 1992. Deconstructing Morphology. University of Chicago Press

  25. [25]

    Kikuo Maekawa, Makoto Yamazaki, Toshinobu Ogiso, Takehiko Maruyama, Hideki Ogura, Wakako Kashino, Hanae Koiso, Masaya Yamaguchi, Makiro Tanaka, and Yasuharu Den. 2014. https://doi.org/10.1007/s10579-013-9261-0 Balanced C orpus of C ontemporary W ritten J apanese . Language Resources and Evaluation, 48:345--371

  26. [26]

    a ckstr\

    Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar T\" a ckstr\" o m, Claudia Bedini, N\' u ria Bertomeu Castell\' o , and Jungmee Lee. 2013. https://www.aclweb.org/anthology/P13-2017 Universal dependency annotation for multilingual parsing . In Proceedings of the ...

  27. [27]

    Osahito Miyaoka. 2015. Go to wa Nani ka Saik\= o (Reconsidering What is the ``Word''?) . Sanseido. (in Japanese)

  28. [28]

    Yugo Murawaki and Sadao Kurohashi. 2008. https://www.aclweb.org/anthology/D08-1045 Online acquisition of J apanese unknown morphemes using morphological constraints . In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), pages 429--437

  29. [29]

    Toshinobu Ogiso, Asuko Kondo, Yoko Mabuchi, and Noriko Hattori. 2017. Construction of the ``corpus of historical J apanese: M eiji-- T aish\= o series i -- magazines''. In Proceedings of Digital Humanities 2017

  30. [30]

    Hideki Ogura, Hanae Koiso, Yumi Fujiike, Sayaka Miyauchi, Hikari Konishi, and Yutaka Hara. 2011. Gendai Kakikotoba Kink\= o K\= o pasu Keitairon J\= o h\= o Kiteish\= u Dai 4 Han (Rules Governing the Morphological Analysis Contained in the BCCWJ , 4th ed.) . (in Japanese)

  31. [31]

    Gregory Pringle. 2016. http://www.cjvlang.com/Spicks/udjapanese.html Thoughts on the U niversal D ependencies proposal for J apanese: The problem of the word as a linguistic unit . Accessed: 2019-06-22

  32. [32]

    Milan Straka and Jana Strakov \'a . 2017. https://doi.org/10.18653/v1/K17-3009 Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDP ipe . In Proceedings of the C o NLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , pages 88--99

  33. [33]

    Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, pages 3104--3112

  34. [34]

    Takaaki Tanaka, Yusuke Miyao, Masayuki Asahara, Sumire Uematsu, Hiroshi Kanayama, Shinsuke Mori, and Yuji Matsumoto. 2016. https://www.aclweb.org/anthology/L16-1261 U niversal D ependencies for J apanese . In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

  35. [35]

    Arseny Tolmachev, Daisuke Kawahara, and Sadao Kurohashi. 2019. https://www.aclweb.org/anthology/N19-1281 Shrinking J apanese morphological analyzers with neural networks and semi-supervised learning . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1...

  36. [36]

    Kiyotaka Uchimoto, Satoshi Sekine, and Hitoshi Isahara. 1999. https://www.aclweb.org/anthology/E99-1026 J apanese dependency structure analysis based on maximum entropy models . In Ninth Conference of the E uropean Chapter of the Association for Computational Linguistics

  37. [37]

    Universal Dependencies contributors . 2019 a . http://universaldependencies.org/introduction Introduction . Accessed: 2019-06-22

  38. [38]

    Universal Dependencies contributors . 2019 b . http://universaldependencies.org/u/overview/tokenization.html Tokenization and word segmentation . Accessed: 2019-06-22