pith. sign in

arxiv: 2512.22487 · v2 · pith:57QBHAIQnew · submitted 2025-12-27 · 💻 cs.CL

Constituency Structure over Eojeol in Korean Treebanks

classification 💻 cs.CL
keywords constituencykoreaneojeol-basedtreebankslayerstructureterminalalignment
0
0 comments X
read the original abstract

The design of Korean constituency treebanks raises a central representational question concerning the choice of terminal units. Although Korean words are morphologically complex, treating morphemes as constituency terminals can obscure the distinction between word-internal morphology and phrase-level syntactic structure, and can create mismatches with eojeol-based dependency resources. This paper argues for an eojeol-based constituency representation, with morphological segmentation and fine-grained POS information encoded in a separate, non-constituent layer. A comparative analysis shows that, under explicit normalization assumptions, the Sejong, Penn Korean, and KAIST treebanks can be compared over a shared eojeol-based constituency backbone. Building on this result, we outline an eojeol-based annotation scheme that preserves interpretable constituency, supports cross-treebank comparison and constituency-dependency alignment, and provides a surface-form terminal layer for future end-to-end Korean constituency parsing.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.