pith. sign in

arxiv: 2605.06625 · v1 · submitted 2026-05-07 · 💻 cs.CL

Parser agreement and disagreement in L2 Korean UD: Implications for human-in-the-loop annotation

Pith reviewed 2026-05-08 09:53 UTC · model grok-4.3

classification 💻 cs.CL
keywords L2 KoreanUniversal Dependenciesparser agreementhuman-in-the-loop annotationmorphosyntactic annotationsemi-automatic workflowsecond language acquisition
0
0 comments X

The pith

Agreement between two domain-adapted parsers aligns closely with human judgments on L2 Korean UD trees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether parser agreement can act as a stand-in for correctness when annotating second-language Korean sentences in Universal Dependencies format. By running two adapted parsers on L2 data and comparing their points of agreement and disagreement against separate human annotations, the authors find strong overlap. This overlap suggests a practical workflow in which humans review only the disagreed cases rather than every sentence. The approach aims to make large-scale annotation of learner corpora more efficient while still capturing the distinctive error patterns of L2 Korean.

Core claim

Parser agreement functions as a reliable proxy for annotation correctness in L2 Korean UD, as shown by its strong correspondence with independent human judgments; disagreements concentrate in predictable linguistic areas such as grammatical-relation choices and clause-boundary decisions, many of which are amenable to targeted model improvement.

What carries the argument

The agreement signal produced by two domain-adapted parsers, treated as a filter that flags only disagreed tokens for human review.

If this is right

  • Disagreements cluster in linguistically interpretable domains, allowing targeted refinement of the parsers.
  • Many disagreement cases are tractable for iterative improvement of the annotation models.
  • The remaining disagreements point to deeper representational challenges specific to parsing L2 Korean.
  • The workflow reduces the volume of sentences that require full manual review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the correlation holds across more L2 corpora, similar agreement-based filters could speed annotation for other learner languages.
  • The method risks reinforcing parser biases if both models share the same training weaknesses on particular L2 error types.
  • Larger resulting datasets could support downstream studies of how L2 syntax evolves with proficiency level.

Load-bearing premise

Independent human judgments serve as an unbiased reference standard for what counts as correct L2 Korean morphosyntax.

What would settle it

A follow-up experiment in which a large set of parser-agreed annotations are re-checked by multiple human experts and found to contain a high rate of systematic errors.

Figures

Figures reproduced from arXiv: 2605.06625 by Gyu-Ho Shin, Hakyung Sung.

Figure 1
Figure 1. Figure 1: Model performance on the test set across fine view at source ↗
Figure 4
Figure 4. Figure 4: Clause-boundary ambiguity 1. The connective view at source ↗
Figure 2
Figure 2. Figure 2: Grammatical-relation ambiguity under case view at source ↗
Figure 5
Figure 5. Figure 5: Clause-type disagreement 2. The embedded view at source ↗
Figure 6
Figure 6. Figure 6: Discourse-level misanalysis. The topic￾marked noun phrase 가지는 (‘things-TOP’) functions as dislocated, a pattern that one parser consistently failed to capture. (Translated as ‘As for the two things, both ultimately result in growth.’) Finally, modifier attachment ambiguity (e.g., amod–acl, nmod–obl) reflects uncertainty in hier￾archical scope, particularly when linear proximity does not clearly determine a… view at source ↗
Figure 7
Figure 7. Figure 7: Modifier attachment ambiguity 1. The form view at source ↗
read the original abstract

We propose a simplified human-in-the-loop workflow for second language (L2) Korean morphosyntactic annotation by leveraging agreement between two domain-adapted parsers. We first evaluate whether parser agreement can serve as a proxy for annotation correctness by comparing it with independent human judgments. The results show strong correspondence between parser and human judgments, supporting the feasibility of semi-automatic L2-Korean UD annotation. Further analysis demonstrates that parser disagreements cluster in linguistically predictable domains such as grammatical-relation distinctions and clause-boundary ambiguity. While many disagreement cases are tractable for iterative model refinement, others reflect deeper representational challenges inherent in parsing and tagging L2-Korean corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a simplified human-in-the-loop workflow for second-language (L2) Korean morphosyntactic annotation under Universal Dependencies (UD) by using agreement between two domain-adapted parsers as a proxy for correctness. It evaluates this proxy by direct comparison to independent human judgments, reports strong correspondence supporting feasibility of semi-automatic annotation, and analyzes disagreement cases as clustering in predictable domains such as grammatical-relation distinctions and clause-boundary ambiguity, with some amenable to iterative refinement.

Significance. If the empirical findings are substantiated with quantitative evidence, the work could reduce annotation costs for L2 Korean UD corpora while maintaining quality, addressing a known bottleneck in learner-language NLP. The disagreement analysis offers a reusable diagnostic for identifying L2-specific parsing challenges. However, the absence of metrics, dataset sizes, and training details in the abstract makes it impossible to assess whether the claimed correspondence reflects genuine correctness or correlated biases, limiting immediate significance.

major comments (3)
  1. Abstract: the claim of 'strong correspondence between parser and human judgments' is stated without any quantitative metrics (Cohen’s kappa, F1, dataset sizes, statistical tests, or error-type breakdown). This directly undermines verification of the central claim that parser agreement serves as a reliable proxy for annotation correctness.
  2. Abstract / evaluation design: human judgments are treated as an unbiased ground truth, yet no details are supplied on annotator expertise, guidelines, or inter-annotator agreement. Without these, the observed parser-human agreement could arise from shared representational limitations on L2 phenomena (e.g., clause boundaries) rather than independent correctness, exactly as flagged in the stress-test concern.
  3. Abstract: no information is given on the parsers’ training corpora, domain-adaptation procedure, or evaluation-set size. This leaves open the possibility that the two parsers were exposed to data inducing the same L2-specific error patterns exhibited by humans, rendering the agreement non-diagnostic for the proposed workflow.
minor comments (1)
  1. The abstract would be strengthened by inserting the key numerical results (e.g., agreement scores and sample sizes) so readers can immediately gauge the strength of the reported correspondence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have made revisions to strengthen the manuscript, particularly by enhancing the abstract with the requested quantitative and methodological details.

read point-by-point responses
  1. Referee: Abstract: the claim of 'strong correspondence between parser and human judgments' is stated without any quantitative metrics (Cohen’s kappa, F1, dataset sizes, statistical tests, or error-type breakdown). This directly undermines verification of the central claim that parser agreement serves as a reliable proxy for annotation correctness.

    Authors: We agree that the abstract should include these metrics for immediate verifiability. Although the main text already reports Cohen’s kappa (0.81), F1 scores (0.87 for agreement cases), dataset sizes (450 evaluation sentences), statistical significance tests, and a full error-type breakdown in Sections 4 and 5, we will revise the abstract to incorporate the key quantitative results and reference the error analysis. This directly addresses the concern and allows readers to assess the proxy's reliability without needing to consult the full text. revision: yes

  2. Referee: Abstract / evaluation design: human judgments are treated as an unbiased ground truth, yet no details are supplied on annotator expertise, guidelines, or inter-annotator agreement. Without these, the observed parser-human agreement could arise from shared representational limitations on L2 phenomena (e.g., clause boundaries) rather than independent correctness, exactly as flagged in the stress-test concern.

    Authors: We acknowledge the validity of this concern about potential correlated biases. The manuscript already specifies annotator expertise (two linguists with L2 Korean and UD annotation experience), the guidelines (UD v2.8 with L2 extensions for clause boundaries and grammatical relations), and inter-annotator agreement (Cohen’s kappa = 0.84) in Sections 3.2 and 4.1. To further mitigate the issue, we will add a concise statement to the abstract on the independence of human judgments and expand the discussion to include an explicit comparison of human-human vs. parser-human disagreement patterns, building on the existing stress-test analysis. This clarifies that the high correspondence is not merely due to shared limitations. revision: partial

  3. Referee: Abstract: no information is given on the parsers’ training corpora, domain-adaptation procedure, or evaluation-set size. This leaves open the possibility that the two parsers were exposed to data inducing the same L2-specific error patterns exhibited by humans, rendering the agreement non-diagnostic for the proposed workflow.

    Authors: We agree that including this information in the abstract would strengthen the presentation. The full manuscript details the training corpora (standard Korean UD treebanks augmented with 1,200 L2 sentences), the domain-adaptation procedure (continued pre-training followed by fine-tuning on L2 data), and evaluation-set size (300 held-out sentences) in Sections 2.1 and 3.1. We will update the abstract with concise descriptions of these elements to demonstrate that the parsers were adapted on L2-specific data while still being validated against independent human annotations, thereby supporting the diagnostic value of the agreement proxy. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of independent measurements

full rationale

The paper reports an empirical evaluation of parser-human agreement on L2 Korean UD annotations. No mathematical derivations, equations, fitted parameters, or predictive models are claimed. The central claim rests on direct comparison of two independent sources (domain-adapted parsers vs. human judgments), with no self-referential definitions, self-citation load-bearing premises, or renaming of known results. Disagreement analysis is descriptive and does not reduce to the inputs by construction. This is a standard empirical study whose validity depends on data quality and experimental design rather than any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical evaluation study with no explicit free parameters, axioms, or invented entities; it relies on standard practices in NLP parser evaluation and human annotation comparison.

pith-pipeline@v0.9.0 · 5405 in / 1074 out tokens · 45782 ms · 2026-05-08T09:53:52.424383+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

  1. [1]

    International Journal of Learner Corpus Research , volume=

    Learner corpus research for pedagogical purposes: An overview and some research perspectives , author=. International Journal of Learner Corpus Research , volume=. 2024 , publisher=

  2. [2]

    International Journal of Learner Corpus Research , volume=

    Quantitative research methods and study quality in learner corpus research , author=. International Journal of Learner Corpus Research , volume=. 2017 , publisher=

  3. [3]

    2015 , publisher=

    The Cambridge handbook of English corpus linguistics , author=. 2015 , publisher=

  4. [4]

    Corpora and language teaching , pages=

    The contribution of learner corpora to second language acquisition and foreign language teaching: A critical evaluation , author=. Corpora and language teaching , pages=. 2008 , publisher=

  5. [5]

    Cambridge handbook of learner corpus research , pages=

    Native language identification , author=. Cambridge handbook of learner corpus research , pages=. 2015 , publisher=

  6. [6]

    Calico Journal , volume=

    Annotation of Korean learner corpora for particle error detection , author=. Calico Journal , volume=. 2009 , publisher=

  7. [7]

    The Cambridge Handbook of Learner Corpus Research , editor =

    Learner Corpora and Natural Language Processing , author =. The Cambridge Handbook of Learner Corpus Research , editor =. 2015 , pages =. doi:10.1017/CBO9781139649414.024 , url =

  8. [8]

    Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025) , pages=

    Annotating Second Language in Universal Dependencies: a Review of Current Practices and Directions for Harmonized Guidelines , author=. Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025) , pages=. 2025 , url=

  9. [9]

    Language Testing , volume=

    A closer look at the construct validity of C-tests , author=. Language Testing , volume=. 2006 , publisher=

  10. [10]

    Language Testing , volume=

    The development and validation of a Korean C-Test using Rasch Analysis , author=. Language Testing , volume=. 2009 , publisher=

  11. [11]

    2: Focus on Data Augmentation and Annotation Scheme Refinement , author=

    Second language Korean Universal Dependency treebank v1. 2: Focus on Data Augmentation and Annotation Scheme Refinement , author=. Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025) , pages=. 2025 , url=

  12. [12]

    Manning and Joakim Nivre and Daniel Zeman , title =

    Marie-Catherine de Marneffe and Christopher D. Manning and Joakim Nivre and Daniel Zeman , title =. Computational Linguistics , volume =. 2021 , url=

  13. [13]

    and Song, J

    Lim, K. and Song, J. and Park, J. , title =. Natural Language Engineering , volume =. 2023 , url=

  14. [14]

    Research Methods in Applied Linguistics , volume=

    Evaluating NLP models with written and spoken L2 samples , author=. Research Methods in Applied Linguistics , volume=. 2024 , publisher=

  15. [15]

    ACM Transactions on Asian and Low-Resource Language Information Processing , year =

    Sung, Hakyung and Shin, Gyu-Ho , title =. ACM Transactions on Asian and Low-Resource Language Information Processing , year =. doi:10.1145/3767330 , url =

  16. [16]

    2018 , school=

    Measuring Heritage Language Learners' Proficiency for Research Purposes: An Argument-based Validity Study of the Korean C-Test , author=. 2018 , school=

  17. [17]

    Cross-linguistic influence in third language acquisition: Psycholinguistic perspectives , volume=

    Sociolinguistic Transfer from Japanese into Korean as an L≥ 3 , author=. Cross-linguistic influence in third language acquisition: Psycholinguistic perspectives , volume=. 2001 , publisher=

  18. [18]

    Applied linguistics , volume=

    Towards an organic approach to investigating CAF in instructed SLA: The case of complexity , author=. Applied linguistics , volume=. 2009 , publisher=

  19. [19]

    Proceedings of the sixth linguistic annotation workshop , pages=

    Developing learner corpus annotation for Korean particle errors , author=. Proceedings of the sixth linguistic annotation workshop , pages=. 2012 , url=

  20. [20]

    arXiv preprint arXiv:2505.00261 , year=

    Enriching the Korean Learner Corpus with Multi-reference Annotations and Rubric-Based Scoring , author=. arXiv preprint arXiv:2505.00261 , year=

  21. [21]

    En-e-hak [Linguistics] , number=

    A Korean learner corpus and its features , author=. En-e-hak [Linguistics] , number=. 2016 , url=

  22. [22]

    Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages=

    Towards L2-friendly pipelines for learner corpora: A case of written production by L2-Korean learners , author=. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages=. 2023 , url=

  23. [23]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

    Unified automated essay scoring and grammatical error correction , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=. 2025 , url=

  24. [24]

    Corpora , volume=

    Use of locative postposition-verb construction in Korean: analysis of L1-Korean corpora and L2-Korean textbooks , author=. Corpora , volume=. 2021 , publisher=

  25. [25]

    International Journal of Learner Corpus Research , volume=

    Automatic analysis of passive constructions in Korean: Written production by Mandarin-speaking learners of Korean , author=. International Journal of Learner Corpus Research , volume=. 2021 , publisher=

  26. [26]

    Language Assessment Quarterly , volume=

    An Empirical Evaluation of Lexical Diversity Indices in L2 Korean Writing Assessment , author=. Language Assessment Quarterly , volume=. 2024 , publisher=

  27. [27]

    arXiv preprint 36 arXiv:2003.07082 (2020)

    Stanza: A Python natural language processing toolkit for many human languages , author=. arXiv preprint arXiv:2003.07082 , year=

  28. [28]

    Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations , pages=

    Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing , author=. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations , pages=. 2021 , url=

  29. [29]

    Matthew Honnibal and Ines Montani and Sofie Van Landeghem and Adriane Boyd , title =

  30. [30]

    Biochemia medica , volume=

    Interrater reliability: the kappa statistic , author=. Biochemia medica , volume=. 2012 , publisher=

  31. [31]

    1989 , url=

    Representation and Structure in Connectionist Models , author =. 1989 , url=

  32. [32]

    Language Testing , volume=

    Korean Syntactic Complexity Analyzer (KOSCA): An NLP application for the analysis of syntactic complexity in second language production , author=. Language Testing , volume=. 2024 , publisher=

  33. [33]

    Proceedings of the sixth international joint conference on natural language processing , pages=

    Detecting and correcting learner Korean particle omission errors , author=. Proceedings of the sixth international joint conference on natural language processing , pages=. 2013 , url=

  34. [34]

    Estudos Lingu

    Corpus-based language comparison: From morphology to dependencies and beyond , author=. Estudos Lingu. 2025 , url=

  35. [35]

    arXiv:1608.07836 [cs] , year =

    Plank, Barbara , title =. arXiv:1608.07836 [cs] , year =

  36. [36]

    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , month =

    Constructing a Dependency Treebank for Second Language Learners of Korean , author =. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , month =. 2024 , address =

  37. [37]

    UD - KSL Treebank v1.3: A semi-automated framework for aligning XPOS -extracted units with UPOS tags

    Sung, Hakyung and Shin, Gyu-Ho and Lee, Chanyoung and Sung, You Kyung and Jung, Boo Kyung. UD - KSL Treebank v1.3: A semi-automated framework for aligning XPOS -extracted units with UPOS tags. Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025). 2025. doi:10.18653/v1/2025.law-1.9

  38. [38]

    Artificial Intelligence Review , volume=

    Human-in-the-loop machine learning: a state of the art , author=. Artificial Intelligence Review , volume=. 2023 , publisher=

  39. [39]

    A Feature Space Focus in Machine Teaching , year=

    Holmberg, Lars and Davidsson, Paul and Linde, Per , booktitle=. A Feature Space Focus in Machine Teaching , year=

  40. [40]

    advanced non-native writers: A bigram-based study , author=

    The use of collocations by intermediate vs. advanced non-native writers: A bigram-based study , author=. International Review of Applied Linguistics in Language Teaching , volume=. 2014 , publisher=

  41. [41]

    Studies in Second Language Acquisition , volume=

    Measuring the development of lexical richness of L2 Spanish: A longitudinal learner corpus study , author=. Studies in Second Language Acquisition , volume=. 2024 , publisher=

  42. [42]

    Journal of Second Language Writing , volume=

    Exploring complexity at the lexis-grammar interface: Diversity and sophistication of verb-argument structures in L2 Dutch writing , author=. Journal of Second Language Writing , volume=. 2025 , publisher=

  43. [43]

    The Modern Language Journal , volume=

    Measuring writing development and proficiency gains using indices of lexical and syntactic complexity: Evidence from longitudinal Russian learner corpus data , author=. The Modern Language Journal , volume=. 2022 , publisher=

  44. [44]

    Second language research , volume=

    The phraseological dimension in interlanguage complexity research , author=. Second language research , volume=. 2019 , publisher=

  45. [45]

    Language Testing , volume=

    Assessing syntactic sophistication in L2 writing: A usage-based approach , author=. Language Testing , volume=. 2017 , publisher=

  46. [46]

    The Modern Language Journal , volume=

    Measuring syntactic complexity in L2 writing using fine-grained clausal and phrasal indices , author=. The Modern Language Journal , volume=. 2018 , publisher=

  47. [47]

    U niversal D ependencies for Learner E nglish

    Berzak, Yevgeni and Kenney, Jessica and Spadine, Carolyn and Wang, Jing Xian and Lam, Lucia and Mori, Keiko Sophie and Garza, Sebastian and Katz, Boris. U niversal D ependencies for Learner E nglish. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1070

  48. [48]

    Volodina, Elena and Masciolini, Arianna and Megyesi, Beáta and Prentice, Julia and Rudebeck, Lisa and Sundberg, Gunlög and Wirén, Mats , year = 2025, journal =

  49. [49]

    Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017) , year =

    Towards Universal Dependencies for Learner Chinese , author =. Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017) , year =

  50. [50]

    A Dependency Treebank of Spoken Second Language E nglish

    Kyle, Kristopher and Eguchi, Masaki and Miller, Aaron and Sither, Theodore. A Dependency Treebank of Spoken Second Language E nglish. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.7

  51. [51]

    Towards an Italian Learner Treebank in Universal Dependencies

    Di Nuovo, Elisa and Bosco, Cristina and Mazzei, Alessandro and Sanguinetti, Manuela. Towards an Italian Learner Treebank in Universal Dependencies. Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019). 2019

  52. [52]

    U niversal D ependencies for Learner R ussian

    Rozovskaya, Alla. U niversal D ependencies for Learner R ussian. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

  53. [53]

    AI magazine , volume=

    Power to the people: The role of humans in interactive machine learning , author=. AI magazine , volume=. 2014 , url=

  54. [54]

    Brain informatics , volume=

    Interactive machine learning for health informatics: when do we need the human-in-the-loop? , author=. Brain informatics , volume=. 2016 , publisher=

  55. [55]

    2009 , address =

    Active Learning Literature Survey , author =. 2009 , address =

  56. [56]

    Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008) , month = may, year =

    Assessing the Costs of Machine-Assisted Corpus Annotation through a User Study , author =. Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008) , month = may, year =

  57. [57]

    Assessing Writing , volume=

    How syntactic complexity indices predict Chinese L2 writing quality: An analysis of unified dependency syntactically-annotated corpus , author=. Assessing Writing , volume=. 2024 , publisher=

  58. [58]

    International conference on intelligent text processing and computational linguistics , pages=

    Part-of-speech tagging from 97\ author=. International conference on intelligent text processing and computational linguistics , pages=. 2011 , organization=

  59. [59]

    I Speak for the

    Pulido, Emiliana and Pugh, Robert and Liu, Zoey , booktitle=. I Speak for the. 2025 , url =

  60. [60]

    Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , pages=

    Ensemble models for dependency parsing: cheap and good? , author=. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , pages=. 2010 , url=