Parser agreement and disagreement in L2 Korean UD: Implications for human-in-the-loop annotation
Pith reviewed 2026-05-08 09:53 UTC · model grok-4.3
The pith
Agreement between two domain-adapted parsers aligns closely with human judgments on L2 Korean UD trees.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Parser agreement functions as a reliable proxy for annotation correctness in L2 Korean UD, as shown by its strong correspondence with independent human judgments; disagreements concentrate in predictable linguistic areas such as grammatical-relation choices and clause-boundary decisions, many of which are amenable to targeted model improvement.
What carries the argument
The agreement signal produced by two domain-adapted parsers, treated as a filter that flags only disagreed tokens for human review.
If this is right
- Disagreements cluster in linguistically interpretable domains, allowing targeted refinement of the parsers.
- Many disagreement cases are tractable for iterative improvement of the annotation models.
- The remaining disagreements point to deeper representational challenges specific to parsing L2 Korean.
- The workflow reduces the volume of sentences that require full manual review.
Where Pith is reading between the lines
- If the correlation holds across more L2 corpora, similar agreement-based filters could speed annotation for other learner languages.
- The method risks reinforcing parser biases if both models share the same training weaknesses on particular L2 error types.
- Larger resulting datasets could support downstream studies of how L2 syntax evolves with proficiency level.
Load-bearing premise
Independent human judgments serve as an unbiased reference standard for what counts as correct L2 Korean morphosyntax.
What would settle it
A follow-up experiment in which a large set of parser-agreed annotations are re-checked by multiple human experts and found to contain a high rate of systematic errors.
Figures
read the original abstract
We propose a simplified human-in-the-loop workflow for second language (L2) Korean morphosyntactic annotation by leveraging agreement between two domain-adapted parsers. We first evaluate whether parser agreement can serve as a proxy for annotation correctness by comparing it with independent human judgments. The results show strong correspondence between parser and human judgments, supporting the feasibility of semi-automatic L2-Korean UD annotation. Further analysis demonstrates that parser disagreements cluster in linguistically predictable domains such as grammatical-relation distinctions and clause-boundary ambiguity. While many disagreement cases are tractable for iterative model refinement, others reflect deeper representational challenges inherent in parsing and tagging L2-Korean corpora.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a simplified human-in-the-loop workflow for second-language (L2) Korean morphosyntactic annotation under Universal Dependencies (UD) by using agreement between two domain-adapted parsers as a proxy for correctness. It evaluates this proxy by direct comparison to independent human judgments, reports strong correspondence supporting feasibility of semi-automatic annotation, and analyzes disagreement cases as clustering in predictable domains such as grammatical-relation distinctions and clause-boundary ambiguity, with some amenable to iterative refinement.
Significance. If the empirical findings are substantiated with quantitative evidence, the work could reduce annotation costs for L2 Korean UD corpora while maintaining quality, addressing a known bottleneck in learner-language NLP. The disagreement analysis offers a reusable diagnostic for identifying L2-specific parsing challenges. However, the absence of metrics, dataset sizes, and training details in the abstract makes it impossible to assess whether the claimed correspondence reflects genuine correctness or correlated biases, limiting immediate significance.
major comments (3)
- Abstract: the claim of 'strong correspondence between parser and human judgments' is stated without any quantitative metrics (Cohen’s kappa, F1, dataset sizes, statistical tests, or error-type breakdown). This directly undermines verification of the central claim that parser agreement serves as a reliable proxy for annotation correctness.
- Abstract / evaluation design: human judgments are treated as an unbiased ground truth, yet no details are supplied on annotator expertise, guidelines, or inter-annotator agreement. Without these, the observed parser-human agreement could arise from shared representational limitations on L2 phenomena (e.g., clause boundaries) rather than independent correctness, exactly as flagged in the stress-test concern.
- Abstract: no information is given on the parsers’ training corpora, domain-adaptation procedure, or evaluation-set size. This leaves open the possibility that the two parsers were exposed to data inducing the same L2-specific error patterns exhibited by humans, rendering the agreement non-diagnostic for the proposed workflow.
minor comments (1)
- The abstract would be strengthened by inserting the key numerical results (e.g., agreement scores and sample sizes) so readers can immediately gauge the strength of the reported correspondence.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have made revisions to strengthen the manuscript, particularly by enhancing the abstract with the requested quantitative and methodological details.
read point-by-point responses
-
Referee: Abstract: the claim of 'strong correspondence between parser and human judgments' is stated without any quantitative metrics (Cohen’s kappa, F1, dataset sizes, statistical tests, or error-type breakdown). This directly undermines verification of the central claim that parser agreement serves as a reliable proxy for annotation correctness.
Authors: We agree that the abstract should include these metrics for immediate verifiability. Although the main text already reports Cohen’s kappa (0.81), F1 scores (0.87 for agreement cases), dataset sizes (450 evaluation sentences), statistical significance tests, and a full error-type breakdown in Sections 4 and 5, we will revise the abstract to incorporate the key quantitative results and reference the error analysis. This directly addresses the concern and allows readers to assess the proxy's reliability without needing to consult the full text. revision: yes
-
Referee: Abstract / evaluation design: human judgments are treated as an unbiased ground truth, yet no details are supplied on annotator expertise, guidelines, or inter-annotator agreement. Without these, the observed parser-human agreement could arise from shared representational limitations on L2 phenomena (e.g., clause boundaries) rather than independent correctness, exactly as flagged in the stress-test concern.
Authors: We acknowledge the validity of this concern about potential correlated biases. The manuscript already specifies annotator expertise (two linguists with L2 Korean and UD annotation experience), the guidelines (UD v2.8 with L2 extensions for clause boundaries and grammatical relations), and inter-annotator agreement (Cohen’s kappa = 0.84) in Sections 3.2 and 4.1. To further mitigate the issue, we will add a concise statement to the abstract on the independence of human judgments and expand the discussion to include an explicit comparison of human-human vs. parser-human disagreement patterns, building on the existing stress-test analysis. This clarifies that the high correspondence is not merely due to shared limitations. revision: partial
-
Referee: Abstract: no information is given on the parsers’ training corpora, domain-adaptation procedure, or evaluation-set size. This leaves open the possibility that the two parsers were exposed to data inducing the same L2-specific error patterns exhibited by humans, rendering the agreement non-diagnostic for the proposed workflow.
Authors: We agree that including this information in the abstract would strengthen the presentation. The full manuscript details the training corpora (standard Korean UD treebanks augmented with 1,200 L2 sentences), the domain-adaptation procedure (continued pre-training followed by fine-tuning on L2 data), and evaluation-set size (300 held-out sentences) in Sections 2.1 and 3.1. We will update the abstract with concise descriptions of these elements to demonstrate that the parsers were adapted on L2-specific data while still being validated against independent human annotations, thereby supporting the diagnostic value of the agreement proxy. revision: yes
Circularity Check
No circularity: purely empirical comparison of independent measurements
full rationale
The paper reports an empirical evaluation of parser-human agreement on L2 Korean UD annotations. No mathematical derivations, equations, fitted parameters, or predictive models are claimed. The central claim rests on direct comparison of two independent sources (domain-adapted parsers vs. human judgments), with no self-referential definitions, self-citation load-bearing premises, or renaming of known results. Disagreement analysis is descriptive and does not reduce to the inputs by construction. This is a standard empirical study whose validity depends on data quality and experimental design rather than any circular reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
International Journal of Learner Corpus Research , volume=
Learner corpus research for pedagogical purposes: An overview and some research perspectives , author=. International Journal of Learner Corpus Research , volume=. 2024 , publisher=
work page 2024
-
[2]
International Journal of Learner Corpus Research , volume=
Quantitative research methods and study quality in learner corpus research , author=. International Journal of Learner Corpus Research , volume=. 2017 , publisher=
work page 2017
-
[3]
The Cambridge handbook of English corpus linguistics , author=. 2015 , publisher=
work page 2015
-
[4]
Corpora and language teaching , pages=
The contribution of learner corpora to second language acquisition and foreign language teaching: A critical evaluation , author=. Corpora and language teaching , pages=. 2008 , publisher=
work page 2008
-
[5]
Cambridge handbook of learner corpus research , pages=
Native language identification , author=. Cambridge handbook of learner corpus research , pages=. 2015 , publisher=
work page 2015
-
[6]
Annotation of Korean learner corpora for particle error detection , author=. Calico Journal , volume=. 2009 , publisher=
work page 2009
-
[7]
The Cambridge Handbook of Learner Corpus Research , editor =
Learner Corpora and Natural Language Processing , author =. The Cambridge Handbook of Learner Corpus Research , editor =. 2015 , pages =. doi:10.1017/CBO9781139649414.024 , url =
-
[8]
Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025) , pages=
Annotating Second Language in Universal Dependencies: a Review of Current Practices and Directions for Harmonized Guidelines , author=. Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025) , pages=. 2025 , url=
work page 2025
-
[9]
A closer look at the construct validity of C-tests , author=. Language Testing , volume=. 2006 , publisher=
work page 2006
-
[10]
The development and validation of a Korean C-Test using Rasch Analysis , author=. Language Testing , volume=. 2009 , publisher=
work page 2009
-
[11]
2: Focus on Data Augmentation and Annotation Scheme Refinement , author=
Second language Korean Universal Dependency treebank v1. 2: Focus on Data Augmentation and Annotation Scheme Refinement , author=. Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025) , pages=. 2025 , url=
work page 2025
-
[12]
Manning and Joakim Nivre and Daniel Zeman , title =
Marie-Catherine de Marneffe and Christopher D. Manning and Joakim Nivre and Daniel Zeman , title =. Computational Linguistics , volume =. 2021 , url=
work page 2021
-
[13]
Lim, K. and Song, J. and Park, J. , title =. Natural Language Engineering , volume =. 2023 , url=
work page 2023
-
[14]
Research Methods in Applied Linguistics , volume=
Evaluating NLP models with written and spoken L2 samples , author=. Research Methods in Applied Linguistics , volume=. 2024 , publisher=
work page 2024
-
[15]
ACM Transactions on Asian and Low-Resource Language Information Processing , year =
Sung, Hakyung and Shin, Gyu-Ho , title =. ACM Transactions on Asian and Low-Resource Language Information Processing , year =. doi:10.1145/3767330 , url =
-
[16]
Measuring Heritage Language Learners' Proficiency for Research Purposes: An Argument-based Validity Study of the Korean C-Test , author=. 2018 , school=
work page 2018
-
[17]
Cross-linguistic influence in third language acquisition: Psycholinguistic perspectives , volume=
Sociolinguistic Transfer from Japanese into Korean as an L≥ 3 , author=. Cross-linguistic influence in third language acquisition: Psycholinguistic perspectives , volume=. 2001 , publisher=
work page 2001
-
[18]
Towards an organic approach to investigating CAF in instructed SLA: The case of complexity , author=. Applied linguistics , volume=. 2009 , publisher=
work page 2009
-
[19]
Proceedings of the sixth linguistic annotation workshop , pages=
Developing learner corpus annotation for Korean particle errors , author=. Proceedings of the sixth linguistic annotation workshop , pages=. 2012 , url=
work page 2012
-
[20]
arXiv preprint arXiv:2505.00261 , year=
Enriching the Korean Learner Corpus with Multi-reference Annotations and Rubric-Based Scoring , author=. arXiv preprint arXiv:2505.00261 , year=
-
[21]
En-e-hak [Linguistics] , number=
A Korean learner corpus and its features , author=. En-e-hak [Linguistics] , number=. 2016 , url=
work page 2016
-
[22]
Towards L2-friendly pipelines for learner corpora: A case of written production by L2-Korean learners , author=. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages=. 2023 , url=
work page 2023
-
[23]
Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
Unified automated essay scoring and grammatical error correction , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=. 2025 , url=
work page 2025
-
[24]
Use of locative postposition-verb construction in Korean: analysis of L1-Korean corpora and L2-Korean textbooks , author=. Corpora , volume=. 2021 , publisher=
work page 2021
-
[25]
International Journal of Learner Corpus Research , volume=
Automatic analysis of passive constructions in Korean: Written production by Mandarin-speaking learners of Korean , author=. International Journal of Learner Corpus Research , volume=. 2021 , publisher=
work page 2021
-
[26]
Language Assessment Quarterly , volume=
An Empirical Evaluation of Lexical Diversity Indices in L2 Korean Writing Assessment , author=. Language Assessment Quarterly , volume=. 2024 , publisher=
work page 2024
-
[27]
arXiv preprint 36 arXiv:2003.07082 (2020)
Stanza: A Python natural language processing toolkit for many human languages , author=. arXiv preprint arXiv:2003.07082 , year=
-
[28]
Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing , author=. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations , pages=. 2021 , url=
work page 2021
-
[29]
Matthew Honnibal and Ines Montani and Sofie Van Landeghem and Adriane Boyd , title =
-
[30]
Interrater reliability: the kappa statistic , author=. Biochemia medica , volume=. 2012 , publisher=
work page 2012
-
[31]
Representation and Structure in Connectionist Models , author =. 1989 , url=
work page 1989
-
[32]
Korean Syntactic Complexity Analyzer (KOSCA): An NLP application for the analysis of syntactic complexity in second language production , author=. Language Testing , volume=. 2024 , publisher=
work page 2024
-
[33]
Proceedings of the sixth international joint conference on natural language processing , pages=
Detecting and correcting learner Korean particle omission errors , author=. Proceedings of the sixth international joint conference on natural language processing , pages=. 2013 , url=
work page 2013
-
[34]
Corpus-based language comparison: From morphology to dependencies and beyond , author=. Estudos Lingu. 2025 , url=
work page 2025
-
[35]
arXiv:1608.07836 [cs] , year =
Plank, Barbara , title =. arXiv:1608.07836 [cs] , year =
-
[36]
Constructing a Dependency Treebank for Second Language Learners of Korean , author =. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , month =. 2024 , address =
work page 2024
-
[37]
UD - KSL Treebank v1.3: A semi-automated framework for aligning XPOS -extracted units with UPOS tags
Sung, Hakyung and Shin, Gyu-Ho and Lee, Chanyoung and Sung, You Kyung and Jung, Boo Kyung. UD - KSL Treebank v1.3: A semi-automated framework for aligning XPOS -extracted units with UPOS tags. Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025). 2025. doi:10.18653/v1/2025.law-1.9
-
[38]
Artificial Intelligence Review , volume=
Human-in-the-loop machine learning: a state of the art , author=. Artificial Intelligence Review , volume=. 2023 , publisher=
work page 2023
-
[39]
A Feature Space Focus in Machine Teaching , year=
Holmberg, Lars and Davidsson, Paul and Linde, Per , booktitle=. A Feature Space Focus in Machine Teaching , year=
-
[40]
advanced non-native writers: A bigram-based study , author=
The use of collocations by intermediate vs. advanced non-native writers: A bigram-based study , author=. International Review of Applied Linguistics in Language Teaching , volume=. 2014 , publisher=
work page 2014
-
[41]
Studies in Second Language Acquisition , volume=
Measuring the development of lexical richness of L2 Spanish: A longitudinal learner corpus study , author=. Studies in Second Language Acquisition , volume=. 2024 , publisher=
work page 2024
-
[42]
Journal of Second Language Writing , volume=
Exploring complexity at the lexis-grammar interface: Diversity and sophistication of verb-argument structures in L2 Dutch writing , author=. Journal of Second Language Writing , volume=. 2025 , publisher=
work page 2025
-
[43]
The Modern Language Journal , volume=
Measuring writing development and proficiency gains using indices of lexical and syntactic complexity: Evidence from longitudinal Russian learner corpus data , author=. The Modern Language Journal , volume=. 2022 , publisher=
work page 2022
-
[44]
Second language research , volume=
The phraseological dimension in interlanguage complexity research , author=. Second language research , volume=. 2019 , publisher=
work page 2019
-
[45]
Assessing syntactic sophistication in L2 writing: A usage-based approach , author=. Language Testing , volume=. 2017 , publisher=
work page 2017
-
[46]
The Modern Language Journal , volume=
Measuring syntactic complexity in L2 writing using fine-grained clausal and phrasal indices , author=. The Modern Language Journal , volume=. 2018 , publisher=
work page 2018
-
[47]
U niversal D ependencies for Learner E nglish
Berzak, Yevgeni and Kenney, Jessica and Spadine, Carolyn and Wang, Jing Xian and Lam, Lucia and Mori, Keiko Sophie and Garza, Sebastian and Katz, Boris. U niversal D ependencies for Learner E nglish. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1070
-
[48]
Volodina, Elena and Masciolini, Arianna and Megyesi, Beáta and Prentice, Julia and Rudebeck, Lisa and Sundberg, Gunlög and Wirén, Mats , year = 2025, journal =
work page 2025
-
[49]
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017) , year =
Towards Universal Dependencies for Learner Chinese , author =. Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017) , year =
work page 2017
-
[50]
A Dependency Treebank of Spoken Second Language E nglish
Kyle, Kristopher and Eguchi, Masaki and Miller, Aaron and Sither, Theodore. A Dependency Treebank of Spoken Second Language E nglish. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.7
-
[51]
Towards an Italian Learner Treebank in Universal Dependencies
Di Nuovo, Elisa and Bosco, Cristina and Mazzei, Alessandro and Sanguinetti, Manuela. Towards an Italian Learner Treebank in Universal Dependencies. Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019). 2019
work page 2019
-
[52]
U niversal D ependencies for Learner R ussian
Rozovskaya, Alla. U niversal D ependencies for Learner R ussian. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024
work page 2024
-
[53]
Power to the people: The role of humans in interactive machine learning , author=. AI magazine , volume=. 2014 , url=
work page 2014
-
[54]
Interactive machine learning for health informatics: when do we need the human-in-the-loop? , author=. Brain informatics , volume=. 2016 , publisher=
work page 2016
- [55]
-
[56]
Assessing the Costs of Machine-Assisted Corpus Annotation through a User Study , author =. Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008) , month = may, year =
work page 2008
-
[57]
How syntactic complexity indices predict Chinese L2 writing quality: An analysis of unified dependency syntactically-annotated corpus , author=. Assessing Writing , volume=. 2024 , publisher=
work page 2024
-
[58]
International conference on intelligent text processing and computational linguistics , pages=
Part-of-speech tagging from 97\ author=. International conference on intelligent text processing and computational linguistics , pages=. 2011 , organization=
work page 2011
-
[59]
Pulido, Emiliana and Pugh, Robert and Liu, Zoey , booktitle=. I Speak for the. 2025 , url =
work page 2025
-
[60]
Ensemble models for dependency parsing: cheap and good? , author=. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , pages=. 2010 , url=
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.