THIVLVC: Retrieval Augmented Dependency Parsing for Latin
Pith reviewed 2026-05-10 20:10 UTC · model grok-4.3
The pith
A retrieval-augmented system refines UDPipe parses for Latin by pulling similar treebank examples to guide an LLM.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
THIVLVC retrieves structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompts an LLM to refine the UDPipe baseline parse according to UD guidelines, producing a 17-point CLAS gain on Seneca poetry and a 1.5-point gain on Thomas Aquinas prose, with a double-blind error analysis of 300 divergences showing 53.3 percent preference for THIVLVC among unanimous annotator decisions.
What carries the argument
The retrieval step that selects structurally similar treebank entries by sentence length and POS n-gram similarity to supply in-context examples for the LLM to correct the UDPipe baseline according to UD guidelines.
If this is right
- The system raises CLAS by 17 points on Seneca's poetic Latin over the UDPipe baseline.
- It raises CLAS by 1.5 points on Thomas Aquinas's prose Latin over the same baseline.
- In 300 divergences examined by double-blind annotators, 53.3 percent of unanimous decisions favor THIVLVC output over the gold standard.
- The results indicate annotation inconsistencies both within individual treebanks and across different Latin sources.
Where Pith is reading between the lines
- The same retrieval-plus-LLM pattern could be tested on other low-resource historical languages that already possess small treebanks.
- Replacing the current length-and-n-gram matcher with syntax-aware retrieval might widen the gap between poetry and prose performance.
- The error-analysis finding of annotation inconsistency suggests that re-annotating a subset of the CIRCSE treebank with stricter genre controls could raise future baseline scores.
Load-bearing premise
That sentences found by length and POS n-gram similarity are close enough in structure to let the LLM correct baseline errors without introducing fresh mistakes.
What would settle it
Running the system on a fresh set of Latin sentences where the retrieved examples produce no CLAS gain or where unanimous human judges consistently prefer the original UDPipe parses.
Figures
read the original abstract
We describe THIVLVC, a two-stage system for the EvaLatin 2026 Dependency Parsing task. Given a Latin sentence, we retrieve structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompt a large language model to refine the baseline parse from UDPipe using the retrieved examples and UD annotation guidelines. We submit two configurations: one without retrieval and one with retrieval (RAG). On poetry (Seneca), THIVLVC improves CLAS by +17 points over the UDPipe baseline; on prose (Thomas Aquinas), the gain is +1.5 CLAS. A double-blind error analysis of 300 divergences between our system and the gold standard reveals that, among unanimous annotator decisions, 53.3% favour THIVLVC, showing annotation inconsistencies both within and across treebanks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes THIVLVC, a two-stage system for the EvaLatin 2026 Latin dependency parsing task. Given an input sentence, it retrieves entries from the CIRCSE treebank via sentence length and POS n-gram similarity, then prompts an LLM to refine the UDPipe baseline parse using the retrieved examples plus UD guidelines. Two configurations are evaluated (with and without retrieval). On Seneca poetry the system reports a +17 CLAS gain over UDPipe; on Thomas Aquinas prose the gain is +1.5 CLAS. A double-blind error analysis of 300 divergences finds that, among cases with unanimous annotator decisions, 53.3 % favor THIVLVC.
Significance. If the gains are shown to be robust, statistically significant, and attributable to the retrieval step rather than LLM prompting alone, the work would supply a practical RAG-based recipe for improving parsing on free-word-order historical texts and would usefully flag annotation inconsistencies across treebanks. The explicit no-retrieval control is a strength.
major comments (3)
- [Retrieval approach] Retrieval approach: sentence length plus POS n-gram similarity is presented as the mechanism that supplies structurally useful in-context examples, yet no quantitative check (e.g., average UAS or labeled-attachment similarity between query and retrieved trees) is reported. In a free-word-order language this proxy can match local POS sequences while differing globally in attachment scope, leaving open the possibility that the RAG component contributes little beyond the LLM's prior knowledge.
- [Experimental results] Results: the headline CLAS deltas (+17 on poetry, +1.5 on prose) are given without statistical significance tests, standard errors, or variance estimates across multiple LLM generations. Because LLM outputs are stochastic, the absence of these controls makes it impossible to judge whether the observed improvements exceed what would be expected from prompt variability alone.
- [Error analysis] Error analysis: the sampling frame for the 300 divergences, the inter-annotator agreement rate before unanimity filtering, and the precise decision criteria used by the three annotators are not stated. Without these details the 53.3 % preference statistic cannot be interpreted as reliable evidence that THIVLVC outperforms the gold standard.
minor comments (2)
- [Methods] Full prompt templates (including the exact UD guideline excerpts supplied to the LLM) are not reproduced; their inclusion would greatly aid reproducibility.
- [Baselines] The manuscript should explicitly compare the RAG configuration against a pure few-shot LLM prompt that uses the same number of examples but drawn randomly rather than by the length/POS-n-gram heuristic.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: Retrieval approach: sentence length plus POS n-gram similarity is presented as the mechanism that supplies structurally useful in-context examples, yet no quantitative check (e.g., average UAS or labeled-attachment similarity between query and retrieved trees) is reported. In a free-word-order language this proxy can match local POS sequences while differing globally in attachment scope, leaving open the possibility that the RAG component contributes little beyond the LLM's prior knowledge.
Authors: We selected sentence length and POS n-gram overlap as an efficient, language-agnostic proxy for retrieving locally coherent examples in a low-resource setting. We acknowledge that this does not guarantee global structural similarity and that a quantitative validation would better isolate the contribution of retrieval. In the revised version we will add an analysis reporting average UAS/LAS between gold parses of retrieved sentences and query sentences, plus a breakdown of attachment-scope differences, to demonstrate that the chosen metric supplies useful in-context signal beyond the LLM's prior knowledge. revision: yes
-
Referee: Results: the headline CLAS deltas (+17 on poetry, +1.5 on prose) are given without statistical significance tests, standard errors, or variance estimates across multiple LLM generations. Because LLM outputs are stochastic, the absence of these controls makes it impossible to judge whether the observed improvements exceed what would be expected from prompt variability alone.
Authors: We agree that stochasticity in LLM decoding requires explicit controls. The revised manuscript will include results from multiple generations with fixed seeds, reporting mean CLAS scores, standard deviations, and paired statistical tests (e.g., t-tests) against the UDPipe baseline. This will establish that the reported gains are robust and exceed prompt-induced variance. revision: yes
-
Referee: Error analysis: the sampling frame for the 300 divergences, the inter-annotator agreement rate before unanimity filtering, and the precise decision criteria used by the three annotators are not stated. Without these details the 53.3 % preference statistic cannot be interpreted as reliable evidence that THIVLVC outperforms the gold standard.
Authors: We will expand the error-analysis section to specify the sampling procedure (random selection among sentences where THIVLVC and gold standard diverged), the pre-filtering inter-annotator agreement (Fleiss' kappa), the exact decision rubric (UD guideline adherence and contextual consistency), and the number of unanimous cases. These additions will allow readers to evaluate the 53.3 % figure appropriately. revision: yes
Circularity Check
No significant circularity; empirical evaluation against external benchmarks
full rationale
The paper presents a retrieval-augmented parsing pipeline evaluated directly on held-out gold-standard treebanks (CIRCSE, EvaLatin 2026 data) and compared to an external baseline (UDPipe). No equations, fitted parameters, or self-citations appear in the provided text. Claims of improvement (+17 CLAS on poetry, +1.5 on prose) and the 53.3% annotator preference are measured against independent gold annotations and double-blind human judgments, with no reduction of outputs to inputs by construction. The derivation chain consists of standard RAG prompting steps whose success is falsifiable on external data.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Retrieved examples from CIRCSE will be structurally similar enough to guide correct refinements.
- domain assumption UD annotation guidelines provide a reliable, consistent reference for the LLM.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
retrieve structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompt a large language model to refine the baseline parse from UDPipe
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Previous systems have relied on supervised neu- ral models trained on existing treebanks
Introduction The EvaLatin 2026 Dependency Task (Iurescia et al., 2026) invites participants to parse Latin texts for two genres: Classical poetry with Seneca, and philosophical prose of Thomas Aquinas. Previous systems have relied on supervised neu- ral models trained on existing treebanks. Such models learn whatever patterns the training data contains, i...
work page 2026
-
[2]
THIVLVC: Retrieval Augmented Dependency Parsing for Latin
Description of the System Our system is a two-stage pipeline1: (1) retrieval of structurally similar sentences from CIRCSE, and (2) generation, where an LLM refines a baseline parse using the retrieved examples and UD guide- lines. Information Retrieval. Given an input sentence, we retrieve thek = 5 most similar sentences from the training set of the CIRC...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
We compared three re- trievalstrategies
Evaluation Protocol Retrieval strategies. We compared three re- trievalstrategies. Ineachcase, qdenotesthequery sentence and s a candidate from the knowledge base
-
[4]
TF-IDF (Baseline). Cosine similarity wherevq and vs are the TF-IDF vectors of word forms for q and s: simtfidf(q, s) = vq · vs ∥vq∥ ∥vs∥
-
[5]
Structural (Length + POSn-grams). A weighted combination of sentence length similarity and POS n-gram Jaccard overlap: simstruct(q, s) = 0 .33 flen + 0.33 fbi + 0.34 ftri (1) where flen = 1 − | |q|−|s| | max(|q|,|s|) is the nor- malized length similarity ( |q| and |s| de- note sentence lengths in tokens), fbi = J(bigrams(POSq), bigrams(POSs)) is the Jac- ...
-
[6]
Morphological. Cosine similarity over TF-IDF vectorsofconcatenatedPOSandmorphological features (POS|FEATS per token). Retrieval metrics. Let Q = {q1, . . . , qM } be the test set. For each query qi, we retrieve k = 5 examples si,1, . . . , si,k from the knowledge base. We evaluate retrieval quality with two metrics: Length Difference: the average absolute...
work page 2026
-
[7]
Results and Analysis IR results. Table 1 shows that the structural strat- egy strongly outperforms TF-IDF and morphologi- cal retrieval on length difference (< 1.2 tokens on average vs.> 11), while maintaining competitive POS overlap. This confirms that sentence length and POSn-grams are sufficient features for retriev- ing structurally similar examples. ...
work page 2020
-
[8]
Error Analysis Not all divergences from the gold standard are er- rors. To better understand our system’s behaviour, we conducted a qualitative analysis of cases where predictions differed from reference annotations. Annotation protocol. We designed a double- blind annotation comparing Gold and THIVLVC outputs (see the interface in Figure 2). Annotators w...
work page 2023
-
[9]
Hence they do not divide the year itself into the same number of seasons: winter, spring [...]
Taxonomy of Disagreement Contradictions between CIRCSE and EvaLatin. According to Table 6, errors of the type adv- mod:lmod instead of advmod account for 7 out of 37 (18%) of the mainTHIVLVC errors. The ad- verbundeillustratesacaseoflegitimateannotation divergence between the CIRCSE corpus and the EvaLatin 2026 corpus. In CIRCSE,2 unde is anno- tated advm...
work page 2026
-
[10]
Limitations Our approach has several limitations. First, the selection of the LLM was based on informal man- ual comparison rather than a systematic ablation study. We tested gemini-3-flash, claude- 4.5-sonnet, and qwen3-72B on a small set of sentences and selectedgemini-3-flash on the basis of output quality and cost, but we did not con- duct a controlle...
-
[11]
Conclusion We presented THIVLVC, a retrieval-augmented LLMsystemforLatindependencyparsingthatcom- bines a structural retriever, UD annotation guide- lines, and a baseline parse to refine syntactic anal- ysis. The system achieves substantial improve- ments over UDPipe on poetry and competitive re- sults on prose. Our error analysis highlights a find- ing t...
work page 2026
-
[12]
This work was carried out at LISN (CNRS, Université Paris-Saclay) and HISOMA (ENS of Lyon)
Acknowledgements We thank the EvaLatin 2026 organizers for making thesharedtaskdataavailable,andtheCIRCSERe- search Centre for the treebank used as knowledge base. This work was carried out at LISN (CNRS, Université Paris-Saclay) and HISOMA (ENS of Lyon)
work page 2026
-
[13]
Bibliographical References Flavio Massimiliano Cecchini, Marco Passarotti, Paolo Ruffolo, Marinella Testori, Lia Draetta, Martina Fieromonte, Annarita Liano, Costanza Marini, and Giovanni Piantanida. 2018. Enhanc- ing the Latin morphological analyser LEMLAT with a medieval Latin glossary. InProceedings of the Fifth Italian Conference on Computational Ling...
work page 2018
-
[14]
Compare the baseline parse with the examples and guidelines
-
[15]
Identify any improvements needed in HEAD/DEPREL
-
[16]
Output your refined version
-
[17]
If uncertain, add a comment line # needs_council = true Output ONLY the CoNLL-U block for the Input Sentence (not the baseline). B. Annotation Interface Figure2: Annotationinterface: overview. TheLatin sentence is displayed at the top; two anonymized parseoptions(AandB)areshownsidebysidewith their ID, word form, head, and relation columns. Figure 3: Annot...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.