THIVLVC: Retrieval Augmented Dependency Parsing for Latin

HiSoMA); Jules Deret; Luc Pommeret (STL); Thibault Wagret (ENS de Lyon

arxiv: 2604.05564 · v1 · submitted 2026-04-07 · 💻 cs.CL

THIVLVC: Retrieval Augmented Dependency Parsing for Latin

Luc Pommeret (STL) , Thibault Wagret (ENS de Lyon , HiSoMA) , Jules Deret This is my paper

Pith reviewed 2026-05-10 20:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords dependency parsingLatinretrieval augmented generationUDPipelarge language modelsuniversal dependenciesEvaLatintreebanks

0 comments

The pith

A retrieval-augmented system refines UDPipe parses for Latin by pulling similar treebank examples to guide an LLM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents THIVLVC as a two-stage system for the EvaLatin 2026 dependency parsing task on Latin texts. It first retrieves structurally similar sentences from the CIRCSE treebank by matching sentence length and POS n-gram patterns, then prompts a large language model to adjust the UDPipe baseline parse using those examples plus Universal Dependencies guidelines. On Seneca poetry the approach raises CLAS by 17 points over the baseline; on Thomas Aquinas prose the gain is 1.5 points. A double-blind review of 300 differing parses finds that when annotators reach unanimous agreement, 53.3 percent favor the THIVLVC output. The work matters because Latin syntax varies sharply by genre and period, and small improvements in automatic parsing can scale to larger digital editions of classical authors.

Core claim

THIVLVC retrieves structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompts an LLM to refine the UDPipe baseline parse according to UD guidelines, producing a 17-point CLAS gain on Seneca poetry and a 1.5-point gain on Thomas Aquinas prose, with a double-blind error analysis of 300 divergences showing 53.3 percent preference for THIVLVC among unanimous annotator decisions.

What carries the argument

The retrieval step that selects structurally similar treebank entries by sentence length and POS n-gram similarity to supply in-context examples for the LLM to correct the UDPipe baseline according to UD guidelines.

If this is right

The system raises CLAS by 17 points on Seneca's poetic Latin over the UDPipe baseline.
It raises CLAS by 1.5 points on Thomas Aquinas's prose Latin over the same baseline.
In 300 divergences examined by double-blind annotators, 53.3 percent of unanimous decisions favor THIVLVC output over the gold standard.
The results indicate annotation inconsistencies both within individual treebanks and across different Latin sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval-plus-LLM pattern could be tested on other low-resource historical languages that already possess small treebanks.
Replacing the current length-and-n-gram matcher with syntax-aware retrieval might widen the gap between poetry and prose performance.
The error-analysis finding of annotation inconsistency suggests that re-annotating a subset of the CIRCSE treebank with stricter genre controls could raise future baseline scores.

Load-bearing premise

That sentences found by length and POS n-gram similarity are close enough in structure to let the LLM correct baseline errors without introducing fresh mistakes.

What would settle it

Running the system on a fresh set of Latin sentences where the retrieved examples produce no CLAS gain or where unanimous human judges consistently prefer the original UDPipe parses.

Figures

Figures reproduced from arXiv: 2604.05564 by HiSoMA), Jules Deret, Luc Pommeret (STL), Thibault Wagret (ENS de Lyon.

**Figure 1.** Figure 1: Overview of the THIVLVC pipeline. The input sentence is processed in parallel by the structural retriever (which selects similar examples from CIRCSE) and by UDPipe (which produces a baseline dependency parse). Both outputs, together with the UD guidelines, are passed to the LLM for refinement. Generation. The retrieved examples, together with the official UD annotation guidelines and the baseline parse … view at source ↗

**Figure 3.** Figure 3: Annotation interface: verdict buttons. Di [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 2.** Figure 2: Annotation interface: overview. The Latin [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

We describe THIVLVC, a two-stage system for the EvaLatin 2026 Dependency Parsing task. Given a Latin sentence, we retrieve structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompt a large language model to refine the baseline parse from UDPipe using the retrieved examples and UD annotation guidelines. We submit two configurations: one without retrieval and one with retrieval (RAG). On poetry (Seneca), THIVLVC improves CLAS by +17 points over the UDPipe baseline; on prose (Thomas Aquinas), the gain is +1.5 CLAS. A double-blind error analysis of 300 divergences between our system and the gold standard reveals that, among unanimous annotator decisions, 53.3% favour THIVLVC, showing annotation inconsistencies both within and across treebanks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

THIVLVC gets a solid +17 CLAS lift on Latin poetry via retrieval-augmented LLM correction of UDPipe, but the retrieval proxy's reliability is the main open question.

read the letter

The paper's core result is that adding retrieval from the CIRCSE treebank to an LLM prompt lets them improve on the UDPipe baseline for Latin dependency parsing, with the largest gain on poetry. On Seneca they report +17 CLAS; on Aquinas prose the gain is only +1.5. They also ran a double-blind error analysis on 300 divergences and found that when annotators reached unanimous agreement, 53.3% preferred their system's output over gold. That is a concrete empirical finding worth noting, and the error analysis itself flags inconsistencies across treebanks, which is useful context for anyone working with historical Latin data.

Referee Report

3 major / 2 minor

Summary. The manuscript describes THIVLVC, a two-stage system for the EvaLatin 2026 Latin dependency parsing task. Given an input sentence, it retrieves entries from the CIRCSE treebank via sentence length and POS n-gram similarity, then prompts an LLM to refine the UDPipe baseline parse using the retrieved examples plus UD guidelines. Two configurations are evaluated (with and without retrieval). On Seneca poetry the system reports a +17 CLAS gain over UDPipe; on Thomas Aquinas prose the gain is +1.5 CLAS. A double-blind error analysis of 300 divergences finds that, among cases with unanimous annotator decisions, 53.3 % favor THIVLVC.

Significance. If the gains are shown to be robust, statistically significant, and attributable to the retrieval step rather than LLM prompting alone, the work would supply a practical RAG-based recipe for improving parsing on free-word-order historical texts and would usefully flag annotation inconsistencies across treebanks. The explicit no-retrieval control is a strength.

major comments (3)

[Retrieval approach] Retrieval approach: sentence length plus POS n-gram similarity is presented as the mechanism that supplies structurally useful in-context examples, yet no quantitative check (e.g., average UAS or labeled-attachment similarity between query and retrieved trees) is reported. In a free-word-order language this proxy can match local POS sequences while differing globally in attachment scope, leaving open the possibility that the RAG component contributes little beyond the LLM's prior knowledge.
[Experimental results] Results: the headline CLAS deltas (+17 on poetry, +1.5 on prose) are given without statistical significance tests, standard errors, or variance estimates across multiple LLM generations. Because LLM outputs are stochastic, the absence of these controls makes it impossible to judge whether the observed improvements exceed what would be expected from prompt variability alone.
[Error analysis] Error analysis: the sampling frame for the 300 divergences, the inter-annotator agreement rate before unanimity filtering, and the precise decision criteria used by the three annotators are not stated. Without these details the 53.3 % preference statistic cannot be interpreted as reliable evidence that THIVLVC outperforms the gold standard.

minor comments (2)

[Methods] Full prompt templates (including the exact UD guideline excerpts supplied to the LLM) are not reproduced; their inclusion would greatly aid reproducibility.
[Baselines] The manuscript should explicitly compare the RAG configuration against a pure few-shot LLM prompt that uses the same number of examples but drawn randomly rather than by the length/POS-n-gram heuristic.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: Retrieval approach: sentence length plus POS n-gram similarity is presented as the mechanism that supplies structurally useful in-context examples, yet no quantitative check (e.g., average UAS or labeled-attachment similarity between query and retrieved trees) is reported. In a free-word-order language this proxy can match local POS sequences while differing globally in attachment scope, leaving open the possibility that the RAG component contributes little beyond the LLM's prior knowledge.

Authors: We selected sentence length and POS n-gram overlap as an efficient, language-agnostic proxy for retrieving locally coherent examples in a low-resource setting. We acknowledge that this does not guarantee global structural similarity and that a quantitative validation would better isolate the contribution of retrieval. In the revised version we will add an analysis reporting average UAS/LAS between gold parses of retrieved sentences and query sentences, plus a breakdown of attachment-scope differences, to demonstrate that the chosen metric supplies useful in-context signal beyond the LLM's prior knowledge. revision: yes
Referee: Results: the headline CLAS deltas (+17 on poetry, +1.5 on prose) are given without statistical significance tests, standard errors, or variance estimates across multiple LLM generations. Because LLM outputs are stochastic, the absence of these controls makes it impossible to judge whether the observed improvements exceed what would be expected from prompt variability alone.

Authors: We agree that stochasticity in LLM decoding requires explicit controls. The revised manuscript will include results from multiple generations with fixed seeds, reporting mean CLAS scores, standard deviations, and paired statistical tests (e.g., t-tests) against the UDPipe baseline. This will establish that the reported gains are robust and exceed prompt-induced variance. revision: yes
Referee: Error analysis: the sampling frame for the 300 divergences, the inter-annotator agreement rate before unanimity filtering, and the precise decision criteria used by the three annotators are not stated. Without these details the 53.3 % preference statistic cannot be interpreted as reliable evidence that THIVLVC outperforms the gold standard.

Authors: We will expand the error-analysis section to specify the sampling procedure (random selection among sentences where THIVLVC and gold standard diverged), the pre-filtering inter-annotator agreement (Fleiss' kappa), the exact decision rubric (UD guideline adherence and contextual consistency), and the number of unanimous cases. These additions will allow readers to evaluate the 53.3 % figure appropriately. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation against external benchmarks

full rationale

The paper presents a retrieval-augmented parsing pipeline evaluated directly on held-out gold-standard treebanks (CIRCSE, EvaLatin 2026 data) and compared to an external baseline (UDPipe). No equations, fitted parameters, or self-citations appear in the provided text. Claims of improvement (+17 CLAS on poetry, +1.5 on prose) and the 53.3% annotator preference are measured against independent gold annotations and double-blind human judgments, with no reduction of outputs to inputs by construction. The derivation chain consists of standard RAG prompting steps whose success is falsifiable on external data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on the quality of the CIRCSE treebank and consistency of UD guidelines but introduces no new fitted parameters or invented entities beyond standard NLP components.

axioms (2)

domain assumption Retrieved examples from CIRCSE will be structurally similar enough to guide correct refinements.
Core of the retrieval stage described in the abstract.
domain assumption UD annotation guidelines provide a reliable, consistent reference for the LLM.
Explicitly used in the prompting step.

pith-pipeline@v0.9.0 · 5453 in / 1269 out tokens · 57774 ms · 2026-05-10T20:10:01.412276+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

retrieve structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompt a large language model to refine the baseline parse from UDPipe

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

Previous systems have relied on supervised neu- ral models trained on existing treebanks

Introduction The EvaLatin 2026 Dependency Task (Iurescia et al., 2026) invites participants to parse Latin texts for two genres: Classical poetry with Seneca, and philosophical prose of Thomas Aquinas. Previous systems have relied on supervised neu- ral models trained on existing treebanks. Such models learn whatever patterns the training data contains, i...

work page 2026
[2]

THIVLVC: Retrieval Augmented Dependency Parsing for Latin

Description of the System Our system is a two-stage pipeline1: (1) retrieval of structurally similar sentences from CIRCSE, and (2) generation, where an LLM refines a baseline parse using the retrieved examples and UD guide- lines. Information Retrieval. Given an input sentence, we retrieve thek = 5 most similar sentences from the training set of the CIRC...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

We compared three re- trievalstrategies

Evaluation Protocol Retrieval strategies. We compared three re- trievalstrategies. Ineachcase, qdenotesthequery sentence and s a candidate from the knowledge base

work page
[4]

Cosine similarity wherevq and vs are the TF-IDF vectors of word forms for q and s: simtfidf(q, s) = vq · vs ∥vq∥ ∥vs∥

TF-IDF (Baseline). Cosine similarity wherevq and vs are the TF-IDF vectors of word forms for q and s: simtfidf(q, s) = vq · vs ∥vq∥ ∥vs∥

work page
[5]

Structural (Length + POSn-grams). A weighted combination of sentence length similarity and POS n-gram Jaccard overlap: simstruct(q, s) = 0 .33 flen + 0.33 fbi + 0.34 ftri (1) where flen = 1 − | |q|−|s| | max(|q|,|s|) is the nor- malized length similarity ( |q| and |s| de- note sentence lengths in tokens), fbi = J(bigrams(POSq), bigrams(POSs)) is the Jac- ...

work page
[6]

Cosine similarity over TF-IDF vectorsofconcatenatedPOSandmorphological features (POS|FEATS per token)

Morphological. Cosine similarity over TF-IDF vectorsofconcatenatedPOSandmorphological features (POS|FEATS per token). Retrieval metrics. Let Q = {q1, . . . , qM } be the test set. For each query qi, we retrieve k = 5 examples si,1, . . . , si,k from the knowledge base. We evaluate retrieval quality with two metrics: Length Difference: the average absolute...

work page 2026
[7]

Results and Analysis IR results. Table 1 shows that the structural strat- egy strongly outperforms TF-IDF and morphologi- cal retrieval on length difference (< 1.2 tokens on average vs.> 11), while maintaining competitive POS overlap. This confirms that sentence length and POSn-grams are sufficient features for retriev- ing structurally similar examples. ...

work page 2020
[8]

Gold is better

Error Analysis Not all divergences from the gold standard are er- rors. To better understand our system’s behaviour, we conducted a qualitative analysis of cases where predictions differed from reference annotations. Annotation protocol. We designed a double- blind annotation comparing Gold and THIVLVC outputs (see the interface in Figure 2). Annotators w...

work page 2023
[9]

Hence they do not divide the year itself into the same number of seasons: winter, spring [...]

Taxonomy of Disagreement Contradictions between CIRCSE and EvaLatin. According to Table 6, errors of the type adv- mod:lmod instead of advmod account for 7 out of 37 (18%) of the mainTHIVLVC errors. The ad- verbundeillustratesacaseoflegitimateannotation divergence between the CIRCSE corpus and the EvaLatin 2026 corpus. In CIRCSE,2 unde is anno- tated advm...

work page 2026
[10]

First, the selection of the LLM was based on informal man- ual comparison rather than a systematic ablation study

Limitations Our approach has several limitations. First, the selection of the LLM was based on informal man- ual comparison rather than a systematic ablation study. We tested gemini-3-flash, claude- 4.5-sonnet, and qwen3-72B on a small set of sentences and selectedgemini-3-flash on the basis of output quality and cost, but we did not con- duct a controlle...

work page
[11]

The system achieves substantial improve- ments over UDPipe on poetry and competitive re- sults on prose

Conclusion We presented THIVLVC, a retrieval-augmented LLMsystemforLatindependencyparsingthatcom- bines a structural retriever, UD annotation guide- lines, and a baseline parse to refine syntactic anal- ysis. The system achieves substantial improve- ments over UDPipe on poetry and competitive re- sults on prose. Our error analysis highlights a find- ing t...

work page 2026
[12]

This work was carried out at LISN (CNRS, Université Paris-Saclay) and HISOMA (ENS of Lyon)

Acknowledgements We thank the EvaLatin 2026 organizers for making thesharedtaskdataavailable,andtheCIRCSERe- search Centre for the treebank used as knowledge base. This work was carried out at LISN (CNRS, Université Paris-Saclay) and HISOMA (ENS of Lyon)

work page 2026
[13]

Bibliographical References Flavio Massimiliano Cecchini, Marco Passarotti, Paolo Ruffolo, Marinella Testori, Lia Draetta, Martina Fieromonte, Annarita Liano, Costanza Marini, and Giovanni Piantanida. 2018. Enhanc- ing the Latin morphological analyser LEMLAT with a medieval Latin glossary. InProceedings of the Fifth Italian Conference on Computational Ling...

work page 2018
[14]

Compare the baseline parse with the examples and guidelines

work page
[15]

Identify any improvements needed in HEAD/DEPREL

work page
[16]

Output your refined version

work page
[17]

If uncertain, add a comment line # needs_council = true Output ONLY the CoNLL-U block for the Input Sentence (not the baseline). B. Annotation Interface Figure2: Annotationinterface: overview. TheLatin sentence is displayed at the top; two anonymized parseoptions(AandB)areshownsidebysidewith their ID, word form, head, and relation columns. Figure 3: Annot...

work page

[1] [1]

Previous systems have relied on supervised neu- ral models trained on existing treebanks

Introduction The EvaLatin 2026 Dependency Task (Iurescia et al., 2026) invites participants to parse Latin texts for two genres: Classical poetry with Seneca, and philosophical prose of Thomas Aquinas. Previous systems have relied on supervised neu- ral models trained on existing treebanks. Such models learn whatever patterns the training data contains, i...

work page 2026

[2] [2]

THIVLVC: Retrieval Augmented Dependency Parsing for Latin

Description of the System Our system is a two-stage pipeline1: (1) retrieval of structurally similar sentences from CIRCSE, and (2) generation, where an LLM refines a baseline parse using the retrieved examples and UD guide- lines. Information Retrieval. Given an input sentence, we retrieve thek = 5 most similar sentences from the training set of the CIRC...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

We compared three re- trievalstrategies

Evaluation Protocol Retrieval strategies. We compared three re- trievalstrategies. Ineachcase, qdenotesthequery sentence and s a candidate from the knowledge base

work page

[4] [4]

Cosine similarity wherevq and vs are the TF-IDF vectors of word forms for q and s: simtfidf(q, s) = vq · vs ∥vq∥ ∥vs∥

TF-IDF (Baseline). Cosine similarity wherevq and vs are the TF-IDF vectors of word forms for q and s: simtfidf(q, s) = vq · vs ∥vq∥ ∥vs∥

work page

[5] [5]

Structural (Length + POSn-grams). A weighted combination of sentence length similarity and POS n-gram Jaccard overlap: simstruct(q, s) = 0 .33 flen + 0.33 fbi + 0.34 ftri (1) where flen = 1 − | |q|−|s| | max(|q|,|s|) is the nor- malized length similarity ( |q| and |s| de- note sentence lengths in tokens), fbi = J(bigrams(POSq), bigrams(POSs)) is the Jac- ...

work page

[6] [6]

Cosine similarity over TF-IDF vectorsofconcatenatedPOSandmorphological features (POS|FEATS per token)

Morphological. Cosine similarity over TF-IDF vectorsofconcatenatedPOSandmorphological features (POS|FEATS per token). Retrieval metrics. Let Q = {q1, . . . , qM } be the test set. For each query qi, we retrieve k = 5 examples si,1, . . . , si,k from the knowledge base. We evaluate retrieval quality with two metrics: Length Difference: the average absolute...

work page 2026

[7] [7]

Results and Analysis IR results. Table 1 shows that the structural strat- egy strongly outperforms TF-IDF and morphologi- cal retrieval on length difference (< 1.2 tokens on average vs.> 11), while maintaining competitive POS overlap. This confirms that sentence length and POSn-grams are sufficient features for retriev- ing structurally similar examples. ...

work page 2020

[8] [8]

Gold is better

Error Analysis Not all divergences from the gold standard are er- rors. To better understand our system’s behaviour, we conducted a qualitative analysis of cases where predictions differed from reference annotations. Annotation protocol. We designed a double- blind annotation comparing Gold and THIVLVC outputs (see the interface in Figure 2). Annotators w...

work page 2023

[9] [9]

Hence they do not divide the year itself into the same number of seasons: winter, spring [...]

Taxonomy of Disagreement Contradictions between CIRCSE and EvaLatin. According to Table 6, errors of the type adv- mod:lmod instead of advmod account for 7 out of 37 (18%) of the mainTHIVLVC errors. The ad- verbundeillustratesacaseoflegitimateannotation divergence between the CIRCSE corpus and the EvaLatin 2026 corpus. In CIRCSE,2 unde is anno- tated advm...

work page 2026

[10] [10]

First, the selection of the LLM was based on informal man- ual comparison rather than a systematic ablation study

Limitations Our approach has several limitations. First, the selection of the LLM was based on informal man- ual comparison rather than a systematic ablation study. We tested gemini-3-flash, claude- 4.5-sonnet, and qwen3-72B on a small set of sentences and selectedgemini-3-flash on the basis of output quality and cost, but we did not con- duct a controlle...

work page

[11] [11]

The system achieves substantial improve- ments over UDPipe on poetry and competitive re- sults on prose

Conclusion We presented THIVLVC, a retrieval-augmented LLMsystemforLatindependencyparsingthatcom- bines a structural retriever, UD annotation guide- lines, and a baseline parse to refine syntactic anal- ysis. The system achieves substantial improve- ments over UDPipe on poetry and competitive re- sults on prose. Our error analysis highlights a find- ing t...

work page 2026

[12] [12]

This work was carried out at LISN (CNRS, Université Paris-Saclay) and HISOMA (ENS of Lyon)

Acknowledgements We thank the EvaLatin 2026 organizers for making thesharedtaskdataavailable,andtheCIRCSERe- search Centre for the treebank used as knowledge base. This work was carried out at LISN (CNRS, Université Paris-Saclay) and HISOMA (ENS of Lyon)

work page 2026

[13] [13]

Bibliographical References Flavio Massimiliano Cecchini, Marco Passarotti, Paolo Ruffolo, Marinella Testori, Lia Draetta, Martina Fieromonte, Annarita Liano, Costanza Marini, and Giovanni Piantanida. 2018. Enhanc- ing the Latin morphological analyser LEMLAT with a medieval Latin glossary. InProceedings of the Fifth Italian Conference on Computational Ling...

work page 2018

[14] [14]

Compare the baseline parse with the examples and guidelines

work page

[15] [15]

Identify any improvements needed in HEAD/DEPREL

work page

[16] [16]

Output your refined version

work page

[17] [17]

If uncertain, add a comment line # needs_council = true Output ONLY the CoNLL-U block for the Input Sentence (not the baseline). B. Annotation Interface Figure2: Annotationinterface: overview. TheLatin sentence is displayed at the top; two anonymized parseoptions(AandB)areshownsidebysidewith their ID, word form, head, and relation columns. Figure 3: Annot...

work page