arxiv: 2604.08752 · v1 · submitted 2026-04-09 · 💻 cs.CL · cs.AI

Recognition: unknown

LLMs Underperform Graph-Based Parsers on Supervised Relation Extraction for Complex Graphs

Paolo Gajo , Domenic Rosati , Hassan Sajjad , Alberto Barr\'on-Cede\~no

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords relation extractionlarge language modelsgraph-based parserscomplex graphssupervised evaluationknowledge graphsNLP performance

0 comments

The pith

Graph-based parsers outperform LLMs on supervised relation extraction as documents contain more relations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates four large language models against a graph-based parser across six relation extraction datasets whose sentence graphs range from simple to complex. It shows that the smaller parser's advantage grows steadily with the number of relations per document. A reader would care because relation extraction feeds knowledge graph construction and many downstream applications; if lighter, specialized models handle dense cases better than general-purpose LLMs under supervised training, the choice of architecture for real text matters. The work treats relation count as a direct indicator of linguistic graph complexity and demonstrates that this factor exposes clear performance gaps.

Core claim

The central claim is that LLMs still lag behind a much smaller graph-based parser on relation extraction when the underlying linguistic graph is complex, with the graph-based parser's advantage increasing as the number of relations in the input documents grows across the six evaluated datasets.

What carries the argument

Supervised evaluation of LLMs versus a graph-based parser on datasets ordered by increasing numbers of relations per document, using relation count as the measure of sentence-graph complexity.

If this is right

The much lighter graph-based parser becomes the better practical choice whenever input texts contain complex linguistic graphs.
LLMs require further adaptation or architectural changes to close the gap on high-relation inputs under supervised conditions.
Knowledge-graph pipelines benefit from routing dense-relation documents to graph parsers rather than relying solely on LLMs.
Supervised fine-tuning of compact specialized models can exceed the performance of larger general models on structurally demanding extraction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If relation count continues to predict the gap, simply scaling LLMs may not resolve extraction difficulties on dense texts without changes to how they represent graph structure.
Hybrid systems that detect high-complexity inputs and hand them to graph parsers could combine the strengths of both approaches.
The same complexity trend may appear in other graph-oriented tasks such as event extraction or coreference resolution, warranting parallel tests.

Load-bearing premise

That the number of relations per document is a sufficient and unbiased proxy for linguistic graph complexity across the datasets and that the supervised training and decoding setups allow a direct, fair comparison between the two approaches.

What would settle it

An experiment on a new collection of documents with high relation counts where the LLMs achieve equal or higher F1 scores than the graph-based parser would falsify the claim of increasing outperformance.

Figures

Figures reproduced from arXiv: 2604.08752 by Alberto Barr\'on-Cede\~no, Domenic Rosati, Hassan Sajjad, Paolo Gajo.

**Figure 2.** Figure 2: Training prompt example for ADE. The ICL example and the entity/relation descriptions are optional. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Minimalistic UUID training prompt example for ERFGC. The ellipsis indicates 20 omitted RDF triples. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Adversarial prompt example for ADE. Base models do not produce any output if asked not to, as shown [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Relation extraction represents a fundamental component in the process of creating knowledge graphs, among other applications. Large language models (LLMs) have been adopted as a promising tool for relation extraction, both in supervised and in-context learning settings. However, in this work we show that their performance still lags behind much smaller architectures when the linguistic graph underlying a text has great complexity. To demonstrate this, we evaluate four LLMs against a graph-based parser on six relation extraction datasets with sentence graphs of varying sizes and complexities. Our results show that the graph-based parser increasingly outperforms the LLMs, as the number of relations in the input documents increases. This makes the much lighter graph-based parser a superior choice in the presence of complex linguistic graphs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Graph parsers pull ahead of LLMs on supervised relation extraction once relation counts rise, but the proxy for complexity looks shaky without controls.

read the letter

The central observation here is that a graph-based parser beats four LLMs on relation extraction, and the margin widens as the number of relations per document grows across the six datasets. That scaling pattern is the new piece they highlight for knowledge-graph pipelines. The work does a clean job of running the same supervised setup on both sides and picking datasets that differ in graph size, which gives a practical data point for anyone choosing tools when texts get relation-heavy. It is incremental rather than revolutionary, but the direct comparison at varying scales is the part that could actually inform decisions. The soft spot is exactly the one the stress test flags: relation count is treated as the measure of linguistic graph complexity, yet nothing in the abstract shows they checked whether it tracks document length, entity density, or relation-type balance instead. If those factors drive the LLM drop-off, the complexity story does not follow. The abstract also skips metrics, significance tests, and training details, so the trend is hard to weigh. This is aimed at NLP engineers who build extraction components for graphs and need to know when to reach for a parser versus an LLM. A reader who wants empirical guidance on model choice will find the numbers useful if the full paper adds the missing controls and stats. It deserves peer review because the question is concrete and the experiment is straightforward, even though revisions will be needed to tighten the interpretation.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates four LLMs against a graph-based parser for supervised relation extraction across six datasets with varying sentence-graph sizes. It reports that the graph parser's advantage over the LLMs widens as the number of relations per document increases, concluding that LLMs underperform on complex linguistic graphs and that the lighter parser is preferable in such cases.

Significance. If the performance trend is shown to be robust to controls for document length, entity density, and relation-type distribution, the result would be significant for NLP and knowledge-graph construction. It supplies concrete empirical evidence that supervised graph parsers can outperform LLMs on dense relational structures, challenging the default preference for LLMs in relation extraction and offering a practical recommendation for high-complexity inputs. The multi-dataset design is a positive feature.

major comments (2)

[§4 (Results and Analysis)] §4 (Results and Analysis): The central claim that the graph parser increasingly outperforms LLMs with rising relation count is presented without reported metrics (e.g., exact F1 definitions), statistical significance tests, or error bars across the six datasets. This absence makes it impossible to judge whether the widening gap is reliable or driven by a few outliers.
[§3 (Datasets and Experimental Setup)] §3 (Datasets and Experimental Setup): The manuscript treats the number of relations per document as a sufficient proxy for linguistic-graph complexity, yet provides no stratification, regression, or partial-correlation analysis against potential confounders such as average document length, token count, or entity density. Without these controls the interpretation that LLMs specifically fail on complex graphs (rather than on longer contexts) remains under-supported.

minor comments (2)

[Abstract] The abstract names neither the four LLMs nor the six datasets; adding these identifiers would improve immediate readability.
[Figures and Tables] Figure captions and table headers should explicitly state the evaluation metric (e.g., micro-F1 or macro-F1) and whether results are averaged over multiple runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns and strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§4 (Results and Analysis)] §4 (Results and Analysis): The central claim that the graph parser increasingly outperforms LLMs with rising relation count is presented without reported metrics (e.g., exact F1 definitions), statistical significance tests, or error bars across the six datasets. This absence makes it impossible to judge whether the widening gap is reliable or driven by a few outliers.

Authors: We agree that the results section would be strengthened by explicit reporting of these elements. In the revised manuscript we now state the precise F1 definition employed (micro-averaged F1 over all relation instances), include statistical significance tests (paired Wilcoxon signed-rank tests with p-values reported for each relation-count bin comparing the graph parser against each LLM), and add error bars to all plots showing standard error across the six datasets. These additions confirm that the widening performance gap is statistically reliable and not attributable to outliers in any single dataset. revision: yes
Referee: [§3 (Datasets and Experimental Setup)] §3 (Datasets and Experimental Setup): The manuscript treats the number of relations per document as a sufficient proxy for linguistic-graph complexity, yet provides no stratification, regression, or partial-correlation analysis against potential confounders such as average document length, token count, or entity density. Without these controls the interpretation that LLMs specifically fail on complex graphs (rather than on longer contexts) remains under-supported.

Authors: We acknowledge that additional controls would better isolate graph complexity from length-related effects. In the revision we have added a multiple linear regression analysis in which the performance gap (graph parser minus LLM F1) is the dependent variable and we control for document length in tokens, entity density, and relation-type entropy. The coefficient for number of relations remains positive and significant after these controls. We also include supplementary stratified plots by document-length quartiles showing the same widening trend within each length stratum. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential constructions

full rationale

The paper conducts an empirical evaluation of four LLMs against a graph-based parser on six relation extraction datasets, reporting performance trends as the number of relations increases. No equations, fitted parameters, uniqueness theorems, or ansatzes are present in the abstract or described methodology. The central claim is supported by observed experimental results rather than any reduction of outputs to inputs by definition, self-citation chains, or renaming of known patterns. This is a standard supervised benchmarking study whose validity depends on experimental controls, not on any internal logical circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmarking study that relies on standard NLP evaluation practices without introducing new theoretical constructs, free parameters, or postulated entities.

axioms (1)

standard math Standard metrics such as precision, recall, and F1 score appropriately measure relation extraction performance.
Implicit assumption in any supervised RE evaluation using labeled datasets.

pith-pipeline@v0.9.0 · 5430 in / 1132 out tokens · 44127 ms · 2026-05-10T17:13:06.771011+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Lucia Donatelli, Theresa Schmidt, Debanjali Biswas, Arne Köhn, Fangzhou Zhai, and Alexander Koller

Resource description framework: metadata and its applications.ACM SIGKDD Explorations Newsletter, 3(1):6–19. Lucia Donatelli, Theresa Schmidt, Debanjali Biswas, Arne Köhn, Fangzhou Zhai, and Alexander Koller
[2]

InPro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, pages 6930– 6942, Online and Punta Cana, Dominican Republic

Aligning actions across recipe graphs. InPro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, pages 6930– 6942, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Su...

2021
[3]

A Survey on In-context Learning

A Survey on In-context Learning.arXiv preprint. ArXiv:2301.00234 [cs]. Timothy Dozat and Christopher D. Manning. 2017. Deep Biaffine Attention for Neural Dependency Pars- ing. InInternational Conference on Learning Repre- sentations. arXiv. ArXiv:1611.01734 [cs]. Timothy Dozat and Christopher D. Manning. 2018. Simpler but More Accurate Semantic Dependency...

work page internal anchor Pith review arXiv 2017
[4]

Anees Kazi, Luca Cosmo, Seyed-Ahmad Ahmadi, Nas- sir Navab, and Michael M

Entity-Relation Extraction as Full Shallow Semantic Dependency Parsing.IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, 32:1088–1099. Anees Kazi, Luca Cosmo, Seyed-Ahmad Ahmadi, Nas- sir Navab, and Michael M. Bronstein. 2023. Dif- ferentiable Graph Module (DGM) for Graph Convo- lutional Networks.IEEE Transactions on Pattern Analysis and ...

2023
[5]

InAdvances in Neural Information Processing Systems, volume 33, pages 2232–2244

Pointer Graph Networks. InAdvances in Neural Information Processing Systems, volume 33, pages 2232–2244. Curran Associates, Inc. Patrick Verga, Emma Strubell, and Andrew McCallum
[6]

Simultaneously Self-Attending to All Men- tions for Full-Abstract Biological Relation Extraction. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. Association for Computational Linguistics. Zhen Wan, Fei Cheng,...

work page arXiv 2018
[7]

triple_list: [

to ensure the predictions are well-formed trees. This is especially useful with big depen- dency graphs, since greedy decoding is more likely to produce invalid trees as size increases. When doing greedy decoding, an edge index (i.e. an adjacency matrix) ai = arg max j sedge ij is pro- duced by taking the argmax of the attention scores sedge i across the ...

2024