GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models

Jiacheng Lin; Jiawei Han; Jimeng Sun; Pengcheng Jiang; Zifeng Wang

arxiv: 2402.10744 · v1 · pith:UVWMBUILnew · submitted 2024-02-16 · 💻 cs.CL · cs.AI

GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models

Pengcheng Jiang , Jiacheng Lin , Zifeng Wang , Jimeng Sun , Jiawei Han This is my paper

classification 💻 cs.CL cs.AI

keywords genresextractionmethodsrelationrelationsevaluationllmsgenerative

0 comments

read the original abstract

The field of relation extraction (RE) is experiencing a notable shift towards generative relation extraction (GRE), leveraging the capabilities of large language models (LLMs). However, we discovered that traditional relation extraction (RE) metrics like precision and recall fall short in evaluating GRE methods. This shortfall arises because these metrics rely on exact matching with human-annotated reference relations, while GRE methods often produce diverse and semantically accurate relations that differ from the references. To fill this gap, we introduce GenRES for a multi-dimensional assessment in terms of the topic similarity, uniqueness, granularity, factualness, and completeness of the GRE results. With GenRES, we empirically identified that (1) precision/recall fails to justify the performance of GRE methods; (2) human-annotated referential relations can be incomplete; (3) prompting LLMs with a fixed set of relations or entities can cause hallucinations. Next, we conducted a human evaluation of GRE methods that shows GenRES is consistent with human preferences for RE quality. Last, we made a comprehensive evaluation of fourteen leading LLMs using GenRES across document, bag, and sentence level RE datasets, respectively, to set the benchmark for future research in GRE

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Task Decomposition for Efficient Annotation
cs.CL 2026-06 unverdicted novelty 4.0

Decomposing annotation tasks using centers from centering theory reduces aggregate inferential load via a degrees-of-freedom model and enables better sub-task allocation.
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions
cs.SE 2026-04 unverdicted novelty 4.0

LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.