pith. sign in

arxiv: 2503.06778 · v3 · submitted 2025-03-09 · 💻 cs.CL · cs.AI

Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

Pith reviewed 2026-05-22 23:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords event annotationlarge language modelshuman-AI collaborationannotation efficiencyevent set curationvariable annotationinformation extraction
0
0 comments X

The pith

LLMs assist human experts in event annotation workflows but cannot serve as reliable independent annotators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates LLMs inside a full event annotation pipeline that filters irrelevant documents, merges reports of the same event, and labels event variables. Fully automated LLM output beats older TF-IDF baselines yet still diverges from expert judgments more than experts diverge from one another. When the same models are restricted to suggesting variables inside the curation step, human annotators finish faster and with less mental effort while producing labels that match expert standards more closely than pure automation does.

Core claim

Although LLM-based automated annotations are better than traditional TF-IDF-based methods or Event Set Curation, they are still not reliable annotators compared to human experts. However, adding LLMs to assist experts for Event Set Curation can reduce the time and mental effort required for Variable Annotation. When using LLMs to extract event variables to assist expert annotators, they agree more with the extracted variables than fully automated LLMs for annotation.

What carries the argument

The three-stage event annotation workflow (document filtering, event merging, variable annotation) with LLMs applied either as full automation or as variable-extraction assistants during curation.

If this is right

  • Fully automated LLM annotations remain less consistent with expert judgments than hybrid assistance.
  • LLM assistance during event set curation lowers time and cognitive load for human annotators.
  • Hybrid outputs align more closely with expert standards than pure automation.
  • The same assistance pattern applies to any annotation task that identifies market changes, breaking news, or sociological trends.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar hybrid patterns could be tested in other annotation domains such as entity or sentiment labeling.
  • Future experiments could measure whether model scale or task-specific fine-tuning narrows the remaining gap to expert performance.
  • Annotation platforms may benefit from interfaces that surface model suggestions without granting full automation.

Load-bearing premise

The study assumes that agreement rates with human experts constitute the correct gold-standard measure of annotation quality and that the chosen events and documents are representative of broader annotation tasks.

What would settle it

A new collection of events and documents in which fully automated LLM annotations reach the same agreement level with experts that experts reach with one another would falsify the claim that LLMs are not reliable independent annotators.

Figures

Figures reproduced from arXiv: 2503.06778 by Benjamin Evans, Carlos Rafael Colon, Feng Gu, Ishani Mondal, Jordan Lee Boyd-Graber, Zongxia Li.

Figure 1
Figure 1. Figure 1: Our workflow for annotating events data begins with preprocessing incoming media news. A Support Vector Machine identifies highly relevant documents for manual re￾view. During Event Set Curation, human annotators create unique event sets. Finally, annotators code the domain-specific variables. We apply LM-based similarity indices and use LM￾extracted variables to aid manual processing. collect trillions of… view at source ↗
Figure 2
Figure 2. Figure 2: The agreement between the manual and the auto￾mated settings is comparable. On average, annotators and LM agree 50% using NM. PEDANTS and BERT show higher agree￾ments. The difference between human-human and human-LM agreement is not statistically significant, suggesting that the LM-extracted variables provide approximately human-level utility. Hybrid Manual 0.4 0.5 0.6 0.7 0.8 0.9 0.69 0.62 0.66 0.47 0.76 … view at source ↗
Figure 3
Figure 3. Figure 3: Agreement by event set type and setting. Anno￾tators show higher agreement in the hybrid setting, where extracted variables are available. This indicates that these variables help code the events. Furthermore, the extracted variables prove particularly beneficial in LM-generated inci￾dent sets, which often contain misinformation. 5.2 Results and Discussion LLM-extracted variables reduce annotation time. Ex… view at source ↗
Figure 4
Figure 4. Figure 4: Agreement grouped by variable type. Human annotators agree more with extracted variables with higher degree of specificity. Country has over 90% agreement. Generic attack type and weapon type also high agreement. In comparison, low specificity variables like location demonstrate low agreement with human judgment. Algorithm 1 Embedding Algorithm 1: Input: list of articles 2: Output: best precision, recall, … view at source ↗
read the original abstract

Event annotation is important for identifying market changes, monitoring breaking news, and understanding sociological trends. Although expert annotators set the gold standards, human coding is expensive and inefficient. Unlike information extraction experiments that focus on single contexts, we evaluate a holistic workflow that removes irrelevant documents, merges documents about the same event, and annotates the events. Although LLM-based automated annotations are better than traditional TF-IDF-based methods or Event Set Curation, they are still not reliable annotators compared to human experts. However, adding LLMs to assist experts for Event Set Curation can reduce the time and mental effort required for Variable Annotation. When using LLMs to extract event variables to assist expert annotators, they agree more with the extracted variables than fully automated LLMs for annotation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript evaluates LLMs in a holistic event annotation workflow (filtering irrelevant documents, merging same-event documents, and annotating variables). It reports that LLM-based automated annotations outperform TF-IDF baselines and Event Set Curation but remain less reliable than human experts as independent annotators; however, LLMs reduce time and mental effort when assisting experts in curation, and experts show higher agreement with LLM-extracted variables than with fully automated LLM annotations.

Significance. If the empirical comparisons hold, the work demonstrates that LLMs are more effective as annotation assistants than replacements in multi-stage event annotation tasks relevant to market monitoring and sociology. The inclusion of multiple baselines and the focus on workflow-level evaluation (rather than isolated extraction) strengthens the practical implications for hybrid human-LLM systems.

major comments (2)
  1. [Abstract/Methods] The central claim that LLMs are not reliable independent annotators rests on lower agreement rates with human experts serving as the gold standard. The manuscript provides no inter-annotator agreement statistics among the expert coders themselves (abstract and methods), leaving the performance gap uninterpretable if human consistency is modest.
  2. [Methods] No sampling frame, selection criteria, or justification is given for the chosen events and documents. This directly affects the generalizability of the finding that LLMs underperform humans while assisting them, as representativeness cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important issues of interpretability and generalizability that we address below. We indicate revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract/Methods] The central claim that LLMs are not reliable independent annotators rests on lower agreement rates with human experts serving as the gold standard. The manuscript provides no inter-annotator agreement statistics among the expert coders themselves (abstract and methods), leaving the performance gap uninterpretable if human consistency is modest.

    Authors: We agree that reporting inter-annotator agreement (IAA) would aid interpretation. Our study used a single highly trained expert per document to establish the gold standard, consistent with practices in specialized event annotation tasks requiring domain knowledge. No multiple independent coders were employed for the same items, so IAA statistics are unavailable. We will revise the methods section to explicitly describe the annotation protocol, note this as a limitation, and clarify that the performance gap is measured against expert-established gold standards rather than claiming absolute superiority. revision: yes

  2. Referee: [Methods] No sampling frame, selection criteria, or justification is given for the chosen events and documents. This directly affects the generalizability of the finding that LLMs underperform humans while assisting them, as representativeness cannot be assessed.

    Authors: We acknowledge the need for explicit justification to support generalizability claims. The events and documents were drawn from a corpus focused on market monitoring and sociological trends (e.g., financial and social events), selected to reflect realistic multi-stage annotation workflows. We will revise the methods section to include a detailed sampling frame, selection criteria, and rationale for the chosen events and documents. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical comparison study

full rationale

This is a straightforward empirical evaluation paper that compares LLM annotation performance against human experts and baselines via agreement rates, time, and effort metrics. No derivations, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps exist in the reported workflow. All claims are grounded in direct experimental measurements that can be independently replicated or falsified from the described tasks and data without reducing to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper containing no mathematical derivations, free parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5671 in / 1066 out tokens · 36514 ms · 2026-05-22T23:57:38.405526+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models

    cs.CL 2025-09 unverdicted novelty 5.0

    SMARTER boosts LLM toxicity detection and explanation performance by up to 13% macro-F1 on three hate-speech benchmarks through self-generated synthetic data and minimal-supervision preference optimization.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 1 Pith paper

  1. [1]

    In 2012 IEEE International Conference on Intelli- gence and Security Informatics, pages 84–89

    Machine learning for the automatic identifica- tion of terrorist incidents in worldwide news media. In 2012 IEEE International Conference on Intelli- gence and Security Informatics, pages 84–89. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositionality. In ...

  2. [2]

    Martin Riedl and Chris Biemann

    Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 26th International Conference KES2022. Martin Riedl and Chris Biemann. 2012. Topictiling: A text segmentation algorithm based on lda. In Pro- ceedings of the Student Research Workshop of the 50th Meeting of the Association for Computational Linguistics, pages 37–42, Jeju...

  3. [3]

    In Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, pages 930–957, Miami, Florida, USA

    Large language models for data annotation and synthesis: A survey. In Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, pages 930–957, Miami, Florida, USA. Association for Computational Linguistics. The GDELT Project. 2021. A planetary scale open dataset: Just how big is gdelt as of 2021? Accessed: November 26, 2024...

  4. [4]

    The event must be intentional – the result of a conscious calculation on the part of a perpetrator

  5. [5]

    The event must involve violence - against either property or people

  6. [6]

    State-level is excluded from the database

    The perpetrators of the events must be sub- national actors. State-level is excluded from the database. In addition, the article must also met at least two of the following criterion:

  7. [7]

    In terms of economic goals, the exclusive pur- suit of profit does not satisfy this criterion

    The act must be aimed at attaining a polit- ical, economic, religious, or social goal. In terms of economic goals, the exclusive pur- suit of profit does not satisfy this criterion. It must involve the pursuit of more profound, systemic economic change

  8. [8]

    It is the act taken as a totality that is considered, irrespec- tive if every individual involved in carrying out the act was aware of this intention

    There must be evidence of an intention to coerce, intimidate, or convey some other message to a larger audience (or audiences) than the immediate victims. It is the act taken as a totality that is considered, irrespec- tive if every individual involved in carrying out the act was aware of this intention. As long as any of the planners or decision-makers b...

  9. [9]

    That is, the act must be outside the parameters permitted by international humanitarian law, insofar as it targets non-combatants

    The action must be outside the context of legitimate warfare activities. That is, the act must be outside the parameters permitted by international humanitarian law, insofar as it targets non-combatants. A.3 Algorithm for Finding the Best Embedding Threshold A.4 Over- and Under-generation of LM-CLS Using an LM to generate event set candidates would almost...

  10. [10]

    Country: the country in which the event oc- curred

  11. [11]

    Location: the most specific location (e.g., vil- lage name) in which the event occurred

  12. [12]

    Target: the targeted group of the event

  13. [13]

    Perpetrator: the group carrying out the event

  14. [14]

    Generic Attack Type: One or more of Fa- cility/Infrastructure Attack, Armed Assault , Assassination, Bombing/Explosion, Hostage Taking (Kidnapping), and NA

  15. [15]

    Generic Weapon: One or more ofExplosives, Firearms, Incendiary, Sabotage Equipment, Melee, Vehicle, and NA,

  16. [16]

    Specific Weapon: A detailed description of Generic Weapon

  17. [17]

    Kills: Number of people killed during the event

  18. [18]

    Determine whether the following articles de- scribe the same incident: {article 1} {article 2}

    Wounds: Number of people injured during the event. A.6 Human- LM Agreement by Variable Type Figure 4 shows how often human annotators agree with LM-extracted variables using equivalence met- rics. Annotators show 0.89 agreement with Coun- try, a variable with high degree of specificity. In contrast, annotators agree with Location infre- quently, suggestin...