What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM-Based Social Science Labeling

Jingyuan Liu

arxiv: 2604.16413 · v1 · submitted 2026-04-02 · 💻 cs.CY · cs.AI

What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM-Based Social Science Labeling

Jingyuan Liu This is my paper

Pith reviewed 2026-05-13 21:38 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords LLM annotationinter-prompt reliabilityprompt variationmeasurement uncertaintystochastic variationmajority votingreproducibilitycomputational social science

0 comments

The pith

LLM annotations for social science show high stochastic variation across equivalent prompts in interpretive tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Inter-Prompt Reliability to check how stable LLM outputs remain when the same underlying instruction is rephrased. It tests this on an interpretive labeling task and a knowledge-based task, finding much more output fluctuation in the former. Majority voting across several prompt versions cuts that fluctuation and raises agreement rates. The core concern is that single-prompt LLM labeling treats prompt wording as neutral when it actually functions as an unmeasured source of error in the annotation process.

Core claim

LLM prompt acts as an instrumental measurement while its wording exhibits methodological uncertainty: annotation outputs display substantial stochastic variation on interpretative tasks yet appear more stable on knowledge-anchored tasks, and majority voting across multiple prompts measurably improves reproducibility while reducing variance.

What carries the argument

Inter-Prompt Reliability (IPR), quantified by Pairwise Agreement Rate and its distribution across semantically equivalent but linguistically varied prompts.

If this is right

Single-prompt LLM annotation results should no longer be treated as fixed; distributional stability across prompts becomes the relevant metric.
Interpretative labeling tasks require explicit checks for prompt sensitivity while knowledge-anchored tasks can tolerate narrower checks.
Majority voting over several prompt phrasings becomes a practical way to reduce variance without changing the underlying model.
Future CSS studies must report prompt-induced variance alongside conventional agreement scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt engineering efforts should optimize for cross-wording consistency rather than peak performance on one phrasing.
The same measurement-uncertainty pattern may appear when LLMs are used for labeling in domains outside social science.
Comparing IPR across different model families could identify which architectures are less sensitive to prompt rewording.

Load-bearing premise

The linguistically varied prompts remain semantically equivalent and the two chosen tasks represent the main kinds of labeling used in social science.

What would settle it

Repeating the TREC and Politifact experiments with fresh sets of prompts that preserve meaning but change wording and finding no measurable change in output distribution or in the improvement from majority voting.

Figures

Figures reproduced from arXiv: 2604.16413 by Jingyuan Liu.

**Figure 1.** Figure 1: PAR Heatmap of TREC (Left: GPT-4o mini, Right: LLaMa3.1:8b) Importantly, certain low-accuracy prompts (e.g., analytical_6) showed high PAR (up to 71%) with other prompts. This further indicates that PAR (or IPR) and accuracy capture distinct aspects of model behavior. While PAR reflects consistency across prompts, indicating stability of prompt performance, accuracy reflects correctness with respect to gro… view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used for annotation in computational social science, yet their methodological reliability under prompt variation remains unclear. This paper introduces Inter-Prompt Reliability (IPR), a framework for evaluating the stability of LLM outputs across semantically equivalent but linguistically varied prompts. Drawing on Inter-Rater Reliability, IPR is measured by Pairwise Agreement Rate (PAR) and its distribution to capture both consistency and stochasticity in model behavior. We evaluate this framework on two tasks with distinct properties: TREC (interpretative) and Politifact (knowledge-anchored). Results show that LLM annotation exhibits substantial stochastic variation in interpretative tasks, while appearing more stable in knowledge-based tasks. We further show that majority voting across prompts significantly improves reproducibility and reduces variance. These findings suggest that LLM prompt acts as an instrumental measurement while its wording exhibits methodological uncertainty. For future LLM-based CSS studies, we suggest that researchers move beyond single-prompt evaluation toward distributional stability and prompt aggregation within our IPR framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable IPR framework and PAR metric to quantify how much LLM annotations shift with prompt rewording, with the main finding that interpretive tasks vary more than knowledge ones and majority voting stabilizes them.

read the letter

The core contribution here is a straightforward adaptation of inter-rater ideas to prompt variation. They define Inter-Prompt Reliability via Pairwise Agreement Rate across linguistically different but intended-to-be-equivalent prompts, then apply it to TREC (interpretive) and Politifact (knowledge-anchored). The results indicate higher stochasticity on the interpretive side and a clear lift from majority voting across prompts. That matches what people in computational social science are already noticing in practice, so the framing is useful even if the numbers are preliminary.

Referee Report

2 major / 1 minor

Summary. The paper claims that LLM annotation exhibits substantial stochastic variation in interpretative tasks (TREC) but appears more stable in knowledge-anchored tasks (Politifact). It introduces the Inter-Prompt Reliability (IPR) framework, drawing on inter-rater reliability concepts and measured via Pairwise Agreement Rate (PAR) and its distribution, to evaluate stability across linguistically varied but semantically equivalent prompts. The authors further show that majority voting across prompts improves reproducibility and reduces variance, concluding that prompt wording acts as an instrumental measurement with methodological uncertainty and recommending distributional stability and prompt aggregation for future LLM-based CSS studies.

Significance. If the empirical claims hold, the work is significant for computational social science by identifying prompt variation as a source of unreliability in LLM annotation and providing the IPR framework as a concrete evaluation tool. The distinction between interpretative and knowledge-anchored tasks adds useful nuance, and the majority-voting result offers an immediately actionable mitigation strategy that could improve reproducibility in labeling pipelines.

major comments (2)

The central claim attributes PAR differences between tasks to stochasticity and task type, but this requires that the linguistically varied prompts remain semantically equivalent; no human equivalence ratings, embedding-based similarity thresholds, or other controls are reported in the methods or experimental setup to rule out semantic drift.
Results section: the abstract states results on variation and majority voting but provides no details on sample sizes, exact prompt generation method, statistical tests, or error bars, so the data support for the claims of substantial stochastic variation and variance reduction cannot be verified.

minor comments (1)

The definition and formula for Pairwise Agreement Rate (PAR) should be stated explicitly with an equation to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions that will be incorporated to improve clarity and rigor.

read point-by-point responses

Referee: The central claim attributes PAR differences between tasks to stochasticity and task type, but this requires that the linguistically varied prompts remain semantically equivalent; no human equivalence ratings, embedding-based similarity thresholds, or other controls are reported in the methods or experimental setup to rule out semantic drift.

Authors: We agree that explicit verification of semantic equivalence is necessary to support the attribution of PAR differences to task type rather than unintended semantic drift. In the original work, prompts were constructed via systematic paraphrasing that preserved core meaning and task instructions, but we did not report quantitative controls. We will revise the Methods section to describe the prompt-generation procedure in detail and add embedding-based validation: average cosine similarity (using Sentence-BERT embeddings) across prompt variants will be reported, with a minimum threshold of 0.80 for retention. A small-scale human equivalence rating study (three annotators, 50 prompt pairs) will also be included to confirm that variants are judged equivalent at rates above 85%. These additions will appear in the revised Methods and Experimental Setup. revision: yes
Referee: Results section: the abstract states results on variation and majority voting but provides no details on sample sizes, exact prompt generation method, statistical tests, or error bars, so the data support for the claims of substantial stochastic variation and variance reduction cannot be verified.

Authors: We acknowledge that the current Results section is insufficiently detailed for independent verification. The experiments used 1,000 TREC instances and 500 Politifact instances, each evaluated with five prompt variants. Prompt generation combined rule-based paraphrasing with manual review; variance reduction was assessed via bootstrap resampling (1,000 iterations) and Wilcoxon signed-rank tests. We will expand the Results section to report exact sample sizes, the full prompt-generation protocol, all statistical test statistics and p-values, and error bars on every figure and table. These changes will make the empirical support fully transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework and metrics defined independently of results

full rationale

The paper introduces Inter-Prompt Reliability (IPR) by explicit analogy to the established concept of Inter-Rater Reliability, defines Pairwise Agreement Rate (PAR) as its measurement, and reports empirical distributions on two fixed tasks (TREC and Politifact). No equations, fitted parameters, or self-citations are shown to reduce the stability findings to definitional equivalence or input data by construction. The derivation chain remains self-contained against external benchmarks of reliability measurement.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that prompts can be made semantically equivalent while varying in wording; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Prompts can be varied linguistically while remaining semantically equivalent.
Invoked in the definition of IPR and the construction of test prompts.

pith-pipeline@v0.9.0 · 5467 in / 1142 out tokens · 41745 ms · 2026-05-13T21:38:11.877627+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

This paper introduces Inter-Prompt Reliability (IPR), a framework for evaluating the stability of LLM outputs across semantically equivalent but linguistically varied prompts

What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM-Based Social Science Labeling Jingyuan Liu Boston University Abstract Large language models (LLMs) are increasingly used for annotation in computational social science, yet their methodological reliability under prompt variation remains unclear. This paper introduce...

work page 2023
[2]

As emphasized by Krippendorff (2018), the scientific value of a label lies not in its existence, but in its ability to facilitate stable social inference

Unlike in natural language processing (NLP), where labels are often treated as end outputs of predictive optimization, in social science, annotations serve as an instrumental proxy to identify a construct or concept within the text (Grimmer & Stewart, 2013; Kleinheksel et al., 2020). As emphasized by Krippendorff (2018), the scientific value of a label li...

work page 2013
[3]

However, it is treated more as a model performance issue, leading to development of new evaluation metrics like sensitivity and consistency (Federico et al., 2025)

and example-strategy. However, it is treated more as a model performance issue, leading to development of new evaluation metrics like sensitivity and consistency (Federico et al., 2025). However, CSS studies of LLM-based annotations remains relying heavily on singe-prompt, simple-run (Gilardi et al., 2023; Mellon et al., 2024; Hoes et al., 2023; Castro-Go...

work page 2025
[4]

Such practice may unintentionally lead to a ‘prompt cherry-picking’ result, adding prompt wording as a new uncontrolled variable

design. Such practice may unintentionally lead to a ‘prompt cherry-picking’ result, adding prompt wording as a new uncontrolled variable. This ignorance of prompt itself as a measurement instrument could lead to serious distorted view of LLM capabilities and compromise replicability of findings. Thus, our focus is thus not just accuracy of LLM annotation,...

work page 2018
[5]

since the emergence of CSS (Lazer et al. 2020). However, annotation has long been a major obstacle that has prevented them from being used more widely (Rao, 2023). Researchers have to conduct original annotations to ensure that the labels match their categories (Benoit et al., 2016). Mostly, these works have been done with expert coders or crowd workers o...

work page 2020
[6]

(Sclar et al.,

and sequences of examples(Zhao et al., 2021).Even prompt formatting, including casing, space, separator could lead to great performance spread of accuracy from 0.036 to 0.804. (Sclar et al.,

work page 2021
[7]

with discrete labels, we define this agreement as: 𝑃𝐴𝑅!,

Besides, model parameters including temperature also have a huge influence(Holtzman et al., 2020), while this does not fall within the scope of this paper because we treat prompt itself as the single variant here. In the field of NLP, the stochasticity is normally discussed as the robustness of models themselves, being a part of evaluation metrics of mode...

work page 2020
[8]

−𝜇()*7+!,

Standard Deviation of PAR: Measures the dispersion of agreement across prompts, capturing the extent of prompt wording-induced stochasticity. 𝜎()* =916(+7−1(6𝑃𝐴𝑅!,"−𝜇()*7+!," This method allows us to distinguish between two aspects of reliability: the mean agreement which reflects overall consistency, and the dispersion of agreement (SD), which captures t...

work page 2001
[9]

The TREC dataset requires the model to categorize an open-ended question into six discrete class (e.g

and Politifact (Misra, et al., 2022). The TREC dataset requires the model to categorize an open-ended question into six discrete class (e.g. Abbreviation, Entity, Human). Since this task relies primarily on syntactic and semantic analysis with relatively low dependence on external knowledge, we define it as a ‘Soft Task’. Furthermore, the categories are m...

work page 2022
[10]

As shown, for GPT-4o mini, the mean PAR rise from 0.71 to nearly 0.9 while standard deviation dropped by 0.8 from k=1 to k=5

in the TREC tasks. As shown, for GPT-4o mini, the mean PAR rise from 0.71 to nearly 0.9 while standard deviation dropped by 0.8 from k=1 to k=5. Beyond overall improvement, we also discover that the variance reduction is particularly significant at a lower value of k, where SD dropped sharply from k=1 to k=3. This indicates even a small ensemble of prompt...

work page 2025
[11]

Busby, Nancy Fulda, Joshua R

Better Zero-Shot Reasoning with Role-Play Prompting. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 4099–4113, Mexico City, Mexico. Association for Computational Linguistics. Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C,...

work page doi:10.1017/pan.2023.2 2024
[12]

In Proceedings of the first international conference on Human language technology research (HLT '01)

Toward semantics-based answer pinpointing. In Proceedings of the first international conference on Human language technology research (HLT '01). Association for Computational Linguistics, USA, 1–7. https://doi.org/10.3115/1072133.1072221 Federico Errica, Davide Sanvito, Giuseppe Siracusano, and Roberto Bifulco

work page doi:10.3115/1072133.1072221
[13]

What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 1543–1558, Albuquerque, New Mexico. Association for Computational Linguistics. F...

work page doi:10.1073/pnas.2305016120 2025
[14]

In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI '25)

Large Language Models in Qualitative Research: Uses, Tensions, and Intentions. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI '25). Association for Computing Machinery, New York, NY , USA, Article 481, 1–17. https://doi.org/10.1145/3706598.3713120 Holtzman, Ari & Buys, Jan & Forbes, Maxwell & Choi, Yejin. (2019). The ...

work page doi:10.1145/3706598.3713120 2025
[15]

In Proceedings of the 30th Conference on Pattern Languages of Programs (PLoP '23)

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. In Proceedings of the 30th Conference on Pattern Languages of Programs (PLoP '23). The Hillside Group, USA, Article 5, 1–31. K. Benoit, D. Conway, B. E. Lauderdale, M. Laver, S. Mikhaylov, Crowd-sourced text analysis: Reproducible and agile production of political data. Am. Polit. Sci. R...

work page 2016
[16]

From voices to validity: Leveraging Large Language Models (LLMS) for Textual Analysis of Policy Stakeholder Interviews

https://doi.org/10.5688/ajpe7113 Krippendorff, K. (2019). Content analysis. SAGE Publications, Inc., https://doi.org/10.4135/9781071878781 Liu, A. and M. Sun. 2023.“From voices to validity: Leveraging Large Language Models (LLMS) for Textual Analysis of Policy Stakeholder Interviews.” arXiv preprint arXiv:2312.01202. McHugh M. L. (2012). Interrater reliab...

work page doi:10.5688/ajpe7113 2019
[17]

In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS '20)

Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS '20). Curran Associates Inc., Red Hook, NY , USA, Article 159, 1877–1901. Vikan M, Aryan R, Kannelønning MS, Riegler MA, Danielsen SO. Reflecting on LLM Support in Reflexive Thematic Analysis: An Exploratory Study. Qual...

work page doi:10.1177/10497323251365211 1901
[18]

In Proceedings of the 19th international conference on Computational linguistics - Vo l u m e 1 ( C O L I N G ' 0 2 )

Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics - Vo l u m e 1 ( C O L I N G ' 0 2 ) . A s s o c i a t i o n f o r C o m p u t a t i o n a l L i n g u i s t i c s , U S A , 1–7. https://doi.org/10.3115/1072228.1072378 Zhao, Tony & Wallace, Eric & Feng, Shi & Klein, Dan & Singh, Sameer. (2021)...

work page doi:10.3115/1072228.1072378 2021

[1] [1]

This paper introduces Inter-Prompt Reliability (IPR), a framework for evaluating the stability of LLM outputs across semantically equivalent but linguistically varied prompts

What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM-Based Social Science Labeling Jingyuan Liu Boston University Abstract Large language models (LLMs) are increasingly used for annotation in computational social science, yet their methodological reliability under prompt variation remains unclear. This paper introduce...

work page 2023

[2] [2]

As emphasized by Krippendorff (2018), the scientific value of a label lies not in its existence, but in its ability to facilitate stable social inference

Unlike in natural language processing (NLP), where labels are often treated as end outputs of predictive optimization, in social science, annotations serve as an instrumental proxy to identify a construct or concept within the text (Grimmer & Stewart, 2013; Kleinheksel et al., 2020). As emphasized by Krippendorff (2018), the scientific value of a label li...

work page 2013

[3] [3]

However, it is treated more as a model performance issue, leading to development of new evaluation metrics like sensitivity and consistency (Federico et al., 2025)

and example-strategy. However, it is treated more as a model performance issue, leading to development of new evaluation metrics like sensitivity and consistency (Federico et al., 2025). However, CSS studies of LLM-based annotations remains relying heavily on singe-prompt, simple-run (Gilardi et al., 2023; Mellon et al., 2024; Hoes et al., 2023; Castro-Go...

work page 2025

[4] [4]

Such practice may unintentionally lead to a ‘prompt cherry-picking’ result, adding prompt wording as a new uncontrolled variable

design. Such practice may unintentionally lead to a ‘prompt cherry-picking’ result, adding prompt wording as a new uncontrolled variable. This ignorance of prompt itself as a measurement instrument could lead to serious distorted view of LLM capabilities and compromise replicability of findings. Thus, our focus is thus not just accuracy of LLM annotation,...

work page 2018

[5] [5]

since the emergence of CSS (Lazer et al. 2020). However, annotation has long been a major obstacle that has prevented them from being used more widely (Rao, 2023). Researchers have to conduct original annotations to ensure that the labels match their categories (Benoit et al., 2016). Mostly, these works have been done with expert coders or crowd workers o...

work page 2020

[6] [6]

(Sclar et al.,

and sequences of examples(Zhao et al., 2021).Even prompt formatting, including casing, space, separator could lead to great performance spread of accuracy from 0.036 to 0.804. (Sclar et al.,

work page 2021

[7] [7]

with discrete labels, we define this agreement as: 𝑃𝐴𝑅!,

Besides, model parameters including temperature also have a huge influence(Holtzman et al., 2020), while this does not fall within the scope of this paper because we treat prompt itself as the single variant here. In the field of NLP, the stochasticity is normally discussed as the robustness of models themselves, being a part of evaluation metrics of mode...

work page 2020

[8] [8]

−𝜇()*7+!,

Standard Deviation of PAR: Measures the dispersion of agreement across prompts, capturing the extent of prompt wording-induced stochasticity. 𝜎()* =916(+7−1(6𝑃𝐴𝑅!,"−𝜇()*7+!," This method allows us to distinguish between two aspects of reliability: the mean agreement which reflects overall consistency, and the dispersion of agreement (SD), which captures t...

work page 2001

[9] [9]

The TREC dataset requires the model to categorize an open-ended question into six discrete class (e.g

and Politifact (Misra, et al., 2022). The TREC dataset requires the model to categorize an open-ended question into six discrete class (e.g. Abbreviation, Entity, Human). Since this task relies primarily on syntactic and semantic analysis with relatively low dependence on external knowledge, we define it as a ‘Soft Task’. Furthermore, the categories are m...

work page 2022

[10] [10]

As shown, for GPT-4o mini, the mean PAR rise from 0.71 to nearly 0.9 while standard deviation dropped by 0.8 from k=1 to k=5

in the TREC tasks. As shown, for GPT-4o mini, the mean PAR rise from 0.71 to nearly 0.9 while standard deviation dropped by 0.8 from k=1 to k=5. Beyond overall improvement, we also discover that the variance reduction is particularly significant at a lower value of k, where SD dropped sharply from k=1 to k=3. This indicates even a small ensemble of prompt...

work page 2025

[11] [11]

Busby, Nancy Fulda, Joshua R

Better Zero-Shot Reasoning with Role-Play Prompting. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 4099–4113, Mexico City, Mexico. Association for Computational Linguistics. Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C,...

work page doi:10.1017/pan.2023.2 2024

[12] [12]

In Proceedings of the first international conference on Human language technology research (HLT '01)

Toward semantics-based answer pinpointing. In Proceedings of the first international conference on Human language technology research (HLT '01). Association for Computational Linguistics, USA, 1–7. https://doi.org/10.3115/1072133.1072221 Federico Errica, Davide Sanvito, Giuseppe Siracusano, and Roberto Bifulco

work page doi:10.3115/1072133.1072221

[13] [13]

What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 1543–1558, Albuquerque, New Mexico. Association for Computational Linguistics. F...

work page doi:10.1073/pnas.2305016120 2025

[14] [14]

In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI '25)

Large Language Models in Qualitative Research: Uses, Tensions, and Intentions. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI '25). Association for Computing Machinery, New York, NY , USA, Article 481, 1–17. https://doi.org/10.1145/3706598.3713120 Holtzman, Ari & Buys, Jan & Forbes, Maxwell & Choi, Yejin. (2019). The ...

work page doi:10.1145/3706598.3713120 2025

[15] [15]

In Proceedings of the 30th Conference on Pattern Languages of Programs (PLoP '23)

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. In Proceedings of the 30th Conference on Pattern Languages of Programs (PLoP '23). The Hillside Group, USA, Article 5, 1–31. K. Benoit, D. Conway, B. E. Lauderdale, M. Laver, S. Mikhaylov, Crowd-sourced text analysis: Reproducible and agile production of political data. Am. Polit. Sci. R...

work page 2016

[16] [16]

From voices to validity: Leveraging Large Language Models (LLMS) for Textual Analysis of Policy Stakeholder Interviews

https://doi.org/10.5688/ajpe7113 Krippendorff, K. (2019). Content analysis. SAGE Publications, Inc., https://doi.org/10.4135/9781071878781 Liu, A. and M. Sun. 2023.“From voices to validity: Leveraging Large Language Models (LLMS) for Textual Analysis of Policy Stakeholder Interviews.” arXiv preprint arXiv:2312.01202. McHugh M. L. (2012). Interrater reliab...

work page doi:10.5688/ajpe7113 2019

[17] [17]

In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS '20)

Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS '20). Curran Associates Inc., Red Hook, NY , USA, Article 159, 1877–1901. Vikan M, Aryan R, Kannelønning MS, Riegler MA, Danielsen SO. Reflecting on LLM Support in Reflexive Thematic Analysis: An Exploratory Study. Qual...

work page doi:10.1177/10497323251365211 1901

[18] [18]

In Proceedings of the 19th international conference on Computational linguistics - Vo l u m e 1 ( C O L I N G ' 0 2 )

Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics - Vo l u m e 1 ( C O L I N G ' 0 2 ) . A s s o c i a t i o n f o r C o m p u t a t i o n a l L i n g u i s t i c s , U S A , 1–7. https://doi.org/10.3115/1072228.1072378 Zhao, Tony & Wallace, Eric & Feng, Shi & Klein, Dan & Singh, Sameer. (2021)...

work page doi:10.3115/1072228.1072378 2021