An ordinal measure of interrater absolute agreement

Daniela Marella; Giuseppe Bove; Pier Luigi Conti

arxiv: 1907.09756 · v1 · pith:ZJA5436Fnew · submitted 2019-07-23 · 📊 stat.ME

An ordinal measure of interrater absolute agreement

Giuseppe Bove , Pier Luigi Conti , Daniela Marella This is my paper

Pith reviewed 2026-05-24 17:30 UTC · model grok-4.3

classification 📊 stat.ME

keywords interrater agreementordinal scalesdispersion indexabsolute agreementvariance restrictionbootstrap confidence intervalsunbiased estimator

0 comments

The pith

A measure of interrater absolute agreement for ordinal scales is constructed from Leti's dispersion index to avoid variance restriction problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new way to quantify how much different raters agree exactly on ordinal ratings. Traditional measures can suffer when raters show little variation in their scores, making agreement appear artificially low or high. By building on an existing index of dispersion for ordinal data, the new measure sidesteps this issue. An unbiased estimator is derived, and methods for confidence intervals are developed using both theory and resampling. The approach is tested on simulated and real data to show it works in practice.

Core claim

The authors introduce an interrater absolute agreement measure for ordinal variables that capitalizes on Leti's dispersion index. This construction avoids the restriction of variance issue that can affect traditional agreement measures. They provide an unbiased estimator, study its sampling properties, and develop asymptotic and bootstrap confidence intervals, demonstrating accuracy through simulations and a real application.

What carries the argument

Leti's dispersion index for ordinal variables, adapted as the foundation for an absolute agreement measure between multiple raters.

If this is right

The new measure provides a direct quantification of absolute agreement without being distorted by low variance in ratings.
An unbiased estimator allows reliable point estimation of the agreement level.
Both asymptotic theory and bootstrap methods yield valid confidence intervals for the measure.
Simulations confirm the procedure's accuracy for assessing agreement in ordinal data.
Application to real data illustrates practical utility in fields using ordinal scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the measure performs well, it could replace or supplement kappa-like statistics in settings where raters tend to use similar score ranges.
Extensions might include incorporating weights for different disagreement levels or handling missing ratings.
The approach could be adapted to other types of categorical data beyond ordinal.
Further work might compare its power to detect disagreement against existing methods in large samples.

Load-bearing premise

Leti's dispersion index provides a suitable basis for measuring absolute agreement between raters on ordinal scales without introducing new biases.

What would settle it

A simulation where the new measure still shows variance restriction effects similar to traditional measures, or where its estimator is biased in finite samples.

Figures

Figures reproduced from arXiv: 1907.09756 by Daniela Marella, Giuseppe Bove, Pier Luigi Conti.

**Figure 1.** Figure 1: Kernel density estimate of d index from the 1000 original samples. The percentile method has a good performance with coverage probability larger than 91%. The worst methods are the P ivot and T − int methods. The lower and upper error rates, [PITH_FULL_IMAGE:figures/full_fig_p014_1.png] view at source ↗

read the original abstract

A measure of interrater absolute agreement for ordinal scales is proposed capitalizing on the dispersion index for ordinal variables proposed by Giuseppe Leti. The procedure allows to avoid the problem of restriction of variance that sometimes affect traditional measures of interrater agreement in different fields of application. An unbiased estimator of the proposed measure is introduced and its sampling properties are investigated. In order to construct confidence intervals for interrater absolute agreement both asymptotic results and bootstrapping methods are used and their performance is evaluated. Simulated data are employed to demonstrate the accuracy and practical utility of the new procedure for assessing agreement. Finally, an application to a real case is provided.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adapts Leti's ordinal dispersion index into an absolute agreement measure for ordinal ratings, supplies an unbiased estimator plus delta-method and bootstrap CIs, and checks performance with simulations and one real example.

read the letter

The core contribution is a direct translation of Leti's dispersion index into an interrater agreement quantity that is defined on the joint rating distribution rather than on a chance-corrected coefficient. This construction is meant to sidestep the variance-restriction problem that affects many traditional measures when raters tend to use only part of the scale. They also give an unbiased estimator, derive its sampling properties, and build confidence intervals both asymptotically and by bootstrap, then evaluate coverage and bias in simulations that vary the number of raters, categories, and true agreement levels, plus one applied example.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes a measure of interrater absolute agreement for ordinal scales constructed by adapting Giuseppe Leti's dispersion index for ordinal variables (typically via 1 minus a normalized dispersion). It derives an unbiased estimator, establishes sampling properties, constructs confidence intervals via delta-method asymptotics and bootstrap, evaluates bias/variance/coverage through simulations across numbers of raters, categories, and agreement levels, and illustrates the procedure on a real dataset.

Significance. If the construction and simulation results hold, the index supplies a direct, non-chance-corrected alternative to measures such as weighted kappa that can suffer from marginal variance restriction. The explicit unbiased estimator, dual CI methods, and simulation coverage checks constitute concrete strengths that would make the contribution useful in applied settings where ordinal ratings are common.

minor comments (3)

[Abstract] The abstract states that the procedure 'allows to avoid the problem of restriction of variance' but does not indicate the precise mechanism (joint-distribution definition versus marginal normalization); a single clarifying sentence would improve readability.
[Simulations] In the simulation section, coverage results are presented for selected combinations of raters and categories; adding a brief table or figure summarizing coverage across the full grid (including low-agreement and high-category cases) would make the performance claims easier to assess.
[Application] The real-data example would be strengthened by a side-by-side numerical comparison with at least one conventional index (e.g., quadratic-weighted kappa) on the same ratings to illustrate the practical difference.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript and the recommendation of minor revision. The report does not list any specific major comments requiring a point-by-point response.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines its interrater agreement index by direct adaptation of Leti's external dispersion measure for ordinal data (1 minus a normalized form of the index), then applies standard unbiased estimation, delta-method asymptotics, and bootstrap for inference. No equation reduces the proposed quantity to a fitted parameter or self-citation by construction; the core definition operates on the joint rating distribution independently of variance-restriction issues in chance-corrected coefficients. All load-bearing steps (estimator derivation, sampling properties, CI construction) rest on external statistical machinery and the cited Leti index rather than internal self-reference or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central proposal depends on the validity of Leti's dispersion index as a dispersion measure for ordinal data and on standard regularity conditions for asymptotic and bootstrap inference; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption Leti's dispersion index is a valid and appropriate measure of dispersion for ordinal variables
The new agreement measure is explicitly constructed by capitalizing on this index.
standard math Standard asymptotic and bootstrap theory applies to the proposed estimator
Used to construct confidence intervals whose performance is evaluated by simulation.

pith-pipeline@v0.9.0 · 5624 in / 1232 out tokens · 23737 ms · 2026-05-24T17:30:45.011721+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Booth, J. G., R. W. Butler, and P. Hall (1994). Bootstrap methods for ﬁnite populations. Journal of the American Statistical Association , 89 (428), 1282–1289

work page 1994
[2]

(2018) Measurement of interrater agreement for the assessment of language proﬁciency

Bove, G., Nuzzo, E., Seraﬁni, A. (2018) Measurement of interrater agreement for the assessment of language proﬁciency. In: S. Capecchi, Di Iorio F., Simone R. ASMOD 2018: Proceedings of the Advanced Statistical Modelling for Ordinal Data Conference . Universit` a Federico II di Napoli, 24-26 October 2018. Napoli: FedOAPress, 61–68

work page 2018
[3]

(2002) Scomposizione della dispersione per variabili statistiche ordinali [Dispersion decomposition for ordinal variables], Statistica, 62, 111–116

Grilli L., Rampichini C. (2002) Scomposizione della dispersione per variabili statistiche ordinali [Dispersion decomposition for ordinal variables], Statistica, 62, 111–116

work page 2002
[4]

Gross, S. (1980). Median estimation in sample surveys. In Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 181–184

work page 1980
[5]

J., Demaree, R

James, L. J., Demaree, R. G.,Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology , 69, 85–98

work page 1984
[6]

J., Demaree R

James L. J., Demaree R. G., Wolf G. (1993) rwg: An assessment of within-group interrater agreement, Journal of Applied Psychology , 78, 306–309

work page 1993
[7]

Efron, B. (1979). Bootstrap methods: another look at the jackknife. The Annals of Statistics, 7(1), 1–26

work page 1979
[8]

(2017) Functional adequacy in L2 writing

Kuiken F., Vedder I. (2017) Functional adequacy in L2 writing. Towards a new rating scale, Language Testing, 34, 321-336

work page 2017
[9]

LeBreton J.M., Burgess J.R.D., Kaiser R.B., Atchley E.K., James L.R. (2003) The restriction of variance hypothesis and interrater reliability and agreement: Are ratings from multiple sources really dissimilar?, Organizational Research Methods, 6, 80–128

work page 2003
[10]

LeBreton J.M., Senter, J.L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11(4), 815–852

work page 2008
[11]

(1983) Statistica descrittiva, Il Mulino, Bologna

Leti G. (1983) Statistica descrittiva, Il Mulino, Bologna

work page 1983
[12]

(1952) The Standard Error of Gini’s Mean Diﬀerence

Lomnicki Z.A. (1952) The Standard Error of Gini’s Mean Diﬀerence. The Annals of Mathematical Statistics, 23, 14, 635–637

work page 1952
[13]

Mashreghi, Z., Haziza, D., L´ eger, C. (2016). A survey of bootstrap methods in ﬁnite population sampling. Statistics Surveys, 10, 1–52

work page 2016
[14]

(1996) Forming inferences about some intraclass correlation coeﬃcients, Psychological Methods, 1, 30–46

McGraw K.O., Wong S.P. (1996) Forming inferences about some intraclass correlation coeﬃcients, Psychological Methods, 1, 30–46

work page 1996
[15]

(2018) Assessing functional adequacy across tasks: A comparison of learners and native speakers’ written texts, (submitted for publication)

Nuzzo E., Bove G. (2018) Assessing functional adequacy across tasks: A comparison of learners and native speakers’ written texts, (submitted for publication)

work page 2018
[16]

Piccarretta, R. (2001). A new measure of nomila-ordinal association, Journal of Applied Statistics, 28, 1, 107–120

work page 2001
[17]

and Jones, M.C

Sheather, S.J. and Jones, M.C. (1991). A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statistical Society Series B , 53, 683–690

work page 1991
[18]

E., and Fleiss, J

Shrout, P. E., and Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing reliability. Psychological Bulletin, 86, 420–428

work page 1979
[19]

(2005) Analyzing rater agreement

von Eye A., Mun E.Y. (2005) Analyzing rater agreement. Manifest variable methods , Lawrence Erlbaum Associates, Mahwah, New Jersey

work page 2005

[1] [1]

Booth, J. G., R. W. Butler, and P. Hall (1994). Bootstrap methods for ﬁnite populations. Journal of the American Statistical Association , 89 (428), 1282–1289

work page 1994

[2] [2]

(2018) Measurement of interrater agreement for the assessment of language proﬁciency

Bove, G., Nuzzo, E., Seraﬁni, A. (2018) Measurement of interrater agreement for the assessment of language proﬁciency. In: S. Capecchi, Di Iorio F., Simone R. ASMOD 2018: Proceedings of the Advanced Statistical Modelling for Ordinal Data Conference . Universit` a Federico II di Napoli, 24-26 October 2018. Napoli: FedOAPress, 61–68

work page 2018

[3] [3]

(2002) Scomposizione della dispersione per variabili statistiche ordinali [Dispersion decomposition for ordinal variables], Statistica, 62, 111–116

Grilli L., Rampichini C. (2002) Scomposizione della dispersione per variabili statistiche ordinali [Dispersion decomposition for ordinal variables], Statistica, 62, 111–116

work page 2002

[4] [4]

Gross, S. (1980). Median estimation in sample surveys. In Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 181–184

work page 1980

[5] [5]

J., Demaree, R

James, L. J., Demaree, R. G.,Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology , 69, 85–98

work page 1984

[6] [6]

J., Demaree R

James L. J., Demaree R. G., Wolf G. (1993) rwg: An assessment of within-group interrater agreement, Journal of Applied Psychology , 78, 306–309

work page 1993

[7] [7]

Efron, B. (1979). Bootstrap methods: another look at the jackknife. The Annals of Statistics, 7(1), 1–26

work page 1979

[8] [8]

(2017) Functional adequacy in L2 writing

Kuiken F., Vedder I. (2017) Functional adequacy in L2 writing. Towards a new rating scale, Language Testing, 34, 321-336

work page 2017

[9] [9]

LeBreton J.M., Burgess J.R.D., Kaiser R.B., Atchley E.K., James L.R. (2003) The restriction of variance hypothesis and interrater reliability and agreement: Are ratings from multiple sources really dissimilar?, Organizational Research Methods, 6, 80–128

work page 2003

[10] [10]

LeBreton J.M., Senter, J.L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11(4), 815–852

work page 2008

[11] [11]

(1983) Statistica descrittiva, Il Mulino, Bologna

Leti G. (1983) Statistica descrittiva, Il Mulino, Bologna

work page 1983

[12] [12]

(1952) The Standard Error of Gini’s Mean Diﬀerence

Lomnicki Z.A. (1952) The Standard Error of Gini’s Mean Diﬀerence. The Annals of Mathematical Statistics, 23, 14, 635–637

work page 1952

[13] [13]

Mashreghi, Z., Haziza, D., L´ eger, C. (2016). A survey of bootstrap methods in ﬁnite population sampling. Statistics Surveys, 10, 1–52

work page 2016

[14] [14]

(1996) Forming inferences about some intraclass correlation coeﬃcients, Psychological Methods, 1, 30–46

McGraw K.O., Wong S.P. (1996) Forming inferences about some intraclass correlation coeﬃcients, Psychological Methods, 1, 30–46

work page 1996

[15] [15]

(2018) Assessing functional adequacy across tasks: A comparison of learners and native speakers’ written texts, (submitted for publication)

Nuzzo E., Bove G. (2018) Assessing functional adequacy across tasks: A comparison of learners and native speakers’ written texts, (submitted for publication)

work page 2018

[16] [16]

Piccarretta, R. (2001). A new measure of nomila-ordinal association, Journal of Applied Statistics, 28, 1, 107–120

work page 2001

[17] [17]

and Jones, M.C

Sheather, S.J. and Jones, M.C. (1991). A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statistical Society Series B , 53, 683–690

work page 1991

[18] [18]

E., and Fleiss, J

Shrout, P. E., and Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing reliability. Psychological Bulletin, 86, 420–428

work page 1979

[19] [19]

(2005) Analyzing rater agreement

von Eye A., Mun E.Y. (2005) Analyzing rater agreement. Manifest variable methods , Lawrence Erlbaum Associates, Mahwah, New Jersey

work page 2005