pith. sign in

arxiv: 1907.09756 · v1 · pith:ZJA5436Fnew · submitted 2019-07-23 · 📊 stat.ME

An ordinal measure of interrater absolute agreement

Pith reviewed 2026-05-24 17:30 UTC · model grok-4.3

classification 📊 stat.ME
keywords interrater agreementordinal scalesdispersion indexabsolute agreementvariance restrictionbootstrap confidence intervalsunbiased estimator
0
0 comments X

The pith

A measure of interrater absolute agreement for ordinal scales is constructed from Leti's dispersion index to avoid variance restriction problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new way to quantify how much different raters agree exactly on ordinal ratings. Traditional measures can suffer when raters show little variation in their scores, making agreement appear artificially low or high. By building on an existing index of dispersion for ordinal data, the new measure sidesteps this issue. An unbiased estimator is derived, and methods for confidence intervals are developed using both theory and resampling. The approach is tested on simulated and real data to show it works in practice.

Core claim

The authors introduce an interrater absolute agreement measure for ordinal variables that capitalizes on Leti's dispersion index. This construction avoids the restriction of variance issue that can affect traditional agreement measures. They provide an unbiased estimator, study its sampling properties, and develop asymptotic and bootstrap confidence intervals, demonstrating accuracy through simulations and a real application.

What carries the argument

Leti's dispersion index for ordinal variables, adapted as the foundation for an absolute agreement measure between multiple raters.

If this is right

  • The new measure provides a direct quantification of absolute agreement without being distorted by low variance in ratings.
  • An unbiased estimator allows reliable point estimation of the agreement level.
  • Both asymptotic theory and bootstrap methods yield valid confidence intervals for the measure.
  • Simulations confirm the procedure's accuracy for assessing agreement in ordinal data.
  • Application to real data illustrates practical utility in fields using ordinal scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the measure performs well, it could replace or supplement kappa-like statistics in settings where raters tend to use similar score ranges.
  • Extensions might include incorporating weights for different disagreement levels or handling missing ratings.
  • The approach could be adapted to other types of categorical data beyond ordinal.
  • Further work might compare its power to detect disagreement against existing methods in large samples.

Load-bearing premise

Leti's dispersion index provides a suitable basis for measuring absolute agreement between raters on ordinal scales without introducing new biases.

What would settle it

A simulation where the new measure still shows variance restriction effects similar to traditional measures, or where its estimator is biased in finite samples.

Figures

Figures reproduced from arXiv: 1907.09756 by Daniela Marella, Giuseppe Bove, Pier Luigi Conti.

Figure 1
Figure 1. Figure 1: Kernel density estimate of d index from the 1000 original samples. The percentile method has a good performance with coverage probability larger than 91%. The worst methods are the P ivot and T − int methods. The lower and upper error rates, [PITH_FULL_IMAGE:figures/full_fig_p014_1.png] view at source ↗
read the original abstract

A measure of interrater absolute agreement for ordinal scales is proposed capitalizing on the dispersion index for ordinal variables proposed by Giuseppe Leti. The procedure allows to avoid the problem of restriction of variance that sometimes affect traditional measures of interrater agreement in different fields of application. An unbiased estimator of the proposed measure is introduced and its sampling properties are investigated. In order to construct confidence intervals for interrater absolute agreement both asymptotic results and bootstrapping methods are used and their performance is evaluated. Simulated data are employed to demonstrate the accuracy and practical utility of the new procedure for assessing agreement. Finally, an application to a real case is provided.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes a measure of interrater absolute agreement for ordinal scales constructed by adapting Giuseppe Leti's dispersion index for ordinal variables (typically via 1 minus a normalized dispersion). It derives an unbiased estimator, establishes sampling properties, constructs confidence intervals via delta-method asymptotics and bootstrap, evaluates bias/variance/coverage through simulations across numbers of raters, categories, and agreement levels, and illustrates the procedure on a real dataset.

Significance. If the construction and simulation results hold, the index supplies a direct, non-chance-corrected alternative to measures such as weighted kappa that can suffer from marginal variance restriction. The explicit unbiased estimator, dual CI methods, and simulation coverage checks constitute concrete strengths that would make the contribution useful in applied settings where ordinal ratings are common.

minor comments (3)
  1. [Abstract] The abstract states that the procedure 'allows to avoid the problem of restriction of variance' but does not indicate the precise mechanism (joint-distribution definition versus marginal normalization); a single clarifying sentence would improve readability.
  2. [Simulations] In the simulation section, coverage results are presented for selected combinations of raters and categories; adding a brief table or figure summarizing coverage across the full grid (including low-agreement and high-category cases) would make the performance claims easier to assess.
  3. [Application] The real-data example would be strengthened by a side-by-side numerical comparison with at least one conventional index (e.g., quadratic-weighted kappa) on the same ratings to illustrate the practical difference.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript and the recommendation of minor revision. The report does not list any specific major comments requiring a point-by-point response.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines its interrater agreement index by direct adaptation of Leti's external dispersion measure for ordinal data (1 minus a normalized form of the index), then applies standard unbiased estimation, delta-method asymptotics, and bootstrap for inference. No equation reduces the proposed quantity to a fitted parameter or self-citation by construction; the core definition operates on the joint rating distribution independently of variance-restriction issues in chance-corrected coefficients. All load-bearing steps (estimator derivation, sampling properties, CI construction) rest on external statistical machinery and the cited Leti index rather than internal self-reference or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central proposal depends on the validity of Leti's dispersion index as a dispersion measure for ordinal data and on standard regularity conditions for asymptotic and bootstrap inference; no free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption Leti's dispersion index is a valid and appropriate measure of dispersion for ordinal variables
    The new agreement measure is explicitly constructed by capitalizing on this index.
  • standard math Standard asymptotic and bootstrap theory applies to the proposed estimator
    Used to construct confidence intervals whose performance is evaluated by simulation.

pith-pipeline@v0.9.0 · 5624 in / 1232 out tokens · 23737 ms · 2026-05-24T17:30:45.011721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Booth, J. G., R. W. Butler, and P. Hall (1994). Bootstrap methods for finite populations. Journal of the American Statistical Association , 89 (428), 1282–1289

  2. [2]

    (2018) Measurement of interrater agreement for the assessment of language proficiency

    Bove, G., Nuzzo, E., Serafini, A. (2018) Measurement of interrater agreement for the assessment of language proficiency. In: S. Capecchi, Di Iorio F., Simone R. ASMOD 2018: Proceedings of the Advanced Statistical Modelling for Ordinal Data Conference . Universit` a Federico II di Napoli, 24-26 October 2018. Napoli: FedOAPress, 61–68

  3. [3]

    (2002) Scomposizione della dispersione per variabili statistiche ordinali [Dispersion decomposition for ordinal variables], Statistica, 62, 111–116

    Grilli L., Rampichini C. (2002) Scomposizione della dispersione per variabili statistiche ordinali [Dispersion decomposition for ordinal variables], Statistica, 62, 111–116

  4. [4]

    Gross, S. (1980). Median estimation in sample surveys. In Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 181–184

  5. [5]

    J., Demaree, R

    James, L. J., Demaree, R. G.,Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology , 69, 85–98

  6. [6]

    J., Demaree R

    James L. J., Demaree R. G., Wolf G. (1993) rwg: An assessment of within-group interrater agreement, Journal of Applied Psychology , 78, 306–309

  7. [7]

    Efron, B. (1979). Bootstrap methods: another look at the jackknife. The Annals of Statistics, 7(1), 1–26

  8. [8]

    (2017) Functional adequacy in L2 writing

    Kuiken F., Vedder I. (2017) Functional adequacy in L2 writing. Towards a new rating scale, Language Testing, 34, 321-336

  9. [9]

    LeBreton J.M., Burgess J.R.D., Kaiser R.B., Atchley E.K., James L.R. (2003) The restriction of variance hypothesis and interrater reliability and agreement: Are ratings from multiple sources really dissimilar?, Organizational Research Methods, 6, 80–128

  10. [10]

    LeBreton J.M., Senter, J.L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11(4), 815–852

  11. [11]

    (1983) Statistica descrittiva, Il Mulino, Bologna

    Leti G. (1983) Statistica descrittiva, Il Mulino, Bologna

  12. [12]

    (1952) The Standard Error of Gini’s Mean Difference

    Lomnicki Z.A. (1952) The Standard Error of Gini’s Mean Difference. The Annals of Mathematical Statistics, 23, 14, 635–637

  13. [13]

    Mashreghi, Z., Haziza, D., L´ eger, C. (2016). A survey of bootstrap methods in finite population sampling. Statistics Surveys, 10, 1–52

  14. [14]

    (1996) Forming inferences about some intraclass correlation coefficients, Psychological Methods, 1, 30–46

    McGraw K.O., Wong S.P. (1996) Forming inferences about some intraclass correlation coefficients, Psychological Methods, 1, 30–46

  15. [15]

    (2018) Assessing functional adequacy across tasks: A comparison of learners and native speakers’ written texts, (submitted for publication)

    Nuzzo E., Bove G. (2018) Assessing functional adequacy across tasks: A comparison of learners and native speakers’ written texts, (submitted for publication)

  16. [16]

    Piccarretta, R. (2001). A new measure of nomila-ordinal association, Journal of Applied Statistics, 28, 1, 107–120

  17. [17]

    and Jones, M.C

    Sheather, S.J. and Jones, M.C. (1991). A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statistical Society Series B , 53, 683–690

  18. [18]

    E., and Fleiss, J

    Shrout, P. E., and Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing reliability. Psychological Bulletin, 86, 420–428

  19. [19]

    (2005) Analyzing rater agreement

    von Eye A., Mun E.Y. (2005) Analyzing rater agreement. Manifest variable methods , Lawrence Erlbaum Associates, Mahwah, New Jersey