Latent Distribution Assumption for Unbiased and Consistent Consensus Modelling

Gleb Gusev; Pavel Serdyukov; Valentina Fedorova

arxiv: 1906.08776 · v1 · pith:KCPAH3QNnew · submitted 2019-06-20 · 💻 cs.HC · cs.LG· stat.ML

Latent Distribution Assumption for Unbiased and Consistent Consensus Modelling

Valentina Fedorova , Gleb Gusev , Pavel Serdyukov This is my paper

Pith reviewed 2026-05-25 19:33 UTC · model grok-4.3

classification 💻 cs.HC cs.LGstat.ML

keywords noisy label aggregationcrowdsourcinglatent distributionconsensus modelingunbiased estimationlabel ambiguity

0 comments

The pith

Modeling each object with a distribution of possible labels instead of one fixed true label produces unbiased and consistent consensus from noisy annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to aggregate noisy labels from multiple annotators. Standard generative models rest on the premise that every object possesses exactly one hidden correct label. The authors replace that premise with a latent distribution assumption: each object is associated with its own probability distribution over labels, from which a subjective label is drawn anew on every observation. They argue that this change removes bias in the estimated consensus when tasks contain genuine ambiguity. Experiments on difficult tasks indicate that the distribution-based models recover more accurate aggregates than single-label baselines.

Core claim

Under the latent distribution assumption, each object is equipped with a fixed but unknown distribution that generates the latent label observed by each annotator; the observed noisy labels are then drawn from this per-object distribution. Parameter estimation under this model yields unbiased and consistent estimates of the consensus distribution, whereas models that enforce a single true label per object remain biased when ambiguity is present.

What carries the argument

The latent distribution assumption, which replaces the single-true-label premise with an object-specific probability distribution over labels that is sampled independently for each observation.

If this is right

Consensus estimates remain consistent even when annotators disagree because no single label is forced to be correct.
The model can output a full distribution over possible labels for each object rather than a point estimate.
Parameter learning stays tractable because the per-object distributions are estimated jointly with annotator accuracies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same assumption could be applied to active learning settings where the system chooses which objects to label next based on distribution entropy.
If the distribution is estimated per object, downstream classifiers trained on the aggregated labels may inherit calibrated uncertainty estimates.

Load-bearing premise

That the noisy labels observed for an object are independent draws from a single fixed distribution belonging to that object.

What would settle it

A controlled experiment in which the true label is known to be unique for every object and single-label models recover the consensus more accurately than distribution-based models on the same data.

Figures

Figures reproduced from arXiv: 1906.08776 by Gleb Gusev, Pavel Serdyukov, Valentina Fedorova.

**Figure 2.** Figure 2: Performance of the LA GLAD (red line with squares) and DA GLAD (green line with dots) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Calibration plots for two approaches to consensus [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The expected values for qˆ LA given by (3) as a function of q for different number of noisy labels n, different values of a, and the uniform prior r = 0.5. The left plot is for n = 5, the middle one is for n = 10, and the right one is for n = 20. Different values of a are shown by colours [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

We study the problem of aggregation noisy labels. Usually, it is solved by proposing a stochastic model for the process of generating noisy labels and then estimating the model parameters using the observed noisy labels. A traditional assumption underlying previously introduced generative models is that each object has one latent true label. In contrast, we introduce a novel latent distribution assumption, implying that a unique true label for an object might not exist, but rather each object might have a specific distribution generating a latent subjective label each time the object is observed. Our experiments showed that the novel assumption is more suitable for difficult tasks, when there is an ambiguity in choosing a "true" label for certain objects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper replaces the single latent true label with a per-object distribution over labels for noisy aggregation, but the abstract alone gives no model, estimation steps, or data to support the unbiased and consistent claims.

read the letter

The key takeaway is that this work replaces the single latent true label with a per-object distribution over labels for aggregating noisy annotations, but the abstract supplies no model, estimation method, or experimental details to back the unbiased and consistent claims. What stands out as new is the explicit contrast with prior generative models that assume one hidden truth per item. By allowing a distribution that generates subjective labels on each observation, the approach tries to handle inherent ambiguity without pretending a unique correct answer exists. The abstract reports that experiments found this better suited to difficult tasks, which aligns with intuition about subjective data. That modeling shift is the main contribution here, and it does target a practical issue in crowdsourcing and HCI where labels can be genuinely contested. The weaknesses are clear from the abstract alone. No equations appear for the generative story or the likelihood. No estimation procedure is described, whether EM or otherwise. Datasets, baselines, and metrics are missing, so the experimental support can't be checked. The title's promise of unbiased and consistent estimates therefore rests on unshown work. The stress-test point about identifiability looks relevant: each object now carries its own distribution, which multiplies the parameters. Nothing indicates that the noisy labels provide enough signal to recover those distributions without bias or that the estimator converges. If the model is not identifiable, the consistency claim fails even in the large-sample limit. This kind of paper would appeal to researchers focused on label aggregation methods for ambiguous or subjective classification problems. Someone building systems for real-world annotation tasks might want to see if the distribution assumption improves results in their domain. But the lack of technical content means it won't give them a usable method or a result they can rely on. I wouldn't bring this to a reading group. It does not look ready for peer review; the authors would need to supply the actual model, the fitting algorithm, the data, and the validation before a referee could assess whether the claims hold.

Referee Report

3 major / 0 minor

Summary. The paper proposes replacing the standard single-latent-true-label assumption in noisy label aggregation with a 'latent distribution assumption,' under which each object is associated with a distribution over labels that generates a subjective label on each observation. The authors claim this yields unbiased and consistent consensus estimates and is more suitable for ambiguous or difficult tasks, as supported by experiments.

Significance. If a well-specified, identifiable model and supporting derivations were provided, the approach could offer a more flexible generative framework for crowdsourced labeling in subjective domains, addressing limitations of single-label models in ambiguous settings.

major comments (3)

[Abstract] Abstract: the title and abstract assert that the novel assumption produces 'unbiased and consistent' consensus estimates, yet no model equations, likelihood function, parameter estimation procedure, or derivation of unbiasedness/consistency is supplied, rendering the central claims unverifiable.
[Abstract] Abstract: the claim that 'our experiments showed that the novel assumption is more suitable for difficult tasks' is unsupported because no datasets, baselines, quantitative results, or statistical validation are described, preventing assessment of the experimental evidence.
[Abstract] Abstract: introducing a full per-object label distribution increases the parameter count relative to single-label models, but the manuscript supplies no argument or identifiability analysis showing that object-specific distributions can be recovered separately from annotator error rates, which directly undermines the unbiased/consistency title claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and specific comments on the abstract. We address each point below and agree that revisions to the abstract and manuscript are warranted to better support the central claims.

read point-by-point responses

Referee: [Abstract] Abstract: the title and abstract assert that the novel assumption produces 'unbiased and consistent' consensus estimates, yet no model equations, likelihood function, parameter estimation procedure, or derivation of unbiasedness/consistency is supplied, rendering the central claims unverifiable.

Authors: We agree that the provided abstract is high-level and does not contain the model equations or derivations, which limits immediate verifiability of the unbiasedness and consistency claims. The manuscript body defines the generative model under the latent distribution assumption, but we will revise the abstract to include a concise description of the model, likelihood, and estimation approach, and ensure the derivations are clearly referenced or expanded if needed. revision: yes
Referee: [Abstract] Abstract: the claim that 'our experiments showed that the novel assumption is more suitable for difficult tasks' is unsupported because no datasets, baselines, quantitative results, or statistical validation are described, preventing assessment of the experimental evidence.

Authors: We agree the abstract does not detail the experimental evidence. The manuscript reports experiments on ambiguous labeling tasks, but to address this we will update the abstract to reference the datasets, baselines compared, and key quantitative findings supporting suitability for difficult tasks. revision: yes
Referee: [Abstract] Abstract: introducing a full per-object label distribution increases the parameter count relative to single-label models, but the manuscript supplies no argument or identifiability analysis showing that object-specific distributions can be recovered separately from annotator error rates, which directly undermines the unbiased/consistency title claim.

Authors: This is a substantive point. The per-object distributions do increase the parameter space to model ambiguity. We will add an explicit identifiability analysis section demonstrating recovery of the object distributions separately from annotator parameters, thereby supporting the unbiased and consistent estimation results. revision: yes

Circularity Check

0 steps flagged

No circularity: novel assumption introduced independently of prior single-label models

full rationale

The paper's central contribution is the explicit introduction of a new 'latent distribution assumption' that replaces the traditional single true label per object with a per-object distribution over latent labels. The abstract and title frame this as a modeling choice whose suitability is checked experimentally on ambiguous tasks. No equations, parameter-fitting steps, or self-citations are shown that would reduce the unbiased/consistency claims to the inputs by construction. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on abstract; the central modeling change rests on one domain assumption with no free parameters or invented entities specified.

axioms (1)

domain assumption Each object has a specific distribution generating a latent subjective label on each observation rather than a single true label
This is the core novel assumption introduced to replace the traditional single-label generative model.

pith-pipeline@v0.9.0 · 5641 in / 1025 out tokens · 26880 ms · 2026-05-25T19:33:34.063694+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 2 internal anchors

[1]

How To Grade a Test Without Knowing the Answers --- A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing

Y . Bachrach, T. Graepel, T. Minka, and J. Guiver. How to grade a test without knowing the answers—a bayesian graphical model for adaptive crowdsourcing and aptitude testing. arXiv preprint arXiv:1206.6386, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[2]

Bartholomew-Biggs, S

M. Bartholomew-Biggs, S. Brown, B. Christianson, and L. Dixon. Automatic differentiation of algorithms. Journal of Computational and Applied Mathematics, 124:171 – 190, 2000

work page 2000
[3]

M Blei, A

D. M Blei, A. Y Ng, and M. I Jordan. Latent dirichlet allocation. The journal of machine learning research, 3:993–1022, 2003

work page 2003
[4]

Buckley, M

C. Buckley, M. Lease, M. D Smucker, H. J. Jung, and C. Grady. Overview of the trec 2010 relevance feedback track (notebook). In The Nineteenth Text Retrieval Conference (TREC) Notebook, 2010

work page 2010
[5]

A. P. Dawid and A. M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20–28, 1979

work page 1979
[6]

Label distribution learning

Xin Geng. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 28(7):1734–1748, 2016

work page 2016
[7]

G Ipeirotis, F

P. G Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages 64–67, 2010

work page 2010
[8]

Kim and Z

H. Kim and Z. Ghahramani. Bayesian classiﬁer combination. In International conference on artiﬁcial intelligence and statistics, pages 619–627, 2012

work page 2012
[9]

Q. Liu, A. T Ihler, and M. Steyvers. Scoring workers in crowdsourcing: How many control questions are enough? In Advances in Neural Information Processing Systems, pages 1914– 1922, 2013

work page 1914
[10]

Probabilistic modeling for crowdsourcing partially-subjective ratings

An Thanh Nguyen, Matthew Halpern, Byron C Wallace, and Matthew Lease. Probabilistic modeling for crowdsourcing partially-subjective ratings. In Fourth AAAI Conference on Human Computation and Crowdsourcing, 2016

work page 2016
[11]

Ruvolo, J

P. Ruvolo, J. Whitehill, and J. R Movellan. Exploiting commonality and interaction effects in crowdsourcing tasks using latent factor models. 2013

work page 2013
[12]

R. Snow, B. O’Connor, D. Jurafsky, and A. Y Ng. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. InProceedings of the conference on empirical methods in natural language processing, pages 254–263, 2008

work page 2008
[13]

Venanzi, J

M. Venanzi, J. Guiver, G. Kazai, P. Kohli, and M. Shokouhi. Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd international conference on World wide web, pages 155–164, 2014

work page 2014
[14]

M V oorhees

E. M V oorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information processing & management, 36:697–716, 2000

work page 2000
[15]

Whitehill, T

J. Whitehill, T. Wu, J. Bergsma, J. R Movellan, and P. L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043, 2009

work page 2035
[16]

D. Zhou, S. Basu, Y . Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2195–2203. 2012

work page 2012
[17]

D. Zhou, Q. Liu, J. Platt, and C. Meek. Aggregating ordinal labels from crowds by minimax conditional entropy. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 262–270, 2014

work page 2014
[18]

Regularized Minimax Conditional Entropy for Crowdsourcing

D. Zhou, Q. Liu, J. C Platt, C. Meek, and N. B Shah. Regularized minimax conditional entropy for crowdsourcing. arXiv preprint arXiv:1503.07240, 2015. 9 Appendix A Theoretical analysis for the latent label assumption Remind, that we consider one object whose “true” label z∼ Bernoulli(q), where q is an unknown object-speciﬁc parameter. Given n noisy labels...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[1] [1]

How To Grade a Test Without Knowing the Answers --- A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing

Y . Bachrach, T. Graepel, T. Minka, and J. Guiver. How to grade a test without knowing the answers—a bayesian graphical model for adaptive crowdsourcing and aptitude testing. arXiv preprint arXiv:1206.6386, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[2] [2]

Bartholomew-Biggs, S

M. Bartholomew-Biggs, S. Brown, B. Christianson, and L. Dixon. Automatic differentiation of algorithms. Journal of Computational and Applied Mathematics, 124:171 – 190, 2000

work page 2000

[3] [3]

M Blei, A

D. M Blei, A. Y Ng, and M. I Jordan. Latent dirichlet allocation. The journal of machine learning research, 3:993–1022, 2003

work page 2003

[4] [4]

Buckley, M

C. Buckley, M. Lease, M. D Smucker, H. J. Jung, and C. Grady. Overview of the trec 2010 relevance feedback track (notebook). In The Nineteenth Text Retrieval Conference (TREC) Notebook, 2010

work page 2010

[5] [5]

A. P. Dawid and A. M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20–28, 1979

work page 1979

[6] [6]

Label distribution learning

Xin Geng. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 28(7):1734–1748, 2016

work page 2016

[7] [7]

G Ipeirotis, F

P. G Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages 64–67, 2010

work page 2010

[8] [8]

Kim and Z

H. Kim and Z. Ghahramani. Bayesian classiﬁer combination. In International conference on artiﬁcial intelligence and statistics, pages 619–627, 2012

work page 2012

[9] [9]

Q. Liu, A. T Ihler, and M. Steyvers. Scoring workers in crowdsourcing: How many control questions are enough? In Advances in Neural Information Processing Systems, pages 1914– 1922, 2013

work page 1914

[10] [10]

Probabilistic modeling for crowdsourcing partially-subjective ratings

An Thanh Nguyen, Matthew Halpern, Byron C Wallace, and Matthew Lease. Probabilistic modeling for crowdsourcing partially-subjective ratings. In Fourth AAAI Conference on Human Computation and Crowdsourcing, 2016

work page 2016

[11] [11]

Ruvolo, J

P. Ruvolo, J. Whitehill, and J. R Movellan. Exploiting commonality and interaction effects in crowdsourcing tasks using latent factor models. 2013

work page 2013

[12] [12]

R. Snow, B. O’Connor, D. Jurafsky, and A. Y Ng. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. InProceedings of the conference on empirical methods in natural language processing, pages 254–263, 2008

work page 2008

[13] [13]

Venanzi, J

M. Venanzi, J. Guiver, G. Kazai, P. Kohli, and M. Shokouhi. Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd international conference on World wide web, pages 155–164, 2014

work page 2014

[14] [14]

M V oorhees

E. M V oorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information processing & management, 36:697–716, 2000

work page 2000

[15] [15]

Whitehill, T

J. Whitehill, T. Wu, J. Bergsma, J. R Movellan, and P. L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043, 2009

work page 2035

[16] [16]

D. Zhou, S. Basu, Y . Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2195–2203. 2012

work page 2012

[17] [17]

D. Zhou, Q. Liu, J. Platt, and C. Meek. Aggregating ordinal labels from crowds by minimax conditional entropy. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 262–270, 2014

work page 2014

[18] [18]

Regularized Minimax Conditional Entropy for Crowdsourcing

D. Zhou, Q. Liu, J. C Platt, C. Meek, and N. B Shah. Regularized minimax conditional entropy for crowdsourcing. arXiv preprint arXiv:1503.07240, 2015. 9 Appendix A Theoretical analysis for the latent label assumption Remind, that we consider one object whose “true” label z∼ Bernoulli(q), where q is an unknown object-speciﬁc parameter. Given n noisy labels...

work page internal anchor Pith review Pith/arXiv arXiv 2015