Measuring the Expertise of Workers for Crowdsourcing Applications

Arnaud Martin (DRUID); Hosna Ouni (DRUID); Jean-Christophe Dubois (DRUID); Laetitia Gros; Mouloud Kharoune (DRUID); Yolande Le Gall (DRUID); Zolt\'an Mikl\'os (DRUID)

arxiv: 1907.10588 · v1 · pith:NKS4NGSInew · submitted 2019-06-24 · 💻 cs.HC · cs.SI

Measuring the Expertise of Workers for Crowdsourcing Applications

Jean-Christophe Dubois (DRUID) , Laetitia Gros , Mouloud Kharoune (DRUID) , Yolande Le Gall (DRUID) , Arnaud Martin (DRUID) , Zolt\'an Mikl\'os (DRUID) , Hosna Ouni (DRUID) This is my paper

Pith reviewed 2026-05-25 17:38 UTC · model grok-4.3

classification 💻 cs.HC cs.SI

keywords crowdsourcingexpertise measurementbelief functionsworker qualityFagin distanceaudio quality assessmentquality evaluation

0 comments

The pith

A new expertise measure for crowdsourcing workers uses four factors from belief functions theory when an objective dataset exists.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method to measure the expertise of crowd workers on platforms by assuming access to a dataset that provides objective comparisons between items. Four factors are defined using the theory of belief functions to capture aspects of worker performance. This measure is compared to the Fagin distance on data from a real experiment involving audio recording quality assessments. The two approaches are then fused together. A sympathetic reader would care because more accurate expertise estimates could help platforms better assign tasks and improve the reliability of results obtained from the crowd.

Core claim

We propose an innovative measure of expertise assuming that we possess a dataset with an objective comparison of the items concerned. Our method is based on the definition of four factors with the theory of belief functions. We compare our method to the Fagin distance on a dataset from a real experiment, where users have to assess the quality of some audio recordings. Then, we propose to fuse both the Fagin distance and our expertise measure.

What carries the argument

Four factors defined with the theory of belief functions to quantify worker expertise from objective item comparisons.

If this is right

The expertise measure applies to crowdsourcing tasks like audio quality assessment where objective item comparisons exist.
Direct comparison to the Fagin distance on real experimental data reveals relative performance.
Fusing the belief function measure with the Fagin distance yields a combined estimator for worker expertise.
Improved expertise assessment supports better task assignment and quality control in crowdsourcing platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Platforms might weight individual worker contributions more heavily when objective comparison data is available for calibration.
The fusion step implies that combining distance-based and belief-function approaches could increase robustness across varied task types.
If the four factors prove stable, the method could extend to other crowdsourced domains such as image labeling or text verification without full ground truth.

Load-bearing premise

A dataset with an objective comparison of the items concerned is available to define and apply the four factors.

What would settle it

If the four-factor belief function measure applied to the audio recording dataset shows no improvement in alignment with known worker performance over the Fagin distance, the claim of an innovative measure would be challenged.

read the original abstract

Crowdsourcing platforms enable companies to propose tasks to a large crowd of users. The workers receive a compensation for their work according to the serious of the tasks they managed to accomplish. The evaluation of the quality of responses obtained from the crowd remains one of the most important problems in this context. Several methods have been proposed to estimate the expertise level of crowd workers. We propose an innovative measure of expertise assuming that we possess a dataset with an objective comparison of the items concerned. Our method is based on the definition of four factors with the theory of belief functions. We compare our method to the Fagin distance on a dataset from a real experiment, where users have to assess the quality of some audio recordings. Then, we propose to fuse both the Fagin distance and our expertise measure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The expertise measure needs an objective item-comparison dataset that crowdsourcing settings usually lack, so the four-factor belief-function construction cannot be used on the data it targets.

read the letter

The paper defines four factors inside the belief-function framework for worker expertise and fuses the output with Fagin distance. It then runs the combined measure on a real audio-quality assessment dataset. That experiment and the fusion step are the concrete parts that can be checked against actual responses. The comparison to Fagin distance is also reported directly on the same data. Those elements give the work a modest empirical anchor. The central construction, however, requires a separate dataset that supplies objective comparisons between items so the four factors can be computed. The abstract states this prerequisite plainly, and the stress-test note correctly flags that such ground-truth comparisons are precisely what crowdsourcing tasks normally lack. No procedure is described for deriving or approximating those comparisons from the crowd responses alone. Without that step the method cannot be instantiated on the data it is meant to handle. The four factors themselves are not derived or validated in the abstract, and the soundness scores reflect the absence of error analysis or reproducibility details. For readers already working on belief functions applied to ranking or quality control, the fusion experiment may be worth skimming. For anyone needing a method that works from ordinary crowdsourced labels, the dependency on unavailable objective data is a load-bearing limitation. I would not send this to peer review without a clear account of how the required comparisons are obtained or replaced.

Referee Report

2 major / 1 minor

Summary. The paper proposes an expertise measure for crowdsourcing workers based on four factors defined in the theory of belief functions. The construction assumes access to a dataset supplying objective comparisons between items; the measure is compared to the Fagin distance on a real audio-recording quality-assessment experiment and the two are proposed to be fused.

Significance. If the prerequisite objective-comparison dataset can be obtained or approximated from crowdsourced responses alone, the belief-function construction would supply a new formal route to expertise estimation and the reported comparison on real data would provide a concrete empirical anchor.

major comments (2)

[Abstract / Method] Abstract and method statement: the four-factor construction is defined only in terms of an objective item-comparison dataset, yet no procedure is supplied for constructing or approximating that dataset from the crowdsourced responses that constitute the motivating setting.
[Abstract] Abstract: no derivation details, validation metrics, or error analysis are supplied for the four factors, so the central claim cannot be checked against the paper's own equations or data.

minor comments (1)

[Experiment] The description of the audio-recording experiment lacks detail on how the objective comparisons were obtained for that specific case.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive comments. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method statement: the four-factor construction is defined only in terms of an objective item-comparison dataset, yet no procedure is supplied for constructing or approximating that dataset from the crowdsourced responses that constitute the motivating setting.

Authors: The manuscript explicitly frames the expertise measure under the assumption of access to an objective item-comparison dataset, as stated in the abstract and method. This assumption enables the direct application of belief function theory to define the four factors. We agree that no explicit procedure is provided for deriving or approximating such a dataset from crowdsourced responses alone. In revision we will add a dedicated discussion subsection outlining possible approximation strategies, including the use of majority-vote consensus as a proxy, iterative refinement with partial ground truth, or hybrid approaches that combine limited objective data with worker responses. revision: yes
Referee: [Abstract] Abstract: no derivation details, validation metrics, or error analysis are supplied for the four factors, so the central claim cannot be checked against the paper's own equations or data.

Authors: The abstract is intentionally concise. The full manuscript supplies the mathematical definitions of the four factors within the belief-function framework (Section on method), together with the empirical comparison against Fagin distance on the audio-recording quality-assessment dataset. This comparison constitutes the primary validation. To improve accessibility we will expand the abstract with a short clause referencing the four factors and the real-data evaluation, and we will ensure the method section cross-references the defining equations and the experimental protocol. revision: partial

Circularity Check

0 steps flagged

No circularity: expertise factors defined from external objective-comparison dataset

full rationale

The paper explicitly conditions its four-factor belief-function measure on the availability of an external dataset supplying objective item comparisons. This input is treated as given rather than derived or fitted inside the paper, and the subsequent comparison to Fagin distance is performed on a separate real-experiment dataset. No equation or step reduces the claimed expertise output to a parameter fitted from the same responses the method is meant to evaluate, nor does any load-bearing claim rest on a self-citation chain. The derivation therefore remains self-contained against the stated external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the availability of an objective comparison dataset and on the standard axioms of belief function theory; no free parameters or invented entities are identifiable from the abstract alone.

axioms (1)

domain assumption Existence of a dataset with objective comparison of the items concerned
Explicitly stated as a prerequisite for the proposed measure.

pith-pipeline@v0.9.0 · 5701 in / 1058 out tokens · 29107 ms · 2026-05-25T17:38:11.587082+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Ben Rjab, A., Kharoune, M., Miklos, Z., and Martin, A. (2016). Charac- terization of experts in crowdsourcing platforms. In The 4th International Conference on Belief Functions, volume 9861, pages 97 –

work page 2016
[2]

and Skene, A

Dawid, P. and Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the em algorithm. 28:20–28. Dempster,

work page 1979
[3]

Dempster, A. P. (1967). Upper and lower probabilities induced by a multivalued mapping. The annals of mathematical statistics, pages 325–339. Essaid et al.,

work page 1967
[4]

Essaid, A., Martin, A., Smits, G., and Ben Yaghlane, B. (2014). A distance- based decision in the credal level. In Artiﬁcial Intelligence and Symbolic Computation - 12th International Conference, AISC 2014, Seville, Spain, December 11-13,

work page 2014
[5]

Fagin, R., Kumar, R., Mahdian, M., Sivakumar, D., and Vee, E. (2004). Com- paring and aggregating rankings with ties. In twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 47–58. Howe,

work page 2004
[6]

Howe, J. (2006). The rise of crowdsourcing. Wired magazine, 14(6):1–4. Ipeirotis et al.,

work page 2006
[7]

G., Provost, F., and Wang, J

Ipeirotis, P. G., Provost, F., and Wang, J. (2010). Machine-learning for spam- mer detection in crowd-sourcing. In HCOMP ’10 Proceedings of the ACM SIGKDD Workshop on Human Computation. ITU,

work page 2010
[8]

Modulated noise reference unit (MNRU)

ITU (1996). Modulated noise reference unit (MNRU). Technical Report ITU-T P.810, International Telecommunication Union. Jouili,

work page 1996
[9]

Jouili, S. (2011). Indexation de masses de documents graphiques : approches struc- turelles. PhD thesis, Universit ´e Nancy II. Jousselme et al.,

work page 2011
[10]

Jousselme, A.-L., Grenier, D., and Boss ´e, ´E. (2001). A new distance be- tween two bodies of evidence. Information fusion, 2(2):91–101. Kendall,

work page 2001
[11]

Kendall, M. (1945). The treatment of ties in ranking problems. Biometrika, pages 239–251. Le et al.,

work page 1945
[12]

Le, J., Edmonds, A., Hester, V ., and Biewald, L. (2010). Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. InWork- shop on Crowdsourcing for Search Evaluation, pages 17–20. Measuring the expertise of workers for crowdsourcing applications 19 Raykar and Yu,

work page 2010
[13]

Raykar, V . C. and Yu, S. (2012). Eliminating spammers and ranking anno- tators for crowdsourced labeling tasks. Journal of Machine Learning Research, 13:491–518. Raykar et al.,

work page 2012
[14]

C., Yu, S., Zhao, L

Raykar, V . C., Yu, S., Zhao, L. H., Hermosillo Valadez, G., Florin, C., Bogoni, L., and Moy, L. (2010). Learning from crowds.Journal of Machine Learning Research, 11:1297–

work page 2010
[15]

Shafer, G. (1976). A mathematical theory of evidence, volume

work page 1976
[16]

Smets, P. (1990). The combination of evidence in the transferable belief model. 12:447 –

work page 1990
[17]

Smyth, P., Fayyad, U., Burl, M., Perona, P., and Baldi, P. (1995). Inferring ground truth from subjective labelling of venus images. Advances in Neural Information Pro- cessing Systems, 7:1085–1092

work page 1995

[1] [1]

Ben Rjab, A., Kharoune, M., Miklos, Z., and Martin, A. (2016). Charac- terization of experts in crowdsourcing platforms. In The 4th International Conference on Belief Functions, volume 9861, pages 97 –

work page 2016

[2] [2]

and Skene, A

Dawid, P. and Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the em algorithm. 28:20–28. Dempster,

work page 1979

[3] [3]

Dempster, A. P. (1967). Upper and lower probabilities induced by a multivalued mapping. The annals of mathematical statistics, pages 325–339. Essaid et al.,

work page 1967

[4] [4]

Essaid, A., Martin, A., Smits, G., and Ben Yaghlane, B. (2014). A distance- based decision in the credal level. In Artiﬁcial Intelligence and Symbolic Computation - 12th International Conference, AISC 2014, Seville, Spain, December 11-13,

work page 2014

[5] [5]

Fagin, R., Kumar, R., Mahdian, M., Sivakumar, D., and Vee, E. (2004). Com- paring and aggregating rankings with ties. In twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 47–58. Howe,

work page 2004

[6] [6]

Howe, J. (2006). The rise of crowdsourcing. Wired magazine, 14(6):1–4. Ipeirotis et al.,

work page 2006

[7] [7]

G., Provost, F., and Wang, J

Ipeirotis, P. G., Provost, F., and Wang, J. (2010). Machine-learning for spam- mer detection in crowd-sourcing. In HCOMP ’10 Proceedings of the ACM SIGKDD Workshop on Human Computation. ITU,

work page 2010

[8] [8]

Modulated noise reference unit (MNRU)

ITU (1996). Modulated noise reference unit (MNRU). Technical Report ITU-T P.810, International Telecommunication Union. Jouili,

work page 1996

[9] [9]

Jouili, S. (2011). Indexation de masses de documents graphiques : approches struc- turelles. PhD thesis, Universit ´e Nancy II. Jousselme et al.,

work page 2011

[10] [10]

Jousselme, A.-L., Grenier, D., and Boss ´e, ´E. (2001). A new distance be- tween two bodies of evidence. Information fusion, 2(2):91–101. Kendall,

work page 2001

[11] [11]

Kendall, M. (1945). The treatment of ties in ranking problems. Biometrika, pages 239–251. Le et al.,

work page 1945

[12] [12]

Le, J., Edmonds, A., Hester, V ., and Biewald, L. (2010). Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. InWork- shop on Crowdsourcing for Search Evaluation, pages 17–20. Measuring the expertise of workers for crowdsourcing applications 19 Raykar and Yu,

work page 2010

[13] [13]

Raykar, V . C. and Yu, S. (2012). Eliminating spammers and ranking anno- tators for crowdsourced labeling tasks. Journal of Machine Learning Research, 13:491–518. Raykar et al.,

work page 2012

[14] [14]

C., Yu, S., Zhao, L

Raykar, V . C., Yu, S., Zhao, L. H., Hermosillo Valadez, G., Florin, C., Bogoni, L., and Moy, L. (2010). Learning from crowds.Journal of Machine Learning Research, 11:1297–

work page 2010

[15] [15]

Shafer, G. (1976). A mathematical theory of evidence, volume

work page 1976

[16] [16]

Smets, P. (1990). The combination of evidence in the transferable belief model. 12:447 –

work page 1990

[17] [17]

Smyth, P., Fayyad, U., Burl, M., Perona, P., and Baldi, P. (1995). Inferring ground truth from subjective labelling of venus images. Advances in Neural Information Pro- cessing Systems, 7:1085–1092

work page 1995