An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

Maja Pavlovic; Massimo Poesio; Silviu Paun

arxiv: 2605.18648 · v1 · pith:4MHKGLHTnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.CL

An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

Maja Pavlovic , Silviu Paun , Massimo Poesio This is my paper

Pith reviewed 2026-05-20 11:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords soft labelshuman uncertaintymodel calibrationsoft-label learningdataset cartographyMNISThuman-AI alignmentregularization

0 comments

The pith

Human soft-labels improve model calibration on hard samples and stabilize training more than they boost raw accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates the contribution of human soft labels by re-annotating subsets of MNIST and a synthetic dataset while holding label modes fixed. This controlled comparison shows that human soft labels yield accuracy gains but deliver their primary benefit by regularizing models toward better calibration on difficult inputs and more consistent convergence across repeated runs. Dataset cartography further reveals that models trained on human soft labels reproduce human uncertainty patterns, whereas synthetic-label models do not. The work supplies a diagnostic testbed for measuring how well learned uncertainty aligns with human judgment.

Core claim

After decoupling soft-label supervision from implicit label-mode corrections, human soft labels still produce modest accuracy improvements yet act chiefly as a regularizer that sharpens calibration on ambiguous examples and reduces variance in training trajectories.

What carries the argument

Re-annotation of fixed subsets to extract human uncertainty signals while controlling for label-mode shifts.

If this is right

Human soft labels produce models whose uncertainty maps more closely onto human uncertainty distributions.
Calibration gains concentrate on the most ambiguous inputs rather than across the entire test set.
Training variance across random initializations decreases when human soft labels are used.
Dataset cartography becomes a practical tool for verifying human-model uncertainty alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applications that prize reliable uncertainty estimates, such as medical triage or autonomous driving, may benefit more from human soft labels than accuracy-focused tasks.
The same controlled re-annotation protocol could be applied to larger image or language datasets to test whether the regularization effect scales.
Hybrid training that mixes a small amount of human soft labels with abundant synthetic data might achieve similar calibration benefits at lower cost.

Load-bearing premise

Re-annotating subsets with human soft labels isolates uncertainty information without introducing selection biases or annotation artifacts absent from the synthetic control.

What would settle it

A replication in which models trained on the human soft-label subsets fail to show lower expected calibration error or higher stability across random seeds on the same held-out difficult samples.

Figures

Figures reproduced from arXiv: 2605.18648 by Maja Pavlovic, Massimo Poesio, Silviu Paun.

**Figure 2.** Figure 2: Dataset Cartography examples with human label variation (HLV) from both Mukhoti (a,c,e) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Cartography Map for Mnist using a simple feed-forward neural network for 5 epochs [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Cartography Map for ARDIS using a simple feed-forward neural network for 5 epochs [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Cartography Map for Mukhoti using a simple feed-forward neural network for 20 epochs [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Cartography Map for Weiss using a simple feed-forward neural network for 30 epochs [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Entropy comparison: Weiss shows higher label entropy, while Mukhoti has a broader, more varied distribution [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 9.** Figure 9: Task instruction screen [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 11.** Figure 11: Distribution of image-level uncertainty. Comparison of uncertainty metrics [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Distribution of u mean n image-level uncertainty. Histogram for MNIST (top) and Mukhoti (bottom). MNIST shows a sharp concentration near zero, reflecting high annotator consensus, whereas Mukhoti shows a significantly heavier tail and broader uncertainty distribution [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 14.** Figure 14: The JSD between successive epochs JSD(Pe||Pe−1) serves as a proxy for how much the model’s beliefs are shifting - LeNet over 6 seeds 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: MNIST - LeNet - Late stage training dynamics averaged across 6 random seeds, with [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Muhkoti - DeeperFFN - Late stage training dynamics averaged across 6 random seeds, [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: MNIST - DeeperFFN - Late stage training dynamics averaged across 6 random seeds, [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Muhkoti - SimpleFFN - Late stage training dynamics averaged across 6 random seeds, [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: MNIST - SimpleFFN - Late stage training dynamics averaged across 6 random seeds, [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: Samples from MNIST and Mukhoti (ambig. MNIST) for the soft-label digits [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

read the original abstract

Central to human-aligned AI is understanding the benefits of human-elicited labels over synthetic alternatives. While human soft-labels improve calibration by capturing uncertainty, prior studies conflate these benefits with the implicit correction of mislabeled data (mode shifts), obscuring true effects of soft-labels. We present a controlled audit of soft-label learning across MNIST and a synthetic variant, re-annotating subsets to extract human uncertainty. By decoupling soft-label supervision from underlying label mode shifts, we show that while human soft-labels do provide accuracy gains, their larger value lies in acting as a regularizer that improves model calibration on difficult samples and promotes stable convergence across training runs. Dataset cartography reveals models trained on human soft-labels mirror human uncertainty, whereas those trained on synthetic labels fail to align with humans. Broadly, this work provides a diagnostic testbed for human-AI uncertainty alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Human soft labels mainly act as a regularizer for calibration and stability on MNIST rather than just fixing labels, with a clean decoupling but narrow scope and unaddressed selection risks.

read the letter

The key takeaway is that human soft-labels deliver modest accuracy gains on MNIST but shine more as a regularizer that improves calibration on hard samples and stabilizes training across runs, while dataset cartography shows better mirroring of human uncertainty patterns than synthetic labels do. The paper separates these effects from simple mode-shift correction, which prior work often mixed together. They do this by building a synthetic variant of the data and re-annotating chosen subsets with human soft labels to isolate uncertainty information. This controlled audit is the main incremental step, and the regularization angle plus the cartography check are useful for deciding when expensive human labels are worth collecting. The setup is thoughtful and the distinction between accuracy and calibration effects lands clearly. The work stays tightly scoped to MNIST image classification, so broader claims about human-aligned AI remain tentative. The re-annotation step is the soft spot: if subset selection or the human process itself shifts sample hardness, noise, or label distributions relative to the synthetic baseline, then the observed calibration and stability gains could trace to those differences instead of soft-label regularization. The stress-test concern about artifacts holds up from the abstract and methods description; stronger evidence that the conditions match on difficulty metrics would tighten the isolation. No equations or derivations appear, so everything rests on the experimental comparisons. This is for people working on calibration, human labels, and uncertainty alignment in classification tasks. It deserves a serious referee because the question is practical and the decoupling design is a clear improvement over conflated studies, even if revisions will need more quantitative detail and checks on the controls.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a controlled audit of soft-label learning on MNIST and a synthetic variant. Subsets are re-annotated with human soft labels to decouple uncertainty information from label mode shifts. The central claims are that human soft-labels yield accuracy gains but primarily function as a regularizer that improves calibration on difficult samples and promotes stable convergence across training runs; dataset cartography further shows that models trained on human soft labels align with human uncertainty patterns while synthetic-label models do not.

Significance. If the controlled isolation of uncertainty effects holds, the work supplies a diagnostic testbed for human-AI uncertainty alignment and clarifies the regularizing role of human soft labels beyond accuracy or mode correction. The attempt to separate these factors is a methodological strength relative to prior conflated studies.

major comments (2)

[Experimental Design / Re-annotation Protocol] The re-annotation procedure on chosen subsets is load-bearing for the claim that observed calibration and stability gains arise from human uncertainty regularization rather than selection or annotation artifacts. The manuscript must demonstrate that subset selection (e.g., by initial model uncertainty or image difficulty) and the human annotation process produce label distributions and sample hardness statistics that match the synthetic control on all dimensions except uncertainty; without explicit comparisons or ablation on selection criteria, the decoupling remains unverified.
[Results and Analysis] Quantitative support for the key findings is insufficient. The abstract and results sections report no specific metrics (e.g., ECE, NLL, or variance across runs), error bars, or statistical tests comparing human vs. synthetic conditions on difficult samples; without these, the assertion that human soft labels act as a superior regularizer cannot be evaluated.

minor comments (2)

[Dataset and Setup] Clarify the construction of the 'synthetic variant' of MNIST and the precise criteria used to select subsets for re-annotation.
[Throughout] Ensure consistent terminology ('soft-labels' vs. 'soft labels') and define 'difficult samples' operationally when discussing calibration gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Experimental Design / Re-annotation Protocol] The re-annotation procedure on chosen subsets is load-bearing for the claim that observed calibration and stability gains arise from human uncertainty regularization rather than selection or annotation artifacts. The manuscript must demonstrate that subset selection (e.g., by initial model uncertainty or image difficulty) and the human annotation process produce label distributions and sample hardness statistics that match the synthetic control on all dimensions except uncertainty; without explicit comparisons or ablation on selection criteria, the decoupling remains unverified.

Authors: We agree that explicit verification of matching statistics between the human re-annotated subsets and synthetic controls is necessary to substantiate the decoupling of uncertainty effects from selection or mode-shift artifacts. The current manuscript describes the protocol and selection process but does not include direct comparative statistics or ablations. In the revised manuscript we will add these elements, including side-by-side comparisons of label entropy, mode agreement, and hardness metrics (e.g., initial model confidence) across conditions, plus an ablation varying selection criteria. These additions will be placed in a dedicated subsection with supporting tables. revision: yes
Referee: [Results and Analysis] Quantitative support for the key findings is insufficient. The abstract and results sections report no specific metrics (e.g., ECE, NLL, or variance across runs), error bars, or statistical tests comparing human vs. synthetic conditions on difficult samples; without these, the assertion that human soft labels act as a superior regularizer cannot be evaluated.

Authors: We acknowledge that the presentation of quantitative results can be strengthened. While comparative trends are shown, the manuscript does not report specific numerical values for ECE, NLL, run variance, error bars, or formal statistical tests on the difficult-sample subset. In the revision we will add these details in the results section and abstract, including tables with mean values plus standard deviations across runs, error bars on figures, and statistical comparisons (e.g., paired tests) restricted to hard samples. This will allow direct evaluation of the regularizing effect. revision: yes

Circularity Check

0 steps flagged

Empirical study with no derivations or self-referential reductions

full rationale

The paper conducts a controlled empirical audit on MNIST and synthetic variants by re-annotating subsets with human soft labels versus synthetic controls. All central claims—accuracy gains, calibration improvements on difficult samples, training stability, and alignment via dataset cartography—are supported solely by experimental comparisons and observed performance metrics. No equations, parameter fits, uniqueness theorems, or derivation chains appear in the provided text; the analysis does not reduce any result to its inputs by construction. This is a standard self-contained empirical study against external benchmarks such as model calibration error and convergence variance.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the work relies on standard supervised learning assumptions and the validity of the synthetic control dataset.

pith-pipeline@v0.9.0 · 5687 in / 1054 out tokens · 36748 ms · 2026-05-20T11:51:28.022560+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 5 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Gorilla in our midst: An online behavioral experiment builder.Behavior research methods, 52(1):388–407, 2020

Alexander L Anwyl-Irvine, Jessica Massonnié, Adam Flitton, Natasha Kirkham, and Jo K Evershed. Gorilla in our midst: An online behavioral experiment builder.Behavior research methods, 52(1):388–407, 2020

work page 2020
[3]

Dices dataset: Diversity in conversational ai evaluation for safety.Advances in Neural Information Processing Systems, 36:53330–53342, 2023

Lora Aroyo, Alex Taylor, Mark Diaz, Christopher Homan, Alicia Parrish, Gregory Serapio- García, Vinodkumar Prabhakaran, and Ding Wang. Dices dataset: Diversity in conversational ai evaluation for safety.Advances in Neural Information Processing Systems, 36:53330–53342, 2023

work page 2023
[4]

Fill in the gaps: Model calibration and generalization with synthetic data

Yang Ba, Michelle V Mancenido, and Rong Pan. Fill in the gaps: Model calibration and generalization with synthetic data. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17211–17225, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.955. URL https:...

work page doi:10.18653/v1/2024.emnlp-main.955 2024
[5]

Stop measuring calibration when humans disagree

Joris Baan, Wilker Aziz, Barbara Plank, and Raquel Fernandez. Stop measuring calibration when humans disagree. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1892–1915, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.124. URL https:...

work page doi:10.18653/v1/2022.emnlp-main.124 2022
[7]

Tailoring mixup to data for calibration

Quentin Bouniot, Pavlo Mozharovskyi, and Florence d’Alché Buc. Tailoring mixup to data for calibration. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=3ygfMPLv0P

work page 2025
[8]

Reassessing how to compare and improve the calibration of machine learning models.arXiv preprint arXiv:2406.04068, 2024

Muthu Chidambaram and Rong Ge. Reassessing how to compare and improve the calibration of machine learning models.arXiv preprint arXiv:2406.04068, 2024

work page arXiv 2024
[9]

Collins, Umang Bhatt, and Adrian Weller

Katherine M. Collins, Umang Bhatt, and Adrian Weller. Eliciting and Learning with Soft Labels from Every Annotator.Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 10:40–52, October 2022. ISSN 2769-1349. doi: 10.1609/hcomp.v10i1.21986. URLhttps://ojs.aaai.org/index.php/HCOMP/article/view/21986

work page doi:10.1609/hcomp.v10i1.21986 2022
[10]

Collins, Umang Bhatt, Weiyang Liu, Vihari Piratla, Ilia Sucholutsky, Bradley Love, and Adrian Weller

Katherine M. Collins, Umang Bhatt, Weiyang Liu, Vihari Piratla, Ilia Sucholutsky, Bradley Love, and Adrian Weller. Human-in-the-loop mixup. InProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI ’23. JMLR.org, 2023. 10

work page 2023
[11]

Human Uncertainty in Concept-Based AI Systems

Katherine Maeve Collins, Matthew Barker, Mateo Espinosa Zarlenga, Naveen Raman, Umang Bhatt, Mateja Jamnik, Ilia Sucholutsky, Adrian Weller, and Krishnamurthy Dvijotham. Human Uncertainty in Concept-Based AI Systems. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 869–889, Montreal Canada, August 2023. ACM. ISBN 97984007023...

work page doi:10.1145/3600211.3604692 2023
[12]

Position: insights from survey methodol- ogy can improve training data

Stephanie Eckman, Barbara Plank, and Frauke Kreuter. Position: insights from survey methodol- ogy can improve training data. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[13]

Mining the uncertainty patterns of humans and models in the annotation of moral foundations and human values

Neele Falk and Gabriella Lapesa. Mining the uncertainty patterns of humans and models in the annotation of moral foundations and human values. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22898–22921, 2025

work page 2025
[14]

Metacognition and confidence: A review and synthesis.Annual Review of Psychology, 75(1):241–268, 2024

Stephen M Fleming. Metacognition and confidence: A review and synthesis.Annual Review of Psychology, 75(1):241–268, 2024

work page 2024
[15]

Eye-tracking data: New insights on response order effects and other cognitive shortcuts in survey responding

Mirta Galesic, Roger Tourangeau, Mick P Couper, and Frederick G Conrad. Eye-tracking data: New insights on response order effects and other cognitive shortcuts in survey responding. Public opinion quarterly, 72(5):892–913, 2008

work page 2008
[16]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017. URLhttps://proceedings.mlr.press/v70/guo17a.html

work page 2017
[17]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Facing uncertainty in the game of bridge: A calibration study.Organizational Behavior and Human Decision Processes, 39(1):98–114, 1987

Gideon Keren. Facing uncertainty in the game of bridge: A calibration study.Organizational Behavior and Human Decision Processes, 39(1):98–114, 1987

work page 1987
[19]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[20]

Overconfidence: It depends on how, what, and whom you ask.Organizational behavior and human decision processes, 79(3):216–247, 1999

Joshua Klayman, Jack B Soll, Claudia Gonzalez-Vallejo, and Sema Barlas. Overconfidence: It depends on how, what, and whom you ask.Organizational behavior and human decision processes, 79(3):216–247, 1999

work page 1999
[21]

Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.Advances in neural information processing systems, 32, 2019

Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.Advances in neural information processing systems, 32, 2019

work page 2019
[22]

Trainable calibration measures for neural networks from kernel mean embeddings

Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. InInternational Conference on Machine Learning, pages 2805–2814. PMLR, 2018

work page 2018
[23]

Ardis: a swedish historical handwritten digit dataset.Neural Computing and Applications, 32(21): 16505–16518, 2020

Huseyin Kusetogullari, Amir Yavariabdi, Abbas Cheddad, Håkan Grahn, and Johan Hall. Ardis: a swedish historical handwritten digit dataset.Neural Computing and Applications, 32(21): 16505–16518, 2020

work page 2020
[24]

The mnist database of handwritten digits.http://yann

Yann LeCun. The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998

work page 1998
[25]

, author Wu, Z

Alisa Liu, Zhaofeng Wu, Julian Michael, Alane Suhr, Peter West, Alexander Koller, Swabha Swayamdipta, Noah Smith, and Yejin Choi. We’re afraid language models aren’t modeling ambiguity. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 790–807, Singapore, December 2023. Association for Computational Lin- guist...

work page doi:10.18653/v1/2023.emnlp-main.51 2023
[26]

Human and system perspectives on the expression of irony: An analysis of likelihood labels and rationales

Aaron Maladry, Alessandra Teresa Cignarella, Els Lefever, Cynthia van Hee, and Veronique Hoste. Human and system perspectives on the expression of irony: An analysis of likelihood labels and rationales. InProceedings of the 2024 Joint International Conference on Com- putational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 8372–...

work page 2024
[27]

A scalable framework for evaluating health language models.npj Digital Medicine, 2026

Neil Mallinar, A Ali Heydari, Xin Liu, Anthony Z Faranesh, Brent Winslow, Nova Hammerquist, Benjamin Graef, Cathy Speed, Mark Malhotra, Shwetak Patel, et al. A scalable framework for evaluating health language models.npj Digital Medicine, 2026

work page 2026
[28]

Aggregation under uncertainty.IEEE Transactions on Fuzzy Systems, 26(4):2475–2478, 2018

Radko Mesiar, Surajit Borkotokey, LeSheng Jin, and Martin Kalina. Aggregation under uncertainty.IEEE Transactions on Fuzzy Systems, 26(4):2475–2478, 2018. doi: 10.1109/ TFUZZ.2017.2756828

work page arXiv 2018
[29]

Torr, and Yarin Gal

Jishnu Mukhoti, Andreas Kirsch, Joost van Amersfoort, Philip H.S. Torr, and Yarin Gal. Deep deterministic uncertainty: A new simple baseline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24384–24394, June 2023

work page 2023
[30]

When does label smoothing help? Advances in neural information processing systems, 32, 2019

Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? Advances in neural information processing systems, 32, 2019

work page 2019
[31]

Obtaining well calibrated probabilities using bayesian binning

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InProceedings of the AAAI conference on artificial intelligence, volume 29, 2015

work page 2015
[32]

Predicting good probabilities with supervised learning

Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. InProceedings of the 22nd international conference on Machine learning, pages 625–632, 2005

work page 2005
[33]

Prolific

Stefan Palan and Christian Schitter. Prolific. — a subject pool for online experiments.Journal of behavioral and experimental finance, 17:22–27, 2018

work page 2018
[34]

Don’t blame the annotator: Bias already starts in the annotation instructions

Mihir Parmar, Swaroop Mishra, Mor Geva, and Chitta Baral. Don’t blame the annotator: Bias already starts in the annotation instructions. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1779–1789, 2023

work page 2023
[35]

Is a picture of a bird a bird? a mixed-methods approach to understanding diverse human perspectives and ambiguity in machine vision models

Alicia Parrish, Susan Hao, Sarah Laszlo, and Lora Aroyo. Is a picture of a bird a bird? a mixed-methods approach to understanding diverse human perspectives and ambiguity in machine vision models. InProceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024, pages 1–18, Torino, Italia, May 2024. ELRA and ICCL. U...

work page 2024
[36]

Understanding model calibration - a gentle introduction and vi- sual exploration of calibration and the expected calibration error (ece)

Maja Pavlovic. Understanding model calibration - a gentle introduction and vi- sual exploration of calibration and the expected calibration error (ece). In ICLR Blogposts 2025, 2025. URL https://d2jud02ci9yv69.cloudfront.net/ 2025-04-28-calibration-45/blog/calibration/

work page 2025
[37]

Peterson, Ruairidh M

Joshua C. Peterson, Ruairidh M. Battleday, Thomas L. Griffiths, and Olga Russakovsky. Human uncertainty makes classification more robust. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

work page 2019
[38]

The “problem” of human label variation: On ground truth in data, modeling and evaluation

Barbara Plank. The “problem” of human label variation: On ground truth in data, modeling and evaluation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.731. URL https://ac...

work page doi:10.18653/v1/2022.emnlp-main.731 2022
[39]

Ambiguous images with human judgments for robust visual event classification

Kate Sanders, Reno Kriz, Anqi Liu, and Benjamin Van Durme. Ambiguous images with human judgments for robust visual event classification. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https:// openreview.net/forum?id=6Hl7XoPNAVX. 12

work page 2022
[40]

Ambiguous annotations: When is a pedestrian not a pedestrian? InFirst Vision and Language for Autonomous Driving and Robotics Workshop, 2024

Luisa Schwirten, Jannes Scholz, Daniel Kondermann, and Janis Keuper. Ambiguous annotations: When is a pedestrian not a pedestrian? InFirst Vision and Language for Autonomous Driving and Robotics Workshop, 2024. URLhttps://openreview.net/forum?id=aPzFAopRks

work page 2024
[41]

Smith, and Yejin Choi

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, Online, November 2020. Association for Computati...

work page doi:10.18653/v1/2020.emnlp-main.746 2020
[42]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel- low, and Rob Fergus. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[43]

On mixup training: Improved calibration and predictive uncertainty for deep neural networks.Advances in neural information processing systems, 32, 2019

Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks.Advances in neural information processing systems, 32, 2019

work page 2019
[44]

Learning from disagreement: A survey.Journal of Artificial Intelligence Research, 72: 1385–1470, 2021

Alexandra N Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio. Learning from disagreement: A survey.Journal of Artificial Intelligence Research, 72: 1385–1470, 2021

work page 2021
[45]

Generating and detecting true ambiguity: a forgotten danger in dnn supervision testing.Empirical Software Engineering, 28 (6):146, 2023

Michael Weiss, André García Gómez, and Paolo Tonella. Generating and detecting true ambiguity: a forgotten danger in dnn supervision testing.Empirical Software Engineering, 28 (6):146, 2023

work page 2023
[46]

Towards understanding why label smoothing degrades selective classification and how to fix it

Guoxuan Xia, Olivier Laurent, Gianni Franchi, and Christos-Savvas Bouganis. Towards understanding why label smoothing degrades selective classification and how to fix it. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=6oWFn6fY4A

work page 2025
[47]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[48]

Cartography active learning

Mike Zhang and Barbara Plank. Cartography active learning. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 395–406, Punta Cana, Dominican Re- public, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. findings-emnlp.36. URLhttps://aclanthology.org/2021.findings-emnlp.36/

work page doi:10.18653/v1/2021 2021
[49]

Yes" and

Fei Zhu, Zhen Cheng, Xu-Yao Zhang, and Cheng-Lin Liu. Rethinking confidence calibration for failure prediction. InEuropean conference on computer vision, pages 518–536. Springer, 2022. A Data Collection Process A.1 Corpus size To establish a lower bound for the number of training instances per class required to achieve reliable performance, we conducted a...

work page arXiv 2022
[50]

Discussions

remain subject to the terms of the MIT License. G.6 Maintenance The soft-digits dataset is hosted on the Hugging Face Hub, which serves as the primary platform for its distribution and long-term maintenance. The corresponding author is responsible for managing the repository, ensuring the data remains accessible, and performing any necessary updates or ve...

work page

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Gorilla in our midst: An online behavioral experiment builder.Behavior research methods, 52(1):388–407, 2020

Alexander L Anwyl-Irvine, Jessica Massonnié, Adam Flitton, Natasha Kirkham, and Jo K Evershed. Gorilla in our midst: An online behavioral experiment builder.Behavior research methods, 52(1):388–407, 2020

work page 2020

[3] [3]

Dices dataset: Diversity in conversational ai evaluation for safety.Advances in Neural Information Processing Systems, 36:53330–53342, 2023

Lora Aroyo, Alex Taylor, Mark Diaz, Christopher Homan, Alicia Parrish, Gregory Serapio- García, Vinodkumar Prabhakaran, and Ding Wang. Dices dataset: Diversity in conversational ai evaluation for safety.Advances in Neural Information Processing Systems, 36:53330–53342, 2023

work page 2023

[4] [4]

Fill in the gaps: Model calibration and generalization with synthetic data

Yang Ba, Michelle V Mancenido, and Rong Pan. Fill in the gaps: Model calibration and generalization with synthetic data. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17211–17225, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.955. URL https:...

work page doi:10.18653/v1/2024.emnlp-main.955 2024

[5] [5]

Stop measuring calibration when humans disagree

Joris Baan, Wilker Aziz, Barbara Plank, and Raquel Fernandez. Stop measuring calibration when humans disagree. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1892–1915, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.124. URL https:...

work page doi:10.18653/v1/2022.emnlp-main.124 2022

[6] [7]

Tailoring mixup to data for calibration

Quentin Bouniot, Pavlo Mozharovskyi, and Florence d’Alché Buc. Tailoring mixup to data for calibration. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=3ygfMPLv0P

work page 2025

[7] [8]

Reassessing how to compare and improve the calibration of machine learning models.arXiv preprint arXiv:2406.04068, 2024

Muthu Chidambaram and Rong Ge. Reassessing how to compare and improve the calibration of machine learning models.arXiv preprint arXiv:2406.04068, 2024

work page arXiv 2024

[8] [9]

Collins, Umang Bhatt, and Adrian Weller

Katherine M. Collins, Umang Bhatt, and Adrian Weller. Eliciting and Learning with Soft Labels from Every Annotator.Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 10:40–52, October 2022. ISSN 2769-1349. doi: 10.1609/hcomp.v10i1.21986. URLhttps://ojs.aaai.org/index.php/HCOMP/article/view/21986

work page doi:10.1609/hcomp.v10i1.21986 2022

[9] [10]

Collins, Umang Bhatt, Weiyang Liu, Vihari Piratla, Ilia Sucholutsky, Bradley Love, and Adrian Weller

Katherine M. Collins, Umang Bhatt, Weiyang Liu, Vihari Piratla, Ilia Sucholutsky, Bradley Love, and Adrian Weller. Human-in-the-loop mixup. InProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI ’23. JMLR.org, 2023. 10

work page 2023

[10] [11]

Human Uncertainty in Concept-Based AI Systems

Katherine Maeve Collins, Matthew Barker, Mateo Espinosa Zarlenga, Naveen Raman, Umang Bhatt, Mateja Jamnik, Ilia Sucholutsky, Adrian Weller, and Krishnamurthy Dvijotham. Human Uncertainty in Concept-Based AI Systems. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 869–889, Montreal Canada, August 2023. ACM. ISBN 97984007023...

work page doi:10.1145/3600211.3604692 2023

[11] [12]

Position: insights from survey methodol- ogy can improve training data

Stephanie Eckman, Barbara Plank, and Frauke Kreuter. Position: insights from survey methodol- ogy can improve training data. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024

[12] [13]

Mining the uncertainty patterns of humans and models in the annotation of moral foundations and human values

Neele Falk and Gabriella Lapesa. Mining the uncertainty patterns of humans and models in the annotation of moral foundations and human values. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22898–22921, 2025

work page 2025

[13] [14]

Metacognition and confidence: A review and synthesis.Annual Review of Psychology, 75(1):241–268, 2024

Stephen M Fleming. Metacognition and confidence: A review and synthesis.Annual Review of Psychology, 75(1):241–268, 2024

work page 2024

[14] [15]

Eye-tracking data: New insights on response order effects and other cognitive shortcuts in survey responding

Mirta Galesic, Roger Tourangeau, Mick P Couper, and Frederick G Conrad. Eye-tracking data: New insights on response order effects and other cognitive shortcuts in survey responding. Public opinion quarterly, 72(5):892–913, 2008

work page 2008

[15] [16]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017. URLhttps://proceedings.mlr.press/v70/guo17a.html

work page 2017

[16] [17]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[17] [18]

Facing uncertainty in the game of bridge: A calibration study.Organizational Behavior and Human Decision Processes, 39(1):98–114, 1987

Gideon Keren. Facing uncertainty in the game of bridge: A calibration study.Organizational Behavior and Human Decision Processes, 39(1):98–114, 1987

work page 1987

[18] [19]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [20]

Overconfidence: It depends on how, what, and whom you ask.Organizational behavior and human decision processes, 79(3):216–247, 1999

Joshua Klayman, Jack B Soll, Claudia Gonzalez-Vallejo, and Sema Barlas. Overconfidence: It depends on how, what, and whom you ask.Organizational behavior and human decision processes, 79(3):216–247, 1999

work page 1999

[20] [21]

Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.Advances in neural information processing systems, 32, 2019

Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.Advances in neural information processing systems, 32, 2019

work page 2019

[21] [22]

Trainable calibration measures for neural networks from kernel mean embeddings

Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. InInternational Conference on Machine Learning, pages 2805–2814. PMLR, 2018

work page 2018

[22] [23]

Ardis: a swedish historical handwritten digit dataset.Neural Computing and Applications, 32(21): 16505–16518, 2020

Huseyin Kusetogullari, Amir Yavariabdi, Abbas Cheddad, Håkan Grahn, and Johan Hall. Ardis: a swedish historical handwritten digit dataset.Neural Computing and Applications, 32(21): 16505–16518, 2020

work page 2020

[23] [24]

The mnist database of handwritten digits.http://yann

Yann LeCun. The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998

work page 1998

[24] [25]

, author Wu, Z

Alisa Liu, Zhaofeng Wu, Julian Michael, Alane Suhr, Peter West, Alexander Koller, Swabha Swayamdipta, Noah Smith, and Yejin Choi. We’re afraid language models aren’t modeling ambiguity. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 790–807, Singapore, December 2023. Association for Computational Lin- guist...

work page doi:10.18653/v1/2023.emnlp-main.51 2023

[25] [26]

Human and system perspectives on the expression of irony: An analysis of likelihood labels and rationales

Aaron Maladry, Alessandra Teresa Cignarella, Els Lefever, Cynthia van Hee, and Veronique Hoste. Human and system perspectives on the expression of irony: An analysis of likelihood labels and rationales. InProceedings of the 2024 Joint International Conference on Com- putational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 8372–...

work page 2024

[26] [27]

A scalable framework for evaluating health language models.npj Digital Medicine, 2026

Neil Mallinar, A Ali Heydari, Xin Liu, Anthony Z Faranesh, Brent Winslow, Nova Hammerquist, Benjamin Graef, Cathy Speed, Mark Malhotra, Shwetak Patel, et al. A scalable framework for evaluating health language models.npj Digital Medicine, 2026

work page 2026

[27] [28]

Aggregation under uncertainty.IEEE Transactions on Fuzzy Systems, 26(4):2475–2478, 2018

Radko Mesiar, Surajit Borkotokey, LeSheng Jin, and Martin Kalina. Aggregation under uncertainty.IEEE Transactions on Fuzzy Systems, 26(4):2475–2478, 2018. doi: 10.1109/ TFUZZ.2017.2756828

work page arXiv 2018

[28] [29]

Torr, and Yarin Gal

Jishnu Mukhoti, Andreas Kirsch, Joost van Amersfoort, Philip H.S. Torr, and Yarin Gal. Deep deterministic uncertainty: A new simple baseline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24384–24394, June 2023

work page 2023

[29] [30]

When does label smoothing help? Advances in neural information processing systems, 32, 2019

Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? Advances in neural information processing systems, 32, 2019

work page 2019

[30] [31]

Obtaining well calibrated probabilities using bayesian binning

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InProceedings of the AAAI conference on artificial intelligence, volume 29, 2015

work page 2015

[31] [32]

Predicting good probabilities with supervised learning

Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. InProceedings of the 22nd international conference on Machine learning, pages 625–632, 2005

work page 2005

[32] [33]

Prolific

Stefan Palan and Christian Schitter. Prolific. — a subject pool for online experiments.Journal of behavioral and experimental finance, 17:22–27, 2018

work page 2018

[33] [34]

Don’t blame the annotator: Bias already starts in the annotation instructions

Mihir Parmar, Swaroop Mishra, Mor Geva, and Chitta Baral. Don’t blame the annotator: Bias already starts in the annotation instructions. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1779–1789, 2023

work page 2023

[34] [35]

Is a picture of a bird a bird? a mixed-methods approach to understanding diverse human perspectives and ambiguity in machine vision models

Alicia Parrish, Susan Hao, Sarah Laszlo, and Lora Aroyo. Is a picture of a bird a bird? a mixed-methods approach to understanding diverse human perspectives and ambiguity in machine vision models. InProceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024, pages 1–18, Torino, Italia, May 2024. ELRA and ICCL. U...

work page 2024

[35] [36]

Understanding model calibration - a gentle introduction and vi- sual exploration of calibration and the expected calibration error (ece)

Maja Pavlovic. Understanding model calibration - a gentle introduction and vi- sual exploration of calibration and the expected calibration error (ece). In ICLR Blogposts 2025, 2025. URL https://d2jud02ci9yv69.cloudfront.net/ 2025-04-28-calibration-45/blog/calibration/

work page 2025

[36] [37]

Peterson, Ruairidh M

Joshua C. Peterson, Ruairidh M. Battleday, Thomas L. Griffiths, and Olga Russakovsky. Human uncertainty makes classification more robust. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

work page 2019

[37] [38]

The “problem” of human label variation: On ground truth in data, modeling and evaluation

Barbara Plank. The “problem” of human label variation: On ground truth in data, modeling and evaluation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.731. URL https://ac...

work page doi:10.18653/v1/2022.emnlp-main.731 2022

[38] [39]

Ambiguous images with human judgments for robust visual event classification

Kate Sanders, Reno Kriz, Anqi Liu, and Benjamin Van Durme. Ambiguous images with human judgments for robust visual event classification. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https:// openreview.net/forum?id=6Hl7XoPNAVX. 12

work page 2022

[39] [40]

Ambiguous annotations: When is a pedestrian not a pedestrian? InFirst Vision and Language for Autonomous Driving and Robotics Workshop, 2024

Luisa Schwirten, Jannes Scholz, Daniel Kondermann, and Janis Keuper. Ambiguous annotations: When is a pedestrian not a pedestrian? InFirst Vision and Language for Autonomous Driving and Robotics Workshop, 2024. URLhttps://openreview.net/forum?id=aPzFAopRks

work page 2024

[40] [41]

Smith, and Yejin Choi

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, Online, November 2020. Association for Computati...

work page doi:10.18653/v1/2020.emnlp-main.746 2020

[41] [42]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel- low, and Rob Fergus. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[42] [43]

On mixup training: Improved calibration and predictive uncertainty for deep neural networks.Advances in neural information processing systems, 32, 2019

Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks.Advances in neural information processing systems, 32, 2019

work page 2019

[43] [44]

Learning from disagreement: A survey.Journal of Artificial Intelligence Research, 72: 1385–1470, 2021

Alexandra N Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio. Learning from disagreement: A survey.Journal of Artificial Intelligence Research, 72: 1385–1470, 2021

work page 2021

[44] [45]

Generating and detecting true ambiguity: a forgotten danger in dnn supervision testing.Empirical Software Engineering, 28 (6):146, 2023

Michael Weiss, André García Gómez, and Paolo Tonella. Generating and detecting true ambiguity: a forgotten danger in dnn supervision testing.Empirical Software Engineering, 28 (6):146, 2023

work page 2023

[45] [46]

Towards understanding why label smoothing degrades selective classification and how to fix it

Guoxuan Xia, Olivier Laurent, Gianni Franchi, and Christos-Savvas Bouganis. Towards understanding why label smoothing degrades selective classification and how to fix it. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=6oWFn6fY4A

work page 2025

[46] [47]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[47] [48]

Cartography active learning

Mike Zhang and Barbara Plank. Cartography active learning. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 395–406, Punta Cana, Dominican Re- public, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. findings-emnlp.36. URLhttps://aclanthology.org/2021.findings-emnlp.36/

work page doi:10.18653/v1/2021 2021

[48] [49]

Yes" and

Fei Zhu, Zhen Cheng, Xu-Yao Zhang, and Cheng-Lin Liu. Rethinking confidence calibration for failure prediction. InEuropean conference on computer vision, pages 518–536. Springer, 2022. A Data Collection Process A.1 Corpus size To establish a lower bound for the number of training instances per class required to achieve reliable performance, we conducted a...

work page arXiv 2022

[49] [50]

Discussions

remain subject to the terms of the MIT License. G.6 Maintenance The soft-digits dataset is hosted on the Hugging Face Hub, which serves as the primary platform for its distribution and long-term maintenance. The corresponding author is responsible for managing the repository, ensuring the data remains accessible, and performing any necessary updates or ve...

work page