A schema-level diagnostic uses multi-annotator criterion judgments to separate unstable criteria from systematic category overlaps in subjective NLP annotation prior to gold-label creation.
Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021a
10 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
FrontierMath is a new benchmark of hundreds of original hard math problems that current AI models solve less than 2% of.
Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.
MMGist filters 23,250 items from 18 benchmarks down to 7,262 using three-stage pipeline, preserving model rankings (Spearman ρ=0.98) while cutting items 69% and raising discrimination 78%.
An automated AEB annotation framework uses data augmentation and noise suppression to achieve 80% recall improvement and 50% workload reduction for rare delayed/false triggers under class imbalance and asymmetric label noise.
Representational alignment varies monotonically with SNR and non-monotonically with sample size (minimized near interpolation threshold) across linear and nonlinear networks, and is decoupled from generalization error.
STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.
A validation-free metric combining neighbor-consistency and effective rank to estimate face recognition dataset quality for downstream model performance.
A scoping review organizes decades of NLP evaluation debates into a taxonomy of recurring concerns and trade-offs with a structured checklist for better evaluation design.
citing papers explorer
-
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
FrontierMath is a new benchmark of hundreds of original hard math problems that current AI models solve less than 2% of.
-
Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications
Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.