A schema-level diagnostic uses multi-annotator criterion judgments to separate unstable criteria from systematic category overlaps in subjective NLP annotation prior to gold-label creation.
Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021a
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
FrontierMath is a new benchmark of hundreds of original hard math problems that current AI models solve less than 2% of.
Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.
STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.
A scoping review organizes decades of NLP evaluation debates into a taxonomy of recurring concerns and trade-offs with a structured checklist for better evaluation design.
citing papers explorer
-
Beyond Black-Box Labels: Interpretable Criteria for Diagnosing Subjective NLP Tasks
A schema-level diagnostic uses multi-annotator criterion judgments to separate unstable criteria from systematic category overlaps in subjective NLP annotation prior to gold-label creation.
-
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
FrontierMath is a new benchmark of hundreds of original hard math problems that current AI models solve less than 2% of.
-
Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications
Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.
-
Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation
STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.
-
Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing
A scoping review organizes decades of NLP evaluation debates into a taxonomy of recurring concerns and trade-offs with a structured checklist for better evaluation design.
- DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning