Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021a

Curtis Northcutt, Lu Jiang, Isaac Chuang · 2017 · arXiv 2103.14749

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Beyond Black-Box Labels: Interpretable Criteria for Diagnosing Subjective NLP Tasks

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

A schema-level diagnostic uses multi-annotator criterion judgments to separate unstable criteria from systematic category overlaps in subjective NLP annotation prior to gold-label creation.

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

cs.AI · 2024-11-07 · unverdicted · novelty 7.0

FrontierMath is a new benchmark of hundreds of original hard math problems that current AI models solve less than 2% of.

Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications

cs.CV · 2024-05-01 · unverdicted · novelty 7.0

Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.

Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation

cs.IR · 2026-04-07 · unverdicted · novelty 6.0

STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.

Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

cs.CL · 2026-04-01 · unverdicted · novelty 4.0

A scoping review organizes decades of NLP evaluation debates into a taxonomy of recurring concerns and trade-offs with a structured checklist for better evaluation design.

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

cs.AI · 2025-11-04

citing papers explorer

Showing 6 of 6 citing papers.

Beyond Black-Box Labels: Interpretable Criteria for Diagnosing Subjective NLP Tasks cs.CL · 2026-04-18 · unverdicted · none · ref 2
A schema-level diagnostic uses multi-annotator criterion judgments to separate unstable criteria from systematic category overlaps in subjective NLP annotation prior to gold-label creation.
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI cs.AI · 2024-11-07 · unverdicted · none · ref 17
FrontierMath is a new benchmark of hundreds of original hard math problems that current AI models solve less than 2% of.
Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications cs.CV · 2024-05-01 · unverdicted · none · ref 15
Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.
Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation cs.IR · 2026-04-07 · unverdicted · none · ref 36
STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.
Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing cs.CL · 2026-04-01 · unverdicted · none · ref 21
A scoping review organizes decades of NLP evaluation debates into a taxonomy of recurring concerns and trade-offs with a structured checklist for better evaluation design.
DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning cs.AI · 2025-11-04 · unreviewed · ref 17

Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021a

fields

years

verdicts

representative citing papers

citing papers explorer