Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021a

Pervasive label errors in test sets destabilize machine learning benchmarks · 2023 · arXiv 2103.14749

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

representative citing papers

Beyond Black-Box Labels: Interpretable Criteria for Diagnosing Subjective NLP Tasks

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

A schema-level diagnostic uses multi-annotator criterion judgments to separate unstable criteria from systematic category overlaps in subjective NLP annotation prior to gold-label creation.

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

cs.AI · 2024-11-07 · unverdicted · novelty 7.0

FrontierMath is a new benchmark of hundreds of original hard math problems that current AI models solve less than 2% of.

Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications

cs.CV · 2024-05-01 · unverdicted · novelty 7.0

Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.

MMGist: A Comprehensive Multimodal Benchmark for 2027

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

MMGist filters 23,250 items from 18 benchmarks down to 7,262 using three-stage pipeline, preserving model rankings (Spearman ρ=0.98) while cutting items 69% and raising discrimination 78%.

Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise

cs.RO · 2026-06-17 · unverdicted · novelty 6.0

An automated AEB annotation framework uses data augmentation and noise suppression to achieve 80% recall improvement and 50% workload reduction for rare delayed/false triggers under class imbalance and asymmetric label noise.

Signal-to-Noise Ratio and Sample Size Govern Representational Alignment in Neural Networks

stat.ML · 2026-05-26 · unverdicted · novelty 6.0

Representational alignment varies monotonically with SNR and non-monotonically with sample size (minimized near interpolation threshold) across linear and nonlinear networks, and is decoupled from generalization error.

Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation

cs.IR · 2026-04-07 · unverdicted · novelty 6.0

STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.

Efficient, Validation-Free Intrinsic Quality Estimation for Large-Scale Face Recognition Datasets

cs.CV · 2026-05-28 · unverdicted · novelty 4.0

A validation-free metric combining neighbor-consistency and effective rank to estimate face recognition dataset quality for downstream model performance.

Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

cs.CL · 2026-04-01 · unverdicted · novelty 4.0

A scoping review organizes decades of NLP evaluation debates into a taxonomy of recurring concerns and trade-offs with a structured checklist for better evaluation design.

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

cs.AI · 2025-11-04

citing papers explorer

Showing 2 of 2 citing papers after filters.

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI cs.AI · 2024-11-07 · unverdicted · none · ref 17
FrontierMath is a new benchmark of hundreds of original hard math problems that current AI models solve less than 2% of.
Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications cs.CV · 2024-05-01 · unverdicted · none · ref 15
Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.

Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021a

fields

years

verdicts

representative citing papers

citing papers explorer