Unmasking and improving data credibility: A study with datasets for training harmless language models

Zhaowei Zhu, Jialu Wang, Hao Cheng, Yang Liu · 2023 · arXiv 2311.11202

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Evian: Towards Explainable Visual Instruction-tuning Data Auditing

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.

Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond

cs.AI · 2024-08-21 · unverdicted · novelty 5.0

The ADC method automates the creation of large image classification datasets using LLMs and search engines, achieving 79% human agreement and reducing label noise on a 1 million image clothing dataset, while also releasing benchmarks for noise and bias issues.

citing papers explorer

Showing 2 of 2 citing papers.

Evian: Towards Explainable Visual Instruction-tuning Data Auditing cs.CV · 2026-04-22 · unverdicted · none · ref 9
EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.
Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond cs.AI · 2024-08-21 · unverdicted · none · ref 25
The ADC method automates the creation of large image classification datasets using LLMs and search engines, achieving 79% human agreement and reducing label noise on a 1 million image clothing dataset, while also releasing benchmarks for noise and bias issues.

Unmasking and improving data credibility: A study with datasets for training harmless language models

fields

years

verdicts

representative citing papers

citing papers explorer