Analysis of the DataComp dataset finds at least 122 million samples with copyright notices, 60% of samples from top domains on sites whose terms prohibit scraping, and 9-13% of samples containing watermarks that standard detection tools miss.
The pre-trained MobileViTv2 (Mehta and Rastegari 2022) is loaded via Huggingface checkpoint apple/mobilevitv2-1.0-imagenet1k-256
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CY 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
How Do Data Owners Say No? A Case Study of Data Consent Mechanisms in Web-Scraped Vision-Language AI Training Datasets
Analysis of the DataComp dataset finds at least 122 million samples with copyright notices, 60% of samples from top domains on sites whose terms prohibit scraping, and 9-13% of samples containing watermarks that standard detection tools miss.