Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications

Andrea Mattia Garavagno; Colby Banbury; Emil Njor; Manjunath Kudlur; Mark Mazumder; Matthew Stewart; Nat Jeffries; Pete Warden; Vijay Janapa Reddi; Xenofon Fafoutis

arxiv: 2405.00892 · v6 · submitted 2024-05-01 · 💻 cs.CV · cs.AI

Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications

Colby Banbury , Emil Njor , Andrea Mattia Garavagno , Mark Mazumder , Matthew Stewart , Pete Warden , Manjunath Kudlur , Nat Jeffries

show 2 more authors

Xenofon Fafoutis Vijay Janapa Reddi

This is my paper

Pith reviewed 2026-05-24 01:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords TinyMLperson detectiondataset curationlabel noisecomputer vision benchmarkmicrocontroller visiondata-centric MLbinary classification

0 comments

The pith

An automated pipeline builds a 6-million-image person detection dataset that raises TinyML model accuracy by up to 6.6 percent over the prior benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that general-purpose dataset practices fail for TinyML because small models are sensitive to label noise and most tasks lack tailored benchmarks. It presents the Wake Vision pipeline that fuses labels from multiple sources, applies size and confidence filters, corrects evaluation labels, and creates fine-grained test subsets. When applied to person detection this yields a dataset nearly one hundred times larger than the previous standard with label error reduced from 7.8 to 2.2 percent. Models trained on the new data reach higher accuracy across four architectures and retain the edge on most demographic and environmental subsets as well as on out-of-distribution imagery. A reader would care because TinyML devices must run reliable vision tasks on microcontrollers where data quality directly limits what can be deployed.

Core claim

The Wake Vision pipeline produces a person-detection dataset of almost six million images whose models achieve up to 6.6 percent higher test accuracy than those trained on Visual Wake Words, match or exceed prior performance on thirteen of sixteen fine-grained subsets, and preserve the advantage on three out-of-distribution test sets.

What carries the argument

The Wake Vision pipeline, which fuses image-level and bounding-box labels, filters by confidence area and depiction, corrects evaluation labels, and auto-generates fine-grained benchmark subsets.

If this is right

Models reach up to 6.6 percent higher accuracy on the held-out test set.
Performance matches or exceeds the prior benchmark on thirteen of sixteen fine-grained subsets covering age, gender, distance, lighting and depiction.
The accuracy advantage persists when tested on three separate out-of-distribution datasets.
A manually verified validation and test set at 2.2 percent label error replaces the prior 7.8 percent error set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same curation steps could be applied to other binary TinyML tasks such as keyword spotting or anomaly detection to reduce reliance on hand-labeled data.
If the pipeline scales, practitioners might shift effort from architecture search toward repeated dataset refresh cycles.
Lower label error on evaluation sets could become a required reporting item for future TinyML benchmarks.

Load-bearing premise

The automated fusion and filtering steps improve label quality without creating new selection biases that would account for the observed accuracy gains.

What would settle it

Retrain the same four architectures on a version of the Wake Vision training set whose labels have been corrupted to the original 7.8 percent error rate and check whether test accuracy falls back to Visual Wake Words levels.

Figures

Figures reproduced from arXiv: 2405.00892 by Andrea Mattia Garavagno, Colby Banbury, Emil Njor, Manjunath Kudlur, Mark Mazumder, Matthew Stewart, Nat Jeffries, Pete Warden, Vijay Janapa Reddi, Xenofon Fafoutis.

**Figure 2.** Figure 2: The Wake Vision dataset generation pipeline. Open Images image-level and [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: A flowchart of the bounding-box filtering process of an image for person detection. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of challenging outlier images. dimensions, including lighting conditions, subject distance, and demographic attributes, enabling developers to identify potential biases or limitations during the design phase rather than after deployment. The suite comprises five fine-grained benchmark sets, three of which (Distance, Lighting, and Depictions) are applicable to any dataset generated by the Wake Visi… view at source ↗

**Figure 5.** Figure 5: Images from each fine-grained benchmark dataset. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: The Wake Vision Challenge submissions advanced our initial Pareto frontiers. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Positive class examples from the test splits of our Wake Vision generated datasets [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Four positive and four negative examples drawn randomly from each of our three [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Pareto frontiers for Wake Vision vs VWW on Out-of-Distribution datasets [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Impact of dataset error rate and size on models of varying capacity. The [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Cross-evaluation results on the Visual Wake Words test set [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

**Figure 12.** Figure 12: Effects of scaling the image size vs. the model width on Wake Vision test [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 13.** Figure 13: This further demonstrates the importance of fine-grained analysis, as some real [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗

**Figure 13.** Figure 13: Impact of grayscale input images on the lighting benchmarks: dark, normal light, [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

**Figure 14.** Figure 14: A screenshot of the labeling menu used to manually label the validation and test [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗

read the original abstract

Tiny machine learning (TinyML) co-locates models with sensors on microcontrollers, where small models (which are disproportionately sensitive to label noise) and bespoke binary tasks (which lack standard benchmarks) make general-purpose dataset practices a poor fit. Visual Wake Words (VWW), the prior standard TinyML person detection benchmark, contains roughly 123K images and has an estimated label error rate of 7.8%, which limits its usefulness for production-grade systems. Manual labeling, however, is prohibitively expensive for the scale and diversity of TinyML use cases. We address this gap with the Wake Vision pipeline, an automated method for generating and curating large-scale binary classification datasets for TinyML. We use data-centric TinyML for the dataset construction, curation, and lifecycle methods that produce the large, well-curated datasets these systems require. The pipeline combines label fusion across image-level and bounding-box sources, confidence-, area-, and depiction-aware filtering, label correction on the evaluation splits, and automatic generation of fine-grained benchmark subsets. Applying it to person detection, we release Wake Vision, a dataset of almost 6M images (close to 100x more person images than VWW) with a manually relabeled validation and test set at a 2.2% label error rate. Models trained on Wake Vision improve test accuracy by up to 6.6% over VWW across MobileNetV2, MCUNet, MicroNets, and ColabNAS architectures, and match or exceed VWW-trained models on 13 of 16 fine-grained subsets covering perceived gender, perceived age, distance, lighting, and depictions. The advantage holds under distribution shift on three out-of-distribution datasets covering driving and overhead-surveillance imagery. All artifacts are released under CC-BY 4.0 through TensorFlow Datasets and Hugging Face.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Wake Vision scales the TinyML person detection benchmark significantly and reports accuracy gains, but without size-controlled ablations the role of the curation pipeline remains unclear.

read the letter

The key point is that Wake Vision provides a much larger and cleaner dataset for TinyML person detection than the existing VWW benchmark, leading to accuracy improvements of up to 6.6% across several models, but it's not clear how much of that comes from the curation steps versus the sheer increase in data volume. The new contribution is the Wake Vision pipeline, which combines label fusion from multiple sources, applies filters for confidence, area, and depiction, corrects labels manually on the validation and test sets, and creates fine-grained subsets automatically. This produces a dataset of nearly 6 million images with a verified 2.2% label error rate on the test portion. Models like MobileNetV2, MCUNet, MicroNets, and ColabNAS trained on it outperform those trained on VWW on 13 of 16 subsets covering gender, age, distance, lighting, and depictions, and the gains carry over to out-of-distribution data from driving and surveillance scenarios. Making the dataset available via TensorFlow Datasets and Hugging Face under CC-BY is a practical move that lowers the barrier for others. What the paper does well is address a real pain point in TinyML, where small models suffer from label noise and standard datasets don't fit bespoke binary tasks. The scale increase and the focus on fine-grained and OOD testing are solid additions. The soft spots are around isolating the effects of the pipeline. The training set is roughly 50 times larger than VWW's, and without ablations that keep the number of images constant while changing the filtering or fusion, the accuracy lift could be explained by data quantity alone. Any filtering that keeps easier examples could also create distribution shifts that favor certain models. The abstract lacks details on exact thresholds and how label conflicts are handled, which makes it harder to assess reproducibility. The stress-test concern about post-hoc curation is fair given the information available. This paper is for TinyML researchers and practitioners who build vision models for microcontrollers and need better benchmarks than VWW. A reader focused on dataset curation or robustness testing would get value from the subsets and OOD results. It deserves a serious referee because the dataset release is a tangible resource and the empirical comparisons are worth verifying, even if the attribution to the curation method requires more evidence. I recommend sending it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper presents Wake Vision, an automated pipeline for generating and curating large-scale binary classification datasets for TinyML vision tasks (focused on person detection). It releases a ~6M-image dataset (vs. VWW's 123K) with label fusion across sources, confidence/area/depiction filtering, and manual correction on eval splits (reducing label error to 2.2%), claiming up to 6.6% test accuracy gains over VWW across MobileNetV2, MCUNet, MicroNets, and ColabNAS, plus advantages on 13/16 fine-grained subsets and three OOD sets.

Significance. If the reported gains can be attributed to the curation pipeline rather than scale alone, the work supplies a useful open dataset, benchmark suite, and methodology for TinyML, where small models are sensitive to label noise and bespoke tasks lack standards. The CC-BY 4.0 release via TensorFlow Datasets and Hugging Face, plus manual relabeling of validation/test sets, supports reproducibility.

major comments (3)

[Abstract, results] Abstract and results sections: The central claim attributes up to 6.6% accuracy lift (and gains on 13/16 subsets plus OOD) to the label-fusion + filtering + correction pipeline, but no ablation experiments hold dataset size fixed while varying the curation steps. With Wake Vision at ~6M images vs. VWW's 123K, the contribution of scale versus quality cannot be isolated; this is load-bearing for the attribution.
[Pipeline description] Methods description of the pipeline: Exact numerical thresholds for confidence, area, and depiction filtering are not reported, nor is the precise rule for resolving conflicts during label fusion across image-level and bounding-box sources. This prevents assessment of whether filtering introduces new selection biases that could explain the accuracy differences.
[Evaluation] Evaluation protocol: Manual label correction and error-rate measurement (2.2%) apply only to validation and test splits; training-set label quality is not similarly verified at scale. Given that small models are disproportionately sensitive to label noise, this leaves open whether training data improvements are real or confounded by distribution shifts from filtering.

minor comments (2)

[Benchmark subsets] The fine-grained subset definitions (perceived gender, age, distance, lighting, depictions) would benefit from explicit criteria or examples in a table to aid interpretation of the 13/16 match/exceed results.
[Results] Figure or table presenting per-architecture accuracy numbers should include error bars or p-values to support the 'up to 6.6%' and 'match or exceed' statements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Abstract, results] Abstract and results sections: The central claim attributes up to 6.6% accuracy lift (and gains on 13/16 subsets plus OOD) to the label-fusion + filtering + correction pipeline, but no ablation experiments hold dataset size fixed while varying the curation steps. With Wake Vision at ~6M images vs. VWW's 123K, the contribution of scale versus quality cannot be isolated; this is load-bearing for the attribution.

Authors: We agree that an ablation holding size fixed would help isolate the effects. However, the core contribution is the automated pipeline that simultaneously enables both large scale and improved quality for TinyML, where manual curation at 6M images is infeasible. VWW is the established benchmark, so the comparison is to that standard rather than a hypothetical large uncurated set. We will revise the abstract and results to clarify that reported gains reflect the full pipeline (scale plus curation) and add a discussion paragraph on the interplay between the two factors. revision: partial
Referee: [Pipeline description] Methods description of the pipeline: Exact numerical thresholds for confidence, area, and depiction filtering are not reported, nor is the precise rule for resolving conflicts during label fusion across image-level and bounding-box sources. This prevents assessment of whether filtering introduces new selection biases that could explain the accuracy differences.

Authors: We thank the referee for highlighting this reproducibility issue. The exact thresholds and fusion rules were omitted to keep the methods concise but will be restored. The revised manuscript will report the specific confidence thresholds, area cutoffs, depiction criteria, and the conflict-resolution logic (e.g., priority ordering between sources). revision: yes
Referee: [Evaluation] Evaluation protocol: Manual label correction and error-rate measurement (2.2%) apply only to validation and test splits; training-set label quality is not similarly verified at scale. Given that small models are disproportionately sensitive to label noise, this leaves open whether training data improvements are real or confounded by distribution shifts from filtering.

Authors: We acknowledge the limitation. Manual verification of the full ~6M-image training set is not practical, which is exactly why an automated pipeline is required for TinyML dataset construction. The same filtering steps are applied to training data, and gains on fine-grained subsets and three OOD sets suggest benefits beyond distribution shift alone. We will add an explicit Limitations section discussing training-set label quality and avenues for future estimation of its error rate. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical dataset release with external benchmarks.

full rationale

The paper describes an automated pipeline for constructing the Wake Vision dataset from existing sources, applies filtering and correction steps, releases the data, and reports accuracy improvements from training standard architectures (MobileNetV2, MCUNet, etc.) on it versus the external VWW benchmark. No equations, parameters fitted to subsets then re-predicted, or self-referential derivations appear. Claims rest on external model training runs and comparisons to prior datasets, not on definitions that reduce outputs to inputs by construction. Minor self-citations, if present, are not load-bearing for the central empirical results.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Pipeline relies on external image sources and manual relabeling whose selection criteria are not quantified in the abstract; no explicit free parameters, axioms, or invented entities are stated.

free parameters (1)

filtering thresholds
Confidence, area, and depiction cutoffs used in curation are chosen but not reported in abstract.

pith-pipeline@v0.9.0 · 5912 in / 1262 out tokens · 31628 ms · 2026-05-24T01:04:18.024218+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Wake Vision pipeline combines label fusion across image-level and bounding-box sources, confidence-, area-, and depiction-aware filtering...
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Models trained on Wake Vision improve test accuracy by up to 6.6% over VWW...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 5 internal anchors

[1]

Mlperf tiny benchmark.arXiv preprint arXiv:2106.07597, 2021a

Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, Danilo Pau, et al. Mlperf tiny benchmark.arXiv preprint arXiv:2106.07597, 2021a. Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas, Urmish Thakker, Dibakar Gope, Vijay Janapa Reddi, Matthew Mattina, and Paul Wh...

work page arXiv 2003
[2]

Sara Beery, Grant Van Horn, and Pietro Perona

Ac- cessed: 2025-05-06. Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InPro- ceedings of the European conference on computer vision (ECCV), pages 456–473,

work page 2025
[3]

Are we done with imagenet?arXiv preprint arXiv:2006.07159,

Lucas Beyer, Olivier J H´ enaff, Alexander Kolesnikov, Xiaohua Zhai, and A¨ aron van den Oord. Are we done with imagenet?arXiv preprint arXiv:2006.07159,

work page arXiv 2006
[4]

Visual Wake Words Dataset

Aakanksha Chowdhery, Pete Warden, Jonathon Shlens, Andrew Howard, and Rocky Rhodes. Visual wake words dataset.arXiv preprint arXiv:1906.05721,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[5]

Challenge edge: Wake vision.https://edgeai.modelnova.ai/ challenges/details/challenge-edge:-wake-vision, 2025a

EDGE AI FOUNDATION. Challenge edge: Wake vision.https://edgeai.modelnova.ai/ challenges/details/challenge-edge:-wake-vision, 2025a. Accessed: 2025-05-06. EDGE AI FOUNDATION. Challenge edge: Wake vision.https://edgeai.modelnova.ai/ challenges/details/edge-ai-challenge:-wake-vision-2, 2025b. Accessed: 2025- 05-07. EDGE AI FOUNDATION. Edge ai foundation.http...

work page 2025
[6]

Andrea Mattia Garavagno, Daniele Leonardis, and Antonio Frisoli

Accessed: 2024-03-06. Andrea Mattia Garavagno, Daniele Leonardis, and Antonio Frisoli. Colabnas: Obtaining lightweight task-specific convolutional neural networks following occam’s razor.Future Generation Computer Systems, 152:152–159,

work page 2024
[7]

What do compressed deep neural networks forget?arXiv preprint arXiv:1911.05248,

Sara Hooker, Aaron Courville, Gregory Clark, Yann Dauphin, and Andrea Frome. What do compressed deep neural networks forget?arXiv preprint arXiv:1911.05248,

work page arXiv 1911
[8]

Charac- terising bias in compressed models.arXiv preprint arXiv:2010.03058,

Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. Charac- terising bias in compressed models.arXiv preprint arXiv:2010.03058,

work page arXiv 2010
[9]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[10]

Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al

25 Banbury, Njor, Garavagno, Mazumder et al. Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al. Datasets: A community library for natural language processing.arXiv preprint arXiv:2109.02846,

work page arXiv
[11]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, Septem- ber 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer,

work page 2014
[12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Lu´ ıs C. R. Martins. Surveillance images for person detection. Kaggle dataset, 2025.https://www.kaggle.com/datasets/luiscrmartins/surveillance-images- for-person-detection. Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaˇ s, William Gaviria Rojas, Sud- nya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah Rose Kirk, et al. Dataperf: Benchmarks ...

work page 2025
[14]

Data aware neural architecture search.arXiv preprint arXiv:2304.01821,

Emil Njor, Jan Madsen, and Xenofon Fafoutis. Data aware neural architecture search.arXiv preprint arXiv:2304.01821,

work page arXiv
[15]

Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021a

Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021a. Curtis G Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label errors in test sets destabilize machine learning benchmarks.arXiv preprint arXiv:2103.14749, 2021b. Mateusz Piecho...

work page arXiv
[16]

Great tinyml needs high-quality data — plumerai blog.https://blog

Plumerai. Great tinyml needs high-quality data — plumerai blog.https://blog. plumerai.com/2021/08/tinyml-data/, August

work page 2021
[17]

Vijay Janapa Reddi

(Accessed on 11/13/2024). Vijay Janapa Reddi. Mlsysbook.ai: Principles and practices of machine learning systems en- gineering. In2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS), pages 41–42. IEEE,

work page 2024
[18]

Energy and Policy Considerations for Deep Learning in NLP

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp.arXiv preprint arXiv:1906.02243,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[19]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Human detection dataset

Constantin Werner. Human detection dataset. Kaggle dataset, 2025.https://www.kaggle. com/datasets/constantinwerner/human-detection-dataset. Christian Wojek, Stefan Walk, and Bernt Schiele. Multi-cue onboard pedestrian detection. In2009 IEEE conference on computer vision and pattern recognition, pages 794–801. IEEE,

work page 2025
[21]

Research on different illumination image classification method

WenLi Zhang, HongLu Li, and ZhuoZheng Wang. Research on different illumination image classification method. In2017 2nd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 2017), pages 574–581. Atlantis Press,

work page 2017
[22]

k 2 c 3 18.5 7.66 250,256 WV70.6±0.96 69.3±0.97 VWW65.6±0.66 70.7±0.08 k 4 c 5 22 18.49 688,790 WV75.7±0.18 74±0.23 VWW69.9±0.26 75.5±0,64 k 8 c 5 32.5 44.56 2,135,476 WV77.3±0.37 75±0.15 VWW73±0.91 77.3±0.57 p= (e−d)/(2e−1). A current flaw of this method is that the injected label errors are not consistent between epochs, which would likely be less destr...

work page 2020
[24]

By default, we only respect labels that have a minimum confidence of

Purely machine-generated labels have a fractional confidence score that is generally>= 5 (Kuznetsova et al., 2020; Krasin et al., 2017). By default, we only respect labels that have a minimum confidence of

work page 2020
[25]

Person Body Part Labels.Body parts are more challenging to relabel, as it is dependent on the use case whether a body part should be considered a person

Labels below this threshold are ignored. Person Body Part Labels.Body parts are more challenging to relabel, as it is dependent on the use case whether a body part should be considered a person. For example a camera that detects whether a person is inside a room to decide if the light should be switched on would want to consider body parts as a person, as...

work page 2019
[26]

stm32tflm

33 Banbury, Njor, Garavagno, Mazumder et al. Table 9: Number of images downloaded from Open Images v7. Download occurred between the 28th of November to the 5 th of December 2023 Train Validation Test Downloaded 7,936,979 36,406 109,305 Errors 1,055,669 5,214 16,131 Appendix I. Person Label Classes In our default configuration, we consider the following O...

work page doi:10.7910/dvn/1hopxc 2023
[27]

What data does each instance consist of?“Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description

for more information. What data does each instance consist of?“Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description. The dataset is composed of images, labels, and metadata. Is there a label or target associated with each instance?If so, please provide a description. Yes, every image has a binary label to ...

work page 2021
[28]

dataset and its derivatives. The authors of Open images tried to identify images that are licensed under a Creative Commons Attribution license but make no representations or warranties regarding the license status of each image and a user should verify the license for each image themselves. COLLECTION How was the data associated with each instance acquir...

work page 2017
[29]

details the image acquisition process. What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)?How were these mechanisms or procedures validated? The images were sources from Flickr in an automated pipeline. The original labeling pipeline was a combination of a...

work page 2020
[30]

The primary source of required resources was model training to evaluate the dataset and the storage and bandwidth required to process and upload the dataset to hosting locations

for approaches in this area.) We used cloud TPU credits provided by the Google Cloud TRC program. The primary source of required resources was model training to evaluate the dataset and the storage and bandwidth required to process and upload the dataset to hosting locations. The total size of 42 W ake Vision the dataset is approximately 2 TB. To the auth...

work page 2019
[31]

which originally sourced its images from Flickr (Flickr, 2024). Were the individuals in question notified about the data collection?If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself. 43 Banbury, N...

work page 2024

[1] [1]

Mlperf tiny benchmark.arXiv preprint arXiv:2106.07597, 2021a

Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, Danilo Pau, et al. Mlperf tiny benchmark.arXiv preprint arXiv:2106.07597, 2021a. Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas, Urmish Thakker, Dibakar Gope, Vijay Janapa Reddi, Matthew Mattina, and Paul Wh...

work page arXiv 2003

[2] [2]

Sara Beery, Grant Van Horn, and Pietro Perona

Ac- cessed: 2025-05-06. Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InPro- ceedings of the European conference on computer vision (ECCV), pages 456–473,

work page 2025

[3] [3]

Are we done with imagenet?arXiv preprint arXiv:2006.07159,

Lucas Beyer, Olivier J H´ enaff, Alexander Kolesnikov, Xiaohua Zhai, and A¨ aron van den Oord. Are we done with imagenet?arXiv preprint arXiv:2006.07159,

work page arXiv 2006

[4] [4]

Visual Wake Words Dataset

Aakanksha Chowdhery, Pete Warden, Jonathon Shlens, Andrew Howard, and Rocky Rhodes. Visual wake words dataset.arXiv preprint arXiv:1906.05721,

work page internal anchor Pith review Pith/arXiv arXiv 1906

[5] [5]

Challenge edge: Wake vision.https://edgeai.modelnova.ai/ challenges/details/challenge-edge:-wake-vision, 2025a

EDGE AI FOUNDATION. Challenge edge: Wake vision.https://edgeai.modelnova.ai/ challenges/details/challenge-edge:-wake-vision, 2025a. Accessed: 2025-05-06. EDGE AI FOUNDATION. Challenge edge: Wake vision.https://edgeai.modelnova.ai/ challenges/details/edge-ai-challenge:-wake-vision-2, 2025b. Accessed: 2025- 05-07. EDGE AI FOUNDATION. Edge ai foundation.http...

work page 2025

[6] [6]

Andrea Mattia Garavagno, Daniele Leonardis, and Antonio Frisoli

Accessed: 2024-03-06. Andrea Mattia Garavagno, Daniele Leonardis, and Antonio Frisoli. Colabnas: Obtaining lightweight task-specific convolutional neural networks following occam’s razor.Future Generation Computer Systems, 152:152–159,

work page 2024

[7] [7]

What do compressed deep neural networks forget?arXiv preprint arXiv:1911.05248,

Sara Hooker, Aaron Courville, Gregory Clark, Yann Dauphin, and Andrea Frome. What do compressed deep neural networks forget?arXiv preprint arXiv:1911.05248,

work page arXiv 1911

[8] [8]

Charac- terising bias in compressed models.arXiv preprint arXiv:2010.03058,

Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. Charac- terising bias in compressed models.arXiv preprint arXiv:2010.03058,

work page arXiv 2010

[9] [9]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[10] [10]

Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al

25 Banbury, Njor, Garavagno, Mazumder et al. Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al. Datasets: A community library for natural language processing.arXiv preprint arXiv:2109.02846,

work page arXiv

[11] [11]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, Septem- ber 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer,

work page 2014

[12] [12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Lu´ ıs C. R. Martins. Surveillance images for person detection. Kaggle dataset, 2025.https://www.kaggle.com/datasets/luiscrmartins/surveillance-images- for-person-detection. Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaˇ s, William Gaviria Rojas, Sud- nya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah Rose Kirk, et al. Dataperf: Benchmarks ...

work page 2025

[14] [14]

Data aware neural architecture search.arXiv preprint arXiv:2304.01821,

Emil Njor, Jan Madsen, and Xenofon Fafoutis. Data aware neural architecture search.arXiv preprint arXiv:2304.01821,

work page arXiv

[15] [15]

Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021a

Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021a. Curtis G Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label errors in test sets destabilize machine learning benchmarks.arXiv preprint arXiv:2103.14749, 2021b. Mateusz Piecho...

work page arXiv

[16] [16]

Great tinyml needs high-quality data — plumerai blog.https://blog

Plumerai. Great tinyml needs high-quality data — plumerai blog.https://blog. plumerai.com/2021/08/tinyml-data/, August

work page 2021

[17] [17]

Vijay Janapa Reddi

(Accessed on 11/13/2024). Vijay Janapa Reddi. Mlsysbook.ai: Principles and practices of machine learning systems en- gineering. In2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS), pages 41–42. IEEE,

work page 2024

[18] [18]

Energy and Policy Considerations for Deep Learning in NLP

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp.arXiv preprint arXiv:1906.02243,

work page internal anchor Pith review Pith/arXiv arXiv 1906

[19] [19]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Human detection dataset

Constantin Werner. Human detection dataset. Kaggle dataset, 2025.https://www.kaggle. com/datasets/constantinwerner/human-detection-dataset. Christian Wojek, Stefan Walk, and Bernt Schiele. Multi-cue onboard pedestrian detection. In2009 IEEE conference on computer vision and pattern recognition, pages 794–801. IEEE,

work page 2025

[21] [21]

Research on different illumination image classification method

WenLi Zhang, HongLu Li, and ZhuoZheng Wang. Research on different illumination image classification method. In2017 2nd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 2017), pages 574–581. Atlantis Press,

work page 2017

[22] [22]

k 2 c 3 18.5 7.66 250,256 WV70.6±0.96 69.3±0.97 VWW65.6±0.66 70.7±0.08 k 4 c 5 22 18.49 688,790 WV75.7±0.18 74±0.23 VWW69.9±0.26 75.5±0,64 k 8 c 5 32.5 44.56 2,135,476 WV77.3±0.37 75±0.15 VWW73±0.91 77.3±0.57 p= (e−d)/(2e−1). A current flaw of this method is that the injected label errors are not consistent between epochs, which would likely be less destr...

work page 2020

[23] [24]

By default, we only respect labels that have a minimum confidence of

Purely machine-generated labels have a fractional confidence score that is generally>= 5 (Kuznetsova et al., 2020; Krasin et al., 2017). By default, we only respect labels that have a minimum confidence of

work page 2020

[24] [25]

Person Body Part Labels.Body parts are more challenging to relabel, as it is dependent on the use case whether a body part should be considered a person

Labels below this threshold are ignored. Person Body Part Labels.Body parts are more challenging to relabel, as it is dependent on the use case whether a body part should be considered a person. For example a camera that detects whether a person is inside a room to decide if the light should be switched on would want to consider body parts as a person, as...

work page 2019

[25] [26]

stm32tflm

33 Banbury, Njor, Garavagno, Mazumder et al. Table 9: Number of images downloaded from Open Images v7. Download occurred between the 28th of November to the 5 th of December 2023 Train Validation Test Downloaded 7,936,979 36,406 109,305 Errors 1,055,669 5,214 16,131 Appendix I. Person Label Classes In our default configuration, we consider the following O...

work page doi:10.7910/dvn/1hopxc 2023

[26] [27]

What data does each instance consist of?“Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description

for more information. What data does each instance consist of?“Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description. The dataset is composed of images, labels, and metadata. Is there a label or target associated with each instance?If so, please provide a description. Yes, every image has a binary label to ...

work page 2021

[27] [28]

dataset and its derivatives. The authors of Open images tried to identify images that are licensed under a Creative Commons Attribution license but make no representations or warranties regarding the license status of each image and a user should verify the license for each image themselves. COLLECTION How was the data associated with each instance acquir...

work page 2017

[28] [29]

details the image acquisition process. What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)?How were these mechanisms or procedures validated? The images were sources from Flickr in an automated pipeline. The original labeling pipeline was a combination of a...

work page 2020

[29] [30]

The primary source of required resources was model training to evaluate the dataset and the storage and bandwidth required to process and upload the dataset to hosting locations

for approaches in this area.) We used cloud TPU credits provided by the Google Cloud TRC program. The primary source of required resources was model training to evaluate the dataset and the storage and bandwidth required to process and upload the dataset to hosting locations. The total size of 42 W ake Vision the dataset is approximately 2 TB. To the auth...

work page 2019

[30] [31]

which originally sourced its images from Flickr (Flickr, 2024). Were the individuals in question notified about the data collection?If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself. 43 Banbury, N...

work page 2024