Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications
Pith reviewed 2026-05-24 01:04 UTC · model grok-4.3
The pith
An automated pipeline builds a 6-million-image person detection dataset that raises TinyML model accuracy by up to 6.6 percent over the prior benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Wake Vision pipeline produces a person-detection dataset of almost six million images whose models achieve up to 6.6 percent higher test accuracy than those trained on Visual Wake Words, match or exceed prior performance on thirteen of sixteen fine-grained subsets, and preserve the advantage on three out-of-distribution test sets.
What carries the argument
The Wake Vision pipeline, which fuses image-level and bounding-box labels, filters by confidence area and depiction, corrects evaluation labels, and auto-generates fine-grained benchmark subsets.
If this is right
- Models reach up to 6.6 percent higher accuracy on the held-out test set.
- Performance matches or exceeds the prior benchmark on thirteen of sixteen fine-grained subsets covering age, gender, distance, lighting and depiction.
- The accuracy advantage persists when tested on three separate out-of-distribution datasets.
- A manually verified validation and test set at 2.2 percent label error replaces the prior 7.8 percent error set.
Where Pith is reading between the lines
- The same curation steps could be applied to other binary TinyML tasks such as keyword spotting or anomaly detection to reduce reliance on hand-labeled data.
- If the pipeline scales, practitioners might shift effort from architecture search toward repeated dataset refresh cycles.
- Lower label error on evaluation sets could become a required reporting item for future TinyML benchmarks.
Load-bearing premise
The automated fusion and filtering steps improve label quality without creating new selection biases that would account for the observed accuracy gains.
What would settle it
Retrain the same four architectures on a version of the Wake Vision training set whose labels have been corrupted to the original 7.8 percent error rate and check whether test accuracy falls back to Visual Wake Words levels.
Figures
read the original abstract
Tiny machine learning (TinyML) co-locates models with sensors on microcontrollers, where small models (which are disproportionately sensitive to label noise) and bespoke binary tasks (which lack standard benchmarks) make general-purpose dataset practices a poor fit. Visual Wake Words (VWW), the prior standard TinyML person detection benchmark, contains roughly 123K images and has an estimated label error rate of 7.8%, which limits its usefulness for production-grade systems. Manual labeling, however, is prohibitively expensive for the scale and diversity of TinyML use cases. We address this gap with the Wake Vision pipeline, an automated method for generating and curating large-scale binary classification datasets for TinyML. We use data-centric TinyML for the dataset construction, curation, and lifecycle methods that produce the large, well-curated datasets these systems require. The pipeline combines label fusion across image-level and bounding-box sources, confidence-, area-, and depiction-aware filtering, label correction on the evaluation splits, and automatic generation of fine-grained benchmark subsets. Applying it to person detection, we release Wake Vision, a dataset of almost 6M images (close to 100x more person images than VWW) with a manually relabeled validation and test set at a 2.2% label error rate. Models trained on Wake Vision improve test accuracy by up to 6.6% over VWW across MobileNetV2, MCUNet, MicroNets, and ColabNAS architectures, and match or exceed VWW-trained models on 13 of 16 fine-grained subsets covering perceived gender, perceived age, distance, lighting, and depictions. The advantage holds under distribution shift on three out-of-distribution datasets covering driving and overhead-surveillance imagery. All artifacts are released under CC-BY 4.0 through TensorFlow Datasets and Hugging Face.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Wake Vision, an automated pipeline for generating and curating large-scale binary classification datasets for TinyML vision tasks (focused on person detection). It releases a ~6M-image dataset (vs. VWW's 123K) with label fusion across sources, confidence/area/depiction filtering, and manual correction on eval splits (reducing label error to 2.2%), claiming up to 6.6% test accuracy gains over VWW across MobileNetV2, MCUNet, MicroNets, and ColabNAS, plus advantages on 13/16 fine-grained subsets and three OOD sets.
Significance. If the reported gains can be attributed to the curation pipeline rather than scale alone, the work supplies a useful open dataset, benchmark suite, and methodology for TinyML, where small models are sensitive to label noise and bespoke tasks lack standards. The CC-BY 4.0 release via TensorFlow Datasets and Hugging Face, plus manual relabeling of validation/test sets, supports reproducibility.
major comments (3)
- [Abstract, results] Abstract and results sections: The central claim attributes up to 6.6% accuracy lift (and gains on 13/16 subsets plus OOD) to the label-fusion + filtering + correction pipeline, but no ablation experiments hold dataset size fixed while varying the curation steps. With Wake Vision at ~6M images vs. VWW's 123K, the contribution of scale versus quality cannot be isolated; this is load-bearing for the attribution.
- [Pipeline description] Methods description of the pipeline: Exact numerical thresholds for confidence, area, and depiction filtering are not reported, nor is the precise rule for resolving conflicts during label fusion across image-level and bounding-box sources. This prevents assessment of whether filtering introduces new selection biases that could explain the accuracy differences.
- [Evaluation] Evaluation protocol: Manual label correction and error-rate measurement (2.2%) apply only to validation and test splits; training-set label quality is not similarly verified at scale. Given that small models are disproportionately sensitive to label noise, this leaves open whether training data improvements are real or confounded by distribution shifts from filtering.
minor comments (2)
- [Benchmark subsets] The fine-grained subset definitions (perceived gender, age, distance, lighting, depictions) would benefit from explicit criteria or examples in a table to aid interpretation of the 13/16 match/exceed results.
- [Results] Figure or table presenting per-architecture accuracy numbers should include error bars or p-values to support the 'up to 6.6%' and 'match or exceed' statements.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We respond point-by-point to the major comments below.
read point-by-point responses
-
Referee: [Abstract, results] Abstract and results sections: The central claim attributes up to 6.6% accuracy lift (and gains on 13/16 subsets plus OOD) to the label-fusion + filtering + correction pipeline, but no ablation experiments hold dataset size fixed while varying the curation steps. With Wake Vision at ~6M images vs. VWW's 123K, the contribution of scale versus quality cannot be isolated; this is load-bearing for the attribution.
Authors: We agree that an ablation holding size fixed would help isolate the effects. However, the core contribution is the automated pipeline that simultaneously enables both large scale and improved quality for TinyML, where manual curation at 6M images is infeasible. VWW is the established benchmark, so the comparison is to that standard rather than a hypothetical large uncurated set. We will revise the abstract and results to clarify that reported gains reflect the full pipeline (scale plus curation) and add a discussion paragraph on the interplay between the two factors. revision: partial
-
Referee: [Pipeline description] Methods description of the pipeline: Exact numerical thresholds for confidence, area, and depiction filtering are not reported, nor is the precise rule for resolving conflicts during label fusion across image-level and bounding-box sources. This prevents assessment of whether filtering introduces new selection biases that could explain the accuracy differences.
Authors: We thank the referee for highlighting this reproducibility issue. The exact thresholds and fusion rules were omitted to keep the methods concise but will be restored. The revised manuscript will report the specific confidence thresholds, area cutoffs, depiction criteria, and the conflict-resolution logic (e.g., priority ordering between sources). revision: yes
-
Referee: [Evaluation] Evaluation protocol: Manual label correction and error-rate measurement (2.2%) apply only to validation and test splits; training-set label quality is not similarly verified at scale. Given that small models are disproportionately sensitive to label noise, this leaves open whether training data improvements are real or confounded by distribution shifts from filtering.
Authors: We acknowledge the limitation. Manual verification of the full ~6M-image training set is not practical, which is exactly why an automated pipeline is required for TinyML dataset construction. The same filtering steps are applied to training data, and gains on fine-grained subsets and three OOD sets suggest benefits beyond distribution shift alone. We will add an explicit Limitations section discussing training-set label quality and avenues for future estimation of its error rate. revision: partial
Circularity Check
No significant circularity; empirical dataset release with external benchmarks.
full rationale
The paper describes an automated pipeline for constructing the Wake Vision dataset from existing sources, applies filtering and correction steps, releases the data, and reports accuracy improvements from training standard architectures (MobileNetV2, MCUNet, etc.) on it versus the external VWW benchmark. No equations, parameters fitted to subsets then re-predicted, or self-referential derivations appear. Claims rest on external model training runs and comparisons to prior datasets, not on definitions that reduce outputs to inputs by construction. Minor self-citations, if present, are not load-bearing for the central empirical results.
Axiom & Free-Parameter Ledger
free parameters (1)
- filtering thresholds
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Wake Vision pipeline combines label fusion across image-level and bounding-box sources, confidence-, area-, and depiction-aware filtering...
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Models trained on Wake Vision improve test accuracy by up to 6.6% over VWW...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mlperf tiny benchmark.arXiv preprint arXiv:2106.07597, 2021a
Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, Danilo Pau, et al. Mlperf tiny benchmark.arXiv preprint arXiv:2106.07597, 2021a. Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas, Urmish Thakker, Dibakar Gope, Vijay Janapa Reddi, Matthew Mattina, and Paul Wh...
-
[2]
Sara Beery, Grant Van Horn, and Pietro Perona
Ac- cessed: 2025-05-06. Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InPro- ceedings of the European conference on computer vision (ECCV), pages 456–473,
work page 2025
-
[3]
Are we done with imagenet?arXiv preprint arXiv:2006.07159,
Lucas Beyer, Olivier J H´ enaff, Alexander Kolesnikov, Xiaohua Zhai, and A¨ aron van den Oord. Are we done with imagenet?arXiv preprint arXiv:2006.07159,
-
[4]
Aakanksha Chowdhery, Pete Warden, Jonathon Shlens, Andrew Howard, and Rocky Rhodes. Visual wake words dataset.arXiv preprint arXiv:1906.05721,
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[5]
EDGE AI FOUNDATION. Challenge edge: Wake vision.https://edgeai.modelnova.ai/ challenges/details/challenge-edge:-wake-vision, 2025a. Accessed: 2025-05-06. EDGE AI FOUNDATION. Challenge edge: Wake vision.https://edgeai.modelnova.ai/ challenges/details/edge-ai-challenge:-wake-vision-2, 2025b. Accessed: 2025- 05-07. EDGE AI FOUNDATION. Edge ai foundation.http...
work page 2025
-
[6]
Andrea Mattia Garavagno, Daniele Leonardis, and Antonio Frisoli
Accessed: 2024-03-06. Andrea Mattia Garavagno, Daniele Leonardis, and Antonio Frisoli. Colabnas: Obtaining lightweight task-specific convolutional neural networks following occam’s razor.Future Generation Computer Systems, 152:152–159,
work page 2024
-
[7]
What do compressed deep neural networks forget?arXiv preprint arXiv:1911.05248,
Sara Hooker, Aaron Courville, Gregory Clark, Yann Dauphin, and Andrea Frome. What do compressed deep neural networks forget?arXiv preprint arXiv:1911.05248,
-
[8]
Charac- terising bias in compressed models.arXiv preprint arXiv:2010.03058,
Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. Charac- terising bias in compressed models.arXiv preprint arXiv:2010.03058,
-
[9]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[10]
25 Banbury, Njor, Garavagno, Mazumder et al. Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al. Datasets: A community library for natural language processing.arXiv preprint arXiv:2109.02846,
-
[11]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, Septem- ber 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer,
work page 2014
-
[12]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Lu´ ıs C. R. Martins. Surveillance images for person detection. Kaggle dataset, 2025.https://www.kaggle.com/datasets/luiscrmartins/surveillance-images- for-person-detection. Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaˇ s, William Gaviria Rojas, Sud- nya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah Rose Kirk, et al. Dataperf: Benchmarks ...
work page 2025
-
[14]
Data aware neural architecture search.arXiv preprint arXiv:2304.01821,
Emil Njor, Jan Madsen, and Xenofon Fafoutis. Data aware neural architecture search.arXiv preprint arXiv:2304.01821,
-
[15]
Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021a. Curtis G Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label errors in test sets destabilize machine learning benchmarks.arXiv preprint arXiv:2103.14749, 2021b. Mateusz Piecho...
-
[16]
Great tinyml needs high-quality data — plumerai blog.https://blog
Plumerai. Great tinyml needs high-quality data — plumerai blog.https://blog. plumerai.com/2021/08/tinyml-data/, August
work page 2021
-
[17]
(Accessed on 11/13/2024). Vijay Janapa Reddi. Mlsysbook.ai: Principles and practices of machine learning systems en- gineering. In2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS), pages 41–42. IEEE,
work page 2024
-
[18]
Energy and Policy Considerations for Deep Learning in NLP
Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp.arXiv preprint arXiv:1906.02243,
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[19]
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Constantin Werner. Human detection dataset. Kaggle dataset, 2025.https://www.kaggle. com/datasets/constantinwerner/human-detection-dataset. Christian Wojek, Stefan Walk, and Bernt Schiele. Multi-cue onboard pedestrian detection. In2009 IEEE conference on computer vision and pattern recognition, pages 794–801. IEEE,
work page 2025
-
[21]
Research on different illumination image classification method
WenLi Zhang, HongLu Li, and ZhuoZheng Wang. Research on different illumination image classification method. In2017 2nd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 2017), pages 574–581. Atlantis Press,
work page 2017
-
[22]
k 2 c 3 18.5 7.66 250,256 WV70.6±0.96 69.3±0.97 VWW65.6±0.66 70.7±0.08 k 4 c 5 22 18.49 688,790 WV75.7±0.18 74±0.23 VWW69.9±0.26 75.5±0,64 k 8 c 5 32.5 44.56 2,135,476 WV77.3±0.37 75±0.15 VWW73±0.91 77.3±0.57 p= (e−d)/(2e−1). A current flaw of this method is that the injected label errors are not consistent between epochs, which would likely be less destr...
work page 2020
-
[24]
By default, we only respect labels that have a minimum confidence of
Purely machine-generated labels have a fractional confidence score that is generally>= 5 (Kuznetsova et al., 2020; Krasin et al., 2017). By default, we only respect labels that have a minimum confidence of
work page 2020
-
[25]
Labels below this threshold are ignored. Person Body Part Labels.Body parts are more challenging to relabel, as it is dependent on the use case whether a body part should be considered a person. For example a camera that detects whether a person is inside a room to decide if the light should be switched on would want to consider body parts as a person, as...
work page 2019
-
[26]
33 Banbury, Njor, Garavagno, Mazumder et al. Table 9: Number of images downloaded from Open Images v7. Download occurred between the 28th of November to the 5 th of December 2023 Train Validation Test Downloaded 7,936,979 36,406 109,305 Errors 1,055,669 5,214 16,131 Appendix I. Person Label Classes In our default configuration, we consider the following O...
-
[27]
for more information. What data does each instance consist of?“Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description. The dataset is composed of images, labels, and metadata. Is there a label or target associated with each instance?If so, please provide a description. Yes, every image has a binary label to ...
work page 2021
-
[28]
dataset and its derivatives. The authors of Open images tried to identify images that are licensed under a Creative Commons Attribution license but make no representations or warranties regarding the license status of each image and a user should verify the license for each image themselves. COLLECTION How was the data associated with each instance acquir...
work page 2017
-
[29]
details the image acquisition process. What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)?How were these mechanisms or procedures validated? The images were sources from Flickr in an automated pipeline. The original labeling pipeline was a combination of a...
work page 2020
-
[30]
for approaches in this area.) We used cloud TPU credits provided by the Google Cloud TRC program. The primary source of required resources was model training to evaluate the dataset and the storage and bandwidth required to process and upload the dataset to hosting locations. The total size of 42 W ake Vision the dataset is approximately 2 TB. To the auth...
work page 2019
-
[31]
which originally sourced its images from Flickr (Flickr, 2024). Were the individuals in question notified about the data collection?If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself. 43 Banbury, N...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.