arxiv: 2604.13359 · v2 · submitted 2026-04-14 · 💻 cs.LG · cs.AR· eess.SP

Recognition: unknown

BioTrain: Sub-MB, Sub-50mW On-Device Fine-Tuning for Edge-AI on Biosignals

Run Wang , Victor J. B. Jung , Philip Wiese , Sebastian Frey , Giusy Spacone , Francesco Conti , Alessio Burrello , Luca Benini

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AReess.SP

keywords on-device fine-tuningbiosignalsEEGEOGedge AIlow-power MCUbackpropagationmemory optimization

0 comments

The pith

BioTrain enables full-network fine-tuning of biosignal AI models on milliwatt-scale edge devices with sub-megabyte memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Biosignals vary substantially between users and over time, degrading the accuracy of small AI models on wearable devices after initial training. BioTrain addresses this by making complete network fine-tuning feasible on tiny, low-power microcontrollers instead of restricting updates to the final layer. The approach relies on a custom memory allocator and network topology changes that shrink the peak memory needed for backpropagation enough to fit within sub-50 mW and sub-MB limits. Tests on EEG and EOG datasets for new-subject calibration and signal-drift adaptation show accuracy gains reaching 35 percent over static baselines, plus an 8.1 times smaller memory footprint than conventional full fine-tuning.

Core claim

BioTrain is a framework that supports full-network fine-tuning of state-of-the-art biosignal models under milliwatt-scale power and sub-megabyte memory constraints. An efficient memory allocator and network topology optimization allow large batch sizes during on-chip backpropagation. On the GAP9 MCU this yields 17 samples per second for EEG models and 85 samples per second for EOG models while staying below 50 mW, together with an 8.1x memory reduction from 5.4 MB to 0.67 MB and accuracy improvements of up to 35 percent over non-adapted baselines.

What carries the argument

Efficient memory allocator combined with network topology optimization that permits large batch sizes and cuts peak memory for full backpropagation on constrained MCUs.

If this is right

Full-network fine-tuning improves accuracy by up to 35 percent over non-adapted baselines on EEG and EOG data.
It outperforms last-layer updates by roughly 7 percent during new-subject calibration.
On-device training sustains 17 samples per second for EEG and 85 for EOG within a sub-50 mW power envelope.
Memory footprint falls 8.1 times to 0.67 MB compared with conventional full-network fine-tuning using batch normalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory-reduction techniques could support on-device adaptation for other variable sensor streams such as audio or motion data on similar hardware.
Wearable health systems might use continuous on-device fine-tuning to track gradual changes in user signals without cloud round-trips.
Porting the allocator and topology optimizations to additional MCU families would test how widely the 8x memory savings apply.

Load-bearing premise

The efficient memory allocator and network topology optimization preserve model accuracy while delivering the stated throughput and memory reductions across EEG and EOG models and real deployment conditions.

What would settle it

A measurement on the GAP9 MCU showing that BioTrain cannot reach 17 samples per second for EEG models or uses more than 0.67 MB while retaining the reported accuracy gains would falsify the central performance claims.

Figures

Figures reproduced from arXiv: 2604.13359 by Alessio Burrello, Francesco Conti, Giusy Spacone, Luca Benini, Philip Wiese, Run Wang, Sebastian Frey, Victor J. B. Jung.

**Figure 1.** Figure 1: Overview of the BioTrain framework for on-device model adaptation. (a) The framework interfaces with PyTorch [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the deployment stack for end-to-end on [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the evaluation protocols. The pipeline [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of fine-tuning strategies across Day-1 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Biosignals exhibit substantial cross-subject and cross-session variability, inducing severe domain shifts that degrade post-deployment performance for small, edge-oriented AI models. On-device adaptation is therefore essential to both preserve user privacy and ensure system reliability. However, existing sub-100 mW MCU-based wearable platforms can only support shallow or sparse adaptation schemes due to the prohibitive memory footprint and computational cost of full backpropagation (BP). In this paper, we propose BioTrain, a framework enabling full-network fine-tuning of state-of-the-art biosignal models under milliwatt-scale power and sub-megabyte memory constraints. We validate BioTrain using both offline and on-device benchmarks on EEG and EOG datasets, covering Day-1 new-subject calibration and longitudinal adaptation to signal drift. Experimental results show that full-network fine-tuning achieves accuracy improvements of up to 35% over non-adapted baselines and outperforms last-layer updates by approximately 7% during new-subject calibration. On the GAP9 MCU platform, BioTrain enables efficient on-device training throughput of 17 samples/s for EEG and 85 samples/s for EOG models within a power envelope below 50 mW. In addition, BioTrain's efficient memory allocator and network topology optimization enable the use of a large batch size, reducing peak memory usage. For fully on-chip BP on GAP9, BioTrain reduces the memory footprint by 8.1x, from 5.4 MB to 0.67 MB, compared to conventional full-network fine-tuning using batch normalization with batch size 8.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BioTrain gets full backprop fine-tuning running under 50 mW on the GAP9 for EEG/EOG models with an 8.1x memory cut to 0.67 MB, but the allocator and topology changes need checking to confirm they leave gradients and capacity unchanged.

read the letter

The main point is that this work makes full-network fine-tuning practical on milliwatt-scale MCUs for biosignal models. They report 17 samples/s throughput on EEG and 85 on EOG while staying below 50 mW, plus accuracy lifts of up to 35% over no adaptation and 7% over last-layer updates during new-subject calibration and drift handling. The 8.1x memory reduction from 5.4 MB to 0.67 MB comes from their allocator and topology optimizations that support larger batches on the GAP9. Those numbers are the concrete engineering win here, and the offline plus on-device tests on real EEG/EOG datasets give the claims some grounding. The paper does a decent job framing the domain-shift problem in wearables and showing why last-layer only is not enough. The results are presented as direct benchmarks rather than derived from their own fitted equations, which keeps the circularity low. What stands out is the specific combination of full BP support, power envelope, and memory numbers for this class of models. The soft spot is exactly the one the stress test flags: whether the memory allocator and topology optimizations preserve the original model and gradients. If the changes alter layer connectivity, normalization, or introduce recomputation that shifts numerical behavior, then the accuracy gains and the reduction factor are not comparing like to like with the batch-norm batch-size-8 baseline. The abstract does not detail the precise modifications, so it is hard to judge how unmodified the fine-tuning really is. Reproducibility would be helped by more on the implementation choices. This is for engineers working on adaptive edge health devices who need to stay on-device for privacy and latency. Readers in biosignal ML or MCU deployment will get usable metrics from it. The claims are specific enough and the setup testable enough that it deserves a serious referee rather than a desk reject, even if the optimization details will need expansion in revision.

Referee Report

3 major / 2 minor

Summary. The paper proposes BioTrain, a framework for enabling full-network fine-tuning of state-of-the-art biosignal models (EEG/EOG) on sub-100mW MCU platforms like GAP9. It claims to achieve this under sub-MB memory and sub-50mW power via an efficient memory allocator and network topology optimization, reporting up to 35% accuracy gains over non-adapted baselines, ~7% over last-layer updates, throughputs of 17 samples/s (EEG) and 85 samples/s (EOG), and an 8.1x memory reduction (5.4 MB to 0.67 MB) compared to conventional BP with batch-norm and batch size 8. Validation covers offline and on-device benchmarks for new-subject calibration and longitudinal drift adaptation.

Significance. If the optimizations truly preserve model capacity, gradients, and numerical behavior equivalent to standard backpropagation on unmodified SOTA architectures, the work would be significant for practical edge AI in biosignals. It targets a key barrier (memory/power for full BP) in privacy-preserving on-device adaptation to domain shifts, with plausible quantitative results on accuracy, throughput, and efficiency that could impact wearable health monitoring. The 8.1x memory cut and milliwatt-scale operation stand out, but significance is conditional on verifying the weakest assumption that accuracy is not traded for the reported reductions.

major comments (3)

[Abstract] Abstract: The central claim of 'full-network fine-tuning' of unmodified SOTA models is load-bearing for the accuracy results (35% and 7% gains), yet the abstract attributes the 8.1x memory reduction (5.4 MB to 0.67 MB) and large-batch BP to 'network topology optimization' without specifying whether this alters layer count, connectivity, normalization layers, or other structural elements. If the optimized topology differs from the reference, the gains cannot be interpreted as evidence for full BP on the original models.
[Abstract] Abstract: The memory and accuracy comparisons are to 'conventional full-network fine-tuning using batch normalization with batch size 8', while BioTrain uses 'large batch size' enabled by the allocator. This introduces a potential mismatch in training dynamics (batch size affects gradient noise and normalization statistics), undermining direct attribution of the 8.1x reduction and accuracy improvements to the allocator alone; explicit verification that gradients and loss landscapes remain equivalent is needed.
[Abstract] Abstract: The on-device results (17/85 samples/s throughput, <50 mW) and accuracy claims rest on the assumption that the memory allocator preserves exact numerical behavior (no selective recomputation or quantization artifacts). Without reported checks (e.g., gradient norm comparisons or floating-point equivalence tests between BioTrain and standard BP), the 35% improvement cannot be confidently linked to full-network adaptation rather than implementation differences.

minor comments (2)

The abstract would benefit from explicit dataset names, sizes, and train/test splits for the EEG/EOG benchmarks to support reproducibility of the day-1 calibration and longitudinal adaptation results.
Notation for power (sub-50mW) and memory (sub-MB) is clear but could include peak vs. average power measurements and exact MCU configuration details for the GAP9 platform.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments highlight important points about clarity in the abstract and the need for explicit verification of numerical equivalence. We have revised the manuscript to address each concern directly, updating the abstract for precision and adding supporting experiments and discussion in the main text.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'full-network fine-tuning' of unmodified SOTA models is load-bearing for the accuracy results (35% and 7% gains), yet the abstract attributes the 8.1x memory reduction (5.4 MB to 0.67 MB) and large-batch BP to 'network topology optimization' without specifying whether this alters layer count, connectivity, normalization layers, or other structural elements. If the optimized topology differs from the reference, the gains cannot be interpreted as evidence for full BP on the original models.

Authors: We appreciate the referee identifying this ambiguity. The network topology optimization in BioTrain refers exclusively to memory-efficient execution strategies (e.g., in-place operations and optimized activation buffering) that do not change layer count, connectivity, or normalization layers. The underlying SOTA model architectures remain unmodified. We have revised the abstract to state this explicitly and added a clarifying paragraph in Section 3.2. revision: yes
Referee: [Abstract] Abstract: The memory and accuracy comparisons are to 'conventional full-network fine-tuning using batch normalization with batch size 8', while BioTrain uses 'large batch size' enabled by the allocator. This introduces a potential mismatch in training dynamics (batch size affects gradient noise and normalization statistics), undermining direct attribution of the 8.1x reduction and accuracy improvements to the allocator alone; explicit verification that gradients and loss landscapes remain equivalent is needed.

Authors: The referee correctly notes the batch-size difference. The large batch size is a direct outcome of the memory allocator overcoming the constraints that force conventional BP to batch size 8. We have added a dedicated discussion of batch-size effects on gradient noise and normalization, along with gradient-norm comparisons between the two regimes, to demonstrate that the reported gains are attributable to the allocator while maintaining comparable training dynamics. revision: yes
Referee: [Abstract] Abstract: The on-device results (17/85 samples/s throughput, <50 mW) and accuracy claims rest on the assumption that the memory allocator preserves exact numerical behavior (no selective recomputation or quantization artifacts). Without reported checks (e.g., gradient norm comparisons or floating-point equivalence tests between BioTrain and standard BP), the 35% improvement cannot be confidently linked to full-network adaptation rather than implementation differences.

Authors: We agree that explicit numerical-equivalence checks strengthen the claims. We have added offline verification experiments comparing gradient norms, loss curves, and final parameter values between BioTrain and a reference PyTorch backpropagation implementation, confirming equivalence within floating-point tolerance. These results are now reported in the revised experimental section. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmarks on datasets with direct measurements

full rationale

The paper describes an engineering framework (BioTrain) with an efficient memory allocator and network topology optimization, validated through offline and on-device experiments on EEG/EOG datasets. Claims of accuracy gains (up to 35%), throughput (17-85 samples/s), and memory reduction (8.1x) are presented as measured outcomes under stated constraints, not as quantities derived from equations or fitted parameters within the paper. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text. The derivation chain is self-contained as implementation + benchmarking rather than mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are described. The work applies standard backpropagation with custom engineering optimizations for memory and power.

pith-pipeline@v0.9.0 · 5619 in / 1191 out tokens · 47920 ms · 2026-05-10T14:45:58.391795+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 1 canonical work pages · 1 internal anchor

[1]

A survey of neural signal decoding based on domain adaptation,

S. Li, Z. Tang, M. Li, L. Yang, and Z. Shang, “A survey of neural signal decoding based on domain adaptation,”Neurocomputing, vol. 657, p. 131653, Dec. 2025

2025
[2]

A multi-day and high-quality EEG dataset for motor imagery brain- computer interface,

B. Yang, F. Rong, Y . Xie, D. Li, J. Zhang, F. Li, G. Shi, and X. Gao, “A multi-day and high-quality EEG dataset for motor imagery brain- computer interface,”Scientific Data, vol. 12, no. 1, p. 488, Mar. 2025

2025
[3]

On-device training: A first overview on existing systems,

S. Zhu, T. V oigt, F. Rahimian, and J. Ko, “On-device training: A first overview on existing systems,”ACM Trans. Sen. Netw., vol. 20, no. 6, pp. 118:1–118:39, Oct. 2024

2024
[4]

An ultra-low power wearable BMI system with continual learning capabilities,

L. Mei, T. M. Ingolfsson, C. Cioflan, V . Kartsch, A. Cossettini, X. Wang, and L. Benini, “An ultra-low power wearable BMI system with continual learning capabilities,”IEEE Transactions on Biomedical Circuits and Systems, vol. 19, no. 3, pp. 511–522, Jun. 2025

2025
[5]

Deeploy: Enabling energy-efficient deployment of small language models on heterogeneous microcontrollers,

M. Scherer, L. Macan, V . Jung, P. Wiese, L. Bompani, A. Burrello, F. Conti, and L. Benini, “Deeploy: Enabling energy-efficient deployment of small language models on heterogeneous microcontrollers,” Aug. 2024

2024
[6]

PULP-TrainLib: Enabling on-device training for RISC- V multi-core MCUs through performance-driven autotuning,

D. Nadalini, M. Rusci, G. Tagliavini, L. Ravaglia, L. Benini, and F. Conti, “PULP-TrainLib: Enabling on-device training for RISC- V multi-core MCUs through performance-driven autotuning,” inEm- bedded Computer Systems: Architectures, Modeling, and Simulation, A. Orailoglu, M. Reichenbach, and M. Jung, Eds. Cham: Springer International Publishing, 2022, vo...

2022
[7]

GAPses: Versatile smart glasses for comfortable and fully-dry acquisition and parallel ultra-low-power processing of EEG and EOG,

S. Frey, M. A. Lucchini, V . Kartsch, T. M. Ingolfsson, A. H. Bernardi, M. Segessenmann, J. Osieleniec, S. Benatti, L. Benini, and A. Cos- settini, “GAPses: Versatile smart glasses for comfortable and fully-dry acquisition and parallel ultra-low-power processing of EEG and EOG,” IEEE Transactions on Biomedical Circuits and Systems, vol. 19, no. 3, pp. 616...

2025
[8]

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Z. Han, C. Gao, J. Liu, J. Zhang, and S. Q. Zhang, “Parameter-efficient fine-tuning for large models: A comprehensive survey,”arXiv preprint arXiv:2403.14608, 2024

work page internal anchor Pith review arXiv 2024
[9]

EcoTTA: Memory- efficient continual test-time adaptation via self-distilled regularization,

J. Song, J. Lee, I. S. Kweon, and S. Choi, “EcoTTA: Memory- efficient continual test-time adaptation via self-distilled regularization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 920–11 929

2023
[10]

TinyTTA: Efficient test-time adaptation via early-exit ensembles on edge devices,

H. Jia, Y . Kwon, A. Orsino, T. Dang, D. Talia, and C. Mascolo, “TinyTTA: Efficient test-time adaptation via early-exit ensembles on edge devices,”Advances in Neural Information Processing Systems, vol. 37, pp. 43 274–43 299, 2024

2024
[11]

Tiny machine learning: Progress and futures,

J. Lin, L. Zhu, W.-M. Chen, W.-C. Wang, and S. Han, “Tiny machine learning: Progress and futures,”IEEE Circuits and Systems Magazine, vol. 23, no. 3, pp. 8–34, 2023

2023
[12]

MATCH: Model-aware TVM-based compilation for heterogeneous edge devices,

M. Amine Hamdi, F. Daghero, G. Maria Sarda, J. Van Delm, A. Symons, L. Benini, M. Verhelst, D. Jahier Pagliari, and A. Burrello, “MATCH: Model-aware TVM-based compilation for heterogeneous edge devices,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 44, no. 10, pp. 3844–3857, Oct. 2025

2025
[13]

TrainDeeploy: Hardware-accelerated parameter-efficient fine-tuning of small transformer models at the extreme edge,

R. Wang, V . J. B. Jung, P. Wiese, F. Conti, A. Burrello, and L. Benini, “TrainDeeploy: Hardware-accelerated parameter-efficient fine-tuning of small transformer models at the extreme edge,” inDesign, Automation and Test in Europe Conference (DATE), 2026

2026
[14]

TinyOL: TinyML with online- learning on microcontrollers,

H. Ren, D. Anicic, and T. A. Runkler, “TinyOL: TinyML with online- learning on microcontrollers,” in2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021, pp. 1–8

2021
[15]

MiniLearn: On-device learning for low-power IoT devices,

C. Profentzas, M. Almgren, and O. Landsiedel, “MiniLearn: On-device learning for low-power IoT devices,” inEWSN, 2022, pp. 1–11

2022
[16]

On- device training under 256KB memory,

J. Lin, L. Zhu, W.-M. Chen, W.-C. Wang, C. Gan, and S. Han, “On- device training under 256KB memory,”Advances in Neural Information Processing Systems, vol. 35, pp. 22 941–22 954, 2022

2022
[17]

AIfES: A next-generation edge AI framework,

L. Wulfert, J. K ¨uhnel, L. Krupp, J. Viga, C. Wiede, P. Gembaczka, and A. Grabmaier, “AIfES: A next-generation edge AI framework,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, pp. 4519–4533, 2024

2024
[18]

Group normalization,

Y . Wu and K. He, “Group normalization,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19. 6

2018