Memory-Efficient Partitioned DNN Inference on Resource-Constrained Android Crowds

Disumi Pathirana; Kutila Gunasekera; Lakshani Manamperi; Nipun Premarathna; Thiwanka Pathirana

arxiv: 2605.20723 · v1 · pith:ALV2ILAQnew · submitted 2026-05-20 · 💻 cs.LG

Memory-Efficient Partitioned DNN Inference on Resource-Constrained Android Crowds

Lakshani Manamperi , Disumi Pathirana , Thiwanka Pathirana , Nipun Premarathna , Kutila Gunasekera This is my paper

Pith reviewed 2026-05-21 07:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords DNN partitioningmobile inferenceAndroid crowdsmemory efficiencyONNX runtimepipeline schedulingedge AIstreaming dependencies

0 comments

The pith

Partitioned scheduling runs large DNNs like DistilBERT on Android phones with under 45 MB RAM per device.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a pipeline scheduling subsystem that distributes a deep neural network across multiple memory-limited Android handsets so that each device loads and runs only one model segment at a time. Five mechanisms work together: just-in-time deferred loading of partitions, a strict single-partition-resident rule, a four-tier affinity scheduler, compressed tensor transport, and a streaming model of one-to-one dependencies between segments. On a 67-million-parameter DistilBERT model the system keeps peak memory at 43 MB and battery draw at 50 mAh per run while cutting batch time by 34 percent compared with waiting for all devices to finish together. A sympathetic reader cares because this approach removes the need to shrink or rewrite large models before they can run on ordinary phones.

Core claim

The DNN pipeline scheduling subsystem of CROWDio achieves practical ONNX inference across resource-constrained Android workers without model modification by distributing memory pressure through JIT deferred partition loading, a single-partition-resident constraint, a 4-tier affinity scheduler, zlib-compressed tensor transport, and a streaming 1:1 dependency model. Evaluated on DistilBERT across five handsets over ten runs, the system holds peak per-device RSS to 43+-2 MB and limits battery draw to 50+-3 mAh per run, while streaming concurrency cuts batch latency 34% below barrier synchronisation.

What carries the argument

The streaming 1:1 dependency model with single-partition-resident constraint, which lets each device hold only one executable model segment in memory while passing compressed tensors onward as soon as the segment finishes.

If this is right

Transformer-based models become executable on commodity Android handsets without pruning, quantization, or other modifications.
Inference workloads can be crowdsourced across groups of phones while keeping per-device memory and battery costs low.
Batch latency improves when devices stream partial results instead of synchronizing at every layer.
ONNX models can be deployed directly on edge crowds with the five listed mechanisms handling memory distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same partitioning and streaming logic could be tested on other mobile operating systems that support deferred model loading.
Real-world crowds would need to tolerate variable network delays that might reduce the reported 34 percent latency gain.
Partition boundaries chosen for DistilBERT may need re-examination for models with denser inter-layer dependencies.

Load-bearing premise

The model can be split into segments whose dependencies allow streaming tensor transport across devices without accuracy loss or high communication cost, and the Android runtime must support the described deferred loading and memory constraint on ordinary handsets.

What would settle it

Re-running the DistilBERT experiment on the same five Android handsets and finding either peak RSS above 50 MB per device or no measurable latency reduction when switching from barrier synchronisation to streaming concurrency would disprove the performance claims.

Figures

Figures reproduced from arXiv: 2605.20723 by Disumi Pathirana, Kutila Gunasekera, Lakshani Manamperi, Nipun Premarathna, Thiwanka Pathirana.

**Figure 1.** Figure 1: CROWDio three-layer architecture. The SDK submits jobs; the Foreman orchestrates scheduling and failure recovery; Workers execute inference under the single-residency constraint. 3.2. Pipeline Model and Memory Budget A DNN is split into S ordered stages, each an independent ONNX artefact assigned to a distinct worker. Our reference workload is DistilBERT (Sanh et al., 2019) for SST-2 sentiment analysis (≈… view at source ↗

**Figure 2.** Figure 2: Task graph for the 3-stage pipeline (streaming, N=5). Stage 0 tasks t0–t4 are initially pending; each Stage 1/2 task carries a per-input 1:1 dependency, yielding maximal concurrency. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Partition loading sequence. cell a is eagerly broadcast; cell b/cell c are JIT-loaded only after the first upstream task completes. Wi-Fi run the DistilBERT SST-2 pipeline (N=5, S=3); all figures are per-device mean ± std over ten independent runs. We compare CROWDio Streaming (Section 3.3) against CROWDio Barrier ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Deploying large deep neural networks on memory-constrained mobile devices is a central challenge in edge ML. While compression, pruning, and quantization reduce per-parameter cost, transformer-based models remain too large for the 3.3-7.4 GB RAM envelope of commodity Android handsets. We present the DNN pipeline scheduling subsystem of CROWDio, which achieves practical ONNX inference across resource-constrained Android workers without model modification, by distributing memory pressure across devices via five mechanisms: JIT deferred partition loading, a single-partition-resident constraint, a 4-tier affinity scheduler, a zlib-compressed tensor transport, and a streaming 1:1 dependency model. Evaluated on DistilBERT (Sanh et al., 2019) (approximately 67 M parameters, SST-2) across five Android handsets over ten runs, our system holds peak per-device RSS to 43+-2 MB and limits battery draw to 50+-3 mAh per run, while streaming concurrency cuts batch latency 34% below barrier synchronisation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers usable memory and power numbers for distributed DistilBERT on Android but needs to prove the runtime can enforce single-partition residency.

read the letter

The key point here is that the authors have put together a system for running DistilBERT across multiple Android phones with very low per-device memory and battery use by splitting the model and streaming parts of it. This could matter for edge deployment in places with limited hardware. The work combines five existing ideas—deferred partition loading, single-partition residency, a tiered scheduler, compressed tensor shipping, and streaming over 1:1 links—into one Android-focused pipeline. It does well by testing on actual handsets rather than simulators, reporting RSS and battery figures from ten runs, and showing that streaming concurrency beats barrier synchronization by 34 percent on batch latency. Credit for getting real measurements on five different devices. The main concern is whether the single-partition-resident constraint and JIT deferred loading actually work on stock Android ONNX runtimes without extra custom code. Most inference engines build the full graph upfront, so holding only one partition in memory at a time might require modifications that aren't standard. If transport buffers or multiple segments end up resident together, the 43 MB bound won't hold. The abstract also leaves out accuracy checks after partitioning, clear baseline comparisons, and any handling of variable network conditions, which makes it hard to gauge how general the gains are. This is for mobile systems researchers who care about practical edge deployment of transformers in low-resource settings. A reader working on distributed inference or Android ML tooling would find the scheduling details and the empirical setup worth reading. The paper has enough implementation and measurement substance to go to a serious referee, even if revisions will be needed on the runtime claims. I recommend putting it through peer review.

Referee Report

3 major / 1 minor

Summary. The manuscript presents the DNN pipeline scheduling subsystem of CROWDio for memory-efficient partitioned ONNX inference of large models such as DistilBERT on resource-constrained Android devices. It relies on five mechanisms—JIT deferred partition loading, a single-partition-resident constraint, a 4-tier affinity scheduler, zlib-compressed tensor transport, and a streaming 1:1 dependency model—to distribute memory pressure across crowdsourced handsets without model modification. Evaluation across five Android devices over ten runs reports peak per-device RSS of 43±2 MB, battery draw limited to 50±3 mAh, and 34% lower batch latency under streaming concurrency versus barrier synchronization.

Significance. If the runtime mechanisms prove feasible on commodity Android and the empirical results hold under full scrutiny, the work would meaningfully advance edge deployment of transformer-scale models by leveraging device crowds rather than single high-memory devices. The use of real handset measurements and emphasis on unmodified ONNX models are notable strengths; reproducible code or machine-checked elements are not mentioned.

major comments (3)

[Abstract and §5] Abstract and §5 (Evaluation): The headline claims of 43±2 MB RSS and 50±3 mAh battery draw rest on the single-partition-resident constraint and JIT deferred loading, yet no evidence or implementation details are supplied showing that stock Android ONNX Runtime (or equivalent) supports per-partition deferred loading without materializing the full graph or multiple partitions simultaneously.
[§5] §5 (Evaluation): No baselines, statistical tests, post-partitioning accuracy measurements, or controls for network variability are reported despite the concrete performance numbers from ten runs; this leaves the support for the central empirical claims unclear.
[§3] §3 (Design): The assumption that the DNN can be partitioned into executable segments with 1:1 dependencies permitting streaming tensor transport without accuracy loss or prohibitive overhead is load-bearing but receives no quantitative validation in the reported experiments.

minor comments (1)

[Abstract] Abstract: Error notation should be standardized (e.g., 43 ± 2 MB rather than 43+-2 MB) for consistency with field conventions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions that will be incorporated to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Evaluation): The headline claims of 43±2 MB RSS and 50±3 mAh battery draw rest on the single-partition-resident constraint and JIT deferred loading, yet no evidence or implementation details are supplied showing that stock Android ONNX Runtime (or equivalent) supports per-partition deferred loading without materializing the full graph or multiple partitions simultaneously.

Authors: We agree that the current manuscript would benefit from explicit implementation details on the JIT deferred loading mechanism. In the revised version we will expand §3 with a dedicated subsection describing how the Android ONNX Runtime Java bindings are used to load and execute individual partitions on demand. The approach instantiates only the active partition via subgraph loading APIs, ensuring the full model graph is never materialized in memory; this is feasible because ONNX Runtime permits independent execution of subgraphs without requiring the complete model to reside simultaneously. We will include relevant API references and a brief code sketch to substantiate the single-partition-resident constraint. revision: yes
Referee: [§5] §5 (Evaluation): No baselines, statistical tests, post-partitioning accuracy measurements, or controls for network variability are reported despite the concrete performance numbers from ten runs; this leaves the support for the central empirical claims unclear.

Authors: The evaluation section indeed omits several standard controls that would improve clarity. We will revise §5 to add: (i) a single-device baseline comparison on the highest-memory handset in the test set, (ii) statistical significance testing (e.g., paired t-tests) on the reported 34 % latency reduction across the ten runs, (iii) post-partitioning accuracy figures on SST-2 to confirm equivalence with the unmodified model, and (iv) explicit description of the controlled Wi-Fi environment used to bound network variability. These additions will be presented without altering the core empirical results already obtained. revision: yes
Referee: [§3] §3 (Design): The assumption that the DNN can be partitioned into executable segments with 1:1 dependencies permitting streaming tensor transport without accuracy loss or prohibitive overhead is load-bearing but receives no quantitative validation in the reported experiments.

Authors: The 1:1 streaming dependency model follows directly from the layer-wise structure of DistilBERT, yet we acknowledge the need for explicit validation. In the revision we will augment §5 with two new measurements: (a) end-to-end accuracy on SST-2 after partitioned inference to demonstrate no degradation relative to the original model, and (b) per-tensor transport overhead (compressed size and latency) under the streaming schedule. These data will be obtained from the same ten-run experimental protocol already described. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurements stand alone

full rationale

The paper reports a systems implementation evaluated via direct runtime measurements (peak RSS 43+-2 MB, battery 50+-3 mAh, 34% latency reduction) on commodity Android handsets running DistilBERT. The five mechanisms (JIT deferred loading, single-partition-resident constraint, 4-tier scheduler, zlib transport, streaming 1:1 model) are presented as design choices whose effects are quantified experimentally. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the central claims do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions about mobile OS capabilities and distributed execution rather than new axioms or invented entities; no free parameters are explicitly fitted in the abstract.

axioms (2)

domain assumption The target DNN can be partitioned into segments with simple 1:1 dependencies that support streaming execution without accuracy degradation.
Implicit in the description of the streaming 1:1 dependency model and single-partition-resident constraint.
domain assumption Commodity Android handsets provide sufficient runtime support for JIT deferred partition loading and zlib-compressed tensor transport.
Required for the system to function on the described 3.3-7.4 GB RAM devices.

pith-pipeline@v0.9.0 · 5735 in / 1569 out tokens · 60959 ms · 2026-05-21T07:02:25.493073+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

JIT deferred partition loading, single-partition-resident constraint, streaming 1:1 dependency model
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

4-tier affinity scheduler with residency tiers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Chen, L. et al. Melon: Breaking the memory wall for resource-efficient on-device machine learning. InPro- ceedings of the 20th ACM International Conference on Mobile Systems, Applications, and Services (MobiSys 2022),

work page 2022
[2]

Dou, A. et al. Misco: A MapReduce framework for mobile systems. InProceedings of the 3rd International Con- ference on Pervasive Technologies Related to Assistive Environments (PETRA 2010),

work page 2010
[3]

Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. InProceedings of the International Conference on Learning Representations (ICLR 2016),

work page 2016
[4]

Huang, Y . et al. GPipe: Efficient training of giant neural networks using pipeline parallelism. InProceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019),

work page 2019
[5]

Quantization and training of neural networks for efficient integer- arithmetic-only inference

Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integer- arithmetic-only inference. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), pp. 2704–2713,

work page 2018
[6]

Kang, Y . et al. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. InProceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Sys- tems (ASPLOS 2017), pp. 615–629,

work page 2017
[7]

Laskaridis, S. et al. SPINN: Synergistic progressive in- ference of neural networks over device and cloud. In Proceedings of the 26th ACM Annual International Con- ference on Mobile Computing and Networking (MobiCom 2020), pp. 1–15,

work page 2020
[8]

Li, S. et al. PipeSwitch: Fast pipelined context switching for deep learning applications. InProceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020), pp. 499–514,

work page 2020
[9]

CROWDio: A Practical Mobile Crowd Computing Framework with Developer-Oriented Design, Adaptive Scheduling, and Fault Resilience

arXiv preprint arXiv:2604.19363. Marinelli, E. E. Hyrax: Cloud computing on mobile devices using MapReduce. Technical Report CMU-CS-09-164, Carnegie Mellon University,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Moritz, P. et al. Ray: A distributed framework for emerg- ing AI applications. InProceedings of the 13th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI 2018), pp. 561–577,

work page 2018
[11]

Narayanan, D. et al. PipeDream: Generalized pipeline parallelism for DNN training. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP 2019),

work page 2019
[12]

Pramanik, P. K. D. and Biswas, T. Energy-efficiency analy- sis of different scheduling algorithms for mobile crowd computing. InProceedings of the 15th International Con- ference on Computing, Communication and Networking Technologies (ICCCNT 2024). IEEE,

work page 2024
[13]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, V ., Debut, L., Chaumond, J., and Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[14]

Shah, S. et al. Edge-based compressed feature transmission for split inference. InProceedings of the IEEE 43rd Inter- national Conference on Distributed Computing Systems (ICDCS 2023),

work page 2023
[15]

Application scheduling in mobile cloud computing with load balancing.Journal of Applied Mathematics, 2013:409539,

Wei, X., Fan, J., Lu, Z., and Ding, K. Application scheduling in mobile cloud computing with load balancing.Journal of Applied Mathematics, 2013:409539,

work page 2013
[16]

Zhao, Z. et al. EdgePipe: Pipelined deep learning inference on heterogeneous edge clusters. InProceedings of the IEEE International Conference on Computer Communi- cations (INFOCOM 2022),

work page 2022

[1] [1]

Chen, L. et al. Melon: Breaking the memory wall for resource-efficient on-device machine learning. InPro- ceedings of the 20th ACM International Conference on Mobile Systems, Applications, and Services (MobiSys 2022),

work page 2022

[2] [2]

Dou, A. et al. Misco: A MapReduce framework for mobile systems. InProceedings of the 3rd International Con- ference on Pervasive Technologies Related to Assistive Environments (PETRA 2010),

work page 2010

[3] [3]

Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. InProceedings of the International Conference on Learning Representations (ICLR 2016),

work page 2016

[4] [4]

Huang, Y . et al. GPipe: Efficient training of giant neural networks using pipeline parallelism. InProceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019),

work page 2019

[5] [5]

Quantization and training of neural networks for efficient integer- arithmetic-only inference

Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integer- arithmetic-only inference. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), pp. 2704–2713,

work page 2018

[6] [6]

Kang, Y . et al. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. InProceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Sys- tems (ASPLOS 2017), pp. 615–629,

work page 2017

[7] [7]

Laskaridis, S. et al. SPINN: Synergistic progressive in- ference of neural networks over device and cloud. In Proceedings of the 26th ACM Annual International Con- ference on Mobile Computing and Networking (MobiCom 2020), pp. 1–15,

work page 2020

[8] [8]

Li, S. et al. PipeSwitch: Fast pipelined context switching for deep learning applications. InProceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020), pp. 499–514,

work page 2020

[9] [9]

CROWDio: A Practical Mobile Crowd Computing Framework with Developer-Oriented Design, Adaptive Scheduling, and Fault Resilience

arXiv preprint arXiv:2604.19363. Marinelli, E. E. Hyrax: Cloud computing on mobile devices using MapReduce. Technical Report CMU-CS-09-164, Carnegie Mellon University,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Moritz, P. et al. Ray: A distributed framework for emerg- ing AI applications. InProceedings of the 13th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI 2018), pp. 561–577,

work page 2018

[11] [11]

Narayanan, D. et al. PipeDream: Generalized pipeline parallelism for DNN training. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP 2019),

work page 2019

[12] [12]

Pramanik, P. K. D. and Biswas, T. Energy-efficiency analy- sis of different scheduling algorithms for mobile crowd computing. InProceedings of the 15th International Con- ference on Computing, Communication and Networking Technologies (ICCCNT 2024). IEEE,

work page 2024

[13] [13]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, V ., Debut, L., Chaumond, J., and Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[14] [14]

Shah, S. et al. Edge-based compressed feature transmission for split inference. InProceedings of the IEEE 43rd Inter- national Conference on Distributed Computing Systems (ICDCS 2023),

work page 2023

[15] [15]

Application scheduling in mobile cloud computing with load balancing.Journal of Applied Mathematics, 2013:409539,

Wei, X., Fan, J., Lu, Z., and Ding, K. Application scheduling in mobile cloud computing with load balancing.Journal of Applied Mathematics, 2013:409539,

work page 2013

[16] [16]

Zhao, Z. et al. EdgePipe: Pipelined deep learning inference on heterogeneous edge clusters. InProceedings of the IEEE International Conference on Computer Communi- cations (INFOCOM 2022),

work page 2022