pith. sign in

arxiv: 2606.02958 · v1 · pith:FCDJBEIXnew · submitted 2026-06-01 · 💻 cs.CR · cs.AI

Echelon: Auditable Aggregate-Only Language-Model Adaptation Across Privacy Boundaries

Pith reviewed 2026-06-28 13:36 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords secure aggregationfederated learningprivacy boundarieslanguage model adaptationLoRAauditable trainingcross-organization MLWAN latency
0
0 comments X

The pith

Echelon enables language-model adaptation across privacy boundaries by exchanging only aggregated boundary deltas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Echelon as a training architecture that treats non-export of device-level model states as an unbreakable systems invariant. Devices perform local training inside each boundary; the only payloads that cross boundaries are securely aggregated boundary-level deltas plus minimal coordination metadata. This aggregate-only rule forces the optimizer to cope with WAN delays, heterogeneous participation, churn, and non-IID data without ever seeing individual updates. The design combines buffered semi-asynchronous secure aggregation, staleness-aware weighting, participation windows, proximal local objectives, and a drift-aware outer controller. On 1B-parameter LoRA adaptation across two boundaries, the system reaches validation loss 3.887 plus or minus 0.010 and matches or beats tuned low-communication baselines under matched token, byte, wall-clock, and sync-count budgets.

Core claim

By restricting all cross-boundary communication to aggregates of boundary-level deltas, Echelon maintains optimization stability under WAN delay, heterogeneous participation, churn, and non-IID data distributions without ever exposing per-device updates to the global plane.

What carries the argument

Buffered semi-asynchronous secure aggregation combined with staleness-aware weighting, participation windows, proximal local objectives, and a drift-aware outer synchronization controller.

If this is right

  • The approach supplies a concrete, auditable surface consisting solely of boundary aggregates.
  • It sustains over 2100 tokens per second throughput in OpenWebText stress tests across WAN and non-IID conditions.
  • Quality loss stays at most 2.2 percent under 200 ms emulated latency or severe non-IID partitioning.
  • Echelon-DA improves time-to-target relative to a privacy-parity DiLoCo plus secure-aggregation baseline under WAN latency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same aggregate-only controller could be tested on three or more boundaries to check whether drift correction continues to scale.
  • The stability mechanisms may transfer to other distributed training settings that face similar export restrictions.
  • Running the identical token budget on a larger base model would show whether the reported loss and throughput figures generalize beyond the 1B LoRA case.

Load-bearing premise

That the global optimizer can still converge when it receives only boundary aggregates rather than individual device updates, even under network delays and non-identical data.

What would settle it

A head-to-head run in which a baseline allowed to exchange per-device updates under identical token and byte budgets reaches materially lower validation loss or faster convergence than Echelon.

Figures

Figures reproduced from arXiv: 2606.02958 by Hina Dixit, Irene Tenison, Nevasini Sasikumar, Punit Kumar.

Figure 1
Figure 1. Figure 1: Boundary-first architecture and audit surface. Devices optimize and communicate only within their boundary. The global plane sees aggregate-only boundary deltas plus 𝑂(1) metadata. The dashed warning path shows flows forbidden by construction. 4.3 What is not protected across rounds Per-round confidentiality is not longitudinal privacy. Across repeated rounds, an adversarial coordinator could attempt diffe… view at source ↗
Figure 2
Figure 2. Figure 2: Execution flow in Echelon. Local device steps produce clipped deltas. Boundaries buffer and securely aggregate semi￾asynchronously with staleness-aware weighting and participation windows. The global plane only observes boundary-level drift signals and aggregate deltas; the most-drifting boundary drives outer cadence. Algorithm 1 Echelon-DA (boundary-scoped buffered semi-asynchronous training). Require: bo… view at source ↗
Figure 3
Figure 3. Figure 3: Regime WR (OpenWebText): latency sensitivity under emulated WAN delay. We plot time-to-target perplexity versus median WAN latency for Echelon-DA and DiLoCo+SA; the gap widens at 100 ms. See [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Regime WR (OpenWebText): metric scorecard summarizing time-to-target and validation perplexity across baselines under the workload-realism setup. Methods are sorted by time-to-target; shorter bars are better. Underlying values are listed in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Quality gap versus normalized 𝐵eff. Validation-loss delta (relative to the 𝑀 = 1 baseline of 3.27; negative is better) is plotted against 𝐵eeff = 𝐵eff/1010 for the stressor sweep spanning 𝑀 ∈ {2, 3}, variable latency, and non-IID severity [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Communication volume versus normalized 𝐵eff. Cross-boundary bytes required to reach 𝐿val ≤ 3.52 (1.10x baseline) are plotted against 𝐵eeff = 𝐵eff/1010. Censored runs are indicated with muted amber crosses. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Privacy red-team leakage markers under the tested buffer and quorum settings. When the buffer covers the whole boundary, the tested difference-of-sums and small-cohort reconstruction attacks become algebraically undefined because the required distinct aggregate equations are absent. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Operational audit trace under the tested cross-region conditions. The schema check confirms that per_device_payload remains exactly zero bytes, while aggregate and control metadata remain visible to the audit surface [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

Cross-organization language-model adaptation increasingly faces hard governance constraints: in many deployments, device-level model state-parameters, activations, optimizer state, and per-device updates-cannot be exported outside an administrative boundary. Existing distributed and federated stacks typically assume cross-site model exchange and then retrofit privacy mechanisms, which complicates compliance and makes auditing brittle. We present Echelon, a boundary-first training architecture that enforces device-level model-state non-export as a systems invariant. Devices train locally inside each boundary; the only cross-boundary payloads are securely aggregated boundary-level deltas plus O(1) coordination metadata, exposed through a concrete audit surface. Restricting exchange to aggregates changes the optimization problem: the system must remain stable under WAN delay, heterogeneous participation, churn, and non-IID data even though the global plane never sees per-device updates. Echelon combines buffered semi-asynchronous secure aggregation, staleness-aware weighting, participation windows, proximal local objectives, and a drift-aware outer synchronization controller. In 1B-parameter LoRA adaptation across M= 2 boundaries, a budget-matched contest over three seeds (24.88M tokens) reaches validation loss 3.887 +/-0.010 and is best or tied-best among tuned low-communication baselines under fixed-token, fixed-bytes, fixed-wall-clock, and fixed-sync-count budgets. In OpenWebText stress tests, Echelon sustains 2,139-2,176 tokens/s across evaluated WAN and non-IID treatments, Echelon-DA improves time-to-target under WAN latency relative to a privacy-parityDiLoCo+SA baseline, and quality degrades by at most 2.2% under 200ms emulated latency or severe non-IID partitioning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents Echelon, a boundary-first architecture for cross-organization language-model adaptation that enforces non-export of per-device model state, optimizer state, and updates as an invariant. Only securely aggregated boundary-level deltas and O(1) metadata cross boundaries; the design combines buffered semi-asynchronous secure aggregation, staleness-aware weighting, participation windows, proximal objectives, and a drift-aware controller. Empirical results for 1B-parameter LoRA adaptation across M=2 boundaries show a validation loss of 3.887 +/-0.010 over three seeds (24.88M tokens) that is best or tied-best under four fixed budgets, with throughput of 2,139-2,176 tokens/s and at most 2.2% quality degradation under emulated WAN latency or severe non-IID partitioning.

Significance. If the stability claims hold, the work supplies a concrete systems approach to auditable aggregate-only adaptation under governance constraints that existing federated stacks do not satisfy by construction. The budget-matched contest and reported throughput numbers under WAN and non-IID treatments provide reproducible evidence of competitiveness against low-communication baselines.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'restricting exchange to aggregates changes the optimization problem' such that the system 'must remain stable under WAN delay, heterogeneous participation, churn, and non-IID data' is load-bearing for extrapolating the 3.887 loss and budget-competitiveness results to realistic multi-boundary deployments, yet the reported experiments supply metrics and ablations only for emulated WAN latency and severe non-IID partitioning (max 2.2% drop) with no corresponding results, stress-test descriptions, or participation-rate ablations for device churn or heterogeneous participation.
  2. [Experiments] Experiments section (budget-matched contest): the claim that Echelon is 'best or tied-best among tuned low-communication baselines' under fixed-token, fixed-bytes, fixed-wall-clock, and fixed-sync-count budgets rests on the 3.887 +/-0.010 result, but the manuscript provides no details on baseline hyperparameter tuning methodology, data partitioning procedure, or statistical significance testing, which prevents independent verification that the reported ranking is robust rather than an artifact of untuned comparators.
minor comments (2)
  1. [Abstract] The abstract and introduction use the phrase 'O(1) coordination metadata' without specifying the exact metadata fields or their bit-width; a concrete enumeration would improve audit-surface clarity.
  2. [Figures] Figure captions for the throughput and degradation plots should explicitly state the number of independent runs and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for major revision. Below we respond point-by-point to the major comments, proposing targeted revisions for clarity and reproducibility while preserving the manuscript's core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'restricting exchange to aggregates changes the optimization problem' such that the system 'must remain stable under WAN delay, heterogeneous participation, churn, and non-IID data' is load-bearing for extrapolating the 3.887 loss and budget-competitiveness results to realistic multi-boundary deployments, yet the reported experiments supply metrics and ablations only for emulated WAN latency and severe non-IID partitioning (max 2.2% drop) with no corresponding results, stress-test descriptions, or participation-rate ablations for device churn or heterogeneous participation.

    Authors: We agree the abstract's stability claim is broader than the reported experiments. The OpenWebText stress tests evaluate emulated WAN latency and severe non-IID partitioning (at most 2.2% degradation), while the mechanisms (buffered semi-asynchronous secure aggregation, staleness-aware weighting, participation windows, proximal objectives, drift-aware controller) are explicitly designed to address heterogeneous participation and churn. We will revise the abstract to more precisely scope the empirical claims to the evaluated conditions and add a short discussion paragraph explaining how the design invariants target the untested factors without claiming new experimental coverage. revision: yes

  2. Referee: [Experiments] Experiments section (budget-matched contest): the claim that Echelon is 'best or tied-best among tuned low-communication baselines' under fixed-token, fixed-bytes, fixed-wall-clock, and fixed-sync-count budgets rests on the 3.887 +/-0.010 result, but the manuscript provides no details on baseline hyperparameter tuning methodology, data partitioning procedure, or statistical significance testing, which prevents independent verification that the reported ranking is robust rather than an artifact of untuned comparators.

    Authors: We accept that these methodological details are required for verification. The +/-0.010 reflects standard deviation across three random seeds. We will add an appendix subsection that specifies: (i) the hyperparameter ranges and search procedure applied to each baseline, (ii) the exact procedure used to induce non-IID partitioning across the two boundaries, and (iii) confirmation that no formal statistical significance tests beyond seed-wise mean and deviation were performed. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical systems architecture with no derivations or fitted predictions

full rationale

The manuscript describes a boundary-first training architecture and reports empirical results from 1B-parameter LoRA experiments under fixed budgets. No equations, parameter-fitting procedures, uniqueness theorems, or ansatzes are presented that could reduce a claimed prediction or result to its own inputs by construction. The stability requirement under WAN delay, churn, and non-IID data is stated as a changed optimization problem but is not derived mathematically; it is addressed through described mechanisms whose effectiveness is evaluated experimentally. All load-bearing claims rest on measured validation loss and throughput numbers rather than self-referential definitions or self-citations that close a loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input provides no explicit free parameters, axioms, or invented entities; full manuscript would be required to identify any fitted weights, stability assumptions, or new entities.

pith-pipeline@v0.9.1-grok · 5852 in / 1129 out tokens · 23915 ms · 2026-06-28T13:36:48.462114+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Arcas. Communication-efficient learning of deep networks from decentralized data.AISTATS, 2017

  2. [2]

    T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith. Federated optimization in heterogeneous networks.MLSys; originally arXiv:1812.06127, 2018/2020

  3. [3]

    S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning.ICML, 2020

  4. [4]

    Bonawitz, V

    K. Bonawitz, V. Ivanov, B. Kreuter, et al. Practical secure aggregation for privacy-preserving machine learning.CCS, 2017

  5. [5]

    X. Lian, H. Zhang, C. Zhang, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization.NeurIPS, 2015

  6. [6]

    Rajbhandari, J

    S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. ZeRO: Memory optimizations toward training trillion parameter models.SC, 2020

  7. [7]

    Rasley, S

    J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters.KDD, 2020

  8. [8]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi, M. Patwary, R. Puri, et al. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv:1909.08053, 2019

  9. [9]

    Allen-Zhu, Y.Li, S.Wang, L.Wang, andW.Chen.LoRA:Low-rankadaptationoflargelanguagemodels

    E.J.Hu, Y.Shen, P.Wallis, Z. Allen-Zhu, Y.Li, S.Wang, L.Wang, andW.Chen.LoRA:Low-rankadaptationoflargelanguagemodels. ICLR, 2022

  10. [10]

    Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc’Aurelio Ranzato, Arthur Szlam, and Jiajun Shen

    A. Douillard, Q. Feng, A. A. Rusu, R. Chhaparia, Y. Donchev, A. Kuncoro, M.-A. Ranzato, A. Szlam, and J. Shen. DiLoCo: Distributed low-communication training of language models. arXiv:2311.08105, 2023

  11. [11]

    Jaghouar, S

    S. Jaghouar, S. Boreiko, J. de la Cruz, N. Tong, and T. Dao. OpenDiLoCo: An open-source framework for globally distributed low-communication training. arXiv:2407.07852, 2024

  12. [12]

    Douillard, Y

    A. Douillard, Y. Donchev, K. Rush, S. Kale, Z. Charles, Z. Garrett, G. Teston, D. Lacey, R. McIlroy, J. Shen, A. Rame, A. Szlam, M.-A. Ranzato, and P. Barham. Streaming DiLoCo with overlapping communication: Towards a distributed free lunch. arXiv:2501.18512, 2025

  13. [13]

    Charles, G

    Z. Charles, G. Teston, L. Dery, K. Rush, N. Fallen, Z. Garrett, A. Szlam, and A. Douillard. Communication-efficient language model training scales reliably and robustly: Scaling laws for DiLoCo. arXiv:2503.09799, 2025

  14. [14]

    Y. Zhu, Y. Xu, H. Xu, Y. Liao, Z. Yao, and L. Huang. Cross-region model training with communication-computation overlapping and delay compensation. arXiv:2504.17672, 2025

  15. [15]

    R. Ye, W. Wang, J. Chai, D. Li, Z. Li, Y. Xu, Y. Du, Y. Wang, and S. Chen. OpenFedLLM: Training large language models on decentralized private data via federated learning. arXiv:2402.06954, 2024

  16. [16]

    Confidential federated computations,

    H. Eichner, D. Ramage, K. Bonawitz, D. Huba, T. Santoro, B. McLarnon, T. Van Overveldt, N. Fallen, P. Kairouz, A. Cheu, K. Daly, A. Gascon, M. Gruteser, and B. McMahan. Confidential Federated Computations. arXiv:2404.10764, 2024

  17. [17]

    Pasquini, G

    D. Pasquini, G. Ateniese, M. Bernaschi, and M. Conti. Eluding secure aggregation in federated learning via model inconsistency.CCS, 2022

  18. [18]

    Z. Wang, Z. Chang, J. Hu, X. Pang, J. Du, Y. Chen, and K. Ren. Breaking secure aggregation: Label leakage from aggregated gradients in federated learning. arXiv:2406.15731, 2024

  19. [19]

    L. Pu, J. Gu, C. Lin, and X. Huang. Janus: Dual-server multi-round secure aggregation with verifiability for federated learning.ICML, 2025. 17 Echelon Decompute Inc. A Related Work Distributed and cross-region LLM training.Megatron-LM, ZeRO, and DeepSpeed optimized training on tightly coupled clusters, where model-state exchange is fundamental to the desi...