pith. sign in

arxiv: 2605.13346 · v1 · pith:GZOKMZK7new · submitted 2026-05-13 · 💻 cs.LG

Contextual Bandits for Resource-Constrained Devices using Probabilistic Learning

Pith reviewed 2026-05-14 19:05 UTC · model grok-4.3

classification 💻 cs.LG
keywords contextual banditshyperdimensional computingprobabilistic learninglow-precisionresource-constrained deviceson-device learningonline decision making
0
0 comments X

The pith

A probabilistic update rule lets hyperdimensional contextual bandits run on low-precision hardware while outperforming binarized versions and approaching full performance with 3 bits per component.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Contextual bandits solve sequential decision problems under partial feedback, but standard linear algorithms and even hyperdimensional variants demand too much memory and precision for small devices. The paper replaces the accumulation-based learning rule in HD-CB with a probabilistic version that updates only a random subset of vector components according to a time-decaying probability and keeps every component value bounded inside a fixed range. This change removes the need for periodic binarization, prevents overflow, and scales the update cost down with the fraction of components touched. Off-policy tests on standard synthetic benchmarks show the new rule beats binarized HD-CB at identical precision and nearly matches the original high-precision HD-CB once each component uses only three bits. The result matters because it opens a practical path for running adaptive decision agents directly on sensors and edge hardware that cannot afford floating-point arithmetic or large memory.

Core claim

The paper claims that replacing deterministic accumulation in HD-CB with a probabilistic update rule—selecting a random subset of components at each step, applying a time-decaying update probability, and clamping values to a fixed interval [-k, +k]—yields a low-precision algorithm that consistently outperforms its binarized counterpart at the same bit width and approaches the regret performance of full-precision HD-CB with as few as three bits per component, as measured by off-policy evaluation on standardized synthetic contextual bandit benchmarks.

What carries the argument

The probabilistic update rule that selects random vector components for update with a time-decaying probability and bounds their values to a fixed range, replacing accumulation and eliminating periodic binarization.

If this is right

  • Expected update cost drops in direct proportion to the fraction of components selected at each step.
  • Memory and energy budgets shrink because low-precision fixed-range values replace high-precision accumulation.
  • Decision quality improves over binarized HD-CB at every tested bit width on the evaluated benchmarks.
  • Deployment on strictly constrained hardware becomes feasible without sacrificing the convergence speed of the original HD-CB approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bounded probabilistic update could be tested on other hyperdimensional tasks such as sequence prediction or anomaly detection to see whether the precision savings transfer.
  • Hardware measurements of actual energy per decision on microcontrollers would quantify gains that synthetic regret curves cannot capture.
  • Adjusting the decay schedule of the update probability per environment might further reduce regret in non-stationary settings.

Load-bearing premise

The probabilistic rule with random subset selection and time-decaying probability preserves enough learning information to match or exceed binarized HD-CB without introducing bias or instability that would appear only outside synthetic benchmarks.

What would settle it

An on-device experiment in which probabilistic HD-CB produces higher cumulative regret or visible instability than binarized HD-CB at the same precision would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.13346 by Amy Loutfi, Antonello Rosato, Denis Kleyko, Kevin Johansson, Marco Angioli.

Figure 1
Figure 1. Figure 1: Cumulative rewards against the number of action [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Memory footprint (KiB, log scale) vs. context di [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Contextual bandits (CB) are online sequential decision-making problems under partial feedback that underpin many adaptive services. There is a growing demand to deploy CB agents directly on-device, under strict constraints on memory, compute, and energy. However, standard linear CB algorithms are often impractical for resource-constrained devices with their unfavorable scaling in computational and memory costs. Recently, HD-CB, a CB approach based on hyperdimensional computing principles, has been proposed to model and solve CB problems by moving into high-dimensional spaces. HD-CB offers faster convergence, favorable scalability, and improves memory efficiency compared to linear CB algorithms. However, its learning rule is accumulation-based: the values of action vectors grow over time, requiring high precision. While periodic binarization can prevent overflow in low-precision components, it may discard important information about magnitudes and degrade decision quality. This paper introduces probabilistic HD-CB, a low-precision variant that replaces deterministic accumulation with a probabilistic update rule. At each step, only a random subset of vector components is updated, with a time-decaying update probability, and component values are constrained to a predefined range [-k,+k]. This approach enables low-precision components, prevents overflow without periodic binarization, and reduces the expected update cost in proportion to the fraction of updated components. Off-policy evaluation on standardized synthetic CB benchmarks using the Open Bandit Pipeline shows that probabilistic HD-CB consistently outperforms binarized HD-CB at equal precision, while approaching the performance of HD-CB with as few as 3 bits per component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes probabilistic HD-CB as a low-precision variant of hyperdimensional computing-based contextual bandits (HD-CB). It replaces deterministic accumulation with a probabilistic update rule that selects a random subset of vector components according to a time-decaying probability, clamps component values to [-k, +k], and thereby avoids overflow and reduces expected update cost. Off-policy evaluation on synthetic benchmarks via the Open Bandit Pipeline is reported to show consistent outperformance over binarized HD-CB at equal precision and performance approaching that of full HD-CB with as few as 3 bits per component.

Significance. If the performance claims are confirmed under closed-loop conditions, the work would offer a practical route to deploying contextual bandits on memory- and energy-constrained devices, extending the applicability of HD-CB while preserving its favorable scaling properties. The reliance on standardized benchmarks is a methodological strength that supports reproducibility.

major comments (2)
  1. [Abstract and evaluation description] Abstract and evaluation description: the reported outperformance rests entirely on off-policy replay of fixed synthetic logs. Because the probabilistic subset selection and decay schedule alter action probabilities, the resulting data distribution in true online interaction may differ systematically from the logged data; no closed-loop experiments are described that would detect such bias or instability.
  2. [Method and experimental sections] Method and experimental sections: the update rule introduces free parameters k and the decay schedule whose values are not reported with sensitivity analysis or default settings; without these details it is impossible to determine whether the claimed gains at 3-bit precision are robust or depend on post-hoc tuning.
minor comments (2)
  1. The abstract states that results are 'consistent' but supplies no information on number of runs, variance, or statistical tests; adding these would strengthen the presentation.
  2. Notation for the time-decaying probability and the precise clamping operation should be given explicitly with an equation in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below with clarifications and planned revisions.

read point-by-point responses
  1. Referee: Abstract and evaluation description: the reported outperformance rests entirely on off-policy replay of fixed synthetic logs. Because the probabilistic subset selection and decay schedule alter action probabilities, the resulting data distribution in true online interaction may differ systematically from the logged data; no closed-loop experiments are described that would detect such bias or instability.

    Authors: We acknowledge the limitation of relying solely on off-policy replay via the Open Bandit Pipeline. While this is the standard methodology for reproducible contextual bandit evaluation on synthetic logs and controls for policy-induced biases in the logged data, the probabilistic updates could indeed induce distribution shifts in true closed-loop interaction. We will revise the evaluation description and add an explicit limitations paragraph noting this point and recommending closed-loop validation as future work. No new experiments are added at this stage. revision: partial

  2. Referee: Method and experimental sections: the update rule introduces free parameters k and the decay schedule whose values are not reported with sensitivity analysis or default settings; without these details it is impossible to determine whether the claimed gains at 3-bit precision are robust or depend on post-hoc tuning.

    Authors: We apologize for the omission. The default settings used were k=3 (clamping bound) and a linear decay of update probability from 0.5 to 0.05 over the horizon. We will revise the method section to state these values explicitly and add a sensitivity analysis subsection in the experiments, reporting performance for k in {1,2,3,4,5} and alternative decay schedules. This will confirm robustness of the reported 3-bit gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; probabilistic update rule and benchmark results are independently defined and evaluated

full rationale

The paper introduces probabilistic HD-CB via an explicit new update rule (random subset selection with time-decaying probability and clamping to [-k, +k]) that replaces the accumulation-based rule of prior HD-CB. Performance claims are supported solely by off-policy evaluation on external standardized synthetic benchmarks via the Open Bandit Pipeline, with no equations, parameters, or results reducing by construction to fitted inputs, self-definitions, or load-bearing self-citations. The derivation chain from rule definition to reported outperformance remains self-contained against the cited benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard contextual bandit assumptions and hyperdimensional vector operations while introducing a new probabilistic update rule whose parameters (bound k and decay schedule) are design choices without independent derivation.

free parameters (2)
  • k
    Bound on component values [-k, +k] chosen to constrain precision and prevent overflow.
  • update probability decay schedule
    Time-decaying probability for component updates is a tunable design parameter.
axioms (1)
  • domain assumption Contextual bandit problems can be effectively modeled and solved using hyperdimensional vector representations
    Directly extends the HD-CB framework without re-deriving its foundations.

pith-pipeline@v0.9.0 · 5585 in / 1343 out tokens · 43595 ms · 2026-05-14T19:05:43.806442+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    A contextual-bandit approach to personalized news article recommendation,

    L. Liet al., “A contextual-bandit approach to personalized news article recommendation,” inInt. Conf. on World Wide Web, 2010, pp. 661–670

  2. [2]

    Artwork personalization at Netflix,

    F. Amatet al., “Artwork personalization at Netflix,” inACM Conference on Recommender Systems (RecSys), 2018, p. 487–488

  3. [3]

    Multiworld testing decision service: A system for experimentation, learning, and decision-making,

    A. Agarwalet al., “Multiworld testing decision service: A system for experimentation, learning, and decision-making,”Whitepaper of Microsoft, pp. 1–40, 2016

  4. [4]

    How The New York Times is experimenting with recommendation algorithms,

    A. Coenen, “How The New York Times is experimenting with recommendation algorithms,” 2019, accessed: 2026-01-13. [Online]. Available: https://tinyurl.com/536zku2f

  5. [5]

    How we boosted app revenue by 10% with real-time personalization,

    J. Runge, “How we boosted app revenue by 10% with real-time personalization,” 2020, accessed: 2026-01-13. [Online]. Available: https://tinyurl.com/433eawyb

  6. [6]

    Automatic hardware accelerators reconfiguration through LinearUCB algorithms on a RISC-V processor,

    M. Angioliet al., “Automatic hardware accelerators reconfiguration through LinearUCB algorithms on a RISC-V processor,” inConf. on PhD Res. in Microelect. and Elect. (PRIME), 2023, pp. 169–172

  7. [7]

    Contextual-bandit anomaly detection for IoT data in distributed hierarchical edge computing,

    M. V . Ngoet al., “Contextual-bandit anomaly detection for IoT data in distributed hierarchical edge computing,” inIEEE Int. Conf. on Distributed Computing Systems (ICDCS), 2020, pp. 1227–1230

  8. [8]

    Edge computing in the dark: Leveraging contextual- combinatorial bandit and coded computing,

    C.-S. Yanget al., “Edge computing in the dark: Leveraging contextual- combinatorial bandit and coded computing,”IEEE/ACM Transactions on Networking, vol. 29, no. 3, pp. 1022–1031, 2021

  9. [9]

    Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis,

    A. Durandet al., “Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis,” inMachine Learning for Healthcare (MLHC), 2018, pp. 67–82

  10. [10]

    Efficient implementation of LinearUCB through algo- rithmic improvements and vector computing acceleration for embedded learning systems,

    M. Angioliet al., “Efficient implementation of LinearUCB through algo- rithmic improvements and vector computing acceleration for embedded learning systems,”ACM Transactions on Embedded Computing Systems, vol. 24, no. 4, pp. 1–23, 2025

  11. [11]

    Hyperdimensional computing: An introduction to comput- ing in distributed representation with high-dimensional random vectors,

    P. Kanerva, “Hyperdimensional computing: An introduction to comput- ing in distributed representation with high-dimensional random vectors,” Cognitive Computation, vol. 1, no. 2, pp. 139–159, 2009

  12. [12]

    A survey on hyperdimensional computing aka vector symbolic architectures, Part I: Models and data transformations,

    D. Kleykoet al., “A survey on hyperdimensional computing aka vector symbolic architectures, Part I: Models and data transformations,”ACM Computing Surveys, vol. 55, no. 6, pp. 1–40, 2022

  13. [13]

    HD-CB: The first exploration of hyperdimensional computing for contextual bandits problems,

    M. Angioliet al., “HD-CB: The first exploration of hyperdimensional computing for contextual bandits problems,”IEEE Open Journal of the Computer Society, vol. 7, pp. 105–116, 2026

  14. [14]

    Randomness in neural networks: An overview,

    S. Scardapane and D. Wang, “Randomness in neural networks: An overview,”Data Min. Know. Disc., vol. 7, no. 2, pp. 1–18, 2017

  15. [15]

    Reducing catastrophic forgetting with associative learn- ing,

    Y . Shenet al., “Reducing catastrophic forgetting with associative learn- ing,”Neural Computation, vol. 35, no. 11, pp. 1797–1819, 2023

  16. [16]

    HD-CBBIN: A lightweight approach for contextual bandit learning in real-time applications,

    M. Angioliet al., “HD-CBBIN: A lightweight approach for contextual bandit learning in real-time applications,” inInt. Joint Conf. on Neur. Netw. (IJCNN), 2025, pp. 1–8

  17. [17]

    A binary self-organizing map and its FPGA implemen- tation,

    K. Appiahet al., “A binary self-organizing map and its FPGA implemen- tation,” inInt. Joint Conf. on Neur. Netw. (IJCNN), 2009, pp. 164–171

  18. [18]

    Integer self-organizing maps for digital hardware,

    D. Kleykoet al., “Integer self-organizing maps for digital hardware,” in Int. Joint Conf. on Neur. Netw. (IJCNN), 2019, pp. 1–8

  19. [19]

    Survey on applications of multi-armed and contextual bandits,

    D. Bouneffoufet al., “Survey on applications of multi-armed and contextual bandits,” inIEEE Congr. Evol. Comput., 2020, p. 1–8

  20. [20]

    From ads to interventions: Contextual bandits in mobile health,

    A. Tewari and S. A. Murphy, “From ads to interventions: Contextual bandits in mobile health,” inMobile Health: Sensors, Analytic Methods, and Applications, 2017, pp. 495–517

  21. [21]

    DroneBandit: Multi-armed contextual bandits for col- laborative edge-to-cloud inference in resource-constrained nanodrones,

    G. Chacunet al., “DroneBandit: Multi-armed contextual bandits for col- laborative edge-to-cloud inference in resource-constrained nanodrones,” inGreat Lakes Symposium on VLSI (GLSVLSI), 2024, p. 98–104

  22. [22]

    To ask or not to ask: Human-in-the-loop contextual bandits with applications in robot-assisted feeding,

    R. Banerjeeet al., “To ask or not to ask: Human-in-the-loop contextual bandits with applications in robot-assisted feeding,” inIEEE Interna- tional Conf. on Robotics and Automation (ICRA), 2025, pp. 1378–1384

  23. [23]

    Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms,

    L. Liet al., “Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms,” inACM Int. Conf. on Web Search and Data Mining (WSDM), 2011, pp. 297–306

  24. [24]

    A comparison of vector symbolic architectures,

    K. Schlegelet al., “A comparison of vector symbolic architectures,” Artificial Intelligence Review, vol. 55, no. 6, pp. 4523–4555, 2022

  25. [25]

    Multiplicative binding, representation operators & anal- ogy,

    R. W. Gayler, “Multiplicative binding, representation operators & anal- ogy,” inAdvances in Analogy Research: Integration of Theory and Data from the Cognitive, Computational, and Neural Sciences, 1998, pp. 1–4

  26. [26]

    Vector symbolic architectures as a computing frame- work for emerging hardware,

    D. Kleykoet al., “Vector symbolic architectures as a computing frame- work for emerging hardware,”Proceedings of the IEEE, vol. 110, no. 10, pp. 1538–1571, 2022

  27. [27]

    Sparse binary distributed encoding of scalars,

    D. A. Rachkovskijet al., “Sparse binary distributed encoding of scalars,” J. Autom. Inf. Sci., vol. 37, no. 6, pp. 12–23, 2005

  28. [28]

    Density encoding enables resource-efficient randomly connected neural networks,

    D. Kleykoet al., “Density encoding enables resource-efficient randomly connected neural networks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 8, pp. 3777–3783, 2021

  29. [29]

    A family of binary spatter codes,

    P. Kanerva, “A family of binary spatter codes,” inInt. Conf. on Artificial Neural Networks (ICANN), 1995, pp. 517–522

  30. [30]

    Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation,

    Y . Saitoet al., “Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2021, pp. 1–14