Contextual Bandits for Resource-Constrained Devices using Probabilistic Learning
Pith reviewed 2026-05-14 19:05 UTC · model grok-4.3
The pith
A probabilistic update rule lets hyperdimensional contextual bandits run on low-precision hardware while outperforming binarized versions and approaching full performance with 3 bits per component.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that replacing deterministic accumulation in HD-CB with a probabilistic update rule—selecting a random subset of components at each step, applying a time-decaying update probability, and clamping values to a fixed interval [-k, +k]—yields a low-precision algorithm that consistently outperforms its binarized counterpart at the same bit width and approaches the regret performance of full-precision HD-CB with as few as three bits per component, as measured by off-policy evaluation on standardized synthetic contextual bandit benchmarks.
What carries the argument
The probabilistic update rule that selects random vector components for update with a time-decaying probability and bounds their values to a fixed range, replacing accumulation and eliminating periodic binarization.
If this is right
- Expected update cost drops in direct proportion to the fraction of components selected at each step.
- Memory and energy budgets shrink because low-precision fixed-range values replace high-precision accumulation.
- Decision quality improves over binarized HD-CB at every tested bit width on the evaluated benchmarks.
- Deployment on strictly constrained hardware becomes feasible without sacrificing the convergence speed of the original HD-CB approach.
Where Pith is reading between the lines
- The same bounded probabilistic update could be tested on other hyperdimensional tasks such as sequence prediction or anomaly detection to see whether the precision savings transfer.
- Hardware measurements of actual energy per decision on microcontrollers would quantify gains that synthetic regret curves cannot capture.
- Adjusting the decay schedule of the update probability per environment might further reduce regret in non-stationary settings.
Load-bearing premise
The probabilistic rule with random subset selection and time-decaying probability preserves enough learning information to match or exceed binarized HD-CB without introducing bias or instability that would appear only outside synthetic benchmarks.
What would settle it
An on-device experiment in which probabilistic HD-CB produces higher cumulative regret or visible instability than binarized HD-CB at the same precision would falsify the performance claim.
Figures
read the original abstract
Contextual bandits (CB) are online sequential decision-making problems under partial feedback that underpin many adaptive services. There is a growing demand to deploy CB agents directly on-device, under strict constraints on memory, compute, and energy. However, standard linear CB algorithms are often impractical for resource-constrained devices with their unfavorable scaling in computational and memory costs. Recently, HD-CB, a CB approach based on hyperdimensional computing principles, has been proposed to model and solve CB problems by moving into high-dimensional spaces. HD-CB offers faster convergence, favorable scalability, and improves memory efficiency compared to linear CB algorithms. However, its learning rule is accumulation-based: the values of action vectors grow over time, requiring high precision. While periodic binarization can prevent overflow in low-precision components, it may discard important information about magnitudes and degrade decision quality. This paper introduces probabilistic HD-CB, a low-precision variant that replaces deterministic accumulation with a probabilistic update rule. At each step, only a random subset of vector components is updated, with a time-decaying update probability, and component values are constrained to a predefined range [-k,+k]. This approach enables low-precision components, prevents overflow without periodic binarization, and reduces the expected update cost in proportion to the fraction of updated components. Off-policy evaluation on standardized synthetic CB benchmarks using the Open Bandit Pipeline shows that probabilistic HD-CB consistently outperforms binarized HD-CB at equal precision, while approaching the performance of HD-CB with as few as 3 bits per component.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes probabilistic HD-CB as a low-precision variant of hyperdimensional computing-based contextual bandits (HD-CB). It replaces deterministic accumulation with a probabilistic update rule that selects a random subset of vector components according to a time-decaying probability, clamps component values to [-k, +k], and thereby avoids overflow and reduces expected update cost. Off-policy evaluation on synthetic benchmarks via the Open Bandit Pipeline is reported to show consistent outperformance over binarized HD-CB at equal precision and performance approaching that of full HD-CB with as few as 3 bits per component.
Significance. If the performance claims are confirmed under closed-loop conditions, the work would offer a practical route to deploying contextual bandits on memory- and energy-constrained devices, extending the applicability of HD-CB while preserving its favorable scaling properties. The reliance on standardized benchmarks is a methodological strength that supports reproducibility.
major comments (2)
- [Abstract and evaluation description] Abstract and evaluation description: the reported outperformance rests entirely on off-policy replay of fixed synthetic logs. Because the probabilistic subset selection and decay schedule alter action probabilities, the resulting data distribution in true online interaction may differ systematically from the logged data; no closed-loop experiments are described that would detect such bias or instability.
- [Method and experimental sections] Method and experimental sections: the update rule introduces free parameters k and the decay schedule whose values are not reported with sensitivity analysis or default settings; without these details it is impossible to determine whether the claimed gains at 3-bit precision are robust or depend on post-hoc tuning.
minor comments (2)
- The abstract states that results are 'consistent' but supplies no information on number of runs, variance, or statistical tests; adding these would strengthen the presentation.
- Notation for the time-decaying probability and the precise clamping operation should be given explicitly with an equation in the method section.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each major comment below with clarifications and planned revisions.
read point-by-point responses
-
Referee: Abstract and evaluation description: the reported outperformance rests entirely on off-policy replay of fixed synthetic logs. Because the probabilistic subset selection and decay schedule alter action probabilities, the resulting data distribution in true online interaction may differ systematically from the logged data; no closed-loop experiments are described that would detect such bias or instability.
Authors: We acknowledge the limitation of relying solely on off-policy replay via the Open Bandit Pipeline. While this is the standard methodology for reproducible contextual bandit evaluation on synthetic logs and controls for policy-induced biases in the logged data, the probabilistic updates could indeed induce distribution shifts in true closed-loop interaction. We will revise the evaluation description and add an explicit limitations paragraph noting this point and recommending closed-loop validation as future work. No new experiments are added at this stage. revision: partial
-
Referee: Method and experimental sections: the update rule introduces free parameters k and the decay schedule whose values are not reported with sensitivity analysis or default settings; without these details it is impossible to determine whether the claimed gains at 3-bit precision are robust or depend on post-hoc tuning.
Authors: We apologize for the omission. The default settings used were k=3 (clamping bound) and a linear decay of update probability from 0.5 to 0.05 over the horizon. We will revise the method section to state these values explicitly and add a sensitivity analysis subsection in the experiments, reporting performance for k in {1,2,3,4,5} and alternative decay schedules. This will confirm robustness of the reported 3-bit gains. revision: yes
Circularity Check
No significant circularity; probabilistic update rule and benchmark results are independently defined and evaluated
full rationale
The paper introduces probabilistic HD-CB via an explicit new update rule (random subset selection with time-decaying probability and clamping to [-k, +k]) that replaces the accumulation-based rule of prior HD-CB. Performance claims are supported solely by off-policy evaluation on external standardized synthetic benchmarks via the Open Bandit Pipeline, with no equations, parameters, or results reducing by construction to fitted inputs, self-definitions, or load-bearing self-citations. The derivation chain from rule definition to reported outperformance remains self-contained against the cited benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- k
- update probability decay schedule
axioms (1)
- domain assumption Contextual bandit problems can be effectively modeled and solved using hyperdimensional vector representations
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
probabilistic update rule... random subset... time-decaying update probability... constrained to [-κ,+κ]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A contextual-bandit approach to personalized news article recommendation,
L. Liet al., “A contextual-bandit approach to personalized news article recommendation,” inInt. Conf. on World Wide Web, 2010, pp. 661–670
work page 2010
-
[2]
Artwork personalization at Netflix,
F. Amatet al., “Artwork personalization at Netflix,” inACM Conference on Recommender Systems (RecSys), 2018, p. 487–488
work page 2018
-
[3]
Multiworld testing decision service: A system for experimentation, learning, and decision-making,
A. Agarwalet al., “Multiworld testing decision service: A system for experimentation, learning, and decision-making,”Whitepaper of Microsoft, pp. 1–40, 2016
work page 2016
-
[4]
How The New York Times is experimenting with recommendation algorithms,
A. Coenen, “How The New York Times is experimenting with recommendation algorithms,” 2019, accessed: 2026-01-13. [Online]. Available: https://tinyurl.com/536zku2f
work page 2019
-
[5]
How we boosted app revenue by 10% with real-time personalization,
J. Runge, “How we boosted app revenue by 10% with real-time personalization,” 2020, accessed: 2026-01-13. [Online]. Available: https://tinyurl.com/433eawyb
work page 2020
-
[6]
Automatic hardware accelerators reconfiguration through LinearUCB algorithms on a RISC-V processor,
M. Angioliet al., “Automatic hardware accelerators reconfiguration through LinearUCB algorithms on a RISC-V processor,” inConf. on PhD Res. in Microelect. and Elect. (PRIME), 2023, pp. 169–172
work page 2023
-
[7]
Contextual-bandit anomaly detection for IoT data in distributed hierarchical edge computing,
M. V . Ngoet al., “Contextual-bandit anomaly detection for IoT data in distributed hierarchical edge computing,” inIEEE Int. Conf. on Distributed Computing Systems (ICDCS), 2020, pp. 1227–1230
work page 2020
-
[8]
Edge computing in the dark: Leveraging contextual- combinatorial bandit and coded computing,
C.-S. Yanget al., “Edge computing in the dark: Leveraging contextual- combinatorial bandit and coded computing,”IEEE/ACM Transactions on Networking, vol. 29, no. 3, pp. 1022–1031, 2021
work page 2021
-
[9]
Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis,
A. Durandet al., “Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis,” inMachine Learning for Healthcare (MLHC), 2018, pp. 67–82
work page 2018
-
[10]
M. Angioliet al., “Efficient implementation of LinearUCB through algo- rithmic improvements and vector computing acceleration for embedded learning systems,”ACM Transactions on Embedded Computing Systems, vol. 24, no. 4, pp. 1–23, 2025
work page 2025
-
[11]
P. Kanerva, “Hyperdimensional computing: An introduction to comput- ing in distributed representation with high-dimensional random vectors,” Cognitive Computation, vol. 1, no. 2, pp. 139–159, 2009
work page 2009
-
[12]
D. Kleykoet al., “A survey on hyperdimensional computing aka vector symbolic architectures, Part I: Models and data transformations,”ACM Computing Surveys, vol. 55, no. 6, pp. 1–40, 2022
work page 2022
-
[13]
HD-CB: The first exploration of hyperdimensional computing for contextual bandits problems,
M. Angioliet al., “HD-CB: The first exploration of hyperdimensional computing for contextual bandits problems,”IEEE Open Journal of the Computer Society, vol. 7, pp. 105–116, 2026
work page 2026
-
[14]
Randomness in neural networks: An overview,
S. Scardapane and D. Wang, “Randomness in neural networks: An overview,”Data Min. Know. Disc., vol. 7, no. 2, pp. 1–18, 2017
work page 2017
-
[15]
Reducing catastrophic forgetting with associative learn- ing,
Y . Shenet al., “Reducing catastrophic forgetting with associative learn- ing,”Neural Computation, vol. 35, no. 11, pp. 1797–1819, 2023
work page 2023
-
[16]
HD-CBBIN: A lightweight approach for contextual bandit learning in real-time applications,
M. Angioliet al., “HD-CBBIN: A lightweight approach for contextual bandit learning in real-time applications,” inInt. Joint Conf. on Neur. Netw. (IJCNN), 2025, pp. 1–8
work page 2025
-
[17]
A binary self-organizing map and its FPGA implemen- tation,
K. Appiahet al., “A binary self-organizing map and its FPGA implemen- tation,” inInt. Joint Conf. on Neur. Netw. (IJCNN), 2009, pp. 164–171
work page 2009
-
[18]
Integer self-organizing maps for digital hardware,
D. Kleykoet al., “Integer self-organizing maps for digital hardware,” in Int. Joint Conf. on Neur. Netw. (IJCNN), 2019, pp. 1–8
work page 2019
-
[19]
Survey on applications of multi-armed and contextual bandits,
D. Bouneffoufet al., “Survey on applications of multi-armed and contextual bandits,” inIEEE Congr. Evol. Comput., 2020, p. 1–8
work page 2020
-
[20]
From ads to interventions: Contextual bandits in mobile health,
A. Tewari and S. A. Murphy, “From ads to interventions: Contextual bandits in mobile health,” inMobile Health: Sensors, Analytic Methods, and Applications, 2017, pp. 495–517
work page 2017
-
[21]
G. Chacunet al., “DroneBandit: Multi-armed contextual bandits for col- laborative edge-to-cloud inference in resource-constrained nanodrones,” inGreat Lakes Symposium on VLSI (GLSVLSI), 2024, p. 98–104
work page 2024
-
[22]
R. Banerjeeet al., “To ask or not to ask: Human-in-the-loop contextual bandits with applications in robot-assisted feeding,” inIEEE Interna- tional Conf. on Robotics and Automation (ICRA), 2025, pp. 1378–1384
work page 2025
-
[23]
Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms,
L. Liet al., “Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms,” inACM Int. Conf. on Web Search and Data Mining (WSDM), 2011, pp. 297–306
work page 2011
-
[24]
A comparison of vector symbolic architectures,
K. Schlegelet al., “A comparison of vector symbolic architectures,” Artificial Intelligence Review, vol. 55, no. 6, pp. 4523–4555, 2022
work page 2022
-
[25]
Multiplicative binding, representation operators & anal- ogy,
R. W. Gayler, “Multiplicative binding, representation operators & anal- ogy,” inAdvances in Analogy Research: Integration of Theory and Data from the Cognitive, Computational, and Neural Sciences, 1998, pp. 1–4
work page 1998
-
[26]
Vector symbolic architectures as a computing frame- work for emerging hardware,
D. Kleykoet al., “Vector symbolic architectures as a computing frame- work for emerging hardware,”Proceedings of the IEEE, vol. 110, no. 10, pp. 1538–1571, 2022
work page 2022
-
[27]
Sparse binary distributed encoding of scalars,
D. A. Rachkovskijet al., “Sparse binary distributed encoding of scalars,” J. Autom. Inf. Sci., vol. 37, no. 6, pp. 12–23, 2005
work page 2005
-
[28]
Density encoding enables resource-efficient randomly connected neural networks,
D. Kleykoet al., “Density encoding enables resource-efficient randomly connected neural networks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 8, pp. 3777–3783, 2021
work page 2021
-
[29]
A family of binary spatter codes,
P. Kanerva, “A family of binary spatter codes,” inInt. Conf. on Artificial Neural Networks (ICANN), 1995, pp. 517–522
work page 1995
-
[30]
Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation,
Y . Saitoet al., “Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2021, pp. 1–14
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.