Multi-armed sampling framework shows near-optimal regret is achievable with minimal exploration, unlike bandits, and unifies both via a continuous temperature family.
Bandit algorithms
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Presents a UCBVI-style algorithm achieving optimal distributional regret bounds O(sqrt(AT log(1/δ))) in multi-armed bandits, confirming a 2020 conjecture.
COSAC enables scalable per-agent policy gradients in sequential cooperative teams via ridge regression on additive reward decomposition and counterfactual advantages from fictitious policy continuations, extending aristocrat utility with controlled bias-variance bounds.
Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.
citing papers explorer
-
Multi-Armed Sampling Problem and the End of Exploration
Multi-armed sampling framework shows near-optimal regret is achievable with minimal exploration, unlike bandits, and unifies both via a continuous temperature family.
-
Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning
Presents a UCBVI-style algorithm achieving optimal distributional regret bounds O(sqrt(AT log(1/δ))) in multi-armed bandits, confirming a 2020 conjecture.
-
COSAC: Counterfactual Credit Assignment in Sequential Cooperative Teams
COSAC enables scalable per-agent policy gradients in sequential cooperative teams via ridge regression on additive reward decomposition and counterfactual advantages from fictitious policy continuations, extending aristocrat utility with controlled bias-variance bounds.
-
When Do We Need LLMs? A Diagnostic for Language-Driven Bandits
Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.