Bandit algorithms

Tor Lattimore · 2020

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

unclear 1 use method 1

representative citing papers

Multi-Armed Sampling Problem and the End of Exploration

cs.LG · 2025-07-14 · conditional · novelty 8.0

Multi-armed sampling framework shows near-optimal regret is achievable with minimal exploration, unlike bandits, and unifies both via a continuous temperature family.

Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Presents a UCBVI-style algorithm achieving optimal distributional regret bounds O(sqrt(AT log(1/δ))) in multi-armed bandits, confirming a 2020 conjecture.

COSAC: Counterfactual Credit Assignment in Sequential Cooperative Teams

cs.LG · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

COSAC enables scalable per-agent policy gradients in sequential cooperative teams via ridge regression on additive reward decomposition and counterfactual advantages from fictitious policy continuations, extending aristocrat utility with controlled bias-variance bounds.

When Do We Need LLMs? A Diagnostic for Language-Driven Bandits

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.

citing papers explorer

Showing 4 of 4 citing papers.

Multi-Armed Sampling Problem and the End of Exploration cs.LG · 2025-07-14 · conditional · none · ref 15
Multi-armed sampling framework shows near-optimal regret is achievable with minimal exploration, unlike bandits, and unifies both via a continuous temperature family.
Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning cs.LG · 2026-05-06 · unverdicted · none · ref 19
Presents a UCBVI-style algorithm achieving optimal distributional regret bounds O(sqrt(AT log(1/δ))) in multi-armed bandits, confirming a 2020 conjecture.
COSAC: Counterfactual Credit Assignment in Sequential Cooperative Teams cs.LG · 2026-04-20 · unverdicted · none · ref 10 · 2 links
COSAC enables scalable per-agent policy gradients in sequential cooperative teams via ridge regression on additive reward decomposition and counterfactual advantages from fictitious policy continuations, extending aristocrat utility with controlled bias-variance bounds.
When Do We Need LLMs? A Diagnostic for Language-Driven Bandits cs.AI · 2026-04-07 · unverdicted · none · ref 27
Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.

Bandit algorithms

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer