Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

Ge Yu; Maosong Sun; Pengcheng Huang; Shaohua Duan; Shuo Wang; Xiaoyuan Yi; Xinze Li; Yu Gu; Yukun Yan; Zhenghao Liu

arxiv: 2508.13993 · v2 · submitted 2025-08-19 · 💻 cs.CL · cs.AI

Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

Shaohua Duan , Pengcheng Huang , Xinze Li , Zhenghao Liu , Xiaoyuan Yi , Yukun Yan , Shuo Wang , Yu Gu

show 2 more authors

Ge Yu Maosong Sun

This is my paper

Pith reviewed 2026-05-18 22:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords long-context modelingmulti-armed banditpreference optimizationDPOsynthetic data generationcontext chunkingLLM fine-tuning

0 comments

The pith

Treating chunks of long context as arms in a multi-armed bandit lets LLMs sample more informative segments for preference optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework that models pieces of a long document as choices in a bandit problem, learning over repeated trials which pieces produce better responses when fed to the model. By balancing trying new chunks and sticking with ones that have already given high rewards, it assembles preference pairs that are both more varied and more reliable than those from full-context or random sampling. These pairs then train the model with direct preference optimization. The approach is tested on two model families and yields gains on tasks that require reasoning over extended inputs. A reader would care because current ways of creating synthetic long-context data often suffer from repetition and errors, limiting how well models handle real documents and conversations.

Core claim

By treating each chunk of a long context as an arm in a multi-armed bandit, selecting chunks according to their current expected reward, generating responses from the chosen chunks, and then updating the chunk scores from the reward feedback, the method collects higher-quality and more diverse preference data pairs that can be used for direct preference optimization, resulting in improved performance on long-context reasoning benchmarks for both Llama and Qwen models.

What carries the argument

Multi-armed bandit rollout that treats context chunks as arms, selects them by expected reward scores, and iteratively updates the scores from feedback on the generated responses.

If this is right

The bandit selection produces preference pairs with greater diversity and fewer factual inconsistencies than full-context or random baselines.
The same procedure delivers measurable gains on long-context reasoning benchmarks for both Llama and Qwen.
Exploration and exploitation together let the model concentrate on the most relevant segments within each long input.
The collected data supports direct preference optimization that strengthens long-context capabilities without requiring new model architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same chunk-as-arm idea could be applied to other long-context tasks such as summarization or multi-hop question answering.
Because only selected chunks are used for generation, the method may reduce the compute needed to create each training example compared with full-context sampling.
Chunk scores learned on one model or task might transfer to new models, allowing reuse of the bandit policy across different LLMs.
Testing the method with alternative reward signals or on contexts longer than those in the current benchmarks would clarify how far the gains extend.

Load-bearing premise

That reward feedback from the generated responses can be used to update chunk scores in a way that reliably surfaces the chunks most useful for creating good preference data.

What would settle it

An experiment on the same long-context reasoning benchmarks that shows no gain when LongMab is replaced by random chunk selection or by always using the full context for response generation.

read the original abstract

Long-context modeling is critical for a wide range of real-world tasks, including long-context question answering, summarization, and complex reasoning tasks. Recent studies have explored fine-tuning Large Language Models (LLMs) with synthetic data to enhance their long-context capabilities. However, the effectiveness of such approaches is often limited by the low diversity and factual inconsistencies in the generated data. To address these challenges, we propose LongMab, a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training. Specifically, we treat context chunks as arms of MAB, select chunks based on their expected reward scores to input into LLMs to generate responses, and iteratively update these scores based on reward feedback. Both exploration and exploitation during the rollout process enable the LLM to focus on the most relevant context segments, thereby generating and collecting high-quality and diverse responses. Experimental results on both Llama and Qwen show the effectiveness of LongMab by achieving more than a 4% improvement on long-context reasoning benchmarks. All data and code will be released on https://github.com/NEUIR/LongMab-PO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes LongMab, a Multi-Armed Bandit (MAB) framework that treats long-context chunks as arms, selects them via expected reward scores for LLM response generation, iteratively updates scores from reward feedback, and uses the resulting high-quality diverse pairs for DPO training to improve long-context reasoning. Experiments claim >4% gains on benchmarks for Llama and Qwen models.

Significance. If the central attribution holds, the work would offer a useful algorithmic tool for efficient synthetic data curation in long-context LLM optimization by leveraging exploration-exploitation to focus on informative segments, potentially addressing diversity and consistency issues more systematically than uniform sampling.

major comments (3)

[Experimental Results] Experimental Results section: the >4% benchmark improvements are reported without ablations that isolate MAB chunk selection from random sampling, fixed chunk selection, or equivalent-volume non-iterative baselines. This is load-bearing for the claim that the bandit mechanism (rather than data volume or DPO itself) drives the gains.
[§3] §3 (Method): the reward signal used to update chunk scores is described only at a high level as 'reward feedback from generated responses.' It is unclear whether this is LLM-as-judge, self-reward, length/fluency heuristics, or an external model, and no analysis shows it correlates with reasoning depth or factual grounding rather than superficial traits.
[§4.1] §4.1 and Table 1 (or equivalent results table): no statistical significance tests, variance across runs, or controls for confounds (e.g., total preference pairs generated, hyperparameter differences) are provided, leaving the support for the effectiveness claim moderate at best.

minor comments (2)

[Abstract] The GitHub link is given but the manuscript should confirm that released code includes the exact MAB rollout implementation, reward computation, and reproduction scripts for the reported experiments.
[§3] Notation for the MAB update rule (e.g., how expected reward scores are computed and decayed) could be made more explicit with a short algorithm box or pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects for strengthening the empirical support and methodological clarity of LongMab. We address each major comment below and will incorporate revisions to improve the paper.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: the >4% benchmark improvements are reported without ablations that isolate MAB chunk selection from random sampling, fixed chunk selection, or equivalent-volume non-iterative baselines. This is load-bearing for the claim that the bandit mechanism (rather than data volume or DPO itself) drives the gains.

Authors: We agree that additional ablations are necessary to attribute gains specifically to the MAB mechanism. In the revised manuscript, we will add experiments comparing LongMab to random chunk sampling, fixed (non-adaptive) chunk selection, and non-iterative baselines, while strictly controlling for the total number of generated preference pairs and data volume. These results will be reported in the Experimental Results section to isolate the contribution of the iterative bandit updates. revision: yes
Referee: §3 (Method): the reward signal used to update chunk scores is described only at a high level as 'reward feedback from generated responses.' It is unclear whether this is LLM-as-judge, self-reward, length/fluency heuristics, or an external model, and no analysis shows it correlates with reasoning depth or factual grounding rather than superficial traits.

Authors: We will clarify the reward computation in the revised Section 3. The reward is derived from an LLM-as-a-judge that scores response quality and relevance to the selected chunk using a structured prompt. We will include the exact judge prompt, scoring rubric, and an analysis on a held-out set showing correlation between these rewards and human ratings of reasoning depth and factual accuracy, to demonstrate that the signal prioritizes substantive qualities over superficial ones such as length. revision: yes
Referee: §4.1 and Table 1 (or equivalent results table): no statistical significance tests, variance across runs, or controls for confounds (e.g., total preference pairs generated, hyperparameter differences) are provided, leaving the support for the effectiveness claim moderate at best.

Authors: We acknowledge this limitation in the current reporting. The revised manuscript will include standard deviations across multiple random seeds, statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing LongMab to baselines, and explicit controls ensuring all methods use identical total preference pair counts and matched hyperparameter settings. These additions will be presented in §4.1 and the associated tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with external benchmarks

full rationale

The paper introduces LongMab as an algorithmic framework treating context chunks as MAB arms, selecting them via expected rewards to generate responses for DPO preference pairs. The central claims rest on experimental results (>4% gains on long-context benchmarks for Llama and Qwen) rather than any closed-form derivation or prediction. No equations reduce a claimed output to a fitted input by construction, no self-citation chain bears the load of a uniqueness theorem, and no ansatz is smuggled in. The method is self-contained against external benchmarks and data generation, making it a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method introduces no new physical entities but relies on the domain assumption about reward feedback in the context of LLM preference optimization.

free parameters (1)

MAB exploration-exploitation parameter
Controls the balance in chunk selection during rollout, likely tuned on validation data.

axioms (1)

domain assumption Reward signals from LLM-generated responses accurately reflect the quality and informativeness of context chunks.
This is central to the MAB update mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5800 in / 1233 out tokens · 48267 ms · 2026-05-18T22:06:10.144208+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

treat context chunks as arms of MAB, select chunks based on their expected reward scores... UCBt(Ci) = µi(t) + α·√(2 lnt/(ni(t)+ϵ))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LongMab-PO significantly improves... achieving more than a 4% improvement on long-context reasoning benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.