arxiv: 2604.15453 · v1 · submitted 2026-04-16 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

(1D) Ordered Tokens Enable Efficient Test-Time Search

Zhitong Gao , Parham Rezaei , Ali Cy , Mingqiao Ye , Nata\v{s}a Jovanovi\'c , Jesse Allardice , Afshin Dehghan , Amir Zamir

show 2 more authors

Roman Bachmann O\u{g}uzhan Fatih Kar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords tokenizationautoregressive modelstest-time searchimage generationcoarse-to-fine tokensverifier guidancetraining-free generation

0 comments

The pith

Coarse-to-fine 1D token sequences improve test-time search scaling in autoregressive image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive models for image generation typically use fixed-order tokens like 2D grids. This paper tests whether ordering tokens from coarse to fine in 1D sequences makes it easier to steer generation using test-time search and verifiers. The key idea is that partial sequences in this order have clear semantic content that an image-text verifier can score. Experiments confirm these models scale better with more search effort at test time. The structure even allows pure search, without any trained AR model, to generate images from text prompts guided only by the verifier.

Core claim

AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, pure test-time search over token sequences can perform training-free text-to-image generation when guided by an image-text verifier. Classical search algorithms interact more effectively with the ordered structure, and the approach works across different verifiers and AR priors.

What carries the argument

Coarse-to-fine 1D token ordering, where each prefix of the sequence represents a progressively refined image that carries semantic meaning for verifier scoring.

Load-bearing premise

Intermediate states in coarse-to-fine token sequences carry semantic meaning that verifiers can reliably evaluate to steer generation.

What would settle it

An experiment where coarse-to-fine ordered tokens show no better scaling or search performance than grid tokens, or where verifiers cannot score partial sequences meaningfully, would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.15453 by Afshin Dehghan, Ali Cy, Amir Zamir, Jesse Allardice, Mingqiao Ye, Nata\v{s}a Jovanovi\'c, O\u{g}uzhan Fatih Kar, Parham Rezaei, Roman Bachmann, Zhitong Gao.

**Figure 1.** Figure 1: (a) Intermediate readouts. 1D ordered tokens provide a coarse-to-fine structure with interpretable readouts amenable to test-time search. For the prompt ‘a potted plant and a donut”, tokens progressively capture concepts from high- to low-level, e.g., ‘plant” → ‘potted plant” → ‘a potted plant and an object”. This structure allows verifiers to effectively guide generation. In contrast, 2D grid tokens gener… view at source ↗

**Figure 2.** Figure 2: Ordered tokens induce a searchable latent structure. (a) FlexTok encodes images into a sequence of 1D ordered tokens trained to support variable-length decoding, imposing a coarse-to-fine hierarchy. (b) Illustration of search over the token vocabulary without an autoregressive model: candidate tokens are sampled using a token prior (here, uniform over the codebook, i.e., no AR model is assumed) and evaluat… view at source ↗

**Figure 3.** Figure 3: Visualization of images decoded from the first-token vocabulary in FlexTok. Each first-token entry is decoded using nine random seeds, producing nine images per token. These decoded images form semantically coherent clusters (e.g., plants, bags, food, and furniture), indicating that tokens capture a global distribution of concepts that can be searched over. formation about later ones, making a full search… view at source ↗

**Figure 4.** Figure 4: Direct search over 1D ordered tokens enables trainingfree text-to-image generation. We search over FlexTok (Bachmann et al., 2025) using a 5-beam strategy and ImageReward as the verifier. We show the best image obtained at each step. 3.3. A Theoretical Perspective Searching over token sequences to maximize a verifier score is analogous to nearest-neighbor search in structured data, where search efficienc… view at source ↗

**Figure 5.** Figure 5: Overview of the Search-over-Tokens (SoTo) evaluation framework. The framework studies test-time scaling behavior of image tokenizers when combined with autoregressive generation and search. (A) Search algorithms: different strategies for exploring the token space during generation, including Best-of-N sampling, Beam Search, and Lookahead Search. (B) Verifiers: scoring functions that guide search by evaluat… view at source ↗

**Figure 6.** Figure 6: Test-time scaling across token structures. We compare inference-time search algorithms on two tokenizers: 1D ordered tokens (FlexTok) and a controlled 2D grid tokenizer. While best-of-N and lookahead search exhibit similar scaling for both tokenizations, beam search yields substantially larger gains for 1D ordered tokens. The rightmost panel compares each tokenizer under its best-performing search algorith… view at source ↗

**Figure 7.** Figure 7: Test-time scaling compared with Janus. We compare FlexTok (1D ordered tokens) and Janus (Wu et al., 2024a) (2D grid tokens) under best-of-N sampling and beam search. While Janus achieves slightly higher performance without search, FlexTok exhibits stronger scaling under beam search as inference compute increases. Results are evaluated on the COCO validation set. Other ordered generation paradigms. To furt… view at source ↗

**Figure 9.** Figure 9: Image generation with zero-shot concept preservation via search. Search over 1D ordered tokens enables multimodal control without finetuning by incorporating an image similarity verifier (DreamSim (Fu et al., 2023)) at inference time. The top row shows direct autoregressive generation with FlexTok, while the bottom row shows generations guided by image-based verification [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 11.** Figure 11: Comparison of different verifiers. Each row reports search using one verifier. All methods use the same beam search algorithm on FlexTok. The best score in each column is highlighted in bold. The superscript in each cell represents the rank within that column’s metric, and the last column reports the average of these column-wise ranks, providing an overall rank for each verifier. 5.5. Analysis of Differen… view at source ↗

**Figure 10.** Figure 10: and App. F.1 provide visual examples. Similarly, experiments with another 1D ordered tokenizer (Semanticist (Wen et al., 2025a)) show that text-to-image generation remains feasible even with a weak AR prior (e.g., a class-conditional prior); see App. E.1 [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 12.** Figure 12: Visualization examples of Semanticist for class-to-image generation on ImageNet. We compare direct autoregressive generation, beam search with a simple prompt, and beam search with a complex prompt. Beam search generally improves image–text alignment, while complex prompts provide additional guidance beyond class priors. Below each group of images, we show the ImageNet class ID and name, along with the co… view at source ↗

**Figure 13.** Figure 13: Test-time scaling across token structures on GenEval. We compare three inference-time search algorithms (Best-of-N, beam search, and lookahead search) on two tokenizers: 1D ordered tokens (FlexTok) and a 2D grid tokenizer, evaluated on GenEval using ImageReward as the verifier. Top row: ImageReward score vs. inference compute; bottom row: GenEval accuracy. The rightmost panel shows each tokenizer paired w… view at source ↗

**Figure 14.** Figure 14: Inference time analysis for different search algorithms (H100 GPU). Top row (a–c): CLIPScore vs. wall-clock inference time per image for Best-of-N, Beam search, and Lookahead search (rollout length L=8), respectively. Each point corresponds to one configuration (N or number of search steps), with the open circle marking the no-search AR baseline (dashed line). Bottom row (d–f): empirical wall-clock time b… view at source ↗

**Figure 15.** Figure 15: Comparison of different verifiers on COCO. Each row reports search using one verifier. All methods use the same beam search algorithm on FlexTok. The best score in each column is highlighted in bold. The superscript in each cell represents the rank within that column’s metric, and the last column reports the average of these column-wise ranks, providing an overall rank for each verifier. Verifier Score Dy… view at source ↗

**Figure 16.** Figure 16: Verifier score trajectories during search. Each panel shows how optimizing one verifier affects all other verifier signals as well as GenEval accuracy. Curves are averaged over 15 prompts from GenEval using FlexTok with beam search on the first 32 tokens. Note that we only show verifier scores where they are comparable during the search process; we exclude likelihood because it always increases with longe… view at source ↗

**Figure 17.** Figure 17: Visual comparison when searching with different AR priors (Examples 1–3). Beam search guided by different AR priors on the GenEval benchmark. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

**Figure 18.** Figure 18: Visual comparison when searching with different AR priors (Examples 4–6). Beam search guided by different AR priors on the GenEval benchmark. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗

**Figure 19.** Figure 19: Visual comparison when searching with different AR priors (Examples 7–9). Beam search guided by different AR priors on the GenEval benchmark. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗

**Figure 20.** Figure 20: Visual comparison when searching with different AR priors (Examples 10–12). Beam search guided by different AR priors on the GenEval benchmark. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗

**Figure 21.** Figure 21: Generation trajectories during verifier-guided search up to 256 tokens: cup. We show intermediate outputs for the prompt “a photo of a cup” at token positions 1, 2, 4, 8, 16, 32, 64, 128, 256. Even for a simple single-object prompt, different verifiers induce noticeably different search paths in object shape, texture, and realism before converging. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_21.png] view at source ↗

**Figure 22.** Figure 22: Generation trajectories during verifier-guided search up to 256 tokens: frisbee and vase. This prompt highlights how different verifiers handle a two-object composition with competing semantics. Some verifiers lock onto one object earlier, while others preserve both objects more reliably over the full search trajectory. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_22.png] view at source ↗

**Figure 23.** Figure 23: Generation trajectories during verifier-guided search up to 256 tokens: two snowboards. This counting-and-category prompt shows how verifiers differ in how quickly they commit to the correct duplicated object structure. Some prioritize realistic texture early, while others more directly organize the scene around the requested count. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_23.png] view at source ↗

**Figure 24.** Figure 24: Generation trajectories during verifier-guided search up to 256 tokens: three hot dogs. This counting prompt illustrates how verifier choice changes the search path even when the target concept is simple. Alignment-focused verifiers improve object identity quickly, while the ensemble and structural verifiers more reliably organize the scene toward the requested count. 41 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 25.** Figure 25: Generation trajectories during verifier-guided search up to 256 tokens: black potted plant and yellow toilet. This prompt emphasizes unusual object and color combinations. Different verifiers stabilize realism and layout at different rates; structural and ensemble guidance more reliably steer the search toward the requested plant–toilet composition over the full trajectory. 42 [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 26.** Figure 26: DreamBench++ comparison between direct AR generation and DreamSim-guided search (Examples 1–3). Each panel compares the direct AR baseline (Base, top row) against verifier-guided search (Base + Search, bottom row) for the same reference subject and prompt set. Search consistently improves identity preservation and prompt-conditioned scene adaptation. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_26.png] view at source ↗

**Figure 27.** Figure 27: DreamBench++ comparison between direct AR generation and DreamSim-guided search (Examples 4–6). Additional subjects using the same visualization format as [PITH_FULL_IMAGE:figures/full_fig_p044_27.png] view at source ↗

**Figure 28.** Figure 28: DreamBench++ reference images and generated results. Each row corresponds to a single subject from DreamBench++ (Peng et al., 2024), with the leftmost column showing the reference image and the remaining columns showing images generated by FlexTok (Bachmann et al., 2025) using beam search guided by the DreamSim verifier (Fu et al., 2023). 45 [PITH_FULL_IMAGE:figures/full_fig_p045_28.png] view at source ↗

read the original abstract

Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through test-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an AR model) can perform training-free text-to-image generation when guided by an image-text verifier. Beyond this, we systematically study how classical search algorithms (best-of-N, beam search, lookahead search) interact with different token structures, as well as the role of different verifiers and AR priors. Our results highlight the impact of token structure on inference-time scalability and provide practical guidance for test-time scaling in AR models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Coarse-to-fine 1D token ordering improves test-time search scaling in AR image models and enables a training-free generation path, but the experiments likely confound order with tokenizer architecture differences.

read the letter

The main takeaway is that training autoregressive models on 1D coarse-to-fine ordered tokens leads to better scaling when you apply test-time search, and you can even skip training the AR model entirely and just search token sequences guided by an image-text verifier to produce images. That training-free result is the freshest part of the work. They also run a systematic comparison of how best-of-N, beam search, and lookahead search perform under the two token structures, plus checks on different verifiers and AR priors. Those pieces give concrete guidance on inference-time scaling that people working on AR generators will find usable. The paper does a reasonable job laying out the hypothesis that intermediate prefixes in coarse-to-fine sequences carry semantic content verifiers can score, and the abstract frames the experiments as controlled. That structure makes the claims easy to follow. The central soft spot is exactly the one the stress-test note flags. The comparisons pit coarse-to-fine 1D tokenizers against classical 2D grid ones, but those tokenizers typically differ in codebook design, hierarchy, receptive field, and partial reconstruction quality, not just sequence order. Any advantage in verifier-guided search could therefore trace to better intermediate states from the tokenizer itself rather than the ordering. The abstract does not describe ablations that hold the tokenizer fixed and vary only the ordering, so the load-bearing claim about order enabling reliable steering is not cleanly isolated. If the full paper has such controls or additional matching experiments, that would strengthen it; otherwise the evidence for the ordering hypothesis is weaker than presented. This work is aimed at people building or scaling autoregressive generative models, especially in vision and multimodal settings where test-time compute is a practical lever. Readers who care about efficient inference and search algorithms will extract value from the comparisons even if they treat the causal claim with caution. The paper shows clear thinking and honest engagement with the test-time scaling literature, so it deserves a serious referee rather than a desk reject. I would send it to review with a request that the authors clarify how they separated ordering effects from tokenizer-level differences.

Referee Report

2 major / 3 minor

Summary. The paper claims that autoregressive (AR) models for image generation benefit from 1D coarse-to-fine ordered tokenizers rather than classical 2D grid structures, because intermediate prefixes carry semantic meaning that enables effective verifier-guided test-time search. Through controlled experiments, AR models on coarse-to-fine tokens show improved scaling with search methods (best-of-N, beam search, lookahead). It further shows that pure test-time search over token sequences (no trained AR model) can perform training-free text-to-image generation when guided by an image-text verifier. The work systematically examines interactions between token structure, search algorithms, verifiers, and AR priors.

Significance. If the central claims hold after addressing controls, the result would be significant for test-time scaling in generative models: it would demonstrate that token ordering can make intermediate states more amenable to external guidance, allowing search to substitute for or augment learned priors. The training-free generation result is a notable strength, as is the systematic ablation of search algorithms and verifiers. These findings could inform tokenizer design for efficient inference without retraining.

major comments (2)

[§4 (Experiments)] §4 (Experiments): The attribution of improved test-time scaling to the 1D coarse-to-fine ordering is not isolated from tokenizer-level differences. The reported comparisons train AR models on distinct tokenizers (coarse-to-fine 1D vs. 2D grid), which typically vary in codebook design, hierarchy, receptive field, and partial reconstruction fidelity. Any advantage in verifier-guided search could therefore stem from easier-to-evaluate intermediate states rather than ordering per se. A controlled ablation that holds the tokenizer fixed while varying only sequence order is needed to support the central claim.
[§5.3 (Training-free generation)] §5.3 (Training-free generation): The mechanism for pure test-time search without a trained AR model is underspecified. It is unclear how candidate token sequences are proposed or expanded in the absence of an autoregressive prior, and therefore how the coarse-to-fine property is actually leveraged beyond the verifier. This detail is load-bearing for the claim that ordered structure alone enables training-free generation.

minor comments (3)

[Abstract] The abstract states that experiments are 'controlled,' yet the tokenizer confound noted above is not addressed; a brief clarification of what variables were held fixed would improve transparency.
[Throughout] Notation for 'coarse-to-fine' versus '1D ordered' should be defined consistently on first use and used uniformly throughout.
[Figures] Figure captions would benefit from explicit mention of the token structure and search method shown in each panel to aid quick reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments help clarify the scope of our claims regarding token ordering and test-time search. We address each major comment below and will revise the manuscript to strengthen the presentation.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The attribution of improved test-time scaling to the 1D coarse-to-fine ordering is not isolated from tokenizer-level differences. The reported comparisons train AR models on distinct tokenizers (coarse-to-fine 1D vs. 2D grid), which typically vary in codebook design, hierarchy, receptive field, and partial reconstruction fidelity. Any advantage in verifier-guided search could therefore stem from easier-to-evaluate intermediate states rather than ordering per se. A controlled ablation that holds the tokenizer fixed while varying only sequence order is needed to support the central claim.

Authors: We agree that the current experiments compare models trained on distinct tokenizers and that this leaves open the possibility of confounding factors beyond ordering. While the coarse-to-fine property is tightly coupled to the tokenizer design in practice, we acknowledge that a stricter isolation would better support the central claim. In the revised manuscript we will add a controlled ablation that fixes the underlying codebook and token vocabulary while varying only the sequence ordering (raster versus coarse-to-fine traversal), allowing us to attribute performance differences more directly to ordering. revision: yes
Referee: [§5.3 (Training-free generation)] §5.3 (Training-free generation): The mechanism for pure test-time search without a trained AR model is underspecified. It is unclear how candidate token sequences are proposed or expanded in the absence of an autoregressive prior, and therefore how the coarse-to-fine property is actually leveraged beyond the verifier. This detail is load-bearing for the claim that ordered structure alone enables training-free generation.

Authors: We thank the referee for highlighting this underspecification. In the training-free regime, partial sequences are expanded by sampling tokens from the codebook at each step; the coarse-to-fine ordering ensures that any prefix can be decoded into a semantically meaningful (low-resolution) image that the verifier can score. Selection among candidates is performed by a verifier-guided beam search that retains only high-scoring prefixes. We will expand §5.3 with a precise algorithmic description and pseudocode in the revision so that the procedure is fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hypothesis tested via controlled comparisons

full rationale

The paper advances a hypothesis that coarse-to-fine 1D token ordering improves test-time search amenability because intermediate prefixes carry semantic meaning, then validates it through experiments comparing AR models trained on distinct tokenizers and pure search without AR training. No equations, derivations, or first-principles results are presented that reduce to inputs by construction; there are no fitted parameters renamed as predictions, no self-definitional loops, and no load-bearing self-citations. The work is self-contained as an empirical study whose central claims rest on observable scaling behavior and generation quality rather than tautological reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about semantic evaluability of intermediate states; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate
Explicitly stated as the root of the hypothesis in the abstract.

pith-pipeline@v0.9.0 · 5616 in / 1226 out tokens · 40231 ms · 2026-05-10T11:03:36.432586+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 3 canonical work pages

[1]

Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344,

URL https://api.semanticscholar. org/CorpusID:229297973. Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., and Isola, P. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023. Gallici, M. and Borde, H. S. d. O. Fine-tuning next-scale visual autoregressive models with group re...

work page arXiv 2023
[2]

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

URL https://api.semanticscholar. org/CorpusID:233296711. Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models.arXiv pr...

work page doi:10.1126/science.aar6404 2022
[3]

Selftok: Discrete visual tokens of autoregression, by diffusion, and for reasoning, 2025

Accessed: 2026-04. Wang, B., Yue, Z., Zhang, F., Chen, S., Bi, L., Zhang, J., Song, X., Chan, K. Y ., Pan, J., Wu, W., Zhou, M., Lin, W., Pan, K., Zhang, S., Jia, L., Hu, W., Zhao, W., and Zhang, H. Discrete visual tokens of autoregression, by diffusion, and for reasoning. 2025. URL https: //arxiv.org/abs/2505.07538. Wang, P., Li, L., Shao, Z., Xu, R., Da...

work page arXiv 2026
[4]

a red apple

formulates diffusion denoising as a multi-step Markov Decision Process and applies policy gradients to fine-tune on downstream objectives. More recently, GRPO-based methods (Shao et al., 2024) have been adapted from LLMs to visual generation: DanceGRPO (Xue et al., 2025) and Flow-GRPO (Liu et al., 2025a) adapt GRPO to diffusion and flow-based models, AR-G...

2024