pith. sign in

arxiv: 2606.13558 · v1 · pith:SH4ZJM6Anew · submitted 2026-06-11 · 💻 cs.CV · cs.CL

Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models

Pith reviewed 2026-06-27 06:49 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords text-guided image editingvisual autoregressive modelsbitwise residual editingtraining-free editinglocalization maskBernoulli projectionPIE-Bench benchmark
0
0 comments X

The pith

BitResEdit couples per-bit guidance with masked residual re-injection to localize edits in visual autoregressive models without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BitResEdit, a training-free approach for text-guided editing of images from bitwise-residual visual autoregressive generators such as Infinity. It tilts per-bit log-odds predictions toward a target prompt while staying close to the original sampler via a Bernoulli-KL projection, then converts the changes into scale-specific residuals that are masked and added back through the model's native sum-of-scales code field. This uses the exact additive structure of the residuals to keep background regions unchanged while applying localized adjustments. A sympathetic reader would care because the method reports stronger prompt alignment on PIE-Bench than prior same-backbone editors, with competitive background fidelity and no requirement to modify or retrain the generator.

Core claim

BitResEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source--target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, Bit

What carries the argument

BitResEdit, which tilts per-bit log-odds with source-negative contrast, projects via Bernoulli-KL, and re-injects masked residuals through the sum-of-scales code field.

If this is right

  • Stronger text alignment on the edited region than token-stream or feature-based VAR editors on the same backbone.
  • Background regions remain exactly as sampled because unchanged residuals are never altered by code arithmetic.
  • Bit-level guidance and residual-level masking play complementary roles, with each improving a different aspect of edit quality.
  • The approach works on any bitwise-residual VAR generator without architecture changes or additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bit-tilting plus residual gating pattern could be tested on other hierarchical generative models that assemble outputs from additive components.
  • Because background preservation is exact rather than approximate, the method may reduce error accumulation in multi-step editing workflows.
  • Operating at the per-bit prediction level rather than after token sampling may allow finer control when the target region requires subtle attribute changes.
  • The closed-form projection step suggests similar trust-region techniques could be applied to other probabilistic heads in autoregressive generators.

Load-bearing premise

The method assumes that source-negative tilting of per-bit log-odds followed by closed-form Bernoulli-KL projection and masked residual re-injection will produce coherent localized edits without introducing artifacts or requiring any model modification or retraining.

What would settle it

Running BitResEdit on PIE-Bench images and finding either no gain in edited-region CLIP score over prior editors or visible background artifacts from the residual re-injection would show the central claim does not hold.

read the original abstract

Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native structures of bitwise-residual VAR models underused: the per-bit Bernoulli prediction head and the additive multi-scale residual code field from which the image is assembled. We propose BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity. BitEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source--target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, BitResEdit attains the strongest text alignment among same-backbone VAR editors, improving CLIP on the edited region by +1.07 over the strongest prior editor while keeping background preservation competitive with it. Ablations show BitEdit and ResEdit play complementary roles in target alignment and background preservation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes BitResEdit, a training-free editor for bitwise-residual VAR models such as Infinity. BitEdit tilts post-CFG per-bit log-odds via source-negative guidance on a shared edited prefix then projects updates with closed-form Bernoulli-KL trust regions; ResEdit converts sampled bits to per-scale residuals, gates them by a localization mask, and re-injects via the model's native sum-of-scales. The central empirical claim is that on PIE-Bench with Infinity-2B this yields the strongest text alignment among same-backbone VAR editors (+1.07 CLIP on the edited region over the strongest prior) while remaining competitive on background preservation, with ablations indicating the two components are complementary.

Significance. If the reported gains are reproducible, the work demonstrates how to exploit two under-used native structures of bitwise-residual VAR generators (per-bit Bernoulli heads and additive multi-scale residual codes) to achieve exact background preservation by code arithmetic and localized edits without retraining or architectural changes. This is a concrete strength: the method is parameter-free at inference time and directly couples decision-time bit guidance with combination-time code composition. The approach could influence training-free editing pipelines for other autoregressive image models.

major comments (1)
  1. [Experiments] Experiments section (and abstract): the central performance claim of +1.07 CLIP improvement on the edited region and 'strongest text alignment' is presented without implementation details, statistical tests, error bars, number of runs, or full experimental protocol, rendering the quantitative result impossible to assess or reproduce from the given text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater experimental transparency. We address the single major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (and abstract): the central performance claim of +1.07 CLIP improvement on the edited region and 'strongest text alignment' is presented without implementation details, statistical tests, error bars, number of runs, or full experimental protocol, rendering the quantitative result impossible to assess or reproduce from the given text.

    Authors: We agree that the current presentation of the +1.07 CLIP gain lacks the necessary supporting details for full assessment and reproducibility. In the revision we will expand the Experiments section (and add a dedicated reproducibility subsection) with: (i) the complete evaluation protocol on PIE-Bench, including exact prompt templates, mask generation procedure, and inference hyperparameters for BitEdit and ResEdit; (ii) the number of independent runs (three random seeds), mean and standard-deviation error bars on all reported CLIP scores, and any statistical significance tests performed; (iii) the precise implementation of the Bernoulli-KL projection and gated residual injection steps. These additions will make the quantitative claims directly verifiable while preserving the existing results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents BitResEdit as a training-free procedure that directly manipulates the native per-bit Bernoulli heads and additive multi-scale residual code field of an existing VAR generator (Infinity). The abstract and method description contain no fitted parameters that are later renamed as predictions, no self-definitional equations, and no load-bearing self-citations whose content reduces to the present claims. The reported +1.07 CLIP improvement is framed as an empirical outcome under the stated protocol rather than a quantity forced by construction from the method's own inputs. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review based on abstract only; no explicit free parameters, invented physical entities, or non-standard axioms are stated.

axioms (1)
  • domain assumption VAR generators such as Infinity possess a per-bit Bernoulli prediction head and an additive multi-scale residual code field that can be accessed at inference time.
    Abstract describes these as native structures underused by existing editors.
invented entities (1)
  • BitResEdit no independent evidence
    purpose: Training-free editor that couples bit guidance with residual code composition
    Method introduced in the abstract

pith-pipeline@v0.9.1-grok · 5814 in / 1255 out tokens · 31871 ms · 2026-06-27T06:49:54.646822+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 5 linked inside Pith

  1. [1]

    Discrete noise inversion for next-scale autoregressive text-based image editing.arXiv preprint arXiv:2509.01984,

    Quan Dao, Xiaoxiao He, Ligong Han, Ngan Hoai Nguyen, Amin Heyrani Nobari, Faez Ahmed, Han Zhang, Viet Anh Nguyen, and Dimitris Metaxas. Discrete noise inversion for next-scale autoregressive text-based image editing.arXiv preprint arXiv:2509.01984,

  2. [2]

    Prompt-guided image editing with masked logit nudging in visual autoregressive models.arXiv preprint arXiv:2604.14591,

    Amir El-Ghoussani, Marc Hölle, Gustavo Carneiro, and Vasileios Belagiannis. Prompt-guided image editing with masked logit nudging in visual autoregressive models.arXiv preprint arXiv:2604.14591,

  3. [3]

    Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626,

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626,

  4. [4]

    Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

  5. [5]

    Flexvar: Flexible visual autoregressive modeling without residual prediction.arXiv preprint arXiv:2502.20313,

    Siyu Jiao, Gengwei Zhang, Yinlong Qian, Jiancheng Huang, Yao Zhao, Humphrey Shi, Lin Ma, Yunchao Wei, and Zequn Jie. Flexvar: Flexible visual autoregressive modeling without residual prediction.arXiv preprint arXiv:2502.20313,

  6. [6]

    Infinitystar: Unified spacetime autoregressive modeling for visual generation.arXiv preprint arXiv:2511.04675, 2025a

    13 Jinlai Liu, Jian Han, Bin Yan, Hui Wu, Fengda Zhu, Xing Wang, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Infinitystar: Unified spacetime autoregressive modeling for visual generation.arXiv preprint arXiv:2511.04675, 2025a. Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Le...

  7. [7]

    S2edit: Text-guided image editing with precise semantic and spatial control.arXiv preprint arXiv:2507.04584, 2025b

    Xudong Liu, Zikun Chen, Ruowei Jiang, Ziyi Wu, Kejia Yin, Han Zhao, Parham Aarabi, and Igor Gilitschenski. S2edit: Text-guided image editing with precise semantic and spatial control.arXiv preprint arXiv:2507.04584, 2025b. Yan Luo, Ahmadou Aidara, Jingyi Lu, Jeremy Moebel, Kai Han, and Mengyu Wang. VAGS: Velocity adaptive guidance scale for image editing ...

  8. [8]

    Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models

    Daiki Miyake, Akihiro Iohara, Yu Saito, and Toshiyuki Tanaka. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 2063–2072, February

  9. [9]

    Flowar: Scale-wise autoregressive image generation meets flow matching.arXiv preprint arXiv:2412.15205, 2024a

    Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Flowar: Scale-wise autoregressive image generation meets flow matching.arXiv preprint arXiv:2412.15205, 2024a. Sucheng Ren, Yaodong Yu, Nataniel Ruiz, Feng Wang, Alan Yuille, and Cihang Xie. M-var: Decoupled scale-wise autoregressive modeling for high-quality image generation....

  10. [10]

    Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

    14 Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

  11. [11]

    LocInv: Localization-aware inversion for text-guided image editing.arXiv preprint arXiv:2405.01496,

    Chuanming Tang, Kai Wang, Fei Yang, and Joost van de Weijer. LocInv: Localization-aware inversion for text-guided image editing.arXiv preprint arXiv:2405.01496,

  12. [12]

    Rethinking structure preservation in text-guided image editing with visual autoregressive models.arXiv preprint arXiv:2603.28367,

    Tao Xia, Jiawei Liu, Yukun Zhang, Ting Liu, Wei Wang, and Lei Zhang. Rethinking structure preservation in text-guided image editing with visual autoregressive models.arXiv preprint arXiv:2603.28367,

  13. [13]

    Scaling autoregressive models for content-rich text-to-image generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789,

  14. [14]

    15 A Implementation Details BitResEdit.AllBitResEditresults in this paper — Table 1, the latency in Table 7, the per-category curves in Figure 3, and the reference rows of §3.4 — use one configuration on the releasedInfinity-2Bstack: the 2B transformer, theD=32-bit BSQ tokenizer/VAE, and the Flan-T5-XL text encoder, run training-free in bfloat16 with theK...

  15. [15]

    Outputs are scored on all700images by the standard PIE-Bench evaluator [Ju et al., 2024] at5122. Latency is wall-clock around the editing call only (image loading and saving excluded), with CUDA synchro- nization before and after, on a single NVIDIA A100-80GB—the same protocol as the other rows of Table

  16. [16]

    The mean is26.65s per image (median26.64s, p9526 .72s); the cost is constant across categories because the recipe is fixed-step. VAGS.The VAGS latency in Table 7, its per-category curves in Figure 3, and its column in Figure 4 come from our run of the authors’ released implementation [Luo et al., 2026] (FlowEdit on Stable Diffusion 3.5 Large, fp16,5122) o...

  17. [17]

    Note on the FlowChef row of Table 1.Following the footnote of Table 1, the FlowChef row quotes reported numbers; they originate from the evaluation in Susladkar et al

    The mean over all700images is18.69s per image (per-shard means18.53–18.78s across eight category-balanced shards); the cost is constant across categories because the recipe is fixed-step. Note on the FlowChef row of Table 1.Following the footnote of Table 1, the FlowChef row quotes reported numbers; they originate from the evaluation in Susladkar et al. [...

  18. [18]

    T able 6PIE-Bench results of the baseline runs we measure ourselves, with theBitResEditrun of Table 1 repeated for reference

    (see the preceding note); and VAREdit-8B scores0.65–1.15CLIP points below its quoted row under this paper’s fixed5122 evaluator, while filling in the preservation metrics its paper does not report. T able 6PIE-Bench results of the baseline runs we measure ourselves, with theBitResEditrun of Table 1 repeated for reference. Configurations are as described i...

  19. [19]

    and accumulates it into the prefix carried to the next scale. Defaults (Appendix A):ηk linearly annealed fromη0 = 3to0, δKL = 1.0, clamp lim = 7, α = 1, N = 4bisection steps; the CFG scales, temperature τ, and top-k/top-p truncation follow ourInfinity[Han et al., 2025] setup. Code-space tensors areC×H z ×Wz; per-bit log-odds arePk ×D. Algorithm 2BitEdit: ...