CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models

Bin Chen; Junhao Li; Shu-Tao Xia; Xinhao Zhong; Yaowei Wang; Yi Sun; Yuxia Qiao

arxiv: 2605.19750 · v1 · pith:PGKDF75Unew · submitted 2026-05-19 · 💻 cs.CV

CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models

Junhao Li , Xinhao Zhong , Yi sun , Yuxia Qiao , Bin Chen , Shu-Tao Xia , Yaowei Wang This is my paper

Pith reviewed 2026-05-20 06:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords continual personalizationvisual autoregressive modelscatastrophic forgettingmulti-concept compositionconcept neuron selectiontext-to-image generation

0 comments

The pith

Gradient-based neuron selection lets visual autoregressive models learn new user concepts sequentially without erasing earlier ones and compose multiple concepts with spatial control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework that adapts visual autoregressive models to handle evolving personalization requests. It shows that selecting and freezing only the neurons most relevant to each new concept, while leaving others free, reduces forgetting across long sequences of user-specific styles. A second component uses spatial conditions to fuse features from multiple personalized concepts during generation so that attributes remain consistent and disentangled. If these mechanisms work as described, users could keep adding personal concepts indefinitely and still generate accurate combined scenes without retraining the entire model or storing past examples.

Core claim

The authors introduce Gradient-based Concept Neuron Selection (GCNS) that identifies concept-relevant neurons via gradients and constrains only the conflicting parameters across sequential tasks, together with a context-aware composition strategy that performs multi-branch feature modeling and localized cross-attention fusion under spatial guidance; these two components together mitigate catastrophic forgetting in continual single-concept learning and reduce feature entanglement in multi-concept synthesis.

What carries the argument

Gradient-based Concept Neuron Selection (GCNS), which identifies concept-relevant neurons and constrains only conflicting parameters across tasks.

If this is right

Long sequences of personalization tasks become feasible without replay buffers or model expansion.
Multi-concept images can be synthesized with explicit spatial control over each concept's placement and attributes.
The same neuron-selection principle could be applied to other autoregressive or diffusion-based generators that suffer from sequential interference.
Generation time remains comparable to the base VAR model because no additional parameters are stored per concept.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If GCNS scales to open-ended user streams, personal image models could evolve continuously on-device without cloud retraining.
The spatial fusion mechanism might generalize to video or 3-D generation where concepts must occupy consistent locations across frames.
Testing whether the selected neurons remain stable when the base model is later fine-tuned on unrelated data would reveal limits of the method.

Load-bearing premise

Identifying a small set of concept-relevant neurons through gradients and freezing only the parameters that conflict with prior tasks is enough to block forgetting for any sequence of user concepts.

What would settle it

A sequence of ten or more distinct user concepts learned one after another; after the final concept is added, early concepts should still generate accurate images when prompted, without visible degradation or mixing of later attributes.

Figures

Figures reproduced from arXiv: 2605.19750 by Bin Chen, Junhao Li, Shu-Tao Xia, Xinhao Zhong, Yaowei Wang, Yi Sun, Yuxia Qiao.

**Figure 2.** Figure 2: Overall framework. (a) Gradient-based Concept Neuron Selection (GCNS) resolves catastrophic forgetting and enable continual personalization. (b) Context-aware Composition Strategy addresses the challenge of concept neglect in multi-concept image synthesis. failing to accommodate evolving user demands. Within the proposed Continual Personalized and Compositional Generation framework in VAR (CPC-VAR), we dec… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of single-concept customization with different baselines. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of multi-concept customization, where Main Prompt indicates the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on the intervention scale [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of custom style transfer [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Visual autoregressive (VAR) models have recently emerged as an efficient paradigm for text-to-image generation. Despite their strong generative capability, existing VAR-based personalization methods remain limited to static settings, failing to accommodate evolving user demands. In particular, sequential concept learning leads to severe catastrophic forgetting, while multi-concept synthesis often suffers from feature entanglement and attribute inconsistency. In this work, we present the first systematic study of continual personalized generation in VAR models. We identify two key challenges: (i) preserving previously learned concepts during sequential customization, and (ii) composing multiple personalized concepts in a controllable manner. To address these issues, we propose a unified framework with two core components. For continual single-concept learning, we introduce Gradient-based Concept Neuron Selection (GCNS), which identifies concept-relevant neurons and constrains only conflicting parameters across tasks, effectively mitigating forgetting without additional model expansion. For multi-concept synthesis, we propose a context-aware composition strategy that performs multi-branch feature modeling and localized cross-attention fusion guided by spatial conditions, enabling precise and disentangled concept composition. Extensive experiments demonstrate that our method significantly improves performance in long-sequence continual personalization while achieving superior results in multi-concept image synthesis compared to existing baselines. These findings highlight the potential of VAR models for scalable and controllable personalized generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CPC-VAR applies gradient neuron selection and localized attention fusion to continual personalization in VAR models, but the approach rests on an untested assumption about limited neuron overlap.

read the letter

The paper's core move is to treat sequential user concepts in VAR generation as a continual learning problem. GCNS picks concept-relevant neurons with gradients and constrains only the conflicting parameters to limit forgetting, while the composition side uses multi-branch features plus spatial-guided cross-attention to reduce entanglement. This combination has not been laid out for VAR before, so the specific framing and engineering choices count as new even if the building blocks are familiar from other continual learning and attention work. It does a clean job naming the two practical pain points—long-sequence forgetting and inconsistent multi-concept outputs—and offers solutions that avoid replay buffers or model growth, which matters for deployment where compute is tight. The framing is straightforward and the motivation is grounded in real user scenarios like evolving creative tools. The main soft spot is the central assumption that gradient-selected neurons will stay sufficiently disjoint across arbitrary sequences. If overlap is high, simply constraining the conflicting subset may not block cumulative interference in the shared autoregressive backbone, and the abstract gives no numbers on overlap, drift after five or more tasks, or an ablation that removes the constraint step. Without those checks the superiority claims are hard to evaluate. The work is aimed at people building or studying personalized generative systems who already know VAR and want ideas for sequential adaptation. A reader focused on efficient continual methods in vision models could pull useful implementation details even if the results need more scrutiny. I would send it to peer review so referees can examine the actual experiments, controls, and any overlap analysis that may be in the full text.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CPC-VAR, a unified framework for continual personalized and compositional generation in Visual Autoregressive (VAR) models. It addresses catastrophic forgetting in sequential single-concept learning via Gradient-based Concept Neuron Selection (GCNS), which identifies concept-relevant neurons and constrains only conflicting parameters without model expansion or replay. For multi-concept synthesis, it proposes a context-aware composition strategy using multi-branch feature modeling and localized cross-attention fusion guided by spatial conditions. The authors claim that extensive experiments demonstrate significant improvements in long-sequence continual personalization and superior results in multi-concept image synthesis over existing baselines.

Significance. If the central claims hold, this work would be significant for enabling scalable continual personalization in efficient VAR architectures without requiring model growth or buffers, potentially advancing controllable multi-concept generation for evolving user demands in text-to-image systems.

major comments (2)

[§3.1] §3.1 (GCNS description): The claim that constraining only conflicting parameters after gradient-based neuron selection suffices to prevent catastrophic forgetting across arbitrary sequences rests on the unverified premise that selected neurons remain sufficiently disjoint; no explicit measurement of neuron overlap or cumulative parameter drift over sequences longer than 5 tasks is provided, which is load-bearing for the no-expansion/no-replay assertion.
[§4] §4 (Experiments): The reported superiority in long-sequence continual personalization lacks ablations that isolate the contribution of the parameter-constraint step (e.g., GCNS without constraints) and does not report retention metrics or statistical significance across task sequences, undermining verification of the mitigation effectiveness.

minor comments (2)

[Abstract] Abstract: The summary of results refers to 'extensive experiments' and 'existing baselines' without naming datasets, metrics, or the number of tasks/sequences evaluated, reducing clarity for readers.
[§3.2] Notation in §3.2: The context-aware composition strategy would benefit from an explicit equation or pseudocode for the localized cross-attention fusion to clarify how spatial conditions guide the multi-branch modeling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below and outline the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [§3.1] §3.1 (GCNS description): The claim that constraining only conflicting parameters after gradient-based neuron selection suffices to prevent catastrophic forgetting across arbitrary sequences rests on the unverified premise that selected neurons remain sufficiently disjoint; no explicit measurement of neuron overlap or cumulative parameter drift over sequences longer than 5 tasks is provided, which is load-bearing for the no-expansion/no-replay assertion.

Authors: We agree that explicit measurements of neuron overlap and cumulative parameter drift would provide stronger empirical support for the premise that selected neurons remain sufficiently disjoint. In the revised manuscript, we will add an analysis section with visualizations and quantitative metrics of neuron overlap across tasks as well as cumulative parameter drift for sequences of up to 10 tasks. revision: yes
Referee: [§4] §4 (Experiments): The reported superiority in long-sequence continual personalization lacks ablations that isolate the contribution of the parameter-constraint step (e.g., GCNS without constraints) and does not report retention metrics or statistical significance across task sequences, undermining verification of the mitigation effectiveness.

Authors: We concur that isolating the parameter-constraint component and adding retention metrics with statistical significance would improve verification. We will include the requested ablations (GCNS without constraints) and report retention metrics together with standard deviations and significance tests over multiple random task sequences in the updated experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: method components and claims rest on experimental validation rather than self-referential definitions or fitted inputs

full rationale

The abstract and provided text introduce GCNS for neuron selection and a context-aware composition strategy as novel components to mitigate forgetting and enable disentangled synthesis in VAR models. No equations, derivations, or parameter-fitting steps are described that reduce claimed performance gains to inputs by construction. The central claims are supported by reference to extensive experiments comparing against baselines, without load-bearing self-citations, uniqueness theorems from prior author work, or renaming of known results as new derivations. The approach is presented as building on standard neural network techniques with added constraints, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions from continual learning and transformer attention literature; no new free parameters, axioms, or invented entities are introduced beyond algorithmic choices.

axioms (2)

domain assumption Gradient signals reliably identify concept-specific neurons without excessive noise from task interference
Invoked when GCNS selects neurons based on gradients across sequential tasks.
domain assumption Spatial conditions can be used to guide disentangled feature fusion without introducing new artifacts
Used to justify the context-aware composition strategy.

pith-pipeline@v0.9.0 · 5775 in / 1290 out tokens · 31600 ms · 2026-05-20T06:57:09.303815+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gradient-based Concept Neuron Selection (GCNS), which identifies concept-relevant neurons and constrains only conflicting parameters across tasks
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ltotal = Lvar(θt) + λ∥Mreg ⊙ (θt − θold)∥2 2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 8 internal anchors

[1]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Sae- hoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/ coyo-dataset, 2022

work page 2022
[2]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021
[3]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

work page 2023
[4]

Anydoor: Zero-shot object-level image customization

Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6593–6602, 2024

work page 2024
[5]

Fine-tuning visual autogressive models for subject-driven generation

Jiwoo Chung, Sangeek Hyun, Hyunjun Kim, Eunseo Koh, MinKyu Lee, and Jae-Pil Heo. Fine-tuning visual autogressive models for subject-driven generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19174–19184, 2025

work page 2025
[6]

How to continually adapt text-to-image diffusion models for flexible customization?Advances in Neural Information Processing Systems, 37:130057–130083, 2024

Jiahua Dong, Wenqi Liang, Hongliu Li, Duzhen Zhang, Meng Cao, Henghui Ding, Salman Khan, and Fahad S Khan. How to continually adapt text-to-image diffusion models for flexible customization?Advances in Neural Information Processing Systems, 37:130057–130083, 2024

work page 2024
[7]

Federated class-incremental learning

Jiahua Dong, Lixu Wang, Zhen Fang, Gan Sun, Shichao Xu, Xiao Wang, and Qi Zhu. Federated class-incremental learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10164–10173, 2022

work page 2022
[8]

SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation

Chongyu Fan, Jiancheng Liu, Yihua Zhang, Eric Wong, Dennis Wei, and Sijia Liu. Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation.arXiv preprint arXiv:2310.12508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Training-free structured diffusion guidance for compositional text-to-image synthesis.arXiv preprint arXiv:2212.05032, 2022

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis.arXiv preprint arXiv:2212.05032, 2022

work page arXiv 2022
[10]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Encoder-based domain tuning for fast personalization of text-to-image models.ACM Transactions on Graphics (TOG), 42(4):1–13, 2023

Rinon Gal, Moab Arar, Yuval Atzmon, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. Encoder-based domain tuning for fast personalization of text-to-image models.ACM Transactions on Graphics (TOG), 42(4):1–13, 2023

work page 2023
[12]

Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models.Advances in Neural Information Processing Systems, 36:15890–15902, 2023

Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models.Advances in Neural Information Processing Systems, 36:15890–15902, 2023

work page 2023
[13]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025. 10

work page 2025
[14]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[15]

Identity decoupling for multi- subject personalization of text-to-image models.Advances in Neural Information Processing Systems, 37:100895–100937, 2024

Sangwon Jang, Jaehyeong Jo, Kimin Lee, and Sung Ju Hwang. Identity decoupling for multi- subject personalization of text-to-image models.Advances in Neural Information Processing Systems, 37:100895–100937, 2024

work page 2024
[16]

Selectively informative description can reduce undesired embedding entanglements in text-to-image personalization

Jimyeong Kim, Jungwon Park, and Wonjong Rhee. Selectively informative description can reduce undesired embedding entanglements in text-to-image personalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8312–8322, 2024

work page 2024
[17]

Multi- concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023

work page 1931
[18]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.International journal of computer vision, 128(7):1956–1981, 2020

work page 1956
[19]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146–30166, 2023

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146–30166, 2023

work page 2023
[21]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, 2023

work page 2023
[22]

Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

work page 2017
[23]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Di- rected diffusion: Direct control of object placement through attention guidance

Wan-Duo Kurt Ma, Avisek Lahiri, John P Lewis, Thomas Leung, and W Bastiaan Kleijn. Di- rected diffusion: Direct control of object placement through attention guidance. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 4098–4106, 2024

work page 2024
[25]

Lego: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models.arXiv preprint arXiv:2311.13833, 2023

Saman Motamed, Danda Pani Paudel, and Luc Van Gool. Lego: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models.arXiv preprint arXiv:2311.13833, 2023

work page arXiv 2023
[26]

Dreammatcher: Appearance matching self-attention for semantically-consistent text-to-image personalization

Jisu Nam, Heesu Kim, DongJae Lee, Siyoon Jin, Seungryong Kim, and Seunggyu Chang. Dreammatcher: Appearance matching self-attention for semantically-consistent text-to-image personalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8100–8110, 2024

work page 2024
[27]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[28]

Orthogonal adaptation for modular customization of diffusion models

Ryan Po, Guandao Yang, Kfir Aberman, and Gordon Wetzstein. Orthogonal adaptation for modular customization of diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7964–7973, 2024

work page 2024
[29]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 11

work page 2021
[30]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

icarl: Incremental classifier and representation learning

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

work page 2001
[32]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[33]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

work page 2023
[34]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.arXiv preprint arXiv:2111.02114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

A survey of multimodal-guided image editing with text-to-image diffusion models

X Shuai, H Ding, X Ma, R Tu, YG Jiang, and D Tao. A survey of multimodal-guided image editing with text-to-image diffusion models. arxiv 2024.arXiv preprint arXiv:2406.14555

work page arXiv 2024
[36]

SmoothGrad: removing noise by adding noise

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smooth- grad: removing noise by adding noise.arXiv preprint arXiv:1706.03825, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Continual diffusion: Continual customization of text-to-image diffusion with c-lora.arXiv preprint arXiv:2304.06027, 2023

James Seale Smith, Yen-Chang Hsu, Lingyu Zhang, Ting Hua, Zsolt Kira, Yilin Shen, and Hongxia Jin. Continual diffusion: Continual customization of text-to-image diffusion with c-lora.arXiv preprint arXiv:2304.06027, 2023

work page arXiv 2023
[38]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

work page 2024
[39]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

work page 2017
[40]

Human preference score: Better aligning text-to-image models with human preference

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score: Better aligning text-to-image models with human preference. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2096–2105, 2023

work page 2096
[41]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023
[42]

Lora-composer: Leveraging low-rank adaptation for multi-concept customization in training-free diffusion models.IEEE Transactions on Image Processing, 34:8145–8158, 2025

Yang Yang, Wen Wang, Liang Peng, Chaotian Song, Yao Chen, Hengjia Li, Xiaolong Yang, Qinglin Lu, Deng Cai, Xiaofei He, et al. Lora-composer: Leveraging low-rank adaptation for multi-concept customization in training-free diffusion models.IEEE Transactions on Image Processing, 34:8145–8158, 2025

work page 2025
[43]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Attention calibration for disentan- gled text-to-image personalization

Yanbing Zhang, Mengping Yang, Qin Zhou, and Zhe Wang. Attention calibration for disentan- gled text-to-image personalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4764–4774, 2024

work page 2024
[45]

Motiondirector: Motion customization of text-to-video diffusion models

Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jia-Wei Liu, Weijia Wu, Jussi Keppo, and Mike Zheng Shou. Motiondirector: Motion customization of text-to-video diffusion models. InEuropean Conference on Computer Vision, pages 273–290. Springer, 2024. 12

work page 2024
[46]

Image and video tokenization with binary spherical quantization.arXiv preprint arXiv:2406.07548, 2024

Yue Zhao, Yuanjun Xiong, and Philipp Krähenbühl. Image and video tokenization with binary spherical quantization.arXiv preprint arXiv:2406.07548, 2024

work page arXiv 2024
[47]

Closing the safety gap: Surgical concept erasure in visual autoregressive models.arXiv preprint arXiv:2509.22400, 2025

Xinhao Zhong, Yimin Zhou, Zhiqi Zhang, Junhao Li, Yi Sun, Bin Chen, Shu-Tao Xia, Xuan Wang, and Ke Xu. Closing the safety gap: Surgical concept erasure in visual autoregressive models.arXiv preprint arXiv:2509.22400, 2025. A Technical appendices and supplementary material A.1 Baseline method ARBooth[5]We clone the code base of ARBooth from official GitHub...

work page arXiv 2025
[48]

Orthogonal Adaptation[28]We adapt Orthogonal Adaption to the Infinity model by ourselves because official code is not available

During the inference phase, we employ Elastic Weight Aggregation to fuse the LoRA weights across distinct tasks. Orthogonal Adaptation[28]We adapt Orthogonal Adaption to the Infinity model by ourselves because official code is not available. We use the randomized orthogonal basis, which is consistent with the paper. The learning rate of text embedding is ...

work page

[1] [1]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Sae- hoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/ coyo-dataset, 2022

work page 2022

[2] [2]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021

[3] [3]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

work page 2023

[4] [4]

Anydoor: Zero-shot object-level image customization

Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6593–6602, 2024

work page 2024

[5] [5]

Fine-tuning visual autogressive models for subject-driven generation

Jiwoo Chung, Sangeek Hyun, Hyunjun Kim, Eunseo Koh, MinKyu Lee, and Jae-Pil Heo. Fine-tuning visual autogressive models for subject-driven generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19174–19184, 2025

work page 2025

[6] [6]

How to continually adapt text-to-image diffusion models for flexible customization?Advances in Neural Information Processing Systems, 37:130057–130083, 2024

Jiahua Dong, Wenqi Liang, Hongliu Li, Duzhen Zhang, Meng Cao, Henghui Ding, Salman Khan, and Fahad S Khan. How to continually adapt text-to-image diffusion models for flexible customization?Advances in Neural Information Processing Systems, 37:130057–130083, 2024

work page 2024

[7] [7]

Federated class-incremental learning

Jiahua Dong, Lixu Wang, Zhen Fang, Gan Sun, Shichao Xu, Xiao Wang, and Qi Zhu. Federated class-incremental learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10164–10173, 2022

work page 2022

[8] [8]

SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation

Chongyu Fan, Jiancheng Liu, Yihua Zhang, Eric Wong, Dennis Wei, and Sijia Liu. Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation.arXiv preprint arXiv:2310.12508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Training-free structured diffusion guidance for compositional text-to-image synthesis.arXiv preprint arXiv:2212.05032, 2022

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis.arXiv preprint arXiv:2212.05032, 2022

work page arXiv 2022

[10] [10]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Encoder-based domain tuning for fast personalization of text-to-image models.ACM Transactions on Graphics (TOG), 42(4):1–13, 2023

Rinon Gal, Moab Arar, Yuval Atzmon, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. Encoder-based domain tuning for fast personalization of text-to-image models.ACM Transactions on Graphics (TOG), 42(4):1–13, 2023

work page 2023

[12] [12]

Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models.Advances in Neural Information Processing Systems, 36:15890–15902, 2023

Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models.Advances in Neural Information Processing Systems, 36:15890–15902, 2023

work page 2023

[13] [13]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025. 10

work page 2025

[14] [14]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022

[15] [15]

Identity decoupling for multi- subject personalization of text-to-image models.Advances in Neural Information Processing Systems, 37:100895–100937, 2024

Sangwon Jang, Jaehyeong Jo, Kimin Lee, and Sung Ju Hwang. Identity decoupling for multi- subject personalization of text-to-image models.Advances in Neural Information Processing Systems, 37:100895–100937, 2024

work page 2024

[16] [16]

Selectively informative description can reduce undesired embedding entanglements in text-to-image personalization

Jimyeong Kim, Jungwon Park, and Wonjong Rhee. Selectively informative description can reduce undesired embedding entanglements in text-to-image personalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8312–8322, 2024

work page 2024

[17] [17]

Multi- concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023

work page 1931

[18] [18]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.International journal of computer vision, 128(7):1956–1981, 2020

work page 1956

[19] [19]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146–30166, 2023

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146–30166, 2023

work page 2023

[21] [21]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, 2023

work page 2023

[22] [22]

Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

work page 2017

[23] [23]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Di- rected diffusion: Direct control of object placement through attention guidance

Wan-Duo Kurt Ma, Avisek Lahiri, John P Lewis, Thomas Leung, and W Bastiaan Kleijn. Di- rected diffusion: Direct control of object placement through attention guidance. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 4098–4106, 2024

work page 2024

[25] [25]

Lego: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models.arXiv preprint arXiv:2311.13833, 2023

Saman Motamed, Danda Pani Paudel, and Luc Van Gool. Lego: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models.arXiv preprint arXiv:2311.13833, 2023

work page arXiv 2023

[26] [26]

Dreammatcher: Appearance matching self-attention for semantically-consistent text-to-image personalization

Jisu Nam, Heesu Kim, DongJae Lee, Siyoon Jin, Seungryong Kim, and Seunggyu Chang. Dreammatcher: Appearance matching self-attention for semantically-consistent text-to-image personalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8100–8110, 2024

work page 2024

[27] [27]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[28] [28]

Orthogonal adaptation for modular customization of diffusion models

Ryan Po, Guandao Yang, Kfir Aberman, and Gordon Wetzstein. Orthogonal adaptation for modular customization of diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7964–7973, 2024

work page 2024

[29] [29]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 11

work page 2021

[30] [30]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

icarl: Incremental classifier and representation learning

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

work page 2001

[32] [32]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[33] [33]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

work page 2023

[34] [34]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.arXiv preprint arXiv:2111.02114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[35] [35]

A survey of multimodal-guided image editing with text-to-image diffusion models

X Shuai, H Ding, X Ma, R Tu, YG Jiang, and D Tao. A survey of multimodal-guided image editing with text-to-image diffusion models. arxiv 2024.arXiv preprint arXiv:2406.14555

work page arXiv 2024

[36] [36]

SmoothGrad: removing noise by adding noise

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smooth- grad: removing noise by adding noise.arXiv preprint arXiv:1706.03825, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

Continual diffusion: Continual customization of text-to-image diffusion with c-lora.arXiv preprint arXiv:2304.06027, 2023

James Seale Smith, Yen-Chang Hsu, Lingyu Zhang, Ting Hua, Zsolt Kira, Yilin Shen, and Hongxia Jin. Continual diffusion: Continual customization of text-to-image diffusion with c-lora.arXiv preprint arXiv:2304.06027, 2023

work page arXiv 2023

[38] [38]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

work page 2024

[39] [39]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

work page 2017

[40] [40]

Human preference score: Better aligning text-to-image models with human preference

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score: Better aligning text-to-image models with human preference. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2096–2105, 2023

work page 2096

[41] [41]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023

[42] [42]

Lora-composer: Leveraging low-rank adaptation for multi-concept customization in training-free diffusion models.IEEE Transactions on Image Processing, 34:8145–8158, 2025

Yang Yang, Wen Wang, Liang Peng, Chaotian Song, Yao Chen, Hengjia Li, Xiaolong Yang, Qinglin Lu, Deng Cai, Xiaofei He, et al. Lora-composer: Leveraging low-rank adaptation for multi-concept customization in training-free diffusion models.IEEE Transactions on Image Processing, 34:8145–8158, 2025

work page 2025

[43] [43]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

Attention calibration for disentan- gled text-to-image personalization

Yanbing Zhang, Mengping Yang, Qin Zhou, and Zhe Wang. Attention calibration for disentan- gled text-to-image personalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4764–4774, 2024

work page 2024

[45] [45]

Motiondirector: Motion customization of text-to-video diffusion models

Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jia-Wei Liu, Weijia Wu, Jussi Keppo, and Mike Zheng Shou. Motiondirector: Motion customization of text-to-video diffusion models. InEuropean Conference on Computer Vision, pages 273–290. Springer, 2024. 12

work page 2024

[46] [46]

Image and video tokenization with binary spherical quantization.arXiv preprint arXiv:2406.07548, 2024

Yue Zhao, Yuanjun Xiong, and Philipp Krähenbühl. Image and video tokenization with binary spherical quantization.arXiv preprint arXiv:2406.07548, 2024

work page arXiv 2024

[47] [47]

Closing the safety gap: Surgical concept erasure in visual autoregressive models.arXiv preprint arXiv:2509.22400, 2025

Xinhao Zhong, Yimin Zhou, Zhiqi Zhang, Junhao Li, Yi Sun, Bin Chen, Shu-Tao Xia, Xuan Wang, and Ke Xu. Closing the safety gap: Surgical concept erasure in visual autoregressive models.arXiv preprint arXiv:2509.22400, 2025. A Technical appendices and supplementary material A.1 Baseline method ARBooth[5]We clone the code base of ARBooth from official GitHub...

work page arXiv 2025

[48] [48]

Orthogonal Adaptation[28]We adapt Orthogonal Adaption to the Infinity model by ourselves because official code is not available

During the inference phase, we employ Elastic Weight Aggregation to fuse the LoRA weights across distinct tasks. Orthogonal Adaptation[28]We adapt Orthogonal Adaption to the Infinity model by ourselves because official code is not available. We use the randomized orthogonal basis, which is consistent with the paper. The learning rate of text embedding is ...

work page