Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Kangsheng Duan; Wenyu Liu; Xiaohu Ruan; Xiaoxin Chen; Xinggang Wang; Ziyang Xu

arxiv: 2606.19195 · v1 · pith:XBPRSZ5Nnew · submitted 2026-06-17 · 💻 cs.CV

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Kangsheng Duan , Ziyang Xu , Wenyu Liu , Xiaohu Ruan , Xiaoxin Chen , Xinggang Wang This is my paper

Pith reviewed 2026-06-26 21:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords image inpaintinglightweight diffusion modelmodel compressionknowledge distillationLocal-λ Mix Interactionefficiencylatent space

0 comments

The pith

Moebius uses Local-λ Mix Interaction blocks and latent-space distillation to match 11.9B-parameter inpainting quality with a 0.22B model and 15x faster inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a highly compressed diffusion backbone can overcome representation bottlenecks in image inpainting by summarizing spatial contexts and global semantics into fixed-size linear matrices. A sympathetic reader would care because current 10B-scale generalist models deliver strong results but remain too slow and expensive for practical use. The approach pairs the compressed architecture with an adaptive multi-granularity distillation strategy that operates entirely in latent space. Experiments on natural and portrait benchmarks show the resulting 0.22B model rivals or exceeds FLUX.1-Fill-Dev while using under 2% of the parameters.

Core claim

By reconstructing the diffusion backbone with the Local-λ Mix Interaction (LλMI) block—composed of Local-λ and Interactive-λ modules that compress spatial contexts and global semantic priors into fixed-size linear matrices—Moebius preserves complex latent interactions. When combined with an adaptive multi-granularity distillation strategy that balances gradient-based losses strictly inside the latent space, the 0.22B model achieves inpainting quality that rivals or surpasses the 11.9B FLUX.1-Fill-Dev on natural and portrait benchmarks while delivering more than 15× faster total inference time.

What carries the argument

The Local-λ Mix Interaction (LλMI) block, which uses Local-λ and Interactive-λ modules to summarize spatial contexts and global semantic priors into fixed-size linear matrices while preserving latent interactions for inpainting.

If this is right

High-fidelity inpainting becomes practical on devices with limited compute and memory.
Task-specific specialists can deliver performance comparable to much larger generalist models in targeted domains.
Latent-space multi-granularity distillation enables high-fidelity alignment without pixel-space decoding costs.
The efficiency gains set a new standard that other inpainting systems can be measured against.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The matrix-summarization technique could extend to other diffusion tasks such as outpainting or conditional generation.
Fixed-size linear representations may reduce memory footprint in additional vision backbones beyond inpainting.
The approach suggests a route toward even smaller models if the distillation balance can be further tuned.

Load-bearing premise

The Local-λ and Interactive-λ modules can compress spatial and semantic information into fixed-size linear matrices without losing the complex latent interactions required for high-fidelity inpainting.

What would settle it

Side-by-side visual or metric evaluation on the same natural and portrait inpainting benchmarks where Moebius produces visibly lower quality or lower scores than FLUX.1-Fill-Dev despite the reported parameter count and speed.

Figures

Figures reproduced from arXiv: 2606.19195 by Kangsheng Duan, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang, Ziyang Xu.

**Figure 1.** Figure 1: Overall pipeline of Moebius. We adopt the Latent Diffusion Model (LDM) [32] framework equipped with Latent Categories Guidance (LCG) [54]. To achieve extreme architectural efficiency, the denoising U-Net is systematically restructured using our proposed LλMI blocks (detailed in Sec. 3.2). Furthermore, an adaptive multi-granularity distillation strategy (Sec. 3.3) is applied during training to align our lig… view at source ↗

**Figure 3.** Figure 3: Illustration of local context aggregation (Local-λ) and crossembedding interaction (Interactiveλ) in the latent domain. In both modules, λ efficiently summarizes either spatial contexts or the global prior ELCG into a fixed-size linear matrix, bypassing memory-intensive attention calculations. maps by summarizing contextual and semantic information into fixed-size linear matrices (denoted as λ), allowi… view at source ↗

**Figure 4.** Figure 4: Small feature spaces can still maintain high representation quality. Moebius (0.22B) exhibits highly similar activation maps to the teacher model, PixelHacker (0.86B), across multiple spatial granularities, demonstrating that it maintains consistent representational quality despite a severely compressed (4× smaller) architecture. This validates the optimal synergy between our lightweight design and t… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison with SOTA academic and industrial meth [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: User study of Moebius (0.22B) against teacher and 10B-level gener [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Real-World Object Removal. Moebius handles realistic masks with superior consistency compared to baselines. balance. Only the holistic integration of all structural modifications (Exp ○9 ) unlocks the optimal efficiency front, reconfirming that extreme compactness demands rigorous architectural synergy over naive module substitution. Dissection of Distillation Objectives. Furthermore, Tab. 5 dissects our… view at source ↗

**Figure 8.** Figure 8: More qualitative comparison on natural scenes (Places2). [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: More qualitative comparison on portrait scenes (CelebA-HQ and [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison with commercial edit systems. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Failure case analysis. Compared with its teacher model (PixelHacker), Moebius may exhibit minor detail loss or less plausible textures in extremely tiny background regions when context is limited. These instances illustrate the capacityefficiency trade-off of our lightweight specialist [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

read the original abstract

While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-$\lambda$ Mix Interaction ($L\lambda MI$) block. Comprising Local-$\lambda$ and Interactive-$\lambda$ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a $>15\times$ acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at https://hustvl.github.io/Moebius.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Moebius claims a 0.22B inpainting model matches 11.9B quality via a new LλMI block and latent distillation, but the abstract shows no metrics to support it.

read the letter

Moebius claims a 0.22B inpainting model matches 11.9B quality via a new LλMI block and latent distillation, but the abstract shows no metrics to support it.

The new elements are the Local-λ Mix Interaction block, built from Local-λ and Interactive-λ modules that compress spatial contexts and semantic priors into fixed-size linear matrices, plus an adaptive multi-granularity distillation run entirely in latent space with dynamic gradient balancing. That specific pairing is not described in prior work mentioned in the abstract.

The approach makes sense on paper. Reconstructing the diffusion backbone to shed parameters while trying to keep latent interactions intact is a direct response to the cost problem with large generalist models. Keeping distillation in latent space avoids extra decoding overhead, which fits the efficiency goal.

The clear limitation is the absence of any numbers. The abstract states that experiments on natural and portrait benchmarks show rival or better quality and a 15× speed-up, but supplies no FID, PSNR, ablation tables, or direct comparisons. Without those, it is impossible to judge whether the modules actually preserve the needed interactions or if the distillation balances the losses effectively. The assumption that linear matrices can hold complex priors for high-fidelity results is plausible but untested in the given text.

This is for researchers focused on deploying inpainting on limited hardware. A reader looking for concrete efficiency techniques would find the design description useful if the full paper includes solid results.

It deserves peer review because the efficiency target is practical and the architecture is internally consistent, even though the evidence must be checked in the complete manuscript.

Referee Report

1 major / 1 minor

Summary. The paper introduces Moebius, a 0.22B-parameter lightweight image inpainting framework that reconstructs the diffusion backbone via the Local-λ Mix Interaction (LλMI) block (with Local-λ and Interactive-λ modules that compress spatial contexts and global semantic priors into fixed-size linear matrices) and pairs it with an adaptive multi-granularity distillation strategy operating in latent space. It claims this enables performance rivaling or surpassing the 11.9B FLUX.1-Fill-Dev model on natural and portrait benchmarks while using <2% of the parameters and delivering >15× faster inference.

Significance. If the performance claims are substantiated by rigorous experiments, the work would be significant for demonstrating that extreme compression of task-specific inpainting models can match large generalist foundation models without sacrificing fidelity, advancing practical deployment of high-quality generative inpainting under resource constraints. The latent-space distillation approach is a sensible design choice to avoid pixel-space costs.

major comments (1)

[Abstract] Abstract: The central claim that Moebius 'rivals or even surpasses' FLUX.1-Fill-Dev with 0.22B parameters and >15× acceleration is load-bearing for the entire contribution, yet the abstract (and provided text) supplies no quantitative metrics, tables, ablation studies, baseline comparisons, or error analysis to support it, rendering verification of the claim impossible.

minor comments (1)

[Abstract] The notation LλMI and the λ modules are introduced without an equation or diagram in the abstract; a formal definition or pseudocode would improve clarity even if present later in the manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that Moebius 'rivals or even surpasses' FLUX.1-Fill-Dev with 0.22B parameters and >15× acceleration is load-bearing for the entire contribution, yet the abstract (and provided text) supplies no quantitative metrics, tables, ablation studies, baseline comparisons, or error analysis to support it, rendering verification of the claim impossible.

Authors: The abstract is a concise high-level summary and conventionally omits specific numerical values, tables, or detailed ablations to respect length constraints. The full manuscript contains all requested elements in the Experiments section, including quantitative metrics and direct baseline comparisons against FLUX.1-Fill-Dev on natural and portrait benchmarks, ablation studies validating the LλMI block and distillation components, inference-time measurements confirming the acceleration factor, and supporting analysis. These sections enable verification of the performance claims. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and architecture description introduce the LλMI block (with Local-λ and Interactive-λ modules) and adaptive multi-granularity distillation as novel components that summarize contexts into linear matrices and align in latent space. No equations, fitted parameters renamed as predictions, or self-citation chains are present that would reduce the performance claims (rivaling FLUX.1-Fill-Dev at 0.22B params) to definitional equivalence with the inputs. The derivation chain remains self-contained via empirical validation on benchmarks, with no load-bearing reductions of the kind enumerated in the analysis criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate free parameters, background axioms, or new postulated entities; the LλMI block is presented as an introduced module rather than an independently evidenced physical entity.

pith-pipeline@v0.9.1-grok · 5820 in / 1201 out tokens · 24199 ms · 2026-06-26T21:01:49.196798+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 34 canonical work pages · 13 internal anchors

[1]

arXiv preprint arXiv:2102.08602 (2021)

Bello, I.: Lambdanetworks: Modeling long-range interactions without attention. arXiv preprint arXiv:2102.08602 (2021)

work page arXiv 2021
[2]

In: Pro- ceedings of the 27th annual conference on Computer graphics and interactive tech- niques

Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Pro- ceedings of the 27th annual conference on Computer graphics and interactive tech- niques. p. 417–424. SIGGRAPH ’00, ACM Press/Addison-Wesley Publishing Co., USA (2000).https://doi.org/10.1145/344779.344972,https://doi.org/10. 1145/344779.344972

work page doi:10.1145/344779.344972 2000
[3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Cai, H., Li, J., Hu, M., Gan, C., Han, S.: Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17302–17313 (2023)

2023
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, H., Zhao, Y.: Don’t look into the dark: Latent codes for pluralistic image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7591–7600 (June 2024)

2024
[5]

Chen, P., Liu, S., Zhao, H., Jia, J.: Distilling knowledge via knowledge review (2021),https://arxiv.org/abs/2104.09044

work page arXiv 2021
[6]

In: International Conference on Learning Representations (ICLR) (2024)

Dao, T.: FlashAttention-2: Faster attention with better parallelism and work par- titioning. In: International Conference on Learning Representations (ICLR) (2024)

2024
[7]

Advances in neural information pro- cessing systems35, 16344–16359 (2022)

Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in neural information pro- cessing systems35, 16344–16359 (2022)

2022
[8]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

2024
[9]

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis (2021),https://arxiv.org/abs/2012.09841

work page arXiv 2021
[10]

Fortin, A., Vernade, G., Kampf, K., Reshi, A.: Introducing gemini 2.5 flash im- age, our state-of-the-art image model.https://developers.googleblog.com/en/ introducing-gemini-2-5-flash-image/(2025)

2025
[11]

In: CVPR (2019)

Gupta, A., Dollar, P., Girshick, R.: LVIS: A dataset for large vocabulary instance segmentation. In: CVPR (2019)

2019
[12]

In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017),https://proceeding...

2017
[13]

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015),https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Inc, M.P.: Papers with code.https://web.archive.org/web/20250621114958/ https://paperswithcode.com/(2025)

work page arXiv 2025
[16]

Jordan, K., Jin, Y., Boza, V., Jiacheng, Y., Cesista, F., Newhouse, L., Bernstein, J.: Muon: An optimizer for hidden layers in neural networks (2024),https:// kellerjordan.github.io/posts/muon/

2024
[17]

Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion (2024) 16

2024
[18]

In: European Conference on Computer Vision (ECCV) (2024)

Kang, M., Zhang, R., Barnes, C., Paris, S., Kwak, S., Park, J., Shechtman, E., Zhu, J.Y., Park, T.: Distilling Diffusion Models into Conditional GANs. In: European Conference on Computer Vision (ECCV) (2024)

2024
[19]

IEEE Transactions on Pattern Analysis and Machine Intelli- gence43(12), 4217–4228 (2021).https://doi.org/10.1109/TPAMI.2020.2970919

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelli- gence43(12), 4217–4228 (2021).https://doi.org/10.1109/TPAMI.2020.2970919

work page doi:10.1109/tpami.2020.2970919 2021
[20]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

2024
[21]

In: Interspeech 2014

Li, J., Zhao, R., Huang, J.T., Gong, Y.: Learning small-size dnn with output- distribution-based criteria. In: Interspeech 2014. pp. 1910–1914 (2014).https: //doi.org/10.21437/Interspeech.2014-432

work page doi:10.21437/interspeech.2014-432 2014
[22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, R., Yang, T., Guo, S., Zhang, L.: Rorem: Training a robust object remover with human-in-the-loop. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14024–14035 (June 2025)

2025
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y., Jia, J.: Mat: Mask-aware transformer for large hole image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

2022
[24]

In: Proceedings of the IEEE international conference on computer vision (2023)

Li, Y., Hu, J., Wen, Y., Evangelidis, G., Salahi, K., Wang, Y., Tulyakov, S., Ren, J.: Rethinking vision transformers for mobilenet size and speed. In: Proceedings of the IEEE international conference on computer vision (2023)

2023
[25]

Lu, C., Song, Y.: Simplifying, stabilizing and scaling continuous-time consistency models (2025),https://arxiv.org/abs/2410.11081

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

In: The Thirteenth International Conference on Learning Repre- sentations (2023)

Manukyan, H., Sargsyan, A., Atanyan, B., Wang, Z., Navasardyan, S., Shi, H.: Hd-painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models. In: The Thirteenth International Conference on Learning Repre- sentations (2023)

2023
[27]

In: The IEEE International Conference on Computer Vision (ICCV) Workshops (Oct 2019)

Nazeri, K., Ng, E., Joseph, T., Qureshi, F., Ebrahimi, M.: Edgeconnect: Struc- ture guided image inpainting using edge prediction. In: The IEEE International Conference on Computer Vision (ICCV) Workshops (Oct 2019)

2019
[28]

Scalable Diffusion Models with Transformers

Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis (2023),https://arxiv.org/abs/2307.01952

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

arXiv preprint arXiv:2404.10518 (2024)

Qin, D., Leichner, C., Delakis, M., Fornoni, M., Luo, S., Yang, F., Wang, W., Banbury, C., Ye, C., Akin, B., et al.: Mobilenetv4-universal models for the mobile ecosystem. arXiv preprint arXiv:2404.10518 (2024)

work page arXiv 2024
[31]

Reda, F., Kontkanen, J., Tabellion, E., Sun, D., Pantofaru, C., Curless, B.: Film: Frameinterpolationforlargemotion.In:EuropeanConferenceonComputerVision (ECCV) (2022)

2022
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684– 10695 (June 2022)

2022
[33]

Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets (2015),https://arxiv.org/abs/1412.6550

work page internal anchor Pith review Pith/arXiv arXiv 2015
[34]

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models (2022),https://arxiv.org/abs/2202.00512

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

In: 2018 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: In- verted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 4510–4520 (2018) 17

2018
[36]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Sargsyan, A., Navasardyan, S., Xu, X., Shi, H.: Mi-gan: A simple baseline for im- age inpainting on mobile devices. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 7335–7345 (October 2023)

2023
[37]

Shazeer, N.: Glu variants improve transformer (2020),https://arxiv.org/abs/ 2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020
[38]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015),https://arxiv.org/abs/1409.1556

work page internal anchor Pith review Pith/arXiv arXiv 2015
[39]

arXiv (2023)

Song, H., Huang, S., Dong, Y., Tu, W.W.: Robustness and generalizability of deep- fake detection: A study with diffusion models. arXiv (2023)

2023
[40]

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models (2023),https: //arxiv.org/abs/2303.01469

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

arXiv preprint arXiv:2109.07161 (2021)

Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., Lempitsky, V.: Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161 (2021)

work page arXiv 2021
[42]

In: International conference on machine learning

Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. pp. 6105–6114. PMLR (2019)

2019
[43]

Team, K., Bai, Y., Bao, Y., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y., Chen, Y., Chen, Y., Chen, Z., Cui, J., Ding, H., Dong, M., Du, A., Du, C., Du, D., Du, Y., Fan, Y., Feng, Y., Fu, K., Gao, B., Gao, H., Gao, P., Gao, T., Gu, X., Guan, L., Guo, H., Guo, J., Hu, H., Hao, X., He, T., He, W., He, W., Hong, C., Hu, Y., Hu, Z., Huang, W., Huang, Z., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

In: International Conference on Learning Representations (ICLR) (2018)

Tero Karras, Timo Aila, S.L., Lehtinen, J.: Progressive growing of gans for im- proved quality, stability and variation. In: International Conference on Learning Representations (ICLR) (2018)

2018
[45]

Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation (2022), https://arxiv.org/abs/1910.10699

work page arXiv 2022
[46]

Advances in neural information pro- cessing systems30(2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

2017
[47]

In: IEEE/CVF Int

Wan, Z., Zhang, J., Chen, D., Liao, J.: High-fidelity pluralistic image completion with transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4672–4681 (2021).https://doi.org/10.1109/ICCV48922. 2021.00465 18

work page doi:10.1109/iccv48922 2021
[48]

2024.3484454

Wang, C., Chen, D., Mei, J.P., Zhang, Y., Feng, Y., Chen, C.: Semckd: Semantic calibration for cross-layer knowledge distillation. IEEE Transactions on Knowledge and Data Engineering35(6), 6305–6319 (2023).https://doi.org/10.1109/TKDE. 2022.3171571

work page doi:10.1109/tkde 2023
[49]

arXiv preprint arXiv:2212.00490 (2022)

Wang, Y., Yu, J., Zhang, J.: Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490 (2022)

work page arXiv 2022
[50]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., Han, S.: Sana: Efficient high-resolution image synthesis with linear diffusion transformer (2024),https://arxiv.org/abs/2410.10629

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Xie, E., Chen, J., Zhao, Y., Yu, J., Zhu, L., Lin, Y., Zhang, Z., Li, M., Chen, J., Cai, H., et al.: Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer (2025),https://arxiv.org/abs/2501.18427

work page arXiv 2025
[53]

In: Neural Information Processing Systems (NeurIPS) (2021)

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: Neural Information Processing Systems (NeurIPS) (2021)

2021
[54]

Xu, Z., Duan, K., Shen, X., Ding, Z., Liu, W., Ruan, X., Chen, X., Wang, X.: Pixelhacker: Image inpainting with structural and semantic consistency (2025), https://arxiv.org/abs/2504.20438

work page arXiv 2025
[55]

In: European Conference on Artificial Intelligence

Xu, Z., Zhao, H., Cui, Z., Liu, W., Zheng, C., Wang, X.: Most-dsa: Modeling motion and structural interactions for direct multi-frame interpolation in dsa images. In: European Conference on Artificial Intelligence. pp. 537–544 (2024), https://ebooks.iospress.nl/pdf/doi/10.3233/FAIA240531

work page doi:10.3233/faia240531 2024
[56]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Xu, Z., Zhao, H., Liu, W., Wang, X.: Garamost: Parallel multi-granularity motion and structural modeling for efficient multi-frame interpolation in dsa images. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 28530– 28538 (2025)

2025
[57]

arXiv preprint arXiv:2402.01739 , year=

Xue, F., Zheng, Z., Fu, Y., Ni, J., Zheng, Z., Zhou, W., You, Y.: Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739 (2024)

work page arXiv 2024
[58]

In: Proceedings of ICML (2024)

Yang, S., Wang, B., Shen, Y., Panda, R., Kim, Y.: Gated linear attention trans- formers with hardware-efficient training. In: Proceedings of ICML (2024)

2024
[59]

Yang, S., Zhang, Y.: Fla: A triton-based library for hardware-efficient implemen- tations of linear attention mechanism (Jan 2024),https://github.com/fla- org/flash-linear-attention

2024
[60]

generation: Taming optimization dilemma in latent diffusion models

Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[61]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yi, Z., Tang, Q., Azizi, S., Jang, D., Xu, Z.: Contextual residual aggregation for ul- tra high-resolution image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7508–7517 (2020)

2020
[62]

In: 2017 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR)

Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: Fast op- timization, network minimization and transfer learning. In: 2017 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR). pp. 7130–7138 (2017). https://doi.org/10.1109/CVPR.2017.754 19

work page doi:10.1109/cvpr.2017.754 2017
[63]

Unsupervised Out-of-Distribution Detection by Maximum Clas- sifier Discrepancy

Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.: Free-form image inpainting with gated convolution. In: 2019 IEEE/CVF International Conference on Com- puter Vision (ICCV). pp. 4470–4479 (2019).https://doi.org/10.1109/ICCV. 2019.00457

work page doi:10.1109/iccv 2019
[64]

Stokes, V

Zeng, Y., Fu, J., Chao, H., Guo, B.: Aggregated contextual transformations for high-resolution image inpainting. IEEE Transactions on Visualization and Com- puter Graphics29(7), 3266–3280 (2023).https://doi.org/10.1109/TVCG.2022. 3156949

work page doi:10.1109/tvcg.2022 2023
[65]

In: CVPR (2018)

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

2018
[66]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolu- tional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6848–6856 (2018)

2018
[67]

In: Interna- tional Conference on Learning Representations (ICLR) (2021)

Zhao, S., Cui, J., Sheng, Y., Dong, Y., Liang, X., Chang, E.I., Xu, Y.: Large scale image completion via co-modulated generative adversarial networks. In: Interna- tional Conference on Learning Representations (ICLR) (2021)

2021
[68]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)

2017
[69]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhu, L., Huang, Z., Liao, B., Liew, J.H., Yan, H., Feng, J., Wang, X.: Dig: Scalable and efficient diffusion models with gated linear attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7664–7674 (June 2025)

2025
[70]

IEEE Transactions on Image Processing30, 4855–4866 (2021).https://doi.org/10.1109/TIP.2021

Zhu, M., He, D., Li, X., Li, C., Li, F., Liu, X., Ding, E., Zhang, Z.: Image inpainting by end-to-end cascaded refinement with mask awareness. IEEE Transactions on Image Processing30, 4855–4866 (2021).https://doi.org/10.1109/TIP.2021. 3076310

work page doi:10.1109/tip.2021 2021
[71]

Zhuang, J., Zeng, Y., Liu, W., Yuan, C., Chen, K.: A task is worth one word: Learning with task prompts for high-quality versatile image inpainting (2023) 20 Supplementary Materials of Moebius Overview In this supplementary material, we provide extensive qualitative results and fur- therempiricalanalysestoreinforcethefindingspresentedinthemainmanuscript. ...

2023

[1] [1]

arXiv preprint arXiv:2102.08602 (2021)

Bello, I.: Lambdanetworks: Modeling long-range interactions without attention. arXiv preprint arXiv:2102.08602 (2021)

work page arXiv 2021

[2] [2]

In: Pro- ceedings of the 27th annual conference on Computer graphics and interactive tech- niques

Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Pro- ceedings of the 27th annual conference on Computer graphics and interactive tech- niques. p. 417–424. SIGGRAPH ’00, ACM Press/Addison-Wesley Publishing Co., USA (2000).https://doi.org/10.1145/344779.344972,https://doi.org/10. 1145/344779.344972

work page doi:10.1145/344779.344972 2000

[3] [3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Cai, H., Li, J., Hu, M., Gan, C., Han, S.: Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17302–17313 (2023)

2023

[4] [4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, H., Zhao, Y.: Don’t look into the dark: Latent codes for pluralistic image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7591–7600 (June 2024)

2024

[5] [5]

Chen, P., Liu, S., Zhao, H., Jia, J.: Distilling knowledge via knowledge review (2021),https://arxiv.org/abs/2104.09044

work page arXiv 2021

[6] [6]

In: International Conference on Learning Representations (ICLR) (2024)

Dao, T.: FlashAttention-2: Faster attention with better parallelism and work par- titioning. In: International Conference on Learning Representations (ICLR) (2024)

2024

[7] [7]

Advances in neural information pro- cessing systems35, 16344–16359 (2022)

Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in neural information pro- cessing systems35, 16344–16359 (2022)

2022

[8] [8]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

2024

[9] [9]

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis (2021),https://arxiv.org/abs/2012.09841

work page arXiv 2021

[10] [10]

Fortin, A., Vernade, G., Kampf, K., Reshi, A.: Introducing gemini 2.5 flash im- age, our state-of-the-art image model.https://developers.googleblog.com/en/ introducing-gemini-2-5-flash-image/(2025)

2025

[11] [11]

In: CVPR (2019)

Gupta, A., Dollar, P., Girshick, R.: LVIS: A dataset for large vocabulary instance segmentation. In: CVPR (2019)

2019

[12] [12]

In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017),https://proceeding...

2017

[13] [13]

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015),https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[14] [14]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Inc, M.P.: Papers with code.https://web.archive.org/web/20250621114958/ https://paperswithcode.com/(2025)

work page arXiv 2025

[16] [16]

Jordan, K., Jin, Y., Boza, V., Jiacheng, Y., Cesista, F., Newhouse, L., Bernstein, J.: Muon: An optimizer for hidden layers in neural networks (2024),https:// kellerjordan.github.io/posts/muon/

2024

[17] [17]

Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion (2024) 16

2024

[18] [18]

In: European Conference on Computer Vision (ECCV) (2024)

Kang, M., Zhang, R., Barnes, C., Paris, S., Kwak, S., Park, J., Shechtman, E., Zhu, J.Y., Park, T.: Distilling Diffusion Models into Conditional GANs. In: European Conference on Computer Vision (ECCV) (2024)

2024

[19] [19]

IEEE Transactions on Pattern Analysis and Machine Intelli- gence43(12), 4217–4228 (2021).https://doi.org/10.1109/TPAMI.2020.2970919

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelli- gence43(12), 4217–4228 (2021).https://doi.org/10.1109/TPAMI.2020.2970919

work page doi:10.1109/tpami.2020.2970919 2021

[20] [20]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

2024

[21] [21]

In: Interspeech 2014

Li, J., Zhao, R., Huang, J.T., Gong, Y.: Learning small-size dnn with output- distribution-based criteria. In: Interspeech 2014. pp. 1910–1914 (2014).https: //doi.org/10.21437/Interspeech.2014-432

work page doi:10.21437/interspeech.2014-432 2014

[22] [22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, R., Yang, T., Guo, S., Zhang, L.: Rorem: Training a robust object remover with human-in-the-loop. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14024–14035 (June 2025)

2025

[23] [23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y., Jia, J.: Mat: Mask-aware transformer for large hole image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

2022

[24] [24]

In: Proceedings of the IEEE international conference on computer vision (2023)

Li, Y., Hu, J., Wen, Y., Evangelidis, G., Salahi, K., Wang, Y., Tulyakov, S., Ren, J.: Rethinking vision transformers for mobilenet size and speed. In: Proceedings of the IEEE international conference on computer vision (2023)

2023

[25] [25]

Lu, C., Song, Y.: Simplifying, stabilizing and scaling continuous-time consistency models (2025),https://arxiv.org/abs/2410.11081

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

In: The Thirteenth International Conference on Learning Repre- sentations (2023)

Manukyan, H., Sargsyan, A., Atanyan, B., Wang, Z., Navasardyan, S., Shi, H.: Hd-painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models. In: The Thirteenth International Conference on Learning Repre- sentations (2023)

2023

[27] [27]

In: The IEEE International Conference on Computer Vision (ICCV) Workshops (Oct 2019)

Nazeri, K., Ng, E., Joseph, T., Qureshi, F., Ebrahimi, M.: Edgeconnect: Struc- ture guided image inpainting using edge prediction. In: The IEEE International Conference on Computer Vision (ICCV) Workshops (Oct 2019)

2019

[28] [28]

Scalable Diffusion Models with Transformers

Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis (2023),https://arxiv.org/abs/2307.01952

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

arXiv preprint arXiv:2404.10518 (2024)

Qin, D., Leichner, C., Delakis, M., Fornoni, M., Luo, S., Yang, F., Wang, W., Banbury, C., Ye, C., Akin, B., et al.: Mobilenetv4-universal models for the mobile ecosystem. arXiv preprint arXiv:2404.10518 (2024)

work page arXiv 2024

[31] [31]

Reda, F., Kontkanen, J., Tabellion, E., Sun, D., Pantofaru, C., Curless, B.: Film: Frameinterpolationforlargemotion.In:EuropeanConferenceonComputerVision (ECCV) (2022)

2022

[32] [32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684– 10695 (June 2022)

2022

[33] [33]

Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets (2015),https://arxiv.org/abs/1412.6550

work page internal anchor Pith review Pith/arXiv arXiv 2015

[34] [34]

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models (2022),https://arxiv.org/abs/2202.00512

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

In: 2018 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: In- verted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 4510–4520 (2018) 17

2018

[36] [36]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Sargsyan, A., Navasardyan, S., Xu, X., Shi, H.: Mi-gan: A simple baseline for im- age inpainting on mobile devices. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 7335–7345 (October 2023)

2023

[37] [37]

Shazeer, N.: Glu variants improve transformer (2020),https://arxiv.org/abs/ 2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020

[38] [38]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015),https://arxiv.org/abs/1409.1556

work page internal anchor Pith review Pith/arXiv arXiv 2015

[39] [39]

arXiv (2023)

Song, H., Huang, S., Dong, Y., Tu, W.W.: Robustness and generalizability of deep- fake detection: A study with diffusion models. arXiv (2023)

2023

[40] [40]

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models (2023),https: //arxiv.org/abs/2303.01469

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

arXiv preprint arXiv:2109.07161 (2021)

Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., Lempitsky, V.: Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161 (2021)

work page arXiv 2021

[42] [42]

In: International conference on machine learning

Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. pp. 6105–6114. PMLR (2019)

2019

[43] [43]

Team, K., Bai, Y., Bao, Y., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y., Chen, Y., Chen, Y., Chen, Z., Cui, J., Ding, H., Dong, M., Du, A., Du, C., Du, D., Du, Y., Fan, Y., Feng, Y., Fu, K., Gao, B., Gao, H., Gao, P., Gao, T., Gu, X., Guan, L., Guo, H., Guo, J., Hu, H., Hao, X., He, T., He, W., He, W., Hong, C., Hu, Y., Hu, Z., Huang, W., Huang, Z., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

In: International Conference on Learning Representations (ICLR) (2018)

Tero Karras, Timo Aila, S.L., Lehtinen, J.: Progressive growing of gans for im- proved quality, stability and variation. In: International Conference on Learning Representations (ICLR) (2018)

2018

[45] [45]

Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation (2022), https://arxiv.org/abs/1910.10699

work page arXiv 2022

[46] [46]

Advances in neural information pro- cessing systems30(2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

2017

[47] [47]

In: IEEE/CVF Int

Wan, Z., Zhang, J., Chen, D., Liao, J.: High-fidelity pluralistic image completion with transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4672–4681 (2021).https://doi.org/10.1109/ICCV48922. 2021.00465 18

work page doi:10.1109/iccv48922 2021

[48] [48]

2024.3484454

Wang, C., Chen, D., Mei, J.P., Zhang, Y., Feng, Y., Chen, C.: Semckd: Semantic calibration for cross-layer knowledge distillation. IEEE Transactions on Knowledge and Data Engineering35(6), 6305–6319 (2023).https://doi.org/10.1109/TKDE. 2022.3171571

work page doi:10.1109/tkde 2023

[49] [49]

arXiv preprint arXiv:2212.00490 (2022)

Wang, Y., Yu, J., Zhang, J.: Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490 (2022)

work page arXiv 2022

[50] [50]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., Han, S.: Sana: Efficient high-resolution image synthesis with linear diffusion transformer (2024),https://arxiv.org/abs/2410.10629

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Xie, E., Chen, J., Zhao, Y., Yu, J., Zhu, L., Lin, Y., Zhang, Z., Li, M., Chen, J., Cai, H., et al.: Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer (2025),https://arxiv.org/abs/2501.18427

work page arXiv 2025

[53] [53]

In: Neural Information Processing Systems (NeurIPS) (2021)

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: Neural Information Processing Systems (NeurIPS) (2021)

2021

[54] [54]

Xu, Z., Duan, K., Shen, X., Ding, Z., Liu, W., Ruan, X., Chen, X., Wang, X.: Pixelhacker: Image inpainting with structural and semantic consistency (2025), https://arxiv.org/abs/2504.20438

work page arXiv 2025

[55] [55]

In: European Conference on Artificial Intelligence

Xu, Z., Zhao, H., Cui, Z., Liu, W., Zheng, C., Wang, X.: Most-dsa: Modeling motion and structural interactions for direct multi-frame interpolation in dsa images. In: European Conference on Artificial Intelligence. pp. 537–544 (2024), https://ebooks.iospress.nl/pdf/doi/10.3233/FAIA240531

work page doi:10.3233/faia240531 2024

[56] [56]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Xu, Z., Zhao, H., Liu, W., Wang, X.: Garamost: Parallel multi-granularity motion and structural modeling for efficient multi-frame interpolation in dsa images. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 28530– 28538 (2025)

2025

[57] [57]

arXiv preprint arXiv:2402.01739 , year=

Xue, F., Zheng, Z., Fu, Y., Ni, J., Zheng, Z., Zhou, W., You, Y.: Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739 (2024)

work page arXiv 2024

[58] [58]

In: Proceedings of ICML (2024)

Yang, S., Wang, B., Shen, Y., Panda, R., Kim, Y.: Gated linear attention trans- formers with hardware-efficient training. In: Proceedings of ICML (2024)

2024

[59] [59]

Yang, S., Zhang, Y.: Fla: A triton-based library for hardware-efficient implemen- tations of linear attention mechanism (Jan 2024),https://github.com/fla- org/flash-linear-attention

2024

[60] [60]

generation: Taming optimization dilemma in latent diffusion models

Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025

[61] [61]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yi, Z., Tang, Q., Azizi, S., Jang, D., Xu, Z.: Contextual residual aggregation for ul- tra high-resolution image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7508–7517 (2020)

2020

[62] [62]

In: 2017 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR)

Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: Fast op- timization, network minimization and transfer learning. In: 2017 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR). pp. 7130–7138 (2017). https://doi.org/10.1109/CVPR.2017.754 19

work page doi:10.1109/cvpr.2017.754 2017

[63] [63]

Unsupervised Out-of-Distribution Detection by Maximum Clas- sifier Discrepancy

Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.: Free-form image inpainting with gated convolution. In: 2019 IEEE/CVF International Conference on Com- puter Vision (ICCV). pp. 4470–4479 (2019).https://doi.org/10.1109/ICCV. 2019.00457

work page doi:10.1109/iccv 2019

[64] [64]

Stokes, V

Zeng, Y., Fu, J., Chao, H., Guo, B.: Aggregated contextual transformations for high-resolution image inpainting. IEEE Transactions on Visualization and Com- puter Graphics29(7), 3266–3280 (2023).https://doi.org/10.1109/TVCG.2022. 3156949

work page doi:10.1109/tvcg.2022 2023

[65] [65]

In: CVPR (2018)

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

2018

[66] [66]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolu- tional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6848–6856 (2018)

2018

[67] [67]

In: Interna- tional Conference on Learning Representations (ICLR) (2021)

Zhao, S., Cui, J., Sheng, Y., Dong, Y., Liang, X., Chang, E.I., Xu, Y.: Large scale image completion via co-modulated generative adversarial networks. In: Interna- tional Conference on Learning Representations (ICLR) (2021)

2021

[68] [68]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)

2017

[69] [69]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhu, L., Huang, Z., Liao, B., Liew, J.H., Yan, H., Feng, J., Wang, X.: Dig: Scalable and efficient diffusion models with gated linear attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7664–7674 (June 2025)

2025

[70] [70]

IEEE Transactions on Image Processing30, 4855–4866 (2021).https://doi.org/10.1109/TIP.2021

Zhu, M., He, D., Li, X., Li, C., Li, F., Liu, X., Ding, E., Zhang, Z.: Image inpainting by end-to-end cascaded refinement with mask awareness. IEEE Transactions on Image Processing30, 4855–4866 (2021).https://doi.org/10.1109/TIP.2021. 3076310

work page doi:10.1109/tip.2021 2021

[71] [71]

Zhuang, J., Zeng, Y., Liu, W., Yuan, C., Chen, K.: A task is worth one word: Learning with task prompts for high-quality versatile image inpainting (2023) 20 Supplementary Materials of Moebius Overview In this supplementary material, we provide extensive qualitative results and fur- therempiricalanalysestoreinforcethefindingspresentedinthemainmanuscript. ...

2023