Recognition: no theorem link
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
Pith reviewed 2026-05-10 19:06 UTC · model grok-4.3
The pith
RefineAnything refines fine details inside a user-specified image region while leaving every non-selected pixel strictly unchanged.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By reallocating the fixed VAE resolution budget to the user region through crop-and-resize and enforcing background identity via blended-mask paste-back plus a boundary-aware consistency loss, RefineAnything produces high-fidelity local corrections in both reference-based and reference-free settings while guaranteeing that all pixels outside the specified region remain identical to the input.
What carries the argument
Focus-and-Refine strategy that crops the user region, refines it at higher effective resolution, and pastes it back with a blended mask to enforce strict background preservation.
If this is right
- Supports both reference images and text prompts for the same local-refinement task.
- Enables iterative correction of defects such as logos, thin structures, and text without global side effects.
- Reduces seam visibility through explicit boundary consistency supervision.
- Provides a benchmark that separately scores region fidelity and background identity.
Where Pith is reading between the lines
- The same crop-and-resize idea may help other diffusion tasks where only a small fraction of the canvas needs high detail.
- Blended-mask paste-back could be combined with existing global editing models to create hybrid pipelines that first edit coarsely then refine locally.
- The approach implies that resolution allocation, rather than model scale alone, is a practical lever for precision in local image tasks.
Load-bearing premise
That cropping and resizing the target region improves local reconstruction quality under a fixed VAE resolution, and that blended-mask paste-back can guarantee zero change to background pixels without introducing artifacts.
What would settle it
Run the model on an input containing a small region with legible text or fine lines; if any background pixel value differs after refinement or if the text remains distorted, the central claim is falsified.
Figures
read the original abstract
We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RefineAnything, a multimodal diffusion model for region-specific image refinement given a user mask or box. It proposes Focus-and-Refine: crop-and-resize the region to reallocate VAE resolution, refine, then use blended-mask paste-back to return the result while claiming strict background preservation. A Boundary Consistency Loss is added to reduce seams. The authors release Refine-30K (20K reference-based + 10K reference-free) and RefineEval benchmark, reporting strong gains over baselines and near-perfect background preservation on RefineEval.
Significance. If the central claims hold, the work establishes a practical, high-precision local refinement pipeline that avoids the background drift common in instruction-driven editors. The new problem formulation, dataset, and benchmark are useful contributions for the community. The counter-intuitive crop-and-resize observation, if empirically validated, could influence how future VAE-based editors allocate resolution budgets. The emphasis on strict background invariance addresses a real pain point in applications such as product photography and document editing.
major comments (3)
- [Abstract, §3.2] Abstract and §3.2: The assertion that 'blended-mask paste-back guarantees strict background preservation without introducing artifacts' is load-bearing for the central claim yet rests on an unverified assumption. Blending necessarily interpolates pixels near the mask edge; if the refined crop differs from the original in illumination, texture, or high-frequency detail, residual changes can appear outside the strict mask even when the mask itself is binary. The Boundary Consistency Loss regularizes only during training and does not enforce pixel-level invariance at inference. Before/after difference maps or background-only metrics (e.g., PSNR/LPIPS restricted to non-masked pixels) are required to bound any leakage.
- [§3.1] §3.1: The key empirical observation that crop-and-resize substantially improves local reconstruction under fixed VAE input resolution is presented without quantitative support or ablation. The manuscript should report reconstruction error (or perceptual metrics) on small regions with and without the crop-and-resize step, ideally across multiple region sizes and VAE resolutions, to confirm the effect is not an artifact of the particular training regime.
- [§4.2, results table] §4.2 and Table 2 (or equivalent results table): The abstract claims 'strong improvements over competitive baselines' and 'near-perfect background preservation,' but the provided text does not include the actual numerical values, baseline names, or exact background-consistency metrics. Without these numbers it is impossible to judge whether the gains are practically meaningful or whether background preservation is truly near-perfect (e.g., background LPIPS < 0.01).
minor comments (2)
- [§3.2] The manuscript should clarify the exact blending function (alpha ramp width, interpolation method) used in the paste-back step and whether it is applied only at inference or also during training.
- [Figures 2-3] Figure captions and method diagrams should explicitly label the crop-and-resize and paste-back stages so readers can trace the resolution reallocation path.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important areas for strengthening the manuscript. We address each major comment below and will incorporate revisions to provide additional quantitative support, metrics, and clarifications as requested.
read point-by-point responses
-
Referee: [Abstract, §3.2] The assertion that 'blended-mask paste-back guarantees strict background preservation without introducing artifacts' is load-bearing for the central claim yet rests on an unverified assumption. Blending necessarily interpolates pixels near the mask edge; if the refined crop differs from the original in illumination, texture, or high-frequency detail, residual changes can appear outside the strict mask even when the mask itself is binary. The Boundary Consistency Loss regularizes only during training and does not enforce pixel-level invariance at inference. Before/after difference maps or background-only metrics (e.g., PSNR/LPIPS restricted to non-masked pixels) are required to bound any leakage.
Authors: We appreciate this point and agree that explicit verification strengthens the central claim. The blended-mask paste-back copies original background pixels directly outside the mask, applying blending only within a narrow boundary zone to minimize seams. However, to empirically bound any potential leakage from illumination or detail mismatches, we will add background-only PSNR and LPIPS metrics (computed solely on non-masked pixels) and input-output difference maps in the revised §3.2 and results. We will also clarify the inference-time invariance of the paste-back mechanism versus the training-time role of the Boundary Consistency Loss. revision: yes
-
Referee: [§3.1] The key empirical observation that crop-and-resize substantially improves local reconstruction under fixed VAE input resolution is presented without quantitative support or ablation. The manuscript should report reconstruction error (or perceptual metrics) on small regions with and without the crop-and-resize step, ideally across multiple region sizes and VAE resolutions, to confirm the effect is not an artifact of the particular training regime.
Authors: We acknowledge the need for quantitative validation of this observation. While the benefit is reflected in the overall RefineEval results and qualitative examples, the revised manuscript will include a new ablation subsection in §3.1. This will report PSNR and LPIPS reconstruction errors for small regions of varying sizes, comparing the crop-and-resize approach against direct fixed-resolution processing across multiple VAE input resolutions, to confirm the resolution reallocation effect. revision: yes
-
Referee: [§4.2, results table] §4.2 and Table 2 (or equivalent results table): The abstract claims 'strong improvements over competitive baselines' and 'near-perfect background preservation,' but the provided text does not include the actual numerical values, baseline names, or exact background-consistency metrics. Without these numbers it is impossible to judge whether the gains are practically meaningful or whether background preservation is truly near-perfect (e.g., background LPIPS < 0.01).
Authors: We apologize for any lack of clarity in presentation. The full manuscript contains Table 2 reporting results on RefineEval against baselines such as InstructPix2Pix and MagicBrush, with both region fidelity and background consistency metrics. In the revision, we will explicitly state all numerical values in the main text of §4.2, highlight background LPIPS scores (which fall below 0.01), and ensure baseline names and exact metrics are prominent in the table and surrounding discussion to allow direct assessment of the gains. revision: yes
Circularity Check
No circularity: method grounded in empirical observation, new dataset, and benchmark without self-referential reductions.
full rationale
The paper defines a new problem setting for region-specific refinement and proposes Focus-and-Refine based on a stated counter-intuitive observation about crop-and-resize reallocating VAE resolution. It constructs Refine-30K (20K reference-based + 10K reference-free) and RefineEval benchmark, then reports empirical gains. No equations, fitted parameters renamed as predictions, self-citations for uniqueness theorems, or ansatzes smuggled via prior work appear in the derivation. The blended-mask paste-back and Boundary Consistency Loss are presented as design choices with training regularization, not as outputs forced by the inputs. The central claims rest on external evaluation rather than reducing to self-definition or self-citation chains.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption VAE-based diffusion models have a fixed input resolution that limits capture of fine local details when the region of interest is small.
- domain assumption Blended-mask paste-back can preserve background pixels strictly unchanged while allowing natural boundary transitions.
Forward citations
Cited by 1 Pith paper
-
From Zero to Detail: A Progressive Spectral Decoupling Paradigm for UHD Image Restoration with New Benchmark
A new framework called ERR decomposes UHD image restoration into three frequency stages with specialized sub-networks and introduces the LSUHDIR benchmark dataset of over 82,000 images.
Reference graph
Works this paper leans on
-
[1]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
BlackForest: Black forest labs; frontier ai lab (2024),https://blackforestlabs. ai/
2024
-
[3]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18392–18402 (2023)
2023
-
[4]
Cai, Q., Chen, J., Chen, Y., Li, Y., Long, F., Pan, Y., Qiu, Z., Zhang, Y., Gao, F., Xu, P., et al.: Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer. arXiv preprint arXiv:2505.22705 (2025)
-
[5]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 22560–22570 (October 2023)
2023
-
[6]
SAM 3: Segment Anything with Concepts
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
In: ICLR
Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wang, Z., Kwok, J.T., Luo, P., Lu, H., Li, Z.: Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In: ICLR. OpenReview.net (2024)
2024
-
[8]
Chen, Z., Li, Y., Wang, H., Chen, Z., Jiang, Z., Li, J., Wang, Q., Yang, J., Tai, Y.: Ragd:Regional-awarediffusionmodelfortext-to-imagegeneration.In:Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19331–19341 (2025)
2025
-
[9]
Dip: Taming diffusion models in pixel space
Chen, Z., Zhu, J., Chen, X., Zhang, J., Hu, X., Zhao, H., Wang, C., Yang, J., Tai, Y.: Dip: Taming diffusion models in pixel space. arXiv preprint arXiv:2511.18822 (2025)
-
[10]
Altclip: Altering the lan- guage encoder in clip for extended language capabilities
Chen, Z., Liu, G., Zhang, B.W., Ye, F., Yang, Q., Wu, L.: Altclip: Altering the language encoder in clip for extended language capabilities. arXiv preprint arXiv:2211.06679 (2022)
-
[11]
Emerging Properties in Unified Multimodal Pretraining
Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
arXiv preprint arXiv:2503.23461 (2025)
Du, N., Chen, Z., Gao, S., Chen, Z., Chen, X., Jiang, Z., Yang, J., Tai, Y.: Textcrafter: Accurately rendering multiple texts in complex visual scenes. arXiv preprint arXiv:2503.23461 (2025)
-
[13]
In: ICML (2024)
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024)
2024
-
[14]
In: NeurIPS
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS. pp. 6840–6851 (2020)
2020
-
[15]
ICLR1(2), 3 (2022)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)
2022
-
[16]
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017)
2017
-
[17]
Auto-Encoding Variational Bayes
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 16 D. Zhou et al
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[18]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
arXiv preprint arXiv:2506.00596 (2025)
Li, D., Zhang, H., Wang, S., Li, J., Wu, Z.: Seg2any: Open-set segmentation- mask-to-image generation with precise shape and semantic control. arXiv preprint arXiv:2506.00596 (2025)
-
[20]
arXiv preprint arXiv:2404.07987 (2024)
Li, M., Yang, T., Kuang, H., Wu, J., Wang, Z., Xiao, X., Chen, C.: Controlnet++: Improving conditional controls with efficient consistency feedback. arXiv preprint arXiv:2404.07987 (2024)
-
[21]
In: European Conference on Computer Vision
Li, M., Yang, T., Kuang, H., Wu, J., Wang, Z., Xiao, X., Chen, C.: Controlnet++: Improving conditional controls with efficient consistency feedback. In: European Conference on Computer Vision. pp. 129–147. Springer (2025)
2025
-
[22]
Li, Y., Ma, F., Yang, Y.: Anysynth: Harnessing the power of image synthetic data generation for generalized vision-language tasks. arXiv preprint arXiv:2411.16749 (2024)
-
[23]
In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)
Li, Y., Ma, F., Yang, Y.: Imagine and seek: Improving composed image retrieval with an imagined proxy. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp. 3984–3993 (June 2025)
2025
-
[24]
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
Li, Y., Zhou, D., Ma, F., Li, F., He, D., Yang, Y.: Foleydirector: Fine-grained tem- poral steering for video-to-audio generation via structured scripts. arXiv preprint arXiv:2603.19857 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Li, Z., Zhang, J., Lin, Q., Xiong, J., Long, Y., Deng, X., Zhang, Y., Liu, X., Huang, M., Xiao, Z., Chen, D., He, J., Li, J., Li, W., Zhang, C., Quan, R., Lu, J., Huang, J., Yuan, X., Zheng, X., Li, Y., Zhang, J., Zhang, C., Chen, M., Liu, J., Fang, Z., Wang, W., Xue, J., Tao, Y., Zhu, J., Liu, K., Lin, S., Sun, Y., Li, Y., Wang, D., Chen, M., Hu, Z., Xia...
2024
-
[26]
Step1X-Edit: A Practical Framework for General Image Editing
Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)
work page internal anchor Pith review arXiv 2025
- [27]
-
[28]
In: ICCV (2023)
Lu, S., Liu, Y., Kong, A.W.K.: Tf-icon: Diffusion-based training-free cross-domain image composition. In: ICCV (2023)
2023
-
[29]
CVPR (2024)
Lu, S., Wang, Z., Li, L., Liu, Y., Kong, A.W.K.: Mace: Mass concept erasure in diffusion models. CVPR (2024)
2024
-
[30]
Robust watermarking using generative priors against image editing: From benchmarking to advances
Lu, S., Zhou, Z., Lu, J., Zhu, Y., Kong, A.W.K.: Robust watermarking using gener- ative priors against image editing: From benchmarking to advances. arXiv preprint arXiv:2410.18775 (2024)
-
[31]
com / index / introducing - 4o - image - generation/(2025)
OpenAI: Gpt-4o.https : / / openai . com / index / introducing - 4o - image - generation/(2025)
2025
-
[32]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) RefineAnything 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
In: ICML
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763 (2021)
2021
-
[35]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022)
2022
-
[36]
arXiv preprint arXiv:2511.18333 (2025)
Shi, X., Li, B., Han, X., Cai, Z., Yang, L., Lin, D., Wang, Q.: Consistcom- pose: Unified multimodal layout control for image composition. arXiv preprint arXiv:2511.18333 (2025)
- [37]
-
[38]
Team, G.: Gemini 2.5 flash & gemini 2.5 flash image model card (2025)
2025
-
[39]
Team, G.: Gemini 3.0 pro & gemini 3.0 pro image model card (2025)
2025
-
[40]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team, S., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025)
work page internal anchor Pith review arXiv 2025
-
[41]
Gpt-image-edit-1.5 m: A million-scale, gpt-generated image dataset
Wang, Y., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y., Xie, C.: Gpt- image-edit-1.5 m: A million-scale, gpt-generated image dataset. arXiv preprint arXiv:2507.21033 (2025)
-
[42]
IEEE transactions on image processing 13(4), 600–612 (2004)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
2004
-
[43]
Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295, 2025a
Xie, J., Darrell, T., Zettlemoyer, L., Wang, X.: Reconstruction alignment improves unified multimodal models. arXiv preprint arXiv:2509.07295 (2025)
-
[46]
Xu, R., Zhou, D., Ma, F., Yang, Y.: Contextgen: Contextual layout anchoring for identity-consistent multi-instance generation. arXiv preprint arXiv:2510.11000 (2025)
-
[47]
arXiv preprint arXiv:2410.09400 , year=
Xu, Y., He, Z., Shan, S., Chen, X.: Ctrlora: An extensible and efficient framework for controllable image generation. arXiv preprint arXiv:2410.09400 (2024)
-
[48]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
ImgEdit: A Unified Image Editing Dataset and Benchmark
Ye, Y., He, X., Li, Z., Lin, B., Yuan, S., Yan, Z., Hou, B., Yuan, L.: Imgedit: A unified image editing dataset and benchmark. arXiv preprint arXiv:2505.20275 (2025)
work page internal anchor Pith review arXiv 2025
-
[50]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)
work page internal anchor Pith review arXiv 2022
-
[51]
arXiv preprint arXiv:2501.01097 (2025)
Zhang, H., Duan, Z., Wang, X., Chen, Y., Zhang, Y.: Eligen: Entity-level controlled image generation with regional attention. arXiv preprint arXiv:2501.01097 (2025)
-
[52]
Zhang, H., Duan, Z., Wang, X., Chen, Y., Zhao, Y., Zhang, Y.: Nexus-gen: A unified model for image understanding, generation, and editing. arXiv preprint arXiv:2504.21356 (2025)
-
[53]
Creatilayout: Siamese multimodal diffusion transformer for creative layout-to-image generation
Zhang,H.,Hong,D.,Gao,T.,Wang,Y.,Shao,J.,Wu,X.,Wu,Z.,Jiang,Y.G.:Cre- atilayout: Siamese multimodal diffusion transformer for creative layout-to-image generation. arXiv preprint arXiv:2412.03859 (2024) 18 D. Zhou et al
-
[54]
arXiv preprint arXiv:2505.19114 (2025)
Zhang, H., Hong, D., Yang, M., Cheng, Y., Zhang, Z., Shao, J., Wu, X., Wu, Z., Jiang, Y.G.: Creatidesign: A unified multi-conditional diffusion transformer for creative graphic design. arXiv preprint arXiv:2505.19114 (2025)
-
[55]
Zhang, Z., Xie, J., Lu, Y., Yang, Z., Yang, Y.: Enabling instructional image editing within-contextgenerationinlargescalediffusiontransformer.In:TheThirty-ninth Annual Conference on Neural Information Processing Systems (2025)
2025
-
[56]
CVPR (2024)
Zhao, C., Cai, W., Dong, C., Hu, C.: Wavelet-based fourier information interac- tion with frequency diffusion adjustment for underwater image restoration. CVPR (2024)
2024
-
[57]
In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Zhao, C., Cai, W., Dong, C., Zeng, Z.: Toward sufficient spatial-frequency in- teraction for gradient-aware underwater image enhancement. In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3220–3224. IEEE (2024)
2024
-
[58]
Zhao, C., Chen, J., Li, H., Kang, Z., Lu, S., Wei, X., Zhang, K., Yang, J., Tai, Y.: Luve: Latent-cascaded ultra-high-resolution video generation with dual frequency experts. arXiv preprint arXiv:2602.11564 (2026)
-
[59]
In: Proceedings of the Computer Vision and Pat- tern Recognition Conference
Zhao, C., Chen, Z., Xu, Y., Gu, E., Li, J., Yi, Z., Wang, Q., Yang, J., Tai, Y.: From zero to detail: Deconstructing ultra-high-definition image restoration from progressive spectral perspective. In: Proceedings of the Computer Vision and Pat- tern Recognition Conference. pp. 17935–17946 (2025)
2025
-
[60]
arXiv preprint arXiv:2510.20661 (2025)
Zhao, C., Ci, E., Xu, Y., Fan, T., Guan, S., Ge, Y., Yang, J., Tai, Y.: Ultrahr- 100k: Enhancing uhr image synthesis with a large-scale high-quality dataset. arXiv preprint arXiv:2510.20661 (2025)
-
[61]
arXiv preprint arXiv:2403.01497 (2024)
Zhao, C., Dong, C., Cai, W.: Learning a physical-aware diffusion model based on transformer for underwater image enhancement. arXiv preprint arXiv:2403.01497 (2024)
-
[62]
Zhou, D., Li, M., Yang, Z., Lu, Y., Xu, Y., Wang, Z., Huang, Z., Yang, Y.: Bidedpo: Conditional image generation with simultaneous text and condition alignment. arXiv preprint arXiv:2511.19268 (2025)
-
[63]
In: ICCV (2025)
Zhou, D., Li, M., Yang, Z., Yang, Y.: Dreamrenderer: Taming multi-instance at- tribute control in large-scale text-to-image models. In: ICCV (2025)
2025
-
[64]
In: CVPR (2024)
Zhou, D., Li, Y., Ma, F., Zhang, X., Yang, Y.: Migc: Multi-instance generation controller for text-to-image synthesis. In: CVPR (2024)
2024
-
[65]
Zhou, D., Xie, J., Yang, Z., Yang, Y.: 3dis: Depth-driven decoupled instance syn- thesis for text-to-image generation. arXiv preprint arXiv:2410.12669 (2024)
-
[66]
In: IJCAI (2023)
Zhou, D., Yang, Z., Yang, Y.: Pyramid diffusion models for low-light image en- hancement. In: IJCAI (2023)
2023
-
[67]
arXiv preprint arXiv:2510.02253 (2025)
Zhou, Z., Lu, S., Leng, S., Zhang, S., Lian, Z., Yu, X., Kong, A.W.K.: Dragflow: Unleashing dit priors with region based supervision for drag editing. arXiv preprint arXiv:2510.02253 (2025)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.