High-Fidelity Mural Restoration via a Unified Hybrid Mask-Aware Transformer
Pith reviewed 2026-05-13 17:35 UTC · model grok-4.3
The pith
The Hybrid Mask-Aware Transformer restores ancient murals by combining local texture modeling with long-range structural inference while preserving undamaged regions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HMAT integrates Mask-Aware Dynamic Filtering for robust local texture modeling, a Transformer bottleneck for long-range structural inference, a mask-conditional style fusion module that dynamically guides generation, and a Teacher-Forcing Decoder with hard-gated skip connections that enforce fidelity in undamaged regions while focusing reconstruction on missing areas.
What carries the argument
The Hybrid Mask-Aware Transformer (HMAT) framework, which uses mask-aware dynamic filtering and a transformer bottleneck together with mask-conditional style fusion and hard-gated skip connections.
Load-bearing premise
The mask-conditional style fusion and hard-gated skip connections will generalize across diverse real-world mural degradation patterns beyond the DHMural and Nine-Colored Deer datasets.
What would settle it
Running HMAT on a fresh collection of murals that exhibit degradation morphologies absent from the two training datasets and observing clear drops in structural coherence or fidelity scores compared with baseline methods.
Figures
read the original abstract
Ancient murals are valuable cultural artifacts, but many have suffered severe degradation due to environmental exposure, material aging, and human activity. Restoring these artworks is challenging because it requires both reconstructing large missing structures and strictly preserving authentic, undamaged regions. This paper presents the Hybrid Mask-Aware Transformer (HMAT), a unified framework for high-fidelity mural restoration. HMAT integrates Mask-Aware Dynamic Filtering for robust local texture modeling with a Transformer bottleneck for long-range structural inference. To further address the diverse morphology of degradation, we introduce a mask-conditional style fusion module that dynamically guides the generative process. In addition, a Teacher-Forcing Decoder with hard-gated skip connections is designed to enforce fidelity in valid regions and focus reconstruction on missing areas. We evaluate HMAT on the DHMural dataset and a curated Nine-Colored Deer dataset under varying degradation levels. Experimental results demonstrate that the proposed method achieves competitive performance compared to state-of-the-art approaches, while producing more structurally coherent and visually faithful restorations. These findings suggest that HMAT provides an effective solution for the digital restoration of cultural heritage murals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the Hybrid Mask-Aware Transformer (HMAT) for high-fidelity restoration of degraded ancient murals. It integrates Mask-Aware Dynamic Filtering for local texture modeling with a Transformer bottleneck for long-range structural inference, introduces a mask-conditional style fusion module to dynamically guide generation according to degradation morphology, and proposes a Teacher-Forcing Decoder with hard-gated skip connections to enforce fidelity in valid regions while focusing reconstruction on missing areas. The method is evaluated on the DHMural dataset and a curated Nine-Colored Deer dataset under varying degradation levels, with the central claim that it achieves competitive performance against state-of-the-art approaches while producing more structurally coherent and visually faithful restorations.
Significance. If the empirical results hold, this work offers a targeted advance for digital cultural heritage preservation by providing a unified hybrid architecture that simultaneously handles large missing structures and strict preservation of authentic regions. The mask-aware components and gated skips represent a concrete contribution to conditional image restoration, with potential applicability to other domains involving partial degradation. The paper's emphasis on both local filtering and global Transformer inference is a strength, as is the focus on real cultural artifacts rather than synthetic benchmarks alone.
major comments (2)
- [§4] The central empirical claim (abstract and §4) that HMAT produces more structurally coherent restorations rests on evaluation solely on DHMural and Nine-Colored Deer under controlled degradation levels. No cross-dataset, cross-domain, or out-of-distribution tests are reported, which directly bears on whether the mask-conditional style fusion and hard-gated skip connections generalize to arbitrary real-world mural patterns (e.g., different crack topologies or pigment fading). This is a load-bearing gap for the generalization assertion.
- [§4] §4 (and associated tables/figures): the abstract asserts competitive results with superior coherence, yet the evaluation description provides no concrete metrics (PSNR, SSIM, LPIPS, or user-study scores), no listed baselines, no ablation tables isolating the contribution of the style fusion or gated skips, and no error analysis. Without these quantitative details, the support for the performance claim cannot be verified.
minor comments (2)
- [Abstract] Abstract: consider adding one or two key quantitative results (e.g., average PSNR improvement) to make the 'competitive performance' claim immediately concrete for readers.
- [§3] Notation: the distinction between 'mask-conditional style fusion' and 'Mask-Aware Dynamic Filtering' should be clarified with a short equation or diagram reference in §3 to avoid reader confusion about module roles.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the two major comments below and will revise the paper to strengthen the empirical evaluation and generalization analysis.
read point-by-point responses
-
Referee: [§4] The central empirical claim (abstract and §4) that HMAT produces more structurally coherent restorations rests on evaluation solely on DHMural and Nine-Colored Deer under controlled degradation levels. No cross-dataset, cross-domain, or out-of-distribution tests are reported, which directly bears on whether the mask-conditional style fusion and hard-gated skip connections generalize to arbitrary real-world mural patterns (e.g., different crack topologies or pigment fading). This is a load-bearing gap for the generalization assertion.
Authors: We agree that cross-dataset and out-of-distribution evaluation is necessary to substantiate the generalization of the mask-aware components. In the revision we will add experiments on additional real mural images with varied degradation patterns (e.g., different crack topologies and pigment fading) drawn from public cultural-heritage collections, together with controlled synthetic OOD degradations, to directly test the robustness of the style fusion and gated-skip mechanisms. revision: yes
-
Referee: [§4] §4 (and associated tables/figures): the abstract asserts competitive results with superior coherence, yet the evaluation description provides no concrete metrics (PSNR, SSIM, LPIPS, or user-study scores), no listed baselines, no ablation tables isolating the contribution of the style fusion or gated skips, and no error analysis. Without these quantitative details, the support for the performance claim cannot be verified.
Authors: We acknowledge that the current presentation of §4 lacks sufficient quantitative detail. The revised manuscript will explicitly report PSNR, SSIM, LPIPS, and user-study scores; list all baselines with implementation details; include ablation tables that isolate the mask-conditional style fusion and hard-gated skip connections; and add a dedicated error-analysis subsection with failure-case visualizations and quantitative breakdown by degradation type. revision: yes
Circularity Check
No significant circularity; empirical claims rest on independent evaluation
full rationale
The paper describes a neural architecture (HMAT) with modules such as Mask-Aware Dynamic Filtering, mask-conditional style fusion, and hard-gated skip connections, then reports competitive performance on DHMural and Nine-Colored Deer datasets. No equations, derivations, or parameter-fitting steps are present that could reduce a claimed prediction to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims are therefore self-contained empirical statements rather than tautological reductions, consistent with a standard computer-vision methods paper.
Axiom & Free-Parameter Ledger
free parameters (1)
- model hyperparameters
axioms (1)
- domain assumption Degradation can be accurately represented by binary masks that separate valid and missing regions
Reference graph
Works this paper leans on
-
[1]
Authors, A.: Nine-colored deer mural dataset.https://drive.google.com/file/ d/163XtOx_0A8bo-oU2piKp_w77f0kaS5du/view?usp=sharing (2026), dataset used for evaluation 12 J. Jiang et al
work page 2026
- [2]
-
[3]
Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Pro- ceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). pp. 417–424 (2000)
work page 2000
-
[4]
IEEE Transactions on Image Processing13(9), 1200–1212 (2004)
Criminisi, A., Perez, P., Toyama, K.: Region filling and object removal by exemplar- based image inpainting. IEEE Transactions on Image Processing13(9), 1200–1212 (2004)
work page 2004
-
[5]
Pattern Recognition145, 109897 (2024)
Huang, W., Deng, Y., Hui, S., Wu, Y., Zhou, S., Wang, J.: Sparse self-attention transformer for image inpainting. Pattern Recognition145, 109897 (2024)
work page 2024
-
[6]
Visual Informatics6(1), 1–13 (2022)
Li, M., Wang, Y., Xu, Y.Q.: Computing for chinese cultural heritage. Visual Informatics6(1), 1–13 (2022)
work page 2022
-
[7]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y., Jia, J.: Mat: Mask-aware transformer for large hole image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
work page 2022
-
[8]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10012– 10022 (2021)
work page 2021
-
[9]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Gool, L.V.: Repaint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11461–11471 (2022)
work page 2022
-
[10]
In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)
Nazeri, K., Ng, E., Joseph, T., Qureshi, F., Ebrahimi, M.: Edgeconnect: Structure guided image inpainting using edge prediction. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). pp. 3265–3274 (2019)
work page 2019
-
[11]
International Journal of Computer Vision132(7), 2367–2400 (2024)
Quan, W., Chen, J., Liu, Y., Yan, D.M., Wonka, P.: Deep learning-based image and video inpainting: A survey. International Journal of Computer Vision132(7), 2367–2400 (2024)
work page 2024
-
[12]
In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015. pp. 234–241 (2015)
work page 2015
-
[13]
In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)
Shao, H., Xu, Q., Wen, P., Gao, P., Yang, Z., Huang, Q.: Building Bridge Across the Time: Disruption and Restoration of Murals In the Wild . In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 20202–20212. IEEE Computer Society (2023)
work page 2023
-
[14]
Sun, M., Zhang, D., Wang, Z., Ren, J., Chai, B., Sun, J.: What’s wrong with the murals at the mogao grottoes: a near-infrared hyperspectral imaging method. Scientific reports5(1) (2015)
work page 2015
-
[15]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Naumov, N., Aliev, H., Chigorin, V.: Resolution-robust large mask inpainting with fourier convolutions. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 2149–2159 (2022)
work page 2022
-
[16]
In: Advances in Neural Information Processing Systems (NeurIPS)
Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive model- ing: Scalable image generation via next-scale prediction. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 37 (2024)
work page 2024
-
[17]
Pattern Recognition134, 109046 (2023) Hybrid Mask-Aware Transformer for Mural Restoration 13
Xiang, H., Zou, Q., Nawaz, M.A., Huang, X., Zhang, F., Yu, H.: Deep learning for image inpainting: A survey. Pattern Recognition134, 109046 (2023) Hybrid Mask-Aware Transformer for Mural Restoration 13
work page 2023
-
[18]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.: Free-form image inpainting with gated convolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
work page 2019
-
[19]
In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) (2018)
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) (2018)
work page 2018
-
[20]
Information Fusion90, 74–94 (2023)
Zhang, X., Zhai, D., Li, T., Zhou, Y., Lin, Y.: Image inpainting based on deep learning: A review. Information Fusion90, 74–94 (2023)
work page 2023
-
[21]
In: International Conference on Learning Representations (ICLR) (2021)
Zhao, S., Cui, J., Sheng, Y., Dong, Y., Chang, E.I., Chang, Y., et al.: Large scale im- age completion via co-modulated generative adversarial networks. In: International Conference on Learning Representations (ICLR) (2021)
work page 2021
-
[22]
IEEE Transactions on Image Processing30, 4855–4866 (2021)
Zhu, M., He, D., Li, X., Li, C., Li, F., Liu, X., Ding, E., Zhang, Z.: Image inpainting by end-to-end cascaded refinement with mask awareness. IEEE Transactions on Image Processing30, 4855–4866 (2021)
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.