You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps

Anna Briotto; Endi Hysa; Lamberto Ballan; Marco Fiorucci; Riccardo Carraro

arxiv: 2605.14166 · v2 · pith:ITMISEQZnew · submitted 2026-05-13 · 💻 cs.CV

You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps

Riccardo Carraro , Anna Briotto , Endi Hysa , Marco Fiorucci , Lamberto Ballan This is my paper

Pith reviewed 2026-05-15 04:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords face super-resolutionU-NetYOLO-Worldlandmark heatmapslightweight modelimage upscalingCelebAspatial loss weighting

0 comments

The pith

A lightweight U-Net reconstructs 128x128 faces from 16x16 inputs by weighting its loss with YOLO-World landmark heatmaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a U-Net architecture for extreme face super-resolution that reuses heatmaps from an open-vocabulary detector as spatial weights in the training loss. These weights direct reconstruction effort toward eyes, nose, and mouth without any auxiliary landmark network or adversarial component. The method trains and runs efficiently because the detector runs once to supply fixed priors rather than being integrated into the pipeline. On aligned CelebA images the weighted loss raises standard metrics and yields visibly sharper outputs than an unweighted baseline. The approach shows that detection outputs can serve directly as perceptual guidance for lightweight upscaling.

Core claim

Heatmaps produced by YOLO-World on the low-resolution input are converted into per-pixel weights that multiply the pixel-wise reconstruction loss; the resulting heatmap-guided objective trains a standard U-Net to emphasize facial landmarks, delivering 128x128 outputs from 16x16 inputs that are quantitatively and perceptually superior to the same network trained without the weighting.

What carries the argument

YOLO-World landmark heatmaps turned into spatial weights for a heatmap-guided reconstruction loss that emphasizes errors around eyes, nose, and mouth.

If this is right

No separate landmark or alignment network is required at training or test time.
The full pipeline stays lightweight because the detector is used only once to generate fixed weights.
Quantitative metrics and visual sharpness improve consistently on the CelebA test set.
Adversarial training is unnecessary to obtain realistic reconstructions under the guided loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weighting strategy could be applied to other restoration tasks where an off-the-shelf detector supplies region priors.
Freezing the detector and reusing its outputs may allow the super-resolution model to be trained with fewer epochs or smaller batches.
Performance on unaligned or real-world low-resolution faces remains untested and would determine whether the method requires an explicit alignment stage.
Replacing the pixel loss entirely with a detector-derived perceptual loss might further simplify the objective.

Load-bearing premise

The heatmaps generated by YOLO-World on 16x16 degraded inputs remain sufficiently accurate and aligned to serve as reliable spatial weights.

What would settle it

A controlled test on inputs where YOLO-World produces visibly misplaced or missing landmark heatmaps that results in lower PSNR/SSIM and blurrier faces than the unweighted baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.14166 by Anna Briotto, Endi Hysa, Lamberto Ballan, Marco Fiorucci, Riccardo Carraro.

**Figure 1.** Figure 1: Efficient U-Net architecture for image super-resolution, transforming a low-resolution [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Heatmaps generatad with landmarks detected by YOLO-World. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on aligned CelebA ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on aligned CelebA ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Face image super-resolution aims to recover high-resolution facial images from severely degraded inputs. Under extreme upscaling factors, fine facial details are often lost, making accurate reconstruction challenging. Existing methods typically rely on heavy network architectures, adversarial training schemes, or separate alignment networks, increasing model complexity and computational cost. To address these issues, we propose a lightweight U-Net based-architecture designed to reconstructs $128{ \times }128$ facial images from severely degraded $16{ \times }16$ inputs, achieving an $8 \times $ magnification. A key contribution is a novel auxiliary-training-free supervision strategy that leverages heatmaps generated by YOLO-World, an open-vocabulary object detector, to localize key facial features such as eyes, nose, and mouth. These heatmaps are converted into spatial weights to form a heatmap-guided loss that emphasizes reconstruction errors in semantically important regions. Unlike prior methods that require dedicated landmark or alignment networks, our approach directly reuses detector outputs as supervision, maintaining an efficient training and inference pipeline. Experiments on the aligned CelebA dataset demonstrate that the proposed loss consistently improves quantitative metrics and produces sharper, more realistic reconstructions. Overall, our results show that lightweight networks can effectively exploit detection-driven priors for perceptually convincing extreme upscaling, without adversarial training or increased computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a lightweight U-Net for 8x face SR that reuses YOLO-World heatmaps as spatial loss weights without extra networks, but the abstract supplies no numbers or checks on whether those heatmaps work on 16x16 inputs.

read the letter

The core idea is straightforward: take a plain lightweight U-Net, run YOLO-World on the low-res 16x16 input to get heatmaps for eyes, nose, and mouth, turn those into per-pixel weights, and add them to the reconstruction loss. This avoids any separate landmark detector or alignment module and skips adversarial training. On aligned CelebA it reportedly gives sharper outputs and better metrics than the baseline U-Net alone. That reuse of an off-the-shelf open-vocabulary detector is the main practical difference from prior face SR work that builds dedicated sub-networks for landmarks or priors. It keeps training and inference cheap, which matters for edge devices. The abstract is clear that the heatmaps come directly from the detector and are not learned inside the U-Net, so there is no circularity in the supervision. The approach is honest about staying simple. The main gap is that the abstract states consistent metric gains and more realistic faces but reports zero numbers, no table of PSNR/SSIM/LPIPS against standard baselines, no ablation that isolates the weighting term, and no check on how often YOLO-World actually produces usable heatmaps at 16x16 resolution. YOLO-World was trained on normal-resolution images, so feeding it severely downsampled aligned crops risks collapsed or noisy detections; if that happens the guided loss collapses to ordinary pixel loss plus artifacts. Without those diagnostics the central claim stays unverified. This is the kind of paper that could be useful to people already working on efficient face restoration pipelines who want a quick way to inject semantic emphasis without extra parameters. It does not reorganize super-resolution theory and the application is narrow, but the implementation looks reproducible enough to test in a day or two. I would send it to peer review so the experiments can be examined properly; the idea is simple enough that a referee can quickly decide if the heatmaps are doing real work.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a lightweight U-Net for 8× face super-resolution (16×16 degraded inputs to 128×128 outputs) on aligned CelebA. It introduces an auxiliary-training-free heatmap-guided reconstruction loss that converts outputs from the pre-trained open-vocabulary YOLO-World detector into spatial weights emphasizing eyes, nose, and mouth regions. The approach avoids adversarial training, separate alignment networks, or heavy architectures, and claims consistent quantitative metric improvements plus sharper reconstructions.

Significance. If the YOLO-World heatmaps remain reliable on severely degraded 16×16 inputs, the method offers a low-overhead way to inject semantic priors into SR losses without extra parameters or training stages. This could be useful for resource-constrained pipelines, but the absence of reported metric values, baseline comparisons, or heatmap-quality diagnostics in the provided text makes the practical significance difficult to evaluate at present.

major comments (2)

[Abstract] Abstract: the central empirical claim states that the proposed loss 'consistently improves quantitative metrics,' yet no PSNR, SSIM, LPIPS, or other numerical values, baseline comparisons, or ablation results are supplied. Without these data the magnitude and reliability of the reported gains cannot be assessed.
[Method] Method (heatmap generation paragraph): YOLO-World is applied directly to 16×16 inputs to produce landmark heatmaps used as spatial weights. No quantitative validation of heatmap accuracy (e.g., landmark localization error, failure rate, or comparison against ground-truth landmarks on the same degraded inputs) is provided. If detections collapse or misalign, the guided loss reduces to standard pixel loss plus potential artifacts, undermining the 'detection-driven priors' contribution.

minor comments (2)

[Abstract] Abstract contains a grammatical error: 'designed to reconstructs' should read 'designed to reconstruct'.
[Experiments] The manuscript should include a dedicated subsection or table reporting the exact quantitative results, chosen baselines, and ablation isolating the heatmap weighting effect.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim states that the proposed loss 'consistently improves quantitative metrics,' yet no PSNR, SSIM, LPIPS, or other numerical values, baseline comparisons, or ablation results are supplied. Without these data the magnitude and reliability of the reported gains cannot be assessed.

Authors: We agree that the abstract would benefit from greater specificity. The Experiments section reports the full set of quantitative results, including PSNR, SSIM, and LPIPS values with baseline comparisons. In the revised manuscript we will update the abstract to cite the key metric improvements achieved by the heatmap-guided loss. revision: yes
Referee: [Method] Method (heatmap generation paragraph): YOLO-World is applied directly to 16×16 inputs to produce landmark heatmaps used as spatial weights. No quantitative validation of heatmap accuracy (e.g., landmark localization error, failure rate, or comparison against ground-truth landmarks on the same degraded inputs) is provided. If detections collapse or misalign, the guided loss reduces to standard pixel loss plus potential artifacts, undermining the 'detection-driven priors' contribution.

Authors: This is a valid concern. The current manuscript does not include explicit quantitative diagnostics of YOLO-World performance on the 16×16 inputs. In the revision we will add a short analysis (new paragraph or table) reporting landmark localization error and detection success rate on the degraded inputs relative to ground-truth landmarks, thereby confirming that the heatmaps remain sufficiently reliable to provide meaningful spatial guidance. revision: yes

Circularity Check

0 steps flagged

No circularity: external pre-trained detector supplies independent supervision

full rationale

The manuscript's core mechanism converts outputs from the external, pre-trained YOLO-World detector into spatial weights for a reconstruction loss. This signal is generated outside the U-Net training loop and does not depend on any parameters or fitted quantities internal to the proposed model. No equations, self-citations, or ansatzes are shown that would reduce the claimed metric improvements to a tautological re-expression of the inputs. The derivation chain therefore remains self-contained, with the performance gains presented as empirical outcomes on CelebA rather than predictions forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a general-purpose detector produces usable landmark heatmaps on 16x16 degraded faces; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption YOLO-World produces reliable spatial heatmaps for eyes, nose, and mouth when run on 16x16 severely degraded face images
The heatmap-guided loss is defined directly from these detector outputs; if the assumption fails, the weighting provides no useful signal.

pith-pipeline@v0.9.0 · 5547 in / 1240 out tokens · 51795 ms · 2026-05-15T04:48:42.556341+00:00 · methodology

You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)