pith. sign in

arxiv: 2205.11880 · v1 · submitted 2022-05-24 · 💻 cs.CV · cs.GR

Hierarchical Vectorization for Portrait Images

Pith reviewed 2026-05-24 11:41 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords portrait vectorizationdiffusion curvesPoisson regionshierarchical representationimage editinggenerative modelretouchingcolor transfer
0
0 comments X

The pith

A three-tier vector representation converts raster portraits into editable diffusion curves, Poisson regions, and generated residuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes automatically converting raster portrait images into a three-level hierarchical vector format. The base layer uses sparse diffusion curves to capture geometric features and low-frequency colors for tasks such as color transfer and expression editing. The middle layer encodes lighting with editable Poisson regions that users can adjust for highlights and shadows. The top layer adds pixel-sized Poisson regions and a trained generative model to handle high-frequency details and enable automatic retouching. This structure supports new blending operations based on the Laplace operator and is evaluated with an illumination-sensitive extension of the FLIP metric on the FFHQR dataset.

Core claim

The central claim is that organizing vector primitives into three tiers—sparse diffusion curves for salient features and low-frequency content, large editable Poisson regions for mid-frequency lighting, and pixel-sized Poisson regions plus a generative model for high-frequency residuals—produces a representation that supports intuitive portrait editing operations including color transfer, facial expression changes, highlight and shadow adjustments, and automatic retouching while preserving image information.

What carries the argument

The 3-tier hierarchical representation consisting of sparse diffusion curves, editable Poisson regions, and pixel-sized PRs with a generative model for residuals.

If this is right

  • Diffusion curves enable semantic color transfer and facial expression editing.
  • Adjusting strength or shape of Poisson regions directly modifies illumination.
  • The generative model produces residuals for automatic retouching of details.
  • Linearity of the Laplace operator allows alpha blending, linear dodge, and linear burn in vector form for lighting edits.
  • The IS-FLIP metric evaluates edits by capturing illumination changes more consistently with perception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hierarchy could be applied to other image categories if the primitive extraction generalizes beyond portraits.
  • Public release of code and models would allow testing on new editing workflows outside the reported tasks.
  • The approach might combine with existing raster tools to create hybrid editing systems.
  • Propagating the layers across video frames could extend the method to moving portraits.

Load-bearing premise

The chosen primitives of diffusion curves for low-frequency content, Poisson regions for lighting, and generated residuals for details can be extracted from and recombined into diverse portraits without visible artifacts or loss of essential information.

What would settle it

Recombining the three layers after an edit produces visible artifacts or mismatches on multiple varied portraits from the FFHQR dataset.

Figures

Figures reproduced from arXiv: 2205.11880 by Fei Hou, Linlin Liu, Qian Fu, Ying He.

Figure 1
Figure 1. Figure 1: Our method converts a portrait photo into a 3-level vector image, consisting of diffusion [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The algorithmic pipeline of our hierarchical vectorization. See the text for details. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Preprocessing. Given an input image I, we compute a retouched image Ir and a highlight￾compensated retouched image Irh. We highlight the differences in greyed boxes. Extracting diffusion curves (DCs). We adopt a two-step method for computing DCs. First, we apply the probability edge algorithm [18] to extract strong edges Ie in the retouched image Irh and use the colors on the edges to define the boundary c… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of hierarchical PVG. The vector primitives are shown in the small insets. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Applying the linear blending functions to edit highlights and shadows. (b)-(c): We add [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example of DC mask. (a) is the input image [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: DC extraction. The percentages are the ratio of the number of pixels in DCs to the [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Generating residual PRs using deep learning. Clockwise from top left in the last column: [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The IS-FLIP-ct metric δEct I is more effective than FLIP δEF and IS-FLIP δEI for evaluating color transfer results. The input image I and the color transferred result J have the same facial features but different colors. However, due to significant color change between J and I, both FLIP and IS-FLIP, which take color changes as part of the difference metric, yield large error values. IS-FLIP-ct, in contras… view at source ↗
Figure 10
Figure 10. Figure 10: Hair color transfer by modifying colors of sparse diffusion curves. (a) original image. (b)-(c) reference images. (d)-(e) hair color transfer. (f)-(i) hair highlighting effects. Light editing. Since our method separates illuminations from colors, the user can explicitly edit light using the PRs in the middle level. In [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Light editing. We apply linear dodge and linear burn to the middle-level PRs to modify [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparing our face retouching results with the [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Expression editing via changing the geometries of DCs in the base level. The first [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison with deep sparse, smart contours [11] in image reconstruction. Our method [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Facial color transfer on a challenging case with a wide range of brightness. We show the [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: A failed case. The vector primitives in the base level are not able to capture the facial [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
read the original abstract

Aiming at developing intuitive and easy-to-use portrait editing tools, we propose a novel vectorization method that can automatically convert raster images into a 3-tier hierarchical representation. The base layer consists of a set of sparse diffusion curves (DC) which characterize salient geometric features and low-frequency colors and provide means for semantic color transfer and facial expression editing. The middle level encodes specular highlights and shadows to large and editable Poisson regions (PR) and allows the user to directly adjust illumination via tuning the strength and/or changing shape of PR. The top level contains two types of pixel-sized PRs for high-frequency residuals and fine details such as pimples and pigmentation. We also train a deep generative model that can produce high-frequency residuals automatically. Thanks to the meaningful organization of vector primitives, editing portraits becomes easy and intuitive. In particular, our method supports color transfer, facial expression editing, highlight and shadow editing and automatic retouching. Thanks to the linearity of the Laplace operator, we introduce alpha blending, linear dodge and linear burn to vector editing and show that they are effective in editing highlights and shadows. To quantitatively evaluate the results, we extend the commonly used FLIP metric (which measures differences between two images) by considering illumination. The new metric, called illumination-sensitive FLIP or IS-FLIP, can effectively capture the salient changes in color transfer results, and is more consistent with human perception than FLIP and other quality measures on portrait images. We evaluate our method on the FFHQR dataset and show that our method is effective for common portrait editing tasks, such as retouching, light editing, color transfer and expression editing. We will make the code and trained models publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a 3-tier hierarchical vectorization for portrait images: a base layer of sparse diffusion curves (DCs) for salient geometry and low-frequency colors, a middle layer of editable Poisson regions (PRs) for specular highlights and shadows, and a top layer of pixel-sized PRs for high-frequency residuals and details, augmented by a deep generative model to synthesize residuals. This structure is claimed to support intuitive editing operations including semantic color transfer, facial expression editing, highlight/shadow adjustment via PR strength/shape, and automatic retouching. Linearity of the Laplace operator is used to introduce alpha blending, linear dodge, and linear burn for vector editing. A new illumination-sensitive FLIP metric (IS-FLIP) is introduced to better capture color-transfer changes, and the method is evaluated on the FFHQR dataset with the claim that it is effective for common portrait editing tasks. Code and models will be released.

Significance. If the extraction and recombination claims hold with low artifact rates across diverse inputs, the work would offer a practically useful advance in vector-based portrait editing by organizing primitives into semantically meaningful, independently editable layers rather than flat vectorizations. The planned public release of code and models is a clear strength that would aid reproducibility. The IS-FLIP extension addresses a relevant gap in evaluating illumination-aware edits, though its added value depends on the missing human-judgment validation.

major comments (2)
  1. [Abstract] Abstract: The central claim that the 3-tier representation (sparse DCs + editable PRs + generated pixel-sized PRs) can be automatically extracted from any portrait and recombined (with or without edits) while preserving salient information and avoiding visible artifacts rests on unshown implementation choices; no reconstruction error metrics, no ablation studies on layer separation, and no consistency checks between edited lower layers and the generative residual model are supplied.
  2. [Abstract] Abstract: The assertion that IS-FLIP is 'more consistent with human perception than FLIP and other quality measures on portrait images' is load-bearing for the quantitative evaluation of editing tasks, yet the abstract supplies neither the validation procedure against human judgments nor any comparative tables on the FFHQR dataset.
minor comments (1)
  1. [Abstract] The abstract states that the method 'supports color transfer, facial expression editing, highlight and shadow editing and automatic retouching' but does not clarify whether these operations are demonstrated with before/after examples or only described at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the abstract and related sections to better highlight the supporting evidence from the full paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the 3-tier representation (sparse DCs + editable PRs + generated pixel-sized PRs) can be automatically extracted from any portrait and recombined (with or without edits) while preserving salient information and avoiding visible artifacts rests on unshown implementation choices; no reconstruction error metrics, no ablation studies on layer separation, and no consistency checks between edited lower layers and the generative residual model are supplied.

    Authors: The full manuscript reports reconstruction error metrics on FFHQR, includes ablation studies on the contribution of each hierarchical layer, and analyzes consistency between edited base/middle layers and the generative residual model in the results and supplementary material. The abstract summarizes these without including specific numbers or figures. We will revise the abstract to briefly reference the quantitative evaluations and key implementation details supporting the extraction and recombination claims. revision: yes

  2. Referee: [Abstract] Abstract: The assertion that IS-FLIP is 'more consistent with human perception than FLIP and other quality measures on portrait images' is load-bearing for the quantitative evaluation of editing tasks, yet the abstract supplies neither the validation procedure against human judgments nor any comparative tables on the FFHQR dataset.

    Authors: The manuscript body contains comparative tables on FFHQR and details the IS-FLIP formulation to capture illumination-sensitive differences. The consistency claim with human perception derives from these metric comparisons and visual analysis rather than a formal user study. We will revise the abstract to reference the evaluation tables and procedure, and qualify the wording to reflect the basis of the claim. revision: partial

Circularity Check

0 steps flagged

No circularity: method construction is independent of claimed outputs

full rationale

The abstract and description present a hierarchical decomposition into diffusion curves, Poisson regions, and residuals with a trained generative model, but contain no equations, fitted parameters, or self-citations that reduce the editing capabilities or IS-FLIP metric to re-expressions of their own inputs by construction. The representation is built bottom-up from image primitives, and the metric is described as an explicit extension of FLIP without load-bearing self-reference. This is the common case of a self-contained technical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, mathematical axioms, or newly postulated entities; the approach relies on standard graphics primitives (diffusion curves, Poisson regions) and a trained generative model whose training details are not given.

pith-pipeline@v0.9.0 · 5833 in / 1066 out tokens · 23423 ms · 2026-05-24T11:41:54.181591+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    M. Afifi, M. A. Brubaker, and M. S. Brown. Histogan: Controlling colors of gan-generated and real images via color histograms. In IEEE CVPR , 2021

  2. [2]

    Andersson, J

    P. Andersson, J. Nilsson, T. Akenine-M¨ oller, M. Oskarsson, K.˚Astr¨ om, and M. D. Fairchild. FLIP: A Difference Evaluator for Alternating Images. Proceedings of the ACM on Computer Graphics and Interactive Techniques , 3(2):15:1–15:23, 2020

  3. [3]

    Bang and H

    D. Bang and H. Shim. Mggan: Solving mode collapse using manifold-guided training. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 2347– 2356, 2021

  4. [4]

    S. Bell, K. Bala, and N. Snavely. Intrinsic images in the wild. ACM TOG, 33(4):1–12, 2014

  5. [5]

    S. Bi, X. Han, and Y. Yu. An l1 image transform for edge-preserving smoothing and scene-level intrinsic decomposition. ACM TOG, 34(4):1–12, 2015

  6. [6]

    Boy´ e, P

    S. Boy´ e, P. Barla, and G. Guennebaud. A vectorial solver for free-form vector gradients.ACM TOG, 31(6):1–9, 2012

  7. [7]

    J. Canny. A computational approach to edge detection. IEEE PAMI, (6):679–698, 1986

  8. [8]

    J. F. Canny. A computational approach to edge detection. IEEE PAMI, PAMI-8(6):679–698, 1986

  9. [9]

    Chen, Y.-S

    K.-W. Chen, Y.-S. Luo, Y.-C. Lai, Y.-L. Chen, C.-Y. Yao, H.-K. Chu, and T.-Y. Lee. Image vectorization with real-time thin-plate spline. IEEE Transactions on Multimedia, 22(1):15–29, 2019

  10. [10]

    Cheng, Y

    Z. Cheng, Y. Zheng, S. You, and I. Sato. Non-local intrinsic decomposition with near-infrared priors. In IEEE ICCV , pages 2521–2530, 2019

  11. [11]

    Dekel, C

    T. Dekel, C. Gan, D. Krishnan, C. Liu, and W. T. Freeman. Sparse, smart contours to represent and edit images. In IEEE CVPR , pages 3511–3520, 2018

  12. [12]

    Favreau, F

    J.-D. Favreau, F. Lafarge, and A. Bousseau. Photo2clipart: image abstraction and vectoriza- tion using layered linear gradients. ACM TOG, 36(6):1–11, 2017

  13. [13]

    Finch, J

    M. Finch, J. Snyder, and H. Hoppe. Freeform vector graphics with controlled thin-plate splines. ACM TOG, 30(6):1–10, 2011

  14. [14]

    Q. Fu, Y. He, F. Hou, J. Zhang, A. Zeng, and Y.-J. Liu. Vectorization based color transfer for portrait images. Computer-Aided Design, 115:111–121, 2019

  15. [15]

    F. Hou, Q. Sun, Z. Fang, Y. Liu, S. Hu, H. Qin, A. Hao, and Y. He. Poisson vector graphics (PVG). IEEE TVCG, 26(2):1361–1371, 2020

  16. [16]

    Lai, S.-M

    Y.-K. Lai, S.-M. Hu, and R. R. Martin. Automatic and topology-preserving gradient mesh generation for image vectorization. ACM Transactions on Graphics (TOG) , 28(3):1–8, 2009. 17

  17. [17]

    C.-H. Lee, Z. Liu, L. Wu, and P. Luo. Maskgan: Towards diverse and interactive facial image manipulation. In IEEE CVPR , pages 5549–5558, 2020

  18. [18]

    Leordeanu, R

    M. Leordeanu, R. Sukthankar, and C. Sminchisescu. Efficient closed-form solution to gener- alized boundary detection. In ECCV, pages 516–529. Springer, 2012

  19. [19]

    J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang. Visual attribute transfer through deep image analogy. arXiv:1705.01088, 2017

  20. [20]

    Z. Liao, H. Hoppe, D. Forsyth, and Y. Yu. A subdivision-based representation for vector image editing. IEEE transactions on visualization and computer graphics , 18(11):1858–1867, 2012

  21. [21]

    S. Lu, W. Jiang, X. Ding, C. S. Kaplan, X. Jin, F. Gao, and J. Chen. Depth-aware image vectorization and editing. The Visual Computer , 35(6-8):1027–1039, 2019

  22. [22]

    Z. Lu, T. Hu, L. Song, Z. Zhang, and R. He. Conditional expression synthesis with face parsing transformation. In ACM MM, pages 1083–1091, 2018

  23. [23]

    Orzan, A

    A. Orzan, A. Bousseau, H. Winnem¨ oller, P. Barla, J. Thollot, and D. Salesin. Diffusion curves: A vector representation for smooth-shaded images. ACM TOG, 27(3):1–8, 2008

  24. [24]

    X. S. Poma, E. Riba, and A. Sappa. Dense extreme inception network: Towards a robust cnn model for edge detection. In IEEE WCACV, pages 1923–1932, 2020

  25. [25]

    Sengupta, A

    S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs. Sfsnet: Learning shape, re- flectance and illuminance of facesin the wild’. In IEEE CVPR , pages 6296–6305, 2018

  26. [26]

    Shafaei, J

    A. Shafaei, J. J. Little, and M. Schmidt. Autoretouch: Automatic professional face retouching. In IEEE WACV, pages 990–998, January 2021

  27. [27]

    Shen and Z.-H

    H.-L. Shen and Z.-H. Zheng. Real-time highlight removal using intensity ratio. Applied Optics, 52(19):4483–4493, 2013

  28. [28]

    Sheng, Z

    L. Sheng, Z. Lin, J. Shao, and X. Wang. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In IEEE CVPR , pages 8242–8250, 2018

  29. [29]

    Y. Shih, S. Paris, C. Barnes, W. T. Freeman, and F. Durand. Style transfer for headshot portraits. ACM TOG, 33(4):148, 2014

  30. [30]

    Z. Shu, S. Hadap, E. Shechtman, K. Sunkavalli, S. Paris, and D. Samaras. Portrait lighting transfer using a mass transport approach. ACM TOG, 36(4):1, 2017

  31. [31]

    Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In IEEE CVPR , pages 5541–5550, 2017

  32. [32]

    J. Sun, L. Liang, F. Wen, and H.-Y. Shum. Image vectorization using optimized gradient meshes. ACM TOG, 26(3):Article 11, 2007

  33. [33]

    Thanh-Tung and T

    H. Thanh-Tung and T. Tran. Catastrophic forgetting and mode collapse in gans. In 2020 International Joint Conference on Neural Networks (IJCNN) , pages 1–10. IEEE, 2020

  34. [34]

    Wang, M.-Y

    T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In IEEE CVPR , 2018

  35. [35]

    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE TIP , 13(4):600–612, 2004

  36. [36]

    G. Xie, X. Sun, X. Tong, and D. Nowrouzezahrai. Hierarchical diffusion curves for accurate automatic image vectorization. ACM TOG, 33(6):1–11, 2014

  37. [37]

    Xie, M.-T

    Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification. In IEEE CVPR , pages 10687–10698, 2020

  38. [38]

    Zhang, S

    X. Zhang, S. Fanello, Y.-T. Tsai, T. Sun, T. Xue, R. Pandey, S. Orts-Escolano, P. Davidson, C. Rhemann, P. Debevec, et al. Neural light transport for relighting and view synthesis. arXiv:2008.03806, 2020. 18

  39. [39]

    S. Zhao, F. Durand, and C. Zheng. Inverse diffusion curves using shape optimization. IEEE TVCG, 24(7):2153–2166, 2017

  40. [40]

    H. Zhou, J. Zheng, and L. Wei. Representing images using curvilinear feature driven subdivi- sion surfaces. IEEE transactions on image processing , 23(8):3268–3280, 2014

  41. [41]

    H. Zhou, S. Hadap, K. Sunkavalli, and D. W. Jacobs. Deep single-image portrait relighting. In IEEE ICCV , pages 7194–7202, 2019

  42. [42]

    H. Zhou, X. Yu, and D. W. Jacobs. Glosh: Global-local spherical harmonics for intrinsic image decomposition. In IEEE ICCV , pages 7820–7829, 2019

  43. [43]

    B. Zoph, G. Ghiasi, T.-Y. Lin, Y. Cui, H. Liu, E. D. Cubuk, and Q. Le. Rethinking pre-training and self-training. NeurIPS, 33, 2020. 19