pith. machine review for the scientific record. sign in

arxiv: 2601.22160 · v2 · submitted 2025-12-13 · 💻 cs.GR · cs.AI

Screen, Cache, and Match: A Training-Free Causality-Consistent Reference Frame Framework for Human Animation

Pith reviewed 2026-05-16 23:10 UTC · model grok-4.3

classification 💻 cs.GR cs.AI
keywords human animationvideo generationdiffusion modelstemporal coherencereference framestraining-freecausality-consistent
0
0 comments X

The pith

FrameCache turns past generated frames into reliable causal references for longer coherent human animation videos without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FrameCache as a training-free framework that converts historical video frames into consistent guidance for ongoing generation. It does so first by building a dynamic reference memory through a Screen-Cache-Match process that supplies motion-aligned appearance cues, and second by aligning the denoising paths of adjacent video segments with overlap-aware propagation and dual-domain blending. A sympathetic reader would care because current diffusion-based animation methods lose identity and temporal smoothness over extended sequences, and this method claims to fix that gap using only existing model outputs as input.

Core claim

FrameCache is a causality-consistent reference frame framework that explicitly converts historical generation results into causal guidance through a Screen-Cache-Match strategy that constructs a dynamic high-quality reference memory and a Trajectory-Aware Autoregressive Generation mechanism that aligns denoising trajectories across video chunks via overlap-aware latent propagation and dual-domain fusion, thereby improving temporal coherence and visual stability when integrated with diffusion baselines.

What carries the argument

The FrameCache framework, built around the Screen-Cache-Match (SCM) strategy for reference memory construction and the Trajectory-Aware Autoregressive Generation (TAAG) mechanism for trajectory alignment.

If this is right

  • Diffusion-based human animation models gain improved temporal coherence over long videos without retraining.
  • Identity drift is reduced by maintaining a dynamic cache of motion-consistent reference frames.
  • The method integrates directly with multiple existing diffusion baselines through its training-free design.
  • Denoising trajectories across adjacent chunks become aligned through overlap propagation and frequency-domain fusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The caching approach could extend to non-human subjects if the reference matching step generalizes beyond human motion patterns.
  • Combining FrameCache with other consistency losses might allow even longer sequences before drift appears.
  • The dual-domain fusion step suggests a general way to blend structural and textural information in autoregressive video models.

Load-bearing premise

Historical generation results can be reliably converted into motion-consistent appearance guidance by the Screen-Cache-Match strategy without accumulating errors or identity drift over long sequences.

What would settle it

A long sequence test in which the same character is animated for many frames using FrameCache yet shows measurable identity drift or motion inconsistency compared to a ground-truth reference sequence.

Figures

Figures reproduced from arXiv: 2601.22160 by Aoxing Li, Chunyang Wang, Haofen Wang, Huanzhen Wang, Jianan Wang, Li He, Nailei Hei, Wenqiang Zhang, Yan Wang, Yingkai Zhao, Yuxuan Lin.

Figure 1
Figure 1. Figure 1: Visual comparison of original animations (top rows) and those enhanced by our FrameCache framework (bottom [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed FrameCache framework, consisting of three stages: (a) Screen, which filters high-quality frames using CLIP-IQA and MUSIQ; (b) Cache, which maintains a dynamic and diverse reference buffer using a redundancy￾aware replacement strategy; and (c) Match, which selects the most motion-consistent frame to guide generation. FrameCache operates in a training-free and causal manner, enhancin… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the Cache stage in FrameCache. This module selects the most motion-consistent reference frame from the cache, ensuring temporal alignment with the current pose sequence. By choosing structurally stable and semantically relevant frames, it enhances motion continuity and reduces flickering artifacts in long-sequence character animation. the ongoing motion. Selecting an inappropriate reference… view at source ↗
Figure 4
Figure 4. Figure 4: Results of qualitative comparison, highlighting the regions enclosed in red boxes and the inter-frame inconsistencies. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: results of qualitative comparison of Magicanimate, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Human animation aims to generate temporally coherent and visually consistent videos over long sequences, yet modeling long-range dependencies while preserving frame quality remains challenging. Inspired by the human ability to leverage past observations for interpreting ongoing actions, we propose FrameCache, a training-free, causality-consistent reference frame framework. FrameCache explicitly converts historical generation results into causal guidance through two complementary mechanisms. First, at the reference level, a novel Screen-Cache-Match (SCM) strategy constructs a dynamic, high-quality reference memory, ensuring motion-consistent appearance guidance to reduce identity drift. Second, at the generative level, a Trajectory-Aware Autoregressive Generation (TAAG) mechanism aligns denoising trajectories across adjacent video chunks. This is achieved through an overlap-aware latent propagation and a dual-domain fusion strategy that seamlessly blends low-frequency structural layouts with high-frequency textural details. Extensive experiments on standard benchmarks demonstrate that FrameCache consistently improves temporal coherence and visual stability while integrating seamlessly with diverse diffusion baselines. Code will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes FrameCache, a training-free causality-consistent reference frame framework for long-sequence human animation. It introduces the Screen-Cache-Match (SCM) strategy to build a dynamic reference memory from historical generation results for motion-consistent appearance guidance, and the Trajectory-Aware Autoregressive Generation (TAAG) mechanism that uses overlap-aware latent propagation and dual-domain fusion to align denoising trajectories across video chunks. The work claims consistent gains in temporal coherence and visual stability when plugged into diverse diffusion baselines, with code to be released.

Significance. If the robustness claims hold, FrameCache would offer a practical, training-free route to mitigate identity drift and temporal inconsistency in diffusion-based animation pipelines. The explicit promise of public code release is a positive for reproducibility; the approach could be adopted as a modular plug-in for existing video diffusion systems without retraining.

major comments (2)
  1. [§3.1] §3.1 (SCM description): the motion-consistent matching procedure is presented as reliably converting noisy historical frames into stable guidance, yet no confidence threshold, artifact rejection step, or periodic re-initialization is defined to bound cumulative appearance drift; this directly underpins the central claim of reduced identity drift over long sequences.
  2. [§5] §5 (Experiments): reported benchmark improvements are given without ablations that isolate SCM's contribution to drift control or quantitative tracking of appearance error accumulation across dozens of chunks; the weakest assumption in the reader's note therefore remains untested in the current evaluation.
minor comments (2)
  1. [§3.2] Notation for the dual-domain fusion weights in TAAG is introduced without an explicit equation; adding a short formula would improve clarity.
  2. [Figure 2] Figure 2 caption does not indicate whether the visualized reference frames are raw diffusion outputs or post-SCM selections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, outlining the revisions we will incorporate to strengthen the presentation of the SCM strategy and the experimental validation.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (SCM description): the motion-consistent matching procedure is presented as reliably converting noisy historical frames into stable guidance, yet no confidence threshold, artifact rejection step, or periodic re-initialization is defined to bound cumulative appearance drift; this directly underpins the central claim of reduced identity drift over long sequences.

    Authors: We agree that an explicit similarity threshold and safeguards against drift accumulation would strengthen the SCM description. The current matching relies on cosine similarity of motion features extracted from historical frames, but we will revise §3.1 to define a fixed similarity threshold (e.g., 0.85) for reference selection, add an artifact rejection step based on low-frequency structural consistency checks, and introduce periodic re-initialization every 16 frames using the initial reference frame. These additions will be accompanied by pseudocode and a short analysis of how they bound drift while preserving the training-free nature of the framework. revision: yes

  2. Referee: [§5] §5 (Experiments): reported benchmark improvements are given without ablations that isolate SCM's contribution to drift control or quantitative tracking of appearance error accumulation across dozens of chunks; the weakest assumption in the reader's note therefore remains untested in the current evaluation.

    Authors: We acknowledge that the existing experiments do not isolate SCM's role or provide chunk-wise drift tracking. In the revised manuscript we will add a dedicated ablation subsection in §5 that (i) disables SCM while retaining TAAG and reports the resulting increase in identity drift, and (ii) tracks quantitative appearance error (face embedding cosine similarity and region-specific LPIPS) across 50+ chunks on the same benchmarks. These new results will directly quantify SCM's contribution to drift control and will be presented alongside the original tables. revision: yes

Circularity Check

0 steps flagged

No circularity: SCM and TAAG are explicit algorithmic constructions without reduction to inputs or self-citations

full rationale

The paper's core contributions are the Screen-Cache-Match (SCM) strategy for building a dynamic reference memory and the Trajectory-Aware Autoregressive Generation (TAAG) mechanism for aligning denoising trajectories. These are presented as novel, training-free procedures that operate on historical diffusion outputs without any fitted parameters, self-defined quantities, or load-bearing self-citations. No equations reduce predictions to inputs by construction, and the framework is described as integrating with external baselines rather than deriving from prior author work. Claims rest on empirical benchmark improvements, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that past generated frames supply usable causal guidance; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Historical generation results can be converted into reliable causal guidance for future frames without introducing drift.
    This premise underpins both the Screen-Cache-Match and Trajectory-Aware Autoregressive Generation components.

pith-pipeline@v0.9.0 · 5510 in / 1171 out tokens · 39023 ms · 2026-05-16T23:10:36.104805+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 5 internal anchors

  1. [1]

    Realtime multi-person 2d pose estimation using part affinity fields,

    Z. Cao, T. Simon, S. E. Wei, and Y . Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,”IEEE, 2017

  2. [2]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

  3. [3]

    Champ: Controllable and consistent human image animation with 3d parametric guidance,

    S. Zhu, J. L. Chen, Z. Dai, Z. Dong, Y . Xu, X. Cao, Y . Yao, H. Zhu, and S. Zhu, “Champ: Controllable and consistent human image animation with 3d parametric guidance,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 145–162

  4. [4]

    Disco: Disentangled control for realistic human dance generation,

    T. Wang, L. Li, K. Lin, Y . Zhai, C.-C. Lin, Z. Yang, H. Zhang, Z. Liu, and L. Wang, “Disco: Disentangled control for realistic human dance generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9326–9336

  5. [5]

    Animate anyone: Consistent and control- lable image-to-video synthesis for character animation,

    L. Hu, X. Gao, P. Zhang, K. Sun, B. Zhang, and L. Bo, “Animate any- one: Consistent and controllable image-to-video synthesis for character animation,”arXiv preprint arXiv:2311.17117, 2023

  6. [6]

    Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025

    L. Hu, G. Wang, Z. Shen, X. Gao, D. Meng, L. Zhuo, P. Zhang, B. Zhang, and L. Bo, “Animate anyone 2: High-fidelity charac- ter image animation with environment affordance,”arXiv preprint arXiv:2502.06145, 2025

  7. [7]

    Dreampose: Fashion image-to-video synthesis via stable diffusion,

    J. Karras, A. Holynski, T.-C. Wang, and I. Kemelmacher-Shlizerman, “Dreampose: Fashion image-to-video synthesis via stable diffusion,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023, pp. 22 623–22 633

  8. [8]

    Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance,

    Y . Zhang, J. Gu, L.-W. Wang, H. Wang, J. Cheng, Y . Zhu, and F. Zou, “Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance,” inInternational Conference on Machine Learning, 2025

  9. [9]

    Magicanimate: Temporally consistent human image animation using diffusion model,

    Z. Xu, J. Zhang, J. H. Liew, H. Yan, J.-W. Liu, C. Zhang, J. Feng, and M. Z. Shou, “Magicanimate: Temporally consistent human image animation using diffusion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1481–1490

  10. [10]

    Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion,

    D. Chang, Y . Shi, Q. Gao, J. Fu, H. Xu, G. Song, Q. Yan, Y . Zhu, X. Yang, and M. Soleymani, “Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion,”arXiv preprint arXiv:2311.12052, 2023

  11. [11]

    Unianimate-dit: Human image animation with large-scale video diffusion transformer,

    X. Wang, S. Zhang, L. Tang, Y . Zhang, C. Gao, Y . Wang, and N. Sang, “Unianimate-dit: Human image animation with large-scale video diffusion transformer,”arXiv preprint arXiv:2504.11289, 2025

  12. [12]

    Stableanimator: High-quality identity-preserving human image ani- mation,

    S. Tu, Z. Xing, X. Han, Z.-Q. Cheng, Q. Dai, C. Luo, and Z. Wu, “Stableanimator: High-quality identity-preserving human image ani- mation,” inProceedings of the Computer Vision and Pattern Recog- nition Conference, 2025, pp. 21 096–21 106

  13. [13]

    Ignoring the cerebellum is hindering progress in neuroscience,

    B. Wang, A. LeBel, and A. M. D’Mello, “Ignoring the cerebellum is hindering progress in neuroscience,”Trends in Cognitive Sciences, vol. 29, no. 4, pp. 318–330, 2025

  14. [14]

    Diffusion models beat gans on image synthesis,

    P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

  15. [15]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,”arXiv preprint arXiv:2011.13456, 2020

  16. [16]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Y . Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, Q. Zhang, K. Kreis, M. Aittala, T. Aila, S. Laineet al., “ediff-i: Text-to-image diffu- sion models with an ensemble of expert denoisers,”arXiv preprint arXiv:2211.01324, 2022

  17. [17]

    Motiondiffuse: Text-driven human motion generation with diffusion model,

    M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,”IEEE transactions on pattern analysis and machine intelli- gence, vol. 46, no. 6, pp. 4115–4128, 2024

  18. [18]

    Human Motion Diffusion Model

    G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,”arXiv preprint arXiv:2209.14916, 2022

  19. [19]

    Design- diffusion: High-quality text-to-design image generation with diffusion models,

    Z. Wang, J. Bao, S. Gu, D. Chen, W. Zhou, and H. Li, “Design- diffusion: High-quality text-to-design image generation with diffusion models,” inProceedings of the Computer Vision and Pattern Recog- nition Conference, 2025, pp. 20 906–20 915

  20. [20]

    Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis,

    Q. Wu, Y . Liu, H. Zhao, T. Bui, Z. Lin, Y . Zhang, and S. Chang, “Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7766–7776

  21. [21]

    Magictailor: Component-controllable personalization in text- to-image diffusion models,

    D. Zhou, J. Huang, J. Bai, J. Wang, H. Chen, G. Chen, X. Hu, and P.-A. Heng, “Magictailor: Component-controllable personalization in text- to-image diffusion models,”arXiv preprint arXiv:2410.13370, 2024

  22. [22]

    Advancing pose-guided image synthesis with progressive conditional diffusion models,

    F. Shen, H. Ye, J. Zhang, C. Wang, X. Han, and W. Yang, “Advancing pose-guided image synthesis with progressive conditional diffusion models,”arXiv preprint arXiv:2310.06313, 2023

  23. [23]

    Diffusion-rwkv: Scaling rwkv-like architectures for diffusion models,

    Z. Fei, M. Fan, C. Yu, D. Li, and J. Huang, “Diffusion-rwkv: Scaling rwkv-like architectures for diffusion models,”arXiv preprint arXiv:2404.04478, 2024

  24. [24]

    Attenst: A training-free attention-driven style transfer framework with pre-trained diffusion models,

    B. Huang, W. Xu, Q. Han, H. Jing, and Y . Li, “Attenst: A training-free attention-driven style transfer framework with pre-trained diffusion models,”arXiv preprint arXiv:2503.07307, 2025

  25. [25]

    Freestyle: Free lunch for text-guided style transfer using diffusion models,

    F. He, G. Li, M. Zhang, L. Yan, L. Si, F. Li, and L. Shen, “Freestyle: Free lunch for text-guided style transfer using diffusion models,”arXiv preprint arXiv:2401.15636, 2024

  26. [26]

    Photorealistic text-to-image diffusion models with deep lan- guage understanding,

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep lan- guage understanding,”Advances in neural information processing systems, vol. 35, pp. 36 479–36 494, 2022

  27. [27]

    Composer: Creative and controllable image synthesis with composable conditions

    L. Huang, D. Chen, Y . Liu, Y . Shen, D. Zhao, and J. Zhou, “Composer: Creative and controllable image synthesis with composable condi- tions,”arXiv preprint arXiv:2302.09778, 2023

  28. [28]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchi- cal text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

  29. [29]

    T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,

    C. Mou, X. Wang, L. Xie, Y . Wu, J. Zhang, Z. Qi, and Y . Shan, “T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 5, 2024, pp. 4296–4304

  30. [30]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847

  31. [31]

    Structure and content-guided video synthesis with diffusion models,

    P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 7346–7356

  32. [32]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,

    J. Z. Wu, Y . Ge, X. Wang, S. W. Lei, Y . Gu, Y . Shi, W. Hsu, Y . Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 7623–7633

  33. [33]

    Versvideo: Leveraging enhanced temporal diffusion models for versatile video generation,

    J. Xiang, R. Huang, J. Zhang, G. Li, X. Han, and Y . Wei, “Versvideo: Leveraging enhanced temporal diffusion models for versatile video generation,” inThe Twelfth International Conference on Learning Representations, 2023

  34. [34]

    Decouple and track: Benchmarking and improving video diffusion transformers for motion transfer,

    Q. Shi, J. Wu, J. Bai, J. Zhang, L. Qi, Y . Tong, and X. Li, “Decouple and track: Benchmarking and improving video diffusion transformers for motion transfer,”arXiv preprint arXiv:2503.17350, 2025

  35. [35]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023

  36. [36]

    Dynamic and selective engrams emerge with memory consolidation,

    D. F. Tome, Y . Zhang, and T. A. M. L. C. S. R. Clopath, “Dynamic and selective engrams emerge with memory consolidation,”Nature neuroscience, vol. 27, no. 3, pp. 561–572, 2024

  37. [37]

    Exploring clip for assessing the look and feel of images,

    J. Wang, K. C. Chan, and C. C. Loy, “Exploring clip for assessing the look and feel of images,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 2, 2023, pp. 2555–2563

  38. [38]

    Musiq: Multi- scale image quality transformer,

    J. Ke, Q. Wang, Y . Wang, P. Milanfar, and F. Yang, “Musiq: Multi- scale image quality transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 5148–5157

  39. [39]

    Parvalbumin+ interneurons obey unique connectivity rules and establish a powerful lateral-inhibition microcircuit in dentate gyrus,

    C. Espinoza, S. J. Guzman, X. Zhang, and P. Jonas, “Parvalbumin+ interneurons obey unique connectivity rules and establish a powerful lateral-inhibition microcircuit in dentate gyrus,”Nature communica- tions, vol. 9, no. 1, p. 4605, 2018

  40. [40]

    Densepose: Dense human pose estimation in the wild,

    R. A. G ¨uler, N. Neverova, and I. Kokkinos, “Densepose: Dense human pose estimation in the wild,”IEEE, 2018

  41. [41]

    Effective whole-body pose es- timation with two-stages distillation,

    Z. Yang, A. Zeng, C. Yuan, and Y . Li, “Effective whole-body pose es- timation with two-stages distillation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4210–4220

  42. [42]

    Making a “completely blind

    A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,”IEEE Signal processing letters, vol. 20, no. 3, pp. 209–212, 2012

  43. [43]

    Nima: Neural image assessment,

    H. Talebi and P. Milanfar, “Nima: Neural image assessment,”IEEE transactions on image processing, vol. 27, no. 8, pp. 3998–4011, 2018

  44. [44]

    No-reference image quality assessment via transformers, relative ranking, and self- consistency,

    S. A. Golestaneh, S. Dadsetan, and K. M. Kitani, “No-reference image quality assessment via transformers, relative ranking, and self- consistency,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 1220–1230

  45. [45]

    From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality,

    Z. Ying, H. Niu, P. Gupta, D. Mahajan, D. Ghadiyaram, and A. Bovik, “From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3575–3585

  46. [46]

    Learning conditional knowledge distillation for degraded-reference image quality assess- ment,

    H. Zheng, H. Yang, J. Fu, Z.-J. Zha, and J. Luo, “Learning conditional knowledge distillation for degraded-reference image quality assess- ment,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 242–10 251