Recognition: 2 theorem links
· Lean TheoremTiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution
Pith reviewed 2026-05-16 08:29 UTC · model grok-4.3
The pith
Tiled prompts overcome prompt misguidance in super-resolution by conditioning each latent tile on its own local text description.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A single global caption used with latent tiling causes prompt misguidance: coarse text misses localized details (omission errors) and supplies locally irrelevant guidance (commission errors). Tiled Prompts generates a tile-specific prompt for each latent tile and performs super-resolution under the locally text-conditioned posterior, resolving both error types with minimal overhead. Experiments on high-resolution real-world images and videos confirm consistent gains in perceptual quality and fidelity together with fewer hallucinations and tile-level artifacts than global-prompt baselines.
What carries the argument
Tiled Prompts framework that generates and applies a distinct text prompt to each latent tile rather than a single global caption.
If this is right
- Consistent gains in perceptual quality and fidelity appear on real high-resolution images and videos.
- Hallucinations and tile-level artifacts decrease compared with global-prompt baselines.
- The same tiled-prompt procedure unifies image and video super-resolution without separate pipelines.
- Overhead remains minimal because prompt generation is performed once per tile before diffusion.
Where Pith is reading between the lines
- The same local-prompt idea could be tested on other tiled diffusion tasks such as inpainting or outpainting where global text also produces local inconsistencies.
- Automatic methods for creating the per-tile prompts might be swapped in without changing the rest of the pipeline, offering a modular upgrade path.
- Real-time or streaming video applications could benefit if the tile prompts are pre-computed or predicted from neighboring frames.
Load-bearing premise
Tile-specific prompts can be generated accurately enough to correct both omission and commission errors in local regions while adding only minimal overhead.
What would settle it
A controlled test set of high-resolution images and videos in which tiled-prompt outputs show no measurable reduction in hallucination rate or tile artifacts relative to the global-prompt baseline when measured by the same perceptual and fidelity metrics.
Figures
read the original abstract
Text-conditioned diffusion models have advanced image and video super-resolution by using prompts as semantic priors, and modern super-resolution pipelines typically rely on latent tiling to scale to high resolutions. In practice, a single global caption is used with the latent tiling, often causing prompt misguidance. Specifically, a coarse global prompt often misses localized details (errors of omission) and provides locally irrelevant guidance (errors of commission) which leads to substandard results at the tile level. To solve this, we propose Tiled Prompts, a unified framework for image and video super-resolution that generates a tile-specific prompt for each latent tile and performs super-resolution under locally text-conditioned posteriors to resolve prompt misguidance with minimal overhead. Our experiments on high resolution real-world images and videos show that tiled prompts bring consistent gains in perceptual quality and fidelity, while reducing hallucinations and tile-level artifacts that can be found in global-prompt baselines. Project Page: https://bryanswkim.github.io/tiled-prompts/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Tiled Prompts, a unified framework for diffusion-based image and video super-resolution that replaces a single global caption with per-tile text prompts generated for each latent tile. This is intended to correct prompt misguidance arising from errors of omission (missed local details) and commission (locally irrelevant guidance) when latent tiling is used to scale to high resolutions, while incurring only minimal overhead. Experiments on high-resolution real-world images and videos are reported to yield consistent gains in perceptual quality and fidelity together with fewer hallucinations and tile-level artifacts relative to global-prompt baselines.
Significance. If the quantitative results and ablations hold, the work supplies a lightweight, practical fix for a recurring limitation in tiled high-resolution generation pipelines. The unified treatment of images and video, together with the emphasis on local semantic conditioning, could be adopted quickly in existing SR workflows and may reduce the need for heavier architectural changes.
major comments (3)
- [Abstract] Abstract: the central claim of 'consistent gains in perceptual quality and fidelity' and 'reducing hallucinations and tile-level artifacts' is stated without any numerical metrics, dataset sizes, or baseline comparisons; the full manuscript must supply tables reporting LPIPS, FID, or human preference scores together with statistical significance to substantiate the claim.
- [§3] §3 (Method): the mechanism for generating tile-specific prompts is described only at a high level; it is unclear whether a VLM is applied independently per tile, how tile context or overlap is handled, and whether any verification of prompt accuracy (e.g., CLIP similarity to local ground-truth regions) is performed. This detail is load-bearing because inaccurate local prompts would fail to resolve omission/commission errors and could introduce new artifacts.
- [§4] §4 (Experiments): no ablation isolating the contribution of prompt quality from the mere act of tiling is reported. Without such controls it remains possible that observed improvements stem from increased conditioning diversity rather than the proposed correction of misguidance, undermining the causal link asserted in the abstract.
minor comments (2)
- Figure captions should explicitly state the resolution of the input images/videos and the number of tiles used so that readers can reproduce the exact experimental setting.
- [Abstract] The project-page URL in the abstract should be checked for accessibility and should contain the promised code and model weights.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of the significance of Tiled Prompts. We address each major comment point by point below, indicating planned revisions where they strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of 'consistent gains in perceptual quality and fidelity' and 'reducing hallucinations and tile-level artifacts' is stated without any numerical metrics, dataset sizes, or baseline comparisons; the full manuscript must supply tables reporting LPIPS, FID, or human preference scores together with statistical significance to substantiate the claim.
Authors: The full manuscript already supplies the requested tables in §4, including LPIPS, FID, and human preference scores with baseline comparisons on datasets of specified sizes. To directly address the abstract's conciseness, we will revise it to include key quantitative highlights (e.g., average metric gains) drawn from those tables while remaining within length limits. revision: yes
-
Referee: [§3] §3 (Method): the mechanism for generating tile-specific prompts is described only at a high level; it is unclear whether a VLM is applied independently per tile, how tile context or overlap is handled, and whether any verification of prompt accuracy (e.g., CLIP similarity to local ground-truth regions) is performed. This detail is load-bearing because inaccurate local prompts would fail to resolve omission/commission errors and could introduce new artifacts.
Authors: We agree the current description in §3 is high-level. The framework applies a VLM independently to each latent tile's visual features, incorporates overlapping tile context via shared global scene information, and blends adjacent prompts for consistency. No explicit post-generation verification (such as CLIP similarity checks) is performed. We will expand §3 with additional procedural details, pseudocode, and a brief discussion of prompt accuracy to make the mechanism fully transparent. revision: yes
-
Referee: [§4] §4 (Experiments): no ablation isolating the contribution of prompt quality from the mere act of tiling is reported. Without such controls it remains possible that observed improvements stem from increased conditioning diversity rather than the proposed correction of misguidance, undermining the causal link asserted in the abstract.
Authors: We respectfully note that the existing experimental design already isolates the contribution of prompt quality. All reported baselines employ identical latent tiling and diffusion pipelines; the sole variable is global versus per-tile prompt conditioning. Observed gains in perceptual quality, fidelity, and reduced tile artifacts are therefore attributable to the correction of omission/commission errors rather than tiling or diversity effects alone. We will add a clarifying sentence in §4 to make this isolation explicit. revision: no
Circularity Check
No circularity: empirical framework with independent experimental support
full rationale
The paper introduces Tiled Prompts as a practical framework for addressing prompt misguidance via per-tile prompt generation in diffusion-based super-resolution. No equations, derivations, or fitted parameters are present that could reduce to self-definition or construction. Claims rest on experimental results from high-resolution images and videos rather than any self-referential logic, self-citation chains, or renamed known results. The central mechanism (tile-specific prompting) is presented as a novel engineering choice with reported perceptual gains, not as a quantity derived from its own inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proposition 1: ΔI = E[DKL(p(xH|xL,c*) || p(xH|xL,c))] lower-bounds accumulated prompt misguidance via ∫ ||δi(c)(t)||² dt
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Tiled prompts reduce ΔIℓ ≤ ΔIg by replacing global cglobal with per-tile clocal
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops
Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution: Dataset and study. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 126–135 (2017) 26
work page 2017
-
[2]
Ad- vances in Neural Information Processing Systems37, 55443–55469 (2024) 11
Ai, Y., Zhou, X., Huang, H., Han, X., Chen, Z., You, Q., Yang, H.: Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation. Ad- vances in Neural Information Processing Systems37, 55443–55469 (2024) 11
work page 2024
-
[3]
In: Proceedings of the 14th Conference on ACM Multimedia Systems
Al Shoura, T., Dehaghi, A.M., Razavi, R., Far, B., Moshirpour, M.: Sepe dataset: 8k video sequences and images for analysis and development. In: Proceedings of the 14th Conference on ACM Multimedia Systems. pp. 463–468 (2023) 11
work page 2023
-
[4]
Stochastic Processes and their Applications12(3), 313–326 (1982) 22
Anderson, B.D.: Reverse-time diffusion equation models. Stochastic Processes and their Applications12(3), 313–326 (1982) 22
work page 1982
- [5]
-
[6]
science313(5793), 1642–1645 (2006) 2
Betzig, E., Patterson, G.H., Sougrat, R., Lindwasser, O.W., Olenych, S., Bonifa- cino, J.S., Davidson, M.W., Lippincott-Schwartz, J., Hess, H.F.: Imaging intracel- lular fluorescent proteins at nanometer resolution. science313(5793), 1642–1645 (2006) 2
work page 2006
-
[7]
In: Proceedings of the IEEE/CVF international conference on computer vision
Cai, J., Zeng, H., Yong, H., Cao, Z., Zhang, L.: Toward real-world single im- age super-resolution: A new benchmark and a new model. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3086–3095 (2019) 11
work page 2019
-
[8]
Chan, K.C.K., Zhou, S., Xu, X., Loy, C.C.: Investigating tradeoffs in real-world video super-resolution (2021) 11, 26
work page 2021
-
[9]
IEEE transactions on pattern analysis and machine intelligence44(5), 2567–2581 (2020) 11
Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: Unify- ing structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence44(5), 2567–2581 (2020) 11
work page 2020
-
[10]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Duan, Z.P., Zhang, J., Jin, X., Zhang, Z., Xiong, Z., Zou, D., Ren, J.S., Guo, C., Li, C.: Dit4sr: Taming diffusion transformer for real-world image super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18948–18958 (2025) 4, 10
work page 2025
-
[11]
In: Forty-first International Conference on Machine Learning (2024) 2
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high- resolution image synthesis. In: Forty-first International Conference on Machine Learning (2024) 2
work page 2024
-
[12]
In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)
Gu, S., Lugmayr, A., Danelljan, M., Fritsche, M., Lamour, J., Timofte, R.: Div8k: Diverse 8k resolution image dataset. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). pp. 3512–3516. IEEE (2019) 26
work page 2019
-
[13]
In: Proceedings of the 2021 conference on empirical methods in natural language processing
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. pp. 7514–7528 (2021) 13
work page 2021
-
[14]
Advances in neural information processing systems30(2017) 11
Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017) 11
work page 2017
-
[15]
Advances in Neural Information Processing Systems33, 6840–6851 (2020) 6
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems33, 6840–6851 (2020) 6
work page 2020
-
[16]
Classifier-Free Diffusion Guidance
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022) 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5197–5206 (2015) 11 16 B. S. Kim et al
work page 2015
-
[18]
arXiv preprint arXiv:2302.02412 (2023) 3, 5
Jiménez, Á.B.: Mixture of diffusers for scene composition and high resolution image generation. arXiv preprint arXiv:2302.02412 (2023) 3, 5
- [19]
-
[20]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4401–4410 (2019) 26
work page 2019
-
[21]
In: Proceedings of the IEEE/CVF international conference on computer vision
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5148–5157 (2021) 11
work page 2021
-
[22]
Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment
Kim, B.S., Kim, J., Ye, J.C.: Chain-of-zoom: Extreme super-resolution via scale autoregression and preference alignment. arXiv preprint arXiv:2505.18600 (2025) 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer (2024),https://arxiv.org/ abs/2408.0332613
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Li, X., Liu, Y., Cao, S., Chen, Z., Zhuang, S., Chen, X., He, Y., Wang, Y., Qiao, Y.: Diffvsr: Revealing an effective recipe for taming robust video super-resolution against complex degradations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15319–15328 (2025) 26
work page 2025
-
[25]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, Y., Zhang, K., Liang, J., Cao, J., Liu, C., Gong, R., Zhang, Y., Tang, H., Liu, Y., Demandolx, D., et al.: Lsdir: A large scale dataset for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1775–1787 (2023) 11, 26
work page 2023
-
[26]
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restorationusingswintransformer.In:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision. pp. 1833–1844 (2021) 6
work page 2021
- [27]
-
[28]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Advances in neural information processing systems36, 34892–34916 (2023) 4
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 4
work page 2023
-
[30]
Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and trans- fer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z6
work page 2023
-
[31]
Min, J., Kim, J.H., Cho, P.H., Lee, J., Park, J., Park, M., Kim, S., Park, H., Kim, S.: Text-aware image restoration with diffusion models (2025),https://arxiv. org/abs/2506.099934
-
[32]
Oktay, O., Bai, W., Lee, M., Guerrero, R., Kamnitsas, K., Caballero, J., de Mar- vao, A., Cook, S., O’Regan, D., Rueckert, D.: Multi-input cardiac image super- resolution using convolutional neural networks. In: Medical Image Computing and Computer-Assisted Intervention-MICCAI 2016: 19th International Confer- ence, Athens, Greece, October 17-21, 2016, Pro...
work page 2016
-
[33]
IEEE transactions on medical imaging30(5), 1028–1041 (2010) 2 Tiled Prompts 17
Ravishankar, S., Bresler, Y.: MR image reconstruction from highly undersampled k-space data by dictionary learning. IEEE transactions on medical imaging30(5), 1028–1041 (2010) 2 Tiled Prompts 17
work page 2010
-
[34]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695 (2022) 2
work page 2022
-
[35]
Shiu, H.S., Lin, C.Y., Wang, Z., Hsiao, C.W., Yu, P.F., Chen, Y.C., Liu, Y.L.: Stream-diffvsr: Low-latency streamable video super-resolution via auto-regressive diffusion (2025) 5
work page 2025
-
[36]
In: 9th Inter- national Conference on Learning Representations, ICLR (2021) 6
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: 9th Inter- national Conference on Learning Representations, ICLR (2021) 6
work page 2021
-
[37]
In: 9th Inter- national Conference on Learning Representations, ICLR (2021) 6, 8
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. In: 9th Inter- national Conference on Learning Representations, ICLR (2021) 6, 8
work page 2021
-
[38]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3667–3676 (2020) 11
work page 2020
-
[39]
Sun, L., Wu, R., Ma, Z., Liu, S., Yi, Q., Zhang, L.: Pixel-level and semantic-level adjustablesuper-resolution:Adual-loraapproach.In:ProceedingsoftheComputer Vision and Pattern Recognition Conference. pp. 2333–2343 (2025) 4
work page 2025
-
[40]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 5
Sun, Y., Sun, L., Liu, S., Wu, R., Zhang, Z., Zhang, L.: One-step diffusion for detail-rich and temporally consistent video super-resolution. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 5
work page 2025
-
[41]
Team, Q.: Qwen2.5-vl (January 2025),https://qwenlm.github.io/blog/qwen2. 5-vl/10
work page 2025
-
[42]
Team, Q.: Qwen3 technical report (2025),https://arxiv.org/abs/2505.09388 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
In: Proceedings of the AAAI conference on artificial intelligence
Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: Proceedings of the AAAI conference on artificial intelligence. vol. 37, pp. 2555–2563 (2023) 11
work page 2023
-
[44]
Wang, J., Lin, S., Lin, Z., Ren, Y., Wei, M., Yue, Z., Zhou, S., Chen, H., Zhao, Y., Yang, C., Xiao, X., Loy, C.C., Jiang, L.: Seedvr2: One-step video restoration via diffusion adversarial post-training (2025) 5
work page 2025
-
[45]
In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025) 5
Wang, J., Lin, Z., Wei, M., Zhao, Y., Yang, C., Loy, C.C., Jiang, L.: Seedvr: Seeding infinity in diffusion transformer towards generic video restoration. In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025) 5
work page 2025
-
[46]
International Journal of Computer Vision 132(12), 5929–5949 (2024) 2, 4
Wang, J., Yue, Z., Zhou, S., Chan, K.C., Loy, C.C.: Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision 132(12), 5929–5949 (2024) 2, 4
work page 2024
-
[47]
Earth-Science Reviews232, 104110 (2022) 2
Wang, P., Bayram, B., Sertel, E.: A comprehensive review on deep learning based remote sensing image super-resolution methods. Earth-Science Reviews232, 104110 (2022) 2
work page 2022
-
[48]
Wang, R., Liu, X., Zhang, Z., Wu, X., Feng, C.M., Zhang, L., Zuo, W.: Benchmark dataset and effective inter-frame alignment for real-world video super-resolution (2022) 11, 26
work page 2022
-
[49]
In: Proceedings of the IEEE confer- ence on computer vision and pattern recognition
Wang, X., Yu, K., Dong, C., Loy, C.C.: Recovering realistic texture in image super- resolution by deep spatial feature transform. In: Proceedings of the IEEE confer- ence on computer vision and pattern recognition. pp. 606–615 (2018) 11
work page 2018
-
[50]
IEEE transactions on image processing 13(4), 600–612 (2004) 11 18 B
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004) 11 18 B. S. Kim et al
work page 2004
-
[51]
In: European conference on computer vision
Wei, P., Xie, Z., Lu, H., Zhan, Z., Ye, Q., Zuo, W., Lin, L.: Component divide- and-conquer for real-world image super-resolution. In: European conference on computer vision. pp. 101–117. Springer (2020) 11
work page 2020
-
[52]
Wu, H., Chen, C., Hou, J., Liao, L., Wang, A., Sun, W., Yan, Q., Lin, W.: Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. In: Proceedings of the European conference on computer vision (ECCV) (2022), https://arxiv.org/abs/2207.0259511
- [53]
-
[54]
Wu,H.,Zhang,E.,Liao,L.,Chen,C.,Hou,J.,Wang,A.,Sun,W.,Yan,Q.,Lin,W.: Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023),https://arxiv.org/abs/2211.0489411
-
[55]
Advances in Neural Information Processing Systems 37, 92529–92553 (2024) 2, 4, 26
Wu, R., Sun, L., Ma, Z., Zhang, L.: One-step effective diffusion network for real- world image super-resolution. Advances in Neural Information Processing Systems 37, 92529–92553 (2024) 2, 4, 26
work page 2024
-
[56]
In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition
Wu, R., Yang, T., Sun, L., Zhang, Z., Li, S., Zhang, L.: Seesr: Towards semantics- aware real-world image super-resolution. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 25456–25467 (2024) 2, 4, 11
work page 2024
-
[57]
Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023) 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Xie, R., Liu, Y., Zhou, P., Zhao, C., Zhou, J., Zhang, K., Zhang, Z., Yang, J., Yang, Z., Tai, Y.: Star: Spatial-temporal augmentation with text-to-video models for real-world video super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025) 5, 11, 25, 26
work page 2025
-
[59]
Advances in Neural Information Processing Systems36(2024) 13
Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems36(2024) 13
work page 2024
-
[60]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition
Yang, S., Wu, T., Shi, S., Lao, S., Gong, Y., Cao, M., Wang, J., Yang, Y.: Maniqa: Multi-dimension attention network for no-reference image quality assessment. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 1191–1200 (2022) 11
work page 2022
-
[61]
In: Proceedings of the European conference on computer vision (ECCV) (2024) 5
Yang, X., He, C., Ma, J., Zhang, L.: Motion-guided latent diffusion for tempo- rally consistent real-world video super-resolution. In: Proceedings of the European conference on computer vision (ECCV) (2024) 5
work page 2024
-
[62]
YANG, X., Xiang, W., Zeng, H., Zhang, L.: Real-world video super-resolution: A benchmark dataset and a decomposition based learning scheme (2021) 11, 26
work page 2021
-
[63]
In: International Conference on Learning Representations (2025) 5
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., Yin, D., Zhang, Y., Wang, W., Cheng, Y., Xu, B., Gu, X., Dong, Y., Tang, J.: Cogvideox: Text-to-video diffusion models with an expert transformer. In: International Conference on Learning Representations (2025) 5
work page 2025
-
[64]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yu,F.,Gu,J.,Li,Z.,Hu,J.,Kong,X.,Wang,X.,He,J.,Qiao,Y.,Dong,C.:Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 25669–25680 (2024) 2, 4
work page 2024
-
[65]
IEEE Transactions on Image Processing24(8), 2579–2591 (2015) 11 Tiled Prompts 19
Zhang, L., Zhang, L., Bovik, A.C.: A feature-enriched completely blind image qual- ity evaluator. IEEE Transactions on Image Processing24(8), 2579–2591 (2015) 11 Tiled Prompts 19
work page 2015
-
[66]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018) 11
work page 2018
-
[67]
Zhang, S., Wang, J., Zhang, Y., Zhao, K., Yuan, H., Qing, Z., Wang, X., Zhao, D., Zhou, J.: I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models (2023) 5, 11
work page 2023
-
[68]
Zhou, S., Yang, P., Wang, J., Luo, Y., Loy, C.C.: Upscale-a-video: Temporal- consistent diffusion model for real-world video super-resolution (2023) 5, 26
work page 2023
-
[69]
In: International Conference on Learning Representations (2024),https://arxiv
Zhu, B., Lin, B., Ning, M., Yan, Y., Cui, J., Wang, H., Pang, Y., Jiang, W., Zhang, J., Li, Z., Zhang, W., Li, Z., Liu, W., Yuan, L.: Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. In: International Conference on Learning Representations (2024),https://arxiv. org/abs/2310.0185213 20 B. S. Kim et al. ...
-
[70]
If it is readable, it MUST be in the list
OCR & Text Extraction: Transcribe EVERY visible character (digits, signage, logos, license plates, small labels). If it is readable, it MUST be in the list
-
[71]
Do not ignore fleeting details
Transient Objects: Identify objects that appear only for a split second or in the background(e.g., ‘raindrops’). Do not ignore fleeting details
-
[72]
Object Parts: Identify small components (e.g., ‘rivets’, ‘bolts’, ‘hinges’, ‘seams’, ‘cracks’)
-
[73]
Fine-grained Material Classification: Do not use vague words like ‘textured’ or ‘smooth’. Instead, identify the EXACT material: (e.g., ‘anodized aluminum’, ‘reinforced concrete’, ‘tempered glass’, ‘corrugated iron’, ‘synthetic plastic’, ‘terrazzo floor’). Format: Output ONLY keywords separated by commas. No explanations, no ‘Here are the words’, no ‘Note:...
-
[74]
Intentional Texture: Based on the object’s identity, what is its actual material?
-
[75]
Micro-OCR: Transcribe any letters, numbers, or symbols that are unique to this patch
-
[76]
STRICT RULE: NEVER use words like ‘blurry’, ‘pixelated’, ‘noisy’, ‘low-res’, or ‘distorted’
Edge & Shape: Describe the intended sharp edges and structures of the objects in the patch. STRICT RULE: NEVER use words like ‘blurry’, ‘pixelated’, ‘noisy’, ‘low-res’, or ‘distorted’. Output ONLY the inferred high-quality keywords, separated by commas. B.3 Other Settings. Settings for Runtime Analysis.Runtime analysis of SISR and VSR tasks reported in Ta...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.