Subjective Portrait Region Cropping in Landscape Videos with Temporal Annotation Smoothing
Pith reviewed 2026-05-08 04:20 UTC · model grok-4.3
The pith
A new database collects human annotations for cropping significant regions from landscape videos into portrait format.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the LIVE-YT VC database featuring 1800 videos annotated by 90 human subjects for portrait region cropping, along with a temporally smoothed version called LIVE-YT VC++ using an intra-frame temporal filter, and demonstrate its usefulness for video aspect ratio transformation using SmartVidCrop and state-of-the-art models.
What carries the argument
The LIVE-YT VC database of subjective portrait region annotations and the intra-frame temporal filter for smoothing annotations within each video.
If this is right
- Models trained or fine-tuned on this dataset can better adapt videos to mobile display aspect ratios while preserving meaning.
- The dataset provides a benchmark for future research on subjective video cropping and grounding.
- Labels resembling saliency annotations allow exploration of similarities between cropping regions and saliency predictions.
- Opening the data to the community will advance development of quality-preserving video reshaping techniques.
Where Pith is reading between the lines
- If the annotations reflect broad human judgments of significance, automated systems could use them to retain narrative elements when resizing videos for different platforms.
- The temporal smoothing method could be tested on other types of subjective video labels to improve consistency.
- Expanding the database to include more diverse video sources might test how well the annotations generalize beyond the original collection.
Load-bearing premise
The annotations provided by the 90 subjects accurately and reliably identify the significant regions that should be kept when cropping videos for different viewers and contexts.
What would settle it
If independent human evaluators consistently prefer crops generated by models not using this dataset over those trained on it, when judging preservation of important content and visual quality.
Figures
read the original abstract
With the rise of mobile video consumption on diverse handheld display resolutions and orientation modes, altering videos to aspect ratios poses challenges. Static cropping and border padding often compromises visual quality, while warping may distort a video's intended meaning. Here we advocate for a more effective approach: cropping significant regions within video frames in a temporal manner, while minimizing distortion and preserving essential content. One barrier to solving this problem is the lack of sufficiently large-scale database devoted to informing these tasks. Towards filling this gap, we introduce the LIVE-YouTube Video Cropping (LIVE-YT VC) database, featuring 1800 videos, annotated by 90 human subjects. Using videos sourced from the YouTube-UGC and LSVQ Databases, this new resource is the largest publicly-available subjective video portrait region cropping database. We also introduce a post-processed version of the database, called LIVE-YT VC++, whereby a novel intra-frame temporal filter was deployed to smooth subjective annotations within each video. We demonstrate the usefulness of this new data resource using the SmartVidCrop algorithm and state-of-the-art video grounding models, in hopes of establishing our subjective dataset as a benchmark for future research. Our contributions offer a resource for advancing video aspect ratio transformation models towards ensuring that reshaped mobile-friendly video content retains its quality and meaning. Since our labels bear resemblances to video saliency annotations, we also conducted an additional analysis to explore the similarity between our labels and video saliency predictions. Finally, we repurposed state-of-the-art video grounding models for aspect ratio change tasks, and fine-tuned them on our dataset. As a service to the research community, we plan to open source the project.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the LIVE-YouTube Video Cropping (LIVE-YT VC) database of 1800 videos annotated by 90 human subjects for subjective portrait region cropping, drawn from YouTube-UGC and LSVQ sources. It also describes LIVE-YT VC++, a smoothed variant created via a novel intra-frame temporal filter. The work claims this is the largest publicly-available database for the task, demonstrates its utility via the SmartVidCrop algorithm and fine-tuned video grounding models, compares annotations to saliency predictions, and states plans to open source the resource as a benchmark for aspect-ratio adaptation in mobile video.
Significance. If the database is released and the annotations are shown to be reliable, this would supply a valuable large-scale subjective resource for video cropping research, filling a gap for methods that adapt landscape video to portrait/mobile formats while preserving content. The temporal smoothing idea directly targets annotation variability, and the empirical benchmarking against existing models provides usable baselines. The scale (1800 videos, 90 annotators) and connection to saliency tasks strengthen its potential as a community benchmark.
major comments (2)
- [Abstract] Abstract: the statement that 'this new resource is the largest publicly-available subjective video portrait region cropping database' is unsupported. The abstract ends by stating 'we plan to open source the project' (future tense) with no release link, DOI, or confirmation of current public availability, directly falsifying the present-tense claim that is central to the paper's contribution.
- [Database construction and LIVE-YT VC++] Database construction and LIVE-YT VC++ sections: the manuscript provides no details on inter-annotator agreement among the 90 subjects, the exact equations or parameters of the intra-frame temporal filter, or statistical validation of the smoothed ++ version relative to the raw annotations. These omissions prevent assessment of whether the resource reliably captures 'significant regions' and can serve as a reproducible benchmark.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major comment below, indicating where revisions will be made to strengthen the presentation and ensure accuracy.
read point-by-point responses
-
Referee: [Abstract] Abstract: the statement that 'this new resource is the largest publicly-available subjective video portrait region cropping database' is unsupported. The abstract ends by stating 'we plan to open source the project' (future tense) with no release link, DOI, or confirmation of current public availability, directly falsifying the present-tense claim that is central to the paper's contribution.
Authors: We agree that the abstract contains an inconsistency in tense that requires correction. The database has been fully collected and processed, but public release is planned upon acceptance to permit any final refinements. We will revise the abstract to state that the resource 'will be the largest publicly-available' upon release and will add a footnote or dedicated section providing a planned release link or repository identifier. This change ensures the claim is accurately supported without misrepresentation. revision: yes
-
Referee: [Database construction and LIVE-YT VC++] Database construction and LIVE-YT VC++ sections: the manuscript provides no details on inter-annotator agreement among the 90 subjects, the exact equations or parameters of the intra-frame temporal filter, or statistical validation of the smoothed ++ version relative to the raw annotations. These omissions prevent assessment of whether the resource reliably captures 'significant regions' and can serve as a reproducible benchmark.
Authors: We concur that these details are necessary to demonstrate reliability and enable reproducibility. In the revised manuscript we will expand the relevant sections to report: quantitative inter-annotator agreement metrics (e.g., mean IoU and Fleiss' kappa across the 90 subjects); the exact mathematical formulation and all parameter values of the intra-frame temporal filter (including smoothing coefficients and window sizes); and statistical comparisons between the raw and smoothed annotations (e.g., variance reduction and temporal consistency measures). These additions will allow readers to evaluate the resource's suitability as a benchmark. revision: yes
Circularity Check
No circularity: empirical dataset introduction with no derivation chain
full rationale
The paper describes collection of subjective annotations from 90 human subjects on 1800 videos drawn from YouTube-UGC and LSVQ, followed by temporal smoothing to create LIVE-YT VC and LIVE-YT VC++ versions. It benchmarks existing algorithms (SmartVidCrop, video grounding models) and plans to release the data. No equations, fitted parameters, predictions, or first-principles derivations appear that could reduce to self-defined inputs or self-citations. The size/availability claim is a direct factual statement about the new resource rather than a derived result. Self-citations, if any, are not load-bearing for any central claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human subjects can consistently identify 'significant' portrait regions in landscape videos when given standard instructions
Reference graph
Works this paper leans on
-
[1]
A fast smart-cropping method and dataset for video retargeting,
K. Apostolidis and V . Mezaris, “A fast smart-cropping method and dataset for video retargeting,” in2021 IEEE International Conference on Image Processing (ICIP), 2021, pp. 2618–2622
work page 2021
-
[2]
S. Gu, Z. Pan, C. Hong, C. Liu, and Z. Cao, “Dynamic beauty is easy to find: A large-scale composition-aware dataset and an end-to- end framework for video reframing,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 7776–7784
work page 2025
-
[3]
Youtube ugc dataset for video compression research,
Y . Wang, S. Inguva, and B. Adsumilli, “Youtube ugc dataset for video compression research,” in2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), 2019, pp. 1–5
work page 2019
-
[4]
Patch-vq:’patching up’the video quality problem,
Z. Ying, M. Mandal, D. Ghadiyaram, and A. Bovik, “Patch-vq:’patching up’the video quality problem,” inProc. of the IEEE Conference on Comput. Vision Pattern Recog., 2021, pp. 14 019–14 029
work page 2021
-
[5]
A2-rl: Aesthetics aware reinforcement learning for image cropping,
D. Li, H. Wu, J. Zhang, and K. Huang, “A2-rl: Aesthetics aware reinforcement learning for image cropping,” inProc. of the IEEE Conference on Comput. Vision Pattern Recog., 2018, pp. 8193–8201
work page 2018
-
[6]
Deep cropping via attention box prediction and aesthetics assessment,
W. Wang and J. Shen, “Deep cropping via attention box prediction and aesthetics assessment,” inProc. of the IEEE Int’l Conference on computer vision, 2017, pp. 2186–2194
work page 2017
-
[7]
Reliable and efficient image cropping: A grid anchor based approach,
H. Zeng, L. Li, Z. Cao, and L. Zhang, “Reliable and efficient image cropping: A grid anchor based approach,” inProc. of the IEEE Confer- ence on Comput. Vision Pattern Recog., 2019, pp. 5949–5957
work page 2019
-
[8]
Fast-at: Fast automatic thumbnail generation using deep neural networks,
S. Esmaeili, B. Singh, and L. Davis, “Fast-at: Fast automatic thumbnail generation using deep neural networks,” inProc. of the IEEE Conference on Comput. Vision Pattern Recog., 2017, pp. 4622–4630
work page 2017
-
[9]
User constrained thumbnail generation using adaptive convolutions,
P. Kishore, A. Bhunia, S. Ghose, and P. Roy, “User constrained thumbnail generation using adaptive convolutions,” inICASSP 2019- 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 1677–1681
work page 2019
-
[10]
Learning to learn cropping models for different aspect ratio requirements,
D. Li, J. Zhang, and K. Huang, “Learning to learn cropping models for different aspect ratio requirements,” inProc. of the IEEE Conference on Comput. Vision Pattern Recog., 2020, pp. 12 685–12 694
work page 2020
-
[11]
Non-homogeneous content- driven video-retargeting,
L. Wolf, M. Guttmann, and D. Cohen-Or, “Non-homogeneous content- driven video-retargeting,” inIEEE 11th International Conference on Computer Vision, 2007, pp. 1–6
work page 2007
-
[12]
A system for retargeting of streaming video,
P. Kr ¨ahenb¨uhl, M. Lang, A. Hornung, and M. Gross, “A system for retargeting of streaming video,” inACM SIGGRAPH Asia 2009 papers, 2009, pp. 1–10
work page 2009
-
[13]
Improved seam carving for video retargeting,
M. Rubinstein, A. Shamir, and S. Avidan, “Improved seam carving for video retargeting,”ACM Transactions on Graphics (TOG), vol. 27, no. 3, pp. 1–9, 2008. JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, MONTH 20XX 12
work page 2008
-
[14]
Multi-operator media retargeting,
——, “Multi-operator media retargeting,”ACM Transactions on Graph- ics (TOG), vol. 28, no. 3, pp. 1–11, 2009
work page 2009
-
[15]
Yfcc100m: The new data in multimedia research,
B. Thomee, D. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016
work page 2016
-
[16]
Trigonometric interpolation of empirical and analytical functions,
C. Lanczos, “Trigonometric interpolation of empirical and analytical functions,” inJournal of Mathematical Physics, vol. 59, 1938, pp. 123– 199
work page 1938
-
[17]
Revisiting video saliency: A large-scale benchmark and a new model,
W. Wang, J. Shen, F. Guo, M. Cheng, and A. Borji, “Revisiting video saliency: A large-scale benchmark and a new model,” inProc. of the IEEE Conference on Comput. Vision Pattern Recog., 2018, pp. 4894– 4903
work page 2018
-
[18]
Measuring colorfulness in natural images,
D. Hasler and S. E. Suesstrunk, “Measuring colorfulness in natural images,” inHuman Vision and Electronic Imaging VIII, vol. 5007. SPIE, 2003, pp. 87–95
work page 2003
-
[19]
Analysis of public image and video databases for quality assessment,
S. Winkler, “Analysis of public image and video databases for quality assessment,”IEEE Journal of Selected Topics in Signal Processing, vol. 6, no. 6, pp. 616–625, 2012
work page 2012
-
[20]
International Telecommunication Union, “Itu-t recommendation p.910: Subjective video quality assessment methods for multimedia applica- tions,” 2008
work page 2008
-
[21]
Lof: Identifying density- based local outliers
M. Breunig, P. Kr ¨oger, R. Ng, and J. Sander, “Lof: Identifying density- based local outliers.” vol. 29, 06 2000, pp. 93–104
work page 2000
-
[22]
V olume 16: How to Detect and Handle Outliers,
B. I. and D. C. H., “V olume 16: How to Detect and Handle Outliers,” The ASQC Basic References in Quality Control: Statistical Techniques, 1993
work page 1993
-
[23]
Lucas-kanade 20 years on: A unifying framework,
S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying framework,”International Journal of Computer Vision, vol. 56, no. 3, pp. 221–255, 2004
work page 2004
-
[24]
Laion- 5b: An open large-scale dataset for training next generation image-text models,
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsmanet al., “Laion- 5b: An open large-scale dataset for training next generation image-text models,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022
work page 2022
-
[25]
Unified image and video saliency modeling,
R. Droste, J. Jiao, and J. A. Noble, “Unified image and video saliency modeling,” in16th European Conference on Computer Vision, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer, pp. 419– 435
work page 2020
-
[26]
Human-centric spatio-temporal video grounding with visual transform- ers,
Z. Tang, Y . Liao, S. Liu, G. Li, X. Jin, H. Jiang, Q. Yu, and D. Xu, “Human-centric spatio-temporal video grounding with visual transform- ers,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8238–8249, 2021
work page 2021
-
[27]
Where does it exist: Spatio-temporal video grounding for multi-form sentences,
Z. Zhang, Z. Zhao, Y . Zhao, Q. Wang, H. Liu, and L. Gao, “Where does it exist: Spatio-temporal video grounding for multi-form sentences,” in Proc. of the IEEE Conference on Comput. Vision Pattern Recog., 2020, pp. 10 668–10 677
work page 2020
-
[28]
Object-aware multi-branch relation networks for spatio-temporal video grounding,
Z. Zhang, Z. Zhao, Z. Lin, B. Huai, and N. J. Yuan, “Object-aware multi-branch relation networks for spatio-temporal video grounding,” arXiv preprint arXiv:2008.06941, 2020
-
[29]
Tubedetr: Spatio- temporal video grounding with transformers,
A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Tubedetr: Spatio- temporal video grounding with transformers,” inProc. of the IEEE Conference on Comput. Vision Pattern Recog., 2022, pp. 16 442–16 453
work page 2022
-
[30]
Embracing consistency: A one-stage approach for spatio-temporal video grounding,
Y . Jin, Z. Yuan, Y . Muet al., “Embracing consistency: A one-stage approach for spatio-temporal video grounding,”Advances in Neural Information Processing Systems, vol. 35, pp. 29 192–29 204, 2022
work page 2022
-
[31]
Collaborative static and dynamic vision-language streams for spatio-temporal video grounding,
Z. Lin, C. Tan, J.-F. Hu, Z. Jin, T. Ye, and W.-S. Zheng, “Collaborative static and dynamic vision-language streams for spatio-temporal video grounding,” inProc. of the IEEE Conference on Comput. Vision Pattern Recog., 2023, pp. 23 100–23 109
work page 2023
-
[32]
Context-guided spatio- temporal video grounding,
X. Gu, H. Fan, Y . Huang, T. Luo, and L. Zhang, “Context-guided spatio- temporal video grounding,” inProc. of the IEEE Conference on Comput. Vision Pattern Recog., 2024, pp. 18 330–18 339
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.