QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs
Pith reviewed 2026-05-21 06:06 UTC · model grok-4.3
The pith
QwenSafe outperforms existing vision-language models at classifying content rating descriptors by using preference alignment on multimodal app data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By adapting Qwen3-VL-8B with the metadata2CRD pipeline for data synthesis and then applying direct preference optimization, the resulting QwenSafe model achieves higher accuracy in binary classification of Apple content rating descriptors than the base model and other leading vision-language models. The improvements are particularly notable in positive-class recall, reaching 111.8% over one baseline, 36.1% over another, and 2.1% over the third. This establishes that aligning model predictions to descriptor-specific multimodal evidence enhances automated content rating tasks.
What carries the argument
metadata2CRD pipeline for creating aligned question-answer pairs combined with direct preference optimization to align the VLM outputs to visual and textual evidence for each content rating descriptor
Load-bearing premise
The data generated by the metadata2CRD pipeline produces high-quality pairs that represent real app content and enable the model to generalize without biases introduced by synthesis or image interpretation.
What would settle it
A large-scale evaluation against human expert labels on actual submitted apps, checking whether the reported recall improvements persist outside the synthetic training distribution.
Figures
read the original abstract
Mobile app marketplaces require developers to disclose standardized content rating descriptors (CRDs) to inform users about potentially sensitive or restricted content. Ensuring the accuracy and consistency of these disclosures remains challenging due to the multimodal nature of app content, which spans textual descriptions and visual interfaces. In this paper, we present QwenSafe, a Vision-Language Model (VLM) designed to automatically identify the presence of Apple-defined CRDs by jointly reasoning over app metadata and screenshots. To enable scalable training for this task, we introduce metadata2CRD, a data-construction pipeline that synthesizes descriptor-aligned question-answer pairs by combining app descriptions, screenshots, and formal descriptor definitions. We adapt Qwen3-VL-8B using supervised fine-tuning followed by Direct Preference Optimization (DPO) to align model predictions with descriptor-specific evidence and explanations across visual and textual modalities. We evaluate QwenSafe on 12 Apple-defined content rating descriptors and compare it against state-of-the-art vision-language models, including Qwen3-VL, LLaVA-1.6, and Gemini-2.5-Flash. QwenSafe consistently outperforms all baselines in binary CRD classification, achieving improvements in positive-class recall of 111.8%, 36.1%, and 2.1%, respectively. Our results demonstrate that descriptor-aware multimodal alignment substantially improves automated content classification and highlights the potential of vision-language models to support scalable and consistent content rating in mobile app marketplaces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents QwenSafe, a VLM based on Qwen3-VL-8B that is fine-tuned with SFT followed by DPO to identify the presence of 12 Apple-defined content rating descriptors (CRDs) from joint app metadata and screenshots. It introduces the metadata2CRD pipeline to synthesize descriptor-aligned QA pairs from descriptions, screenshots, and formal definitions. The central empirical claim is that QwenSafe outperforms baselines (Qwen3-VL, LLaVA-1.6, Gemini-2.5-Flash) in binary CRD classification, with positive-class recall gains of 111.8%, 36.1%, and 2.1% respectively.
Significance. If the reported gains reflect genuine multimodal generalization rather than pipeline artifacts, the work could support more scalable and consistent automated content rating for app marketplaces. The combination of descriptor-specific definitions with DPO for evidence alignment is a sensible technical choice for this safety-oriented task. However, the absence of dataset statistics, split details, and external validation substantially weakens the strength of the conclusions.
major comments (2)
- [§4.2 and Table 1] §4.2 and Table 1: The manuscript reports large positive-class recall improvements but provides no information on evaluation dataset size, per-descriptor sample counts, train-test split ratios, or statistical significance testing. This information is required to assess whether the 111.8%, 36.1%, and 2.1% gains are reliable or could arise from variance or imbalance.
- [§3.1] §3.1 (metadata2CRD pipeline): The evaluation set is generated by the same synthesis procedure used for training data, with no experiments or analysis addressing possible data leakage, keyword injection, or distribution shift. The central claim that QwenSafe performs robust joint metadata+screenshot reasoning therefore requires external validation on human-annotated real-world apps, which is not reported.
minor comments (2)
- [Abstract] Abstract: The sentence reporting recall improvements lists three percentages but does not explicitly map them to the three named baselines; adding this mapping would improve clarity.
- [§2] §2 (Related Work): The discussion of prior VLM safety and content moderation work is brief; adding references to recent multimodal safety benchmarks would better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below, indicating the revisions we will incorporate to address the concerns raised.
read point-by-point responses
-
Referee: [§4.2 and Table 1] §4.2 and Table 1: The manuscript reports large positive-class recall improvements but provides no information on evaluation dataset size, per-descriptor sample counts, train-test split ratios, or statistical significance testing. This information is required to assess whether the 111.8%, 36.1%, and 2.1% gains are reliable or could arise from variance or imbalance.
Authors: We agree that these details are necessary to properly evaluate the reliability of the reported gains. The current manuscript provides only high-level dataset descriptions in §4. In the revised version we will expand §4.2 and Table 1 to report the total size of the evaluation set, the number of positive and negative samples per descriptor, the train-test split ratios employed, and the results of statistical significance tests (e.g., McNemar’s test) comparing QwenSafe against the baselines. These additions will allow readers to assess whether the observed improvements are robust to variance and class imbalance. revision: yes
-
Referee: [§3.1] §3.1 (metadata2CRD pipeline): The evaluation set is generated by the same synthesis procedure used for training data, with no experiments or analysis addressing possible data leakage, keyword injection, or distribution shift. The central claim that QwenSafe performs robust joint metadata+screenshot reasoning therefore requires external validation on human-annotated real-world apps, which is not reported.
Authors: We acknowledge the validity of this concern. The metadata2CRD pipeline relies on formal descriptor definitions rather than surface-level keywords, and we used disjoint app sets for training and evaluation to reduce direct leakage. Nevertheless, we did not include explicit ablation studies on keyword injection or distribution shift. In the revision we will add such analysis (e.g., performance after removing obvious keyword cues) and clarify the steps taken to ensure separation between splits. We agree that external validation on independently human-annotated real-world apps would provide stronger evidence of generalization beyond the synthetic distribution; we will explicitly note this as a limitation and outline it as future work. revision: partial
- External validation on human-annotated real-world apps is not available in the current study and would require new data collection outside the scope of this work.
Circularity Check
No circularity in empirical evaluation of VLM fine-tuning pipeline
full rationale
The paper presents a standard empirical ML workflow: it introduces a data synthesis pipeline (metadata2CRD) to generate training pairs from app metadata, screenshots, and descriptor definitions, performs supervised fine-tuning followed by DPO on Qwen3-VL-8B, and reports direct performance metrics (positive-class recall improvements) against external baselines on a binary classification task for 12 descriptors. These metrics are measured outcomes on an evaluation set and do not reduce by any equations or definitions to quantities that are tautologically equivalent to the training inputs or fitted parameters. No mathematical derivations, self-citations, uniqueness theorems, or ansatzes are present in the provided text that would create a load-bearing circular chain. The central claims rest on observable model outputs rather than self-referential constructions.
Axiom & Free-Parameter Ledger
free parameters (1)
- DPO and SFT hyperparameters
axioms (1)
- domain assumption App metadata and screenshots jointly contain sufficient evidence to determine the presence or absence of each CRD.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We adapt Qwen3-VL-8B using supervised fine-tuning followed by Direct Preference Optimization (DPO) to align model predictions with descriptor-specific evidence...
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
QwenSafe consistently outperforms all baselines in binary CRD classification, achieving improvements in positive-class recall of 111.8%, 36.1%, and 2.1%, respectively.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Apple Newsroom.: Apple expands tools to help parents protect kids and teens online. (2025),https://www.apple.com/au/newsroom/2025/06/ apple-expands-tools-to-help-parents-protect-kids-and-teens-online/#: ~:text=12%20June%202025-,Apple%20expands%20tools%20to%20help% 20parents%20protect%20kids%20and%20teens,they%20set%20up%20their% 20device
work page 2025
-
[2]
com/google-play-statistics-and-trends
42matters: Google play statistics and trends 2025 (2025),https://42matters. com/google-play-statistics-and-trends
work page 2025
-
[3]
42matters: ios apple app store statistics and trends 2025 (2025),https:// 42matters.com/ios-apple-app-store-statistics-and-trends
work page 2025
-
[4]
Apple Inc.: Age ratings values and definitions (2025),https: //developer.apple.com/help/app-store-connect/reference/ age-ratings-values-and-definitions
work page 2025
-
[5]
Apple Inc.: Choosing a category.https://developer.apple.com/app-store/ categories/(2025)
work page 2025
-
[6]
Apple Inc.: Set an app age rating.https://developer.apple.com/help/ app-store-connect/manage-app-information/set-an-app-age-rating(2025)
work page 2025
-
[7]
austlii.edu.au/cgi-bin/viewdb/au/legis/cth/consol\_act/bsa1992214/ (1992)
Australasian Legal Information Institute: Online content regulation.https://www. austlii.edu.au/cgi-bin/viewdb/au/legis/cth/consol\_act/bsa1992214/ (1992)
work page 1992
-
[8]
Australasian Legal Information Institute: ONLINE SAFETY ACT 2021 - SECT 105.https://www.austlii.edu.au/cgi-bin/viewdoc/au/legis/cth/ consol\_act/osa2021154/s105.html(2021)
work page 2021
-
[9]
Bai, S., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Board, E.S.R.: Rating guide (1994),https://www.esrb.org/ratings-guide/
work page 1994
-
[11]
Canadian centre for child protection: Reviewing the enforcement of app age rat- ings in apple’s app store and google play.https://content.c3p.ca/pdfs/C3P\ _AppAgeRatingReport\_en.pdf(2022)
work page 2022
-
[12]
In: Proceedings of the 31st ACM international conference on multimedia
Cao,R.,Hee,M.S.,Kuek,A.,Chong,W.H.,Lee,R.K.W.,Jiang,J.:Pro-cap:Lever- aging a frozen vision-language model for hateful meme detection. In: Proceedings of the 31st ACM international conference on multimedia. pp. 5244–5252 (2023)
work page 2023
-
[13]
Carter, M., Zhangshao, T., Hardwick, T., Egliston, B., Xiao, L.Y.: Investigating mobile games’ compliance with australia’s 2024 mandatory minimum age classifi- cations scheme for gambling-like mechanics. Available at SSRN (2025)
work page 2024
-
[14]
In: Proceedings of the 22nd international conference on World Wide Web
Chen, Y., Xu, H., Zhou, Y., Zhu, S.: Is this app safe for children? a comparison study of maturity ratings on Android and iOS applications. In: Proceedings of the 22nd international conference on World Wide Web. pp. 201–212 (2013)
work page 2013
-
[15]
arXiv preprint arXiv:2103.12407 (2021)
Chiu, K.L., Collins, A., Alexander, R.: Detecting hate speech with gpt-3. arXiv preprint arXiv:2103.12407 (2021)
-
[16]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
arXiv preprint arXiv:2502.15739 (2025)
Denipitiyage, D., Silva, B., Seneviratne, S., Seneviratne, A., Chawla, S.: Detect- ing content rating violations in android applications: A vision-language approach. arXiv preprint arXiv:2502.15739 (2025)
-
[18]
eSafety commissioner, Australia: Illegal and restricted online content."https:// www.esafety.gov.au/key-topics/Illegal-restricted-content(2024) 18 D. Denipitiyage et al
work page 2024
-
[19]
eSafety commissioner, Australia: Illegal and restricted online content (2024)
work page 2024
-
[20]
European General Data Protection Regulation: General data protection regulation gdpr. (2016),https://gdpr-info.eu/
work page 2016
-
[21]
Google: App Discovery with Google Play, Part 3: Machine Learning to Fight Spam and Abuse at Scale.https://research.google/blog/ app-discovery-with-google-play-part-3-machine-learning-to-fight-spa/ m-and-abuse-at-scale/(Mar 2015)
work page 2015
-
[22]
Google: Keeping google play safe for users and developers: June 29, 2023 (2023),https://support.google.com/googleplay/android-developer/answer/ 13721042?hl=en
work page 2023
-
[23]
google.com/googleplay/answer/6209544?hl=en
Google: Apps & games content ratings on google play (2025),https://support. google.com/googleplay/answer/6209544?hl=en
-
[24]
In: 2023 International Conference on Machine Learning and Applications (ICMLA)
Guo, K., Hu, A., Mu, J., Shi, Z., Zhao, Z., Vishwamitra, N., Hu, H.: An inves- tigation of large language models for real-world hate speech detection. In: 2023 International Conference on Machine Learning and Applications (ICMLA). pp. 1568–1573. IEEE (2023)
work page 2023
-
[25]
Haotian Liu, Chunyuan Li, Y.L., Lee, Y.J.: Improved baselines with visual instruc- tion tuning (2023)
work page 2023
-
[26]
In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management
Hu, B., Liu, B., Gong, N.Z., Kong, D., Jin, H.: Protecting your children from inappropriate content in mobile apps: An automatic maturity rating framework. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. pp. 1111–1120 (2015)
work page 2015
-
[27]
Ibrahim, H.: Google play review times: Expectations and tips to streamline approval (Nov 2024),https://median.co/blog/ google-play-review-times-what-to-expect-and-how-to-streamline-approval
work page 2024
-
[28]
Interactive Software Federation of Europe (ISFE): Pegi-pan-european game infor- mation.http://www.pegi.info/en/index/id/952(2003)
work page 2003
-
[29]
International Age Rating Coalition: How iarc works. (2025),https://www. globalratings.com/how-iarc-works.aspx
work page 2025
-
[30]
com/iphone-apps/95993/11-iphone-apps-that-got-banned-and-why
Jensen, K.T.: 11 iphone apps that got banned and why (2022),https://au.pcmag. com/iphone-apps/95993/11-iphone-apps-that-got-banned-and-why
work page 2022
-
[31]
Advances in neural information processing systems33, 2611–2624 (2020)
Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., Testuggine, D.: The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems33, 2611–2624 (2020)
work page 2020
-
[32]
In: Proceedings of the 17th International Workshop on Mobile Computing Systems and Applications
Liu, M., Wang, H., Guo, Y., Hong, J.: Identifying and analyzing the privacy of apps for kids. In: Proceedings of the 17th International Workshop on Mobile Computing Systems and Applications. pp. 105–110 (2016)
work page 2016
-
[33]
In: Proceedings of the AAAI conference on artificial intelligence
Mathew,B.,Saha,P.,Yimam,S.M.,Biemann,C.,Goyal,P.,Mukherjee,A.:Hatex- plain: A benchmark dataset for explainable hate speech detection. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 14867–14875 (2021)
work page 2021
-
[34]
Motion Picture Association: Film rating (1968),https://www.motionpictures. org/film-ratings/
work page 1968
-
[35]
ecfr.gov/current/title-16/chapter-I/subchapter-C/part-312
National Archives: Children’s online privacy protection rule (2022),https://www. ecfr.gov/current/title-16/chapter-I/subchapter-C/part-312
work page 2022
-
[36]
google.com/console/about/programs/families/(2015)
Play Store: Creating apps and games for children and families.https://play. google.com/console/about/programs/families/(2015)
work page 2015
-
[37]
Advances in neural information processing systems36, 53728–53741 (2023) QwenSafe 19
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023) QwenSafe 19
work page 2023
-
[38]
(2025),https: //www.classification.gov.au/classification-ratings/what-are-ratings
Regional Development Department of Infrastructure, Transport and Communi- cation.: The advisory categories for films and computer games. (2025),https: //www.classification.gov.au/classification-ratings/what-are-ratings
work page 2025
-
[39]
In: Proceedings of the ACM Web Conference 2023
Sun, R., Xue, M., Tyson, G., Wang, S., Camtepe, S., Nepal, S.: Not seen, not heard in the digital world! measuring privacy practices in children’s apps. In: Proceedings of the ACM Web Conference 2023. pp. 2166–2177 (2023)
work page 2023
-
[40]
Unterhaltungssoftware Selbstkontrolle: SK age categories. (2025),https://usk. de/en/the-usk/faqs/age-categories/
work page 2025
-
[41]
In: Proceedings of the ACM on Web Conference 2025
Wang, H., Tan, R.Y., Lee, R.K.W.: Cross-modal transfer from memes to videos: Addressing data scarcity in hateful video detection. In: Proceedings of the ACM on Web Conference 2025. pp. 5255–5263 (2025)
work page 2025
-
[42]
Royal Society Open Science12(5), 250704 (2025)
Xiao, L.Y., Lund, M.L.: Non-compliance with and non-enforcement of uk loot box industry self-regulation on the apple app store: a longitudinal study on poor implementation. Royal Society Open Science12(5), 250704 (2025)
work page 2025
-
[43]
In: Proceedings of the 13th Asia-Pacific Symposium on Internetware
Zhou, C., Zhan, X., Li, L., Liu, Y.: Automatic maturity rating for Android apps. In: Proceedings of the 13th Asia-Pacific Symposium on Internetware. pp. 16–27 (2022) A Appendix Due to space constraints, we provide the complete multi-class classification re- sultsacrossalldescriptorsinTable3.Thistablereportsmildandstrongprecision and recall for all methods...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.