pith. sign in

arxiv: 2605.17610 · v1 · pith:YDI27EI4new · submitted 2026-05-17 · 💻 cs.CV · cs.CL

SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening

Pith reviewed 2026-05-20 13:59 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords video guardrailscontent moderationfast-and-slow inferenceAI safetyvideo datasetschain of thoughtinfluence filtering
0
0 comments X

The pith

SafeLens delivers state-of-the-art video moderation through fast-and-slow screening at reduced cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SafeLens, a framework for video guardrails that uses a fast-and-slow inference approach to handle most videos with quick pattern matching while applying deeper reasoning only when needed. It also creates a compact high-quality training set by influence-guided filtering that keeps just 2.4 percent of the original SafeWatch data and adds chain-of-thought traces to support test-time reasoning. This design achieves better accuracy than both open-source and closed-source models on real-world and AI-generated video benchmarks while lowering inference costs. The approach suggests that thoughtful architecture can outperform simple scaling of data and model size for safety tasks.

Core claim

SafeLens combines a fast-and-slow screening architecture with a filtered training dataset and structured chain-of-thought augmentation to perform accurate and efficient video content moderation, outperforming existing guardrails on benchmarks while reducing computational expense.

What carries the argument

The fast-and-slow inference architecture, which routes simple inputs to fast pattern recognition and complex ones to slower, more deliberate reasoning.

If this is right

  • Video platforms can moderate content with lower latency and resource use.
  • AI-generated video safety checks become more practical at scale.
  • Training on smaller but higher-quality datasets can match or exceed results from larger ones.
  • Test-time reasoning augmentation improves performance without additional training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fast-and-slow designs might apply to other content moderation tasks like text or image safety.
  • Reducing inference cost could enable real-time moderation on smaller hardware.
  • The method highlights the value of data filtering over data scaling in safety applications.

Load-bearing premise

The small filtered subset of the SafeWatch Dataset still represents the full distribution of policy-violating and non-violating videos well enough for accurate generalization.

What would settle it

Training the same model architecture on the unfiltered full SafeWatch Dataset and comparing its benchmark performance and inference cost to SafeLens would test whether the filtering step is necessary or beneficial.

Figures

Figures reproduced from arXiv: 2605.17610 by Anshuman Chhabra, Hadi Askari, Muhao Chen, Shahriar Kabir Nahin.

Figure 1
Figure 1. Figure 1: Example of fast-and-slow reasoning: (a) depicts a group study scene from a video that can be quickly classified as safe; (b) the video requires more detailed analysis to determine safety, as it shows a person lying down, potentially injured. Second, modern VLMs are computationally expensive, making large-scale deployment for video moderation pipelines challenging [28]. Current state-of-the-art video guardr… view at source ↗
Figure 2
Figure 2. Figure 2: Our SAFELENS framework: SafeLens-S1 performs fast screening, fol￾lowed by SafeLens-S2 for slow-thinking. SAFELENS-S2: Policy-Aware Chain-of-Thought Reasoning [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Analyzing runtime (seconds) across SAFELENS and baselines. In contrast, SAFELENS provides consistent perfor￾mance across all categories. Both individual fast (SAFELENS-S1) and slow (SAFELENS-S2) systems as well as their combination (SAFELENS) achieve strong and balanced results across categories, which leads to better overall accuracy and Macro F1. We provide results for the validation dataset in Appendix … view at source ↗
Figure 4
Figure 4. Figure 4: Avg. accuracy and runtime of SAFELENS across different threshold values. We also calculate the average FLOPs for all models and find similar trends with SAFELENS attaining top performance across baselines (we defer these results to Appendix G due to space constraints). A key advantage of SAFELENS is that its accuracy-runtime trade-off can be controlled by varying any or all of the components, i.e., the pro… view at source ↗
Figure 5
Figure 5. Figure 5: SAFELENS-S2 accuracy-runtime trade￾off varying the embedding and reasoning models. Varying SAFELENS-S2 Backbone Models. In our main experiments, we use Qwen3-VL-2B as both the embedding and reasoning model. However, smaller VLMs can potentially further reduce runtime cost without a significant loss in accuracy. To evaluate whether this is the case, we also consider extremely lightweight alterna￾tives for e… view at source ↗
Figure 7
Figure 7. Figure 7: Analyzing runtime (seconds) across SAFELENS and baselines on the vali￾dation set. H Details of All Guardrail Policies In this section, we provide formal definitions for the six harmful content categories addressed in this work: Sexual Content, Harassment & Bullying, Threats, Violence & Harm, False & Deceptive Information, Illegal/Regulated Activities, and Hateful Content & Extremism [PITH_FULL_IMAGE:figur… view at source ↗
Figure 8
Figure 8. Figure 8: Overview of the six harmful content categories. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of potential incorrect annotations in SafeWatch training dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of corrected annotations in the SafeWatch-Real validation dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of SafeWatch policy prompt used for all baselines. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of SAFELENS-S1 policy prompt. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of SAFELENS-S2 policy prompt. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
read the original abstract

The rapid growth of online video platforms and AI-generated content has made reliable video guardrails a key challenge for safety and real-world deployment. While most videos can be screened through fast pattern recognition, a small subset requires deeper reasoning over temporally complex content and nuanced policy constraints. Existing approaches typically rely on large vision-language models applied uniformly across all inputs, resulting in high inference costs and inefficient allocation of computation. We propose SafeLens, a video guardrail framework that introduces a fast-and-slow inference architecture for efficient and accurate content moderation with variable computational cost across inputs. Additionally, we construct a high-quality dataset by applying influence-guided filtering to the SafeWatch Dataset, retaining only 2.4% of the original data. To further address limitations of training-time scaling, we enable test-time reasoning by augmenting the filtered data with structured Chain-of-Thought traces. Across real-world and AI-generated video benchmarks, SafeLens achieves state-of-the-art performance, outperforming strong open-source video guardrails (e.g., SafeWatch-8B, OmniGuard-7B) and closed-source models (e.g., GPT-5.4, Gemini-3.1-pro) while significantly reducing inference cost, demonstrating that efficient design serves to be more effective than scaling data or model size alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SafeLens, a fast-and-slow video guardrail framework that applies influence-guided filtering to retain only 2.4% of the SafeWatch Dataset, augments the subset with structured Chain-of-Thought traces, and deploys variable-depth inference to achieve state-of-the-art performance on real-world and AI-generated video benchmarks while reducing inference cost relative to larger open-source models (SafeWatch-8B, OmniGuard-7B) and closed-source models (GPT-5.4, Gemini-3.1-pro).

Significance. If the empirical claims hold after proper validation, the work would show that deliberate data curation combined with test-time reasoning can outperform uniform scaling of model size or training data volume in safety guardrails, offering a practical route to lower-cost deployment on video platforms.

major comments (2)
  1. [§3] §3 (Dataset Construction): The central SOTA claim rests on training and evaluating on the influence-filtered 2.4% subset. No coverage metrics, t-SNE embeddings, or performance numbers on the discarded 97.6% are reported, leaving open the possibility that high-influence examples preferentially retained do not represent the full distribution of temporal and nuanced policy violations needed for generalization to the benchmarks.
  2. [§4] §4 (Experiments): The abstract asserts outperformance and cost reduction, yet the text provides neither quantitative tables with error bars, ablation results isolating the contribution of the fast-and-slow router versus the filtered data, nor explicit comparison protocols against the cited baselines; without these, the load-bearing performance claims cannot be verified.
minor comments (1)
  1. [Abstract] Abstract: The phrasing 'significantly reducing inference cost' is not accompanied by concrete latency or FLOPs numbers even at the abstract level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): The central SOTA claim rests on training and evaluating on the influence-filtered 2.4% subset. No coverage metrics, t-SNE embeddings, or performance numbers on the discarded 97.6% are reported, leaving open the possibility that high-influence examples preferentially retained do not represent the full distribution of temporal and nuanced policy violations needed for generalization to the benchmarks.

    Authors: We agree that additional analysis of the filtered subset's coverage is necessary to fully support the generalization claims. The influence-guided selection prioritizes examples with high impact on model behavior, but we did not report explicit distribution comparisons in the original submission. In the revised version we will add t-SNE embeddings of the full SafeWatch dataset versus the retained 2.4% subset, together with performance numbers obtained when models are trained on the discarded portion, to demonstrate that the high-influence examples preserve the necessary temporal and policy-violation diversity. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract asserts outperformance and cost reduction, yet the text provides neither quantitative tables with error bars, ablation results isolating the contribution of the fast-and-slow router versus the filtered data, nor explicit comparison protocols against the cited baselines; without these, the load-bearing performance claims cannot be verified.

    Authors: We acknowledge that the current experimental presentation lacks the quantitative rigor needed to verify the central claims. While the manuscript reports comparative results, it does not include error bars, isolated ablations, or detailed protocol descriptions. We will expand §4 with tables reporting mean performance and standard deviations across multiple runs, ablation studies that separately quantify the fast-and-slow router and the influence-filtered data, and an explicit subsection detailing the evaluation protocol, prompt templates, and inference settings used for all baselines including SafeWatch-8B, OmniGuard-7B, GPT-5.4, and Gemini-3.1-pro. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmark evaluation rather than self-referential derivations

full rationale

The paper presents SafeLens as a fast-and-slow architecture trained on an influence-filtered subset (2.4% of SafeWatch) augmented with CoT traces, with SOTA performance reported as direct empirical outcomes on real-world and AI-generated video benchmarks. No equations, fitted parameters, or mathematical derivations appear that would reduce a claimed prediction back to the training choices by construction. Dataset filtering and augmentation are methodological steps whose validity is asserted via external benchmark comparisons rather than tautological self-definition. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are described in the provided text. The results are therefore self-contained against external benchmarks and falsifiable independently of the paper's internal choices.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the work is an empirical system description relying on standard ML training assumptions not enumerated here.

pith-pipeline@v0.9.0 · 5774 in / 1173 out tokens · 45556 ms · 2026-05-20T13:59:28.643641+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 8 internal anchors

  1. [1]

    Video interactions in online video social networks.ACM Trans

    Fabrício Benevenuto, Tiago Rodrigues, Virgilio Almeida, Jussara Almeida, and Keith Ross. Video interactions in online video social networks.ACM Trans. Multimedia Comput. Commun. Appl., 5(4), November 2009

  2. [2]

    Vlm as policy: Common-law content moderation framework for short video platform

    Xingyu Lu, Tianke Zhang, Chang Meng, Xiaobei Wang, Jinpeng Wang, Yi-Fan Zhang, Shisong Tang, Changyi Liu, Haojie Ding, Kaiyu Jiang, Kaiyu Tang, Bin Wen, Hai-Tao Zheng, Fan Yang, Tingting Gao, Di Zhang, and Kun Gai. Vlm as policy: Common-law content moderation framework for short video platform. InProceedings of the 31st ACM SIGKDD Conference on Knowledge ...

  3. [3]

    Protecting young users on social media: Evaluating the effectiveness of content moderation and legal safeguards on video sharing platforms.arXiv preprint arXiv:2505.11160, 2025

    Fatmaelzahraa Eltaher, Rahul Krishna Gajula, Luis Miralles-Pechuán, Patrick Crotty, Juan Martínez-Otero, Christina Thorpe, and Susan McKeever. Protecting young users on social media: Evaluating the effectiveness of content moderation and legal safeguards on video sharing platforms.arXiv preprint arXiv:2505.11160, 2025

  4. [4]

    Video is worth a thousand images: Exploring the latest trends in long video generation.ACM Comput

    Faraz Waseem and Muhammad Shahzad. Video is worth a thousand images: Exploring the latest trends in long video generation.ACM Comput. Surv., 58(6), December 2025

  5. [5]

    Moderating synthetic content: the challenge of generative ai.Philosophy & Technology, 37, 11 2024

    Sarah Fisher, Jeffrey Howard, and Beatriz Kira. Moderating synthetic content: the challenge of generative ai.Philosophy & Technology, 37, 11 2024

  6. [6]

    Towards safer social media platforms: scalable and performant few-shot harmful content moderation using large language models.arXiv preprint arXiv:2501.13976, 2025

    Akash Bonagiri, Lucen Li, Rajvardhan Oak, Zeerak Babar, Magdalena Wojcieszak, and Anshu- man Chhabra. Towards safer social media platforms: scalable and performant few-shot harmful content moderation using large language models.arXiv preprint arXiv:2501.13976, 2025

  7. [7]

    Adi Levi, Or Levi, Sardhendu Mishra, and Jonathan Morra. Ai vs. human moderators: A com- parative evaluation of multimodal llms in content moderation for brand safety. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5965–5973, 2025

  8. [8]

    Re-ranking using large language models for mitigating exposure to harmful content on social media platforms

    Rajvardhan Oak, Muhammad Haroon, Claire Wonjeong Jo, Magdalena Wojcieszak, and An- shuman Chhabra. Re-ranking using large language models for mitigating exposure to harmful content on social media platforms. InACL, 2025

  9. [9]

    Controllable hybrid captioner for improved long-form video understanding.arXiv preprint arXiv:2507.17047, 2025

    Kuleen Sasse, Efsun Sarioglu Kayi, and Arun Reddy. Controllable hybrid captioner for improved long-form video understanding.arXiv preprint arXiv:2507.17047, 2025

  10. [10]

    Evaluating multimodal large language models on video captioning via Monte Carlo tree search

    Linhao Yu, Xingguang Ji, Yahui Liu, Fanheng Kong, Chenxi Sun, Jingyuan Zhang, Hongzhi Zhang, Victoria W., Fuzheng Zhang, and Deyi Xiong. Evaluating multimodal large language models on video captioning via Monte Carlo tree search. InACL, 2025

  11. [11]

    Right this way: Can vlms guide us to see more to answer questions?Advances in Neural Information Processing Systems, 37:132946–132976, 2024

    Li Liu, Diji Yang, Sijia Zhong, Kalyana S Tholeti, Lei Ding, Yi Zhang, and Leilani H Gilpin. Right this way: Can vlms guide us to see more to answer questions?Advances in Neural Information Processing Systems, 37:132946–132976, 2024

  12. [12]

    Guiding vision-language model selection for visual question-answering across tasks, domains, and knowledge types

    Neelabh Sinha, Vinija Jain, and Aman Chadha. Guiding vision-language model selection for visual question-answering across tasks, domains, and knowledge types. In Wei Emma Zhang, Xiang Dai, Desmond Elliot, Byron Fang, Mongyuan Sim, Haojie Zhuang, and Weitong Chen, editors,Proceedings of the First Workshop of Evaluation of Multi-Modal Generation, pages 76–9...

  13. [13]

    Multilingual evaluation of image-text retrieval in vision–language models: A metric-based perspective

    Bodhisatta Maiti. Multilingual evaluation of image-text retrieval in vision–language models: A metric-based perspective. InProceedings of the 4th International Workshop on Multimodal Human Understanding for the Web and Social Media, MUWS ’25, page 10–16, New York, NY , USA, 2025. Association for Computing Machinery

  14. [14]

    A little more like this: Text-to-image re- trieval with vision-language models using relevance feedback

    Bulat Khaertdinov, Mirela Popa, and Nava Tintarev. A little more like this: Text-to-image re- trieval with vision-language models using relevance feedback. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3825–3834, 2026

  15. [15]

    Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024. 10

  16. [16]

    Towards policy-adaptive image guardrail: Benchmark and method.arXiv preprint arXiv:2603.01228, 2026

    Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, and Shuigeng Zhou. Towards policy-adaptive image guardrail: Benchmark and method.arXiv preprint arXiv:2603.01228, 2026

  17. [17]

    Memeguard: An llm and vlm-based framework for advancing content moderation via meme intervention

    Prince Jha, Raghav Jain, Kumar Mandal, Aman Chadha, Sriparna Saha, and Pushpak Bhat- tacharyya. Memeguard: An llm and vlm-based framework for advancing content moderation via meme intervention. InAnnual Meeting of the Association for Computational Linguistics, 2024

  18. [18]

    Shieldgemma 2: Robust and tractable image content moderation, 2025.URL https://arxiv

    Wenjun Zeng, Dana Kurniawan, Ryan Mullins, Yuchi Liu, Tamoghna Saha, Dirichi Ike-Njoku, Jindong Gu, Yiwen Song, Cai Xu, Jingjing Zhou, et al. Shieldgemma 2: Robust and tractable image content moderation, 2025.URL https://arxiv. org/abs/2504.01081

  19. [19]

    Llavaguard: An open vlm-based framework for safeguarding vision datasets and models.arXiv preprint arXiv:2406.05113, 2024

    Lukas Helff, Felix Friedrich, Manuel Brack, Kristian Kersting, and Patrick Schramowski. Llavaguard: An open vlm-based framework for safeguarding vision datasets and models.arXiv preprint arXiv:2406.05113, 2024

  20. [20]

    Llama guard 3 vision: Safeguarding human-ai image understanding conversations.arXiv preprint arXiv:2411.10414, 2024

    Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti. Llama guard 3 vision: Safeguarding human-ai image understanding conversations.arXiv preprint arXiv:2411.10414, 2024

  21. [21]

    MULTIGUARD: An efficient approach for AI safety moderation across languages and modalities

    Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, and Chandan Singh. MULTIGUARD: An efficient approach for AI safety moderation across languages and modalities. InEMNLP, 2025

  22. [22]

    Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl

    Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, and Lili Qiu. Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl.arXiv preprint arXiv:2510.02282, 2025

  23. [23]

    Filter-and-refine: A MLLM based cascade system for industrial-scale video content moderation

    Zixuan Wang, Jinghao Shi, Hanzhong Liang, Xiang Shen, Vera Wen, Zhiqian Chen, Yifan Wu, Zhixin Zhang, and Hongyu Xiong. Filter-and-refine: A MLLM based cascade system for industrial-scale video content moderation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), 2025

  24. [24]

    Guardreasoner-omni: A reasoning-based multi-modal guardrail for text, image, and video.arXiv preprint arXiv:2602.03328, 2026

    Zhenhao Zhu, Yue Liu, Yanpei Guo, Wenjie Qu, Cancan Chen, Yufei He, Yibo Li, Yulin Chen, Tianyi Wu, Huiying Xu, et al. Guardreasoner-omni: A reasoning-based multi-modal guardrail for text, image, and video.arXiv preprint arXiv:2602.03328, 2026

  25. [25]

    Safewatch: An efficient safety-policy following video guardrail model with transparent explanations

    Zhaorun Chen, Francesco Pinto, Minzhou Pan, and Bo Li. Safewatch: An efficient safety-policy following video guardrail model with transparent explanations. InInternational Conference on Learning Representations, volume 2025, pages 76566–76608, 2025

  26. [26]

    Learning with noisy labels revisited: A study using real-world human annotations.arXiv preprint arXiv:2110.12088, 2021

    Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, and Yang Liu. Learning with noisy labels revisited: A study using real-world human annotations.arXiv preprint arXiv:2110.12088, 2021

  27. [27]

    Scaling laws for data filtering–data curation cannot be compute agnostic, 2024.URL https://arxiv

    Sachin Goyal, Pratyush Maini, Zachary C Lipton, Aditi Raghunathan, and J Zico Kolter. Scaling laws for data filtering–data curation cannot be compute agnostic, 2024.URL https://arxiv. org/abs/2404.07177

  28. [28]

    A survey on efficient vision-language models.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 15(3):e70036, 2025

    Gaurav Shinde, Anuradha Ravi, Emon Dey, Shadman Sakib, Milind Rampure, and Nirmalya Roy. A survey on efficient vision-language models.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 15(3):e70036, 2025

  29. [29]

    Omniguard: Unified omni-modal guardrails with deliberate reasoning.ArXiv, abs/2512.02306, 2025

    Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie, Peng Qi, and Muhao Chen. Omniguard: Unified omni-modal guardrails with deliberate reasoning.arXiv preprint arXiv:2512.02306, 2025

  30. [30]

    Valerie Thompson.Dual-process theories: A metacognitive perspective, pages 171–196. 01 2009

  31. [31]

    ThinkGuard: Deliberative slow thinking leads to cautious guardrails

    Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. ThinkGuard: Deliberative slow thinking leads to cautious guardrails. InACL (Findings), 2025. 11

  32. [32]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023.URL https://arxiv. org/abs/2312.06674, 2(6):15, 2024

  33. [33]

    Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

  34. [34]

    Bingoguard: Llm content moderation tools with risk levels.arXiv preprint arXiv:2503.06550, 2025

    Fan Yin, Philippe Laban, Xiangyu Peng, Yilun Zhou, Yixin Mao, Vaibhav Vats, Linnea Ross, Divyansh Agarwal, Caiming Xiong, and Chien-Sheng Wu. Bingoguard: Llm content moderation tools with risk levels.arXiv preprint arXiv:2503.06550, 2025

  35. [35]

    ShieldGemma: Generative AI Content Moderation Based on Gemma

    Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, et al. Shieldgemma: Genera- tive ai content moderation based on gemma, 2024.URL https://arxiv. org/abs/2407.21772

  36. [36]

    Jonathan St.B.T. Evans. In two minds: dual-process accounts of reasoning.Trends in Cognitive Sciences, 7(10):454–459, 2003

  37. [37]

    Thinking, fast and slow.Farrar, Straus and Giroux, 2011

    Daniel Kahneman. Thinking, fast and slow.Farrar, Straus and Giroux, 2011

  38. [38]

    Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems, 36:23813–23825, 2023

    Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems, 36:23813–23825, 2023

  39. [39]

    Dynathink: Fast or slow? a dynamic decision-making framework for large language models

    Jiabao Pan, Yan Zhang, Chen Zhang, Zuozhu Liu, Hongwei Wang, and Haizhou Li. Dynathink: Fast or slow? a dynamic decision-making framework for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14686– 14695, 2024

  40. [40]

    Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces

    Andy DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, and Qinqing Zheng. Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces. InInternational Conference on Learning Representations, volume 2025, pages 95080–95117, 2025

  41. [41]

    Fast-slow thinking grpo for large vision-language model reasoning

    Wenyi Xiao and Leilei Gan. Fast-slow thinking grpo for large vision-language model reasoning. Advances in Neural Information Processing Systems, 38:171601–171631, 2026

  42. [42]

    Learning to think fast and slow for visual language models.arXiv preprint arXiv:2511.16670, 2025

    Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, and Kaiyang Zhou. Learning to think fast and slow for visual language models.arXiv preprint arXiv:2511.16670, 2025

  43. [43]

    Fasionad: Fast and slow fusion thinking systems for human- like autonomous driving with adaptive feedback.arXiv preprint arXiv:2411.18013, 2024

    Kangan Qian, Zhikun Ma, Yangfan He, Ziang Luo, Tianyu Shi, Tianze Zhu, Jiayin Li, Jianhui Wang, Ziyu Chen, Xiao He, et al. Fasionad: Fast and slow fusion thinking systems for human- like autonomous driving with adaptive feedback.arXiv preprint arXiv:2411.18013, 2024

  44. [44]

    Understanding the effect of noise in llm training data with algorithmic chains of thought.arXiv preprint arXiv:2402.04004, 2024

    Alex Havrilla and Maia Iyer. Understanding the effect of noise in llm training data with algorithmic chains of thought.arXiv preprint arXiv:2402.04004, 2024

  45. [45]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, pages 1885–1894. PMLR, 2017

  46. [46]

    LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions

    Hadi Askari, Shivanshu Gupta, Fei Wang, Anshuman Chhabra, and Muhao Chen. LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions. InAdvances in Neural Information Processing Systems, 2025

  47. [47]

    Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models

    Yongchan Kwon, Eric Wu, Kevin Wu, and James Y Zou. Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models. InInternational Conference on Learning Representations, volume 2024, pages 21921–21942, 2024

  48. [48]

    What Data Benefits My Classifier? Enhancing Model Performance and Interpretability through Influence-Based Data Selection

    Anshuman Chhabra, Peizhao Li, Prasant Mohapatra, and Hongfu Liu. What Data Benefits My Classifier? Enhancing Model Performance and Interpretability through Influence-Based Data Selection. InInternational Conference on Learning Representations, 2024. 12

  49. [49]

    Influence functions for efficient data selection in reasoning.arXiv preprint arXiv:2510.06108, 2025

    Prateek Humane, Paolo Cudrano, Daniel Z Kaplan, Matteo Matteucci, Supriyo Chakraborty, and Irina Rish. Influence functions for efficient data selection in reasoning.arXiv preprint arXiv:2510.06108, 2025

  50. [50]

    Ma, and Hao Peng

    Qirun Dai, Dylan Zhang, Jiaqi W. Ma, and Hao Peng. Improving influence-based instruction tuning data selection for balanced learning of diverse capabilities. InEMNLP (Findings), 2025

  51. [51]

    First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation

    Dmytro Vitel and Anshuman Chhabra. First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation. In International Conference on Learning Representations, 2026

  52. [52]

    Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 33:19920–19930, 2020

    Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 33:19920–19930, 2020

  53. [53]

    Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models

    Anshuman Chhabra, Bo Li, Jian Chen, Prasant Mohapatra, and Hongfu Liu. Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models. In International Conference on Machine Learning, 2025

  54. [54]

    Efficient knowledge probing of large language models by adapting pre-trained embeddings.arXiv preprint arXiv:2508.06030, 2025

    Kartik Sharma, Yiqiao Jin, Rakshit Trivedi, and Srijan Kumar. Efficient knowledge probing of large language models by adapting pre-trained embeddings.arXiv preprint arXiv:2508.06030, 2025

  55. [55]

    Building production-ready probes for gemini.arXiv preprint arXiv:2601.11516, 2026

    János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, and Arthur Conmy. Building production-ready probes for gemini.arXiv preprint arXiv:2601.11516, 2026

  56. [56]

    Training data influence analysis and estimation: a survey

    Zayd Hammoudeh and Daniel Lowd. Training data influence analysis and estimation: a survey. Machine Learning, 113(5):2351–2403, March 2024

  57. [57]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024.URL https://arxiv. org/abs/2408.03314, 20, 2024

  58. [58]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InEMNLP, 2025

  59. [59]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  60. [60]

    Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

  61. [61]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

  62. [62]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

  63. [63]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  64. [64]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

  65. [65]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  66. [66]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  67. [67]

    Florence-2: Advancing a unified representation for a variety of vision tasks

    Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4818–4829, June 2024

  68. [68]

    Lfm2 technical report.arXiv preprint arXiv:2511.23404, 2025

    Liquid AI. Lfm2 technical report.arXiv preprint arXiv:2511.23404, 2025

  69. [69]

    DESCRIPTION:

    Orion LLM Labs. GRM-2.5-Air. https://huggingface.co/OrionLLM/GRM-2.5-Air, 2026. 14 Appendix A Limitations SAFELENSdemonstrates strong performance and efficiency across benchmarks, but there are some limitations. Runtime depends on hardware, inference stack, and implementation details. While our results are based on B200 GPUs (using the HuggingFace inferen...