Recognition: 2 theorem links
· Lean TheoremDeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs
Pith reviewed 2026-05-15 07:43 UTC · model grok-4.3
The pith
A 3-billion-parameter model fine-tuned on real police interactions outperforms a much larger general-purpose LLM in de-escalation dialogue.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeEscalWild curates 285,887 dialogue turns from real-world police-civilian video into 1,500 filtered scenarios; fine-tuning small models on this corpus yields higher scores on ROUGE-L, BLEU-4, METEOR, BERTScore, realism, and human evaluation than base SLMs, with the tuned Qwen 2.5 3B-Instruct exceeding Gemini 2.5 Flash under matched conditions.
What carries the argument
The hybrid human-plus-LLM filtering pipeline that distills 1,500 high-fidelity scenarios from 5,000 raw video-derived inputs while preserving dialogue turns and token volume.
Where Pith is reading between the lines
- The same video-to-scenario pipeline could be reused to create training data for other high-stakes verbal skills such as crisis negotiation or medical communication.
- Local deployment of these tuned models on edge hardware would let officers run private, repeated practice sessions without sending sensitive interaction data to external servers.
- If the performance edge holds across more officer cohorts, training programs could shift from scripted role-play to open-ended model-driven simulations at much lower ongoing cost.
Load-bearing premise
The filtering process selects scenarios that truly represent typical police-civilian encounters without favoring easier or less representative cases.
What would settle it
A controlled test in which experienced officers rate the realism and helpfulness of responses from the fine-tuned 3B model lower than those from the untuned base model or the larger general model would falsify the performance claim.
Figures
read the original abstract
Effective de-escalation is critical for law enforcement safety and community trust, yet traditional training methods lack scalability and realism. While Large Language Models (LLMs) enable dynamic, open-ended simulations, their substantial computational footprint renders them impractical for deployment on the lightweight, portable hardware required for immersive field training. Small Language Models (SLMs) offer a viable real-time alternative but suffer from a critical scarcity of high-quality, domain-specific training data. To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from publicly available video repositories. Starting with 5,000 raw inputs, we employed a rigorous hybrid filtering process combining human-in-the-loop verification with LLM-as-a-Judge evaluation to distill 1,500 high-fidelity scenarios. The resulting corpus comprises 285,887 dialogue turns, totaling approximately 4.7 million tokens. Extensive experiments demonstrate that SLMs fine-tuned on this data significantly outperform their base counterparts across ROUGE-L, BLEU-4, METEOR, BERTScore, Realism Score, and human evaluation metrics. Notably, our fine-tuned Qwen 2.5 (3B-Instruct) surpasses the general-purpose Gemini 2.5 Flash model when evaluated under equivalent conditions, demonstrating that domain-optimized SLMs can achieve superior performance with a fraction of the computational cost. This work establishes the foundational infrastructure for accessible, low-latency, and privacy-preserving officer training systems at the edge. We publicly release our code(https://github.com/Hasebul/DeEscalWild-Benchmark-Framework) and dataset(https://doi.org/10.7910/DVN/CWMCZI).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeEscalWild, a benchmark dataset of 1,500 high-fidelity police-civilian de-escalation scenarios curated from 5,000 raw video-derived inputs via a hybrid human-in-the-loop and LLM-as-Judge filtering pipeline, yielding 285,887 dialogue turns. It reports that SLMs fine-tuned on this data, notably Qwen 2.5 (3B-Instruct), significantly outperform their base counterparts and even surpass the general-purpose Gemini 2.5 Flash model across ROUGE-L, BLEU-4, METEOR, BERTScore, Realism Score, and human evaluations, while releasing the dataset and code.
Significance. If the central empirical claims hold after addressing evaluation details, the work is significant for providing a scalable, real-world dataset that enables efficient SLM-based training simulations for law enforcement, demonstrating that domain-adapted small models can deliver superior performance at lower computational cost than larger generalist LLMs and supporting privacy-preserving edge deployment.
major comments (2)
- [Abstract] Abstract: The claim that fine-tuned Qwen 2.5 (3B-Instruct) surpasses Gemini 2.5 Flash under equivalent conditions is load-bearing on the hybrid filtering pipeline (5,000 raw inputs to 1,500 scenarios) producing unbiased, high-fidelity data; the LLM-as-Judge step risks retaining dialogues with predictable lexical or arc patterns that align with supervised fine-tuning objectives, potentially artifactually widening gaps on automatic metrics without reflecting real-world robustness.
- [Experiments] Experiments section: No information is provided on baseline selection criteria, statistical significance testing for the reported metric gains, or inter-annotator agreement for the Realism Score and human evaluations; these omissions undermine confidence that the consistent improvements are robust rather than sensitive to post-hoc choices.
minor comments (1)
- [Abstract] The public release of code and dataset is a strength that supports reproducibility and should be highlighted more explicitly in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript's transparency and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that fine-tuned Qwen 2.5 (3B-Instruct) surpasses Gemini 2.5 Flash under equivalent conditions is load-bearing on the hybrid filtering pipeline (5,000 raw inputs to 1,500 scenarios) producing unbiased, high-fidelity data; the LLM-as-Judge step risks retaining dialogues with predictable lexical or arc patterns that align with supervised fine-tuning objectives, potentially artifactually widening gaps on automatic metrics without reflecting real-world robustness.
Authors: We appreciate this concern regarding potential artifacts from the LLM-as-Judge component. The pipeline mitigates this through a subsequent human-in-the-loop verification stage, where domain experts (including law enforcement trainers) reviewed all retained scenarios for realism, diversity of conflict arcs, and adherence to de-escalation principles, explicitly discarding any with overly formulaic or predictable structures. Furthermore, the fine-tuned model's outperformance holds on human evaluations (not solely automatic metrics), which are less susceptible to lexical artifacts. To address the referee's point directly, we will expand the Methods section with additional details on the LLM judge prompts, human review criteria, and examples of filtered-out dialogues. revision: partial
-
Referee: [Experiments] Experiments section: No information is provided on baseline selection criteria, statistical significance testing for the reported metric gains, or inter-annotator agreement for the Realism Score and human evaluations; these omissions undermine confidence that the consistent improvements are robust rather than sensitive to post-hoc choices.
Authors: We agree these reporting details are necessary for assessing robustness. In the revised manuscript, we will add: (1) explicit baseline selection criteria, explaining our choice of base SLM variants and Gemini 2.5 Flash as a strong general-purpose comparator to isolate domain-adaptation effects; (2) statistical significance testing (paired t-tests with Bonferroni correction and reported p-values) for all metric gains; and (3) inter-annotator agreement statistics (e.g., Fleiss' kappa) for the Realism Score and human evaluation annotations. These will be integrated into the Experiments section with a new subsection on evaluation reliability. revision: yes
Circularity Check
No significant circularity in empirical benchmark and evaluation
full rationale
The paper presents an empirical dataset curation pipeline from real-world video sources followed by SLM fine-tuning and evaluation on held-out scenarios using standard metrics (ROUGE-L, BLEU, etc.) against external baselines including Gemini 2.5 Flash. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim is a direct performance comparison on independently filtered test data and remains self-contained without reduction to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Publicly available police-civilian interaction videos contain sufficient high-quality, representative de-escalation dialogues for training purposes.
- domain assumption LLM-as-a-Judge combined with human verification reliably identifies high-fidelity scenarios without introducing systematic bias.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hybrid filtering process combining human-in-the-loop verification with LLM-as-a-Judge evaluation to distill 1,500 high-fidelity scenarios... fine-tuned Qwen 2.5 (3B-Instruct) surpasses... Gemini 2.5 Flash
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
QLoRA adapters... rank r=16, scaling factor α=32... evaluation on ROUGE-L, BLEU-4, METEOR, BERTScore
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Anand, A. and Polyak, E. Exploring the potential of large language models for enhanced virtual non-player character interactions. In INTED2024 Proceedings, 18th International Technology, Education and Development Conference, pp.\ 4895--4898. IATED, 4-6 March, 2024 2024. ISBN 978-84-09-59215-9. doi:10.21125/inted.2024.1269. URL https://doi.org/10.21125/int...
-
[2]
Banerjee, S. and Lavie, A. METEOR : An automatic metric for MT evaluation with improved correlation with human judgments. In Goldstein, J., Lavie, A., Lin, C.-Y., and Voss, C. (eds.), Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , pp.\ 65--72, Ann Arbor, Michigan, June 2005. As...
work page 2005
- [3]
-
[4]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
The falcon 3 family of open models
Falcon-LLM Team . The falcon 3 family of open models. https://huggingface.co/blog/falcon3, December 2024. Accessed: 2026-01-24
work page 2024
-
[6]
Large language models empowered agent-based modeling and simulation: A survey and perspectives
Gao, C., Lan, X., Li, N., Yuan, Y., Ding, J., Zhou, Z., Xu, F., and Li, Y. Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11 0 (1): 0 1--24, 2024
work page 2024
-
[7]
Granite Team, I. Granite 3.0 language models. URL: https://github. com/ibm-granite/granite-3.0-language-models, 2024
work page 2024
-
[8]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Crafting papers on machine learning
Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.\ 1207--1216, Stanford, CA, 2000. Morgan Kaufmann
work page 2000
-
[10]
ROUGE : A package for automatic evaluation of summaries
Lin, C.-Y. ROUGE : A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.\ 74--81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013/
work page 2004
-
[11]
B leu: a method for automatic evaluation of machine translation
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. B leu: a method for automatic evaluation of machine translation. In Isabelle, P., Charniak, E., and Lin, D. (eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.\ 311--318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics....
-
[12]
Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pp.\ 1--22, 2023
work page 2023
-
[13]
Pecher, B., Srba, I., and Bielikova, M. Comparing specialised small and general large language models on text classification: 100 labelled samples to achieve break- E ven performance. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 165--...
-
[14]
W., Xu, T., Brockman, G., Mcleavey, C., and Sutskever, I
Radford, A., Kim, J. W., Xu, T., Brockman, G., Mcleavey, C., and Sutskever, I. Robust speech recognition via large-scale weak supervision. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, ...
work page 2023
-
[15]
P., Livescu, K., Jurafsky, D., and Field, A
Rosas-Smith, J., Bartelds, M., Huang, R., Garc \' a-Perera, L. P., Livescu, K., Jurafsky, D., and Field, A. Constructing datasets from public police body camera footage. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025
work page 2025
-
[16]
Srbinovska, A., Srbinovska, A., Senthil, V., Martin, A., McCluskey, J., Bateman, J., and Fokou A S , E. Towards ai-driven policing: Interdisciplinary knowledge discovery from police body-worn camera footage. arXiv preprint arXiv:2504.20007, 2025
-
[17]
P., Lopez, J., Islam, M., and Deb, S
Sridhar, E. P., Lopez, J., Islam, M., and Deb, S. Adaptive de-escalation trainer: Piloting a rag-enhanced, emotionally modulated ai simulator for police training. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, volume 69, pp.\ 171--175. SAGE Publications Sage CA: Los Angeles, CA, 2025
work page 2025
-
[18]
Gemma 2: Improving Open Language Models at a Practical Size
Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Violakis, P. Leveraging large language models for enhanced simulation-based learning in police and law enforcement. Policing: A Journal of Policy and Practice, 19: 0 paaf012, 2025
work page 2025
-
[20]
P., Prabhakaran, V., Hamilton, W
Voigt, R., Camp, N. P., Prabhakaran, V., Hamilton, W. L., Hetey, R. C., Griffiths, C. M., Jurgens, D., Jurafsky, D., and Eberhardt, J. L. Language from police body camera footage shows racial disparities in officer respect. Proceedings of the national Academy of sciences, 114 0 (25): 0 6521--6526, 2017
work page 2017
-
[21]
Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models
Wang, N., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y., Guo, H., Gan, R., Ni, Z., Yang, J., et al. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 14743--14777, 2024
work page 2024
-
[22]
Co SER : Coordinating LLM -based persona simulation of established roles
Wang, X., Wang, H., Zhang, Y., Yuan, X., Xu, R., tse Huang, J., Yuan, S., Guo, H., Chen, J., Zhou, S., Wang, W., and Xiao, Y. Co SER : Coordinating LLM -based persona simulation of established roles. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=BOrR7YqKUt
work page 2025
-
[23]
Small models are valuable plug-ins for large language models
Xu, C., Xu, Y., Wang, S., Liu, Y., Zhu, C., and McAuley, J. Small models are valuable plug-ins for large language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 283--294, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.fin...
-
[24]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Let's negotiate! a survey of negotiation dialogue systems
Zhan, H., Wang, Y., Li, Z., Feng, T., Hua, Y., Sharma, S., Qu, L., Semnani-Azad, Z., Zukerman, I., and Haffari, R. Let's negotiate! a survey of negotiation dialogue systems. In EACL (Findings), 2024
work page 2024
-
[26]
Zhang*, T., Kishore*, V., Wu*, F., Weinberger, K. Q., and Artzi, Y. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr
work page 2020
-
[27]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.