pith. machine review for the scientific record. sign in

arxiv: 2604.13075 · v2 · submitted 2026-03-20 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords de-escalation trainingsmall language modelsbenchmark datasetpolice-civilian interactionsfine-tuningdialogue generationreal-world scenarioslaw enforcement simulation
0
0 comments X

The pith

A 3-billion-parameter model fine-tuned on real police interactions outperforms a much larger general-purpose LLM in de-escalation dialogue.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark called DeEscalWild by pulling police-civilian encounters from public videos and filtering them down to 1,500 high-quality scenarios through a mix of human review and LLM judging. This data lets small language models be trained specifically for de-escalation, producing more realistic and effective responses than their untuned versions. The key result is that one such tuned 3B model beats a general-purpose larger model on standard metrics and human judgments while using far less compute. The work aims to make dynamic training simulations practical on portable devices that officers can actually carry into the field.

Core claim

DeEscalWild curates 285,887 dialogue turns from real-world police-civilian video into 1,500 filtered scenarios; fine-tuning small models on this corpus yields higher scores on ROUGE-L, BLEU-4, METEOR, BERTScore, realism, and human evaluation than base SLMs, with the tuned Qwen 2.5 3B-Instruct exceeding Gemini 2.5 Flash under matched conditions.

What carries the argument

The hybrid human-plus-LLM filtering pipeline that distills 1,500 high-fidelity scenarios from 5,000 raw video-derived inputs while preserving dialogue turns and token volume.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same video-to-scenario pipeline could be reused to create training data for other high-stakes verbal skills such as crisis negotiation or medical communication.
  • Local deployment of these tuned models on edge hardware would let officers run private, repeated practice sessions without sending sensitive interaction data to external servers.
  • If the performance edge holds across more officer cohorts, training programs could shift from scripted role-play to open-ended model-driven simulations at much lower ongoing cost.

Load-bearing premise

The filtering process selects scenarios that truly represent typical police-civilian encounters without favoring easier or less representative cases.

What would settle it

A controlled test in which experienced officers rate the realism and helpfulness of responses from the fine-tuned 3B model lower than those from the untuned base model or the larger general model would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2604.13075 by Eshwara Prasad Sridhar, Krity Haque Charu, Md Hasebul Hasan, Mohammad A. Islam, Shuchisnigdha Deb.

Figure 1
Figure 1. Figure 1: Architecture of the Multimodal Virtual De-escalation System. The pipeline operates as a closed-loop system where the Police Officer’s multimodal input is processed by the Perception Layer, reasoned upon by the specialized SLM Core, and rendered via the Synthesis Layer to control the Virtual Avatar in real-time. training data. While the application of NLP to law enforce￾ment is not without precedent—(Voigt … view at source ↗
Figure 2
Figure 2. Figure 2: The end-to-end data processing pipeline. The workflow consists of three main stages: (1) Data Acquisition from social media platforms, (2) Data Preparation involving manual review, automated transcription, and hybrid filtering, and (3) Quality Validation including diarization checks to produce the final validated dataset. produced a curated subset of 1,500 videos. We validated the alignment of these videos… view at source ↗
Figure 3
Figure 3. Figure 3: Training Dynamics and Convergence Analysis. We report the training (blue) and validation (red) loss trajectories for (a) Qwen 2.5, (b) Llama 3.2, (c) Gemma 2, (d) Granite 3, and (e) Falcon 3 during fine-tuning on the DeEscalWild dataset. All models exhibit rapid initial convergence within the first 50 steps, followed by a stabilization phase. The close alignment between training and validation curves acros… view at source ↗
Figure 4
Figure 4. Figure 4: Full Feature Schema Taxonomy used for filtering. Phase 2: LLM-Based Feature Extraction Step 3: Annotation. We utilized an LLM to map every transcript to the feature schema defined in Phase 1. The model was instructed to output a binary vector corresponding to the presence (1) or absence (0) of specific signals. The exact prompt structure used for inference is provided in [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 5
Figure 5. Figure 5: The zero-shot prompt used for feature extraction. Cvalid(v) = (c(Snoise) = 0) | {z } Context ∧ (c(Spolice) ≥ 2) | {z } Relevance ∧ (c(Sesc) ≥ 3 ∨ c(Sdeesc) ≥ 3) | {z } Intensity (1) Where: • Context: Ensures the video is not an advertisement or irrelevant context. • Relevance: Requires at least two distinct signals confirming active police participation. • Intensity: Requires the interaction to have signif… view at source ↗
Figure 6
Figure 6. Figure 6: Comprehensive Diversity Analysis of the Dataset. The figure presents a detailed breakdown of sociodemographic and situational attributes. (a) and (b) show the primary demographic composition; (c) and (d) illustrate age and linguistic diversity; (e) and (f) highlight the distribution of incident types and the predominance of high-severity interactions. These distributions confirm the dataset’s focus on high… view at source ↗
Figure 7
Figure 7. Figure 7: Structure of the prompt fed to Gemini 2.5 Flash, consisting of a system persona, static few-shot examples, and dynamic transcript history. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Effective de-escalation is critical for law enforcement safety and community trust, yet traditional training methods lack scalability and realism. While Large Language Models (LLMs) enable dynamic, open-ended simulations, their substantial computational footprint renders them impractical for deployment on the lightweight, portable hardware required for immersive field training. Small Language Models (SLMs) offer a viable real-time alternative but suffer from a critical scarcity of high-quality, domain-specific training data. To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from publicly available video repositories. Starting with 5,000 raw inputs, we employed a rigorous hybrid filtering process combining human-in-the-loop verification with LLM-as-a-Judge evaluation to distill 1,500 high-fidelity scenarios. The resulting corpus comprises 285,887 dialogue turns, totaling approximately 4.7 million tokens. Extensive experiments demonstrate that SLMs fine-tuned on this data significantly outperform their base counterparts across ROUGE-L, BLEU-4, METEOR, BERTScore, Realism Score, and human evaluation metrics. Notably, our fine-tuned Qwen 2.5 (3B-Instruct) surpasses the general-purpose Gemini 2.5 Flash model when evaluated under equivalent conditions, demonstrating that domain-optimized SLMs can achieve superior performance with a fraction of the computational cost. This work establishes the foundational infrastructure for accessible, low-latency, and privacy-preserving officer training systems at the edge. We publicly release our code(https://github.com/Hasebul/DeEscalWild-Benchmark-Framework) and dataset(https://doi.org/10.7910/DVN/CWMCZI).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DeEscalWild, a benchmark dataset of 1,500 high-fidelity police-civilian de-escalation scenarios curated from 5,000 raw video-derived inputs via a hybrid human-in-the-loop and LLM-as-Judge filtering pipeline, yielding 285,887 dialogue turns. It reports that SLMs fine-tuned on this data, notably Qwen 2.5 (3B-Instruct), significantly outperform their base counterparts and even surpass the general-purpose Gemini 2.5 Flash model across ROUGE-L, BLEU-4, METEOR, BERTScore, Realism Score, and human evaluations, while releasing the dataset and code.

Significance. If the central empirical claims hold after addressing evaluation details, the work is significant for providing a scalable, real-world dataset that enables efficient SLM-based training simulations for law enforcement, demonstrating that domain-adapted small models can deliver superior performance at lower computational cost than larger generalist LLMs and supporting privacy-preserving edge deployment.

major comments (2)
  1. [Abstract] Abstract: The claim that fine-tuned Qwen 2.5 (3B-Instruct) surpasses Gemini 2.5 Flash under equivalent conditions is load-bearing on the hybrid filtering pipeline (5,000 raw inputs to 1,500 scenarios) producing unbiased, high-fidelity data; the LLM-as-Judge step risks retaining dialogues with predictable lexical or arc patterns that align with supervised fine-tuning objectives, potentially artifactually widening gaps on automatic metrics without reflecting real-world robustness.
  2. [Experiments] Experiments section: No information is provided on baseline selection criteria, statistical significance testing for the reported metric gains, or inter-annotator agreement for the Realism Score and human evaluations; these omissions undermine confidence that the consistent improvements are robust rather than sensitive to post-hoc choices.
minor comments (1)
  1. [Abstract] The public release of code and dataset is a strength that supports reproducibility and should be highlighted more explicitly in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript's transparency and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that fine-tuned Qwen 2.5 (3B-Instruct) surpasses Gemini 2.5 Flash under equivalent conditions is load-bearing on the hybrid filtering pipeline (5,000 raw inputs to 1,500 scenarios) producing unbiased, high-fidelity data; the LLM-as-Judge step risks retaining dialogues with predictable lexical or arc patterns that align with supervised fine-tuning objectives, potentially artifactually widening gaps on automatic metrics without reflecting real-world robustness.

    Authors: We appreciate this concern regarding potential artifacts from the LLM-as-Judge component. The pipeline mitigates this through a subsequent human-in-the-loop verification stage, where domain experts (including law enforcement trainers) reviewed all retained scenarios for realism, diversity of conflict arcs, and adherence to de-escalation principles, explicitly discarding any with overly formulaic or predictable structures. Furthermore, the fine-tuned model's outperformance holds on human evaluations (not solely automatic metrics), which are less susceptible to lexical artifacts. To address the referee's point directly, we will expand the Methods section with additional details on the LLM judge prompts, human review criteria, and examples of filtered-out dialogues. revision: partial

  2. Referee: [Experiments] Experiments section: No information is provided on baseline selection criteria, statistical significance testing for the reported metric gains, or inter-annotator agreement for the Realism Score and human evaluations; these omissions undermine confidence that the consistent improvements are robust rather than sensitive to post-hoc choices.

    Authors: We agree these reporting details are necessary for assessing robustness. In the revised manuscript, we will add: (1) explicit baseline selection criteria, explaining our choice of base SLM variants and Gemini 2.5 Flash as a strong general-purpose comparator to isolate domain-adaptation effects; (2) statistical significance testing (paired t-tests with Bonferroni correction and reported p-values) for all metric gains; and (3) inter-annotator agreement statistics (e.g., Fleiss' kappa) for the Realism Score and human evaluation annotations. These will be integrated into the Experiments section with a new subsection on evaluation reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark and evaluation

full rationale

The paper presents an empirical dataset curation pipeline from real-world video sources followed by SLM fine-tuning and evaluation on held-out scenarios using standard metrics (ROUGE-L, BLEU, etc.) against external baselines including Gemini 2.5 Flash. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim is a direct performance comparison on independently filtered test data and remains self-contained without reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that video-derived dialogues filtered by the described hybrid process constitute representative, high-fidelity training data; no free parameters are introduced in the abstract, and no new physical or mathematical entities are postulated.

axioms (2)
  • domain assumption Publicly available police-civilian interaction videos contain sufficient high-quality, representative de-escalation dialogues for training purposes.
    Invoked when selecting the initial 5,000 raw inputs and when claiming the final 1,500 scenarios are realistic.
  • domain assumption LLM-as-a-Judge combined with human verification reliably identifies high-fidelity scenarios without introducing systematic bias.
    Central to the multi-stage filtering process described in the abstract.

pith-pipeline@v0.9.0 · 5641 in / 1406 out tokens · 34797 ms · 2026-05-15T07:43:35.384118+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 4 internal anchors

  1. [1]

    and Polyak, E

    Anand, A. and Polyak, E. Exploring the potential of large language models for enhanced virtual non-player character interactions. In INTED2024 Proceedings, 18th International Technology, Education and Development Conference, pp.\ 4895--4898. IATED, 4-6 March, 2024 2024. ISBN 978-84-09-59215-9. doi:10.21125/inted.2024.1269. URL https://doi.org/10.21125/int...

  2. [2]

    and Lavie, A

    Banerjee, S. and Lavie, A. METEOR : An automatic metric for MT evaluation with improved correlation with human judgments. In Goldstein, J., Lavie, A., Lin, C.-Y., and Voss, C. (eds.), Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , pp.\ 65--72, Ann Arbor, Michigan, June 2005. As...

  3. [3]

    pyannote

    Bredin, H. pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In 24th INTERSPEECH Conference (INTERSPEECH 2023), pp.\ 1983--1987. ISCA, 2023

  4. [4]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  5. [5]

    The falcon 3 family of open models

    Falcon-LLM Team . The falcon 3 family of open models. https://huggingface.co/blog/falcon3, December 2024. Accessed: 2026-01-24

  6. [6]

    Large language models empowered agent-based modeling and simulation: A survey and perspectives

    Gao, C., Lan, X., Li, N., Yuan, Y., Ding, J., Zhou, Z., Xu, F., and Li, Y. Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11 0 (1): 0 1--24, 2024

  7. [7]

    Granite 3.0 language models

    Granite Team, I. Granite 3.0 language models. URL: https://github. com/ibm-granite/granite-3.0-language-models, 2024

  8. [8]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  9. [9]

    Crafting papers on machine learning

    Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.\ 1207--1216, Stanford, CA, 2000. Morgan Kaufmann

  10. [10]

    ROUGE : A package for automatic evaluation of summaries

    Lin, C.-Y. ROUGE : A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.\ 74--81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013/

  11. [11]

    B leu: a method for automatic evaluation of machine translation

    Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. B leu: a method for automatic evaluation of machine translation. In Isabelle, P., Charniak, E., and Lin, D. (eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.\ 311--318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics....

  12. [12]

    S., O'Brien, J., Cai, C

    Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pp.\ 1--22, 2023

  13. [13]

    Comparing specialised small and general large language models on text classification: 100 labelled samples to achieve break- E ven performance

    Pecher, B., Srba, I., and Bielikova, M. Comparing specialised small and general large language models on text classification: 100 labelled samples to achieve break- E ven performance. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 165--...

  14. [14]

    W., Xu, T., Brockman, G., Mcleavey, C., and Sutskever, I

    Radford, A., Kim, J. W., Xu, T., Brockman, G., Mcleavey, C., and Sutskever, I. Robust speech recognition via large-scale weak supervision. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, ...

  15. [15]

    P., Livescu, K., Jurafsky, D., and Field, A

    Rosas-Smith, J., Bartelds, M., Huang, R., Garc \' a-Perera, L. P., Livescu, K., Jurafsky, D., and Field, A. Constructing datasets from public police body camera footage. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025

  16. [16]

    Towards ai-driven policing: Interdisciplinary knowledge discovery from police body-worn camera footage

    Srbinovska, A., Srbinovska, A., Senthil, V., Martin, A., McCluskey, J., Bateman, J., and Fokou A S , E. Towards ai-driven policing: Interdisciplinary knowledge discovery from police body-worn camera footage. arXiv preprint arXiv:2504.20007, 2025

  17. [17]

    P., Lopez, J., Islam, M., and Deb, S

    Sridhar, E. P., Lopez, J., Islam, M., and Deb, S. Adaptive de-escalation trainer: Piloting a rag-enhanced, emotionally modulated ai simulator for police training. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, volume 69, pp.\ 171--175. SAGE Publications Sage CA: Los Angeles, CA, 2025

  18. [18]

    Gemma 2: Improving Open Language Models at a Practical Size

    Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

  19. [19]

    Leveraging large language models for enhanced simulation-based learning in police and law enforcement

    Violakis, P. Leveraging large language models for enhanced simulation-based learning in police and law enforcement. Policing: A Journal of Policy and Practice, 19: 0 paaf012, 2025

  20. [20]

    P., Prabhakaran, V., Hamilton, W

    Voigt, R., Camp, N. P., Prabhakaran, V., Hamilton, W. L., Hetey, R. C., Griffiths, C. M., Jurgens, D., Jurafsky, D., and Eberhardt, J. L. Language from police body camera footage shows racial disparities in officer respect. Proceedings of the national Academy of sciences, 114 0 (25): 0 6521--6526, 2017

  21. [21]

    Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models

    Wang, N., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y., Guo, H., Gan, R., Ni, Z., Yang, J., et al. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 14743--14777, 2024

  22. [22]

    Co SER : Coordinating LLM -based persona simulation of established roles

    Wang, X., Wang, H., Zhang, Y., Yuan, X., Xu, R., tse Huang, J., Yuan, S., Guo, H., Chen, J., Zhou, S., Wang, W., and Xiao, Y. Co SER : Coordinating LLM -based persona simulation of established roles. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=BOrR7YqKUt

  23. [23]

    Small models are valuable plug-ins for large language models

    Xu, C., Xu, Y., Wang, S., Liu, Y., Zhu, C., and McAuley, J. Small models are valuable plug-ins for large language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 283--294, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.fin...

  24. [24]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  25. [25]

    Let's negotiate! a survey of negotiation dialogue systems

    Zhan, H., Wang, Y., Li, Z., Feng, T., Hua, Y., Sharma, S., Qu, L., Semnani-Azad, Z., Zukerman, I., and Haffari, R. Let's negotiate! a survey of negotiation dialogue systems. In EACL (Findings), 2024

  26. [26]

    Q., and Artzi, Y

    Zhang*, T., Kishore*, V., Wu*, F., Weinberger, K. Q., and Artzi, Y. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr

  27. [27]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...