arxiv: 2604.13075 · v2 · submitted 2026-03-20 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

Md Hasebul Hasan , Krity Haque Charu , Eshwara Prasad Sridhar , Shuchisnigdha Deb , Mohammad A. Islam

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords de-escalation trainingsmall language modelsbenchmark datasetpolice-civilian interactionsfine-tuningdialogue generationreal-world scenarioslaw enforcement simulation

0 comments

The pith

A 3-billion-parameter model fine-tuned on real police interactions outperforms a much larger general-purpose LLM in de-escalation dialogue.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark called DeEscalWild by pulling police-civilian encounters from public videos and filtering them down to 1,500 high-quality scenarios through a mix of human review and LLM judging. This data lets small language models be trained specifically for de-escalation, producing more realistic and effective responses than their untuned versions. The key result is that one such tuned 3B model beats a general-purpose larger model on standard metrics and human judgments while using far less compute. The work aims to make dynamic training simulations practical on portable devices that officers can actually carry into the field.

Core claim

DeEscalWild curates 285,887 dialogue turns from real-world police-civilian video into 1,500 filtered scenarios; fine-tuning small models on this corpus yields higher scores on ROUGE-L, BLEU-4, METEOR, BERTScore, realism, and human evaluation than base SLMs, with the tuned Qwen 2.5 3B-Instruct exceeding Gemini 2.5 Flash under matched conditions.

What carries the argument

The hybrid human-plus-LLM filtering pipeline that distills 1,500 high-fidelity scenarios from 5,000 raw video-derived inputs while preserving dialogue turns and token volume.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same video-to-scenario pipeline could be reused to create training data for other high-stakes verbal skills such as crisis negotiation or medical communication.
Local deployment of these tuned models on edge hardware would let officers run private, repeated practice sessions without sending sensitive interaction data to external servers.
If the performance edge holds across more officer cohorts, training programs could shift from scripted role-play to open-ended model-driven simulations at much lower ongoing cost.

Load-bearing premise

The filtering process selects scenarios that truly represent typical police-civilian encounters without favoring easier or less representative cases.

What would settle it

A controlled test in which experienced officers rate the realism and helpfulness of responses from the fine-tuned 3B model lower than those from the untuned base model or the larger general model would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2604.13075 by Eshwara Prasad Sridhar, Krity Haque Charu, Md Hasebul Hasan, Mohammad A. Islam, Shuchisnigdha Deb.

**Figure 1.** Figure 1: Architecture of the Multimodal Virtual De-escalation System. The pipeline operates as a closed-loop system where the Police Officer’s multimodal input is processed by the Perception Layer, reasoned upon by the specialized SLM Core, and rendered via the Synthesis Layer to control the Virtual Avatar in real-time. training data. While the application of NLP to law enforcement is not without precedent—(Voigt … view at source ↗

**Figure 2.** Figure 2: The end-to-end data processing pipeline. The workflow consists of three main stages: (1) Data Acquisition from social media platforms, (2) Data Preparation involving manual review, automated transcription, and hybrid filtering, and (3) Quality Validation including diarization checks to produce the final validated dataset. produced a curated subset of 1,500 videos. We validated the alignment of these videos… view at source ↗

**Figure 3.** Figure 3: Training Dynamics and Convergence Analysis. We report the training (blue) and validation (red) loss trajectories for (a) Qwen 2.5, (b) Llama 3.2, (c) Gemma 2, (d) Granite 3, and (e) Falcon 3 during fine-tuning on the DeEscalWild dataset. All models exhibit rapid initial convergence within the first 50 steps, followed by a stabilization phase. The close alignment between training and validation curves acros… view at source ↗

**Figure 4.** Figure 4: Full Feature Schema Taxonomy used for filtering. Phase 2: LLM-Based Feature Extraction Step 3: Annotation. We utilized an LLM to map every transcript to the feature schema defined in Phase 1. The model was instructed to output a binary vector corresponding to the presence (1) or absence (0) of specific signals. The exact prompt structure used for inference is provided in [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 5.** Figure 5: The zero-shot prompt used for feature extraction. Cvalid(v) = (c(Snoise) = 0) | {z } Context ∧ (c(Spolice) ≥ 2) | {z } Relevance ∧ (c(Sesc) ≥ 3 ∨ c(Sdeesc) ≥ 3) | {z } Intensity (1) Where: • Context: Ensures the video is not an advertisement or irrelevant context. • Relevance: Requires at least two distinct signals confirming active police participation. • Intensity: Requires the interaction to have signif… view at source ↗

**Figure 6.** Figure 6: Comprehensive Diversity Analysis of the Dataset. The figure presents a detailed breakdown of sociodemographic and situational attributes. (a) and (b) show the primary demographic composition; (c) and (d) illustrate age and linguistic diversity; (e) and (f) highlight the distribution of incident types and the predominance of high-severity interactions. These distributions confirm the dataset’s focus on high… view at source ↗

**Figure 7.** Figure 7: Structure of the prompt fed to Gemini 2.5 Flash, consisting of a system persona, static few-shot examples, and dynamic transcript history. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Effective de-escalation is critical for law enforcement safety and community trust, yet traditional training methods lack scalability and realism. While Large Language Models (LLMs) enable dynamic, open-ended simulations, their substantial computational footprint renders them impractical for deployment on the lightweight, portable hardware required for immersive field training. Small Language Models (SLMs) offer a viable real-time alternative but suffer from a critical scarcity of high-quality, domain-specific training data. To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from publicly available video repositories. Starting with 5,000 raw inputs, we employed a rigorous hybrid filtering process combining human-in-the-loop verification with LLM-as-a-Judge evaluation to distill 1,500 high-fidelity scenarios. The resulting corpus comprises 285,887 dialogue turns, totaling approximately 4.7 million tokens. Extensive experiments demonstrate that SLMs fine-tuned on this data significantly outperform their base counterparts across ROUGE-L, BLEU-4, METEOR, BERTScore, Realism Score, and human evaluation metrics. Notably, our fine-tuned Qwen 2.5 (3B-Instruct) surpasses the general-purpose Gemini 2.5 Flash model when evaluated under equivalent conditions, demonstrating that domain-optimized SLMs can achieve superior performance with a fraction of the computational cost. This work establishes the foundational infrastructure for accessible, low-latency, and privacy-preserving officer training systems at the edge. We publicly release our code(https://github.com/Hasebul/DeEscalWild-Benchmark-Framework) and dataset(https://doi.org/10.7910/DVN/CWMCZI).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dataset from real videos is the useful part here, but the 3B-beats-Gemini claim looks shaky without checks on filter bias and stats.

read the letter

The main takeaway is the released DeEscalWild corpus: 1,500 scenarios and 285k turns pulled from public police videos via a hybrid human-plus-LLM filter. That gives people working on low-latency training sims something concrete to use, and they ship the data and code, which is straightforward and helpful. The experiments show that fine-tuning Qwen 2.5 3B on it lifts the usual automatic metrics and the realism score, and under their setup the tuned 3B edges out Gemini 2.5 Flash. That part is worth noting because it illustrates how targeted data can close the gap for small models on a narrow domain task. The soft spot is the curation step itself. The stress-test concern about selection bias holds up on the abstract: if the LLM judge keeps dialogues with clearer arcs or predictable phrasing, the held-out set becomes easier for a supervised model than for an untuned generalist, which would widen the reported gaps on ROUGE-L, BLEU, and BERTScore without proving real-world robustness. The abstract also skips statistical significance, baseline selection details, and inter-annotator numbers for the human realism score, so the performance comparison is hard to evaluate as is. This is for researchers who need domain data for SLM fine-tuning in safety or training applications; they would get immediate value from the corpus even if the model results need more work. I would send it to peer review once the authors add the missing experimental controls and test whether the filter changes the difficulty distribution.

Referee Report

2 major / 1 minor

Summary. The paper introduces DeEscalWild, a benchmark dataset of 1,500 high-fidelity police-civilian de-escalation scenarios curated from 5,000 raw video-derived inputs via a hybrid human-in-the-loop and LLM-as-Judge filtering pipeline, yielding 285,887 dialogue turns. It reports that SLMs fine-tuned on this data, notably Qwen 2.5 (3B-Instruct), significantly outperform their base counterparts and even surpass the general-purpose Gemini 2.5 Flash model across ROUGE-L, BLEU-4, METEOR, BERTScore, Realism Score, and human evaluations, while releasing the dataset and code.

Significance. If the central empirical claims hold after addressing evaluation details, the work is significant for providing a scalable, real-world dataset that enables efficient SLM-based training simulations for law enforcement, demonstrating that domain-adapted small models can deliver superior performance at lower computational cost than larger generalist LLMs and supporting privacy-preserving edge deployment.

major comments (2)

[Abstract] Abstract: The claim that fine-tuned Qwen 2.5 (3B-Instruct) surpasses Gemini 2.5 Flash under equivalent conditions is load-bearing on the hybrid filtering pipeline (5,000 raw inputs to 1,500 scenarios) producing unbiased, high-fidelity data; the LLM-as-Judge step risks retaining dialogues with predictable lexical or arc patterns that align with supervised fine-tuning objectives, potentially artifactually widening gaps on automatic metrics without reflecting real-world robustness.
[Experiments] Experiments section: No information is provided on baseline selection criteria, statistical significance testing for the reported metric gains, or inter-annotator agreement for the Realism Score and human evaluations; these omissions undermine confidence that the consistent improvements are robust rather than sensitive to post-hoc choices.

minor comments (1)

[Abstract] The public release of code and dataset is a strength that supports reproducibility and should be highlighted more explicitly in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript's transparency and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that fine-tuned Qwen 2.5 (3B-Instruct) surpasses Gemini 2.5 Flash under equivalent conditions is load-bearing on the hybrid filtering pipeline (5,000 raw inputs to 1,500 scenarios) producing unbiased, high-fidelity data; the LLM-as-Judge step risks retaining dialogues with predictable lexical or arc patterns that align with supervised fine-tuning objectives, potentially artifactually widening gaps on automatic metrics without reflecting real-world robustness.

Authors: We appreciate this concern regarding potential artifacts from the LLM-as-Judge component. The pipeline mitigates this through a subsequent human-in-the-loop verification stage, where domain experts (including law enforcement trainers) reviewed all retained scenarios for realism, diversity of conflict arcs, and adherence to de-escalation principles, explicitly discarding any with overly formulaic or predictable structures. Furthermore, the fine-tuned model's outperformance holds on human evaluations (not solely automatic metrics), which are less susceptible to lexical artifacts. To address the referee's point directly, we will expand the Methods section with additional details on the LLM judge prompts, human review criteria, and examples of filtered-out dialogues. revision: partial
Referee: [Experiments] Experiments section: No information is provided on baseline selection criteria, statistical significance testing for the reported metric gains, or inter-annotator agreement for the Realism Score and human evaluations; these omissions undermine confidence that the consistent improvements are robust rather than sensitive to post-hoc choices.

Authors: We agree these reporting details are necessary for assessing robustness. In the revised manuscript, we will add: (1) explicit baseline selection criteria, explaining our choice of base SLM variants and Gemini 2.5 Flash as a strong general-purpose comparator to isolate domain-adaptation effects; (2) statistical significance testing (paired t-tests with Bonferroni correction and reported p-values) for all metric gains; and (3) inter-annotator agreement statistics (e.g., Fleiss' kappa) for the Realism Score and human evaluation annotations. These will be integrated into the Experiments section with a new subsection on evaluation reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark and evaluation

full rationale

The paper presents an empirical dataset curation pipeline from real-world video sources followed by SLM fine-tuning and evaluation on held-out scenarios using standard metrics (ROUGE-L, BLEU, etc.) against external baselines including Gemini 2.5 Flash. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim is a direct performance comparison on independently filtered test data and remains self-contained without reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that video-derived dialogues filtered by the described hybrid process constitute representative, high-fidelity training data; no free parameters are introduced in the abstract, and no new physical or mathematical entities are postulated.

axioms (2)

domain assumption Publicly available police-civilian interaction videos contain sufficient high-quality, representative de-escalation dialogues for training purposes.
Invoked when selecting the initial 5,000 raw inputs and when claiming the final 1,500 scenarios are realistic.
domain assumption LLM-as-a-Judge combined with human verification reliably identifies high-fidelity scenarios without introducing systematic bias.
Central to the multi-stage filtering process described in the abstract.

pith-pipeline@v0.9.0 · 5641 in / 1406 out tokens · 34797 ms · 2026-05-15T07:43:35.384118+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hybrid filtering process combining human-in-the-loop verification with LLM-as-a-Judge evaluation to distill 1,500 high-fidelity scenarios... fine-tuned Qwen 2.5 (3B-Instruct) surpasses... Gemini 2.5 Flash
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

QLoRA adapters... rank r=16, scaling factor α=32... evaluation on ROUGE-L, BLEU-4, METEOR, BERTScore

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 4 internal anchors

[1]

and Polyak, E

Anand, A. and Polyak, E. Exploring the potential of large language models for enhanced virtual non-player character interactions. In INTED2024 Proceedings, 18th International Technology, Education and Development Conference, pp.\ 4895--4898. IATED, 4-6 March, 2024 2024. ISBN 978-84-09-59215-9. doi:10.21125/inted.2024.1269. URL https://doi.org/10.21125/int...

work page doi:10.21125/inted.2024.1269 2024
[2]

and Lavie, A

Banerjee, S. and Lavie, A. METEOR : An automatic metric for MT evaluation with improved correlation with human judgments. In Goldstein, J., Lavie, A., Lin, C.-Y., and Voss, C. (eds.), Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , pp.\ 65--72, Ann Arbor, Michigan, June 2005. As...

work page 2005
[3]

pyannote

Bredin, H. pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In 24th INTERSPEECH Conference (INTERSPEECH 2023), pp.\ 1983--1987. ISCA, 2023

work page 2023
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

The falcon 3 family of open models

Falcon-LLM Team . The falcon 3 family of open models. https://huggingface.co/blog/falcon3, December 2024. Accessed: 2026-01-24

work page 2024
[6]

Large language models empowered agent-based modeling and simulation: A survey and perspectives

Gao, C., Lan, X., Li, N., Yuan, Y., Ding, J., Zhou, Z., Xu, F., and Li, Y. Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11 0 (1): 0 1--24, 2024

work page 2024
[7]

Granite 3.0 language models

Granite Team, I. Granite 3.0 language models. URL: https://github. com/ibm-granite/granite-3.0-language-models, 2024

work page 2024
[8]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Crafting papers on machine learning

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.\ 1207--1216, Stanford, CA, 2000. Morgan Kaufmann

work page 2000
[10]

ROUGE : A package for automatic evaluation of summaries

Lin, C.-Y. ROUGE : A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.\ 74--81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013/

work page 2004
[11]

B leu: a method for automatic evaluation of machine translation

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. B leu: a method for automatic evaluation of machine translation. In Isabelle, P., Charniak, E., and Lin, D. (eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.\ 311--318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics....

work page doi:10.3115/1073083.1073135 2002
[12]

S., O'Brien, J., Cai, C

Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pp.\ 1--22, 2023

work page 2023
[13]

Comparing specialised small and general large language models on text classification: 100 labelled samples to achieve break- E ven performance

Pecher, B., Srba, I., and Bielikova, M. Comparing specialised small and general large language models on text classification: 100 labelled samples to achieve break- E ven performance. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 165--...

work page doi:10.18653/v1/2025.emnlp-main.9 2025
[14]

W., Xu, T., Brockman, G., Mcleavey, C., and Sutskever, I

Radford, A., Kim, J. W., Xu, T., Brockman, G., Mcleavey, C., and Sutskever, I. Robust speech recognition via large-scale weak supervision. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, ...

work page 2023
[15]

P., Livescu, K., Jurafsky, D., and Field, A

Rosas-Smith, J., Bartelds, M., Huang, R., Garc \' a-Perera, L. P., Livescu, K., Jurafsky, D., and Field, A. Constructing datasets from public police body camera footage. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025

work page 2025
[16]

Towards ai-driven policing: Interdisciplinary knowledge discovery from police body-worn camera footage

Srbinovska, A., Srbinovska, A., Senthil, V., Martin, A., McCluskey, J., Bateman, J., and Fokou A S , E. Towards ai-driven policing: Interdisciplinary knowledge discovery from police body-worn camera footage. arXiv preprint arXiv:2504.20007, 2025

work page arXiv 2025
[17]

P., Lopez, J., Islam, M., and Deb, S

Sridhar, E. P., Lopez, J., Islam, M., and Deb, S. Adaptive de-escalation trainer: Piloting a rag-enhanced, emotionally modulated ai simulator for police training. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, volume 69, pp.\ 171--175. SAGE Publications Sage CA: Los Angeles, CA, 2025

work page 2025
[18]

Gemma 2: Improving Open Language Models at a Practical Size

Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Leveraging large language models for enhanced simulation-based learning in police and law enforcement

Violakis, P. Leveraging large language models for enhanced simulation-based learning in police and law enforcement. Policing: A Journal of Policy and Practice, 19: 0 paaf012, 2025

work page 2025
[20]

P., Prabhakaran, V., Hamilton, W

Voigt, R., Camp, N. P., Prabhakaran, V., Hamilton, W. L., Hetey, R. C., Griffiths, C. M., Jurgens, D., Jurafsky, D., and Eberhardt, J. L. Language from police body camera footage shows racial disparities in officer respect. Proceedings of the national Academy of sciences, 114 0 (25): 0 6521--6526, 2017

work page 2017
[21]

Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models

Wang, N., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y., Guo, H., Gan, R., Ni, Z., Yang, J., et al. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 14743--14777, 2024

work page 2024
[22]

Co SER : Coordinating LLM -based persona simulation of established roles

Wang, X., Wang, H., Zhang, Y., Yuan, X., Xu, R., tse Huang, J., Yuan, S., Guo, H., Chen, J., Zhou, S., Wang, W., and Xiao, Y. Co SER : Coordinating LLM -based persona simulation of established roles. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=BOrR7YqKUt

work page 2025
[23]

Small models are valuable plug-ins for large language models

Xu, C., Xu, Y., Wang, S., Liu, Y., Zhu, C., and McAuley, J. Small models are valuable plug-ins for large language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 283--294, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.fin...

work page doi:10.18653/v1/2024.findings-acl.18 2024
[24]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Let's negotiate! a survey of negotiation dialogue systems

Zhan, H., Wang, Y., Li, Z., Feng, T., Hua, Y., Sharma, S., Qu, L., Semnani-Azad, Z., Zukerman, I., and Haffari, R. Let's negotiate! a survey of negotiation dialogue systems. In EACL (Findings), 2024

work page 2024
[26]

Q., and Artzi, Y

Zhang*, T., Kishore*, V., Wu*, F., Weinberger, K. Q., and Artzi, Y. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr

work page 2020
[27]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page