LaMSUM: Amplifying Voices Against Harassment through LLM Guided Extractive Summarization of User Incident Reports
Pith reviewed 2026-05-24 00:09 UTC · model grok-4.3
The pith
LaMSUM uses multi-level voting to make LLMs output extractive summaries from large collections of code-mixed harassment reports.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LaMSUM is a novel multi-level framework combining summarization with different voting methods to generate extractive summaries for large collections of incident reports using LLMs. It addresses LLMs' tendency to produce abstractive outputs and their limited context windows when processing code-mixed languages. Extensive evaluation using four popular LLMs (Llama, Mistral, Claude and GPT-4o) demonstrates that LaMSUM outperforms state-of-the-art extractive summarization methods. The work is presented as one of the first attempts to achieve extractive summarization through LLMs.
What carries the argument
multi-level framework that pairs staged summarization with voting methods to steer LLMs toward selecting verbatim excerpts
If this is right
- Stakeholders receive a single overview covering thousands of reports without reading each one.
- Policy makers can identify recurring patterns in harassment incidents more efficiently.
- The same LLM back-ends (Llama, Mistral, Claude, GPT-4o) produce higher-quality extractive summaries than prior dedicated extractive systems.
- Code-mixed text common in real user reports is handled without language-specific preprocessing.
- The framework offers an early route to extractive rather than abstractive output from LLMs.
Where Pith is reading between the lines
- The voting mechanism might be reusable in other LLM tasks where strict fidelity to source text is required, such as legal document extraction.
- Adding more hierarchy levels could allow processing of even larger report sets that current context limits still block.
- Testing the framework on complaint data from unrelated domains would reveal whether the multi-level voting pattern generalizes beyond harassment reports.
Load-bearing premise
Voting across staged LLM outputs can reliably force selection of original text excerpts rather than new paraphrased sentences when reports mix languages and exceed single context windows.
What would settle it
Generate summaries on a test collection of 50 reports and verify whether every sentence in each output appears as a contiguous verbatim substring in the input set, with zero added or reworded content.
Figures
read the original abstract
Citizen reporting platforms help the public and authorities stay informed about sexual harassment incidents. However, the high volume of data shared on these platforms makes reviewing each individual case challenging. Therefore, a summarization algorithm capable of processing and understanding various code-mixed languages is essential. In recent years, Large Language Models (LLMs) have shown exceptional performance in NLP tasks, including summarization. LLMs inherently produce abstractive summaries by paraphrasing the original text, while the generation of extractive summaries - selecting specific subsets from the original text - through LLMs remains largely unexplored. Moreover, LLMs have a limited context window size, restricting the amount of data that can be processed at once. We tackle these challenges by introducing LaMSUM, a novel multi-level framework combining summarization with different voting methods to generate extractive summaries for large collections of incident reports using LLMs. Extensive evaluation using four popular LLMs (Llama, Mistral, Claude and GPT-4o) demonstrates that LaMSUM outperforms state-of-the-art extractive summarization methods. Overall, this work represents one of the first attempts to achieve extractive summarization through LLMs, and is likely to support stakeholders by offering a comprehensive overview and enabling them to develop effective policies to minimize incidents of unwarranted harassment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents LaMSUM, a novel multi-level framework that combines LLM summarization with voting methods to generate extractive summaries from large collections of code-mixed incident reports on sexual harassment. The central claim is that LaMSUM outperforms state-of-the-art extractive summarization methods, as demonstrated through extensive evaluation using four popular LLMs: Llama, Mistral, Claude, and GPT-4o. The work aims to address challenges of context window limits and the tendency of LLMs to produce abstractive rather than extractive summaries.
Significance. If the results hold and the outputs are verifiably extractive, this could be a significant contribution to applying LLMs for extractive summarization in challenging settings involving code-mixed languages and large document collections, potentially aiding stakeholders in understanding harassment patterns and developing policies. The use of multiple LLMs and voting methods is a promising direction for steering LLMs towards extractive outputs.
major comments (2)
- [Abstract] Abstract: the claim that LaMSUM 'outperforms state-of-the-art extractive summarization methods' is asserted without any quantitative metrics, baselines, dataset sizes, or evaluation protocol described, leaving the central empirical claim without visible supporting evidence.
- [LaMSUM framework description] LaMSUM framework description (multi-level chunking + voting): no post-hoc verification is described that the generated summaries remain strictly extractive (e.g., sentence-level overlap ratio, ROUGE-L computed only against source sentences, or an ablation disabling voting to measure extractiveness drop). LLMs default to abstractive output, so without such a check the reported gains could reflect improved abstractive content rather than the claimed extractive property, undermining direct comparison to classical extractive baselines that are extractive by construction.
minor comments (1)
- [Abstract] Abstract: the statement that this is 'one of the first attempts' to achieve extractive summarization through LLMs would benefit from explicit citations to any prior LLM-based extractive work for context.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that LaMSUM 'outperforms state-of-the-art extractive summarization methods' is asserted without any quantitative metrics, baselines, dataset sizes, or evaluation protocol described, leaving the central empirical claim without visible supporting evidence.
Authors: We acknowledge that the abstract presents the performance claim at a high level. While the full quantitative results, baselines, dataset details, and evaluation protocol are provided in the Experiments and Results sections, we agree that including key supporting metrics in the abstract would make the central claim more self-contained. In the revised manuscript we will add a concise statement of the main ROUGE improvements, number of baselines, and dataset size to the abstract. revision: yes
-
Referee: [LaMSUM framework description] LaMSUM framework description (multi-level chunking + voting): no post-hoc verification is described that the generated summaries remain strictly extractive (e.g., sentence-level overlap ratio, ROUGE-L computed only against source sentences, or an ablation disabling voting to measure extractiveness drop). LLMs default to abstractive output, so without such a check the reported gains could reflect improved abstractive content rather than the claimed extractive property, undermining direct comparison to classical extractive baselines that are extractive by construction.
Authors: This concern is valid. Although the LaMSUM design (multi-level chunking followed by sentence-level voting) is intended to enforce extractiveness by selecting verbatim sentences from the source, the submitted manuscript does not include explicit post-hoc verification. We will add a dedicated subsection that reports (1) sentence-level overlap ratios, (2) ROUGE-L computed exclusively against source sentences, and (3) an ablation that disables the voting stage to quantify any drop in extractiveness. These additions will directly address the possibility of abstractive leakage and strengthen the comparison to classical extractive baselines. revision: yes
Circularity Check
No circularity; empirical comparison is independent of inputs
full rationale
The paper presents an empirical framework (LaMSUM) for LLM-based extractive summarization and reports performance gains against external baselines using four named LLMs. No equations, fitted parameters, or self-citations are invoked as load-bearing premises that reduce the central claim to a tautology or prior author result. The evaluation is described as direct comparison on held-out incident reports; the extractiveness property is asserted via the multi-level voting design rather than being defined in terms of the measured outcomes. This is a standard non-circular empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can be guided via multi-level processing and voting to output extractive rather than abstractive summaries despite limited context windows
invented entities (1)
-
LaMSUM framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Bhattacharya, P.; Poddar, S.; Rudra, K.; Ghosh, K.; and Ghosh, S. 2021. Incorporating domain knowledge for extractive summarization of legal case documents. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law
work page 2021
-
[4]
Brandt, F.; Conitzer, V.; Endriss, U.; Lang, J.; and Procaccia, A. D. 2016. Handbook of computational social choice. Cambridge University Press
work page 2016
-
[5]
Bra z inskas, A.; Lapata, M.; and Titov, I. 2020. Few-Shot Learning for Opinion Summarization. In EMNLP
work page 2020
-
[6]
Brown, H.; and Shokri, R. 2023. How (Un)Fair is Text Summarization?
work page 2023
-
[7]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.;...
work page 2020
-
[8]
Buchholz, K. 2024. The Countries That Are Safe & Unsafe for Women
work page 2024
-
[9]
Chang, Y.; Lo, K.; Goyal, T.; and Iyyer, M. 2024. BooookScore: A systematic exploration of book-length summarization in the era of LLMs. In ICLR
work page 2024
-
[10]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; Ye, W.; Zhang, Y.; Chang, Y.; Yu, P. S.; Yang, Q.; and Xie, X. 2023. A Survey on Evaluation of Large Language Models
work page 2023
-
[11]
Davidson, T.; Warmsley, D.; Macy, M.; and Weber, I. 2017. Automated Hate Speech Detection and the Problem of Offensive Language. ICWSM
work page 2017
-
[12]
Dublish, N. 2020. All about the Hathras Case
work page 2020
-
[13]
ElSherief, M.; Belding, E.; and Nguyen, D. 2017. \#NotOkay: Understanding Gender-Based Violence in Social Media. ICWSM
work page 2017
-
[14]
Emerson, P. 2013. The original Borda count and partial voting. Social Choice and Welfare
work page 2013
-
[15]
Erkan, G.; and Radev, D. R. 2004. LexRank: Graph-based Lexical Centrality As Salience in Text Summarization. Journal of Artificial Intelligence Research
work page 2004
-
[16]
Ghosh Chowdhury, A.; Sawhney, R.; Mathur, P.; Mahata, D.; and Ratn Shah, R. 2019. Speak up, Fight Back! Detection of Social Media Disclosures of Sexual Harassment. In NAACL
work page 2019
-
[17]
Gong, Y.; and Liu, X. 2001. Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis. In ACM SIGIR
work page 2001
-
[18]
Goyal, T.; Li, J. J.; and Durrett, G. 2023. News Summarization and Evaluation in the Era of GPT-3. arXiv:2209.12356
-
[19]
Hassan, N.; Poudel, A.; Hale, J.; Hubacek, C.; Huq, K. T.; Karmaker Santu, S. K.; and Ahmed, S. I. 2020. Towards Automated Sexual Violence Report Tracking. ICWSM
work page 2020
-
[20]
Jia, R.; Cao, Y.; Tang, H.; Fang, F.; Cao, C.; and Wang, S. 2020. Neural Extractive Summarization with Hierarchical Attentive Heterogeneous Graph Network. In EMNLP
work page 2020
-
[21]
Jiang, A. Q.; and et al. 2024. Mixtral of Experts. arXiv:2401.04088
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Jin, H.; Han, X.; Yang, J.; Jiang, Z.; Liu, Z.; Chang, C.-Y.; Chen, H.; and Hu, X. 2024 a . LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning. In ICML
work page 2024
- [23]
-
[24]
Jost, L. 2006. Entropy and diversity. Oikos
work page 2006
-
[25]
Jung, T.; Kang, D.; Mentch, L.; and Hovy, E. 2019. Earlier Isn ' t Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization. In EMNLP
work page 2019
-
[26]
Kanwal, N.; and Rizzo, G. 2022. Attention-based clinical note summarization. In Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing
work page 2022
-
[27]
Kim, S.; Razi, A.; Alsoubai, A.; Wisniewski, P. J.; and De Choudhury, M. 2024. Assessing the Impact of Online Harassment on Youth Mental Health in Private Networked Spaces. ICWSM
work page 2024
-
[28]
Kopackova, H.; and Libalova, P. 2019. Citizen reporting as the form of e-participation in smart cities. In Iberian Conference on Information Systems and Technologies (CISTI). IEEE
work page 2019
-
[29]
Laban, P.; Kryscinski, W.; Agarwal, D.; Fabbri, A.; Xiong, C.; Joty, S.; and Wu, C.-S. 2023. S umm E dits: Measuring LLM Ability at Factual Reasoning Through The Lens of Summarization. In EMNLP
work page 2023
-
[30]
Lackner, M.; Regner, P.; and Krenn, B. 2023. abcvoting: A P ython package for approval-based multi-winner voting rules. Journal of Open Source Software
work page 2023
-
[31]
Laskar, M. T. R.; Bari, M. S.; Rahman, M.; Bhuiyan, M. A. H.; Joty, S.; and Huang, J. 2023. A Systematic Study and Comprehensive Evaluation of C hat GPT on Benchmark Datasets. In Findings of the Association for Computational Linguistics: ACL 2023
work page 2023
-
[32]
Lin, C.-Y. 2004. ROUGE : A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out
work page 2004
-
[33]
Liu, Y.; and Lapata, M. 2019. Text Summarization with Pretrained Encoders. In EMNLP
work page 2019
-
[34]
Liu, Y.; Shi, K.; He, K.; Ye, L.; Fabbri, A.; Liu, P.; Radev, D.; and Cohan, A. 2024. On Learning to Summarize with Large Language Models as References. In NAACL
work page 2024
- [35]
-
[36]
K.; Goyal, P.; and Mukherjee, A
Mathew, B.; Saha, P.; Tharad, H.; Rajgaria, S.; Singhania, P.; Maity, S. K.; Goyal, P.; and Mukherjee, A. 2019. Thou Shalt Not Hate: Countering Online Hate Speech. ICWSM
work page 2019
-
[37]
Miller, D. 2019. Leveraging BERT for Extractive Text Summarization on Lectures. arXiv:1906.04165
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[38]
Mudambi, R.; Navarra, P.; and Nicosia, C. 1996. Plurality versus Proportional Representation: An Analysis of Sicilian Elections. Public Choice
work page 1996
-
[39]
C.; Vishnu, U.; Goyal, P.; Bhattacharya, S.; and Ganguly, N
Mukherjee, R.; Peruri, H. C.; Vishnu, U.; Goyal, P.; Bhattacharya, S.; and Ganguly, N. 2020. Read what you need: Controllable Aspect-based Opinion Summarization of Tourist Reviews. In SIGIR
work page 2020
-
[40]
Nenkova, A.; and Vanderwende, L. 2005. The impact of frequency on summarization. Technical report, Microsoft Research
work page 2005
-
[41]
Olteanu, A.; Castillo, C.; Boy, J.; and Varshney, K. 2018. The Effect of Extremist Violence on Hateful Speech Online. ICWSM
work page 2018
-
[42]
OpenAI. 2024. GPT-4o mini: advancing cost-efficient intelligence
work page 2024
-
[43]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; Schulman, J.; Hilton, J.; Kelton, F.; Miller, L.; Simens, M.; Askell, A.; Welinder, P.; Christiano, P. F.; Leike, J.; and Lowe, R. 2022. Training language models to follow instructions with human feedback. In NIPS
work page 2022
-
[44]
Park, H.; and Lee, J. 2021. Designing a Conversational Agent for Sexual Assault Survivors: Defining Burden of Self-Disclosure and Envisioning Survivor-Centered Solutions. In CHI
work page 2021
- [45]
-
[46]
Ristad, E.; and Yianilos, P. 1998. Learning string-edit distance. IEEE Transactions on PAML
work page 1998
-
[47]
They Don't Leave Us Alone Anywhere We Go
Sambasivan, N.; Batool, A.; Ahmed, N.; Matthews, T.; Thomas, K.; Gayt\' a n-Lugo, L. S.; Nemer, D.; Bursztein, E.; Churchill, E.; and Consolvo, S. 2019. "They Don't Leave Us Alone Anywhere We Go": Gender and Digital Abuse in South Asia. In CHI
work page 2019
-
[48]
Sawhney, R.; Mathur, P.; Jain, T.; Gautam, A. K.; and Shah, R. R. 2021. Multitask Learning for Emotionally Analyzing Sexual Abuse Disclosures. In NAACL
work page 2021
-
[49]
Shin, B.; Floch, J.; Rask, M.; B ck, P.; Edgar, C.; Berditchevskaia, A.; Mesure, P.; and Branlat, M. 2024. A systematic analysis of digital tools for citizen participation. Government Information Quarterly
work page 2024
-
[50]
Stoop, W.; Kunneman, F.; van den Bosch, A.; and Miller, B. 2019. Detecting harassment in real-time as conversations develop. In Workshop on Abusive Language Online
work page 2019
-
[51]
Sultana, S.; Deb, M.; Bhattacharjee, A.; Hasan, S.; Alam, S.; Chakraborty, T.; Roy, P.; Ahmed, S. F.; Moitra, A.; Amin, M. A.; Islam, A. N.; and Ahmed, S. I. 2021. ‘Unmochon’: A Tool to Combat Online Sexual Harassment over Facebook Messenger. In CHI
work page 2021
-
[52]
Tam, D.; Mascarenhas, A.; Zhang, S.; Kwan, S.; Bansal, M.; and Raffel, C. 2023. Evaluating the Factual Consistency of Large Language Models Through News Summarization. In ACL
work page 2023
-
[53]
Tang, L.; Shalyminov, I.; mei Wong, A. W.; Burnsky, J.; Vincent, J. W.; Yang, Y.; Singh, S.; Feng, S.; Song, H.; Su, H.; Sun, L.; Zhang, Y.; Mansour, S.; and McKeown, K. 2024. TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization. arXiv:2402.13249
-
[54]
Tang, L.; Sun, Z.; Idnay, B.; Nestor, J. G.; Soroush, A.; Elias, P. A.; Xu, Z.; Ding, Y.; Durrett, G.; Rousseau, J.; Weng, C.; and Peng, Y. 2023 a . Evaluating large language models on medical evidence summarization. medRxiv
work page 2023
-
[55]
Tang, Y.; Puduppully, R.; Liu, Z.; and Chen, N. 2023 b . In-context Learning of Large Language Models for Controlled Dialogue Summarization: A Holistic Benchmark and Empirical Analysis. In NewSumm Workshop
work page 2023
-
[56]
It’s common and a part of being a content creator
Thomas, K.; Kelley, P. G.; Consolvo, S.; Samermit, P.; and Bursztein, E. 2022. “It’s common and a part of being a content creator”: Understanding How Creators Experience and Cope with Hate and Harassment Online. In CHI
work page 2022
-
[57]
Times, T. E. 2024. Kolkata doctor rape-murder case: RG Kar, the campus was victim's `second home'
work page 2024
-
[58]
Today, I. 2020. Nirbhaya case: From December 16, 2012 to March 20, 2020 | A timeline
work page 2020
-
[59]
Touvron, H.; and et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models
work page 2023
-
[60]
Upadhayay, B.; Lodhia, Z.; and Behzadan, V. 2021. Combating Human Trafficking via Automatic OSINT Collection, Validation and Fusion. In ICWSM Workshop
work page 2021
-
[61]
Venkatasubramanian, K.; Skorinko, J. L. M.; Kobeissi, M.; Lewis, B.; Jutras, N.; Bosma, P.; Mullaly, J.; Kelly, B.; Lloyd, D.; Freark, M.; and Alterio, N. A. 2021. Exploring A Reporting Tool to Empower Individuals with Intellectual and Developmental Disabilities to Self-Report Abuse. In CHI
work page 2021
- [62]
-
[63]
Wu, Y.; Iso, H.; Pezeshkpour, P.; Bhutani, N.; and Hruschka, E. 2024. Less is More for Long Document Summary Evaluation by LLM s. In EACL
work page 2024
-
[64]
Xu, J.; Gan, Z.; Cheng, Y.; and Liu, J. 2020. Discourse-Aware Neural Extractive Text Summarization. In ACL
work page 2020
- [65]
-
[66]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; and Le, Q. V. 2019. XLNet: generalized autoregressive pretraining for language understanding. In NeurIPS
work page 2019
-
[67]
Zhang, H.; Liu, X.; and Zhang, J. 2022. HEGEL : Hypergraph Transformer for Long Document Summarization. In EMNLP
work page 2022
-
[68]
Zhang, H.; Liu, X.; and Zhang, J. 2023 a . D iffu S um: Generation Enhanced Extractive Summarization with Diffusion. In ACL
work page 2023
-
[69]
Zhang, H.; Liu, X.; and Zhang, J. 2023 b . Extractive Summarization via C hat GPT for Faithful Summary Generation. In EMNLP
work page 2023
-
[70]
Zhang, H.; Liu, X.; and Zhang, J. 2023 c . S umm I t: Iterative Text Summarization via C hat GPT . In EMNLP
work page 2023
-
[71]
Zhang, T.; Ladhak, F.; Durmus, E.; Liang, P.; McKeown, K.; and Hashimoto, T. B. 2024. Benchmarking Large Language Models for News Summarization . ACL Transactions
work page 2024
-
[72]
Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; Du, Y.; Yang, C.; Chen, Y.; Chen, Z.; Jiang, J.; Ren, R.; Li, Y.; Tang, X.; Liu, Z.; Liu, P.; Nie, J.-Y.; and Wen, J.-R. 2023. A Survey of Large Language Models
work page 2023
-
[73]
Zhong, M.; Liu, P.; Chen, Y.; Wang, D.; Qiu, X.; and Huang, X. 2020. Extractive Summarization as Text Matching. In ACL
work page 2020
-
[74]
Ziems, C.; Vigfusson, Y.; and Morstatter, F. 2020. Aggressive, Repetitive, Intentional, Visible, and Imbalanced: Refining Representations for Cyberbullying Classification. ICWSM
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.