Recognition: 1 theorem link
· Lean TheoremLLM-based uncertainty assessment of social media situational signals for crisis reporting
Pith reviewed 2026-05-15 09:41 UTC · model grok-4.3
The pith
An uncertainty layer lets LLMs judge how plausible each social media disaster claim is when checked against external proxy data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose an uncertainty-aware framework for automated situational awareness reporting that explicitly accounts for the plausibility of social media claims. After classifying posts according to an established schema, we add an uncertainty assessment layer that evaluates whether individual situational claims plausibly reflect real-world conditions when conditioned on external proxy data, while explicitly eliciting the model's confidence in this judgment. These assessments are then used to generate crisis reports that communicate not only what is being reported but how certain those reports are.
What carries the argument
The uncertainty assessment layer, which conditions an LLM's plausibility judgment of each social media claim on external proxy data such as USGS PAGER summaries and elicits an explicit confidence score for that judgment.
If this is right
- Crisis reports can include explicit uncertainty levels that let responders focus first on the most plausible signals.
- External proxy data sources can be systematically folded into LLM pipelines for situational awareness.
- The same classification-plus-uncertainty pipeline scales to hundreds of thousands of posts without requiring equal trust in every message.
- Human crisis communicators receive an additional signal for prioritizing information under time pressure.
Where Pith is reading between the lines
- The approach could be extended to other disaster types by swapping in appropriate proxy data such as flood maps or wildfire perimeters.
- Uncertainty scores might serve as weights when fusing social media with traditional sensor feeds.
- Repeated application across events could reveal systematic biases in which kinds of claims LLMs tend to over- or under-estimate.
Load-bearing premise
Large language models can reliably decide whether a social media claim about a disaster matches real-world conditions once they are given external proxy summaries.
What would settle it
Direct comparison of the framework's uncertainty scores against independent human expert ratings or post-event verified ground truth on the same collection of social media posts.
read the original abstract
Social media has become a critical source of situational awareness during disasters, providing real-time insights into evolving impacts and emerging needs. To support crisis response at scale, recent work has increasingly leveraged large language models (LLMs) to automatically classify and summarize situational information from social media streams. However, existing approaches implicitly assume that extracted situational claims are equally plausible, despite information quality varying substantially as a crisis unfolds. In this work, we propose an uncertainty-aware framework for automated situational awareness reporting that explicitly accounts for the plausibility of social media claims. First, we classify social media posts according to an established situational awareness schema. Second, we introduce an uncertainty assessment layer that evaluates whether individual situational claims plausibly reflect real-world conditions when conditioned on external proxy data, while explicitly eliciting the model's confidence in this judgment. Third, we use these uncertainty assessments to generate crisis reports that communicate not only what is being reported, but how certain those reports are. We apply this framework to over 200,000 earthquake-related Twitter/X posts, using impact summaries from the USGS PAGER as a representative external proxy. We argue that explicitly representing uncertainty supports human crisis communicators in prioritizing information under time pressure, and provides a framework for integrating external proxy data into LLM-based situational awareness pipelines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a three-step LLM-based framework for uncertainty-aware situational awareness from social media during crises. Posts are first classified according to an established schema; an uncertainty assessment layer then evaluates the plausibility of individual claims when conditioned on external proxy data such as USGS PAGER summaries while eliciting the model's confidence; finally, these assessments are used to generate reports that communicate both the situational information and its assessed certainty. The framework is applied to over 200,000 earthquake-related Twitter/X posts.
Significance. If the LLM-generated uncertainty scores prove well-calibrated against real-world outcomes, the work would provide a practical method for prioritizing information under time pressure and a reusable template for incorporating external proxy data into LLM pipelines for crisis informatics.
major comments (2)
- [Abstract / Proposed Framework] The manuscript describes the three-step pipeline (classification, uncertainty assessment conditioned on PAGER, and report generation) but supplies no quantitative results, validation metrics, calibration plots, error analysis, or ablation studies. Without measuring whether low-uncertainty claims align with verified impacts at higher rates than high-uncertainty ones, the central claim that the uncertainty layer supports prioritization remains an untested modeling choice.
- [Uncertainty Assessment Layer] The uncertainty assessment step relies on LLM binary plausibility judgments plus elicited confidence when conditioned on USGS PAGER summaries, yet no external validation against ground truth (post-event verified damage reports, official casualty figures, or expert annotations on the same posts) is reported. This leaves open whether the confidence scores are calibrated or merely reflect the model's internal priors.
minor comments (1)
- The abstract states the framework is applied to over 200,000 posts but does not report the number of generated reports, any example outputs, or the distribution of uncertainty scores.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments on our manuscript. We agree that quantitative validation is crucial for substantiating the effectiveness of the uncertainty assessment layer and will revise the paper accordingly to include relevant metrics and analyses.
read point-by-point responses
-
Referee: [Abstract / Proposed Framework] The manuscript describes the three-step pipeline (classification, uncertainty assessment conditioned on PAGER, and report generation) but supplies no quantitative results, validation metrics, calibration plots, error analysis, or ablation studies. Without measuring whether low-uncertainty claims align with verified impacts at higher rates than high-uncertainty ones, the central claim that the uncertainty layer supports prioritization remains an untested modeling choice.
Authors: We acknowledge the validity of this observation. The current version of the manuscript focuses on introducing the framework and demonstrating its application on a large corpus of posts, but does not include the quantitative evaluations mentioned. In the revised manuscript, we will add validation metrics, including calibration analysis of the uncertainty scores against available ground truth data from post-event reports, and ablation studies to assess the contribution of the uncertainty layer. revision: yes
-
Referee: [Uncertainty Assessment Layer] The uncertainty assessment step relies on LLM binary plausibility judgments plus elicited confidence when conditioned on USGS PAGER summaries, yet no external validation against ground truth (post-event verified damage reports, official casualty figures, or expert annotations on the same posts) is reported. This leaves open whether the confidence scores are calibrated or merely reflect the model's internal priors.
Authors: We agree that external validation is necessary to confirm the calibration of the elicited confidence scores. We will incorporate comparisons with verified impact data from USGS and other sources in the revision. This will include analyzing the correlation between low-uncertainty claims and actual reported damages. revision: yes
Circularity Check
No significant circularity; framework relies on independent external proxy data
full rationale
The paper's core pipeline classifies social media posts, then prompts an LLM for plausibility judgments explicitly conditioned on external USGS PAGER summaries as proxy data, and finally generates uncertainty-aware reports. This chain does not reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The uncertainty layer is constructed by feeding independent external summaries into the LLM rather than deriving it from the posts alone or from prior author results. No equations or ansatzes are shown to loop back to the input claims by construction. The approach is self-contained against external benchmarks and receives a normal non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can reliably evaluate the plausibility of social media claims when conditioned on external proxy data such as USGS PAGER summaries
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce an uncertainty assessment layer that evaluates whether individual situational claims plausibly reflect real-world conditions when conditioned on external proxy data... using impact summaries from the USGS PAGER
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Riccardo Cantini, Cristian Cosentino, Fabrizio Marozzo, Domenico Talia, and Paolo Trunfio. Harnessing prompt-based Large Language Models for disaster monitoring and automated reporting from social media feedback.Online Social Networks and Media, 45:100295, 2025. 25
work page 2025
-
[2]
Loris Belcastro, Cristian Cosentino, Fabrizio Marozzo, Merve G¨ und¨ uz-C¨ ure, and Sule ¨Ozt¨ urk-Birim. Multistakeholder disaster insights from social media using Large Language Models.IEEE Transactions on Computational Social Systems, 2025
work page 2025
-
[3]
Minsun Shim and Heui Sug Jo. What quality factors matter in enhancing the perceived benefits of online health information sites? Application of the updated delone and mclean information systems success model.International Journal of Medical Informatics, 137:104093, 2020
work page 2020
-
[4]
Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. Processing social media messages in mass emergency: A survey.ACM Computing Surveys (CSUR), 47(4):1–38, 2015
work page 2015
-
[5]
Evaluating prompt engineering techniques for accuracy and confidence elicitation in medical LLMs
Nariman Naderi, Zahra Atf, Peter R Lewis, Aref Mahjoubfar, Seyed Amir Ahmad Safavi-Naini, and Ali Soroush. Evaluating prompt engineering techniques for accuracy and confidence elicitation in medical LLMs. InProceedings of the International Workshop on Explainable, Trustworthy, and Responsible Artificial Intelligence and Multi-Agent Systems, pages 67–84. S...
work page 2025
-
[6]
Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine- tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP...
work page 2023
-
[7]
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs.arXiv preprint arXiv:2306.13063, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Lingyao Li, Michelle Bensi, and Gregory Baecher. Exploring the potential of social media crowdsourcing for post-earthquake damage assessment.International Journal of Disaster Risk Reduction, 98:104062, 2023
work page 2023
-
[9]
Lingyao Li, Michelle Bensi, Qingbin Cui, Gregory B Baecher, and You Huang. Social media crowdsourcing for rapid damage assessment following a sudden- onset natural hazard event.International Journal of Information Management, 60:102378, 2021
work page 2021
-
[10]
Sarah Vieweg, Amanda L Hughes, Kate Starbird, and Leysia Palen. Microblog- ging during two natural hazards events: what Twitter may contribute to situational awareness. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’10), pages 1079–1088, 2010
work page 2010
-
[11]
Zheye Wang and Xinyue Ye. Space, time, and situational awareness in natural hazards: A case study of hurricane sandy with social media data.Cartography and Geographic Information Science, 46(4):334–346, 2019
work page 2019
-
[12]
Sense- place2: Geotwitter analytics support for situational awareness
Alan M MacEachren, Anuj Jaiswal, Anthony C Robinson, Scott Pezanowski, Alexander Savelyev, Prasenjit Mitra, Xiao Zhang, and Justine Blanford. Sense- place2: Geotwitter analytics support for situational awareness. InProceedings of the IEEE Conference on Visual Analytics Science and Technology, pages 181–190. IEEE, 2011
work page 2011
-
[13]
Qunying Huang and Yu Xiao. Geographic situational awareness: Mining tweets for disaster preparedness, emergency response, impact, and recovery.ISPRS 26 International Journal of Geo-information, 4(3):1549–1568, 2015
work page 2015
-
[14]
Extracting information nuggets from disaster-related messages in social media
Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. Extracting information nuggets from disaster-related messages in social media. InProceedings of the International Conference on Information Systems for Crisis Response and Management (ISCRAM’13), pages 791–800, 2013
work page 2013
-
[15]
Himanshu Zade, Kushal Shah, Vaibhavi Rangarajan, Priyanka Kshirsagar, Muhammad Imran, and Kate Starbird. From situational awareness to action- ability: Towards improving the utility of social media data for crisis response. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW):1–18, 2018
work page 2018
-
[16]
Crisis2sum: An exploratory study on disaster summarization from multiple streams
Philipp Seeberger and Korbinian Riedhammer. Crisis2sum: An exploratory study on disaster summarization from multiple streams. InInternational Conference on Information Systems for Crisis Response and Management (ISCRAM’24), 2024
work page 2024
-
[17]
Giacomo Gonella, Gian Maria Campedelli, Stefano Menini, and Marco Guerini. Crisitext: A dataset of warning messages for LLM training in emergency communication.arXiv preprint arXiv:2510.09243, 2025
-
[18]
Calibrating long-form generations from Large Language Models
Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, and Bhuwan Dhingra. Calibrating long-form generations from Large Language Models. In Conference on Empirical Methods in Natural Language Processing (EMNLP’24), pages 13441–13460. ACL, 2024
work page 2024
-
[19]
Images Amplify Misinformation Sharing in Vision-Language Models
Alice Plebe, Timothy Douglas, Diana Riazi, and R Maria del Rio-Chanona. I’ll believe it when i see it: Images increase misinformation sharing in vision-language models.arXiv preprint arXiv:2505.13302, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Jungjae Lee, Yubin Choi, Minhyuk Song, and Sanghyun Park. Chatfive: Enhanc- ing user experience in likert scale personality test through interactive conversation with llm agents. InProceedings of the 6th ACM Conference on Conversational User Interfaces (CUI’24), volume 6, pages 1–8, New York, NY, USA, 2024. ACM
work page 2024
-
[21]
Growth stalls at Elon Musk’s X
Clara Murray and Cristina Criddle. Growth stalls at Elon Musk’s X. https:// www.ft.com/content/1829abb6-d8d0-4d67-9b1b-628f583b3291, 2024. Accessed: 2026-01-08
work page 2024
-
[22]
Design and evaluation of an agentic workflow for crisis-related synthetic tweet datasets, 2026
Roben Delos Reyes, Timothy Douglas, and Asanobu Kitamoto. Design and evaluation of an agentic workflow for crisis-related synthetic tweet datasets, 2026
work page 2026
-
[23]
Timothy Douglas, Licia Capra, and Mirco Musolesi. Tracking the temporal evolu- tion of online social relationships in times of crisis.Journal of Social Computing, 6:1–24, 2026
work page 2026
-
[24]
Ten social dimensions of conversations and relationships
Minje Choi, Luca Maria Aiello, Kriszti´ an Zsolt Varga, and Daniele Quercia. Ten social dimensions of conversations and relationships. InProceedings of The International World Wide Web Conference (WWW’20), pages 1514–1525, 2020
work page 2020
-
[25]
Guanren Zhou and Khalid M Mosalam. Automated virtual earthquake recon- naissance reporting using natural language processing.Natural Hazards Review, 26(3):04025018, 2025
work page 2025
-
[26]
Kiran Zahra, Muhammad Imran, and Frank O Ostermann. Automatic identifica- tion of eyewitness messages on Twitter during disasters.Information Processing & Management, 57(1):102107, 2020
work page 2020
-
[27]
Koustav Rudra, Niloy Ganguly, Pawan Goyal, and Saptarshi Ghosh. Extracting and summarizing situational information from the Twitter social media during 27 disasters.ACM Transactions on the Web (TWEB), 12(3):1–35, 2018
work page 2018
-
[28]
Sudha Verma, Sarah Vieweg, William Corvey, Leysia Palen, James Martin, Martha Palmer, Aaron Schram, and Kenneth Anderson. Natural Language Pro- cessing to the rescue? Extracting “situational awareness” tweets during mass emergency. InProceedings of the International AAAI Conference on Web and Social Media (ICWSM’11), volume 5, pages 385–392, 2011
work page 2011
-
[29]
Reuters. Thailand ends search at site of skyscraper that col- lapsed during quake. https://www.reuters.com/world/asia-pacific/ thailand-ends-search-site-skyscraper-that-collapsed-during-quake-2025-05-13/, May 2025. Accessed: 2026-03-02. 28 Supplementary Material Classification Prompt Box 1: Classification Prompt System role.You are an information extracti...
work page 2025
-
[30]
Donotuse markdown code blocks (e.g.,‘‘‘json)
-
[31]
Donotadd explanations, headers, or separators (e.g., “—”)
-
[32]
The JSON must start with[and end with]
-
[33]
Donotwrap the JSON in any other text or formatting
-
[34]
index": <original_index_number>,
Donotinclude example output or placeholder text. 29 Uncertainty Assessment Prompt Box 2: Uncertainty Assessment Prompt System role.You are an expert crisis analysis assistant. Task.Evaluate whether social media situational awareness (SA) signals are plausibly representative of real-world conditions during a crisis event, given automatically gener- ated im...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.