pith. machine review for the scientific record. sign in

arxiv: 2605.00829 · v1 · submitted 2026-03-17 · 💻 cs.CY

Recognition: 1 theorem link

· Lean Theorem

LLM-based uncertainty assessment of social media situational signals for crisis reporting

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:41 UTC · model grok-4.3

classification 💻 cs.CY
keywords uncertainty assessmentlarge language modelssocial mediacrisis reportingsituational awarenessproxy dataearthquakes
0
0 comments X

The pith

An uncertainty layer lets LLMs judge how plausible each social media disaster claim is when checked against external proxy data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that adds explicit uncertainty scoring to LLM analysis of social media during crises. Posts are first sorted into standard situational awareness categories. An additional assessment step then asks the model whether each claim is likely to match real conditions, given outside data such as USGS earthquake impact summaries, and to state its own . The resulting reports therefore convey both the reported information and a measure of its reliability. The authors apply the approach to more than 200,000 earthquake-related posts and argue that the uncertainty signals help human responders prioritize messages when time is short.

Core claim

We propose an uncertainty-aware framework for automated situational awareness reporting that explicitly accounts for the plausibility of social media claims. After classifying posts according to an established schema, we add an uncertainty assessment layer that evaluates whether individual situational claims plausibly reflect real-world conditions when conditioned on external proxy data, while explicitly eliciting the model's confidence in this judgment. These assessments are then used to generate crisis reports that communicate not only what is being reported but how certain those reports are.

What carries the argument

The uncertainty assessment layer, which conditions an LLM's plausibility judgment of each social media claim on external proxy data such as USGS PAGER summaries and elicits an explicit confidence score for that judgment.

If this is right

  • Crisis reports can include explicit uncertainty levels that let responders focus first on the most plausible signals.
  • External proxy data sources can be systematically folded into LLM pipelines for situational awareness.
  • The same classification-plus-uncertainty pipeline scales to hundreds of thousands of posts without requiring equal trust in every message.
  • Human crisis communicators receive an additional signal for prioritizing information under time pressure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be extended to other disaster types by swapping in appropriate proxy data such as flood maps or wildfire perimeters.
  • Uncertainty scores might serve as weights when fusing social media with traditional sensor feeds.
  • Repeated application across events could reveal systematic biases in which kinds of claims LLMs tend to over- or under-estimate.

Load-bearing premise

Large language models can reliably decide whether a social media claim about a disaster matches real-world conditions once they are given external proxy summaries.

What would settle it

Direct comparison of the framework's uncertainty scores against independent human expert ratings or post-event verified ground truth on the same collection of social media posts.

read the original abstract

Social media has become a critical source of situational awareness during disasters, providing real-time insights into evolving impacts and emerging needs. To support crisis response at scale, recent work has increasingly leveraged large language models (LLMs) to automatically classify and summarize situational information from social media streams. However, existing approaches implicitly assume that extracted situational claims are equally plausible, despite information quality varying substantially as a crisis unfolds. In this work, we propose an uncertainty-aware framework for automated situational awareness reporting that explicitly accounts for the plausibility of social media claims. First, we classify social media posts according to an established situational awareness schema. Second, we introduce an uncertainty assessment layer that evaluates whether individual situational claims plausibly reflect real-world conditions when conditioned on external proxy data, while explicitly eliciting the model's confidence in this judgment. Third, we use these uncertainty assessments to generate crisis reports that communicate not only what is being reported, but how certain those reports are. We apply this framework to over 200,000 earthquake-related Twitter/X posts, using impact summaries from the USGS PAGER as a representative external proxy. We argue that explicitly representing uncertainty supports human crisis communicators in prioritizing information under time pressure, and provides a framework for integrating external proxy data into LLM-based situational awareness pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a three-step LLM-based framework for uncertainty-aware situational awareness from social media during crises. Posts are first classified according to an established schema; an uncertainty assessment layer then evaluates the plausibility of individual claims when conditioned on external proxy data such as USGS PAGER summaries while eliciting the model's confidence; finally, these assessments are used to generate reports that communicate both the situational information and its assessed certainty. The framework is applied to over 200,000 earthquake-related Twitter/X posts.

Significance. If the LLM-generated uncertainty scores prove well-calibrated against real-world outcomes, the work would provide a practical method for prioritizing information under time pressure and a reusable template for incorporating external proxy data into LLM pipelines for crisis informatics.

major comments (2)
  1. [Abstract / Proposed Framework] The manuscript describes the three-step pipeline (classification, uncertainty assessment conditioned on PAGER, and report generation) but supplies no quantitative results, validation metrics, calibration plots, error analysis, or ablation studies. Without measuring whether low-uncertainty claims align with verified impacts at higher rates than high-uncertainty ones, the central claim that the uncertainty layer supports prioritization remains an untested modeling choice.
  2. [Uncertainty Assessment Layer] The uncertainty assessment step relies on LLM binary plausibility judgments plus elicited confidence when conditioned on USGS PAGER summaries, yet no external validation against ground truth (post-event verified damage reports, official casualty figures, or expert annotations on the same posts) is reported. This leaves open whether the confidence scores are calibrated or merely reflect the model's internal priors.
minor comments (1)
  1. The abstract states the framework is applied to over 200,000 posts but does not report the number of generated reports, any example outputs, or the distribution of uncertainty scores.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript. We agree that quantitative validation is crucial for substantiating the effectiveness of the uncertainty assessment layer and will revise the paper accordingly to include relevant metrics and analyses.

read point-by-point responses
  1. Referee: [Abstract / Proposed Framework] The manuscript describes the three-step pipeline (classification, uncertainty assessment conditioned on PAGER, and report generation) but supplies no quantitative results, validation metrics, calibration plots, error analysis, or ablation studies. Without measuring whether low-uncertainty claims align with verified impacts at higher rates than high-uncertainty ones, the central claim that the uncertainty layer supports prioritization remains an untested modeling choice.

    Authors: We acknowledge the validity of this observation. The current version of the manuscript focuses on introducing the framework and demonstrating its application on a large corpus of posts, but does not include the quantitative evaluations mentioned. In the revised manuscript, we will add validation metrics, including calibration analysis of the uncertainty scores against available ground truth data from post-event reports, and ablation studies to assess the contribution of the uncertainty layer. revision: yes

  2. Referee: [Uncertainty Assessment Layer] The uncertainty assessment step relies on LLM binary plausibility judgments plus elicited confidence when conditioned on USGS PAGER summaries, yet no external validation against ground truth (post-event verified damage reports, official casualty figures, or expert annotations on the same posts) is reported. This leaves open whether the confidence scores are calibrated or merely reflect the model's internal priors.

    Authors: We agree that external validation is necessary to confirm the calibration of the elicited confidence scores. We will incorporate comparisons with verified impact data from USGS and other sources in the revision. This will include analyzing the correlation between low-uncertainty claims and actual reported damages. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework relies on independent external proxy data

full rationale

The paper's core pipeline classifies social media posts, then prompts an LLM for plausibility judgments explicitly conditioned on external USGS PAGER summaries as proxy data, and finally generates uncertainty-aware reports. This chain does not reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The uncertainty layer is constructed by feeding independent external summaries into the LLM rather than deriving it from the posts alone or from prior author results. No equations or ansatzes are shown to loop back to the input claims by construction. The approach is self-contained against external benchmarks and receives a normal non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger reflects the high-level claims. No explicit free parameters or invented entities are named. The central assumption is treated as a domain assumption rather than derived.

axioms (1)
  • domain assumption LLMs can reliably evaluate the plausibility of social media claims when conditioned on external proxy data such as USGS PAGER summaries
    The framework's second step rests on this capability without reported validation in the abstract.

pith-pipeline@v0.9.0 · 5524 in / 1280 out tokens · 30001 ms · 2026-05-15T09:41:41.980966+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

  1. [1]

    Harnessing prompt-based Large Language Models for disaster monitoring and automated reporting from social media feedback.Online Social Networks and Media, 45:100295, 2025

    Riccardo Cantini, Cristian Cosentino, Fabrizio Marozzo, Domenico Talia, and Paolo Trunfio. Harnessing prompt-based Large Language Models for disaster monitoring and automated reporting from social media feedback.Online Social Networks and Media, 45:100295, 2025. 25

  2. [2]

    Multistakeholder disaster insights from social media using Large Language Models.IEEE Transactions on Computational Social Systems, 2025

    Loris Belcastro, Cristian Cosentino, Fabrizio Marozzo, Merve G¨ und¨ uz-C¨ ure, and Sule ¨Ozt¨ urk-Birim. Multistakeholder disaster insights from social media using Large Language Models.IEEE Transactions on Computational Social Systems, 2025

  3. [3]

    Minsun Shim and Heui Sug Jo. What quality factors matter in enhancing the perceived benefits of online health information sites? Application of the updated delone and mclean information systems success model.International Journal of Medical Informatics, 137:104093, 2020

  4. [4]

    Processing social media messages in mass emergency: A survey.ACM Computing Surveys (CSUR), 47(4):1–38, 2015

    Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. Processing social media messages in mass emergency: A survey.ACM Computing Surveys (CSUR), 47(4):1–38, 2015

  5. [5]

    Evaluating prompt engineering techniques for accuracy and confidence elicitation in medical LLMs

    Nariman Naderi, Zahra Atf, Peter R Lewis, Aref Mahjoubfar, Seyed Amir Ahmad Safavi-Naini, and Ali Soroush. Evaluating prompt engineering techniques for accuracy and confidence elicitation in medical LLMs. InProceedings of the International Workshop on Explainable, Trustworthy, and Responsible Artificial Intelligence and Multi-Agent Systems, pages 67–84. S...

  6. [6]

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine- tuned with human feedback

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine- tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP...

  7. [7]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs.arXiv preprint arXiv:2306.13063, 2023

  8. [8]

    Exploring the potential of social media crowdsourcing for post-earthquake damage assessment.International Journal of Disaster Risk Reduction, 98:104062, 2023

    Lingyao Li, Michelle Bensi, and Gregory Baecher. Exploring the potential of social media crowdsourcing for post-earthquake damage assessment.International Journal of Disaster Risk Reduction, 98:104062, 2023

  9. [9]

    Social media crowdsourcing for rapid damage assessment following a sudden- onset natural hazard event.International Journal of Information Management, 60:102378, 2021

    Lingyao Li, Michelle Bensi, Qingbin Cui, Gregory B Baecher, and You Huang. Social media crowdsourcing for rapid damage assessment following a sudden- onset natural hazard event.International Journal of Information Management, 60:102378, 2021

  10. [10]

    Microblog- ging during two natural hazards events: what Twitter may contribute to situational awareness

    Sarah Vieweg, Amanda L Hughes, Kate Starbird, and Leysia Palen. Microblog- ging during two natural hazards events: what Twitter may contribute to situational awareness. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’10), pages 1079–1088, 2010

  11. [11]

    Zheye Wang and Xinyue Ye. Space, time, and situational awareness in natural hazards: A case study of hurricane sandy with social media data.Cartography and Geographic Information Science, 46(4):334–346, 2019

  12. [12]

    Sense- place2: Geotwitter analytics support for situational awareness

    Alan M MacEachren, Anuj Jaiswal, Anthony C Robinson, Scott Pezanowski, Alexander Savelyev, Prasenjit Mitra, Xiao Zhang, and Justine Blanford. Sense- place2: Geotwitter analytics support for situational awareness. InProceedings of the IEEE Conference on Visual Analytics Science and Technology, pages 181–190. IEEE, 2011

  13. [13]

    Qunying Huang and Yu Xiao. Geographic situational awareness: Mining tweets for disaster preparedness, emergency response, impact, and recovery.ISPRS 26 International Journal of Geo-information, 4(3):1549–1568, 2015

  14. [14]

    Extracting information nuggets from disaster-related messages in social media

    Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. Extracting information nuggets from disaster-related messages in social media. InProceedings of the International Conference on Information Systems for Crisis Response and Management (ISCRAM’13), pages 791–800, 2013

  15. [15]

    From situational awareness to action- ability: Towards improving the utility of social media data for crisis response

    Himanshu Zade, Kushal Shah, Vaibhavi Rangarajan, Priyanka Kshirsagar, Muhammad Imran, and Kate Starbird. From situational awareness to action- ability: Towards improving the utility of social media data for crisis response. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW):1–18, 2018

  16. [16]

    Crisis2sum: An exploratory study on disaster summarization from multiple streams

    Philipp Seeberger and Korbinian Riedhammer. Crisis2sum: An exploratory study on disaster summarization from multiple streams. InInternational Conference on Information Systems for Crisis Response and Management (ISCRAM’24), 2024

  17. [17]

    Crisitext: A dataset of warning messages for LLM training in emergency communication.arXiv preprint arXiv:2510.09243, 2025

    Giacomo Gonella, Gian Maria Campedelli, Stefano Menini, and Marco Guerini. Crisitext: A dataset of warning messages for LLM training in emergency communication.arXiv preprint arXiv:2510.09243, 2025

  18. [18]

    Calibrating long-form generations from Large Language Models

    Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, and Bhuwan Dhingra. Calibrating long-form generations from Large Language Models. In Conference on Empirical Methods in Natural Language Processing (EMNLP’24), pages 13441–13460. ACL, 2024

  19. [19]

    Images Amplify Misinformation Sharing in Vision-Language Models

    Alice Plebe, Timothy Douglas, Diana Riazi, and R Maria del Rio-Chanona. I’ll believe it when i see it: Images increase misinformation sharing in vision-language models.arXiv preprint arXiv:2505.13302, 2025

  20. [20]

    Chatfive: Enhanc- ing user experience in likert scale personality test through interactive conversation with llm agents

    Jungjae Lee, Yubin Choi, Minhyuk Song, and Sanghyun Park. Chatfive: Enhanc- ing user experience in likert scale personality test through interactive conversation with llm agents. InProceedings of the 6th ACM Conference on Conversational User Interfaces (CUI’24), volume 6, pages 1–8, New York, NY, USA, 2024. ACM

  21. [21]

    Growth stalls at Elon Musk’s X

    Clara Murray and Cristina Criddle. Growth stalls at Elon Musk’s X. https:// www.ft.com/content/1829abb6-d8d0-4d67-9b1b-628f583b3291, 2024. Accessed: 2026-01-08

  22. [22]

    Design and evaluation of an agentic workflow for crisis-related synthetic tweet datasets, 2026

    Roben Delos Reyes, Timothy Douglas, and Asanobu Kitamoto. Design and evaluation of an agentic workflow for crisis-related synthetic tweet datasets, 2026

  23. [23]

    Tracking the temporal evolu- tion of online social relationships in times of crisis.Journal of Social Computing, 6:1–24, 2026

    Timothy Douglas, Licia Capra, and Mirco Musolesi. Tracking the temporal evolu- tion of online social relationships in times of crisis.Journal of Social Computing, 6:1–24, 2026

  24. [24]

    Ten social dimensions of conversations and relationships

    Minje Choi, Luca Maria Aiello, Kriszti´ an Zsolt Varga, and Daniele Quercia. Ten social dimensions of conversations and relationships. InProceedings of The International World Wide Web Conference (WWW’20), pages 1514–1525, 2020

  25. [25]

    Automated virtual earthquake recon- naissance reporting using natural language processing.Natural Hazards Review, 26(3):04025018, 2025

    Guanren Zhou and Khalid M Mosalam. Automated virtual earthquake recon- naissance reporting using natural language processing.Natural Hazards Review, 26(3):04025018, 2025

  26. [26]

    Automatic identifica- tion of eyewitness messages on Twitter during disasters.Information Processing & Management, 57(1):102107, 2020

    Kiran Zahra, Muhammad Imran, and Frank O Ostermann. Automatic identifica- tion of eyewitness messages on Twitter during disasters.Information Processing & Management, 57(1):102107, 2020

  27. [27]

    Extracting and summarizing situational information from the Twitter social media during 27 disasters.ACM Transactions on the Web (TWEB), 12(3):1–35, 2018

    Koustav Rudra, Niloy Ganguly, Pawan Goyal, and Saptarshi Ghosh. Extracting and summarizing situational information from the Twitter social media during 27 disasters.ACM Transactions on the Web (TWEB), 12(3):1–35, 2018

  28. [28]

    situational awareness

    Sudha Verma, Sarah Vieweg, William Corvey, Leysia Palen, James Martin, Martha Palmer, Aaron Schram, and Kenneth Anderson. Natural Language Pro- cessing to the rescue? Extracting “situational awareness” tweets during mass emergency. InProceedings of the International AAAI Conference on Web and Social Media (ICWSM’11), volume 5, pages 385–392, 2011

  29. [29]

    rationales

    Reuters. Thailand ends search at site of skyscraper that col- lapsed during quake. https://www.reuters.com/world/asia-pacific/ thailand-ends-search-site-skyscraper-that-collapsed-during-quake-2025-05-13/, May 2025. Accessed: 2026-03-02. 28 Supplementary Material Classification Prompt Box 1: Classification Prompt System role.You are an information extracti...

  30. [30]

    Donotuse markdown code blocks (e.g.,‘‘‘json)

  31. [31]

    Donotadd explanations, headers, or separators (e.g., “—”)

  32. [32]

    The JSON must start with[and end with]

  33. [33]

    Donotwrap the JSON in any other text or formatting

  34. [34]

    index": <original_index_number>,

    Donotinclude example output or placeholder text. 29 Uncertainty Assessment Prompt Box 2: Uncertainty Assessment Prompt System role.You are an expert crisis analysis assistant. Task.Evaluate whether social media situational awareness (SA) signals are plausibly representative of real-world conditions during a crisis event, given automatically gener- ated im...