pith. sign in

arxiv: 2607.00849 · v1 · pith:AXY3GM2Anew · submitted 2026-07-01 · 💻 cs.CL

The Course of News Events: A Comparison of Bottom-Up and Top-Down Approaches for Collecting Text-Based Data about Disasters

Pith reviewed 2026-07-02 13:25 UTC · model grok-4.3

classification 💻 cs.CL
keywords disaster news databottom-up clusteringtop-down inventory searchmedia coverage biaslandslide eventstext-based disaster monitoringsample selection methods
0
0 comments X

The pith

The choice between querying news databases with an existing disaster list or clustering articles by time and location changes which events enter the data sample.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two common ways to gather news articles on disasters. One starts from a known list of events and searches downward; the other lets the texts group themselves upward through patterns in dates and places. Using German coverage of landslides around the world, the authors show that the two routes produce noticeably different sets of events. A reader should care because the resulting sample then feeds studies of media bias, real-time disaster tracking, and efforts to improve official inventories.

Core claim

Using a dataset of German news about landslides worldwide, the authors compare top-down querying of news databases with the aid of an existing disaster inventory against bottom-up NLP clustering of news texts based on temporal and spatial features, and they document variations in event coverage that follow from the choice of method.

What carries the argument

The direct side-by-side comparison of top-down inventory-guided search versus bottom-up temporal-spatial text clustering on the same German landslide news corpus.

If this is right

  • Different selection methods produce different distributions of covered events.
  • Studies of inequality in media attention to disasters become sensitive to the upstream sampling choice.
  • Disaster monitoring and inventory enrichment projects inherit whatever coverage gaps the chosen method introduces.
  • Researchers must document and justify the selection route before interpreting patterns in the collected news.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test whether the same divergence appears for other hazard types such as floods or earthquakes.
  • One practical step would be to run both methods in parallel on new corpora and measure overlap before choosing one.
  • The observed differences may also affect how well news-derived data can be merged with satellite or official loss records.

Load-bearing premise

The bottom-up clustering method can be treated as producing a sample that is comparable in coverage and representativeness to the top-down inventory method.

What would settle it

A systematic count showing that one method consistently includes or excludes whole classes of landslide events (for example, small rural slides versus large urban ones) that the other method captures at different rates.

Figures

Figures reproduced from arXiv: 2607.00849 by Andreas Niekler, Brielen Madureira, Mariana Madruga de Brito.

Figure 1
Figure 1. Figure 1: Illustrative comparison of two approaches to [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the event matching procedures. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: depicts the confusion matrices with the overlap between aligned and queried events in each type of event source. The 851 success￾ful queries covered 779 unique news events. 89 queries matched news events midway through and 60 news events were queried by more than one EM-DAT entry. Such cases require post-processing decisions on whether two distinct news topics were inappropriately merged in the bottom-up a… view at source ↗
Figure 4
Figure 4. Figure 4: Number of detected news events by country. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Broad overview of the temporal dispersion of the onset days of EM-DAT entries (green circles) and initial [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Broad overview of the temporal dispersion of the onset days of EM-DAT entries (green circles) and initial [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Spatial distribution of EM-DAT entries referring to landslides. Germany (in black) was not analysed. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Spatial distribution of EM-DAT entries referring to landslides that could be queried (top-down) in the [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Spatial distribution of (bottom-up) news events referring to landslides. Germany (in black) was not [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

News articles are an important source of information on disaster impacts and adaptation. A key methodological challenge in socio-environmental studies is how to select a representative data sample. Two approaches are common: querying news databases top-down with the aid of an existing disaster inventory or using NLP methods to cluster news texts bottom-up based on temporal and spatial features. Using a dataset of German news about landslides worldwide, we compare these approaches and discuss variations in event coverage. Such research design decision can influence the resulting news sample, affecting its use in studies of inequality in media coverage, disaster monitoring and inventory enrichment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper compares top-down (inventory-guided querying of news databases) and bottom-up (NLP-based clustering of news texts using temporal and spatial features) approaches for collecting data on disasters from German news articles about landslides. It finds variations in event coverage and concludes that the choice of approach can influence the news sample, impacting studies of media inequality, disaster monitoring, and inventory enrichment.

Significance. If the empirical comparison is robust, the result would be significant for methodological practice in computational social science and socio-environmental research, as it would demonstrate that data-collection decisions materially affect downstream analyses of coverage patterns. The work usefully flags implications for inequality studies and inventory enrichment.

major comments (2)
  1. [Abstract] Abstract: the description is high-level only and supplies no implementation details, metrics, statistical tests, or data-exclusion rules, preventing assessment of whether observed sample differences support the central claim.
  2. [Bottom-up method] Bottom-up method (wherever described): no external validation of clusters against known events (precision/recall, event-matching, or overlap with independent ground truth) is reported. This is load-bearing, because without it the claim that differences reflect genuine coverage variation rather than clustering artifacts cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the description is high-level only and supplies no implementation details, metrics, statistical tests, or data-exclusion rules, preventing assessment of whether observed sample differences support the central claim.

    Authors: We agree that the abstract would benefit from greater specificity to allow readers to evaluate the empirical support for our claims. In the revised version, we will expand the abstract to include key implementation details for both the top-down and bottom-up methods, the primary comparison metrics (e.g., event overlap rates), any statistical tests performed, and explicit data-exclusion rules. This change will directly address the concern while maintaining the abstract's brevity. revision: yes

  2. Referee: [Bottom-up method] Bottom-up method (wherever described): no external validation of clusters against known events (precision/recall, event-matching, or overlap with independent ground truth) is reported. This is load-bearing, because without it the claim that differences reflect genuine coverage variation rather than clustering artifacts cannot be evaluated.

    Authors: We acknowledge that the manuscript does not report formal external validation metrics such as precision/recall against an independent ground truth for the bottom-up clusters. The comparison with the top-down inventory serves as an internal cross-check, but we agree this does not fully substitute for explicit validation. In revision, we will add a dedicated subsection describing any available overlap-based matching with the inventory events, manual inspection procedures used to assess cluster quality, and a limitations discussion on potential clustering artifacts. If additional independent ground truth becomes available, we will incorporate quantitative metrics; otherwise, we will clearly flag the reliance on comparative evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical methodological comparison with no fitted predictions or self-referential derivations

full rationale

The paper performs an empirical side-by-side comparison of two news-sampling strategies (top-down inventory queries vs. bottom-up NLP clustering on temporal/spatial features) using a German landslide news corpus. No equations, parameter fits, or 'predictions' are defined; the central claim is simply that the two methods produce measurably different samples. No self-citations are invoked to justify uniqueness or to close any derivation loop, and the work does not rename known results or smuggle ansatzes. The analysis is therefore self-contained against external benchmarks and receives the default non-circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical methodological comparison paper; no free parameters, mathematical axioms, or invented entities are introduced or required by the central claim.

pith-pipeline@v0.9.1-grok · 5634 in / 984 out tokens · 27351 ms · 2026-07-02T13:25:49.645368+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 17 canonical work pages

  1. [1]

    and Speybroeck, Niko , year =

    Delforge, Damien and Wathelet, Valentin and Below, Regina and Sofia, Cinzia Lanfredi and Tonnelier, Margo and van Loenhout, Joris A.F. and Speybroeck, Niko , year =. EM-DAT: the Emergency Events Database , volume =. doi:10.1016/j.ijdrr.2025.105509 , journal =

  2. [2]

    EGUsphere , VOLUME =

    Valkenborg, Bram and Dewitte, Olivier and Smets, Benoît , TITLE =. EGUsphere , VOLUME =. 2026 , PAGES =

  3. [3]

    and Malamud, Bruce D

    Taylor, Faith E. and Malamud, Bruce D. and Freeborough, Katy and Demeritt, David , year =. Enriching Great Britain’s National Landslide Database by searching newspaper archives , volume =. doi:10.1016/j.geomorph.2015.05.019 , journal =

  4. [4]

    An automated approach for developing geohazard inventories using news: integrating natural language processing (NLP), machine learning, and mapping , volume =

    Avcıoğlu, Aydoğan and Demir, Og\". An automated approach for developing geohazard inventories using news: integrating natural language processing (NLP), machine learning, and mapping , volume =. Natural Hazards and Earth System Sciences , publisher =. 2025 , month =. doi:10.5194/nhess-25-2421-2025 , number =

  5. [5]

    Environmental Research Letters , author =

    Flash droughts and their impacts—using newspaper articles to assess the perceived consequences of rapidly emerging droughts , volume =. Environmental Research Letters , author =. 2024 , pages =. doi:10.1088/1748-9326/ad58fa , number =

  6. [6]

    Proceedings of the International AAAI Conference on Web and Social Media , author=

    Identifying and Investigating Global News Coverage of Critical Events Such as Disasters and Terrorist Attacks , volume=. Proceedings of the International AAAI Conference on Web and Social Media , author=. 2025 , month=. doi:10.1609/icwsm.v19i1.35818 , number=

  7. [7]

    The untold story of missing data in disaster research: a systematic review of the empirical literature utilising the Emergency Events Database (EM-DAT) , volume =

    Jones, Rebecca Louise and Kharb, Aditi and Tubeuf, Sandy , year =. The untold story of missing data in disaster research: a systematic review of the empirical literature utilising the Emergency Events Database (EM-DAT) , volume =. Environmental Research Letters , publisher =. doi:10.1088/1748-9326/acfd42 , number =

  8. [8]

    Human and economic impacts of natural disasters: can we trust the global data? , volume =

    Jones, Rebecca Louise and Guha-Sapir, Debarati and Tubeuf, Sandy , year =. Human and economic impacts of natural disasters: can we trust the global data? , volume =. Scientific Data , publisher =. doi:10.1038/s41597-022-01667-x , number =

  9. [9]

    The AVI project: A bibliographical and archive inventory of landslides and floods in Italy , volume =

    Guzzetti, Fausto and Cardinali, Mauro and Reichenbach, Paola , year =. The AVI project: A bibliographical and archive inventory of landslides and floods in Italy , volume =. Environmental Management , publisher =. doi:10.1007/bf02400865 , number =

  10. [10]

    Llasat, M. C. and Llasat-Botija, M. and López, L. , year =. A press database on natural risks and its application in the study of floods in Northeastern Spain , volume =. Natural Hazards and Earth System Sciences , publisher =. doi:10.5194/nhess-9-2049-2009 , number =

  11. [11]

    and de Brito, Mariana Madruga , year =

    Sodoge, Jan and Kuhlicke, Christian and Mahecha, Miguel D. and de Brito, Mariana Madruga , year =. Text mining uncovers the unique dynamics of socio-economic impacts of the 2018–2022 multi-year drought in Germany , volume =. Natural Hazards and Earth System Sciences , publisher =. doi:10.5194/nhess-24-1757-2024 , number =

  12. [12]

    2026 , howpublished=

    Climate Change and Migration in Central America: Evidence from New Environmental Event Data , author=. 2026 , howpublished=

  13. [13]

    Wikimpacts 1.0: A new global climate impact database based on automated information extraction from Wikipedia , url =

    Li, Ni and Thiery, Wim and Zahra, Shorouq and Madruga de Brito, Mariana and Worou, Koffi and Kurfalı, Murathan and Lampe, Seppe and Muñoz, Paul and Flynn, Clare and Trigoso, Camila and Nivre, Joakim and Zscheischler, Jakob and Messori, Gabriele , year =. Wikimpacts 1.0: A new global climate impact database based on automated information extraction from Wi...

  14. [14]

    2026 , eprint=

    How Loud Rumbles Hit Newsstands: A Data Analysis of Coverage and Spatial Bias in German News about Landslides Around the World , author=. 2026 , eprint=

  15. [15]

    2026 , eprint=

    Assessing socio-economic climate impacts from text data , author=. 2026 , eprint=

  16. [16]

    Purves , title =

    Inhye Kong and Ross S. Purves , title =. Annals of the American Association of Geographers , volume =. 2026 , publisher =. doi:10.1080/24694452.2025.2564220 , URL =

  17. [17]

    The Sky Is Falling: Predictors of News Coverage of Natural Disasters Worldwide , volume =

    Yan, Yan and Bissell, Kim , year =. The Sky Is Falling: Predictors of News Coverage of Natural Disasters Worldwide , volume =. Communication Research , publisher =. doi:10.1177/0093650215573861 , number =

  18. [18]

    Meehl and Thomas Karl and David R

    Gerald A. Meehl and Thomas Karl and David R. Easterling and Stanley Changnon and Roger Pielke and David Changnon and Jenni Evans and Pavel Ya. Groisman and Thomas R. Knutson and Kenneth E. Kunkel and Linda O. Mearns and Camille Parmesan and Roger Pulwarty and Terry Root and Richard T. Sylves and Peter Whetton and Francis Zwiers. An Introduction to Trends ...

  19. [19]

    and Chang, Heejun and Chester, Mikhail V

    McPhillips, Lauren E. and Chang, Heejun and Chester, Mikhail V. and Depietri, Yaella and Friedman, Erin and Grimm, Nancy B. and Kominoski, John S. and McPhearson, Timon and Méndez-Lázaro, Pablo and Rosi, Emma J. and Shafiei Shiva, Javad , title =. Earth's Future , volume =. doi:https://doi.org/10.1002/2017EF000686 , url =. https://agupubs.onlinelibrary.wi...

  20. [20]

    2006 , publisher=

    Extreme events in nature and society , author=. 2006 , publisher=

  21. [21]

    and Hornsey, Matthew J

    Chapman, Cassandra M. and Hornsey, Matthew J. and Fielding, Kelly S. and Gulliver, Robyn , year =. International media coverage promotes donations to a climate disaster , volume =. Disasters , publisher =. doi:10.1111/disa.12557 , number =

  22. [22]

    Routledge handbook of public policy , pages=

    Mass media and policy-making , author=. Routledge handbook of public policy , pages=. 2012 , publisher=

  23. [23]

    Handbuch Umweltsoziologie , pages=

    Computational Social Sciences in der Umweltsoziologie , author=. Handbuch Umweltsoziologie , pages=. 2023 , publisher=

  24. [24]

    Real-Time News Event Extraction for Global Crisis Monitoring , ISBN =

    Tanev, Hristo and Piskorski, Jakub and Atkinson, Martin , pages =. Real-Time News Event Extraction for Global Crisis Monitoring , ISBN =. doi:10.1007/978-3-540-69858-6_21 , booktitle =