pith. sign in

arxiv: 2506.05611 · v2 · submitted 2025-06-05 · 💻 cs.CR

How Tough Is Location Anonymization? Re-identifying 100K Real-User Trajectories in Japan

Pith reviewed 2026-05-19 10:21 UTC · model grok-4.3

classification 💻 cs.CR
keywords mobility trajectorieslocation anonymizationre-identification attackprivacy metricsdifferential privacytrajectory privacyJapan mobility data
0
0 comments X

The pith

Re-identification succeeds on anonymized mobility data from 100,000 users in Japan by recovering hidden locations and times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines whether standard anonymization techniques adequately protect large sets of movement records. It applies re-identification methods to a released dataset of 100,000 trajectories in Japan and finds that spatial density patterns and daily activity rhythms allow recovery of the underlying map and calendar. The analysis then uses several metrics to show that individual users can often be singled out from just a few location points or visits to unique places. Tests of common privacy tools reveal that settings strong enough to block re-identification also make the data useless for most practical purposes. Readers should care because governments and companies continue to publish such data for planning and research under the assumption it is safe.

Core claim

The sanitization applied to the YJMob100K dataset leaves enough spatial and temporal structure to recover both the real-world geographic frame and the actual calendar timeline by exploiting density signatures, urban correlations, and temporal activity profiles. On top of this reconstruction, metrics capturing spatio-temporal k-anonymity, point unicity, home-work uniqueness, and exposure to sensitive locations reveal extensive re-identification surfaces. Representative sanitization strategies like geo-indistinguishability and local differential privacy either destroy utility at strong privacy levels or leave structural leakage intact at utility-preserving levels.

What carries the argument

Reconstruction of geographic and temporal context from density signatures, urban correlations, and temporal activity profiles, followed by trajectory-level privacy metrics such as k-anonymity and anchor uniqueness.

If this is right

  • Strong privacy parameters in sanitization methods destroy the data's utility for downstream analysis.
  • Utility-preserving parameter settings leave the structural leakage largely intact.
  • A small number of observations or visits to sensitive venues often suffices to uniquely identify users.
  • Current sanitization techniques prove insufficient for protecting large-scale mobility data.
  • Trajectory-aware privacy mechanisms and stronger publication standards are needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar re-identification risks likely apply to mobility datasets released in other countries with comparable urban structures.
  • Organizations publishing trajectory data may need to adopt more sophisticated de-identification that accounts for spatio-temporal correlations.
  • Policy makers could develop guidelines requiring validation against re-identification attacks before release.
  • Researchers might explore hybrid methods combining differential privacy with trajectory-specific noise.

Load-bearing premise

The anonymized data still contains enough recognizable patterns from movement density and timing to map it back to real places and dates.

What would settle it

A successful falsification would occur if the reconstructed geographic frame from density signatures does not align with actual locations in Japan when verified against public geographic data.

Figures

Figures reproduced from arXiv: 2506.05611 by Abhishek Kumar Mishra, Heber H. Arcolezi, Mathieu Cunche.

Figure 1
Figure 1. Figure 1: Visual comparison of population densities in original vs. re-identified version. (a) Grid-based trajectories in the released Yjmob100k dataset exhibit structured mobility aligned with urban infrastructure. (b) Actual map of the city of Nagoya. (c) The same dataset after re-identification, revealing the city of Nagoya, as described in Algorithm 1 This work builds on that foundation by showing that naive ano… view at source ↗
Figure 2
Figure 2. Figure 2: Spearman correlation coefficients between the anonymized dataset and population data across 10 major Japanese cities. Nagoya clearly emerges as the most probable location. a transformation set T. The goal is to identify the most likely city of origin C ∗ and the transformation T ∗ that maximize spatial similarity, with all intermediate correlation values stored in a dictionary S. We define a set of geometr… view at source ↗
Figure 3
Figure 3. Figure 3: Overall activity. displacement ranges. By inverting the transformations, we recover structural alignment with known geographies. 4.3 Fine-grained spatial re-identification To refine the spatial alignment after identifying Nagoya as the most probable city, we perform a fine-tuning step using hill climbing over a 100 km × 100 km area centered around Nagoya’s coordinates. The goal is to determine the optimal … view at source ↗
Figure 4
Figure 4. Figure 4: Daily temporal activity in top 10 residential areas organized in two classes obtained from a 2-class clusterisation. Day 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Class A A B B B B A A A B B B B A Weekday Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Day 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Class A B B B B B A A B B B B B A Weekday Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Day 28 29 30 3… view at source ↗
Figure 5
Figure 5. Figure 5: Activity at Port Messe Nagoya (grid 69,110; GPS 35.05016252379402, 137.39573944062153) with hosted events. the four remaining dates in H align with known holidays. Only one sequence satisfies all constraints: the dataset begins on 15 September 2019 (Day 0). Under this alignment, the five holidays in H match exactly with Japan’s national holidays: Respect for the Aged Day (16/09/2019 - Day 1), Autumn Equino… view at source ↗
Figure 6
Figure 6. Figure 6: Sensitive user attributes with varying seclusion thresholds. 6 What Are the Privacy Risks for Users in This Dataset? Having assessed that the implemented protection measures for Yjmob100k can be reversed, we want to study the potential threat to the user privacy of the dataset with actual date and location. We assess the privacy impact of our spatio-temporal re-identification by quantifying individual-leve… view at source ↗
Figure 7
Figure 7. Figure 7: a) Clustered correlation after applying Geo-Ind using radius as 1 Km. The results still indicate a significantly higher correlation for Nagoya, allowing for successful location re-identification. b) Error in home location inference when comparing before and after applying perturbations using Geo-Ind. 7.2 Geo-indistinguishability We apply Geo-indistinguishability (Geo-Ind) 18 to perturb individual location … view at source ↗
Figure 8
Figure 8. Figure 8: Effect of GRR-based LDP on home and work inference. (a) Re-identification rate of users’ home and work locations under varying ε. (b) KL divergence between true and estimated population distributions, reflecting the utility loss. Tokyo Yokohama Osaka Nagoya Sapporo Kobe Fukuoka Kawasaki Kyoto Sendai City 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Spearman Correlation (a) Clustered Correlation 15 20 25 30 35 40 45 Manhatt… view at source ↗
Figure 9
Figure 9. Figure 9: Evaluation of spatial de-structuring effects. (a) Spearman correlation between spatially-permuted user distributions and official census data for the top 10 Japanese cities. (b) CDF of Manhattan distance errors between inferred and true home/work locations. (c) CDF of KL divergence between temporal-spatial distributions before and after permutation. timestamp. For low ε values (e.g., ε ≤ 2), re-identificat… view at source ↗
read the original abstract

Mobility traces are among the most revealing forms of personal data, yet trajectory releases are often protected only by ad hoc transformations. We stress-test such practices on recently-released YJMob100K, an anonymized dataset of 100,000 user trajectories in Japan. First, we show that the applied protection leaves enough spatial and temporal structure to recover both the real-world geographic frame and the actual calendar timeline by exploiting density signatures, urban correlations, and temporal activity profiles. On top of this reconstruction, we quantify privacy risks through trajectory-level metrics that capture spatio-temporal k-anonymity, -point unicity, home-work and multi-anchor uniqueness, and exposure to secluded and sensitive locations. These metrics reveal extensive re-identification surfaces: a small number of observations, anchors, or sensitive venues often suffices to uniquely pinpoint users or their social neighborhoods. Finally, we evaluate representative sanitization strategies: geo-indistinguishability, local differential privacy, and aggressive spatial de-structuring; and observe a consistent pattern: strong privacy parameters destroy downstream utility, while utility-preserving settings leave structural leakage largely intact. Overall, our findings show that current sanitization techniques are insufficient for large-scale mobility data, and they highlight the urgent need for trajectory-aware privacy mechanisms and stronger publication standards.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper stress-tests ad hoc anonymization on the YJMob100K dataset of 100,000 real-user trajectories in Japan. It claims that the released data retains sufficient spatial and temporal structure to reconstruct the true geographic frame and calendar timeline via density signatures, urban correlations, and activity profiles. Building on this, it applies trajectory-level metrics (spatio-temporal k-anonymity, point unicity, home-work and multi-anchor uniqueness, exposure to sensitive locations) to quantify re-identification surfaces, then evaluates sanitization strategies (geo-indistinguishability, local differential privacy, spatial de-structuring) and reports that strong privacy parameters destroy utility while utility-preserving parameters leave structural leakage largely intact.

Significance. If the reconstruction accuracy and metric results hold, the work is significant as a large-scale empirical demonstration on a recently released real-user mobility dataset. It uses standard privacy metrics without circularity or invented parameters and provides concrete trade-off evidence between privacy and utility, supporting calls for trajectory-aware mechanisms and improved publication standards in location privacy.

major comments (3)
  1. [Abstract / Reconstruction section] Reconstruction pipeline (described in the abstract and presumably §3): the central claim that density signatures, urban correlations, and temporal profiles suffice to recover both the real-world geographic frame and actual calendar timeline lacks reported quantitative accuracy metrics, error rates, or validation against ground truth; without these, the extent of recoverable structure cannot be assessed.
  2. [Metrics and results] Privacy risk quantification (abstract and presumably §4): the metrics for spatio-temporal k-anonymity, point unicity, home-work uniqueness, and sensitive-location exposure are introduced, but the manuscript provides no tables or distributions showing the fraction of users with low k or high unicity; this is load-bearing for the claim of 'extensive re-identification surfaces'.
  3. [Sanitization experiments] Sanitization evaluation (abstract and presumably §5): the reported pattern that 'strong privacy parameters destroy downstream utility, while utility-preserving settings leave structural leakage largely intact' requires explicit before/after comparisons (e.g., metric values or utility scores at specific ε for geo-indistinguishability); the current high-level statement is insufficient to support the insufficiency conclusion.
minor comments (2)
  1. [Metrics definitions] Define or cite the exact formulas used for 'multi-anchor uniqueness' and 'exposure to secluded locations' to ensure reproducibility.
  2. [Dataset description] Add a table summarizing the dataset statistics (number of points per trajectory, spatial/temporal granularity) early in the paper.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of our empirical analysis on the YJMob100K dataset. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract / Reconstruction section] Reconstruction pipeline (described in the abstract and presumably §3): the central claim that density signatures, urban correlations, and temporal profiles suffice to recover both the real-world geographic frame and actual calendar timeline lacks reported quantitative accuracy metrics, error rates, or validation against ground truth; without these, the extent of recoverable structure cannot be assessed.

    Authors: We agree that the reconstruction claims would be strengthened by explicit quantitative validation. In the revision we will expand §3 with a new subsection reporting accuracy metrics, error rates, and ground-truth validation results for both geographic frame recovery and calendar timeline alignment. revision: yes

  2. Referee: [Metrics and results] Privacy risk quantification (abstract and presumably §4): the metrics for spatio-temporal k-anonymity, point unicity, home-work uniqueness, and sensitive-location exposure are introduced, but the manuscript provides no tables or distributions showing the fraction of users with low k or high unicity; this is load-bearing for the claim of 'extensive re-identification surfaces'.

    Authors: The current version describes the metrics but does not include the requested aggregate statistics. We will add summary tables and distributions in the revised §4 that report the fraction of users below given k thresholds and above given unicity levels, directly supporting the extent of re-identification surfaces. revision: yes

  3. Referee: [Sanitization experiments] Sanitization evaluation (abstract and presumably §5): the reported pattern that 'strong privacy parameters destroy downstream utility, while utility-preserving settings leave structural leakage largely intact' requires explicit before/after comparisons (e.g., metric values or utility scores at specific ε for geo-indistinguishability); the current high-level statement is insufficient to support the insufficiency conclusion.

    Authors: We accept that concrete parameter-level comparisons are needed. The revised §5 will include tables and figures with before/after values of the privacy metrics and utility scores at specific ε (and equivalent parameters for the other mechanisms), making the trade-off evidence explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This paper is an empirical analysis of an externally released dataset (YJMob100K) using standard privacy metrics including spatio-temporal k-anonymity, point unicity, and anchor uniqueness. The central claims rest on direct reconstruction of geographic and temporal frames from density signatures and quantitative evaluation of sanitization strategies, with no equations, fitted parameters, or self-citations that reduce results to inputs by construction. The argument is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the released dataset's anonymization preserves identifiable density and activity patterns; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption The YJMob100K dataset was protected only by ad hoc transformations that leave spatial and temporal structure intact.
    Stated in the opening of the abstract as the premise enabling reconstruction.

pith-pipeline@v0.9.0 · 5768 in / 1161 out tokens · 26838 ms · 2026-05-19T10:21:33.351898+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    C., Hidalgo, C

    Gonzalez, M. C., Hidalgo, C. A. & Barabasi, A.-L. Understanding individual human mobility patterns. nature 453, 779–782 (2008)

  2. [2]

    d., Hidalgo, C

    Montjoye, Y .-A. d., Hidalgo, C. A., Verleysen, M. & Blondel, V . D. Unique in the Crowd: The privacy bounds of human mobility. Sci. Reports 3, 1376, DOI: 10.1038/srep01376 (2013)

  3. [3]

    Yjmob100k: City-scale and longitudinal dataset of anonymized human mobility trajectories

    Yabe, T.et al. Yjmob100k: City-scale and longitudinal dataset of anonymized human mobility trajectories. Sci. Data 11, 397 (2024)

  4. [4]

    & Silva, C

    Douriez, M., Doraiswamy, H., Freire, J. & Silva, C. T. Anonymizing NYC Taxi Data: Does It Matter? In Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on , 140–148 (IEEE, 2016)

  5. [5]

    Revealing urban area from mobile positioning data

    Pint´er, G. Revealing urban area from mobile positioning data. Sci. Reports 14, DOI: 10.1038/s41598-024-82006-5 (2024). 10/11

  6. [6]

    J., Zhong, W., Zhang, F

    Zhong, Y ., Yuan, N. J., Zhong, W., Zhang, F. & Xie, X. You Are Where You Go: Inferring Demographic Attributes from Location Check-ins. In ACM WSDM, 295–304, DOI: 10.1145/2684822.2685287 (ACM, Shanghai China, 2015)

  7. [7]

    Please Forget Where I Was Last Summer: The Privacy Risks of Public Location (Meta)Data

    Drakonakis, K., Ilia, P., Ioannidis, S. & Polakis, J. Please Forget Where I Was Last Summer: The Privacy Risks of Public Location (Meta)Data, DOI: 10.48550/arXiv.1901.00897 (2019). ArXiv:1901.00897 [cs]

  8. [8]

    C., Wang, K., Chen, R

    Fung, B. C., Wang, K., Chen, R. & Yu, P. S. Privacy-preserving data publishing: A survey of recent developments.ACM Comput. Surv. (Csur) 42, 1–53 (2010)

  9. [9]

    & Castelluccia, C

    Acs, G. & Castelluccia, C. A Case Study: Privacy Preserving Release of Spatio-temporal Density in Paris. In ACM SIGKDD, KDD ’14, 1679–1688, DOI: 10.1145/2623330.2623361 (ACM, New York, NY , USA, 2014)

  10. [10]

    Fiore, M. et al. Privacy in trajectory micro-data publishing : a survey. arXiv:1903.12211 [cs] (2020). ArXiv: 1903.12211

  11. [11]

    Available online: https://data.humdata

    Japan: High resolution population density maps + demographic estimates (2018). Available online: https://data.humdata. org/dataset/japan-high-resolution-population-density-maps-demographic-estimates (accessed on 3 June 2025)

  12. [12]

    Japanese holidays calendar

    Time and Date AS. Japanese holidays calendar. https://www.timeanddate.com/calendar/custom.html?year=2021&country= 26&cols=3&hol=9&df=1 (2021). Accessed: 2025-06-03

  13. [13]

    Typhoon hagibis: Japan suffers deadly floods and landslides from storm

    BBC News. Typhoon hagibis: Japan suffers deadly floods and landslides from storm. https://www.bbc.com/news/ world-asia-50020108 (2019). Accessed: 2025-06-03

  14. [14]

    & Stanica, R

    Gramaglia, M., Fiore, M., Furno, A. & Stanica, R. Glove: Towards privacy-preserving publishing of record-level-truthful mobile phone trajectories. ACM/IMS Trans. Data Sci. 2, DOI: 10.1145/3451178 (2021)

  15. [15]

    E., Bordenabe, N

    Chatzikokolakis, K., Andr´es, M. E., Bordenabe, N. E. & Palamidessi, C. Broadening the scope of differential privacy using metrics. In PETS, 82–102, DOI: 10.1007/978-3-642-39077-7 5 (Springer, 2013)

  16. [16]

    & Ramage, D

    Kairouz, P., Bonawitz, K. & Ramage, D. Discrete distribution estimation under local privacy. In International Conference on Machine Learning, 2436–2444 (PMLR, 2016)

  17. [17]

    Wang, H. et al. PrivTrace: Differentially private trajectory synthesis by adaptive markov models. In 32nd USENIX Security Symposium (USENIX Security 23) , 1649–1666 (2023)

  18. [18]

    E., Bordenabe, N

    Andr´es, M. E., Bordenabe, N. E., Chatzikokolakis, K. & Palamidessi, C. Geo-indistinguishability: differential privacy for location-based systems. In ACM CCS, CCS ’13, 901–914, DOI: 10.1145/2508859.2516735 (2013)

  19. [19]

    CNIL – French Data Protection Authority

    Commission Nationale de l’Informatique et des Libert´es. CNIL – French Data Protection Authority. https://www.cnil.fr (2025). Accessed: 2025-06-03

  20. [20]

    Personal Information Protection Commission (PPC)

    Personal Information Protection Commission Japan. Personal Information Protection Commission (PPC). https://www. ppc.go.jp/en/ (2025). Accessed: 2025-06-03

  21. [21]

    Yabe, T. et al. YJMob100K: City-Scale and Longitudinal Dataset of Anonymized Human Mobility Trajectories, DOI: 10.5281/zenodo.10836269 (2024). A Additional data A.1 Top 10 residential grids We identify the top 10 residential grid cells by user count as follows: (82.0, 135.0), (77.0, 135.0), (81.0, 135.0), (82.0, 149.0), (77.0, 134.0), (87.0, 141.0), (80.0...