pith. sign in

arxiv: 2606.00430 · v1 · pith:ZHO3QDAOnew · submitted 2026-05-29 · ⚛️ physics.soc-ph · cs.CY· cs.DB

SF-LIFE: A Large-Scale Simulated Movement Dataset for the San Francisco Bay Area

Pith reviewed 2026-06-28 19:19 UTC · model grok-4.3

classification ⚛️ physics.soc-ph cs.CYcs.DB
keywords simulated mobility datasetSan Francisco Bay Areaagent-based simulationmulti-modal trajectoriesGTFS transit dataurban computinglocation recordshuman mobility analysis
0
0 comments X

The pith

SF-LIFE supplies over three trillion noise-free location records from 500,000 simulated agents in the San Francisco Bay Area.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SF-LIFE as a dataset built from an agent-based simulation that assigns daily activity agendas to agents and then traces their exact paths across real transit and road networks. The resulting records are complete and unlabeled by privacy constraints, covering every second of movement for half a million agents across seventy days. Because real tracking data is limited by cost, noise, and consent rules, this simulated resource is positioned to support large-scale studies of multi-modal travel that would otherwise be impractical. The authors supply the raw trajectories, activity labels, demographics, and the underlying map data so that others can directly use or subsample the collection.

Core claim

SF-LIFE contains 3,024,000,000,000 location records that capture the complete, noise-free, multi-modality trajectories of 500,000 simulated agents observed at 1 Hz while navigating the San Francisco Bay Area network over a 70-day period. The trajectories combine needs-driven daily agendas produced by agent-based simulation with kinematic paths derived from GTFS schedules of more than forty transit agencies and the OpenStreetMap street network. The release includes the full annotated trajectories, downsampled versions, activity and demographic metadata, and the source road and building layers.

What carries the argument

An agent-based simulation that first generates needs-driven daily agendas for each agent and then converts those agendas into 1 Hz kinematic trajectories using GTFS transit schedules and OpenStreetMap geometry.

If this is right

  • Researchers can train and test machine-learning models on complete, labeled multi-modal trajectories at a scale impossible with real data.
  • Transit planners can examine the effects of schedule changes or new routes on the full population of agents without privacy barriers.
  • Urban-computing studies gain a ground-truth benchmark that includes every location update and the causal activity behind each stop.
  • Subsampled versions allow experiments at lower temporal resolution while preserving the original activity and demographic annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the simulation captures the dominant drivers of mobility, the same pipeline could be re-run for other metropolitan areas once their GTFS and OSM data are available.
  • The dataset could serve as a testbed for evaluating privacy-preserving aggregation methods by comparing results against the known full trajectories.
  • Varying the underlying activity-generation rules or infrastructure layers would let users explore counterfactual mobility under different policy assumptions.

Load-bearing premise

The simulated agendas and paths reproduce the statistical patterns of real human movement in the Bay Area closely enough for the intended research applications.

What would settle it

A side-by-side comparison in which the simulated distributions of trip lengths, mode shares, or daily activity timing differ markedly from independent real-world surveys or anonymized tracking studies of the same region.

Figures

Figures reproduced from arXiv: 2606.00430 by Andreas Z\"ufle, Andrew Crooks, Boyu Wang, Carola Wenk, Chanuka Algama, Dieter Pfoser, Doug Taylor, Erfan Hosseini Sereshgi, Hamdi Kavak, Henrique Ferraz de Arruda, John Hunter, Lance Kennedy, Mauryan Uppalapati, Nathan Holt, Sandro Martinelli Reia, Taylor Anderson, Yueyang Liu.

Figure 1
Figure 1. Figure 1: Simulation Architecture: An agent-based city-level simulation uses Maslowian Needs [ [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Gender vs agent type. Finally, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Vehicle Ownership [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Life patterns of agent 149857 (worker). mo tu we th fr sa su Week 0 Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9 Calendar Visualization for Agent 150502 [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: List of colors and their corresponding location or [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Life patterns of agent 261254 (homemaker) [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Life patterns of agent 360916 (homemaker). mo tu we th fr sa su Week 0 Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9 Calendar Visualization for Agent 49270 [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Life patterns of agent 49270 (student). mo tu we th fr sa su Week 0 Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9 Calendar Visualization for Agent 66982 [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Life patterns of agent 66982 (worker). mo tu we th fr sa su Week 0 Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9 Calendar Visualization for Agent 439557 [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 17
Figure 17. Figure 17: Agent 261254 (homemaker). Staypoint colors indi [PITH_FULL_IMAGE:figures/full_fig_p009_17.png] view at source ↗
Figure 16
Figure 16. Figure 16: Agent 150502 (worker). Staypoint colors indicate [PITH_FULL_IMAGE:figures/full_fig_p009_16.png] view at source ↗
Figure 21
Figure 21. Figure 21: Agent 439557 (worker). Staypoint colors indicate [PITH_FULL_IMAGE:figures/full_fig_p010_21.png] view at source ↗
Figure 20
Figure 20. Figure 20: Agent 66982 (worker). Staypoint colors indicate [PITH_FULL_IMAGE:figures/full_fig_p010_20.png] view at source ↗
Figure 23
Figure 23. Figure 23: Number of agents present at a given school during [PITH_FULL_IMAGE:figures/full_fig_p011_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Number of agents present at a given school during [PITH_FULL_IMAGE:figures/full_fig_p011_24.png] view at source ↗
read the original abstract

We introduce SF-LIFE, a large-scale simulated movement dataset designed to accelerate research in transportation, mobility, and machine learning. The dataset contains 3,024,000,000,000 location records capturing complete, noise-free, multi-modality trajectories of 500,000 simulated agents observed at a 1Hz frequency navigating the San Francisco Bay Area network over a 70-day period. The data captures (1) needs-driven daily agendas of individual agents generated by an agent-based simulation of human patterns of life and (2) detailed kinematic trajectories moving agents across the OpenStreetMap representation of San Francisco using data from 40+ transit agencies across 9 counties. SF-LIFE provides unprecedented scale and detail as trajectories are based on real transit infrastructure using San Francisco General Transit Feed Specification (GTFS) data, having agent movements across multiple modalities, including bus, rail, bike, automobile, and walking. For this high-fidelity simulated representation of San Francisco, we provide (1) the full trajectory data annotated with transportation mode labels, (2) reduced-size versions of the trajectory data with reduced temporal frequency, (3) agent activity information describing the causal activity why an agent visits a place, (4) agent demographic data, and (5) the underlying OSM road network and building data. As the first dataset of its scale and level of detail, SF-LIFE overcomes the privacy, noise, and completeness limitations inherent in real-world tracking data, providing a robust and ethically sourced resource for research in transit optimization, human mobility analysis, and urban computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SF-LIFE, a large-scale simulated movement dataset for the San Francisco Bay Area. It contains 3,024,000,000,000 location records from 500,000 agents observed at 1 Hz over 70 days, generated via agent-based simulation of needs-driven daily agendas combined with kinematic multi-modal trajectories (bus, rail, bike, car, walk) derived from GTFS data of 40+ agencies and OpenStreetMap networks. The release includes full annotated trajectories, downsampled versions, activity labels, demographic attributes, and the underlying road/building network data, positioned as overcoming privacy, noise, and completeness limitations of real tracking data.

Significance. If the generated trajectories are sufficiently representative of real mobility, the dataset would offer substantial value for transportation, urban computing, and machine-learning research by supplying complete, noise-free, privacy-preserving trajectories at a scale and annotation level not available from empirical sources, supporting studies on transit optimization, mode choice, and activity patterns.

major comments (2)
  1. [Abstract] Abstract: The claim that SF-LIFE constitutes a 'robust' resource for mobility research rests on the representativeness of the needs-driven agent-based simulation and GTFS/OSM-derived trajectories, yet the manuscript reports no quantitative validation metrics, comparisons to the Bay Area Travel Survey, census commuting flows, or observed mode shares.
  2. [Dataset generation description] Dataset generation description: No error analysis, sensitivity tests, or fidelity checks are supplied for the kinematic trajectory generation step or the mapping from external GTFS/OSM inputs to 1 Hz multi-modal paths, leaving the central claim of high-fidelity simulation unverified.
minor comments (2)
  1. [Abstract] The abstract lists five provided data products but does not indicate file formats, access method, or total storage size, which would aid potential users.
  2. Clarify whether the 70-day period includes weekends and holidays or is restricted to weekdays, as this affects the interpretation of daily agenda patterns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the need for validation metrics and fidelity checks in the SF-LIFE manuscript. We address each major comment below and will revise the manuscript to incorporate quantitative comparisons and analysis where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that SF-LIFE constitutes a 'robust' resource for mobility research rests on the representativeness of the needs-driven agent-based simulation and GTFS/OSM-derived trajectories, yet the manuscript reports no quantitative validation metrics, comparisons to the Bay Area Travel Survey, census commuting flows, or observed mode shares.

    Authors: We agree that the manuscript would be strengthened by explicit quantitative validation. The current version focuses on dataset construction and scale but does not include direct empirical comparisons. In the revised manuscript, we will add a dedicated validation subsection reporting comparisons of simulated mode shares, trip purpose distributions, and aggregate commuting flows against the Bay Area Travel Survey and relevant census data, including appropriate statistical metrics. revision: yes

  2. Referee: [Dataset generation description] Dataset generation description: No error analysis, sensitivity tests, or fidelity checks are supplied for the kinematic trajectory generation step or the mapping from external GTFS/OSM inputs to 1 Hz multi-modal paths, leaving the central claim of high-fidelity simulation unverified.

    Authors: We acknowledge the absence of reported error analysis or sensitivity tests for the kinematic mapping step. The trajectories are constructed directly from GTFS schedules and OSM networks using deterministic path-finding and interpolation to 1 Hz, but no quantitative fidelity or error metrics were included. In revision, we will add a section describing the mapping procedure in greater detail along with any available checks on path accuracy, timing discrepancies, and sensitivity to key parameters in the agent-based agenda generation. revision: yes

Circularity Check

0 steps flagged

No circularity; data generation from external inputs

full rationale

The paper presents a simulated dataset generated via agent-based modeling of needs-driven agendas combined with kinematic routing on external GTFS transit schedules and OpenStreetMap networks. No equations, fitted parameters, predictions, or uniqueness theorems are defined or invoked within the paper that reduce to its own inputs by construction. All load-bearing components (agent behaviors, trajectories, modalities) derive from cited external data sources rather than self-referential definitions or self-citations. This is a standard data-generation contribution whose central assumption (representativeness) is external and untested here, but that does not constitute circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions from agent-based modeling and use of public infrastructure data; no free parameters, invented entities, or ad-hoc axioms beyond domain assumptions about simulation fidelity are evident from the abstract.

axioms (2)
  • domain assumption Agent-based simulation generates needs-driven daily agendas that reflect real human patterns of life.
    Invoked when describing how individual agent trajectories are produced.
  • domain assumption Kinematic trajectories correctly follow real OpenStreetMap road network and GTFS transit schedules from 40+ agencies.
    Stated as the foundation for multi-modal movement across the Bay Area.

pith-pipeline@v0.9.1-grok · 5891 in / 1410 out tokens · 28036 ms · 2026-06-28T19:19:43.338661+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mobility Anomaly Generation using LLM-Driven Behavior with Kinematic Constraints

    cs.AI 2026-06 unverdicted novelty 5.0

    Presents an end-to-end system using LLM agents to add behavioral anomalies to simulated trajectories, then applies map routing and noise to generate realistic annotated anomaly datasets for mobility research.

Reference graph

Works this paper leans on

25 extracted references · 3 canonical work pages · cited by 1 Pith paper

  1. [1]

    John M Abowd, John Haltiwanger, and Julia Lane. 2004. Integrated longitudinal employer-employee data for the United States.American Economic Review94, 2 (2004), 224–229

  2. [2]

    Hossein Amiri, Will Kohn, Shiyang Ruan, Joon-Seok Kim, Hamdi Kavak, Andrew Crooks, Dieter Pfoser, Carola Wenk, and Andreas Züfle. 2024. The patterns of life human mobility simulation. InProceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems. 653–656

  3. [3]

    Hossein Amiri, Richard Yang, Shiyang Ruan, Joon-Seok Kim, Hamdi Kavak, Andrew Crooks, Dieter Pfoser, Carola Wenk, and Andreas Züfle. 2025. HD-GEN: A Software System for Large-Scale Human Mobility Data Generation Based on Patterns of Life. InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems. 407–410

  4. [4]

    Trevor C Bailey and Anthony C Gatrell. 1995. Spatial data analysis: Theory and practice.Journal of the Royal Statistical Society: Series A158, 3 (1995), 461–462

  5. [5]

    Ana LC Bazzan and Franziska Klügl. 2013. Agent-based modeling and simulation for transportation systems.Transportation Research Part C: Emerging Technologies 37 (2013), 1–3

  6. [6]

    Richard A Becker, Ramón Cáceres, Karrie Hanson, Ji Meng Loh, Simon Urbanek, Alex Varshavsky, and Chris Volinsky. 2011. Large-scale analysis of urban mobility patterns using GPS data. InProceedings of the 2011 ACM SIGKDD international conference on Knowledge discovery and data mining. 311–319

  7. [7]

    Stacey Bricka, Timothy Reuscher, Paul Schroeder, Mitchell Fisher, Justina Beard, and Xiaoyuan Layla Sun. 2024. Summary of travel trends: 2022 national household travel survey. (2024)

  8. [8]

    Nicholson Collier and Jonathan Ozik. 2022. Distributed agent-based simulation with Repast4Py. In2022 Winter Simulation Conference (WSC)(Singapore). IEEE, 192–206. doi:10.1109/WSC57314.2022.10015389

  9. [9]

    Ketevan Gallagher, Taylor Anderson, Andrew Crooks, and Andreas Züfle. 2023. Synthetic geosocial network generation. InProceedings of the 7th ACM SIGSPA- TIAL Workshop on Location-based Recommendations, Geosocial Networks and Geoadvertising. 15–24

  10. [10]

    Marta C Gonzalez, Cesar A Hidalgo, and Albert-Laszlo Barabasi. 2008. Under- standing human mobility patterns from large-scale trajectory data.Nature453, 7196 (2008), 779–782

  11. [11]

    Google Transit. 2023. General Transit Feed Specification Reference.Google Developers(2023). https://developers.google.com/transit/gtfs/refer ence

  12. [12]

    Erfan Hosseini Sereshgi, Mauryan Uppalapati, Yueyang Liu, Lance Kennedy, Andreas Züfle, and Carola Wenk. 2025. Semantic Anomaly Detection in Human Trajectories: Preserving Behavioral Patterns Through Calendar Representations. InProceedings of the 2nd ACM SIGSPATIAL International Workshop on Geospatial Anomaly Detection (GeoAnomalies ’25). Association for ...

  13. [13]

    Na Jiang, Fuzhen Yin, Boyu Wang, and Andrew T Crooks. 2024. A large-scale geographically explicit synthetic population with social networks for the united states.Scientific Data11, 1 (2024), 1204

  14. [14]

    Abraham H Maslow. 1943. A theory of human motivation.Psychological review 50, 4 (1943), 370

  15. [15]

    Frank Primerano, Michael AP Taylor, Ladda Pitaksringkarn, and Peter Tisato

  16. [16]

    Defining and understanding trip chaining behaviour.Transportation35, 1 (2008), 55–72

  17. [17]

    Sandro M Reia, Henrique F de Arruda, Shiyang Ruan, Taylor Anderson, Hamdi Kavak, and Dieter Pfoser. 2026. Towards Universal Urban Patterns-of-Life Simu- lation.arXiv preprint arXiv:2601.22099(2026)

  18. [18]

    Manolis Terrovitis, Nikos Mamoulis, and Panos Kalnis. 2008. Privacy-preserving trajectory data publishing. InProceedings of the 2008 ACM SIGMOD international conference on Management of data. 591–602

  19. [19]

    Jingyuan Wang, Xiangjie Kong, Feng Xia, and Lianyue Sun. 2019. Machine learning for urban mobility: A survey. InProceedings of the 2019 IEEE International Conference on Big Data. IEEE, 5587–5596

  20. [20]

    Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of ‘small- world’networks.nature393, 6684 (1998), 440–442

  21. [21]

    Lei Zhang, Xiaolei Wang, and Feng Chen. 2016. Transit network optimization using agent-based simulation. InTransportation Research Board 95th Annual Meeting

  22. [22]

    Yu Zheng. 2015. Trajectory data mining: an overview.ACM Transactions on Intelligent Systems and Technology (TIST)6, 3 (2015), 1–41

  23. [23]

    Yu Zheng, Loren Capra, Ouri Wolfson, and Hai Yang. 2014. Urban computing: con- cepts, methodologies, and applications. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1103–1112

  24. [24]

    Yu Zheng, Loren Capra, Ouri Wolfson, and Hai Yang. 2015. Urban mobility analysis with large-scale trajectory data. InProceedings of the IEEE, Vol. 103. IEEE, 136–154

  25. [25]

    Andreas Züfle, Carola Wenk, Dieter Pfoser, Andrew Crooks, Joon-Seok Kim, Hamdi Kavak, Umar Manzoor, and Hyunjee Jin. 2023. Urban life: a model of people and places.Computational and Mathematical Organization Theory29, 1 (2023), 20–51