SF-LIFE: A Large-Scale Simulated Movement Dataset for the San Francisco Bay Area
Pith reviewed 2026-06-28 19:19 UTC · model grok-4.3
The pith
SF-LIFE supplies over three trillion noise-free location records from 500,000 simulated agents in the San Francisco Bay Area.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SF-LIFE contains 3,024,000,000,000 location records that capture the complete, noise-free, multi-modality trajectories of 500,000 simulated agents observed at 1 Hz while navigating the San Francisco Bay Area network over a 70-day period. The trajectories combine needs-driven daily agendas produced by agent-based simulation with kinematic paths derived from GTFS schedules of more than forty transit agencies and the OpenStreetMap street network. The release includes the full annotated trajectories, downsampled versions, activity and demographic metadata, and the source road and building layers.
What carries the argument
An agent-based simulation that first generates needs-driven daily agendas for each agent and then converts those agendas into 1 Hz kinematic trajectories using GTFS transit schedules and OpenStreetMap geometry.
If this is right
- Researchers can train and test machine-learning models on complete, labeled multi-modal trajectories at a scale impossible with real data.
- Transit planners can examine the effects of schedule changes or new routes on the full population of agents without privacy barriers.
- Urban-computing studies gain a ground-truth benchmark that includes every location update and the causal activity behind each stop.
- Subsampled versions allow experiments at lower temporal resolution while preserving the original activity and demographic annotations.
Where Pith is reading between the lines
- If the simulation captures the dominant drivers of mobility, the same pipeline could be re-run for other metropolitan areas once their GTFS and OSM data are available.
- The dataset could serve as a testbed for evaluating privacy-preserving aggregation methods by comparing results against the known full trajectories.
- Varying the underlying activity-generation rules or infrastructure layers would let users explore counterfactual mobility under different policy assumptions.
Load-bearing premise
The simulated agendas and paths reproduce the statistical patterns of real human movement in the Bay Area closely enough for the intended research applications.
What would settle it
A side-by-side comparison in which the simulated distributions of trip lengths, mode shares, or daily activity timing differ markedly from independent real-world surveys or anonymized tracking studies of the same region.
Figures
read the original abstract
We introduce SF-LIFE, a large-scale simulated movement dataset designed to accelerate research in transportation, mobility, and machine learning. The dataset contains 3,024,000,000,000 location records capturing complete, noise-free, multi-modality trajectories of 500,000 simulated agents observed at a 1Hz frequency navigating the San Francisco Bay Area network over a 70-day period. The data captures (1) needs-driven daily agendas of individual agents generated by an agent-based simulation of human patterns of life and (2) detailed kinematic trajectories moving agents across the OpenStreetMap representation of San Francisco using data from 40+ transit agencies across 9 counties. SF-LIFE provides unprecedented scale and detail as trajectories are based on real transit infrastructure using San Francisco General Transit Feed Specification (GTFS) data, having agent movements across multiple modalities, including bus, rail, bike, automobile, and walking. For this high-fidelity simulated representation of San Francisco, we provide (1) the full trajectory data annotated with transportation mode labels, (2) reduced-size versions of the trajectory data with reduced temporal frequency, (3) agent activity information describing the causal activity why an agent visits a place, (4) agent demographic data, and (5) the underlying OSM road network and building data. As the first dataset of its scale and level of detail, SF-LIFE overcomes the privacy, noise, and completeness limitations inherent in real-world tracking data, providing a robust and ethically sourced resource for research in transit optimization, human mobility analysis, and urban computing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SF-LIFE, a large-scale simulated movement dataset for the San Francisco Bay Area. It contains 3,024,000,000,000 location records from 500,000 agents observed at 1 Hz over 70 days, generated via agent-based simulation of needs-driven daily agendas combined with kinematic multi-modal trajectories (bus, rail, bike, car, walk) derived from GTFS data of 40+ agencies and OpenStreetMap networks. The release includes full annotated trajectories, downsampled versions, activity labels, demographic attributes, and the underlying road/building network data, positioned as overcoming privacy, noise, and completeness limitations of real tracking data.
Significance. If the generated trajectories are sufficiently representative of real mobility, the dataset would offer substantial value for transportation, urban computing, and machine-learning research by supplying complete, noise-free, privacy-preserving trajectories at a scale and annotation level not available from empirical sources, supporting studies on transit optimization, mode choice, and activity patterns.
major comments (2)
- [Abstract] Abstract: The claim that SF-LIFE constitutes a 'robust' resource for mobility research rests on the representativeness of the needs-driven agent-based simulation and GTFS/OSM-derived trajectories, yet the manuscript reports no quantitative validation metrics, comparisons to the Bay Area Travel Survey, census commuting flows, or observed mode shares.
- [Dataset generation description] Dataset generation description: No error analysis, sensitivity tests, or fidelity checks are supplied for the kinematic trajectory generation step or the mapping from external GTFS/OSM inputs to 1 Hz multi-modal paths, leaving the central claim of high-fidelity simulation unverified.
minor comments (2)
- [Abstract] The abstract lists five provided data products but does not indicate file formats, access method, or total storage size, which would aid potential users.
- Clarify whether the 70-day period includes weekends and holidays or is restricted to weekdays, as this affects the interpretation of daily agenda patterns.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the need for validation metrics and fidelity checks in the SF-LIFE manuscript. We address each major comment below and will revise the manuscript to incorporate quantitative comparisons and analysis where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that SF-LIFE constitutes a 'robust' resource for mobility research rests on the representativeness of the needs-driven agent-based simulation and GTFS/OSM-derived trajectories, yet the manuscript reports no quantitative validation metrics, comparisons to the Bay Area Travel Survey, census commuting flows, or observed mode shares.
Authors: We agree that the manuscript would be strengthened by explicit quantitative validation. The current version focuses on dataset construction and scale but does not include direct empirical comparisons. In the revised manuscript, we will add a dedicated validation subsection reporting comparisons of simulated mode shares, trip purpose distributions, and aggregate commuting flows against the Bay Area Travel Survey and relevant census data, including appropriate statistical metrics. revision: yes
-
Referee: [Dataset generation description] Dataset generation description: No error analysis, sensitivity tests, or fidelity checks are supplied for the kinematic trajectory generation step or the mapping from external GTFS/OSM inputs to 1 Hz multi-modal paths, leaving the central claim of high-fidelity simulation unverified.
Authors: We acknowledge the absence of reported error analysis or sensitivity tests for the kinematic mapping step. The trajectories are constructed directly from GTFS schedules and OSM networks using deterministic path-finding and interpolation to 1 Hz, but no quantitative fidelity or error metrics were included. In revision, we will add a section describing the mapping procedure in greater detail along with any available checks on path accuracy, timing discrepancies, and sensitivity to key parameters in the agent-based agenda generation. revision: yes
Circularity Check
No circularity; data generation from external inputs
full rationale
The paper presents a simulated dataset generated via agent-based modeling of needs-driven agendas combined with kinematic routing on external GTFS transit schedules and OpenStreetMap networks. No equations, fitted parameters, predictions, or uniqueness theorems are defined or invoked within the paper that reduce to its own inputs by construction. All load-bearing components (agent behaviors, trajectories, modalities) derive from cited external data sources rather than self-referential definitions or self-citations. This is a standard data-generation contribution whose central assumption (representativeness) is external and untested here, but that does not constitute circularity in any derivation chain.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Agent-based simulation generates needs-driven daily agendas that reflect real human patterns of life.
- domain assumption Kinematic trajectories correctly follow real OpenStreetMap road network and GTFS transit schedules from 40+ agencies.
Forward citations
Cited by 1 Pith paper
-
Mobility Anomaly Generation using LLM-Driven Behavior with Kinematic Constraints
Presents an end-to-end system using LLM agents to add behavioral anomalies to simulated trajectories, then applies map routing and noise to generate realistic annotated anomaly datasets for mobility research.
Reference graph
Works this paper leans on
-
[1]
John M Abowd, John Haltiwanger, and Julia Lane. 2004. Integrated longitudinal employer-employee data for the United States.American Economic Review94, 2 (2004), 224–229
2004
-
[2]
Hossein Amiri, Will Kohn, Shiyang Ruan, Joon-Seok Kim, Hamdi Kavak, Andrew Crooks, Dieter Pfoser, Carola Wenk, and Andreas Züfle. 2024. The patterns of life human mobility simulation. InProceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems. 653–656
2024
-
[3]
Hossein Amiri, Richard Yang, Shiyang Ruan, Joon-Seok Kim, Hamdi Kavak, Andrew Crooks, Dieter Pfoser, Carola Wenk, and Andreas Züfle. 2025. HD-GEN: A Software System for Large-Scale Human Mobility Data Generation Based on Patterns of Life. InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems. 407–410
2025
-
[4]
Trevor C Bailey and Anthony C Gatrell. 1995. Spatial data analysis: Theory and practice.Journal of the Royal Statistical Society: Series A158, 3 (1995), 461–462
1995
-
[5]
Ana LC Bazzan and Franziska Klügl. 2013. Agent-based modeling and simulation for transportation systems.Transportation Research Part C: Emerging Technologies 37 (2013), 1–3
2013
-
[6]
Richard A Becker, Ramón Cáceres, Karrie Hanson, Ji Meng Loh, Simon Urbanek, Alex Varshavsky, and Chris Volinsky. 2011. Large-scale analysis of urban mobility patterns using GPS data. InProceedings of the 2011 ACM SIGKDD international conference on Knowledge discovery and data mining. 311–319
2011
-
[7]
Stacey Bricka, Timothy Reuscher, Paul Schroeder, Mitchell Fisher, Justina Beard, and Xiaoyuan Layla Sun. 2024. Summary of travel trends: 2022 national household travel survey. (2024)
2024
-
[8]
Nicholson Collier and Jonathan Ozik. 2022. Distributed agent-based simulation with Repast4Py. In2022 Winter Simulation Conference (WSC)(Singapore). IEEE, 192–206. doi:10.1109/WSC57314.2022.10015389
-
[9]
Ketevan Gallagher, Taylor Anderson, Andrew Crooks, and Andreas Züfle. 2023. Synthetic geosocial network generation. InProceedings of the 7th ACM SIGSPA- TIAL Workshop on Location-based Recommendations, Geosocial Networks and Geoadvertising. 15–24
2023
-
[10]
Marta C Gonzalez, Cesar A Hidalgo, and Albert-Laszlo Barabasi. 2008. Under- standing human mobility patterns from large-scale trajectory data.Nature453, 7196 (2008), 779–782
2008
-
[11]
Google Transit. 2023. General Transit Feed Specification Reference.Google Developers(2023). https://developers.google.com/transit/gtfs/refer ence
2023
-
[12]
Erfan Hosseini Sereshgi, Mauryan Uppalapati, Yueyang Liu, Lance Kennedy, Andreas Züfle, and Carola Wenk. 2025. Semantic Anomaly Detection in Human Trajectories: Preserving Behavioral Patterns Through Calendar Representations. InProceedings of the 2nd ACM SIGSPATIAL International Workshop on Geospatial Anomaly Detection (GeoAnomalies ’25). Association for ...
-
[13]
Na Jiang, Fuzhen Yin, Boyu Wang, and Andrew T Crooks. 2024. A large-scale geographically explicit synthetic population with social networks for the united states.Scientific Data11, 1 (2024), 1204
2024
-
[14]
Abraham H Maslow. 1943. A theory of human motivation.Psychological review 50, 4 (1943), 370
1943
-
[15]
Frank Primerano, Michael AP Taylor, Ladda Pitaksringkarn, and Peter Tisato
-
[16]
Defining and understanding trip chaining behaviour.Transportation35, 1 (2008), 55–72
2008
- [17]
-
[18]
Manolis Terrovitis, Nikos Mamoulis, and Panos Kalnis. 2008. Privacy-preserving trajectory data publishing. InProceedings of the 2008 ACM SIGMOD international conference on Management of data. 591–602
2008
-
[19]
Jingyuan Wang, Xiangjie Kong, Feng Xia, and Lianyue Sun. 2019. Machine learning for urban mobility: A survey. InProceedings of the 2019 IEEE International Conference on Big Data. IEEE, 5587–5596
2019
-
[20]
Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of ‘small- world’networks.nature393, 6684 (1998), 440–442
1998
-
[21]
Lei Zhang, Xiaolei Wang, and Feng Chen. 2016. Transit network optimization using agent-based simulation. InTransportation Research Board 95th Annual Meeting
2016
-
[22]
Yu Zheng. 2015. Trajectory data mining: an overview.ACM Transactions on Intelligent Systems and Technology (TIST)6, 3 (2015), 1–41
2015
-
[23]
Yu Zheng, Loren Capra, Ouri Wolfson, and Hai Yang. 2014. Urban computing: con- cepts, methodologies, and applications. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1103–1112
2014
-
[24]
Yu Zheng, Loren Capra, Ouri Wolfson, and Hai Yang. 2015. Urban mobility analysis with large-scale trajectory data. InProceedings of the IEEE, Vol. 103. IEEE, 136–154
2015
-
[25]
Andreas Züfle, Carola Wenk, Dieter Pfoser, Andrew Crooks, Joon-Seok Kim, Hamdi Kavak, Umar Manzoor, and Hyunjee Jin. 2023. Urban life: a model of people and places.Computational and Mathematical Organization Theory29, 1 (2023), 20–51
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.