GenWorld: Empirically Grounded Urban Simulation Infrastructure for Scalable LLM-Agent Studies
Pith reviewed 2026-06-29 02:51 UTC · model grok-4.3
The pith
GenWorld combines a building-level synthetic city with offline LLM policy compilation to enable scalable urban agent simulations grounded in real census and mobility data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GenWorld supplies an empirically grounded urban simulation infrastructure that merges a building-level synthetic city, a structured agent-environment interface, and offline compilation of LLM-derived decision signals into lookup policies, allowing scalable rollout; the reference implementation for Higashihiroshima anchors 196608 synthetic residents in census and geospatial data, validates demographic consistency, and uses YJMob100K data as a commuting-distance check, with demonstrations of full-city weekday rollouts, weekday-weekend contrasts, and warning-response perturbations.
What carries the argument
Offline compilation of LLM-derived decision signals into lookup policies, which replaces repeated online calls with fast table lookups while retaining signals from the original model outputs.
If this is right
- A full-city weekday simulation becomes computationally feasible for hundreds of thousands of agents.
- Weekday-weekend behavioral contrasts can be generated reproducibly from the same infrastructure.
- Perturbation experiments such as warning responses can include auditable replanning traces.
- Demographic consistency checks against census tabulations can be repeated for new cities.
- Mobile-phone data can serve as an external diagnostic for commuting distances.
Where Pith is reading between the lines
- The same compilation approach could be tested on other cities if equivalent census and mobility datasets exist.
- If the lookup policies prove stable, the framework might later support controlled experiments on policy interventions such as evacuation routes.
- Calibration against observed traffic or evacuation outcomes would be needed before treating outputs as forecasts.
- The structured interface might allow swapping in different agent decision models without rebuilding the city layer.
Load-bearing premise
Offline lookup policies compiled from LLM outputs will preserve enough behavioral fidelity for the intended uses even without direct quantitative checks against live LLM behavior during rollout.
What would settle it
A side-by-side comparison in which live LLM agents and their compiled lookup-policy counterparts produce statistically different aggregate statistics on commuting distances or activity patterns when both are run on the same synthetic city.
Figures
read the original abstract
LLM-agent simulation faces a joint grounding and scaling problem: agents should act in environments that reflect real urban constraints, yet direct online LLM calls for city-scale populations are computationally prohibitive. We present GenWorld, an empirically grounded urban simulation infrastructure that combines a building-level synthetic city, a structured agent-environment interface, and offline compilation of LLM-derived decision signals into lookup policies for scalable rollout. In a reference instantiation for Higashihiroshima, Japan, GenWorld grounds 196,608 synthetic residents in census and geospatial data, validates demographic consistency against census tabulations, and uses YJMob100K mobile-phone data as a commuting-distance diagnostic. We demonstrate the infrastructure through three reproducible cases: a full-city weekday rollout, a weekday-weekend behavioral contrast, and a warning-response perturbation with auditable replanning traces. These cases support GenWorld as a reproducible platform for grounded and scalable LLM-agent studies, while calibrated forecasting for traffic, evacuation, or policy outcomes remains future work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents GenWorld, an infrastructure designed to address the grounding and scaling challenges in LLM-agent urban simulations. It integrates a building-level synthetic city model, a structured agent-environment interface, and an offline compilation method that converts LLM-derived decision signals into lookup policies. The system is instantiated for Higashihiroshima, Japan, grounding 196,608 synthetic residents using census and geospatial data, with demographic validation against census tabulations and commuting-distance diagnostics from YJMob100K mobile-phone data. Three reproducible demonstration cases are provided: a full-city weekday rollout, a weekday-weekend behavioral contrast, and a warning-response perturbation with auditable replanning traces. The authors position GenWorld as a platform for grounded and scalable LLM-agent studies, noting that calibrated forecasting is future work.
Significance. If the offline-compiled lookup policies retain sufficient behavioral fidelity to the original LLM decisions, GenWorld would represent a significant contribution by enabling city-scale simulations that are both empirically grounded in real data and computationally scalable. The grounding in census and mobile data, combined with the reproducible demonstration cases, would support its use as a platform for studying agent behaviors in urban settings. The explicit acknowledgment that calibrated forecasting remains future work appropriately scopes the current contribution.
major comments (2)
- [§3 (Policy Compilation)] §3 (Policy Compilation): The central claim that GenWorld enables grounded, scalable LLM-agent studies via offline compilation of LLM decision signals into lookup policies requires that the compiled policies preserve behavioral fidelity. No quantitative metric (e.g., action-distribution divergence, trajectory statistics, or decision agreement rate) is reported comparing live LLM outputs to the lookup tables on held-out states. This is load-bearing for asserting that the scalability benefit retains the grounding property.
- [§2 (Reference Instantiation)] §2 (Reference Instantiation): Demographic consistency is asserted against census tabulations and YJMob100K is invoked as a commuting-distance diagnostic for the 196,608 residents, but no error metrics, sample sizes, or exclusion rules are supplied. This weakens the ability to assess the strength of the empirical grounding claim.
minor comments (2)
- [Abstract] Abstract: Including at least one quantitative result (e.g., a specific error rate from the demographic validation) would strengthen the 'empirically grounded' assertion without altering the scope.
- [Demonstration cases] Demonstration cases: The reproducibility of the three cases is positive, but the manuscript could specify the state-space cardinality or number of unique states compiled into the lookup policies to better contextualize the scalability gain.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the requirements for substantiating our claims on scalable grounded simulation. We respond to each major comment below.
read point-by-point responses
-
Referee: [§3 (Policy Compilation)] §3 (Policy Compilation): The central claim that GenWorld enables grounded, scalable LLM-agent studies via offline compilation of LLM decision signals into lookup policies requires that the compiled policies preserve behavioral fidelity. No quantitative metric (e.g., action-distribution divergence, trajectory statistics, or decision agreement rate) is reported comparing live LLM outputs to the lookup tables on held-out states. This is load-bearing for asserting that the scalability benefit retains the grounding property.
Authors: We agree that a quantitative fidelity assessment is necessary to support the claim that offline compilation preserves the grounding property while enabling scale. The manuscript demonstrates the infrastructure via three reproducible cases that rely on the compiled policies but does not report a direct comparison (e.g., action-distribution divergence or decision agreement) against live LLM outputs on held-out states. In revision we will add this evaluation to §3, using a held-out set of states drawn from the same agent population. revision: yes
-
Referee: [§2 (Reference Instantiation)] §2 (Reference Instantiation): Demographic consistency is asserted against census tabulations and YJMob100K is invoked as a commuting-distance diagnostic for the 196,608 residents, but no error metrics, sample sizes, or exclusion rules are supplied. This weakens the ability to assess the strength of the empirical grounding claim.
Authors: We accept that the current presentation of the grounding validation lacks the quantitative detail needed for readers to evaluate its strength. The manuscript states that demographic consistency was checked against census tabulations and that YJMob100K served as a commuting-distance diagnostic, but supplies no error metrics, sample sizes, or exclusion criteria. In the revised manuscript we will expand §2 with these specifics, including the error metric(s) employed, the exact sample sizes for each comparison, and the data-exclusion rules applied to the mobile-phone traces. revision: yes
Circularity Check
No circularity; derivation relies on independent external data sources
full rationale
The paper constructs a synthetic population from census and geospatial data, validates demographics directly against census tabulations, and uses YJMob100K as an external commuting-distance diagnostic. The offline compilation step converts LLM outputs to lookup policies for scalability but does not define any quantity in terms of itself or rename a fitted parameter as a prediction. No equations, self-citations, or uniqueness theorems are presented that would reduce the grounding claim to a tautology. The infrastructure is therefore self-contained against external benchmarks rather than circular by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Census and geospatial data can be used to instantiate a representative synthetic population whose aggregate statistics match official tabulations
- domain assumption Offline lookup policies compiled from LLM outputs retain enough behavioral fidelity to support the claimed simulation studies
Reference graph
Works this paper leans on
-
[1]
The gravity model.Annu
James E Anderson. The gravity model.Annu. Rev. Econ., 3(1):133–160, 2011
2011
-
[2]
Sallma: A software architec- ture for llm-based multi-agent systems
Marco Becattini, Roberto Verdecchia, and En- rico Vicario. Sallma: A software architec- ture for llm-based multi-agent systems. In 2025 IEEE/ACM International Workshop New Trends in Software Architecture (SATrends), pages 5–8. IEEE, 2025
2025
-
[3]
Global building morphology indicators.Computers, En- vironment and Urban Systems, 95:101809, 2022
Filip Biljecki and Yoong Shin Chow. Global building morphology indicators.Computers, En- vironment and Urban Systems, 95:101809, 2022. 19
2022
-
[4]
On the limits of agency in agent-based models.arXiv preprint arXiv:2409.10568, 2024
Ayush Chopra, Shashank Kumar, Nurullah Giray-Kuru, Ramesh Raskar, and Arnau Quera- Bofarull. On the limits of agency in agent-based models.arXiv preprint arXiv:2409.10568, 2024
-
[5]
Population synthesis using iterative pro- portional fitting (ipf): A review and future research.Transportation Research Procedia, 17:223–233, 2016
Abdoul-Ahad Choupani and Amir Reza Mam- doohi. Population synthesis using iterative pro- portional fitting (ipf): A review and future research.Transportation Research Procedia, 17:223–233, 2016
2016
-
[6]
Brookings Institution Press, 1996
Joshua M Epstein and Robert Axtell.Growing artificial societies: social science from the bot- tom up. Brookings Institution Press, 1996
1996
-
[7]
Reproducible methods for modeling combined public transport and cycling trips and associ- ated benefits: Evidence from the biclar tool
Rosa F´ elix, Filipe Moura, and Robin Lovelace. Reproducible methods for modeling combined public transport and cycling trips and associ- ated benefits: Evidence from the biclar tool. Computers, Environment and Urban Systems, 117:102230, 2025
2025
-
[8]
Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, and Yong Li. Citybench: Evaluating the capabilities of large language models for urban tasks.arXiv preprint arXiv:2406.13945, 2024. Accepted by KDD 2025 D&B Track
-
[9]
Kunihiko Fujiwara, Ryuta Tsurumi, Tomoki Kiyono, Zicheng Fan, Xiucheng Liang, Binyu Lei, Winston Yap, Koichi Ito, and Filip Biljecki. Voxcity: A seamless framework for open geospa- tial data integration, grid-based semantic 3d city model generation, and urban environment simu- lation.Computers, Environment and Urban Sys- tems, 123:102366, 2026
2026
-
[10]
Agentscope: A flexible yet robust multi-agent platform.arXiv preprint arXiv:2402.14034, 2024
Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, et al. Agentscope: A flexible yet ro- bust multi-agent platform.arXiv preprint arXiv:2402.14034, 2024
-
[11]
Understanding individual hu- man mobility patterns.nature, 453(7196):779– 782, 2008
Marta C Gonzalez, Cesar A Hidalgo, and Albert- Laszlo Barabasi. Understanding individual hu- man mobility patterns.nature, 453(7196):779– 782, 2008
2008
-
[12]
What about people in re- gional science.Transport Sociology: Social as- pects of transport planning, pages 143–158, 1970
Torsten H¨ agerstrand. What about people in re- gional science.Transport Sociology: Social as- pects of transport planning, pages 143–158, 1970
1970
-
[13]
Spatiotempo- ral patterns of urban human mobility.Journal of Statistical Physics, 151(1):304–318, 2013
Samiul Hasan, Christian M Schneider, Satish V Ukkusuri, and Marta C Gonz´ alez. Spatiotempo- ral patterns of urban human mobility.Journal of Statistical Physics, 151(1):304–318, 2013
2013
-
[14]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[15]
Metagpt: Meta programming for a multi-agent collaborative framework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. In The twelfth international conference on learning representations, 2023
2023
-
[16]
Introducing matsim
Andreas Horni, Kai Nagel, and Kay W Ax- hausen. Introducing matsim. InMulti-Agent Transport Simulation MATSim. Ubiquity Press, 2016
2016
-
[17]
Large language models as simu- lated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023
John J Horton. Large language models as simu- lated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023
2023
-
[18]
A method to create a synthetic population with social networks for geographically-explicit agent- based models.Computational Urban Science, 2(1):7, 2022
Na Jiang, Andrew T Crooks, Hamdi Kavak, Annetta Burger, and William G Kennedy. A method to create a synthetic population with social networks for geographically-explicit agent- based models.Computational Urban Science, 2(1):7, 2022
2022
-
[19]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Trajllm: A modular llm- enhanced agent-based framework for realistic hu- man trajectory simulation
Chenlu Ju, Jiaxin Liu, Shobhit Sinha, Hao Xue, and Flora Salim. Trajllm: A modular llm- enhanced agent-based framework for realistic hu- man trajectory simulation. InCompanion Pro- ceedings of the ACM on Web Conference 2025, pages 2847–2850, 2025
2025
-
[21]
Nationwide synthetic human mobility dataset construction from limited travel sur- veys and open data.Computer-Aided Civil and Infrastructure Engineering, 39(21):3337– 3353, 2024
Takehiro Kashiyama, Yanbo Pang, Yuya Shibuya, Takahiro Yabe, and Yoshihide Seki- moto. Nationwide synthetic human mobility dataset construction from limited travel sur- veys and open data.Computer-Aided Civil and Infrastructure Engineering, 39(21):3337– 3353, 2024
2024
-
[22]
Recent develop- ment and applications of sumo-simulation of ur- ban mobility.International journal on advances in systems and measurements, 5(3&4):128–138, 2012
Daniel Krajzewicz, Jakob Erdmann, Michael Behrisch, Laura Bieker, et al. Recent develop- ment and applications of sumo-simulation of ur- ban mobility.International journal on advances in systems and measurements, 5(3&4):128–138, 2012. 20
2012
-
[23]
Compu- tational social science.Science, 323(5915):721– 723, 2009
David Lazer, Alex Pentland, Lada Adamic, Sinan Aral, Albert-L´ aszl´ o Barab´ asi, Devon Brewer, Nicholas Christakis, Noshir Contractor, James Fowler, Myron Gutmann, et al. Compu- tational social science.Science, 323(5915):721– 723, 2009
2009
-
[24]
arXiv preprint arXiv:2407.18932 , year=
Xuchuan Li, Fei Huang, Jianrong Lv, Zhix- iong Xiao, Guolong Li, and Yang Yue. Be more real: Travel diary generation using llm agents and individual profiles.arXiv preprint arXiv:2407.18932, 2024
-
[25]
Sung Yoo Lim, Hyunsoo Yun, Prateek Bansal, Dong-Kyu Kim, and Eui-Jin Kim. A large lan- guage model for feasible and diverse popula- tion synthesis.arXiv preprint arXiv:2505.04196, 2025
-
[26]
arXiv preprint arXiv:2506.23306 , year=
Qi Liu, Can Li, and Wanjing Ma. Gatsim: Ur- ban mobility simulation with generative agents. arXiv preprint arXiv:2506.23306, 2025
-
[27]
Toward llm-agent-based modeling of transporta- tion systems: A conceptual framework.Artificial Intelligence for Transportation, 1:100001, 2025
Tianming Liu, Jirong Yang, and Yafeng Yin. Toward llm-agent-based modeling of transporta- tion systems: A conceptual framework.Artificial Intelligence for Transportation, 1:100001, 2025
2025
-
[28]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agent- bench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Mason: A multiagent simulation environment.Simulation, 81(7):517–527, 2005
Sean Luke, Claudio Cioffi-Revilla, Liviu Panait, Keith Sullivan, and Gabriel Balan. Mason: A multiagent simulation environment.Simulation, 81(7):517–527, 2005
2005
-
[30]
Learning universal human mobility patterns with a foundation model for cross-domain data fusion.Transportation Research Part C: Emerg- ing Technologies, 180:105311, 2025
Haoxuan Ma, Xishun Liao, Yifan Liu, Qinhua Jiang, Chris Stanford, Shangqing Cao, and Jiaqi Ma. Learning universal human mobility patterns with a foundation model for cross-domain data fusion.Transportation Research Part C: Emerg- ing Technologies, 180:105311, 2025
2025
-
[31]
Data- driven generation of spatio-temporal routines in human mobility.Data Mining and Knowledge Discovery, 32(3):787–829, 2018
Luca Pappalardo and Filippo Simini. Data- driven generation of spatio-temporal routines in human mobility.Data Mining and Knowledge Discovery, 32(3):787–829, 2018
2018
-
[32]
Generative agents: Interac- tive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interac- tive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user inter- face software and technology, pages 1–22, 2023
2023
-
[33]
Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhi- heng Zheng, Jing Yi Wang, Di Zhou, et al. Agentsociety: Large-scale simulation of llm- driven generative agents advances understanding of human behaviors and society.arXiv preprint arXiv:2502.08691, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Toolformer: Language mod- els can teach themselves to use tools.Ad- vances in Neural Information Processing Sys- tems, 36:68539–68551, 2023
Timo Schick, Jane Dwivedi-Yu, Roberto Dess` ı, Roberta Raileanu, Maria Lomeli, Eric Ham- bro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language mod- els can teach themselves to use tools.Ad- vances in Neural Information Processing Sys- tems, 36:68539–68551, 2023
2023
-
[35]
Building, composing and exper- imenting complex spatial models with the gama platform.GeoInformatica, 23(2):299–322, 2019
Patrick Taillandier, Benoit Gaudou, Arnaud Grignard, Quang-Nghi Huynh, Nicolas Maril- leau, Philippe Caillou, Damien Philippon, and Alexis Drogoul. Building, composing and exper- imenting complex spatial models with the gama platform.GeoInformatica, 23(2):299–322, 2019
2019
-
[36]
Gemma Team, Aishwarya Kamath, Johan Fer- ret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram´ e, Morgane Rivi` ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Netlogo: A sim- ple environment for modeling complexity
Seth Tisue, Uri Wilensky, et al. Netlogo: A sim- ple environment for modeling complexity. InIn- ternational conference on complex systems, vol- ume 21, pages 16–21. Boston, MA, 2004
2004
-
[38]
Large language models as urban residents: An llm agent framework for personal mobility gen- eration.Advances in Neural Information Pro- cessing Systems, 37:124547–124574, 2024
Jiawei Wang, Renhe Jiang, Chuang Yang, Zengqing Wu, Makoto Onizuka, Ryosuke Shibasaki, Noboru Koshizuka, and Chuan Xiao. Large language models as urban residents: An llm agent framework for personal mobility gen- eration.Advances in Neural Information Pro- cessing Systems, 37:124547–124574, 2024
2024
-
[39]
The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025
2025
-
[40]
Yjmob100k: City- scale and longitudinal dataset of anonymized human mobility trajectories.Scientific Data, 11(1):397, 2024
Takahiro Yabe, Kota Tsubouchi, Toru Shimizu, Yoshihide Sekimoto, Kaoru Sezaki, Esteban Moro, and Alex Pentland. Yjmob100k: City- scale and longitudinal dataset of anonymized human mobility trajectories.Scientific Data, 11(1):397, 2024. 21
2024
-
[41]
Yuwei Yan, Qingbin Zeng, Zhiheng Zheng, Jingzhe Yuan, Jie Feng, Jun Zhang, Fengli Xu, and Yong Li. Opencity: A scalable platform to simulate urban activities with massive llm agents.arXiv preprint arXiv:2410.21286, 2024
-
[42]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh inter- national conference on learning representations, 2022
2022
-
[43]
Xiaotong Ye, Nicolas Bougie, Toshihiko Ya- masaki, and Narimasa Watanabe. Mo- bilecity: An efficient framework for large-scale urban behavior simulation.arXiv preprint arXiv:2504.16946, 2025
-
[44]
Llm-aidsim: Llm-enhanced agent-based influence diffusion simulation in so- cial networks.Systems, 13(1):29, 2025
Lan Zhang, Yuxuan Hu, Weihua Li, Quan Bai, and Parma Nand. Llm-aidsim: Llm-enhanced agent-based influence diffusion simulation in so- cial networks.Systems, 13(1):29, 2025
2025
-
[45]
arXiv preprint arXiv:2504.10157 , year=
Xinnong Zhang, Jiayu Lin, Xinyi Mou, Shiyue Yang, Xiawei Liu, Libo Sun, Hanjia Lyu, Yihang Yang, Weihong Qi, Yue Chen, et al. Socioverse: A world model for social simulation powered by llm agents and a pool of 10 million real-world users.arXiv preprint arXiv:2504.10157, 2025
-
[46]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. A Supplementary Materials A.1 Additional Figures A.2 Data Sources Figure A1: Census data summary showing age- gend...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.