Trajeglish: Traffic Modeling as Next-Token Prediction
read the original abstract
A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs. In pursuit of this functionality, we apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios. Using a simple data-driven tokenization scheme, we discretize trajectories to centimeter-level resolution using a small vocabulary. We then model the multi-agent sequence of discrete motion tokens with a GPT-like encoder-decoder that is autoregressive in time and takes into account intra-timestep interaction between agents. Scenarios sampled from our model exhibit state-of-the-art realism; our model tops the Waymo Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%. We ablate our modeling choices in full autonomy and partial autonomy settings, and show that the representations learned by our model can quickly be adapted to improve performance on nuScenes. We additionally evaluate the scalability of our model with respect to parameter count and dataset size, and use density estimates from our model to quantify the saliency of context length and intra-timestep interaction for the traffic modeling task.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
-
Bridging Local Observation and Global Simulation in Closed-Loop Traffic Modeling
CRAFT reduces collisions by 31.2% and traffic violations by 33.2% in closed-loop traffic simulation by discovering context-induced failures in what-if rollouts and using a contextual preference evaluator to reweight a...
-
Human-like autonomy emerges from self-play and a pinch of human data
Self-play RL regularized with 30 minutes of human data produces driving policies that coordinate with humans, training in 15 hours on one GPU with 2500x less data than imitation learning.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.