arxiv: 2604.15782 · v1 · submitted 2026-04-17 · 💻 cs.LG · physics.soc-ph

Recognition: unknown

Fusing Cellular Network Data and Tollbooth Counts for Urban Traffic Flow Estimation

Oluwaleke Yusuf , Shaira Tabassum

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:02 UTC · model grok-4.3

classification 💻 cs.LG physics.soc-ph

keywords cellular network datatollbooth countsorigin-destination matricestraffic flow estimationmachine learningurban planningvehicle categoriesmobility data fusion

0 comments

The pith

Machine learning corrects aggregated cellular mobility data using sparse tollbooth counts to produce vehicle-specific origin-destination matrices for urban traffic planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a machine learning framework that learns to adjust and break down large-scale but imprecise cellular network activity data by training on accurate vehicle counts from limited tollbooth sensors. Temporal and spatial features capture the link between crowd movement patterns and actual vehicular flows by category. The corrected data then feeds into logic that infers trip destinations and distributes flows across origin-destination pairs. The framework is demonstrated on planning for a bus depot expansion, yielding hourly matrices that support traffic simulations where detailed data is otherwise unavailable. Readers care because such estimates let planners test how infrastructure changes affect background traffic without relying solely on expensive new sensors.

Core claim

The framework infers destinations from transit routes and implements routing logic to distribute corrected flows between OD pairs. This approach is applied to a bus depot expansion in Trondheim, Norway, generating hourly OD matrices by vehicle length category. The results show how limited but accurate sensor measurements can correct extensive but aggregated mobility data to produce grounded estimates of background vehicular traffic flows. These macro-scale estimates can be refined for micro-scale analysis at desired locations.

What carries the argument

Machine learning model using temporal and spatial features to map aggregated cellular mobility data onto vehicle-category counts from tollbooths, followed by destination inference from transit routes and routing logic to allocate flows to specific origin-destination pairs.

If this is right

Hourly origin-destination matrices broken down by vehicle length category become available for running traffic simulations of infrastructure projects.
Background vehicular flows can be estimated across a city even in locations lacking direct sensors.
The outputs support evaluation of interventions such as depot expansions by providing category-specific flow inputs.
Macro-level estimates can be zoomed in for detailed analysis at chosen sites.
The overall method offers a repeatable way to create origin-destination data from cellular sources in settings with limited ground truth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same correction step could be applied to other cities that already collect cellular aggregates and occasional toll counts, testing transferability of the learned mapping.
Adding public transport schedule data might further separate modes within the cellular signals for even finer disaggregation.
The matrices could serve as a low-cost baseline to decide where to place additional sensors for ongoing validation.
Streaming versions of the cellular input might allow periodic updates to the matrices for dynamic planning use.

Load-bearing premise

That temporal and spatial features alone let the model learn the full relationship between cellular aggregates and real vehicle counts without leaving systematic biases that would distort the resulting matrices.

What would settle it

Direct comparison of the generated hourly OD matrices against independent vehicle counts or camera observations collected at multiple non-tollbooth road segments would reveal whether the estimates align with actual flows.

Figures

Figures reproduced from arXiv: 2604.15782 by Oluwaleke Yusuf, Shaira Tabassum.

**Figure 2.** Figure 2: Differences between total vehicular counts and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The ML-based pipeline adopted for data fusion. The XGBoost model is trained to learn the relationship between [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: SHAP summary plot showing the global importance of [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Hourly inferred traffic flow distributions across the eight destinations for various NPRA vehicle length categories, [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Outputs from the Aimsun traffic simulation, showing (a) the reference scenario with estimated background traffic and [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Traffic simulations, essential for planning urban transit infrastructure interventions, require vehicle-category-specific origin-destination (OD) data. Existing data sources are imperfect: sparse tollbooth sensors provide accurate vehicle counts by category, while extensive mobility data from cellular network activity captures aggregated crowd movement, but lack modal disaggregation and have systematic biases. This study develops a machine learning framework to correct and disaggregate cellular network data using sparse tollbooth counts as ground truth. The model uses temporal and spatial features to learn the complex relationship between aggregated mobility data and vehicular data. The framework infers destinations from transit routes and implements routing logic to distribute corrected flows between OD pairs. This approach is applied to a bus depot expansion in Trondheim, Norway, generating hourly OD matrices by vehicle length category. The results show how limited but accurate sensor measurements can correct extensive but aggregated mobility data to produce grounded estimates of background vehicular traffic flows. These macro-scale estimates can be refined for micro-scale analysis at desired locations. The framework provides a generalisable approach for generating origin-destination data from cellular network data. This enables downstream tasks, like detailed traffic simulations for infrastructure planning in data-scarce contexts, supporting urban planners in making informed decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows a workable way to correct cellular mobility data with tollbooth counts for vehicle-category OD matrices in a real planning case, but the validation details and bias checks are too thin to judge how well it holds up.

read the letter

The core contribution is a straightforward ML pipeline that takes aggregated cellular activity, adds temporal and spatial features, and uses tollbooth counts as ground truth to produce hourly origin-destination matrices broken down by vehicle length. They apply it to a bus depot expansion in Trondheim and add routing logic to spread the corrected flows across pairs. That combination of data sources and the downstream use for infrastructure simulation is the practical piece worth noting. It fills a gap for cities that have some toll sensors but no dense traffic counters, and the Trondheim example gives a concrete sense of the output format planners could actually use. The routing step to infer destinations from transit routes is a reasonable engineering choice rather than a theoretical leap. The paper is clear that the goal is macro-scale estimates that can be refined locally, which matches how these tools get used in practice. The main weakness is the lack of reported error metrics, hold-out validation, or baseline comparisons. The abstract and methods description do not show how much residual bias remains after the correction, especially for vehicle categories or time periods not well covered by the tollbooths. Without those numbers it is hard to know whether the temporal and spatial features really capture the sampling biases in the cellular data or just smooth over them. The claim that the framework is generalisable rests on one city case, so readers will want to see sensitivity tests or cross-validation results before treating the outputs as grounded. This is the kind of applied methods paper that transportation planners and simulation groups would find useful for data-scarce settings. It is not a broad methodological advance, but the workflow is concrete enough that a serious referee could check the implementation details and ask for the missing quantitative checks. I would send it to review rather than desk reject, with the expectation that the authors add error tables and a clearer limitations section.

Referee Report

3 major / 2 minor

Summary. The paper develops a machine learning framework to fuse sparse but accurate tollbooth vehicle counts with extensive but aggregated cellular network mobility data. Using temporal and spatial features, the model corrects and disaggregates the cellular data to produce vehicle-category-specific origin-destination (OD) matrices. The framework includes logic to infer destinations from transit routes and distribute flows via routing. It is applied to a case study of bus depot expansion in Trondheim, Norway, to generate hourly OD matrices by vehicle length category for background traffic estimation.

Significance. If the quantitative performance is strong, this work could provide a practical, generalizable method for creating detailed vehicular OD data in urban areas with limited sensor coverage. By leveraging limited ground-truth toll data to calibrate broader mobility datasets, it addresses a key data gap for traffic simulations used in infrastructure planning. The approach's applicability to data-scarce contexts is a notable strength.

major comments (3)

[Abstract and §3 (Model Description)] Abstract and §3 (Model Description): The manuscript provides no details on the machine learning model architecture, specific temporal and spatial features used, loss function, or training/validation procedure. Without these, it is impossible to assess whether the model can adequately capture biases in cellular data (e.g., demographic sampling, modal misattribution) using only the chosen features.
[§5 (Results)] §5 (Results): No quantitative results, error metrics (such as MAE, RMSE for counts or OD flows), validation against held-out tollbooth data, or comparisons to baselines are reported. The claim that the framework produces 'grounded estimates' lacks empirical support, which is load-bearing for the central contribution.
[§4 (Framework)] §4 (Framework): The routing logic and destination inference from transit routes are described at a high level, but no pseudocode, algorithmic details, or sensitivity analysis to routing assumptions are provided. This could introduce unquantified errors in the OD matrix generation.

minor comments (2)

[Abstract] The abstract mentions 'the results show' but does not preview any specific findings or metrics, which is atypical for a methods paper.
[Notation] Clarify the exact definition of 'aggregated mobility data' versus 'vehicular data' early in the paper to avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and will make revisions to enhance the technical completeness, empirical support, and transparency of the work.

read point-by-point responses

Referee: [Abstract and §3 (Model Description)] The manuscript provides no details on the machine learning model architecture, specific temporal and spatial features used, loss function, or training/validation procedure. Without these, it is impossible to assess whether the model can adequately capture biases in cellular data (e.g., demographic sampling, modal misattribution) using only the chosen features.

Authors: We agree that the model description in §3 is currently high-level and insufficient for full reproducibility or bias assessment. In the revised manuscript, we will expand §3 (and update the abstract if needed) to specify the model architecture, the exact temporal features (e.g., hour-of-day, weekday/weekend indicators) and spatial features (e.g., zone-level aggregates, distance to tollbooths), the loss function, and the training/validation procedure including data partitioning and any regularization techniques. This will clarify how the model uses tollbooth ground truth to correct cellular biases. revision: yes
Referee: [§5 (Results)] No quantitative results, error metrics (such as MAE, RMSE for counts or OD flows), validation against held-out tollbooth data, or comparisons to baselines are reported. The claim that the framework produces 'grounded estimates' lacks empirical support, which is load-bearing for the central contribution.

Authors: We acknowledge that the current §5 focuses on the Trondheim case study application without quantitative metrics or validation. This is a substantive gap. In the revision, we will add error metrics (MAE, RMSE) on held-out tollbooth counts and OD flows, describe the validation split, and include baseline comparisons (e.g., uncorrected cellular data or naive disaggregation). These additions will provide empirical grounding for the estimates. revision: yes
Referee: [§4 (Framework)] The routing logic and destination inference from transit routes are described at a high level, but no pseudocode, algorithmic details, or sensitivity analysis to routing assumptions are provided. This could introduce unquantified errors in the OD matrix generation.

Authors: We agree that the framework in §4 would benefit from greater specificity. We will add pseudocode and step-by-step algorithmic details for destination inference and flow distribution, along with a sensitivity analysis varying key routing assumptions (e.g., route choice parameters or transit route interpretations). This will help quantify and bound potential errors in the generated OD matrices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; external tollbooth counts serve as independent ground truth for ML training

full rationale

The paper trains an ML model on temporal and spatial features to correct and disaggregate cellular mobility data, explicitly using sparse tollbooth counts as external ground truth for vehicle-category-specific flows. It then applies separate routing logic derived from transit routes to produce OD matrices. No load-bearing step reduces by construction to a fitted input renamed as prediction, a self-definition, or a self-citation chain; the derivation remains self-contained against the distinct data sources and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions about the accuracy of tollbooth counts and the presence of correctable biases in cellular data, plus the capacity of an unspecified ML model to learn the required mapping from temporal and spatial features alone.

free parameters (1)

Machine learning model parameters
The framework learns the relationship between mobility and vehicular data, implying parameters fitted during training whose values are not reported in the abstract.

axioms (2)

domain assumption Sparse tollbooth sensors provide accurate vehicle counts by category that can serve as ground truth
Explicitly used as the reference to correct cellular data.
domain assumption Cellular network activity captures aggregated crowd movement but contains systematic biases and lacks modal disaggregation
Stated as the motivation for the correction step.

pith-pipeline@v0.9.0 · 5516 in / 1565 out tokens · 45533 ms · 2026-05-10T09:02:54.540721+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 2 canonical work pages

[1]

Cars, planes, trains: Where do CO 2 emissions from transport come from?

H. Ritchie and M. Roser, “Cars, planes, trains: Where do CO 2 emissions from transport come from?” Our World in Data , Oct. 2020. [Online]. Available: https://ourworldindata.org/co2-emissions-from-transport

2020
[2]

Sustainable transport modes, travel satisfaction, and emotions: Evidence from car- dependent compact cities,

K. Mouratidis, J. De V os, A. Yiannakou, and I. Politis, “Sustainable transport modes, travel satisfaction, and emotions: Evidence from car- dependent compact cities,” Travel Behaviour and Society , vol. 33, p. 100613, Oct. 2023

2023
[3]

About Trafikkdata,

Statens vegvesen, “About Trafikkdata,” Jan. 2023. [Online]. Available: https://trafikkdata.atlas.vegvesen.no/om-trafikkdata

2023
[4]

Crowd Insights Methodology - Telia,

Telia, “Crowd Insights Methodology - Telia,” Oct. 2021. [Online]. Available: https://coda.io/@data-insights/telia-webinars-and-training/ crowd-insights-methodology-training-27

2021
[5]

Advancing Dynamic Origin-Destination Matrices Estimation Models Using Crowd-Sourced Flexibility Data,

M. Castiglione, G. Cantelmo, M. Nigro, and E. Cipriani, “Advancing Dynamic Origin-Destination Matrices Estimation Models Using Crowd-Sourced Flexibility Data,” in 12th Triennial Symposium on Transportation Analysis , Okinawa, Japan, Jun. 2025. [Online]. Available: https://tristan2025.org/proceedings/TRISTAN2025 ExtendedAbstract 207.pdf

2025
[6]

The CMS experiment at the CERN LHC

E. Fernandes, “Estimating Origin-Destination Matrices in Helsinki’s Public Transport through Multi-Source Data Fusion,” Master’s thesis, Aalto University, Sep. 2025. [Online]. Available: https: //aaltodoc.aalto.fi/handle/123456789/140249

work page arXiv 2025
[7]

Using Multiple Biased Data Sets to Recover Missing Trips with a Behaviorally Informed Model,

X. Guan, S. Huang, and C. Chen, “Using Multiple Biased Data Sets to Recover Missing Trips with a Behaviorally Informed Model,” Trans- portation Science, vol. 59, no. 4, pp. 743–762, Jul. 2025

2025
[8]

Origin–destination prediction via knowledge-enhanced hybrid learn- ing,

Z. Xing, E. Chung, Y . Wang, A. Toriumi, T. Oguchi, and Y . Wu, “Origin–destination prediction via knowledge-enhanced hybrid learn- ing,” Computer-Aided Civil and Infrastructure Engineering , vol. 40, no. 17, pp. 2498–2521, 2025

2025
[9]

Origin-destination prediction from road average speed data using GraphResLSTM model,

G. Hu and J. Zhang, “Origin-destination prediction from road average speed data using GraphResLSTM model,” PeerJ Computer Science , vol. 11, p. e2709, Feb. 2025

2025
[10]

Estimating Erratic Measurement Errors in Network-Wide Traffic Flow via Virtual Balance Sensors,

Z. Zheng, Z. Wang, H. Fu, and W. Ma, “Estimating Erratic Measurement Errors in Network-Wide Traffic Flow via Virtual Balance Sensors,” Transportation Science, vol. 59, no. 4, pp. 721–742, Jul. 2025

2025
[11]

Traffic Prediction Using LSTM, RF and XGBoost,

K. N. Lam, “Traffic Prediction Using LSTM, RF and XGBoost,” in Proceedings of the 2nd International Conference on Data Analysis and Machine Learning - DAML. SciTePress / INSTICC, 2025, pp. 267–274

2025
[12]

CTCam: Enhancing Trans- portation Evaluation through Fusion of Cellular Traffic and Camera- Based Vehicle Flows,

C. Lin, S.-L. Tung, H.-T. Su, and W. H. Hsu, “CTCam: Enhancing Trans- portation Evaluation through Fusion of Cellular Traffic and Camera- Based Vehicle Flows,” in Proceedings of the 32nd ACM International Conference on Information and Knowledge Management , ser. CIKM ’23. New York, NY , USA: Association for Computing Machinery, Oct. 2023, pp. 5341–5345

2023
[13]

XGBoost: A scalable tree boosting system,

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , ser. Kdd ’16. New York, NY , USA: ACM, 2016, pp. 785–794

2016
[14]

Lundberg and Su-In Lee

S. M. Lundberg and S.-I. Lee, “A Unified Approach to Interpreting Model Predictions,” in Proceedings of the 31st International Conference on Neural Information Processing Systems , ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., Dec. 2017, pp. 4768–4777. [Online]. Available: https://dl.acm.org/doi/10.5555/3295222.3295230

work page doi:10.5555/3295222.3295230 2017
[15]

Austroads Extended Vehicle Classification Scheme for Traffic and Transport Surveys,

D. Gaynor, D. Johnston, M. Coleman, C. Chin, and S. Cropley, “Austroads Extended Vehicle Classification Scheme for Traffic and Transport Surveys,” Austroads, Sydney, New South Wales, Publication AP-G104-23, Sep. 2023. [Online]. Available: https: //austroads.gov.au/publications/traffic-management/ap-g104-23

2023
[16]

Aimsun Next Traffic Modelling Software,

Aimsun, “Aimsun Next Traffic Modelling Software,” Barcelona, Spain,
[17]

Available: https://www.aimsun.com

[Online]. Available: https://www.aimsun.com
[18]

How More Buses Could Affect Traffic:: A Digital Twin of Trondheim’s Sandmoen Bus Depot,

S. Tabassum, U. Oluwaleke, B.-A. Raanes, G. Kiss, and F. Lindseth, “How More Buses Could Affect Traffic:: A Digital Twin of Trondheim’s Sandmoen Bus Depot,” Moderne mobilitet og infrastruktur, vol. 3, no. 2, Nov. 2025

2025