FlightSense: An End-to-End MLOps Platform for Real-Time Flight Delay Prediction via Rotation-Chain Propagation Features and Agentic Conversational AI
Pith reviewed 2026-05-20 23:17 UTC · model grok-4.3
The pith
Modeling delay propagation through aircraft rotation chains raises flight delay prediction AUC from 0.732 to 0.879.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Deriving eleven delay propagation features from aircraft rotation chains identified by tail-number tracking in BTS records produces the largest accuracy gain, moving test AUC from 0.732 on schedule features to 0.875, with a final value of 0.879 after adding NOAA weather data across ten airports; the same system runs as a production MLOps service with real-time inference and natural-language query handling.
What carries the argument
The eleven delay propagation features extracted from aircraft rotation chains via tail-number tracking, which quantify how delays accumulate and transfer between consecutive legs of the same aircraft.
If this is right
- The rotation-chain features account for the dominant performance increase over the schedule-only baseline.
- Real-time inference remains feasible when the model is served through SageMaker with live weather updates.
- Natural-language queries about current delays can be answered by routing through a tool-use conversational agent.
- The full pipeline runs end-to-end in production without retraining at each step.
Where Pith is reading between the lines
- The same rotation-tracking idea could be tested on other scheduled transport networks where vehicles or crews cycle through repeated routes.
- Performance in live settings may degrade if tail-number data arrive with higher latency or missing values than in archived records.
- Extending the feature set to include crew or gate constraints might capture additional propagation paths the current model leaves implicit.
Load-bearing premise
The eleven features built from tail-number sequences in the historical BTS dataset capture genuine dynamic delay spread that will hold in live operations without data leakage or selection effects.
What would settle it
Retraining and testing the final model on a completely held-out later year of BTS data or on live operational streams and observing AUC fall substantially below 0.875 would falsify the claim that the rotation features generalize.
Figures
read the original abstract
Flight delays impose cascading operational and financial burdens across the aviation network, costing the U.S. economy billions of dollars annually by disrupting interconnected aircraft rotation systems. While prior machine learning approaches have demonstrated strong predictive performance, most treat upstream delays as static input variables rather than explicitly modeling how delays propagate dynamically through aircraft rotation chains, and none have deployed such systems alongside a live weather-aware conversational AI interface for end-user interaction. This paper presents FlightSense, an end-to-end MLOps platform for real-time flight delay prediction built through a progressive three-version feature engineering framework. Version 1 trains an XGBoost classifier on 11 schedule-based features establishing a baseline ROC AUC of 0.732 on 7.07 million BTS 2018 On-Time Performance records. Version 2 introduces 11 delay propagation features derived from aircraft rotation chains via tail-number tracking, yielding the dominant performance gain (AUC 0.732 to 0.875) and surpassing the single-stage XGBoost baseline reported by Zhou (2025). Version 3 integrates five NOAA meteorological features across 10 major U.S. airports, achieving a final test set AUC of 0.879. FlightSense is deployed as a production AWS MLOps pipeline incorporating live weather ingestion via Lambda, real-time SageMaker inference, an interactive Streamlit dashboard, and an Amazon Bedrock Nova Micro conversational assistant answering natural-language delay queries via a tool-use architecture.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FlightSense, an end-to-end MLOps platform for real-time flight delay prediction. It uses a three-version progressive feature engineering pipeline on 7.07 million BTS 2018 On-Time Performance records. Version 1 establishes a baseline with 11 schedule-based features (AUC 0.732). Version 2 adds 11 delay propagation features derived from aircraft rotation chains via tail-number tracking, producing the main lift to AUC 0.875 and surpassing a prior single-stage XGBoost baseline. Version 3 adds five NOAA meteorological features for a final test AUC of 0.879. The system is deployed as a production AWS pipeline with live weather ingestion, SageMaker inference, a Streamlit dashboard, and an Amazon Bedrock agentic conversational interface.
Significance. If the rotation-chain features are constructed without temporal leakage, the reported AUC improvement from 0.732 to 0.875 would provide concrete evidence that explicitly modeling dynamic delay propagation through tail-number-tracked rotations adds substantial predictive value beyond static schedule features. The large-scale BTS dataset, staged ablation, and production MLOps deployment with conversational AI would together strengthen the practical contribution to aviation operations research.
major comments (2)
- [Abstract / Version 2] Abstract / Version 2 description: The dominant performance gain is attributed to the 11 delay propagation features built from aircraft rotation chains via tail-number tracking. However, the manuscript provides no explicit description of how these features enforce strict temporal cutoffs (e.g., using only departures and delays prior to the target flight's scheduled departure time). Without such safeguards, the AUC jump from 0.732 to 0.875 could result from lookahead bias rather than genuine causal propagation modeling.
- [Version 2] Version 2 feature construction: The paper must demonstrate that the 11 propagation features are computed exclusively from historical data available at prediction time for each record. If the rotation chains aggregate delays from segments occurring after the prediction timestamp or rely on post-hoc selection of the full chain, the cross-validation and test-set results would be compromised by non-causal information.
minor comments (2)
- [Abstract] The citation to Zhou (2025) as the single-stage XGBoost baseline should include the full reference details and a brief comparison of feature sets to allow readers to assess the claimed improvement.
- [Version 2] The manuscript would benefit from a table summarizing the exact definitions and computation windows for each of the 11 propagation features.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight an important point about ensuring and documenting the absence of temporal leakage in the rotation-chain features. We address each major comment below and have revised the manuscript to strengthen the description of the feature construction process.
read point-by-point responses
-
Referee: [Abstract / Version 2] Abstract / Version 2 description: The dominant performance gain is attributed to the 11 delay propagation features built from aircraft rotation chains via tail-number tracking. However, the manuscript provides no explicit description of how these features enforce strict temporal cutoffs (e.g., using only departures and delays prior to the target flight's scheduled departure time). Without such safeguards, the AUC jump from 0.732 to 0.875 could result from lookahead bias rather than genuine causal propagation modeling.
Authors: We agree that the original manuscript did not provide a sufficiently explicit description of the temporal cutoffs in the abstract or high-level overview. The feature engineering in Section 3.2 was intended to be causal, but we acknowledge the need for clearer documentation. In the revised manuscript we have expanded the abstract and added a dedicated paragraph in Section 3.2.2 that states: for every target flight with scheduled departure time T, the 11 propagation features are derived exclusively from prior flights of the same tail number whose actual departure occurred before T. This filtering is applied before any aggregation of delay statistics, ensuring no future information enters the feature vector. revision: yes
-
Referee: [Version 2] Version 2 feature construction: The paper must demonstrate that the 11 propagation features are computed exclusively from historical data available at prediction time for each record. If the rotation chains aggregate delays from segments occurring after the prediction timestamp or rely on post-hoc selection of the full chain, the cross-validation and test-set results would be compromised by non-causal information.
Authors: We confirm that the original implementation enforced causality by restricting each rotation chain to historical segments available at the prediction timestamp. To address the referee's request for explicit demonstration, the revised Section 3.2.2 now includes (1) a formal definition of the temporal filter, (2) a small illustrative example with actual BTS timestamps showing that only pre-T departures are included, and (3) a note that the same causal construction was used uniformly for the 5-fold cross-validation and the held-out test set. No post-hoc selection of the full chain occurs; chains are built incrementally using only records whose departure time precedes the target flight's scheduled departure. revision: yes
Circularity Check
No significant circularity; performance gains measured on held-out test data
full rationale
The paper's central results consist of empirical AUC improvements obtained by training XGBoost models on successively enriched feature sets derived from the 2018 BTS historical records and evaluating on a held-out test partition. Version 1 uses 11 schedule features (baseline AUC 0.732), Version 2 adds 11 rotation-chain features computed via tail-number tracking, and Version 3 adds meteorological variables. These steps are standard supervised-learning feature engineering followed by out-of-sample evaluation; the reported lift is not equivalent to any input by construction, nor does any load-bearing claim reduce to a self-citation or a fitted parameter that is then relabeled as a prediction. No equations or uniqueness theorems are invoked that would create definitional circularity.
Axiom & Free-Parameter Ledger
free parameters (2)
- XGBoost hyperparameters
- Propagation feature definitions
axioms (1)
- domain assumption BTS On-Time Performance records provide accurate tail numbers and timestamps sufficient to reconstruct actual aircraft rotations without significant missing or erroneous links.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Version 2 introduces 11 delay propagation features derived from aircraft rotation chains via tail-number tracking... yielding the dominant performance gain (AUC 0.732 to 0.875)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Version 3 integrates five NOAA meteorological features... final test set AUC of 0.879
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M. Ball, C. Barnhart, M. Dresner, M. Hansen, K. Neels, A. Odoni, E. Peterson, L. Sherry, A. Trani, and B. Zou, “Total delay impact study: A comprehensive assessment of the costs and impacts of flight delay in the United States,” NEXTOR Research Report, 2010
work page 2010
-
[2]
U.S. Department of Transportation, “Air Travel Consumer Report,” January 2026
work page 2026
-
[3]
Characterization and prediction of air traffic delays,
J. J. Rebollo and H. Balakrishnan, “Characterization and prediction of air traffic delays,”Transportation Research Part C: Emerging Technologies, vol. 44, pp. 231–241, 2014
work page 2014
-
[4]
Flight delay prediction for commercial air transport: A deep learning approach,
B. Yu, Z. Guo, S. Asian, H. Wang, and G. Chen, “Flight delay prediction for commercial air transport: A deep learning approach,”Transportation Research Part E: Logistics and Transportation Review, vol. 125, pp. 203–221, 2019
work page 2019
-
[5]
Modeling flight delay propagation: A new analytical-econometric approach,
N. Kafle and B. Zou, “Modeling flight delay propagation: A new analytical-econometric approach,”Transportation Research Part B: Methodological, vol. 93, pp. 520–542, 2016
work page 2016
-
[6]
Integrating delay-absorption capability into flight departure delay prediction,
J. Zhou, “Integrating delay-absorption capability into flight departure delay prediction,”arXiv preprint arXiv:2512.08197, George Mason University, 2025
-
[7]
A review of research on flight delay propagation: Current situation and prospect,
N. Li and H. G. Yao, “A review of research on flight delay propagation: Current situation and prospect,”Journal of Advanced Transportation, vol. 2025, Article ID 4851103, 2025
work page 2025
-
[8]
Flight delay prediction based on aviation big data and machine learning,
G. Gui, F. Liu, J. Sun, J. Yang, Z. Zhou, and D. Zhao, “Flight delay prediction based on aviation big data and machine learning,”IEEE Transactions on Vehicular Technology, vol. 69, no. 1, pp. 140–150, 2019
work page 2019
-
[9]
LeRAAT: LLM-Enabled Real-Time Aviation Advisory Tool,
M. R. Schlichting, V . Rasmussen, H. Alazzeh, H. Liu, K. Jafari, A. F. Hardy, D. M. Asmar, and M. J. Kochenderfer, “LeRAAT: LLM-Enabled Real-Time Aviation Advisory Tool,”arXiv preprint arXiv:2503.16477, 2025
-
[10]
T. Phisannupawong, J. J. Damanik, and H.-L. Choi, “Flight delay prediction via cross-modality adaptation of large language models and aircraft trajectory representation,”arXiv preprint arXiv:2510.23636, KAIST, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Flight delay prediction from spatial and temporal perspective,
Q. Li and R. Jing, “Flight delay prediction from spatial and temporal perspective,”Expert Systems with Applications, vol. 205, p. 117662, 2022
work page 2022
-
[12]
Z. Guo, B. Yu, M. Hao, W. Wang, Y . Jiang, and F. Zong, “A novel hybrid method for flight departure delay prediction using random forest regression and maximal information coefficient,”Aerospace Science and Technology, vol. 116, p. 106822, 2021
work page 2021
-
[13]
Forecasting flight delays using machine learning,
B. Hari Chandana, N. Harshitha, D. Anwar, T. Harshitha, and G. Har- shavardhan Reddy, “Forecasting flight delays using machine learning,” inProc. 1st Int. Conf. Research and Development in Information, Com- munication, and Computing Technologies (ICRDICCT’25), SciTePress, pp. 735–743, 2025, doi: 10.5220/0013889300004919
-
[14]
A review of network delay prediction and advances in large language models for air traffic,
M. Sun, Y . Tian, J. Li, C.-L. Wu, L. Peng, and S. Xu, “A review of network delay prediction and advances in large language models for air traffic,”Artificial Intelligence Review, vol. 59, no. 36, 2026, doi: 10.1007/s10462-025-11400-w
-
[15]
M. Lambelho, M. Mitici, S. Pickup, and A. Marsden, “Assessing strategic flight schedules at an airport using machine learning-based flight delay and cancellation predictions,”Journal of Air Transport Management, vol. 82, p. 101737, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.