pith. sign in

arxiv: 2605.07364 · v2 · pith:5RU3JDLQnew · submitted 2026-05-08 · 💻 cs.LG

FlightSense: An End-to-End MLOps Platform for Real-Time Flight Delay Prediction via Rotation-Chain Propagation Features and Agentic Conversational AI

Pith reviewed 2026-05-20 23:17 UTC · model grok-4.3

classification 💻 cs.LG
keywords flight delay predictiondelay propagationaircraft rotation chainstail number trackingXGBoost classifierMLOps pipelineconversational AIweather integration
0
0 comments X

The pith

Modeling delay propagation through aircraft rotation chains raises flight delay prediction AUC from 0.732 to 0.879.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that delays do not occur in isolation but travel forward along the sequence of flights an aircraft performs, and that capturing this movement with specific features improves forecasts. A sympathetic reader would care because cascading delays cost the economy billions each year and most existing models treat upstream delays as fixed numbers rather than dynamic effects. The authors test this idea by building three successive versions of an XGBoost model on millions of historical records, first with schedule data alone, then adding eleven rotation-derived features, then weather variables. They wrap the best model in a live AWS pipeline that ingests fresh weather and answers user questions through a conversational interface.

Core claim

Deriving eleven delay propagation features from aircraft rotation chains identified by tail-number tracking in BTS records produces the largest accuracy gain, moving test AUC from 0.732 on schedule features to 0.875, with a final value of 0.879 after adding NOAA weather data across ten airports; the same system runs as a production MLOps service with real-time inference and natural-language query handling.

What carries the argument

The eleven delay propagation features extracted from aircraft rotation chains via tail-number tracking, which quantify how delays accumulate and transfer between consecutive legs of the same aircraft.

If this is right

  • The rotation-chain features account for the dominant performance increase over the schedule-only baseline.
  • Real-time inference remains feasible when the model is served through SageMaker with live weather updates.
  • Natural-language queries about current delays can be answered by routing through a tool-use conversational agent.
  • The full pipeline runs end-to-end in production without retraining at each step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rotation-tracking idea could be tested on other scheduled transport networks where vehicles or crews cycle through repeated routes.
  • Performance in live settings may degrade if tail-number data arrive with higher latency or missing values than in archived records.
  • Extending the feature set to include crew or gate constraints might capture additional propagation paths the current model leaves implicit.

Load-bearing premise

The eleven features built from tail-number sequences in the historical BTS dataset capture genuine dynamic delay spread that will hold in live operations without data leakage or selection effects.

What would settle it

Retraining and testing the final model on a completely held-out later year of BTS data or on live operational streams and observing AUC fall substantially below 0.875 would falsify the claim that the rotation features generalize.

Figures

Figures reproduced from arXiv: 2605.07364 by Aditi J. Shelke, Nitin P. Hazarani, Renuka J. Shelke, Yash M. Kamerkar.

Figure 1
Figure 1. Figure 1: Pipeline Order • Origin snowfall (origin_snow): Total daily snowfall in inches. Snow produces the most operationally severe delay effects — deicing requirements add 20–45 minutes of gate time per departure, and active snowfall triggers ground stops. • Origin maximum temperature (origin_tmax): Daily maximum temperature in °F. Extreme heat reduces air￾craft climb performance margins through density altitude … view at source ↗
Figure 2
Figure 2. Figure 2: Streamlit Dashboard [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bedrock chatbot response E. Streamlit Dashboard The FlightSense prediction dashboard is hosted on an EC2 t3.micro instance and provides an interactive interface for real￾time delay risk assessment. The dashboard accepts a 27-feature flight query and returns a calibrated delay probability with four-tier risk classification: high (>70%), moderate (>50%), low (>30%), and on-time (≤30%). When a query is submit… view at source ↗
Figure 5
Figure 5. Figure 5: Three-version ablation comparison. is the primary driver of performance gains, not hyperparameter tuning alone. One honest limitation: the deep untuned baseline (AUC 0.886) slightly exceeds V3+HPO (AUC 0.879), indicating the 5-job HPO budget did not fully explore the search space. An expanded search of 20–50 jobs would likely close this gap. D. Feature Importance Analysis Table IX presents the top 10 featu… view at source ↗
read the original abstract

Flight delays impose cascading operational and financial burdens across the aviation network, costing the U.S. economy billions of dollars annually by disrupting interconnected aircraft rotation systems. While prior machine learning approaches have demonstrated strong predictive performance, most treat upstream delays as static input variables rather than explicitly modeling how delays propagate dynamically through aircraft rotation chains, and none have deployed such systems alongside a live weather-aware conversational AI interface for end-user interaction. This paper presents FlightSense, an end-to-end MLOps platform for real-time flight delay prediction built through a progressive three-version feature engineering framework. Version 1 trains an XGBoost classifier on 11 schedule-based features establishing a baseline ROC AUC of 0.732 on 7.07 million BTS 2018 On-Time Performance records. Version 2 introduces 11 delay propagation features derived from aircraft rotation chains via tail-number tracking, yielding the dominant performance gain (AUC 0.732 to 0.875) and surpassing the single-stage XGBoost baseline reported by Zhou (2025). Version 3 integrates five NOAA meteorological features across 10 major U.S. airports, achieving a final test set AUC of 0.879. FlightSense is deployed as a production AWS MLOps pipeline incorporating live weather ingestion via Lambda, real-time SageMaker inference, an interactive Streamlit dashboard, and an Amazon Bedrock Nova Micro conversational assistant answering natural-language delay queries via a tool-use architecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FlightSense, an end-to-end MLOps platform for real-time flight delay prediction. It uses a three-version progressive feature engineering pipeline on 7.07 million BTS 2018 On-Time Performance records. Version 1 establishes a baseline with 11 schedule-based features (AUC 0.732). Version 2 adds 11 delay propagation features derived from aircraft rotation chains via tail-number tracking, producing the main lift to AUC 0.875 and surpassing a prior single-stage XGBoost baseline. Version 3 adds five NOAA meteorological features for a final test AUC of 0.879. The system is deployed as a production AWS pipeline with live weather ingestion, SageMaker inference, a Streamlit dashboard, and an Amazon Bedrock agentic conversational interface.

Significance. If the rotation-chain features are constructed without temporal leakage, the reported AUC improvement from 0.732 to 0.875 would provide concrete evidence that explicitly modeling dynamic delay propagation through tail-number-tracked rotations adds substantial predictive value beyond static schedule features. The large-scale BTS dataset, staged ablation, and production MLOps deployment with conversational AI would together strengthen the practical contribution to aviation operations research.

major comments (2)
  1. [Abstract / Version 2] Abstract / Version 2 description: The dominant performance gain is attributed to the 11 delay propagation features built from aircraft rotation chains via tail-number tracking. However, the manuscript provides no explicit description of how these features enforce strict temporal cutoffs (e.g., using only departures and delays prior to the target flight's scheduled departure time). Without such safeguards, the AUC jump from 0.732 to 0.875 could result from lookahead bias rather than genuine causal propagation modeling.
  2. [Version 2] Version 2 feature construction: The paper must demonstrate that the 11 propagation features are computed exclusively from historical data available at prediction time for each record. If the rotation chains aggregate delays from segments occurring after the prediction timestamp or rely on post-hoc selection of the full chain, the cross-validation and test-set results would be compromised by non-causal information.
minor comments (2)
  1. [Abstract] The citation to Zhou (2025) as the single-stage XGBoost baseline should include the full reference details and a brief comparison of feature sets to allow readers to assess the claimed improvement.
  2. [Version 2] The manuscript would benefit from a table summarizing the exact definitions and computation windows for each of the 11 propagation features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight an important point about ensuring and documenting the absence of temporal leakage in the rotation-chain features. We address each major comment below and have revised the manuscript to strengthen the description of the feature construction process.

read point-by-point responses
  1. Referee: [Abstract / Version 2] Abstract / Version 2 description: The dominant performance gain is attributed to the 11 delay propagation features built from aircraft rotation chains via tail-number tracking. However, the manuscript provides no explicit description of how these features enforce strict temporal cutoffs (e.g., using only departures and delays prior to the target flight's scheduled departure time). Without such safeguards, the AUC jump from 0.732 to 0.875 could result from lookahead bias rather than genuine causal propagation modeling.

    Authors: We agree that the original manuscript did not provide a sufficiently explicit description of the temporal cutoffs in the abstract or high-level overview. The feature engineering in Section 3.2 was intended to be causal, but we acknowledge the need for clearer documentation. In the revised manuscript we have expanded the abstract and added a dedicated paragraph in Section 3.2.2 that states: for every target flight with scheduled departure time T, the 11 propagation features are derived exclusively from prior flights of the same tail number whose actual departure occurred before T. This filtering is applied before any aggregation of delay statistics, ensuring no future information enters the feature vector. revision: yes

  2. Referee: [Version 2] Version 2 feature construction: The paper must demonstrate that the 11 propagation features are computed exclusively from historical data available at prediction time for each record. If the rotation chains aggregate delays from segments occurring after the prediction timestamp or rely on post-hoc selection of the full chain, the cross-validation and test-set results would be compromised by non-causal information.

    Authors: We confirm that the original implementation enforced causality by restricting each rotation chain to historical segments available at the prediction timestamp. To address the referee's request for explicit demonstration, the revised Section 3.2.2 now includes (1) a formal definition of the temporal filter, (2) a small illustrative example with actual BTS timestamps showing that only pre-T departures are included, and (3) a note that the same causal construction was used uniformly for the 5-fold cross-validation and the held-out test set. No post-hoc selection of the full chain occurs; chains are built incrementally using only records whose departure time precedes the target flight's scheduled departure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance gains measured on held-out test data

full rationale

The paper's central results consist of empirical AUC improvements obtained by training XGBoost models on successively enriched feature sets derived from the 2018 BTS historical records and evaluating on a held-out test partition. Version 1 uses 11 schedule features (baseline AUC 0.732), Version 2 adds 11 rotation-chain features computed via tail-number tracking, and Version 3 adds meteorological variables. These steps are standard supervised-learning feature engineering followed by out-of-sample evaluation; the reported lift is not equivalent to any input by construction, nor does any load-bearing claim reduce to a self-citation or a fitted parameter that is then relabeled as a prediction. No equations or uniqueness theorems are invoked that would create definitional circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the reliability of BTS tail-number data for constructing rotation chains and on standard supervised learning assumptions; no new physical entities are introduced.

free parameters (2)
  • XGBoost hyperparameters
    Model training implies selection or tuning of tree depth, learning rate, and regularization parameters, though values are not stated in the abstract.
  • Propagation feature definitions
    The exact formulas and any thresholds used to derive the 11 delay propagation features from rotation chains are data-dependent choices.
axioms (1)
  • domain assumption BTS On-Time Performance records provide accurate tail numbers and timestamps sufficient to reconstruct actual aircraft rotations without significant missing or erroneous links.
    All propagation features depend on this dataset property holding for the 2018 records used.

pith-pipeline@v0.9.0 · 5818 in / 1429 out tokens · 52117 ms · 2026-05-20T23:17:18.015432+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    Total delay impact study: A comprehensive assessment of the costs and impacts of flight delay in the United States,

    M. Ball, C. Barnhart, M. Dresner, M. Hansen, K. Neels, A. Odoni, E. Peterson, L. Sherry, A. Trani, and B. Zou, “Total delay impact study: A comprehensive assessment of the costs and impacts of flight delay in the United States,” NEXTOR Research Report, 2010

  2. [2]

    Air Travel Consumer Report,

    U.S. Department of Transportation, “Air Travel Consumer Report,” January 2026

  3. [3]

    Characterization and prediction of air traffic delays,

    J. J. Rebollo and H. Balakrishnan, “Characterization and prediction of air traffic delays,”Transportation Research Part C: Emerging Technologies, vol. 44, pp. 231–241, 2014

  4. [4]

    Flight delay prediction for commercial air transport: A deep learning approach,

    B. Yu, Z. Guo, S. Asian, H. Wang, and G. Chen, “Flight delay prediction for commercial air transport: A deep learning approach,”Transportation Research Part E: Logistics and Transportation Review, vol. 125, pp. 203–221, 2019

  5. [5]

    Modeling flight delay propagation: A new analytical-econometric approach,

    N. Kafle and B. Zou, “Modeling flight delay propagation: A new analytical-econometric approach,”Transportation Research Part B: Methodological, vol. 93, pp. 520–542, 2016

  6. [6]

    Integrating delay-absorption capability into flight departure delay prediction,

    J. Zhou, “Integrating delay-absorption capability into flight departure delay prediction,”arXiv preprint arXiv:2512.08197, George Mason University, 2025

  7. [7]

    A review of research on flight delay propagation: Current situation and prospect,

    N. Li and H. G. Yao, “A review of research on flight delay propagation: Current situation and prospect,”Journal of Advanced Transportation, vol. 2025, Article ID 4851103, 2025

  8. [8]

    Flight delay prediction based on aviation big data and machine learning,

    G. Gui, F. Liu, J. Sun, J. Yang, Z. Zhou, and D. Zhao, “Flight delay prediction based on aviation big data and machine learning,”IEEE Transactions on Vehicular Technology, vol. 69, no. 1, pp. 140–150, 2019

  9. [9]

    LeRAAT: LLM-Enabled Real-Time Aviation Advisory Tool,

    M. R. Schlichting, V . Rasmussen, H. Alazzeh, H. Liu, K. Jafari, A. F. Hardy, D. M. Asmar, and M. J. Kochenderfer, “LeRAAT: LLM-Enabled Real-Time Aviation Advisory Tool,”arXiv preprint arXiv:2503.16477, 2025

  10. [10]

    LLM4Delay: Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation

    T. Phisannupawong, J. J. Damanik, and H.-L. Choi, “Flight delay prediction via cross-modality adaptation of large language models and aircraft trajectory representation,”arXiv preprint arXiv:2510.23636, KAIST, 2025

  11. [11]

    Flight delay prediction from spatial and temporal perspective,

    Q. Li and R. Jing, “Flight delay prediction from spatial and temporal perspective,”Expert Systems with Applications, vol. 205, p. 117662, 2022

  12. [12]

    A novel hybrid method for flight departure delay prediction using random forest regression and maximal information coefficient,

    Z. Guo, B. Yu, M. Hao, W. Wang, Y . Jiang, and F. Zong, “A novel hybrid method for flight departure delay prediction using random forest regression and maximal information coefficient,”Aerospace Science and Technology, vol. 116, p. 106822, 2021

  13. [13]

    Forecasting flight delays using machine learning,

    B. Hari Chandana, N. Harshitha, D. Anwar, T. Harshitha, and G. Har- shavardhan Reddy, “Forecasting flight delays using machine learning,” inProc. 1st Int. Conf. Research and Development in Information, Com- munication, and Computing Technologies (ICRDICCT’25), SciTePress, pp. 735–743, 2025, doi: 10.5220/0013889300004919

  14. [14]

    A review of network delay prediction and advances in large language models for air traffic,

    M. Sun, Y . Tian, J. Li, C.-L. Wu, L. Peng, and S. Xu, “A review of network delay prediction and advances in large language models for air traffic,”Artificial Intelligence Review, vol. 59, no. 36, 2026, doi: 10.1007/s10462-025-11400-w

  15. [15]

    Assessing strategic flight schedules at an airport using machine learning-based flight delay and cancellation predictions,

    M. Lambelho, M. Mitici, S. Pickup, and A. Marsden, “Assessing strategic flight schedules at an airport using machine learning-based flight delay and cancellation predictions,”Journal of Air Transport Management, vol. 82, p. 101737, 2020