pith. sign in

arxiv: 2604.13459 · v2 · submitted 2026-04-15 · 💻 cs.LG · cs.AI· cs.SY· eess.SY

Asymmetric-Loss-Guided Hybrid CNN-BiLSTM-Attention Model for Industrial RUL Prediction with Interpretable Failure Heatmaps

Pith reviewed 2026-05-10 13:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SY
keywords remaining useful life predictionhybrid CNN BiLSTMasymmetric loss functionattention mechanismturbofan engine prognosticsinterpretable heatmapsC-MAPSS datasetindustrial safety constraints
0
0 comments X

The pith

A hybrid CNN-BiLSTM model with asymmetric loss and attention predicts turbofan remaining useful life at 17.52 cycles RMSE while generating per-engine degradation heatmaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a combined 1D-CNN, BiLSTM, and attention network can capture both local sensor patterns and long-term dependencies in engine data better than prior methods. It trains this network with an asymmetric loss that applies heavier penalties when the model overestimates how much life remains, aiming to reduce the risk of unexpected failures. The approach also extracts attention weights to produce heatmaps that show which time steps drive each engine's predicted degradation. Results on the standard FD001 test set of 100 engines report competitive error metrics alongside these visual explanations. If correct, the work supplies a concrete way to make prognostic models both accurate and aligned with industrial safety priorities.

Core claim

The central claim is that a twin-stage 1D-CNN feeding a BiLSTM with Bahdanau additive attention, trained on zero-leakage preprocessed C-MAPSS FD001 data using piecewise-linear RUL targets capped at 130 cycles and the NASA asymmetric exponential loss, reaches an RMSE of 17.52 cycles and an S-Score of 922.06 while the attention weights directly yield interpretable failure heatmaps that reveal temporal degradation patterns for individual engines.

What carries the argument

The hybrid architecture of twin-stage 1D-CNN layers for spatial sensor correlations, BiLSTM for bidirectional temporal modeling, and Bahdanau attention for weighted focus on degradation stages, all optimized under the asymmetric exponential loss that penalizes RUL over-estimates more than under-estimates.

If this is right

  • Maintenance decisions can be scheduled with lower risk of missing failures because the loss discourages optimistic life estimates.
  • Attention heatmaps supply per-engine visual traces of degradation progression that engineers can inspect without post-hoc explanation tools.
  • The same preprocessing and labeling pipeline can be reused for other multi-sensor time-series prognostic tasks in industry.
  • Competitive scores against baselines indicate the architecture is ready for further scaling to additional C-MAPSS operating conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the asymmetric loss reliably enforces safety, the same penalty structure could be transferred to other forecasting domains where over-prediction of remaining capacity carries high cost.
  • The generated heatmaps suggest a route to compare degradation signatures across engines or fleets without requiring separate interpretability modules.
  • Performance on the single FD001 subset leaves open the question of whether the model maintains accuracy when sensor noise or operating regimes differ from the training simulation.
  • Combining the attention outputs with physical degradation equations might produce hybrid physics-informed predictions that improve generalization beyond the current dataset.

Load-bearing premise

The assumption that the chosen asymmetric loss, the simulated dataset, and the capped linear RUL labels produce targets and penalties that match real industrial safety needs without bias or leakage.

What would settle it

Evaluating the same trained model on failure records from actual operational turbofan engines rather than the C-MAPSS simulation and checking whether the RMSE remains near 17.52 and the S-Score near 922 would directly test whether the safety and accuracy claims hold outside the simulation.

Figures

Figures reproduced from arXiv: 2604.13459 by Mohammed Ezzaldin Babiker Abdullah.

Figure 1
Figure 1. Figure 1: End-to-end research methodology pipeline illustrating the eight sequential processing stages from raw sensor data ingestion through quantitative model evaluation [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pearson inter-correlation matrix of the 17 retained sensor and operational setting features computed from the training corpus. Structured correlation clusters validate the sensor selection and motivate the twin -stage CNN spatial feature extraction design [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Integrated hybrid CNN-BiLSTM-Attention architecture with layer-by-layer data flow and exact input/output tensor dimensions at each processing stage. 3.3.1. CNN Spatial Feature Extraction Block The input tensor of shape (30, 17) is first processed by two sequential 1D Convolutional (Conv1D) layers that hierarchically extract spatial co-activation features across the 17 sensor channels within each time step.… view at source ↗
Figure 4
Figure 4. Figure 4: Twin-Stage 1D-CNN block showing sequential spatial feature extraction. Each stage applies a Conv1D filter bank, Batch Normalization, ReLU activation, and regularising Dropout [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Bidirectional LSTM temporal memory block. Forward and backward LSTM sub -networks process the sensor sequence from opposite temporal directions; their hidden states are concatenated to form a 256-dimensional joint representation at each time step. 3.3.3. Bahdanau Additive Attention Mechanism Not all 30 time steps contribute equally to the terminal RUL prediction. Cycles during the healthy plateau phase car… view at source ↗
Figure 6
Figure 6. Figure 6: Bahdanau additive attention computation graph. Hidden states H(t) are projected through learned weight matrices to produce alignment scores, normalized via softmax to yield per-timestep importance weights α(t). 3.3.4. Dense Regression Head The context vector c is processed through a two-stage fully-connected regression head: Dense (64, ReLU, L2=1×10⁻⁴) followed by Dropout (0.2), Dense (32, ReLU), and a fin… view at source ↗
Figure 7
Figure 7. Figure 7: NASA asymmetric exponential loss function. The over-estimation region (ε ≥ 0) accumulates penalty at an exponential rate governed by h₂ = 10; the under-estimation region (ε < 0) grows more slowly via h₁ = 13, encoding the asymmetric safety priority. Verification: for |ε| = 20, over-estimation penalty (6.39) exceeds under-estimation penalty (3.66) by 74% [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Predicted versus ground-truth RUL trajectories for all 100 NASA C-MAPSS FD001 test engines (RMSE = 17.523 cycles, NASA S-Score = 922.06), with signed per-engine residual bar chart [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Error distribution analysis. Left: residual histogram with zero-error and mean-error reference lines. Right: true-versus-predicted scatter plot with identity line. 4.4. Comparative Analysis [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Benchmark comparison of RMSE (left) and NASA S-Score (right) across five methods on NASA C￾MAPSS FD001. The proposed model is highlighted. 4.5. Safety Analysis [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: presents the signed prediction errors for all 100 test engines, sorted by absolute magnitude. The asymmetric loss function measurably induces a negative-ε bias: the majority of residuals are negative (predicted RUL < true RUL), corresponding to safe, conservative maintenance recommendations. The ±10-cycle accuracy band is superimposed, characterizing the proportion of engines for which predictions fall wi… view at source ↗
Figure 12
Figure 12. Figure 12: presents attention weight heatmaps {α(t)}₁³⁰ for five representative test engines. A consistent temporal concentration pattern emerges across all engines: the attention mechanism allocates the highest weights to the five to ten most recent time steps, correctly identifying proximity to the terminal fa ilure state as the dominant predictive signal. Early plateau cycles receive correspondingly low attention… view at source ↗
Figure 13
Figure 13. Figure 13: RUL degradation characterization of the training corpus. Left: piecewise-linear RUL profiles for six representative training engines. Right: global RUL scatter across all training cycles, illustrating the progressively denser low-RUL sampling region. 5.3. RUL Degradation Profile Analysis [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
read the original abstract

Turbofan engine degradation under sustained operational stress necessitates robust prognostic systems capable of accurately estimating the Remaining Useful Life (RUL) of critical components. Existing deep learning approaches frequently fail to simultaneously capture multi-sensor spatial correlations and long-range temporal dependencies, while standard symmetric loss functions inadequately penalize the safety-critical error of over-estimating residual life. This study proposes a hybrid architecture integrating Twin-Stage One-Dimensional Convolutional Neural Networks (1D-CNN), a Bidirectional Long Short-Term Memory (BiLSTM) network, and a custom Bahdanau Additive Attention mechanism. The model was trained and evaluated on the NASA Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) FD001 sub-dataset employing a zero-leakage preprocessing pipeline, piecewise-linear RUL labeling capped at 130 cycles, and the NASA-specified asymmetric exponential loss function that disproportionately penalizes over-estimation to enforce industrial safety constraints. Experiments on 100 test engines achieved a Root Mean Squared Error (RMSE) of 17.52 cycles and a NASA S-Score of 922.06. Furthermore, extracted attention weight heatmaps provide interpretable, per-engine insights into the temporal progression of degradation, supporting informed maintenance decision-making. The proposed framework demonstrates competitive performance against established baselines and offers a principled approach to safe, interpretable prognostics in industrial settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a hybrid 1D-CNN + BiLSTM + Bahdanau attention architecture for turbofan engine RUL prediction on the C-MAPSS FD001 dataset. It employs zero-leakage windowing, piecewise-linear RUL targets capped at 130 cycles, and the NASA asymmetric exponential loss that penalizes over-estimation more heavily. On 100 test engines the model reports RMSE = 17.52 cycles and S-Score = 922.06, together with per-engine attention heatmaps offered as evidence of interpretability.

Significance. If the reported metrics are shown to be robust to baselines, ablations, and statistical testing, and if the preprocessing and loss are verified to enforce the claimed safety bias without leakage or target simplification, the work would provide a concrete example of combining spatial-temporal modeling with an asymmetric loss for safety-critical prognostics. The attention heatmaps constitute a modest but useful step toward interpretability, though their practical value remains unquantified.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the headline RMSE 17.52 and S-Score 922.06 are presented without any tabulated baseline comparisons, ablation results, error bars, or statistical significance tests. This omission makes it impossible to determine whether the hybrid architecture actually improves upon established CNN-LSTM or attention-only models on the same FD001 split.
  2. [§3.2 and §3.3] §3.2 (Preprocessing) and §3.3 (Loss): the zero-leakage claim, the piecewise-linear labeling with hard cap at 130 cycles, and the exact implementation of the NASA asymmetric exponential loss are load-bearing for the safety and industrial-readiness conclusions. The manuscript must supply (i) explicit pseudocode or data-flow diagrams confirming no future information leaks into training windows, (ii) a quantitative check that the 130-cycle cap does not artificially flatten the error distribution, and (iii) a before/after comparison of error histograms under symmetric vs. asymmetric loss to confirm the intended under-prediction bias on the test set.
  3. [§5] §5 (Interpretability): attention weight heatmaps are visualized but never evaluated against any ground-truth degradation signature or compared to saliency methods. Without a quantitative metric (e.g., alignment with known sensor degradation patterns or human-expert agreement scores), the interpretability claim remains unsupported.
minor comments (2)
  1. [Abstract] The abstract states “competitive performance against established baselines” yet supplies no numbers; move at least one baseline row into the abstract or add a footnote with the best reported FD001 RMSE from the literature.
  2. [§3] Notation for the asymmetric loss scaling factor and the RUL cap value should be introduced once in §3 and used consistently thereafter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to incorporate additional comparisons, pseudocode, figures, and discussions addressing the major comments. Our responses to each point are provided below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline RMSE 17.52 and S-Score 922.06 are presented without any tabulated baseline comparisons, ablation results, error bars, or statistical significance tests. This omission makes it impossible to determine whether the hybrid architecture actually improves upon established CNN-LSTM or attention-only models on the same FD001 split.

    Authors: We agree that tabulated comparisons, ablations, error bars, and statistical tests are required to substantiate the performance claims. In the revised manuscript, we have added Table 3 in Section 4 comparing our model to baselines including CNN-LSTM, BiLSTM with attention, and other published FD001 results using the identical split and metrics. Table 4 presents ablation results (removing CNN, BiLSTM, or Attention) averaged over 5 runs with standard deviations and error bars in the figures. Paired t-tests with p-values are included to demonstrate statistical significance of the improvements. revision: yes

  2. Referee: [§3.2 and §3.3] §3.2 (Preprocessing) and §3.3 (Loss): the zero-leakage claim, the piecewise-linear labeling with hard cap at 130 cycles, and the exact implementation of the NASA asymmetric exponential loss are load-bearing for the safety and industrial-readiness conclusions. The manuscript must supply (i) explicit pseudocode or data-flow diagrams confirming no future information leaks into training windows, (ii) a quantitative check that the 130-cycle cap does not artificially flatten the error distribution, and (iii) a before/after comparison of error histograms under symmetric vs. asymmetric loss to confirm the intended under-prediction bias on the test set.

    Authors: We have added explicit pseudocode in new Appendix A for the windowing and labeling pipeline, confirming that training windows use only historical data with no future leakage. For the 130-cycle cap, we included an analysis showing it impacts fewer than 15% of samples and does not flatten errors, supported by a supplementary figure of RUL distributions. We also added Figure 8 with error histograms for symmetric vs. asymmetric loss on the test set, confirming the intended bias (reduced over-estimation frequency). revision: yes

  3. Referee: [§5] §5 (Interpretability): attention weight heatmaps are visualized but never evaluated against any ground-truth degradation signature or compared to saliency methods. Without a quantitative metric (e.g., alignment with known sensor degradation patterns or human-expert agreement scores), the interpretability claim remains unsupported.

    Authors: We acknowledge that quantitative metrics are ideal but limited by the absence of ground-truth degradation signatures in C-MAPSS. In the revised Section 5, we added qualitative alignment of attention patterns with known physical trends (e.g., rising sensor values) and a comparison to integrated gradients saliency maps. We note that expert agreement scores cannot be computed without additional annotations unavailable in the dataset, but the expanded analysis strengthens the interpretability support. revision: partial

Circularity Check

0 steps flagged

No circularity: standard supervised training on external benchmark with no self-referential equations or fitted predictions.

full rationale

The paper presents an empirical ML pipeline (1D-CNN + BiLSTM + attention trained with NASA asymmetric loss on C-MAPSS FD001) whose performance metrics are obtained by direct supervised evaluation on a held-out test set. No mathematical derivation, uniqueness theorem, or ansatz is invoked whose result reduces to the inputs by construction; the reported RMSE and S-Score are ordinary training outcomes rather than quantities defined or fitted in terms of themselves. Preprocessing and labeling choices are standard and externally specified, not self-referential.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on standard supervised learning assumptions and dataset conventions rather than new postulates; the only notable choices are the RUL cap and asymmetric loss parameters taken from prior NASA practice.

free parameters (2)
  • RUL cap
    Piecewise-linear labeling capped at 130 cycles, a conventional choice for this dataset to avoid extreme values.
  • Asymmetric loss scaling
    Parameters of the NASA-specified exponential loss chosen to penalize over-estimation; exact values not stated in abstract.
axioms (2)
  • domain assumption C-MAPSS FD001 simulated data is sufficiently representative of real turbofan degradation for model validation.
    The entire evaluation rests on this simulated benchmark being a valid proxy for industrial use.
  • domain assumption Zero-leakage preprocessing fully prevents train-test contamination in time-series sensor data.
    Standard ML assumption invoked to justify the reported metrics.

pith-pipeline@v0.9.0 · 5557 in / 1625 out tokens · 59422 ms · 2026-05-10T13:59:48.064882+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Layer Normalization

    Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization [Preprint]. arXiv. arXiv:1607.06450. Babu, G. S., Zhao, P., & Li, X. L. (2016). Deep convolutional neural network based regression approach for estimation of remaining useful life. In S. B. Navathe, W. Wu, S. Shekhar, X. Du, X. S. Wang, & H. Xiong (Eds.), Database Systems for Advanced Ap...

  2. [2]

    Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (Vol. 37, pp. 448–456). PMLR. Khelif, R., Chebel-Morello, B., Malinowski, S., Laajili, E., Fnaiech, F., & Zerhouni, N. (2016). Direct remaining useful life es...

  3. [3]

    Zheng, S., Ristovski, K., Farahat, A., & Gupta, C. (2017). Long short-term memory network for remaining useful life estimation. In Proceedings of the 2017 IEEE International Conference on Prognostics and Health Management (pp. 88–95). IEEE