Asymmetric-Loss-Guided Hybrid CNN-BiLSTM-Attention Model for Industrial RUL Prediction with Interpretable Failure Heatmaps

Mohammed Ezzaldin Babiker Abdullah

arxiv: 2604.13459 · v2 · submitted 2026-04-15 · 💻 cs.LG · cs.AI· cs.SY· eess.SY

Asymmetric-Loss-Guided Hybrid CNN-BiLSTM-Attention Model for Industrial RUL Prediction with Interpretable Failure Heatmaps

Mohammed Ezzaldin Babiker Abdullah This is my paper

Pith reviewed 2026-05-10 13:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SY

keywords remaining useful life predictionhybrid CNN BiLSTMasymmetric loss functionattention mechanismturbofan engine prognosticsinterpretable heatmapsC-MAPSS datasetindustrial safety constraints

0 comments

The pith

A hybrid CNN-BiLSTM model with asymmetric loss and attention predicts turbofan remaining useful life at 17.52 cycles RMSE while generating per-engine degradation heatmaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a combined 1D-CNN, BiLSTM, and attention network can capture both local sensor patterns and long-term dependencies in engine data better than prior methods. It trains this network with an asymmetric loss that applies heavier penalties when the model overestimates how much life remains, aiming to reduce the risk of unexpected failures. The approach also extracts attention weights to produce heatmaps that show which time steps drive each engine's predicted degradation. Results on the standard FD001 test set of 100 engines report competitive error metrics alongside these visual explanations. If correct, the work supplies a concrete way to make prognostic models both accurate and aligned with industrial safety priorities.

Core claim

The central claim is that a twin-stage 1D-CNN feeding a BiLSTM with Bahdanau additive attention, trained on zero-leakage preprocessed C-MAPSS FD001 data using piecewise-linear RUL targets capped at 130 cycles and the NASA asymmetric exponential loss, reaches an RMSE of 17.52 cycles and an S-Score of 922.06 while the attention weights directly yield interpretable failure heatmaps that reveal temporal degradation patterns for individual engines.

What carries the argument

The hybrid architecture of twin-stage 1D-CNN layers for spatial sensor correlations, BiLSTM for bidirectional temporal modeling, and Bahdanau attention for weighted focus on degradation stages, all optimized under the asymmetric exponential loss that penalizes RUL over-estimates more than under-estimates.

If this is right

Maintenance decisions can be scheduled with lower risk of missing failures because the loss discourages optimistic life estimates.
Attention heatmaps supply per-engine visual traces of degradation progression that engineers can inspect without post-hoc explanation tools.
The same preprocessing and labeling pipeline can be reused for other multi-sensor time-series prognostic tasks in industry.
Competitive scores against baselines indicate the architecture is ready for further scaling to additional C-MAPSS operating conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the asymmetric loss reliably enforces safety, the same penalty structure could be transferred to other forecasting domains where over-prediction of remaining capacity carries high cost.
The generated heatmaps suggest a route to compare degradation signatures across engines or fleets without requiring separate interpretability modules.
Performance on the single FD001 subset leaves open the question of whether the model maintains accuracy when sensor noise or operating regimes differ from the training simulation.
Combining the attention outputs with physical degradation equations might produce hybrid physics-informed predictions that improve generalization beyond the current dataset.

Load-bearing premise

The assumption that the chosen asymmetric loss, the simulated dataset, and the capped linear RUL labels produce targets and penalties that match real industrial safety needs without bias or leakage.

What would settle it

Evaluating the same trained model on failure records from actual operational turbofan engines rather than the C-MAPSS simulation and checking whether the RMSE remains near 17.52 and the S-Score near 922 would directly test whether the safety and accuracy claims hold outside the simulation.

Figures

Figures reproduced from arXiv: 2604.13459 by Mohammed Ezzaldin Babiker Abdullah.

**Figure 1.** Figure 1: End-to-end research methodology pipeline illustrating the eight sequential processing stages from raw sensor data ingestion through quantitative model evaluation [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Pearson inter-correlation matrix of the 17 retained sensor and operational setting features computed from the training corpus. Structured correlation clusters validate the sensor selection and motivate the twin -stage CNN spatial feature extraction design [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Integrated hybrid CNN-BiLSTM-Attention architecture with layer-by-layer data flow and exact input/output tensor dimensions at each processing stage. 3.3.1. CNN Spatial Feature Extraction Block The input tensor of shape (30, 17) is first processed by two sequential 1D Convolutional (Conv1D) layers that hierarchically extract spatial co-activation features across the 17 sensor channels within each time step.… view at source ↗

**Figure 4.** Figure 4: Twin-Stage 1D-CNN block showing sequential spatial feature extraction. Each stage applies a Conv1D filter bank, Batch Normalization, ReLU activation, and regularising Dropout [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Bidirectional LSTM temporal memory block. Forward and backward LSTM sub -networks process the sensor sequence from opposite temporal directions; their hidden states are concatenated to form a 256-dimensional joint representation at each time step. 3.3.3. Bahdanau Additive Attention Mechanism Not all 30 time steps contribute equally to the terminal RUL prediction. Cycles during the healthy plateau phase car… view at source ↗

**Figure 6.** Figure 6: Bahdanau additive attention computation graph. Hidden states H(t) are projected through learned weight matrices to produce alignment scores, normalized via softmax to yield per-timestep importance weights α(t). 3.3.4. Dense Regression Head The context vector c is processed through a two-stage fully-connected regression head: Dense (64, ReLU, L2=1×10⁻⁴) followed by Dropout (0.2), Dense (32, ReLU), and a fin… view at source ↗

**Figure 7.** Figure 7: NASA asymmetric exponential loss function. The over-estimation region (ε ≥ 0) accumulates penalty at an exponential rate governed by h₂ = 10; the under-estimation region (ε < 0) grows more slowly via h₁ = 13, encoding the asymmetric safety priority. Verification: for |ε| = 20, over-estimation penalty (6.39) exceeds under-estimation penalty (3.66) by 74% [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Predicted versus ground-truth RUL trajectories for all 100 NASA C-MAPSS FD001 test engines (RMSE = 17.523 cycles, NASA S-Score = 922.06), with signed per-engine residual bar chart [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Error distribution analysis. Left: residual histogram with zero-error and mean-error reference lines. Right: true-versus-predicted scatter plot with identity line. 4.4. Comparative Analysis [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Benchmark comparison of RMSE (left) and NASA S-Score (right) across five methods on NASA CMAPSS FD001. The proposed model is highlighted. 4.5. Safety Analysis [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: presents the signed prediction errors for all 100 test engines, sorted by absolute magnitude. The asymmetric loss function measurably induces a negative-ε bias: the majority of residuals are negative (predicted RUL < true RUL), corresponding to safe, conservative maintenance recommendations. The ±10-cycle accuracy band is superimposed, characterizing the proportion of engines for which predictions fall wi… view at source ↗

**Figure 12.** Figure 12: presents attention weight heatmaps {α(t)}₁³⁰ for five representative test engines. A consistent temporal concentration pattern emerges across all engines: the attention mechanism allocates the highest weights to the five to ten most recent time steps, correctly identifying proximity to the terminal fa ilure state as the dominant predictive signal. Early plateau cycles receive correspondingly low attention… view at source ↗

**Figure 13.** Figure 13: RUL degradation characterization of the training corpus. Left: piecewise-linear RUL profiles for six representative training engines. Right: global RUL scatter across all training cycles, illustrating the progressively denser low-RUL sampling region. 5.3. RUL Degradation Profile Analysis [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

read the original abstract

Turbofan engine degradation under sustained operational stress necessitates robust prognostic systems capable of accurately estimating the Remaining Useful Life (RUL) of critical components. Existing deep learning approaches frequently fail to simultaneously capture multi-sensor spatial correlations and long-range temporal dependencies, while standard symmetric loss functions inadequately penalize the safety-critical error of over-estimating residual life. This study proposes a hybrid architecture integrating Twin-Stage One-Dimensional Convolutional Neural Networks (1D-CNN), a Bidirectional Long Short-Term Memory (BiLSTM) network, and a custom Bahdanau Additive Attention mechanism. The model was trained and evaluated on the NASA Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) FD001 sub-dataset employing a zero-leakage preprocessing pipeline, piecewise-linear RUL labeling capped at 130 cycles, and the NASA-specified asymmetric exponential loss function that disproportionately penalizes over-estimation to enforce industrial safety constraints. Experiments on 100 test engines achieved a Root Mean Squared Error (RMSE) of 17.52 cycles and a NASA S-Score of 922.06. Furthermore, extracted attention weight heatmaps provide interpretable, per-engine insights into the temporal progression of degradation, supporting informed maintenance decision-making. The proposed framework demonstrates competitive performance against established baselines and offers a principled approach to safe, interpretable prognostics in industrial settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard hybrid CNN-BiLSTM-attention stack on C-MAPSS FD001 with asymmetric loss and heatmaps, but missing ablations and baselines leave the performance edge unclear.

read the letter

This paper takes the usual 1D-CNN for local sensor patterns, BiLSTM for sequences, Bahdanau attention, and the NASA asymmetric exponential loss, then runs it on the FD001 turbofan data with piecewise RUL capped at 130 cycles and zero-leakage windowing. It reports RMSE 17.52 and S-score 922 on the 100 test engines plus attention weight heatmaps. The asymmetric loss and the heatmaps are the parts that line up with real maintenance needs: penalizing over-prediction of life is the right safety bias, and the maps let you see which sensors drive the degradation timeline for each engine. Those choices are sensible and executed without obvious circularity or invented math. The work is otherwise a direct assembly of components already common in the RUL literature, so the main output is an empirical report rather than a new method or derivation. The soft spots sit in the validation. The abstract and results give headline numbers but no side-by-side runs against published baselines under identical preprocessing, no ablation tables showing what the attention or the hybrid actually buys, and no error bars or significance checks. The claim that the heatmaps support informed decisions is asserted from the visuals alone; there is no user study or proxy metric that measures whether they improve maintenance choices. The stress-test points on leakage, the 130-cycle cap, and whether the loss actually produces the desired under-prediction bias on held-out engines are fair to raise until the paper shows the error distributions or sensitivity checks. This is useful for applied teams in aerospace or industrial prognostics who need a working model with some built-in explainability and are willing to re-validate it themselves. Researchers looking for architectural novelty or tightly controlled experiments will find less here. It is worth sending to peer review once the authors add the missing comparisons and checks, because the application domain is practical and the core implementation looks honest.

Referee Report

3 major / 2 minor

Summary. The paper proposes a hybrid 1D-CNN + BiLSTM + Bahdanau attention architecture for turbofan engine RUL prediction on the C-MAPSS FD001 dataset. It employs zero-leakage windowing, piecewise-linear RUL targets capped at 130 cycles, and the NASA asymmetric exponential loss that penalizes over-estimation more heavily. On 100 test engines the model reports RMSE = 17.52 cycles and S-Score = 922.06, together with per-engine attention heatmaps offered as evidence of interpretability.

Significance. If the reported metrics are shown to be robust to baselines, ablations, and statistical testing, and if the preprocessing and loss are verified to enforce the claimed safety bias without leakage or target simplification, the work would provide a concrete example of combining spatial-temporal modeling with an asymmetric loss for safety-critical prognostics. The attention heatmaps constitute a modest but useful step toward interpretability, though their practical value remains unquantified.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the headline RMSE 17.52 and S-Score 922.06 are presented without any tabulated baseline comparisons, ablation results, error bars, or statistical significance tests. This omission makes it impossible to determine whether the hybrid architecture actually improves upon established CNN-LSTM or attention-only models on the same FD001 split.
[§3.2 and §3.3] §3.2 (Preprocessing) and §3.3 (Loss): the zero-leakage claim, the piecewise-linear labeling with hard cap at 130 cycles, and the exact implementation of the NASA asymmetric exponential loss are load-bearing for the safety and industrial-readiness conclusions. The manuscript must supply (i) explicit pseudocode or data-flow diagrams confirming no future information leaks into training windows, (ii) a quantitative check that the 130-cycle cap does not artificially flatten the error distribution, and (iii) a before/after comparison of error histograms under symmetric vs. asymmetric loss to confirm the intended under-prediction bias on the test set.
[§5] §5 (Interpretability): attention weight heatmaps are visualized but never evaluated against any ground-truth degradation signature or compared to saliency methods. Without a quantitative metric (e.g., alignment with known sensor degradation patterns or human-expert agreement scores), the interpretability claim remains unsupported.

minor comments (2)

[Abstract] The abstract states “competitive performance against established baselines” yet supplies no numbers; move at least one baseline row into the abstract or add a footnote with the best reported FD001 RMSE from the literature.
[§3] Notation for the asymmetric loss scaling factor and the RUL cap value should be introduced once in §3 and used consistently thereafter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to incorporate additional comparisons, pseudocode, figures, and discussions addressing the major comments. Our responses to each point are provided below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline RMSE 17.52 and S-Score 922.06 are presented without any tabulated baseline comparisons, ablation results, error bars, or statistical significance tests. This omission makes it impossible to determine whether the hybrid architecture actually improves upon established CNN-LSTM or attention-only models on the same FD001 split.

Authors: We agree that tabulated comparisons, ablations, error bars, and statistical tests are required to substantiate the performance claims. In the revised manuscript, we have added Table 3 in Section 4 comparing our model to baselines including CNN-LSTM, BiLSTM with attention, and other published FD001 results using the identical split and metrics. Table 4 presents ablation results (removing CNN, BiLSTM, or Attention) averaged over 5 runs with standard deviations and error bars in the figures. Paired t-tests with p-values are included to demonstrate statistical significance of the improvements. revision: yes
Referee: [§3.2 and §3.3] §3.2 (Preprocessing) and §3.3 (Loss): the zero-leakage claim, the piecewise-linear labeling with hard cap at 130 cycles, and the exact implementation of the NASA asymmetric exponential loss are load-bearing for the safety and industrial-readiness conclusions. The manuscript must supply (i) explicit pseudocode or data-flow diagrams confirming no future information leaks into training windows, (ii) a quantitative check that the 130-cycle cap does not artificially flatten the error distribution, and (iii) a before/after comparison of error histograms under symmetric vs. asymmetric loss to confirm the intended under-prediction bias on the test set.

Authors: We have added explicit pseudocode in new Appendix A for the windowing and labeling pipeline, confirming that training windows use only historical data with no future leakage. For the 130-cycle cap, we included an analysis showing it impacts fewer than 15% of samples and does not flatten errors, supported by a supplementary figure of RUL distributions. We also added Figure 8 with error histograms for symmetric vs. asymmetric loss on the test set, confirming the intended bias (reduced over-estimation frequency). revision: yes
Referee: [§5] §5 (Interpretability): attention weight heatmaps are visualized but never evaluated against any ground-truth degradation signature or compared to saliency methods. Without a quantitative metric (e.g., alignment with known sensor degradation patterns or human-expert agreement scores), the interpretability claim remains unsupported.

Authors: We acknowledge that quantitative metrics are ideal but limited by the absence of ground-truth degradation signatures in C-MAPSS. In the revised Section 5, we added qualitative alignment of attention patterns with known physical trends (e.g., rising sensor values) and a comparison to integrated gradients saliency maps. We note that expert agreement scores cannot be computed without additional annotations unavailable in the dataset, but the expanded analysis strengthens the interpretability support. revision: partial

Circularity Check

0 steps flagged

No circularity: standard supervised training on external benchmark with no self-referential equations or fitted predictions.

full rationale

The paper presents an empirical ML pipeline (1D-CNN + BiLSTM + attention trained with NASA asymmetric loss on C-MAPSS FD001) whose performance metrics are obtained by direct supervised evaluation on a held-out test set. No mathematical derivation, uniqueness theorem, or ansatz is invoked whose result reduces to the inputs by construction; the reported RMSE and S-Score are ordinary training outcomes rather than quantities defined or fitted in terms of themselves. Preprocessing and labeling choices are standard and externally specified, not self-referential.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on standard supervised learning assumptions and dataset conventions rather than new postulates; the only notable choices are the RUL cap and asymmetric loss parameters taken from prior NASA practice.

free parameters (2)

RUL cap
Piecewise-linear labeling capped at 130 cycles, a conventional choice for this dataset to avoid extreme values.
Asymmetric loss scaling
Parameters of the NASA-specified exponential loss chosen to penalize over-estimation; exact values not stated in abstract.

axioms (2)

domain assumption C-MAPSS FD001 simulated data is sufficiently representative of real turbofan degradation for model validation.
The entire evaluation rests on this simulated benchmark being a valid proxy for industrial use.
domain assumption Zero-leakage preprocessing fully prevents train-test contamination in time-series sensor data.
Standard ML assumption invoked to justify the reported metrics.

pith-pipeline@v0.9.0 · 5557 in / 1625 out tokens · 59422 ms · 2026-05-10T13:59:48.064882+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Layer Normalization

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization [Preprint]. arXiv. arXiv:1607.06450. Babu, G. S., Zhao, P., & Li, X. L. (2016). Deep convolutional neural network based regression approach for estimation of remaining useful life. In S. B. Navathe, W. Wu, S. Shekhar, X. Du, X. S. Wang, & H. Xiong (Eds.), Database Systems for Advanced Ap...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-319-32025-0 2016
[2]

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (Vol. 37, pp. 448–456). PMLR. Khelif, R., Chebel-Morello, B., Malinowski, S., Laajili, E., Fnaiech, F., & Zerhouni, N. (2016). Direct remaining useful life es...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

Zheng, S., Ristovski, K., Farahat, A., & Gupta, C. (2017). Long short-term memory network for remaining useful life estimation. In Proceedings of the 2017 IEEE International Conference on Prognostics and Health Management (pp. 88–95). IEEE

work page 2017

[1] [1]

Layer Normalization

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization [Preprint]. arXiv. arXiv:1607.06450. Babu, G. S., Zhao, P., & Li, X. L. (2016). Deep convolutional neural network based regression approach for estimation of remaining useful life. In S. B. Navathe, W. Wu, S. Shekhar, X. Du, X. S. Wang, & H. Xiong (Eds.), Database Systems for Advanced Ap...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-319-32025-0 2016

[2] [2]

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (Vol. 37, pp. 448–456). PMLR. Khelif, R., Chebel-Morello, B., Malinowski, S., Laajili, E., Fnaiech, F., & Zerhouni, N. (2016). Direct remaining useful life es...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

Zheng, S., Ristovski, K., Farahat, A., & Gupta, C. (2017). Long short-term memory network for remaining useful life estimation. In Proceedings of the 2017 IEEE International Conference on Prognostics and Health Management (pp. 88–95). IEEE

work page 2017