## Abstract

Prognostics and health management (PHM) of bearings is crucial for reducing the risk of failure and the cost of maintenance for rotating machinery. Model-based prognostic methods develop closed-form mathematical models based on underlying physics. However, the physics of complex bearing failures under varying operating conditions is not well understood yet. To complement model-based prognostics, data-driven methods have been increasingly used to predict the remaining useful life (RUL) of bearings. As opposed to other machine learning methods, ensemble learning methods can achieve higher prediction accuracy by combining multiple learning algorithms of different types. The rationale behind ensemble learning is that higher performance can be achieved by combining base learners that overestimate and underestimate the RUL of bearings. However, building an effective ensemble remains a challenge. To address this issue, the impact of diversity in base learners and extracted features in different degradation stages on the performance of ensemble learning is investigated. The degradation process of bearings is classified into three stages, including normal wear, smooth wear, and severe wear, based on the root-mean-square (RMS) of vibration signals. To evaluate the impact of diversity on prediction performance, vibration data collected from rolling element bearings was used to train predictive models. Experimental results have shown that the performance of the proposed ensemble learning method is significantly improved by selecting diverse features and base learners in different degradation stages.

## 1 Introduction

Bearing faults constitute up to 44% of the total faults in large induction motors [1]. Common causes of bearing failure include inappropriate lubrication, misalignment, load imbalance, fatigue, corrosion, vibrations, and excessive temperature [2]. Prognostics and health management (PHM) of bearings is crucial for reducing unplanned machine downtime for rotating machinery as well as for improving system safety and reliability [3–8]. PHM techniques for bearings can be classified into two categories: model-based and data-driven methods [5,9,10]. Model-based methods such as Kalman filter [11,12] and particle filter [2] develop closed-form models based on underlying physics. In contrast, data-driven PHM methods such as artificial neural networks [3], deep convolution learning [13], relevance vector machines [14], and principal component analysis [15] make predictions based on hidden patterns and inference without explicit mathematical models. Because the physics of complex bearing failures under varying operating conditions is not well understood yet, data-driven methods based on machine learning have been increasingly used to predict the remaining useful life (RUL) of bearings.

Various machine learning algorithms have been demonstrated on RUL prediction of bearings. However, little research has been conducted to develop ensemble learning-based PHM approaches to predicting the RUL of bearings [16,17]. As one of the most effective machine learning algorithms, ensemble learning methods fuse machine learning algorithms of different types (also known as base learners) to achieve better prediction accuracy than the individual machine learning algorithms. The objective of this study is to develop an enhanced ensemble learning algorithm by selecting diverse base learners and features at varying degradation stages of a bearing. In particular, the impact of diversity in base learners and features on RUL prediction accuracy is investigated. We hypothesize that the accuracy and robustness of a predictive model can be improved by combining multiple weak learners as well as selecting varying features at varying degradation stages. The remainder of this paper is organized as follows: Sec. 2 presents a literature review on PHM for bearings. Section 3 presents a computational framework based on an enhanced ensemble learning algorithm. This framework consists of classification of degradation stages, dynamic base learner selection, and dynamic feature selection. Section 4 presents a case study to demonstrate the effectiveness of the proposed framework. Section 5 presents conclusion and future work.

## 2 Related Work

This section provides an overview of model-based and data-driven approaches to predicting the RUL of bearings.

Li et al. [18] proposed an improved exponential model for predicting the RUL of rolling element bearings. Particle filtering was used to reduce the random error of the stochastic bearing degradation process. The proposed method was demonstrated on the FEMTO bearing data set. Li and Liang [19] introduced an approach based on improved rescaled range (R/S) statistic and fractional Brownian motion to predict bearing degradation trends. Classical R/S methods are sensitive to heteroscedasticity and short-term dependence. To address this issue, an improved R/S statistic model with an auto-covariance estimator was introduced. The FEMTO bearing data set was used to demonstrate the proposed method. Li et al. [20] developed a stochastic defect-propagation model for predicting the RUL of rolling element bearings. An augmented stochastic differential equation system was developed by taking into account uncertainties in parameter estimation. The proposed method was demonstrated using both numerical simulations and vibration signals collected from experiments. Qian and Yan [2] developed an enhanced particle filter-based approach for predicting the RUL of rolling element bearings. Particles were used to determine an adaptive importance density function and a backpropagation neural network in each recursive step. Experimental results have shown that the proposed method outperformed traditional particle filters and support vector regression. Boskoski et al. [21] developed an approach to RUL prediction of bearings based on Rényi entropy-based features and the Gaussian process model. The FEMTO bearing data set was used to evaluate the proposed approach. Experimental results have shown that the proposed approach was capable of predicting the RUL of bearings. Singleton et al. [22] introduced an extended Kalman filter-based method for predicting the RUL of bearings. The FEMTO bearing data set was used to demonstrate that the proposed method achieved high prediction accuracy. Lei et al. [23] proposed a model-based method for bearing RUL prediction. A health indicator was introduced by fusing information correlates with the degradation process from multiple features. The proposed health indicator and the parameters that is initialized by maximum likelihood estimation were used to predict the RUL of bearings using particle filtering.

Dong and Luo [15] proposed a data-driven approach to tracking the degradation process of bearings. Principal component analysis was used to fuse the features as well as to reduce data dimensionality. A least-squares support vector machines (SVM) were proposed to predict the degradation process using the fused features. The proposed method was demonstrated on a run-to-failure bearing data set. Ben Ali et al. [3] introduced a data-driven approach to RUL prediction of bearings by combining a simplified fuzzy adaptive resonance theory map neural network and Weibull distribution. A new feature called root-mean-square entropy estimator was introduced to track bearing degradation processes. Condition-monitoring data collected from double row bearings were used to validate the proposed method. Experimental results have shown that the proposed method achieved a high classification rate of bearing failures. Gebraeel et al. [24] proposed a neural network-based approach to RUL prediction of bearings. Experimental data were collected from a group of identical thrust bearings running at specified conditions. The results proved the significant accuracy of the proposed method. Guo et al. [25] defined a health indicator based on a recurrent neural network (RNN) to predict the RUL of bearings. The most sensitive features were extracted based on correlation and monotonicity. Liao et al. [26] introduced a data-driven approach based on a restricted Boltzmann machine (RBM) and an unmonitored self-organizing map. The proposed method was validated using the experimental data collected from a spindle testbed. Experimental results have shown that the RBM was capable of predicting the RUL of bearings with high accuracy. Huang et al. [27] proposed a data-driven approach to RUL prediction of bearings by combining self-organizing map (SOM) and back propagation neural networks. The SOM was used to determine the minimum quantization error indicator using vibration features. The back propagation neural network was trained using the indicators. A bearing run-to-failure experiment was conducted to demonstrate the effectiveness of the proposed method. Li et al. [17] proposed an ensemble-based approach to RUL prediction by combing different algorithms with different degradation-dependent weights. A degradation-dependent weight vector was determined by minimizing the cross-validation error. A simulation data set on bearing degradation was used to demonstrate the proposed method. The results have shown that the proposed method was accurate for RUL prediction.

In summary, little research has been reported on predicting the RUL of bearings using ensemble learning by taking into account bearing degradation stages as well as the impact of diversity in base learners and features at varying degradation stages on prediction accuracy. To fill this research gap, the impact of diverse base learners and features in different degradation stages on predicting the RUL of bearings is investigated.

## 3 Methodology

**is an**

*A**m*×

*n*matrix,

**∈**

*b**R*

^{m}is a column vector of response variables,

*x*

_{i}≥ 0 is the weight assigned to each base learner, and $\Vert \u22c5\Vert 2$ denotes the Euclidean norm.

*y*

_{t}at time

*t*.

*θ*.

The performance of the ensemble learning method is highly dependent on the selection of base learners [32]. While previous research has demonstrated that improving diversity in an ensemble can improve prediction performance [33], we conducted a systematic study on the impact of diversity in base learners and features in different degradation stages on the accuracy of RUL prediction as shown in Fig. 1.

The computational framework of the proposed ensemble learning approach is illustrated in Fig. 2. The input of the framework is health-monitoring data such as vibration signals. The diversified ensemble approach introduces a new step where the degradation process of a bearing is classified into multiple stages. The base learners and features are selected based on the degradation stages. In this study, base learners of three different types, including decision tree-based, instance-based, and linear model-based, are selected. Section 3.1 presents the classification of degradation stages of bearings. Sections 3.2 and 3.3 introduce dynamic base learner selection and dynamic feature selection.

### 3.1 Classification of Degradation Stages.

*K*is the number of change points to be detected from the input data,

*k*and

_{0}*k*are the first and the (

_{r}*r +*1)th sample of the signal, respectively.

*β*is a proportionality constant. By determining the empirical estimate

*χ*and the deviation measurement Δ of the mean value of the segment divided by the (

*r*+ 1)th sample, the deviation function $\u2211i=mn\Delta (xi;\chi ([xm\cdots xn]))$ can be solved by Eq. (6)

*x*

_{m}· · ·

*x*

_{n}are all the samples between the

*m*th and the

*n*th samples.

After *K* change points are detected, the input data will be divided into *K +* 1 segments. For each change point, the mean values of two adjacent segments near the change point will be compared. If the mean value of the segment after the change point is more than twice the mean value of the previous segment, the change point will be considered as an anomaly point and the two segments will be classified into two stages.

*x*

_{1}…

*x*

_{r}…

*x*

_{n}are all the samples.

### 3.2 Diversity in Base Learners.

Because the performance of base learners varies in different degradation stages of bearings, different base learners will be used to construct an ensemble in individual degradation stage. Some base learners might overestimate the RUL of bearings; others might underestimate the RUL of bearings. The ensemble with the best performance should combine base learners that overestimate and underestimate the RUL. The weights assigned to the selected base learners in different stages will be determined by minimizing the cross-validation error using NNLS. In this study, 16 candidate base learners from three different categories, including decision tree-based, instance-based, and linear model-based algorithms, were tested. Five of the tested algorithms were selected as base leaners to achieve the minimum cross-validation error. More details about selected base learners will be shown in Secs. 3.2.1–3.2.5.

#### 3.2.1 Extra Trees.

*K*), the minimum sample size of a split (

*n_min*), and the number of trees (

*M*).

*K*contributes to the strength of the attribute selection process. The strength of the average noise of the output is determined by

*n_min*. The strength of variance reduction for the aggregation process is determined by

*M*. The final prediction is obtained by combining the predictions of the trees, the regression prediction (the average of each prediction), and the classification (the voting result of prediction). The typical form of approximation by Extra Trees is shown as Eq. (8) [39]

*N*is the sample size, $I(i1,\u2026,in)(x)$ is the characteristic function of the hyper-interval, and the real-valued parameters $\lambda (i1,\u2026,in)X$ depend on input

*x*

_{j}and output

*y*

_{j}of the method.

#### 3.2.2 Random Forests.

*j*represents a splitting variable and

*s*is the cutting point,

*R*

_{1}(

*j*,

*s*) = {

*X*|

*X*

_{j}≤

*s*};

*c*1 =

*ave*(

*y*

_{i}|

*x*

_{i}∈

*R*

_{1}(

*j*,

*s*));

*R*

_{2}(

*j*,

*s*) = {

*X*|

*X*

_{j}≥

*s*};

*c*2 =

*ave*(

*y*

_{i}|

*x*

_{i}∈

*R*

_{2}(

*j*,

*s*)).

This splitting process will continue to repeat until it satisfies a stopping criterion. After all the decision trees in the forest have reached the threshold and stop splitting, a final prediction can be made by taking the average of the predictions from the regression trees.

#### 3.2.3 XGBoost.

*l*is a differentiable convex loss function that measures the difference between the prediction $y^(t\u22121)$ and the target

*y*

^{i}. The second term Ω penalizes the complexity of the model. The first- and second-order gradient on the lost functions are $gi=\u2202y^(t\u22121)l(yi,y^(t\u22121))$ and $hi=\u2202y^(t\u22121)2l(yi,y^(t\u22121))$.

The efficiency of this method is guaranteed by parallel and distributed computing. The method is demonstrated to be both faster and more accurate than most classical tree bagging algorithms [41].

#### 3.2.4 Support Vector Machines.

SVM minimize the upper bound of the generalization error by maximizing the margin between the hyperplane and the data [42]. Each class of the data lays on a different side of the two-dimensional plane divided by a line called a hyperplane. The performance of SVM mainly depends on the selection of a good kernel function [43]. It can also select the model automatically by obtaining the optimal number and locations of the basis functions during the training [44]. SVM yielded lower error rates than other instance-based methods. The proposed method applied the polynomial kernel as $K(x,xi)=1+\u2211(x\xd7xi)d$, and the exponential is $K(x,xi)=$$exp(\u2212\gamma \xd7\u2211(x\u2212xi)2)$.

#### 3.2.5 Generalized Additive Models.

*Y*to the exponential distribution of predictor variables

*x*

_{i}is calculated by the link function shown in Eq. (11) [45]

### 3.3 Diversity in Features.

To select the optimal features in different stages, each extracted feature will be scored using the selected criteria. A threshold is determined for each criterion in a degradation stage. To determine the threshold of a criterion, the proposed method trains ensemble models using features with different scores assigned by the criterion and compares the RMSE of each model. The score with the smallest cross-validation RMSE is set as the threshold. Features with scores greater than all thresholds are selected for a stage.

To understand the impact of feature diversity on prediction accuracy, we tested three types of features: 13 time-domain features, 16 frequency-domain features, and 8 time-frequency domain features. The significance of each feature was evaluated, and the cross-validation error of each combination was compared. Different features were selected for different base learner selections. The results of feature selections were also different for different stages. The results have shown that feature diversity could improve the performance of the proposed method.

In theory, feature selection for ensembles is not necessarily the same as feature selection for a single base learner because the overall performance drives different criteria into each base learner. As stated earlier, an ensemble leverages the strength of base learners in different regions along the trajectory. This behavior could be accomplished in part through selecting different features. In this study, three most popular measures, including trendability, monotonicity, and prognosability, are used to evaluate the significance of the features.

*Trendability:*Trendability measures similarity within trajectories of a feature corresponding to time series [46]. The constant features will have zero correlation with time, and therefore zero trendability, and the features with linear functions will have strong correlations with time, showing large trendability. Features with good trendability represent the state and degradation of the system in the time series. The expression of trendability is shown in Eq. (12)

*x*is a vector of observations of the feature,

*y*is the time index of the feature, and

*n*is the number of observations.

*Monotonicity*: Monotonicity measures the consistent increase or decrease of a feature. It is measured by the absolute difference between the numbers of positive derivative and negative derivative of the feature [46]. The expression of monotonicity is shown in Eq. (13)

*n*is the number of the observations

*x*. The value of Eq. (6) ranges from 0 to 1, where 0 means the feature is non-monotonic and 1 means the feature is monotonic decreasing or increasing.

*Prognosability*: Prognosability measures the variance of the critical value of failures in a population of systems [47]. It is the deviation of the final failure values of each path divided by the range of the mean path. The expression of prognosability is shown in Eq. (14)

*x*

_{j}is the measurements of a feature on the

*j*th system, variable

*M*represents the number of monitored systems,

*N*

_{j}represents the number of measurements on the

*j*th system.

## 4 Case Study

### 4.1 Experimental Setup.

The experimental data used in this case study were collected using the PRONOSTIA platform designed by the FEMTO-ST Institute [48]. This data set was also used in the IEEE PHM 2012 challenge. The PRONOSTIA testbed can accelerate the degradation process of bearings such that critical failure will occur within several hours under constant or varying operating conditions. This testbed consists of a rotational component, a load generation component, and a measurement component. A synchronous motor with a gearbox and a speed controller is used to control the rotational speed of the bearings. A pneumatic jack and a digital electro-pneumatic pressure regulator are used to control the load up to 4000 N. More details about the PRONOSTIA testbed are shown in Figs. 3 and 4.

### 4.2 Data Description.

The IEEE PHM 2012 challenge data sets were collected under three different operating conditions. One of the data sets was used to demonstrate the proposed method. The raw data were collected under a rotating speed of 1800 rpm and a load force of 4000 N. Seven sub-data sets, including Bearing1_1 to Bearing1_7, were used for training and validating the predictive model. Each sub-data set contains vibration signals in both horizontal and vertical directions (Fig. 5) that were collected using a set of high-frequency accelerometers. The sampling frequency for the vibration signal was 25.6 kHz. 2560 samples were recorded in the first 0.1 s of every 10 s. To avoid damages to the testbed, run-to-failure tests were terminated when the amplitude of the vibration signal exceeded 20 g. Table 1 shows the monitored useful life of each bearing.

### 4.3 Feature Extraction.

Thirty-seven (37) features, including thirteen (13) time-domain features and twenty-four (24) frequency-domain features, were extracted from the horizontal and vertical vibration signals, respectively. In total, seventy-four (74) features were extracted. The time-domain features include maximum, minimum, standard deviation, root-mean-square, kurtosis, skewness, mean, peak–peak value, variance, upper bound, entropy, standard division of inverse sine, and standard division of inverse tangent [49]. The frequency-domain features were extracted using a fast-Fourier transform for each sampling period. The frequency-domain features include the maximum, frequency of maximum amplitude, bandwidth, energy, and entropy. These frequency-domain features were applied on both the frequency–time spectrum and the power–density spectrum.

### 4.4 Degradation Stages.

A bearing may experience varying degradation patterns/stages during its in-service life. Detecting change points and degradation stages can improve RUL prediction accuracy. As shown in Fig. 5, the bearing degradation stages are correlated with statistical features of raw data. 37 features were investigated. Figure 6 shows three example features. RMS is the most effective feature that can detect change points.

The proposed method was able to detect three different stages, including (1) normal condition, (2) smooth wear condition, and (3) severe wear condition, before failure occurs [38]. Table 2 shows two different cases where three and two degradation stages were detected. For example, Bearing1_1 and Bearing1_3 have three degradation stages, including normal condition, smooth wear, and severe wear. Other bearing data sets have two degradation stages, including normal condition and severe wear. Figure 7 shows three degradation stages detected in Bearing1_1 and two degradation stages detected in Bearing1_7 data sets.

Data sets with three stages | Data sets with two stages | |
---|---|---|

Data sets index | Bearing1_1 and 1_3 | Bearing1_2, 1_4, 1_5, 1_6 and 1_7 |

Data sets with three stages | Data sets with two stages | |
---|---|---|

Data sets index | Bearing1_1 and 1_3 | Bearing1_2, 1_4, 1_5, 1_6 and 1_7 |

To demonstrate that classifying degradation stages can improve prediction performance, the prediction accuracy of the ensemble learning algorithm with and without classifying degradation stages was compared as shown in Fig. 8. Bearing1_2 to Bearing1_7 were used for training; Bearing1_1 was used for testing. When training a predictive model without classifying degradation stages, only one predictive model was built using all the training data. This predictive model was not able to track the changes in degradation patterns at varying degradation stages as shown in Fig. 8(a). In contrast, a predictive model was trained for each degradation stage after degradation stages were detected. The predictive model trained for each degradation stage was able to track the degradation pattern at each stage with higher accuracy as shown in Fig. 8(b). Table 3 shows more details about prediction performance in terms of relative error and *R*^{2}.

Without detecting bearing degradation stages | With detecting bearing degradation stages | |||
---|---|---|---|---|

Relative error | R^{2} | Relative error | R^{2} | |

Stage 1 | 27.83% | 0.8180 | 22.18% | 0.7866 |

Stage 2 | 73.05% | 0.6697 | 27.86% | 0.8035 |

Stage 3 | 343.47% | 0.3344 | 163.19% | 0.9482 |

Overall | 44.26% | 0.8491 | 26.25% | 0.9062 |

Without detecting bearing degradation stages | With detecting bearing degradation stages | |||
---|---|---|---|---|

Relative error | R^{2} | Relative error | R^{2} | |

Stage 1 | 27.83% | 0.8180 | 22.18% | 0.7866 |

Stage 2 | 73.05% | 0.6697 | 27.86% | 0.8035 |

Stage 3 | 343.47% | 0.3344 | 163.19% | 0.9482 |

Overall | 44.26% | 0.8491 | 26.25% | 0.9062 |

Figure 9 shows a comparison of prediction performance at stage 3 for Bearing1_1 and Bearing1_3 data sets where three degradation stages were observed. Classifying degradation stages improves prediction performance significantly. For the Bearing1_1 data set, the relative error improved from 343.47% to 163.19%. The *R*^{2} error improved from 0.3344 to 0.9482. For the Bearing1_3 data set, the relative error improved from 156.8% to 67.5%. The *R*^{2} error changed from 0.723 to 0.721. The results also showed that classifying degradation stages reduced overestimation.

Figure 10 shows a comparison of prediction performance at stage 3 for Bearing1_2 and Bearing1_7 data sets where two degradation stages were observed. The results have shown that prediction performance at stage 3 for both data sets was improved significantly in terms of both *R*^{2} and relative error by classifying degradation stages. For Bearing1_2, the relative error improved from 969.6% to 186.3%. The *R*^{2} error improved from 0.1773 to 0.8350. For Bearing1_7, the relative error improved from 1297% to 91.47%. The *R*^{2} error improved from 0.3726 to 0.7526. Similar to the results shown in Fig. 9, classifying degradation stages significantly reduced overestimation.

Figure 11 shows a comparison of prediction performance at stage 3 for Bearing1_1 to Bearing1_7 data sets. Six of the seven data sets were used for training; the remaining data set was used for testing. The results have shown that a significant performance improvement was achieved for Bearing1_1, Bearing1_2, Bearing1_6, and Bearing1_7 data sets in terms of both relative error and *R*^{2} by classifying degradation stages. A minor performance improvement was achieved for Bearing1_3, Bearing1_4, and Bearing1_5 data sets.

### 4.5 Impact of Diversity in Base Leaners.

To further improve prediction accuracy by increasing base learner diversity, different base learners were selected in different degradation stages using the method described in Sec. 3.2. The hypothesis is that the performance of different base learners varies in different degradation stages. In this case study, five base learners were selected from three different categories, including decision tree-based, instance-based, and linear model-based algorithms. Different weights were assigned to the selected base learners in different degradation stages. Table 4 lists the optimal weights that minimized the cross-validation error in each degradation stage. The results have shown that only three methods from tree-based algorithms were selected for the fixed model trained without stage classification. The proposed method was demonstrated to be more diverse than fixed models in base learner selection. Table 5 shows a comparison between fixed base learner selection and dynamic base learner selection. The results have shown that by increasing diversity in base learner selection, the performance of the prediction model was improved. For Bearing1_1 data set, the relative error improved from 26.25% to 25.16%. The *R*^{2} error improved from 0.9062 to 0.9482.

Extra trees | XGBoost | SVM | Random forest | GAM | |
---|---|---|---|---|---|

Without stage | 0.4627 | 0.4059 | 0.0000 | 0.1314 | 0.0000 |

Stage 1 | 0.3372 | 0.5174 | 0.0000 | 0.1454 | 0.0000 |

Stage 2 | 0.0000 | 0.1155 | 0.0170 | 0.3935 | 0.4739 |

Stage 3 | 1.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |

Extra trees | XGBoost | SVM | Random forest | GAM | |
---|---|---|---|---|---|

Without stage | 0.4627 | 0.4059 | 0.0000 | 0.1314 | 0.0000 |

Stage 1 | 0.3372 | 0.5174 | 0.0000 | 0.1454 | 0.0000 |

Stage 2 | 0.0000 | 0.1155 | 0.0170 | 0.3935 | 0.4739 |

Stage 3 | 1.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |

### 4.6 Impact of Diversity in Features.

To further improve prediction accuracy by increasing feature diversity, different features were selected in different degradation stages using the method described in Sec. 3.3. The hypothesis is that feature selection for ensembles is not necessarily the same as feature selection for a single base learner. An ensemble model leverages the strength of base learners in different degradation stages. As mentioned in Sec. 3.3, three most popular measures, including trendability, monotonicity, and prognosability, were used to evaluate the significance of extracted features. Models with different thresholds were trained with data from Bearing1_2 to Bearing1_7. The cross-validation errors of all the models were compared, and the threshold of each criterion was determined to minimize the cross-validation error in the training domain. Features satisfied with the thresholds of all criteria were selected. To evaluate the dynamic feature selection, performance of the model using selected features will be tested by each of the seven bearing data sets (leave one out training and testing). The results have shown that dynamic feature selection can improve the prediction accuracy and reduce training time. The dynamic feature selection method consists of two steps:

*Step 1*: Remove linearly dependent features for stages 1, 2, and 3. The linear dependency of the extracted 74 features was evaluated using the S Function in a linear regression model [50]. Twenty (20) features were removed for stages 1 and 3; twenty-one (21) features were removed for stage 2. Because these features are linearly dependent on the remaining features, removing these features did not affect the prediction accuracy of the predictive model.*Step 2*: Evaluate the importance of the remaining features for stages 1, 2, and 3 based on three criteria, including monotonicity, trendability, and prognosability. A threshold for each criterion is determined based on the RMSE of the predictive model.

For stage 1, the thresholds for monotonicity, trendability, and prognosability are 0.01, 0.02, and 0.08, respectively because the smallest RMSEs were achieved by selecting these thresholds as shown in Fig. 12. Twenty (20) features were selected for stage 1.

For stage 2, the thresholds for monotonicity, trendability, and prognosability are 0.16, 0.02, and 0.12, respectively because the smallest RMSEs were achieved by selecting these thresholds as shown in Fig. 13. Forty (40) features were selected for stage 2.

For stage 3, the thresholds for monotonicity, trendability, and prognosability are 0.1, 0.24, and 0.15, respectively because the smallest RMSEs were achieved by selecting these thresholds as shown in Fig. 14. Forty-five (45) features were selected for stage 3.

Figure 15 shows a comparison between fixed and dynamic feature selection for stage 1. Six (6) data sets were used for training; the remaining one was used for testing. As shown in Fig. 15, by using dynamic feature selection, the accuracy of the predictive model for stage 1 slightly improved in terms of relative error and *R*^{2}.

Figure 16 shows a comparison between fixed and dynamic feature selection for stage 3. Six (6) data sets were used for training; the remaining one was used for testing. As shown in Fig. 16, the accuracy of the predictive model for stage 3 did not improve using dynamic feature selection in terms of relative error and *R*^{2}. However, removing redundant features can increase computational efficiency.

Table 6 shows a comparison of overall performance for Bearing1_1 between fixed and dynamic feature selection. By using dynamic feature selection, the relative error has improved from 25.16% to 21.35%. The *R*^{2} error has improved from 0.9482 to 0.9647. The average training time has been reduced from 137 min to 45 min. The results have shown that the prediction accuracy can be improved by increasing the diversity in base learners.

### 4.7 Performance Comparison.

*R*

^{2}, and a score function, were used to evaluate the prediction accuracy of machine learning algorithms. Relative error and

*R*

^{2}have been widely used to evaluate prediction accuracy. The score function is another model evaluation metric where different penalties are allocated for underestimates (negative absolute error) and overestimates (positive absolute error) [48]. A smaller penalty is assigned for underestimates, while a greater penalty is assigned for overestimates. In other words, to ensure system safety, underestimates are more desirable than overestimates. The score function ranges between 0 and 1. Greater scores indicate better prediction performance. The score function is defined in Eq. (15)

*A*

_{i}is the score of accuracy,

*i*is the number of bearing test data sets, and

*Er*

_{i}is the relative error. The final score is defined in Eq. (16)

The performance of the proposed method was compared with that of two deep learning algorithms reported in the literature (Table 7). Guo et al. [25] developed a predictive model using RNN on Bearing1_1 and Bearing1_2. The predictive model was validated on Bearing1_3 to Bearing1_7 data sets. The relative errors for Bearing1_3, Bearing1_4, Bearing1_5, Bearing1_6, and Bearing1_7 are 43.28%, 67.55%, −22.98%, 21.23%, and 17.83%. The average relative error is 32.48%. Our method achieved an average relative error of 25.73% and a score of 0.95 using the same training and test data sets. In addition, Liao et al. [26] trained a predictive model using a restricted RBM on Bearing1_1 to Bearing1_5. The predictive model was validated on Bearing1_6 and Bearing1_7 data sets. A score of 0.57 was achieved. Our method achieved an average relative error of 26.63% and a score of 0.96 using the same training and test data sets.

## 5 Conclusions and Future Work

A novel ensemble learning-based approach to PHM was developed by selecting diverse base learners and features in different degradation stages. To demonstrate the proposed method, the IEEE PHM 2012 challenge data were used to predict the RUL of rolling element bearings. The degradation process of the bearings was classified into three stages, including normal, smooth wear, and severe wear conditions, based on the variation in RMS of the vibration signals. The predictive model was built for each degradation stage. The base learners of the ensemble learning algorithm were dynamically selected from machine learning algorithms of three different types, including decision tree-based, instance-based, and generalized linear models. To increase the diversity in features, the features fed into the proposed method were also dynamically selected for each degradation stage. The experimental results have shown that dynamic feature selection and dynamic base learner selection in different degradation stages can increase the diversity in features and base learners, thereby improving the performance of ensemble learning.

The proposed method with increased diversity in base learners and features was capable of estimating the RUL of bearings with higher accuracy than that of two deep learning algorithms (i.e., RNN and RBM). In the future, we will feed different features to different base learners at varying degradation stages in order to further improve prediction accuracy. In addition, more advanced change point detection techniques such as deep learning will be tested. Moreover, uncertainty quantification methods will be used to provide quantile predictions.

## Acknowledgment

The research reported in this paper is partially supported by the NASA Ames Research Center (Grant No. 80NSSC18M108). Any opinions, findings, conclusions, or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of NASA Ames Research Center.