Transfer Learning for Fault Detection with Application to Wind Turbine SCADA Data
Silvio Simani 1,*, Saverio Farsoni 1
, Paolo Castaldi 2
-
Department of Engineering, University of Ferrara, Via Saragat 1E, Ferrara, 44122, FE, Italy
-
Department of Electrical, Electronic and Information Engineering, University of Bologna, Via Fontanelle, Forl´ı, 47121, FC, Italy
* Correspondence: Silvio Simani
Academic Editor: Andrés Elías Feijóo Lorenzo
Collection: Wind Energy
Received: February 02, 2023 | Accepted: March 15, 2023 | Published: March 21, 2023
Journal of Energy and Power Technology 2023, Volume 5, Issue 1, doi:10.21926/jept.2301011
Recommended citation: Simani S, Farsoni S, Castaldi P. Transfer Learning for Fault Detection with Application to Wind Turbine SCADA Data. Journal of Energy and Power Technology 2023; 5(1): 011; doi:10.21926/jept.2301011.
© 2023 by the authors. This is an open access article distributed under the conditions of the Creative Commons by Attribution License, which permits unrestricted use, distribution, and reproduction in any medium or format, provided the original work is correctly cited.
Abstract
The installed wind power capacity is growing worldwide. Remote condition monitoring of wind turbines is employed to achieve higher uptimes and lower maintenance costs. Machine learning models can detect developing damages in wind turbines. Therefore, this paper demonstrates that cross–turbine transfer learning can drastically improve the accuracy of fault detection models in turbines with scarce SCADA data. In particular, it shows that combining the knowledge from turbines with scarce and turbines with plentiful data enables earlier detection of faults than prior art methods. Training fault detection models require large amounts of past and present SCADA data but these data are often unavailable or not representative of the current operation behavior. Newly commissioned wind farms lack SCADA data from the previous operation. Due to control software updates or hardware replacements, older turbines may also lack representative SCADA data. After such events, a turbine’s operation behavior can change significantly so its SCADA data no longer represent its current behavior. Therefore, the work highlights how to reuse and transfer knowledge across wind turbines to overcome this lack of data and enable the earlier detection of faults in wind turbines.
Keywords
Condition monitoring; diagnostics; wind turbines; transfer learning; fault detection; convolutional neural networks
1. Introduction
The globally installed wind power capacity is growing steadily, in line with efforts to increase the share of renewable energy production worldwide [1,2]. The operation and maintenance costs of wind farms make up about one-third of the electricity production costs over the lifetime of a wind farm [3].
Condition monitoring and intelligent fault detection approaches can enable a transition towards condition–based maintenance of Wind Turbines (WT)s, thereby increasing the up-time of the monitored WTs and reducing their maintenance costs. Fault detection algorithms that rely on machine learning are designed to notify the operator when unusual operation behavior is observed. This requires models of the WT’s normal operation behavior which can be trained on condition data from healthy WTs.
The use of condition data from the Supervisory Control And Data Acquisition (SCADA) system was successfully demonstrated to train such Normal Behaviour Models (NBM)s especially for fault detection tasks, as shown e.g., in [4,5,6,7,8,9].
However, this approach requires that condition monitoring data be available over a sufficiently long operation period covering a range of operating conditions. Indeed, when condition monitoring data are scarce or no longer representative of the WT’s current behavior, fault detection based on NBMs is usually not an option because NBMs cannot be trained then. This is the case for newly commissioned WTs and in the initial stage of the operation life of a WT. Moreover, major changes in the operation conditions, regardless of WT age, can result in a lack of SCADA data representative of the new operation behavior. This scenario can arise, e.g., after control software updates or hardware retrofits.
Therefore, the use of Transfer Learning (TL) in connection with Deep Learning (DL) is considered in this work. DL methods have become popular among researchers in the field of fault detection. However, their performance depends on the availability of big data sets [10]. To overcome this problem researchers started applying TL to achieve good performance from small available data-sets, by leveraging multiple prediction models over similar machines and working conditions. However, the influence of negative TL limits their application. Negative transfer among prediction models increases when the environment and working conditions change continuously. To overcome the effect of negative transfer, the paper also proposes a TL method for WT fault detection that prevents negative transfer and only focuses on relevant information from the source machine.
In particular, this study demonstrates the potential of TL for enabling SCADA–based fault detection with normal behavior models even when NBM training data are scarce. It is investigated how NBMs can be transferred from a WT with sufficient training data (source WT) to a WT with scarce operation data (target WT) to achieve accurate NBMs and, thus, reduce fault detection delays. The work investigates multiple TL strategies for training an NBM in the target WT, and studies how they affect the accuracy of the trained NBM. It is also analyzed how the preferable strategy depends on the amount of training data available. The proposed approach drastically reduces the SCADA data required to train an NBM for fault detection tasks in the target WT. It is shown that TL results in a higher NBM accuracy and, thus, earlier fault detection when less than a year of training data is available.
It is worth noting that the recent advancements in DL have achieved enormous success in many fields [11], ranging from computer vision [12] to speech recognition [13]. Therefore, it also attracting researchers in fault detection [14,15], as this paper considers. However, fault detection faces a data imbalance problem compared to other fields where most available data are considered healthy cases with only a very small minority being data of fault cases [16]. This study developed a method to build fault detection models over WTs by the learned knowledge among the WTs. Even though the WTs are practically identical, their behavior differs due to different environments and Working Conditions (WC)s [17,18]. The recent progress in the Internet of Things (IoT) allows for collecting data from remote sources, such as SCADA systems.
The goal of TL is to improve learners' performance by transferring knowledge from another related domain whereas, the traditional machine learning models are based on the assumption that both the training and testing data belong to the same data distribution. TL is inspired by human learning behavior, where humans use previous related knowledge while learning to solve new problems. For example, a person learns to drive a car quickly if he knows how to ride a bike compared to learning from scratch without any road experience. As shown in the study, TL enables machine learning models to transfer learned knowledge from source domains to a target domain to improve the performance of the target learning function. In contrast, the source and target domains have different data distributions [10]. Moreover, the source domain data samples can be transferred to improve the learning of the target model [19]. TL is getting popular, and it has been applied in many fields including fault detection, but some research gaps still need to be filled out.
The paper will only address TL for fault detection because this research is focused on these topics. TL for fault detection transfers the learned knowledge about faults from a source machine to a target machine. The currently available methods only use one machine as a source. The paper [20] proposes domain adaption in fault diagnosis to learn features combined from the source and the target domain. Then a support vector machine classifier is used to predict faults. A TL method was used to predict bearing inner, ball, and outer race faults in changing working conditions [21]. The paper [22] proposed an improved deep neural network optimized by a particle swarm optimization algorithm and a regularization method to classify gear pitting faults.
Data labeling is a difficult task, and fault data is imbalanced. However, transferring the knowledge to a target without the labeled data is also possible. The maximum mean discrepancy was used to minimize the difference between the source and the target domains when the labeled data was unavailable for the target. Along with the domain adaptation using maximum mean discrepancy, different DL models are used for condition recognition like sparse autoencoder [23] or Convolutional Neural Network (CNN) [24]. The paper [25] proposed a feature–based CNN to extract transferable features from the raw vibration signals. Then multi-layer domain adaptation is added to reduce the distribution discrepancy of learned transferable features, and pseudo-label learning [26] is used to train from unlabeled target domain samples. On the other hand, pre–trained networks can train a deep learning network for fault classification, as shown in [27]. The sensor data is converted to image data by plotting [28] or using wavelet transformation [29] to obtain a time-frequency distribution to fine-tune high-level network layers. The low-level features are extracted from the pre-trained network. The research using the TL approach to fault detection was validated on lab-generated data from a bearing dataset [30].
The current state-of-the-art methods do not focus on the aspect that there are some dissimilarities between the source and the target machines. The discrepancies referred to as negative transfer may hinder the performance of the target model. In contrast with the negative transfer, the valuable information from the source is positive, as addressed e.g. in [10]. Combining DL with TL to train nonlinear high-dimensional DTL models with a small data-set is also possible. TL is also classified into four categories: instance-based DTL, network-based DTL, mapping-based DTL, and adversarial-based DTL. In order to address the negative transfer problem, the work [10] proposed, e.g., an instance-based DTL method.
On the other hand, concerning DL applied to fault diagnosis, DL–the based fault diagnosis model is relatively widespread, such as, e.g., deep variational autoencoder [31,32], multiscale deep belief network [33], deep hybrid learning [34] and stacked denoising autoencoder [35].
In particular for this research, which combines TL with DL for fault diagnosis, the work is focused on TL which allows for combining data over multiple machines. To deal with large data sets and capture the nonlinear trends from different measurements, a method is needed like DL instead of using shallow machine learning models. TL with DL is employed to transfer learned knowledge from extensive fault history WT to the scarce fault history WT, which is insufficient to train traditional machine learning models. In fact, concerning previous works by the authors [36], this paper exploits DL and TL tools for fault diagnosis application to WT SCADA data. Other well–established traditional approaches showing the application of artificial intelligence tools for fault diagnosis can be found, e.g., in [37,38,39,40,41,42,43,44].
This paper is organized as follows. Section 2 reviews existing TL approaches and how they can be beneficially applied in wind power applications. Recently proposed first applications of TL to condition monitoring tasks in wind farms are also considered. SCADA-based modeling of the normal operation behavior and detection of deviations from that behavior is one of the most relevant fault detection approaches in WTs in practice [8]. Despite its relevance, the potential of TL for this fault detection task has not been investigated so far. This work demonstrates how TL can be beneficial for extracting operation knowledge from WTs with sufficient training data and for re-using this knowledge to improve the fault detection accuracy in WTs with scarce training data. Section 3 introduces new strategies for training an NBM for fault detection despite scarce target WT training data. Section 4 discusses the results of a case study of the presented strategies and their performances. Finally, Section 6 presents some conclusions and possible directions for future research.
2. Transfer Learning for Fault Detection
Fault detection methods for WTs usually assume that the training and the test data originate from the same multivariate distribution in the same feature space. For example, it is usually assumed that a model of the WT’s normal operation behavior, once trained on past operation data, will continue to perform well when applied to data from the future operation of the WT. However, this assumption is often not correct in practice as the distribution of the operation data can evolve over time and even change abruptly. This can be reflected in distribution shifts, such as power curve shifts or other correlated SCADA variables. Various processes can cause these distribution changes, including control software updates and hardware replacement. In such situations, SCADA data that are representative of the current normal operation of the WT are scarce. Representative SCADA data are also scarce after the WT commissioning and at the beginning of its operational life. If condition data are barely available, the feature space will be populated only sparsely, so a probability density distribution representative of the WT’s normal operation behavior cannot be estimated reliably. For instance, sufficient SCADA data might not be available at high wind speeds, so the power generation cannot be estimated accurately.
On the other hand, after software and hardware updates, SCADA data from before the update are often plentifully available but no longer representative of the current operation behavior. The feature space is comprehensively populated in this case, but the data distribution does not fully reflect the WT’s current normal operation behaviour. If nevertheless, an NBM is trained on past operation data, it tends to be less capable of detecting anomalous operation behavior and, thus, causes delays in the detection of incipient faults [5].
TL relies on the mathematical concepts of domain and task. Formally, a domain D = {X, P(X)} consists of a feature space X and a marginal probability distribution P(X). A task T is defined by a label set Y and a conditional probability distribution P(Y|X), which can be estimated from a training set {(xi, yi)|xi ϵ X, yi ϵ Y}. TL considers a source domain Ds, a target domain Dt, a source task Ts and a target task Tt, and aims to estimate the conditional probability distribution in the target domain, P(Yt|Xt) from information extracted from the source domain Ds = {Xs, P(Xs)} with tasks Ts and Tt [45,46].
Thus, it enables the transfer of information learned on a source feature space and task to a different but related target feature space and task. In this way, it can help to address mismatches in the training and the test data. TL has shown success when the source and target data follow similar but different distributions or populate similar but different feature spaces; it can also be successful for applications when the source task and target task are similar but different [46,47,48,49].
Such situations frequently arise in machine health applications in industrial fleets. Wind farms comprise different but similar WT types, operating conditions, and SCADA systems. TL has a strong potential for improving WT condition monitoring tasks including performance monitoring, fault diagnostics and machine health state prognostics. This study demonstrates the application of TL for training accurate models of the normal operation behavior of WTs for fault detection tasks. The representativeness and accuracy of normal behavior models are of major importance for enabling the early detection of incipient faults [5]. TL strategies can be particularly useful when much training data in the source domain are available but little training data in the target domain, which is the case in the situations of scarce WT condition data considered in this study.
Few studies have investigated the reuse and transfer of knowledge across different WTs. The potential of TL methods in wind energy applications, notably fault diagnostics tasks, has barely been studied. The work [50] proposed applying a fault detection model across WTs without retraining the model on the target WT. They made use of an offshore source WT and multiple onshore target WTs. Fault labels were available for the source WT and unavailable for the target WT. The model was trained on data from the SCADA system of the source WT. The fault detection model was applied to the target domain without retraining; the entire target domain data set was used as a test set. The paper [51] addressed image-based blade damage detection methods in WTs. They applied a TL approach to demonstrate improved feature extraction and damage detection accuracy in the images. The work [52] applied TL to fault diagnosis tasks in WTs. Based on SCADA data, they investigated the application to gear cog belt fractures and blade icing detection. The authors compared classical machine learning methods and TL with an algorithm [45] for the two fault classification tasks, finding the superior performance of the TL–based method. The paper [53] presented a fault diagnosis method for WT gearboxes that uses generative TL. They demonstrated their approach in the laboratory with accelerometer measurements from the gearbox of a 3–kW WT. The work [54] investigated the transfer of fault diagnosis tasks on SCADA and failure status data sets. The authors considered different fault types, focused on the transfer of fault diagnostics models, and demonstrated the application of autoencoders. Finally, the work [55] proposed a TL approach based on multiple autoencoders with step-wise customization. They chose the application field of short-term wind power forecasting and demonstrated their proposed approach in 50 commercial WTs.
Surprisingly, none of the existing TL methods for SCADA–based fault detection involves normal behavior modeling, even though it is one of the most relevant data-driven methods for early fault detection in WTs. This study investigates TL strategies for fault detection based on normal behavior models. The research issues that this work aims at solving are the following.
-
How to reuse and transfer operational knowledge across WTs;
-
The effect of the transfer on the performance of fault detection models for the target WT;
-
The results obtained with a one–fits–all normal behavior model for fault detection at the target WT when applied to a customized model;
-
The improvement in the accuracy of SCADA–based fault detection with TL of an NBM concerning training an NBM from scratch;
-
The relation between the accuracy improvement and the amount of training data from the target WT.
3. Proposed Methodology
The task in this study is the SCADA–based detection of incipient faults in WTs whose SCADA data are scarce or no longer representative of the current normal operation behavior. To this end, the work demonstrates how to train a NBM and transfer it to a target WT, as shown in Figure 1. It is sketched how TL strategies can facilitate the re–use and transfer of knowledge about the normal operation behavior of a source WT to a target WT for fault detection tasks.
Figure 1 Transfer learning strategy enhancement.
A normal behavior model f is a statistical or machine learning model that was trained on SCADA features x1(t), x2(t), ..., xm(t) to learn the distribution of one or several target SCADA variables y1(t), y2(t), ..., yn(t) to be monitored, as shown in Figure 2. In particular, the flow chart of Figure 2 illustrates the SCADA–based fault detection methods based on training and application of normal behavior models.
Figure 2 Flow chart of the fault detection strategy.
The model f maps the SCADA features x1(t), x2(t), ..., xm(t) to the SCADA variable estimates yˆ1(t), yˆ2(t), ..., yˆn(t) expected under normal operation behavior, with m, n ϵ N. The features x1(t), x2(t), ..., xm(t) on which f is trained are chosen to reflect the full range of operating conditions that can arise in a fault–free normally functioning WT and that are relevant for predicting the estimated values of the monitored variables y1(t), y2(t), ..., yn(t). The SCADA input features typically include environmental conditions, particularly wind speed. The SCADA target variables to be monitored should be suitable indicators of fault conditions. For example, operation temperatures of critical components and fluids are often selected as the SCADA target variables to be monitored for early fault detection tasks. This is because unusual heat generation, such as excessive friction, can indicate operating problems and incipient faults.
Note that, as recalled in Figure 2, this paper will not consider the problem of the evaluation of the diagnostic signals ϵi = yi - yˆi, with i = 1, ..., n since it focuses on the improvement of the accuracy of the estimates yˆi. However, it is worth noting that as already remarked, the paper focuses on transfer learning approaches. The fault diagnosis represents a possible application, even if it is not the main point of the work. As proposed, e.g., in [56] the selection of the fault detection threshold ϵi can be based on a simple geometrical test or more complicated statistical schemes. In particular for this study, the fault detection thresholds are settled using the empirical solution that considers a margin of ±10% of the minimal and maximal values of the fault-free values of ϵi = yi - yˆi. However, more complex approaches may be exploited to improve fault detection accuracy, as addressed in [56].
The aim of the work is the early detection of faults based on an accurate estimation of the normal operation behavior of the monitored WT. In a TL sense, the task to be learned is the same on the source and the target data sets, namely the accurate estimation of the normal operating behavior to enable early detection of developing faults based on the deviation from normal behavior. Observations of actual WT faults are relatively scarce and of diverse types. Therefore, many automated fault detection algorithms rely on evaluating signal reconstruction errors in normal behavior modeling approaches instead of learning fault signatures [7,8]. Normal behavior models were shown to successfully detect developing faults in WT drive trains, e.g., in [6]. They constitute one of the most relevant SCADA–based fault detection approaches in practice [8]. Single- and multi-target NBMs have been proposed, with multi-target NBMs being capable of exploiting the covariance among the target variables y1(t), y2(t), ..., yn(t) which can enable more accurate normal behavior models and earlier fault detection [4,5]. In practice, aseparate WT–specific NBM is usually trained and deployed to operation for every WT. This is done to account for site–specific aspects, such as wake effects, on the WT operation, to achieve a high NBM monitoring accuracy and, thus, enable earlier detection of developing faults. Training an NBM requires the availability of training data {x1(t), x2(t), ..., xm(t), y1(t), y2(t), ..., yn(t) that} are representative of the current normal operation behavior of the WT under all operating conditions. In practice, ideally at least one year of SCADA data is used for training an NBM. However, so much representative training data are not always available. Newly commissioned WTs and WTs in their first year of operation naturally lack these training data. Even older WTs may be scarce in representative SCADA data that can be used for training an NBM. For example, this may be the case after control software or hardware updates.
To overcome the scarcity of training data, this work proposes to exploit the information contained in the NBM of a WT abundant in training data (source WT) and to transfer that information to the WT lacking representative training data (target WT), as remarked in Figure 1. Different computational strategies will be analyzed for this knowledge transfer in the following. The NBMs trained in this study are convolutional neural networks as detailed below. As shown in Figure 3, five different transfer strategies are considered.
Figure 3 Strategy 1 for training a NMB for fault detection tasks on a target WT.
In particular, three different transfer strategies (strategies 1-3) are compared to two strategies in which an NBM is trained from scratch at least partly on SCADA data from the target WT (strategies 4 and 5).
Therefore, according to transfer strategy 1 of Figure 3, the model is trained on the training set of the source WT. The scarce training data of the target WT are used to retrain only the last layers of the model. The network weights of the other layers are transferred to the target WT without any changes. In more detail, the NMB has been trained and optimized on the source WT training and validation sets. The NBM for the target WT was initialized with the weights of the CNN trained on the source WT. Thus, the initial state of the target NBM already comprised knowledge about the normal operation behavior of a WT and the covariance among the input and target variables even before the training on the target WT’s training data started. Next, the weights of the two last layers of the NBM were retrained on the 24–hours-sequences of the training set of the target WT. The last two layers were the two fully connected layers with eight neurons in the hidden layer and seven in the output layer, as described above.
Strategy 2 in Figure 4 is the same as strategy 1, except all the network layers are retrained on the target WT data. In particular this strategy, it is involved an NBM to be trained and optimized on the source WT training and validation sets followed by initializing the target WT’s NBM with the thus trained CNN and its weights. Unlike strategy 1, however, all layers' weights have been retrained on the target WT’s training set.
Figure 4 Strategy 2 for training a NMB for fault detection tasks on a target WT.
Strategy 3 of Figure 5 involves training the model on the source WT and transferring and applying the trained model to the target WT data without any modifications. In more detail, an NBM has been trained on the source WT’s data as discussed above. The trained model was then applied to the test data of the target WT without any retraining or other customization to the target WT. So no condition monitoring data from the target WT were used as part of the training or validation according to strategy 3.
Figure 5 Strategy 3 for training a NMB for fault detection tasks on a target WT.
Strategy 4 of Figure 6 provides that the NBM is trained from scratch on the merged SCADA training data sets from the source and the target WTs. In particular, this strategy has provided that the SCADA data of the source and the target WTs have been combined prior to the NBM training. A CNN has been trained on the merged data set of source and target WT SCADA data. The trained NBM was then tested on the target WT.
Figure 6 Strategy 4 for training a NMB for fault detection tasks on a target WT.
Finally, strategy 5 in Figure 7 involves training an NBM from scratch on the scarce training data set of the target WT and not making use of the NBM or of the training data from the source WT. In more detail for this strategy, no SCADA data or other information from the source WT have been used to train the target WT’s NBM. Thus, training, validation and testing have been performed from scratch on the SCADA data of the target WT. The performance of each strategy has been measured in terms of the Root Mean Square Error (RMSE) of the trained normal behavior model on the target WT test set.
Figure 7 Strategy 5 for training a NMB for fault detection tasks on a target WT.
4. Simulation Results
SCADA data from six 3–MW onshore WTs were used for a case study in which performances of the proposed TL strategies were also compared. Eight combinations of source and target WTs were randomly selected from the available six WTs. The WTs are commercially operated horizontal-axis variable-speed machines with pitch regulation. Fourteen months of SCADA data were available from each WT. The SCADA data were provided at 10– minute resolution and computed as mean values computed over periods of ten minutes. The location and observation time of the WTs is not disclosed to ensure the operator's privacy.
The SCADA target variables y1(t), y2(t), ..., y7(t) to be monitored in the case study considered in this work are provided in Table 1 and Table 2 along with the SCADA features x1(t), x2(t), x3(t) to estimate them.
Table 2 SCADA monitored target.
In particular, Table 2 reports the target variables yi(t) to be monitored and the features xi(t) used for estimating the normal behavior model based on which fault detection is performed. The features comprise the 10–minute mean wind speeds and wind directions measured at the nacelle and the ambient air temperatures. Wind speed is the main predictor of the target variables. Wind speed and air temperature are also relevant features for the environmental conditions, including wake effects and air pressure changes. The ambient temperature is an imperfect proxy of the air pressure in this study because air pressure measurements were unavailable from the WTs.
It is worth noting that the SCADA features xi(t), such as wind speed and air temperature reported in Table 1 and Table 2 are inherently included in the target data, since the wind turbine control system regulates the wind turbine outputs by monitoring these variables, as described, e.g., in [57].
The SCADA data were normalized to ensure all features have a similar order of magnitude and to enable fast model training. Then, the data of the source and target WTs were each split into training, validation and test sets, with 20% of both data sets used as the respective test sets and 10% as validation sets. The same split was applied for training and testing all five strategies and models compared in this study.
The SCADA features served as input to train a normal behavior model to provide estimates yˆ1(t), yˆ2(t), ..., yˆ7(t) of the target variables. A multitarget Convolutional Neural Network (CNN) model f was trained to this end. The WTs’ normal operation behavior was modeled with a CNN to capture the time dependence of the WT operation patterns. Sequences with 144-time steps corresponding to 24 hours were created from the normalized time series as input to the CNN. Thus, the CNN predicts the target values based on a sequence of past operation states rather than on same-time observations only. The input data corresponded to 144-time steps by 3 input variables and a moving window over the past 144-time steps. The CNN was trained on the resulting sequences of the source WT’s training set to compare strategies 1-5, as shown in Figures 3, 4, 5, 6, and 7.
CNNs are artificial neural networks that comprise layers from which feature maps are compiled by filters moving over the SCADA sequences with a convolution operation [11,58]. Note that the architecture of the CNN was defined as part of a hyper-parameter optimization process in which a grid search was performed about the number of layers and layer sizes that resulted in the lowest validation set error. Architectures with up to two convolutional and two dense hidden layers with different numbers of neurons were compared to this end. Several 4, 8 and 16 neurons were assessed in each hidden layer, and convolutional filter sizes of 2 and 3 were in the optimization process. The mean squared error was chosen as a loss function for training the CNN and to compare the models’ performances. The resulting CNN architecture compromises a convolutional layer with 16 convolutional filters, a max pooling layer, and a second convolutional layer with 16 filters. The features thus extracted were passed to a dense hidden layer with 8 neurons connected to a dense output layer with seven neurons corresponding to the seven target variables. The CNN was trained and optimized on the training and validation sets of the source WT to establish a normal behavior model of the source WT. The Adaptive Moment Estimation (Adam) optimizer was employed for stochastic gradient descent [59]. The training was conducted over 100 epochs with a batch size of 256 data.
The following TL strategies in this case study were compared by Figures 3, 4, 5, 6, and 7.
For strategy 1, the CNN normal behavior model was trained and optimized on the source WT training and validation sets as discussed above. The CNN NBM for the target WT was initialized with the weights of the CNN trained on the source WT. Thus, the initial state of the target NBM already comprised knowledge about the normal operation behavior of a WT and the covariance among the input and target variables even before the training on the target WT’s training data started. Next, the weights of the two last layers of the CNN were retrained on the 24–hours-sequences of the training set of the target WT. The last two layers were the two fully connected layers of the CNN with eight neurons in the hidden layer and seven in the output layer, as described above. The training was again performed with 100 epochs and a batch size of 256.
Strategy 2 involved a CNN NBM to be trained and optimized on the source WT training and validation sets followed by initializing the target WT’s NBM with the thus trained CNN and its weights. Unlike strategy 1, however, all layers' weights were retrained on the target WT’s training set.
According to strategy 3, a CNN was trained as the NBM on the source WT’s data as discussed above. The trained model was then applied to the test data of the target WT without any retraining or other customization to the target WT. So no condition monitoring data from the target WT were used as part of the training or validation according to strategy 3.
Strategy 4 provided that the SCADA data of the source and the target WTs were combined prior to the NBM training. A CNN was trained on the merged data set of source and target WT SCADA data. The trained NBM was then tested on the target WT.
According to strategy 5, no SCADA data or other information from the source WT were used to train the target WT’s NBM. Thus, training, validation and testing were performed from scratch on the SCADA data of the target WT. The performance of each strategy was measured in terms of RMSE of the trained normal behavior model on the target WT test set.
An issue of particular interest was how the five strategies compare for various amounts of training data from the target WT. The performances of the five strategies were compared for a range of scarce training set sizes of the target WT in the case study. To this end, a randomly selected source WT was selected from among the six WTs and the transfer to four other WTs was investigated. Each of these target WTs was randomly selected to serve as the respective target WT. Figure 8 shows how the performance of the normal behavior models trained by each of the five strategies varies with the size of the training data set of the target WT. Training set sizes of the target WT between 0.3 and 11 months were investigated.
Figure 8 NBM accuracy for 4 randomly selected source WTs.
In particular, Figure 8 depicts the accuracy of the NBM trained for a randomly selected source WT and 4 randomly selected target WTs for varying amounts of target WT training data. Each subfigure corresponds to a different target WT.
Subsequently, this process was repeated by randomly selecting another source WT and four target WTs from among the six WTs, and then training the NBMs for varying amounts of target training data. The accuracy of the normal behavior models resulting from the five learning strategies is shown in Figure 9 for target WT training set sizes of between 0.3 and 11 months.
Figure 9 NBM accuracy for different randomly selected source WTs.
In particular, Figure 9 depicts the accuracy of the NBMs trained for another randomly selected source WT, which differs from the one in Figure 8. Also in this case four randomly selected target WTs for varying amounts of target WT training data. Each subfigure corresponds to a different target WT.
A separate CNN NBM was trained for each strategy and the training set sizes. A random sampling of the source and target WTs was applied to assess eight combinations of source and target WTs instead of evaluating all 30 possible combinations. The sampling of eight out of 30 pairs of source and target WTs was performed due to the high computational cost of the analysis. Each subfigure of Figure 8 and Figure 9 involved training, validation and testing 55 separate CNNs, corresponding to several hours of run time on a standard medium-performance PC.
Finally, the algorithmic complexity depends on the complexity of the model CNN models. Given the high variability within each class and high similarity between certain data classes, many convolutional layers and fully connected dense layers are needed to accurately distinguish between 4 classes. The model consists of alternating 2D convolutional layers and max-pooling layers. The final two layers consist of fully connected dense layers. The first dense layer consists of 64 dimensions and the second layer consists of 4 dimensions representing the 4 classes. The final dense layer’s highest value, consisting of a 1D array of length 4, indicates the prediction of that model for given pattern and target data.
The models were trained for about 200 epochs with a learning rate of 0.0005. Slow learning was preferred in these models due to their complexity. Slow learning also allowed for size constriction within the models, as previous attempts in the training process with a faster learning rate required significantly larger models to achieve comparable results.
Using smaller models was thus found to be more resource, memory, and time-intensive. The resulting model consisted of 60,772 trainable parameters, 99.78% smaller than the average number of models with TL.
5. Discussion and Remarks
The results show that the TL schemes (strategies 1 and 2) drastically improved the accuracy of the normal behavior models for fault detection tasks with scarce target training data compared to strategies 3 and 5. The latter ones are referred to as naive strategies in the sense that they only use either the operation knowledge of the source WT or the scarce operation knowledge of the target WT. Strategy 3 involves training an NBM of the source WT and applying that model to the target WT without any modifications. Strategy 3 can be considered a simple TL strategy because the NBM is directly transferred and applied to the target WT without any customization. This strategy provides a constant NBM accuracy on the target WT as the model training does not involve any SCADA data from the target WT. Strategy 5 provides that an NBM of the target WT is to be trained from scratch on the scarce training data of the target WT. Figures 8 and 9 show that this strategy is far the worst among all investigated strategies in case of scarce target WT data. It results in the lowest NBM accuracy on the target WT test set. The scarcer the target training data, the worse the performance of strategy 5 in absolute and relative terms compared to the other strategies. With increasing amounts of target WT training data, the performance of strategy 5 starts to approach the performances of the other four strategies and converge with them at about 8–10 months of target WT training data. At the same time, applying the source WT model without modifications (strategy 3) tends to outperform the other four strategies at these target WT training data sizes. This is because strategy 3 is the only strategy that does not consider any operational data from the target WT in the target NBM training. This reduces its performance relative to the other strategies when more and more target WT training data become available.
The study also highlighted that strategies using combined operational knowledge from source and target WTs outperform naive strategies 3 and 5. As expected, the performance of TL strategies 1 and 2 and of strategy 4 tends to increase with larger amounts of target WT training data. If training data are scarce, naive strategies 3 and 5 result in lower NBM accuracy and, therefore, in larger fault detection delays in the target WT than the other strategies. The latter also includes strategy 4 which requires the NBM training to be performed on the combined training sets of the source and target WTs. Strategy 4 achieves similar NBM accuracy and, thus, similar fault detection delays as strategies 1 and 2, thereby clearly outperforming the naive strategies. Transfer learning strategies 1 and 2 perform very similarly, with strategy 1 (retraining last layers only) being moderately outperformed by strategy 2 (retraining all layers) in the case of more dissimilar source and target WTs as measured by the accuracy of the NBM trained following strategy 3. Strategies 3 and 4 converge towards identical training datasets and, thus, the same degree of accuracy as the size of the target WT training set is decreased.
The main research issues investigated in this work concern the following issues. The reuse and the transfer of operational knowledge across WTs. The effect of this procedure on the performance of fault detection models concerning the target WT. The improvement of the accuracy provided by TL for the case of SCADA–based fault detection when scarce target WT training data compared to training from scratch. And if so, the varying accuracy improvement concerning the size of the target WT training set.
The effect of one-fits-all NBM used for fault detection relative to an NBM customized to the target WT. These topics were analyzed using five knowledge transfer strategies and investigated in six commercial multi–MW WTs.
Major advantages of TL that were demonstrated in other fields include that it can enable the training of models for the target system even if few or no training data are available for that system [48]. Hence, less training data are required and more accurate model predictions can be given with the scarce training data. The work confirmed these advantages in the performed case study. Comparing the performances of our five proposed strategies, it was found that the TL strategies 1 and 2 and strategy 4 which involve training on the combined source and target data outperform the naive strategies 3 and 5. Thus, strategies that use the combined operational knowledge available from both the source and the target WTs outperform the naive strategies in case of scarce training data of the target WT. Strategies 3 and 5 use training data from only the source or the target WT, which results in less accurate normal behavior models and larger fault detection delays when target WT training data are scarce. It was demonstrated that operators and asset managers could overcome a lack of representative training data by reusing and transferring operational knowledge across WTs by adopting the proposed TL strategies 1 and 2 or strategy 4. Therefore, knowledge transfer across WTs can drastically improve the accuracy of SCADA-based normal behavior models and fault detection, and, thus, reduces fault detection delays in the case of scarce target WT training data. This improvement is particularly pronounced when compared to training an NBM from scratch on the scarce target WT training data. In fact, the scarcer the target WT training data, the more strongly strategies 1, 2 and 4 outperform the training from scratch on the target WT (strategy 5). NBMs trained with strategies 1, 2 or 4 also tended to outperform one-fits-all NBMs trained on the source WT and applied to the target WT without modifications (strategy 3).
Even though SCADA data from the source WT tend to provide only an imperfect description of the NBM of the target WT, they are plentifully available. They can be used to train a more accurate NBM for the target WT in case of scarce training data. The results demonstrated that even small amounts of target WT training data could be applied beneficially to fine-tune and improve the performance of the NBM of the target WT. It was also found that an NBM should be customized for each WT. A one-fits-all NBM for all WTs in the wind farm (strategy 3) and training from scratch despite lack of training data (strategy 5) are the worst performing ones of the five compared strategies. The developed knowledge transfer strategies facilitate early fault detection in WTs with scarce SCADA data. TThey help improve the uptime of WT fleets by facilitating earlier and more informed responses to unforeseen maintenance actions, particularly for targetWTs with limited availability of past condition monitoring data.
6. Conclusion
Data-driven fault detection in wind turbines requires sensor data that represent the monitored turbines’ past and present normal operation behaviour of the monitored turbines. Such data are often unavailable, for example, after the commissioning and in the first months of operation of the turbines, but also after major turbine updates. Fault detection models can barely be trained under these circumstances. To overcome these limitations, the paper proposed and demonstrated the reuse and transfer of knowledge about the normal operation behavior from turbines with sufficient training data to turbines with scarce training data. The results are very encouraging and open up further research questions. Future studies should investigate, for example, how to select the optimal source wind turbines for the knowledge transfer of machine health monitoring tasks. It will also be interesting to study transfer learning strategies for varying degrees of operational similarity between machines of the same fleet and across different fleets. Moreover, transfer learning strategies might reduce the amount of SCADA data storage needed for condition monitoring tasks. Future studies should also investigate this use case in more detail.
Author Contributions
Prof. Silvio Simani was responsible for project development, for paper writing and corrections, Dr. Saverio Farsoni conducted data collection, and Prof. Paolo Castaldi supervised the project.
Competing Interests
The authors have declared that no competing interests exist.
References
- Frank S, Gusti M, Havlík P, Lauri P, DiFulvio F, Forsell N, et al. Land-based climate change mitigation potentials within the agenda for sustainable development. Environ Res Lett. 2021; 16: 024006. [CrossRef]
- Liu H, Khan I, Zakari A, Alharthi M. Roles of trilemma in the world energy sector and transition towards sustainable energy: A study of economic growth and the environment. Energy Policy. 2022; 170: 113238. [CrossRef]
- Gonzalo AP, Benmessaoud T, Entezami M, Márquez FPG. Optimal maintenance management of offshore wind turbines by minimizing the costs. Sustain Energy Technol Assess. 2022; 52: 102230. [CrossRef]
- Meyer A. Multi-target normal behaviour models for wind farm condition monitoring. Appl Energy. 2021; 300: 117342. [CrossRef]
- Meyer A. Early fault detection with multi-target neural networks. Proceeding of the computational science and its applications–ICCSA 2021; 2021 September 13-16; Cagliari, Italy. Cham: Springer. [CrossRef]
- Schlechtingen M, Santos IF, Achiche S. Wind turbine condition monitoring based on SCADA data using normal behavior models. Part 1: System description. Appl Soft Comput. 2013; 13: 259-270. [CrossRef]
- Stetco A, Dinmohammadi F, Zhao X, Robu V, Flynn D, Barnes M, et al. Machine learning methods for wind turbine condition monitoring: A review. Renew Energy. 2019; 133: 620-635. [CrossRef]
- Tautz‐Weinert J, Watson SJ. Using SCADA data for wind turbine condition monitoring–A review. IET Renew Power Gener. 2017; 11: 382-394. [CrossRef]
- Zaher A, McArthur S, Infield D, Patel Y. Online wind turbine fault detection through automated SCADA data analysis. Wind Energy. 2009; 12: 574-593. [CrossRef]
- Jamil F, Verstraeten T, Nowé A, Peeters C, Helsen J. A deep boosted transfer learning method for wind turbine gearbox fault detection. Renew Energy. 2022; 197: 331-341. [CrossRef]
- LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015; 521: 436-444. [CrossRef]
- Gavali P, Banu JS. Deep convolutional neural network for image classification on CUDA platform. In: Deep learning and parallel computing environment for bioengineering systems. Amsterdam, Netherlands: Elsevier Inc.; 2019. pp. 99-122. [CrossRef]
- Hornik K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 1991; 4: 251-257. [CrossRef]
- Lei Y, Jia F, Lin J, Xing S, Ding SX. An intelligent fault diagnosis method using unsupervised feature learning towards mechanical big data. IEEE Trans Ind Electron. 2016; 63: 3137-3147. [CrossRef]
- Zhang W, Li C, Peng G, Chen Y, Zhang Z. A deep convolutional neural network with new training methods for bearing fault diagnosis under noisy environment and different working load. Mech Syst Signal Process. 2018; 100: 439-453. [CrossRef]
- Chen L, Xu G, Zhang Q, Zhang X. Learning deep representation of imbalanced SCADA data for fault detection of wind turbines. Measurement. 2019; 139: 370-379. [CrossRef]
- Li C, Zhang S, Qin Y, Estupinan E. A systematic review of deep transfer learning for machinery fault diagnosis. Neurocomputing. 2020; 407: 121-135. [CrossRef]
- Verstraeten T, Nowé A, Keller J, Guo Y, Sheng S, Helsen J. Fleetwide data-enabled reliability improvement of wind turbines. Renew Sust Energ Rev. 2019; 109: 428-437. [CrossRef]
- Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C. A survey on deep transfer learning. Artificial neural networks and machine learning–ICANN 2018; 2018 October 4-7; Rhodes, Greece. Cham: Springer. [CrossRef]
- Lu W, Liang B, Cheng Y, Meng D, Yang J, Zhang T. Deep model based domain adaptation for fault diagnosis. IEEE Trans Ind Electron. 2017; 64: 2296-2305. [CrossRef]
- Zhang R, Tao H, Wu L, Guan Y. Transfer learning with neural networks for bearing fault diagnosis in changing working conditions. IEEE Access. 2017; 5: 14347-14357. [CrossRef]
- Bhuiyan MR, Uddin J. Deep transfer learning models for industrial fault diagnosis using vibration and acoustic sensors data: A review. Vibration. 2023; 6: 218-238. [CrossRef]
- Wen L, Gao L, Li X. A new deep transfer learning based on sparse auto-encoder for fault diagnosis. IEEE Trans Syst Man Cybern. 2019; 49: 136-144. [CrossRef]
- Guo L, Lei Y, Xing S, Yan T, Li N. Deep convolutional transfer learning network: A new method for intelligent fault diagnosis of machines with unlabeled data. IEEE Trans Ind Electron. 2019; 66: 7316-7325. [CrossRef]
- Yang B, Lei Y, Jia F, Xing S. An intelligent fault diagnosis approach based on transfer learning from laboratory bearings to locomotive bearings. Mech Syst Signal Process. 2019; 122: 692-706. [CrossRef]
- Yang X, Song Z, King I, Xu Z. A survey on deep semi-supervised learning. IEEE Trans Knowl Data Eng. 2022; 1-20. doi:10.1109/TKDE.2022.3220219. [CrossRef]
- Shao S, McAleer S, Yan R, Baldi P. Highly accurate machine fault diagnosis using deep transfer learning. IEEE Trans Industr Inform. 2019; 15: 2446-2455. [CrossRef]
- Cao P, Zhang S, Tang J. Preprocessing-free gear fault diagnosis using small datasets with deep convolutional neural network-based transfer learning. IEEE Access. 2018; 6: 26241-26253. [CrossRef]
- Antonini M, Barlaud M, Mathieu P, Daubechies I. Image coding using wavelet transform. IEEE Trans Image Process. 1992; 1: 205-220. [CrossRef]
- Smith WA, Randall RB. Rolling element bearing diagnostics using the Case Western Reserve University data: A benchmark study. Mech Syst Signal Process. 2015; 64: 100-131. [CrossRef]
- Yan X, She D, Xu Y. Deep order-wavelet convolutional variational autoencoder for fault identification of rolling bearing under fluctuating speed conditions. Expert Syst Appl. 2023; 216: 119479. [CrossRef]
- Yan X, She D, Xu Y, Jia M. Deep regularized variational autoencoder for intelligent fault diagnosis of rotor–bearing system within entire life-cycle process. Knowl Based Syst. 2021; 226: 107142. [CrossRef]
- Yan X, Liu Y, Jia M. Multiscale cascading deep belief network for fault identification of rotating machinery under various working conditions. Knowl Based Syst. 2020; 193: 105484. [CrossRef]
- Yan X, Liu Y, Xu Y, Jia M. Multistep forecasting for diurnal wind speed based on hybrid deep learning model with improved singular spectrum decomposition. Energy Convers Manag. 2020; 225: 113456. [CrossRef]
- Yan X, Liu Y, Jia M. Health condition identification for rolling bearing using a multi-domain indicator-based optimized stacked denoising autoencoder. Struct Health Monit. 2020; 19: 1602-1626. [CrossRef]
- Farsoni S, Simani S, Castaldi P. Fuzzy and neural network approaches to wind turbine fault diagnosis. Appl Sci. 2021; 11: 5035. [CrossRef]
- Hoskins J, Kaliyur K, Himmelblau DM. Fault diagnosis in complex chemical plants using artificial neural networks. AIChE J. 1991; 37: 137-141. [CrossRef]
- Himmelblau DM, Barker RW, Suewatanakul W. Fault classification with the AID of artificial neural networks. In: Fault detection, supervision and safety for technical processes. Baden, Germany: Elsevier Ltd.; 1991. pp. 369-373.
- Shaker MS, Patton RJ. Active sensor fault tolerant output feedback tracking control for wind turbine systems via T–S model. Eng Appl Artif Intell. 2014; 34: 1-12. [CrossRef]
- Uppal FJ, Patton RJ, Palade V. Neuro-fuzzy based fault diagnosis applied to an electro-pneumatic valve. IFAC Proc Vol. 2002; 35: 477-482. [CrossRef]
- Watanabe K, Matsuura I, Abe M, Kubota M, Himmelblau DM. Incipient fault diagnosis of chemical processes via artificial neural networks. AIChE J. 1989; 35: 1803-1812. [CrossRef]
- Xu D, Jiang B, Shi P. Nonlinear actuator fault estimation observer: An inverse system approach via a TS fuzzy model. Int J Appl Math Comput Sci. 2012; 22: 183-196. [CrossRef]
- Isermann R. On fuzzy logic applications for automatic control, supervision, and fault diagnosis. IEEE Trans Syst Man Cybern. 1998; 28: 221-235. [CrossRef]
- Palade V, Patton RJ, Uppal FJ, Quevedo J, Daley S. Fault diagnosis of an industrial gas turbine using neuro-fuzzy methods. IFAC Proc Vol. 2002; 35: 471-476. [CrossRef]
- Dai W, Yang Q, Xue GR, Yu Y. Boosting for transfer learning. Proceedings of the 24th international conference on machine learning; 2007 June 20-24; Corvalis Oregon USA. New York, NY, United States: Association for Computing Machinery. [CrossRef]
- Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2010; 22: 1345-1359. [CrossRef]
- Olivas ES, Guerrero JDM, Martinez-Sober M, Magdalena-Benedito JR, Serrano L. Handbook of research on machine learning applications and trends: Algorithms, methods, and techniques. Hershey, PA: IGI global; 2009. [CrossRef]
- Weiss K, Khoshgoftaar TM, Wang D. A survey of transfer learning. J Big Data. 2016; 3: 1-40. [CrossRef]
- Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, et al. A comprehensive survey on transfer learning. Proc IEEE. 2020; 109: 43-76. [CrossRef]
- Chatterjee J, Dethlefs N. Deep learning with knowledge transfer for explainable anomaly prediction in wind turbines. Wind Energy. 2020; 23: 1693-1710. [CrossRef]
- Yang X, Zhang Y, Lv W, Wang D. Image recognition of wind turbine blade damage based on a deep learning model with transfer learning and an ensemble learning classifier. Renew Energ. 2021; 163: 386-397. [CrossRef]
- Chen W, Qiu Y, Feng Y, Li Y, Kusiak A. Diagnosis of wind turbine faults with transfer learning algorithms. Renew Energ. 2021; 163: 2053-2067. [CrossRef]
- Guo J, Wu J, Zhang S, Long J, Chen W, Cabrera D, et al. Generative transfer learning for intelligent fault diagnosis of the wind turbine gearbox. Sensors. 2020; 20: 1361. [CrossRef]
- Li Y, Jiang W, Zhang G, Shu L. Wind turbine fault diagnosis based on transfer learning and convolutional autoencoder with small-scale data. Renew Energ. 2021; 171: 103-115. [CrossRef]
- Liu X, Cao Z, Zhang Z. Short-term predictions of multiple wind turbine power outputs based on deep neural networks with transfer learning. Energy. 2021; 217: 119356. [CrossRef]
- Chen J, Patton RJ. Robust model-based fault diagnosis for dynamic systems. Boston: Kluwer Academic Publishers; 1999. [CrossRef]
- Odgaard PF, Stoustrup J, Kinnaert M. Fault-tolerant control of wind turbines: A benchmark model. IEEE Trans Control Syst Technol. 2013; 21: 1168-1182. [CrossRef]
- LeCun Y, Bengio Y. Convolutional networks for images, speech, and time series. Handb Brain Theory Neural Netw. 1995; 3361: 1995.
- Li S, Chen H, Wang M, Heidari AA, Mirjalili S. Slime mould algorithm: A new method for stochastic optimization. Future Gener Comput Syst. 2020; 111: 300-323. [CrossRef]