Keywords

1 Introduction

Electrocardiogram (ECG) is a kind of weak electrical signal that reflects the beating law of the heart. As a common physiological signal in human body, ECG signals contain measurable characteristic discrepancies among different individuals. Generally, ECG signals are periodic and composed of the similar P-QRS-T waves. However, for different individuals, the position, period and amplitude of each characteristic point are different, which is the basis for ECG signals to be used in personal identification [1]. ECG signals include the following advantages: universality, uniqueness, stability and measurability, which are the necessary conditions of biometrics. In addition, ECG, as a biological feature of human body, is not easy to be stolen, and its safety is relatively higher. Meanwhile, since the ECG signal is one-dimensional, there is low computational complexity in preprocessing and feature processing.

At present, the research on ECG identification algorithm can be divided into two categories [2]: feature extraction algorithm based on reference point detection and feature extraction algorithm based on non-reference point detection. The feature extraction algorithm based on reference point detection mainly extracts the amplitude, interval, slope, area, angle and other geometric features of ECG signals for identification. Chen et al. [3] extracted five features of Q wave position, S wave position, QRS interval, RQ amplitude difference and RS amplitude difference, this method used Support Vector Machine (SVM) to classify and recognize; Palaniappan et al. [4] intercepted the QRS segments of ECG signals, and extracted five feature points and one morphological coefficient of the segment, this method used the Back-Propagation neural network (BP) to classify and recognize. The feature extraction algorithm based on non-reference point detection is mainly based on transform features, such as time-frequency transform, wavelet transform and sparse coding. Zhao et al. [5] used Fast Fourier’s matching tracking method and sparse decomposition of characteristic coefficients to classify and recognize by SVM; Chen et al. [6] used the singular value and dissimilarity distance of the wavelet transform matrix of ECG as the characteristic parameters, SVM was used to classify and 40 samples were identified, the recognition accuracy was 97.82%.

In the above literature, the feature extraction use the reference points overly relies on the positioning accuracy of the reference points, it only focuses on the local information and ignore the overall characteristics of the signal. The feature extraction algorithm based on non-reference point use all the information of the signal, so that this algorithm contains a large amount of redundant information, and the computation complexity is increased by feature transformation. For the common classification models, k-Nearest-Neighbor (KNN) is not regularized for identification, and class deviation is easy to occur in case of sample imbalance. Although SVM performs well, it is sensitive to missing data and the kernel functions should be selected carefully. The learning speed of neural network is slow, and it is easy to fall into local minimum.

Based on the above problems, the feature point location, feature redundancy and classification model selection are analyzed. An ECG identification algorithm based on PCA and Adaboost classifier is proposed in this paper. We extracted complete heart beats through R points, then the PCA was used to process the multidimensional features, removed inter correlation and redundant information, the PCA method can reduce the high-dimensional features to low dimensional features. Finally, Adaboost algorithm was used to construct strong classifier for classification and match. Adaboost algorithm has been successfully applied to face recognition [7], license plate recognition [8], disease diagnosis [9] and other fields because of its simple flow and ideal classification effect. The strategy of “reassigning weights” is adopted to combine weak classifiers weighted into strong classifiers with higher accuracy, which can improve the accuracy of ECG identification. In this paper, the simulation experiment based on ECG-ID database shows that the recognition accuracy and timeliness of the proposed algorithm are improved, which proves that the algorithm has a better performance.

2 ECG Signal Preprocessing and R Point Location

2.1 Denoising

ECG signal is a kind of weak electrical signal. It is easy to be disturbed by noise when collecting, which will affect the accuracy of identification. So the ECG signal needs to be denoised. The noises in the ECG signal mainly include baseline drift (<1 Hz), power frequency interference (50 Hz or 60 Hz), electromyographical interference (30–300 Hz). According to the frequency distribution of noise and ECG signal, we adopt the adaptive wavelet soft threshold method [10] for denoising. The sampling frequency of the signal in the ECG-ID database is 500 Hz. According to the Nyquist sampling theorem, we can know that the frequency of ECG signal is 0–250 Hz, the db4 wavelet is used to decompose the signal with 9 layers. The frequency distribution is shown in Table 1. The frequency of ECG signal mainly distributes in 0.05–100 Hz, of which QRS frequency is 3–40 Hz and S-T frequency is 0.7–10 Hz. So we use 9-layer db4 wavelet to decompose and reconstruct. According to the frequency distribution in the Table 1, the wavelet coefficients of the high frequency component in first layer can be directly set to zero, and the wavelet coefficients of the low frequency component in ninth layer can be set to zero. Then the soft threshold method is used to remove the noise which frequency is mixed with ECG signal.

Table 1. The 9 layer decomposition of ECG signal

First, the threshold of the high-frequency coefficients of each layer is determined, the formula is defined as follows:

$$ thr = \alpha_{k} \times \sqrt {2\log (n)} \times \sigma $$
(1)

Where \( n \) is the signal length of the threshold processing. \( a_{k} \) is the weighted threshold coefficient of the decomposition of each layer, and \( \sigma \) is defined as follows:

$$ \sigma = \frac{{median\left( {\left| {d(k)} \right|} \right)}}{0.6745} $$
(2)

Where \( d(k) \) is the wavelet coefficients of each scale. \( k \) is the number of layers being processed.

In order to adapt to the wavelet decomposition thresholds of different layers, the weighted threshold method is adopted and the weight is designed, the formula is defined as follows:

$$ \alpha_{k} = \left\{ {\begin{array}{*{20}l} {0.25} \hfill & {f \le 30\,{\text{Hz}}} \hfill \\ {0.5} \hfill & {30\,{\text{Hz}} < \,f\, < 125\,{\text{Hz}}} \hfill \\ 0 \hfill & {f \ge 125\,{\text{Hz}}} \hfill \\ \end{array} } \right. $$
(3)

Finally, the processed coefficients of the filtered coefficients are reconstructed, and the denoising results of individuals 3 and 9 are shown in Fig. 1. From the graph, we can see that the noise is basically removed, which can meet the needs of identification.

Fig. 1.
figure 1

The diagram of de-noising effect

2.2 R Point Location

The peak value of R wave is the largest in ECG signal, and its location is the simplest. In this paper, the second order difference threshold method [11] is used to locate the peak value of R wave. Figure 2 is the R point detection chart.

Fig. 2.
figure 2

The R point detection diagram

3 Our Algorithm

3.1 Principal Component Analysis (PCA)

Principal component analysis (PCA) [12] is a multivariate statistical algorithm for optimal orthogonal transformation in pattern recognition. It is mainly based on orthogonal projection to remove the correlation between data and maximize the variance of projection data. Using a few main features to replace the original ECG data can remove the correlation of ECG waveform characteristics. This algorithm reduces the data dimension and highlights the main characteristics of the data while retaining the main information of the ECG signal.

The process of principal component analysis is as follows:

  1. (1)

    Suppose the sample set \( X = (x^{ 1} ,x^{ 2} , \ldots ,x^{\text{m}} ) \) is composed of \( m \) heart beat samples, the dimension of each sample is \( n \). Take the sample set to remove the mean:

    $$ x_{ij}^{ * } = x_{ij} - \bar{x}_{i} \quad (i = 1,2, \ldots ,n;\;j = 1,2, \ldots ,m) $$
    (4)
  2. (2)

    Calculate the covariance matrix:

    $$ \Sigma = \frac{1}{m}XX^{T} $$
    (5)
  3. (3)

    Calculate the eigenvalue \( (\lambda^{ 1} \ge \lambda^{ 2} \ge , \ldots , \ge \lambda^{n} ) \) of the Σ and its corresponding eigenvector \( (\omega^{ 1} \ge \omega^{ 2} \ge , \ldots , \ge \omega^{n} ) \).

  4. (4)

    Determine the required low dimension \( d^{\prime } \) according to contribution rate

    $$ \varphi = \frac{{\sum\nolimits_{i = 1}^{{d^{\prime } }} {\lambda_{i} } }}{{\sum\nolimits_{{i{ = 1}}}^{n} {\lambda_{i} } }} $$
    (6)

    Where \( d^{\prime } \) is determined according to the demand of principal component contribution rate φ.

  5. (5)

    Transform n-dimensional sample set X into \( d^{\prime} \) dimension space:

    $$ Z = W^{T} X $$
    (7)
    $$ W = (\omega_{1} ,\omega_{2} , \ldots ,\omega_{{d^{\prime } }} ) \, $$
    (8)

3.2 Adaboost Algorithm

Adaboost algorithm was proposed and developed by Freund [13] in 1996. The Adaboost algorithm adopts the strategy of “reassigning weights” to train the weak classifiers for several rounds, and automatically improve the weights of the wrong samples in the previous training. The weights of weak classifiers with low misclassification rate are added, and the weight of every weak classifier is combined to the weight of the final strong classifier. Adaboost algorithm is often used in the study of binary classification problems, and ECG identification is a multi-classification problem, so the traditional Adaboost algorithm needs to be improved. Zhu et al. [14] proposed an improved algorithm called Stagewise Additive Modeling using a Multi-class Exponential loss function (SAMME). It extends the Adaboost algorithm directly to multiple types of problems. The SAMME algorithm reduces the requirement for the correct class rate of the weak classifier from 1/2 to 1/k, which means that in the multi-classification problem, the performance of the weak classifier can be accepted as long as it is better than random guess.

The steps of this algorithm are as follows:

The sample set is given: \( \left\{ {(x_{1} ,y_{1} ),(x_{2} ,y_{2} ), \ldots ,(x_{m} ,y_{m} )} \right\} \) which \( x_{i} \) represent the ECG signal eigenvector, \( y_{i} \) represent the category label of \( x_{i} \), and \( y_{i} \in Y = (1,2, \ldots ,K) \).

  1. Step 1:

    Initialize the weight of training samples:

    $$ w_{1} = \frac{1}{m} (i = 1,2, \ldots ,m) $$
    (9)
  2. Step 2:

    For \( t = 1,2, \ldots ,T \) (\( T \) is the number of iterations):

    1. (1)

      Train the weak classifier \( h_{t} \) according to the sample distribution \( \omega_{t} \).

    2. (2)

      Calculate the Prediction Error of Weak Classifier \( h_{\text{t}} \):

      $$ e_{t} = \sum\limits_{i = 1}^{N} {\omega_{1} (i) \bullet 1[y_{i} \ne h_{t} (x_{i} )]} $$
      (10)

      Where \( 1[*] \) means that when \( [*] \) is established, it equals 1, otherwise it equals 0.

    3. (3)

      Calculate the weight of weak classifier \( \alpha_{t} \) based on prediction error \( e_{t} \):

      $$ a_{t} = \frac{1}{2}\ln (\frac{{1 - e_{t} }}{{e_{t} }}) + \log (K - 1) $$
      (11)

      Where K is the number of categories.

    4. (4)

      Reassign the next training sample according to the weight \( a_{t} \):

      $$ \omega_{t + 1} (i) = \frac{{\omega_{t} (i)}}{{B_{T} }}*\left\{ {\begin{array}{*{20}l} {e^{{ - a_{t} }} ,} \hfill & {if\quad y_{i} = h_{t} (x_{i} )} \hfill \\ {e^{{a_{t} }} ,} \hfill & {if\quad y_{i} = h_{t} (x_{i} )} \hfill \\ \end{array} } \right. $$
      (12)

      Where \( B_{T} \) is a normalization factor that normalizes \( \omega_{t}^{t + 1} \).

  3. Step 3:

    Built final strong classifier:

    $$ H(x) = \arg \mathop {\hbox{max} }\limits_{y \in Y} \sum\limits_{t = 1}^{T} {a_{t} \bullet 1[h_{t} (x_{i} ) = y_{i} ]} $$
    (13)

3.3 The Flow of PCA_Adaboost Algorithm

Based on the simple and efficient Adaboost algorithm, a strong classifier is formed by combining PCA and Adaboost. It improves the timeliness and accuracy of ECG recognition, and has a good feasibility and practical significance. The flow chart is shown in Fig. 3.

Fig. 3.
figure 3

The flow chart of PCA_Adaboost algorithm

4 Experiments and Results

4.1 Experimental Environment and Database

The experimental environment in this paper is the personal PC of the Windows10 operating system, the processor is Intel (R) Core (TM) i5-7500, the memory is 4 GB, and the compilation environments are Matlab2017b and Python3.6.

This data is derived from the ECG-ID database [15] in Physionet, which contains ECG signals from 90 people. The signal acquisition frequency is 500 Hz and the resolution is 12bit, of which the seventy-fourth individual only collect one signal, and the remaining 89 people have at least two ECG signals collected at different times. Since only one signal was collected, the seventy-fourth individuals does not satisfy the experimental conditions of identification in this paper, we will eliminate the number seventy-fourth individual. This paper will use the remaining ECG signals from 89 people to carry out the identification experiment.

4.2 Experiment and Result Analysis

In this paper, two ECG signals of each people in the database are taken as training data and test data respectively. First, the pretreatment of denoising is used. Then, 150 points are intercepted forward based on R-point location, and 300 points are intercepted backward to extract a total of 450-dimensionals waveform features, including 2116 training dataset beats and 2110 test dataset beats. PCA is used to reduce the dimension of waveform features.

As shown in Fig. 4, the cumulative contribution rate of the first 10 principal components is 93.56%. In view of the special security requirements of identity recognition, combined with the actual effect of Adaboost classifier classification, we finally selected the 30 dimensions of principal components, the cumulative contribution rate is 99.72%, and the loss of information can be ignored.

Fig. 4.
figure 4

The principal component contribution rate

Figure 5 is a comparison map of waveform characteristics and PCA dimension reduction characteristics. From the graph, we can see that the heart beat waveform of individual A is relatively close at different times, but it is quite different from individual B and individual C. For PCA dimensionality reduction features, it can be seen that the characteristics of the same individual are closer, and different individuals have more obvious differences. PCA dimensionality reduction highlights the main features of ECG data, reduces the dimensionality of ECG data, and reduces the correlation between ECG signals and the redundant information that interferes with the recognition accuracy. This algorithm improves the recognition efficiency and recognition accuracy.

Fig. 5.
figure 5

The waveform features and PCA dimensionality reduction features

Under the same experimental standard, the decision tree is used as the base classifier. And we compared the recognition accuracy of Adaboost algorithm with PCA feature dimension reduction and without PCA feature dimension reduction at different iterations, as shown in Fig. 6.

Fig. 6.
figure 6

The relationship between the accuracy and the number of iterations

It can be seen from the Fig. 6 that the recognition accuracy is positively correlated with the different number of iterations. The recognition accuracy is basically stable when the number of iterations is 40 without PCA feature reduction. The recognition accuracy is basically stable when the number of iterations is 30 with PCA feature reduction, and the recognition time is increased with the number of iterations. In this paper, 40 iterations were used to construct the recognition model.

In the same data set the algorithm proposed in this paper was compared with KNN, SVM, BP. And the accuracy rate and average time consuming is shown in Table 2. Among them, KNN classifier parameter K = 5; SVM classifier uses linear kernel function; BP neural network adopts 30-20-89 network [16], the transfer function is “tansig”, the minimum gradient is 1.0−e7, and the upper limit of iteration is 500.

Table 2. Comparison of different classifier recognition results

In this section, we defined the PCA and KNN recognition method is PCA_KNN, PCA and SVM recognition method is PCA_SVM, PCA and BP recognition method is PCA_BP. This algorithms were compared with KNN, SVM, BP, and Adaboost algorithms. And as shown in the Table 2, the methods combined with the PCA feature dimensionality reduction can improve the recognition accuracy and recognition timeliness. Among them, the accuracy of PCA_KNN algorithm is 92.13%, the accuracy of PCA_BP algorithm is 96.63%, the accuracy of PCA_SVM algorithm is 96.63%, and the accuracy of PCA_Adaboost algorithm can reach 98.88%. In view of the time-consuming of recognition, the PCA feature dimension reduction greatly reduces the time recognition required, while the average time consuming is basically reduced by one order of magnitude, and the timeliness of recognition is improved. Compared with the PCA_KNN algorithm, PCA _SVM algorithm PCA_BP method, the recognition time of the proposed algorithm is less than PCA_BP and PCA_SVM. Although it takes much more time than the PCA_KNN algorithm, the recognition accuracy is much higher than PCA_KNN algorithm. Experiments show that PCA algorithm extracts the principal components of ECG signals instead of the original ECG waveform features, and removes the correlation between ECG signals and redundant information interfering the recognition accuracy. The PCA algorithm improves the recognition accuracy, reduces the data dimension of ECG signals, reduces the computational complexity of the algorithm, and improves the recognition efficiency. The Adaboost algorithm adopts the strategy of “weight assignment”, to trains the training samples of ECG signals through weak classifiers, it also improves the weight of the wrong samples of the previous ECG signals, reduces the weights of the correct samples of the ECG signals, and increases the weights of the classifiers with small error rates, and reduces the weight of the classifier with large error rates. Several rounds of training and automatic adjustment are combined into a strong classifier with high recognition accuracy.

5 Conclusions

In order to improve the timeliness and accuracy of ECG identification, an ECG identity recognition method based on PCA and Adaboost algorithm was proposed in this paper. This paper mainly studies the dimensionality reduction and classification using waveform characteristics. Firstly, we extracted the single beat based on R point positioning, and PCA was used to reduce the dimension of this feature. Since PCA can remove the correlation and redundant information in original waveform features, the computation complexity of ECG identification using PCA features will be reduced. As a result, not only the running time of the algorithm can be decreased significantly, but also the need for timeliness of algorithm can be well satisfied. Then the Adaboost algorithm was applied to PCA features for model training. Specially, the “reassigning weights” strategy was adopted to combine multiple weak classifiers into one strong classifier for better classification effect. In our experiments, the algorithm was evaluated on ECG-ID database and accuracy of 98.88% could be achieved within 7 s. The experimental results show that our method has good performance on both identification accuracy and timeliness. The next step is to verify the reliability of the algorithm for the ECG signal data collected by ourselves.