Predicting Ischemic Stroke in Patients with Atrial Fibrillation Using Machine Learning

Background: Atrial fibrillation (AF) is a well-known risk factor for stroke. Predicting the risk is important to prevent the first and secondary attacks of cerebrovascular diseases by determining early treatment. This study aimed to predict the ischemic stroke in AF patients based on the massive and complex Korean National Health Insurance (KNHIS) data through a machine learning approach. Methods: We extracted 65-dimensional features, including demographics, health examination, and medical history information, of 754,949 patients with AF from KNHIS. Logistic regression was used to determine whether the extracted features had a statistically significant association with ischemic stroke occurrence. Then, we constructed the ischemic stroke prediction model using an attentionbased deep neural network. The extracted features were used as input, and the occurrence of ischemic stroke after the diagnosis of AF was the output used to train the model. Results: We found 48 features significantly associated with ischemic stroke occurrence through regression analysis (p-value < 0.001). When the proposed deep learning model was applied to 150,989 AF patients, it was confirmed that the occurrence ischemic stroke was predicted to be higher AUROC (AUROC = 0.727± 0.003) compared to CHA2DS2-VASc score (AUROC = 0.651 ± 0.007) and other machine learning methods. Conclusions: As part of preventive medicine, this study could help AF patients prepare for ischemic stroke prevention based on predicted stoke associated features and risk scores.


Introduction
In Korea, cerebrovascular diseases are the fourth leading cause of death [1]. It can lead to various functional impairments such as motor weakness, sensory deficit, dysphagia, dysarthria, aphasia, cognitive impairment, and emotional disturbances [2][3][4]. Therefore, it is important to prevent primary and secondary stroke by providing appropriate treatments, such as oral anticoagulation in patients with atrial fibrillation (AF), through early detection.
AF is a common risk factor of cardioembolic cerebral infarction [5]. It accounts for 7 to 31 percent of stroke patients aged 60 years or older [6][7][8]. Thromboembolism in the left atrium caused by AF would increase the risk of stroke by four to five times [7][8][9]. The recent populationbased study presented AF as an independent predictor of 30day and one-year mortality after a first ischemic stroke [10]. Approximately 17 percent of all deaths were attributable to the ischemic stroke with AF. The previous observation study showed that stroke with AF would affect the functional limitation and compromised quality of life [11]. Due to the high risk of recurrent embolism, the development of risk calculating methods for stroke with AF is in progress. In particular, because the pathophysiology of stroke in AF is different from that of non-AF, there is a need for a method that considers these characteristics [12][13][14].
Most previous studies for predicting ischemic stroke risk in patients with AF were based on statistical methods, such as CHADS 2 and CHA 2 DS 2 -VASc scores [15][16][17]. The CHADS 2 score would reflect the representation of incidence risk for stroke using five factors, including congestive heart failure, hypertensive diseases, more than 74 years of age, diabetes mellitus, and previous cerebrovascular attack [18]. However, CHADS 2 has a limitation in that it is difficult to accurately evaluate low-risk groups. To improve the predictive performance in the low-risk group, a CHA 2 DS 2 -VASc score was proposed considering the presence or absence of vascular diseases, ages 65-74 years, and female gender. CHA 2 DS 2 -VASc scores have guided many clinicians on using oral anticoagulants as an indicator of bleeding risk, which could suggest its low use for stroke with AF owing to high CHA 2 DS 2 -VASc scores [19]. It was devised to compensate for the defects of CHADS 2 , but there are still other limitations. First, CHA 2 DS 2 -VASc scores are limited in considering various characteristics of ischemic stroke. For example, only vascular diseases were considered, and other mechanisms, such as large-artery atherosclerosis, and small-vessel occlusion, were not considered. Second, CHA 2 DS 2 -VASc scores have modest performance in stroke risk prediction [20,21].
Herein, we present a machine learning-based method to predict the occurrence of ischemic stroke in AF patients based on the Korean National Health Insurance Service (KNHIS) data. Recent studies have demonstrated that accumulated data of patients in electronic medical record (EMR) can be utilized to predict potential disease risk [22,23]. In Korea, more than 97% of the population is covered by the KNHIS program and the remaining three percent are covered by a medical aid program operated by the KNHIS [24]. The KNHIS contains information on Korean demographic, health examination, and medical use/transaction information. Therefore, it was hypothesized that the accumulated large-scale KNHIS information of the AF patients can be used to predict the further occurrence of ischemic stroke. To handle the massive and complex KNHIS information, we adapted a deep neural network that can express the degree of influence on the output by weight for each column through several hidden layers to identify patterns in the data. The evaluation results showed that many ischemic stroke patients were identified with high AUROC.

Data Sources
This study used KNHIS data from January 1, 2005 to December 31, 2018. Since 1995, KNHIS, the single national health insurer, has provided health examinations for all Koreans. The KNHIS database contains complete health information about approximately 50 million Koreans [25]. In this study, case subjects were defined as patients with AF who were newly diagnosed with ischemic stroke, and control subjects were those with AF who had not been diagnosed with ischemic stroke. We used the International Classification of Disease, 10th revision (ICD-10) codes to identify patients with AF and those who had experienced ischemic stroke from the health claim records [26]. We obtained patients diagnosed with AF (ICD-10: I48) between 2005 and 2013. Subsequently, we checked if the selected patients were hospitalized for ischemic stroke (ICD-10: I63) within five years after the diagnosis of AF. Next, we collected demographic, health examination, and medical history information of subjects from the KNHIS database. Demographic information contains gender, age, occupational status, and income level. The medical history includes information on the occurrence of 43 diseases (e.g., hypertensive disease, hemolytic anemia, chronic gastritis, hyperlipidemia, and thyroid diseases). The medical history information in the KNHIS database is built using the medical bills that were claimed by medical service providers for the expenses. Health examination includes results of nine general laboratory tests (e.g., blood pressure, and urinary protein) and six questionnaires on lifestyle and behavior (e.g., smoking, exercise, and drinking). A detailed description of the extracted information was provided in Supplementary Table 1.
The study protocol was approved by the Institutional Review Board of the National Health Insurance Service in Korea (NHIS-2020-4-109). The authors confirm that all methods were performed in accordance with relevant guidelines and regulations. The need for informed consent from participants was waived by the ethics committee of the Chonnam National University because this study involved routinely collected medical data that were anonymized at all stages to protect an individual's privacy.

Regression-based Statistical Analysis
Logistic regression is a statistical technique that estimates the causal relationship between categorical dependent variables and several independent variables and is divided into two types according to the number of categories of dependent variables [27]. A binary logistic regression is used when the dependent variable has two categories of 0 or 1, and polynomial logistic regression is used when the dependent variable is composed of two or more categories. The binary logistic regression, used as a statistical technique in this study, was expressed by defining logistic functions in reverse using logits as shown below in Eqns. 1,2, to express linear relationships between independent and dependent variables [28].
Regression coefficient, standard error, Wald chisquare, and p-value were used for binary logistic regression analysis with maximum likelihood estimation. The regression coefficient implies that the dependent variable increases or decreases in proportion to the estimated value when the independent variable increases by one unit [29]. If the coefficient is a positive value, it has a positive correlation and vice versa. As the coefficient was close to zero, the effect of the independent variable decreased [29,30]. The standard error is the standard deviation of the sample means used to determine whether the regression coefficient occurs by accident, revealing the closeness of the sample mean values to the population mean [31]. The smaller the standard error, the closer it is to the population mean, and it spreads closer to the regression line, which implies that the prob-ability of a regression coefficient being accidental is less likely to occur. This shows that the causal relationship between the independent variable and dependent variable is significant. The Wald chi-square is an index for evaluating the importance of each independent variable [27].
where β is the coefficient, and SE is its standard error. The Wald chi-square refers to the ratio of the square of the regression coefficient to its standard error and is expressed as a chi-square distribution [27]. The higher the value, the lower the significance level, indicating that it is an important variable in explaining the dependent variable. The pvalue is the probability that a value equal to or more than that in the sample is observed, assuming that the null hypothesis is correct [32]. Moreover, a p-value less than a certain significance level implies that the observed result is improbable under the null hypothesis and that there is a significant association between the dependent and the corresponding independent variables. However, a p-value greater than a certain significance level indicates that there is no significant association between the dependent and the corresponding independent variables. Therefore, the p-value for each feature tests the null hypothesis that the feature does not correlate with the occurrence of ischemic stroke. In this study, we set the significance of the p-value as 0.001. Multicollinearity was detected by the tolerance and variance inflation factor (VIF). The tolerance is defined as 1-R 2 ,where R 2 is the coefficient of determination for the regression of a variable on the other independent variables. The VIF is defined as the reciprocal of tolerance. If the VIF value exceeds 10, it is considered to indicate multicollinearity.

Deep Neural Network for Predicting The Occurrence of Ischemic Stroke in AF Patients
In this study, we used a deep neural network to predict the occurrence of ischemic stroke in AF patients based on KNHIS data (Fig. 1). The deep neural network is composed of multiple hidden layers between an input layer and an output layer. The multiple hidden layers enable the modeling of complex nonlinear relationships through the learning function of a high-level layer formed by combining the features of the lower layer, and learning complex functions mapping the input to the output from data [33]. Among the 75 features extracted from KNHIS, we used 4 demographic information, 31 medical histories, and 13 health examination features, which were considered statistically significant through regression analysis. The dataset was divided into 6:2:2 as training, validation, and test set, respectively. Then, the self-attention mechanism was applied to the deep learning model. The self-attention mechanism improves the prediction performance by estimating the importance of the feature [34]. Input features were fed to the fully-connected and the softmax layers to calculate the self-attention scores.
where X is the selected input features, and g(•) is the fullyconnected layer without activation. g(•) can be represented as below.
where W = [w 1 , w 2 ,…, w n ] is the weight matrix, and b is the bias of each unit. In this study, the output of linear operator g(•) is the same size as the input; therefore, W ∈ R 48×48 and b ∈ R 48 . g(•) is then fed into the softmax function which return a vector of numbers with equal to one.
Then, the component-wise multiplication between input features and self-attention score vector was performed.
where ⊙ is the component-wise multiplication operator. Then, we concatenated the output vector o and input fea-

General Characteristic of the Study Population
From June 2005 to March 2013, a total of 754,949 patients were diagnosed with AF, of which 62,226 (8.24%) were diagnosed with ischemic stroke five years after diagnosis of AF. Table 1 shows the frequency and proportion of stroke and non-stroke groups in each variable. Exceptionally, insurance fee is continuous data, we reported mean ± SD (range). The CHA 2 DS 2 -VASc score ranged from 0 to 9, indicating that the risk increases as the score increases. The mean CHA 2 DS 2 -VASc score of the non-stroke group was 2.15 points, and the stroke group was 3.01 points. As expected, it was confirmed that the stroke group had higher CHA 2 DS 2 -VASc scores than the non-stroke group. Next, we checked the individual risk factors of CHA 2 DS 2 -VASc scores, including five medical history factors, age, and sex. It was confirmed that the stroke group had a high proportion compared with the non-stroke group for the medical history of five diseases considered in the CHA 2 DS 2 -VASc score. The mean (± SD) age of the patients was 64.6 ± 13.3 years Demographic, medical history, and health examination information was used as input, and occurrence of ischemic stroke in AF patients was used as output. We calculated attention scores for the input features and concatenated the attention scores with input features. The occurrence of ischemic stroke was predicted by a three-layer fully-connected neural network with non-linear activation function.
in the non-stroke group and 71.5 ± 9.5 years in the stroke group. We also observed that the stroke group had a higher proportion of patients aged from 65 to 74 years and above 75 years than the non-stroke group. Regarding sex, the nonstroke group had a higher proportion of males (59.21%), whereas the stroke group had a higher proportion of females (50.34%). These results indicate that the CHA 2 DS 2 -VASc score reflects the characteristics of stroke because it gave a high score for the five medical history features, elderly, and women in the stroke group. However, the difference between the two groups was not significant, and the proportion of patients who scored five or higher in the stroke group did not reach 20%. To estimate the strength of the association between independent variables and occurrence of ischemic stroke, relative risk were provided. When the relative risk value is 1, it means that the independent variable does not affect the result. When the Relative risk value is higher than 1, it means that the risk of the occurrence of ischemic stroke is increased by the independent variable. Conversely, if the relative risk value is less than 1, the risk is decreased by the independent variable.

Identifying Relationships between Features and Ischemic Stroke Occurrence
We used coefficient values and p-values of logistic regression to identify the features related to ischemic stroke occurrence. The VIF values of independent variables are reported in Supplementary Table 3. The result indicated that VIF values of all variables did not exceed 3. The results showed that age, sex, and occupational status were important factors in demographic information. We identified important features from medical history, including thyroid diseases, other cardiac arrhythmias, chronic lower respiratory diseases, hemolytic anemia, cancer, hemorrhoids, diabetes mellitus, hypertensive diseases, chronic kidney diseases, heart failure, hyperlipidemia, peripheral vascular disease, gout, noninflammatory gynecological problems, pulmonary embolism, and chronic gastritis. Significant features found by the p-values through tests were analyzed based on coefficient, Wald chi-square, and odds ratio with 95% confidence interval (CI) (Supplementary Section 2, Supplementary Table 3 and Supplementary Fig. 1).

Predicting the Occurrence of Ischemic Stroke in Patients with AF
Our method predicts the occurrence of ischemic stroke risk in AF patients based on KNHIS data. We evaluated the area under the curve scores of the receiver operating characteristic (AUROC) and corresponding SD for the average of 10-fold cross validation to assess the predictive performance. We tested the performance for five different types of input feature sets: (i) using all features without feature selection (FS); (ii) using all features with FS; (iii) using demographic features with FS only; (iv) using medical his-  (Fig. 2a). From the results, we found that using all features with FS (AUROC = 0.722 ± 0.004) exhibited better performance than using all features without FS (AUROC = 0.714 ± 0.003) and using a single feature set only (AUROC = 0.613~0.679). These results indicate that the proposed model considers the complex associations of large-scale feature sets. Next, we compared our method with other machine learning methods, including logistic regression, XGBoost, and random forest (Fig. 2b). Both XGBoost and random forest optimized hyperparameters with Bayesian optimizers. In both methods, the maximum depth of the tree was tuned in the range of 5 to 10, and the number of classifiers was tuned in the range of 10 to 500. To prevent overfitting, XGBoost tuned minimum child weight (1-10) and learning rate (0.01-0.1), and random forest tuned minimum sample split (2-10) and mini-mum sample leaf (0.01-0.5). Results indicate that the proposed deep learning model has better AUROC than logistic regression (AUROC = 0.691 ± 0.005), XGBoost (AU-ROC = 0.708 ± 0.004), and random forest (AUROC = 0.694 ± 0.009). Furthermore, we compared the prediction performance of our method with the CHA 2 DS 2 -VASc scores (Fig. 2c). Since CHA 2 DS 2 -VASc scores are predictors of cardioembolic sources, we screened the patients corresponding to the cerebral infraction due to embolism of cerebral arteries (ICD-10: I63.4) in ischemic stroke patients. Previous study indicated that patients with ICD-10 code of I63.4 includes about 73% subtype diagnosis of cerebral embolism [35]. Through this process, we finally selected 954 ischemic stroke patients and conducted the experiment. The results indicated that an AUROC value of the proposed method was 0.727 ± 0.003 and an AUROC value of CHA 2 DS 2 -VASc scores was 0.651 ± 0.007. In the prediction of the occurrence of ischemic stroke in AF patients, we firstly calculated precisions for different positive/negative ratios to evaluate the precision performance in the various skewness of datasets (Table 2) [36]. To do this, a negative set was generated by random sampling of the AF patients without ischemic stroke at various rates. The negative sets were generated ten times at each rate, and the performance for each case was evaluated by averaging the results. Results indicated that the precision performance of the proposed deep learning model was decreased when skewness was increased. In realistic scenario, the precision score of the proposed method was 0.132 ± 0.011, logistic regression was 0.095 ± 0.013, XGBoost was 0.121 ± 0.012, and random forest was 0.109 ± 0.013. Next, we checked the recall performance. Out of 12,445 ischemic stroke patients of the test dataset, the proposed model covered 9508 patients (r = 0.764 ± 0.005). Furthermore, the proposed method performed better compared to other machine learning methods (r = 0.678~0.717).
The model output can be interpreted as an approximate probability of ischemic stroke occurrence and has a value between 0 and 1. In general, the decision threshold that predicts ischemic stroke occurrence based on the model output value is often 0.5. However, the default threshold may not represent an optimal interpretation of the predicted probabilities [37]. In this study, the class distribution of the dataset is skewed, and predicted probabilities are not cal-ibrated. This is a classification problem with imbalanced classes [38]. To solve this, we identified the optimal threshold value of the model output to judge the occurrence of ischemic stroke. We calculated the F1-scores, which is the harmonic mean of precision and recall, by changing the threshold of the model output. The best performance (F1score = 0.223) was when the threshold value was 0.519.

Discussion
AF is the most common sustained arrhythmia. CHADS 2 and CHA 2 DS 2 -VASc scores are the most popular methods for predicting the risk of ischemic stroke in patients with AF. However, these scores may not be enough to predict the incidence of stroke as they only use five to seven features based on limited information. Previous studies demonstrated that the pathogenesis of stroke in AF patients is complex and involves various factors, such as hypertension, diabetes, dementia, and obesity [39,40]. Moreover, based on KNHIS data, this study found that 32 features were statistically significantly associated with stroke. Therefore, more accurate predictions will be possible if the information from ischemic stroke patients with AF can be completely utilized.
The EMR data accumulated in the hospital applies to this approach because it contains various medical information about patients. However, sharing or releasing EMR data is very difficult owing to privacy and confidentiality issues [41,42]. Therefore, there is a limit to analyzing past medical information of patients who have used several hospitals. In recent years, the Observational Health Data Sciences and Informatics (OHDSI) project has been attempting to standardize and expand the EMR information of hospi-tals; however, it is currently being conducted only for a few hospitals, and technical and institutional improvements are needed [43]. In Korea, the records of diagnosis, prescriptions, and health examination generated by all medical institutions are collected by KNHIS. This can overcome the limitations of EMR data being accessible only to certain hospitals. Another strength is that it is a model specific to a particular race and region. Most previous risk prediction models were developed in different cohort studies [44]. They are not suitable for the Korean population as their clinical trial cohorts include information on people from different races and regions. This study is significant as the prediction model was developed by considering the characteristics of the Korean population with high AUROC comparing with the CHA 2 DS 2 -VASc score.
There are additional considerations that may improve our study. First, the type I error (false positive) in the study subjects increases because only ICD-10 codes are considered without prescription information when extracting the AF and ischemic stroke patients. Type I error refers to a situation in which the result incorrectly indicates the presence of a disease, and type II error (false negative) is an opposite situation in which the result does not indicate the presence of a disease [45]. If we use both ICD-10 codes and prescription information, the type I error decreases but the type II error increases. In this study, only ICD-10 codes were used because it was considered important to reduce the type II error. However, it may be more important to reduce Type I errors depending on the researcher's research design. Therefore, we plan to conduct additional analysis considering various combinations of subject selection. The best way to solve this is to use medical examination results, but currently, KNHIS does not collect medical examination results. Second, the prediction results can vary significantly depending on the definition of the case target. In this study, we selected the case subjects by checking whether ischemic stroke occurred during the 5year period after AF. The period was an option to extract as many ischemic stroke patients as possible. If the purpose of the study was to predict acute stroke, the period should have been shorter. To predict the various occurrence characteristics of ischemic stroke, we can consider learning the model for various datasets and applying ensemble methods. Third, our method had better performance than conventional methods but requires improvement for practical application. Many existing studies typically report predictive performance at approximately 0.850 AU-ROC value [46]. However, due to different study design, case subject definition, and used datasets, there is a limit to simply comparing them with performance values. Therefore, we have shown how well the proposed method performs compared to conventional methods. However, since this is also only fragmentary performance on the particular dataset, it is necessary to consider the number of different cases in which ischemic stroke occurs in AF patients. In addition, improvement of precision is necessary. The precision performance of the proposed method indicates that only 13.2% (n = 9508) of those who were predicted to have ischemic stroke (n = 72,032) have actually experienced ischemic stroke. In order to be used in practical application, it is necessary to increase true positive predictions and decrease false positive predictions of the model.
In the future, we plan to construct diverse subject datasets, which will allow us to ensemble various machine learning models considering the characteristics of the subjects to improve the predictive performance. This study will allow more accurate prediction through future experiments, and will be useful for patients with AF to prepare for ischemic stroke prevention as part of preventive medicine.

Conclusions
This study proposes a new model to predict the occurrence of ischemic stroke in patients with AF. To prevent ischemic stroke, a system for early detection of occurrence should be established. This study predicted the occurrence of ischemic stroke in AF patients based on a machine learning approach by utilizing the massive and complex KNHIS data. The validation results showed that the proposed machine learning model has high AUROC compared to CHA 2 DS 2 -VASc scores. However, in order for the proposed method to be used in practice, the challenge of improving predictive performance remains. Nevertheless, this study suggested a method of using NHIS data in the development of healthcare applications. In addition, it is expected that further studies will be able to be applied not only to predict ischemic stroke but also to predict various diseases.

Author Contributions
SY proposed the objective and motivation of this work and designed overall method. SJ, MKS, EL, YYK and SB performed data-preprocessing. SJ, MJL, and SY performed preliminary study. DL, MKS and SY helped to write the main manuscript text and provided comments that improve introduction and method parts. SJ, EL, and SY performed evaluation process. EL and MKS provided some ideas in discussion. MJL, and SY supervised this work.

Ethics Approval and Consent to Participate
The study protocol was approved by the Institutional Review Board of the National Health Insurance Service in Korea (NHIS-2020-4-109). The authors confirm that all methods were performed in accordance with relevant guidelines and regulations. The need for informed consent from participants was waived by the ethics committee of the Chonnam National University because this study involved routinely collected medical data that were anonymized at all stages to protect an individual's privacy.