A Machine Learning Framework for Diagnosing and Predicting the Severity of Coronary Artery Disease

Background: Although machine learning (ML)-based prediction of coronary artery disease (CAD) has gained increasing attention, assessment of the severity of suspected CAD in symptomatic patients remains challenging. Methods: The training set for this study consisted of 284 retrospective participants, while the test set included 116 prospectively enrolled participants from whom we collected 53 baseline variables and coronary angiography results. The data was pre-processed with outlier processing and One-Hot coding. In the first stage, we constructed a ML model that used baseline information to predict the presence of CAD with a dichotomous model. In the second stage, baseline information was used to construct ML regression models for predicting the severity of CAD. The non-CAD population was included, and two different scores were used as output variables. Finally, statistical analysis and SHAP plot visualization methods were employed to explore the relationship between baseline information and CAD. Results: The study included 269 CAD patients and 131 healthy controls. The eXtreme Gradient Boosting (XGBoost) model exhibited the best performance amongst the different models for predicting CAD, with an area under the receiver operating characteristic curve of 0.728 (95% CI 0.623–0.824). The main correlates were left ventricular ejection fraction, homocysteine, and hemoglobin (p < 0.001). The XGBoost model performed best for predicting the SYNTAX score, with the main correlates being brain natriuretic peptide (BNP), left ventricular ejection fraction, and glycated hemoglobin (p < 0.001). The main relevant features in the model predictive for the GENSINI score were BNP, high density lipoprotein, and homocysteine (p < 0.001). Conclusions: This data-driven approach provides a foundation for the risk stratification and severity assessment of CAD. Clinical Trial Registration: The study was registered in www.clinicaltrials.gov protocol registration system (number NCT05018715).


Introduction
Artificial intelligence is an important tool in the current era of big data and can improve human productivity by simulating human learning thought processes and analyzing complex data [1].Currently, machine learning (ML) and a subset of ML, deep learning, are the most common methods used in artificial intelligence [2].The inception of machine learning (ML) can be traced back to the 1950s and 1960s [2], when scholars commenced investigating the plausibility of employing computers for self-regulating learning and discerning decision-making, accomplished through the construction of mathematical models and algorithms [3].This approach empowers computers to continually enhance and optimize their functioning by processing and learning from data [4].ML deviates from traditional rule-based programming by placing a greater emphasis on automatic pattern recognition in data, thereby precluding the need for manual rule design.Deep learning is commonly used to analyze raw clinical data and imaging data [4], while ML can be used to predict the severity and prognosis of cardiovas-cular disease [5].Artificial intelligence is now commonly used in medicine and has been advancing progressively in the cardiovascular field [6].
The diagnosis of coronary artery disease (CAD) and early intervention in symptomatic patients with suspected CAD is challenging [7], and its definitive diagnosis in clinical practice remains complicated [8].Although current methods reduce the probability of misdiagnosis of stable CAD, the invasive diagnostic procedures used can be considered an overly medical approach.Therefore, the development of a scoring system that accurately predicts coronary artery stenosis in patients suspected of CAD and its severity could reduce the number of downstream and invasive diagnostic tests [9].Thus far, investigators have proposed multiple testing strategies to effectively screen patients with suspected CAD, the most notable being the Diamond Forrest model.However, research suggests this model has a high false positive rate.As a result, a "battle of the scores" has ensued over the past decade for predicting the pretest probability of coronary heart disease.Many "up-to-date" risk assessment models have emerging based on the latest clinical trial data.However, these methods still cannot accurately assess the complications of CAD and hence their application in clinical practice remains limited [8].
The GENSINI score reflects plaque loading, but not bifurcation, calcification and tortuous lesion characteristics.The SYNTAX score on the other hand reflects the type of plaque and the complexity of percutaneous coronary intervention (PCI).It also describes the anatomy of the coronary lesion and provides guidance to clinicians when developing optimal treatment plans for high-risk patients.The SYN-TAX score can help with making treatment decisions for patients with lesions suitable for both PCI and coronary artery bypass graft (CABG) and in whom the surgical mortality rate is expected to be low.
The goal of this study was therefore to develop a ML model based on the clinical characteristics of a retrospective cohort comprising CAD patients and healthy controls.The model was then tested in a prospective cohort.The objectives of the study were first to use ML and statistical methods to identify new risk factors associated with disease severity in CAD, and second to develop an electronic medical record-and coronary score-driven ML model that was predictive for the detection of severe CAD.

Methods
A three-step modeling procedure was used to achieve the research goals [10].In the first, patients were divided into two groups based on coronary angiographic findings: a coronary group (stenosis ≥50) and a non-coronary group (stenosis <50%, or no stenosis) [11,12].The second step was to provide estimates of the SYNTAX and GENSINI scores for patients undergoing coronary angiography [13].In the third step, 53 clinical characteristics were used as input to predict diagnosis (For example, sex, age, BMI, etc.), the GENSINI and SYNTAX scores (Table 1) [14].Feature selection deep learning techniques were also used and these provide a way to identify potential risk factors for CAD based on ML.This allows a better understanding of the medical and clinical features associated with the presence or absence of CAD, with the outcome derived from the SYNTAX score distribution, and with the outcome derived from the GENSINI score.The methodology to be evaluated is designed to provide a uniform risk score that can help to determine the need for invasive or functional noninvasive tests in patients with suspected CAD, as well as for patients with complex CAD who need more rigorous coronary revascularization surgery.The development of an automated recommendation system based on data-driven, perspective analysis ML algorithms should thus provide an auxiliary means for personalized treatment in routine clinical practice.
Unlike previous studies [8], the present investigation included a population with <50% coronary stenosis for regression analysis.There were two reasons for this.First, a significant proportion of patients in our study cohort had coronary stenosis in the 0-50% range, but exhibited clear symptoms of CAD.Recent research on this population suggests that disease progression without early intervention can have serious consequences.These patients were therefore included with the aim of guiding physicians in the development of protocols for early coronary prevention.Second, this population was also an accurate representation of the real-world population, thus making it easier to reproduce in future work.SYNTAX scores were obtained using online evaluation on the website (http://www.syntaxscore.org/).The GENSINI score is based on coronary angiographic findings and was calculated by multiplying the stenosis score at the site of the lesion by the appropriate weighting factor.The sum of all the lesion branch scores is the GENSINI score [15].
This study included patients who underwent elective or urgent coronary angiography at the First Affiliated Hospital of Xinjiang Medical University.We attempted to develop new risk prediction algorithms for CAD-related risk factors and for CAD severity using clinical indicators in combination with coronary angiographic features and with two different scoring criteria.

Participants
The training set consisted of data from 284 retrospective participants, while the test set was comprised of 116 prospectively enrolled participants [16].Patients were eligible for the test set only if they were judged to have stable angina.The exclusion criteria were a previous diagnosis of CA, previous diagnosis of acute coronary syndrome (ACS), previous history of PCI or CABG, severe infection, or renal or pulmonary comorbidities [17].

Model Building Process
A three-step approach was used for building the model: database creation, model construction, and model interpretation and evaluation.The detailed technical path is shown in Fig. 1.

Database Creation
In the first step, each patient's medical data was collected from electronic medical records.The SYNTAX and GENSINI scores for each patient were independently assessed by two cardiologists.Disagreements in the coronary evaluation were assessed by a third specialist who then made the final decision.
All of the original data were summarized and stored, and then carefully checked to ensure they met the quality standards for the tasks performed [18].To this end, descriptive statistical methods and visualization techniques were used to summarize patient characteristics for assessment by the cardiologists and to identify features that are meaningful for the construction of ML models [19].

Data Processing and Feature Selection
The original dataset contained 53 feature attributes.These were initially processed using the Pandas package in Python to convert the raw data into Int and Float types that could be used for ML operations [20].The Filter and Embedded methods were applied for analysis of the clinical features [21].The Filter method primarily employs techniques such as the chi-square test and correlation coefficient, whereas the Embedded method integrates feature selection into ML algorithms to identify the most relevant features through the learning process.Notably, the extreme gradient boosting (XGBoost) and random forest (RF) algorithms are the most relevant approaches in this context [22,23].The XGBoost algorithm is well-suited for the processing of clinical data [22,24], while the RF algorithm has the advantages of high accuracy in feature selection, avoidance of overfitting, and broad applicability [23].In view of the dimensionality and feature relevance of the dataset, we chose to use the XGBoost regressor and RF regressor function packages to filter the clinical features [25].Ultimately, the algorithm that considers the area under the receiver operating characteristic (ROC) curve to be the largest is the best algorithm for constructing the dataset by comparing the ML feature filtering performance [26].

Model Building
The building phase for our experimental model consisted of two steps.In the first, the binary classification problem is addressed, with the model built after labeling patients as either "diseased" or "disease-free" based on their coronary angiography results [11,27].
During the model training process, regularization techniques and weight adjustment of samples were employed to enhance the model prediction ability, given the limited sample size and the unbalanced categories in the dataset.Furthermore, a 5-fold cross-validation was used for model selection (Supplementary Fig. 1) [28], as well as hyperparameter adjustment to prevent overfitting and to improve model generalization.Specifically, L1 and L2 regularization techniques were used to select important features, to reduce the weighting of unimportant features, and to avoid overfitting.Sample-based weight adjustment was also used to balance the dataset by assigning higher weights to minority categories of samples [29].This drives the model to assign higher weights to minority categories during training [30].Sample weights were determined by calculating the ratio of the weights of positive samples (representing the minority categories) to the weights of negative samples (representing the majority categories).More specifically, this ratio was calculated as the number of samples in the majority category divided by the number of samples in the minority category.Furthermore, 5-fold crossvalidation was used for model selection and for hyperparameter adjustment to prevent overfitting and to improve the ability for model generalization.The model parameters are shown in Supplementary Table 1.Following the completion of training on the training dataset, the model was tested on the test dataset to validate the performance metrics [30].The second step in the model building involved a regression analysis and was modeled based on the SYNTAX and GENSINI scores.From the large number of candidate models available for modeling classification and regression, a total of 7 dichotomous classification models and 9 regression models were selected [31].The input and output for these models are described in detail in Table 1.

Model Interpretation and Evaluation
To address the challenge of limited model interpretability, the SHAP framework was incorporated to provide an explanation of the model outcomes, thereby increasing confidence in the results.The SHAP value quantifies the extent to which each feature in the model contributes to the prediction.It also facilitates with visualization of the results.
For the evaluation of performance, various metrics have been employed to evaluate the efficacy of ML models in both classification and regression tasks.For dichotomous models, a range of evaluation metrics were employed across five dimensions, including area under the ROC curve, Rsquared, accuracy, precision, recall, and F1-score [32].Performance evaluation was carried out by calculating the mean squared error, mean absolute error (MAE), mean absolute prediction error, and coefficient of determination (R2) [33], as detailed below.

Accuracy = T P + T N N Precision = T P T P + F P
Where N denotes the total number of samples tested, TP (True Positive) denotes a true case, TN (True Negative) denotes a true negative case, FP (False Positive) denotes a false positive case, and FN (False Negative) denotes a false negative case.y_i denotes the true value of the i-th sample, (y_i ) ̂denotes the predicted value of the i-th sample, and y¯denotes the mean of the true values of all samples.
The ML framework proposed in this study was implemented in the python programming language.Differences were considered statistically significant when twosided tests showed a p-value < 0.05.p-values were corrected for multiple testing using the Benjamini-Hochberg procedure [34].All tests were two-tailed (non-directional), i.e., the alternative hypothesis was that the indicators being measured were not equal.

Visualization Analysis
After screening for key features that affect diagnostic and regression models, SHAP visualization analysis was performed in two separate parts.Fig. 2 summarizes the risk factors that had a significant impact on diagnosis and on the SYNTAX and GENSINI scores.Left ventricular ejection fraction, homocysteine, hemoglobin, HDL, and BNP each had a significant effect in the diagnostic model.BNP, EF%, MB, GHb, and TC were important features in the regression models for accurate prediction by the SYNTAX score, while BNP, HDL, GHb, glucose, and age were important for accurate prediction by the GENSINI score.
Nine metrics outside of the statistical analysis of SYN-TAX score correlations were identified in the deep learning algorithm based on SYNTAX scores (Fig. 3).These factors were not identified by assessing the Spearman correlation coefficients.In fact, statistical evaluation of rigorous SYNTAX scores found that K+ had a significant positive correlation (p = 0.025, r = 0.611).Education had a significant negative correlation (p = 0.026, r = -0.111)but the r value was close to 0, indicating only a weak corre- lation.In the deep learning algorithm based on GENSINI scores, 10 indicators were identified outside of the analysis of GENSINI score correlations.None of these factors was found to be significant by assessing Spearman's correlation coefficient.Indeed, statistical evaluation of the GENSINI scores showed very weak correlations with Leu (p = 0.032, r = 0.107) and CRP (p = 0.05, r = 0.142).

Model Evaluation
We next evaluated the performance of the classifiers and regression models, as summarized in Tables 4,5.Two specific classification models were found to have advantages.Multidimensional evaluation revealed the RF model performed best in terms of sensitivity, specificity, and recall in a balanced manner.The XGBoost classifier performed best in terms of the area under the ROC curve (Fig. 4a).For the regression models, XGBoost dominated for the prediction of SYNTAX and GENSINI scores (Fig. 4b,c).A key issue from the clinician's perspective is whether the method can explain the results.Practical evidence suggests that BNP, EF%, lipids, age and glucose are some of the main risk factors for the development and progression of cardiovascular disease, for cardiovascular disease prognosis, and for the occurrence of adverse cardiovascular events.

Discussion
The accuracy of early coronary risk assessment during hospitalization is critical for the proper management of CAD, which requires different treatment modalities according to the level of disease severity.During the risk assessment of cardiovascular disease in routine clinical practice, clinicians tend to overly focus on laboratory indicators and non-laboratory patient characteristics such as BMI and gender are often underestimated.Although the latter are important risk factors for cardiovascular disease, they are often considered less important when assessing disease severity.Coronary angiography can be a good diagnostic tool for CAD, but has the disadvantages of being complicated to perform and prone to adverse reactions.For example, in one study ascular complications reached 11.7% and the incidence of contrast nephropathy reached 3.3% [35].Patients are also inclined to refuse the test in the early stages of the disease.Therefore, coronary angiography is generally used to confirm the diagnosis of CAD after the onset of obvious significant symptoms.It is not used for the purpose of early screening or diagnosis, thus leading to many problems such as untimely treatment of patients and poor disease control.
In the current study we selected 53 clinical indicators and built ML models to investigate the nonlinear relationship between these indicators and the diagnostic outcome of CAD patients.Additionally, we constructed ML models with the aim of assessing the severity of CAD patients based on clinical indicators.Our findings demonstrate that ML algorithms can be used to predict the risk of coronary heart disease, thereby assisting physicians to diagnose the disease more accurately.We evaluated multiple models to compare the efficacy of different ML algorithms.The results showed that integrated learning outperformed other methods of diagnosing coronary heart disease by combining the results of multiple classifiers.In particular, the XGBoost [36] model identified the top 15 indicators important for disease prediction (EF%, BNP, HCY, etc.), with an accuracy >90%.We found that XGBoost is well-suited for typical structured data such as tabular and time series data, and can be used for both classification and regression tasks.XG-Boost also outperformed traditional decision tree models in terms of training speed and accuracy, while still retaining good explanatory power [37].Based on our evaluation of model performance, we consider XGBoost to be the most effective model for classifying the individual risk of CAD in patients with essential hypertension.Gupta et al. have applied ML in many areas including software maintenance [21,32], smart homes [28] and medical tasks [24] with outstanding results [29].Mittas et al. [8] made the first attempt at applying ML for CAD assessment.They excluded patients with coronary angiography results suggestive of non-CAD, and then proceeded to construct a deep learning model with a mean absolute error (MAE) of 5.6916.However, their model had a major limitation in that it excluded the non-diseased population upfront, thereby reducing the 0-factor interference [8].It is important to note that application of ML algorithms in the medical field still faces multiple challenges and limitations.These include ensuring the transparency and interpretability of algorithms, as well as addressing data imbalance and privacy issues.Further research is necessary to overcome these obstacles and to advance the application of ML in the field of CAD diagnosis [38].
Exploratory and statistical analyses have shown that several risk factors for CAD are important for predicting whether patients have this disease [39].In the present study, we provided objective evidence of risk factors that affect SYNTAX and GENSINI scores in the absence of knowledge about the relationship between SYNTAX scores and predictors [24].
Regarding future research on the application of machine learning in the diagnosis of coronary heart disease, insights can be gleaned from other fields of study.Yu et al. [40] explored the issue of disease causality inference by constructing a machine learning knowledge base to identify correlations among multiple diseases.Shamseddine et al. [41] proposed privacy-preserving federated learning models, providing novel ideas for developing machine learning models that protect patient privacy.Similarly, Wassan et al. [24] developed a solution to patient privacy concerns by utilizing federated machine learning to facilitate mobile collaborative development of standard prediction models, while storing all training data locally, thereby separating machine learning from data storage in the cloud to prevent privacy issues in medical data sharing.In the future, building a coronary heart disease knowledge base can aid in comprehending the linkages between coronary heart disease and multiple related illnesses.Furthermore, to mitigate the challenge of insufficient medical data for machine learning modeling due to patient privacy issues, adopting a federated learning approach may be worthwhile.
In summary, the main findings of this study concerned the diagnosis of CAD and evaluation of its severity.It is important to accurately predict whether a patient has CAD,   examination.The results of the feature selection algorithm identified some of the risk factors that contribute to variation in the distribution of SYNTAX and GENSINI scores.The application of ML prediction models to cardiovascular disease has been evaluated previously in patients with ACS [42].ML algorithms for CAD have been applied in some clinical settings, including (i) the prediction of CAD using clinical variables and an interdisciplinary ap-proach; (ii) improving the detection of functional CAD using computational hemodynamics (e.g., FFR-based algorithms); and (iii) assessing the ability to automatically predict CAD based on myocardial perfusion imaging.Current clinical practice for patients with suspected CAD relies on invasive coronary angiography and the post hoc calculation of a score based on the coronary angiographic findings to guide further treatment.
There have been few comprehensive studies of CAD through the lens of ML [38].In the clinical setting, the individual risk model established here and based on the XG-Boost algorithm could be further developed into a supplementary diagnostic system.The model could be applied for screening CAD in the population and also to assist physicians in diagnosing CAD during outpatient visits.This could ultimately improve early detection and control, with a high degree of practicality and feasibility.The model could also provide a realistic approximation of the coronary load score to assess the complexity of CAD.
The present study has several limitations.The large number of patients with a coronary score of zero and the non-homogeneous data created difficulties for the modeling process due to limitation of the sample size.The distribution of patients with non-zero scores was not concentrated [43].The risk stratification ML framework was developed to help clinicians identify patients with suspected coronary heart disease who should be referred for further examination, or who should undergo emergency surgery.
Further work is needed to optimize the model by using data from multicenter studies with large sample sizes.The model then needs to be validated in a prospective cohort and deployed into the community and clinic.In addition, multidisciplinary factors could be integrated into the model by using bioinformatic and pharmacogenomic analysis to extract other validated biomarkers such as specific genotypes.In brief, once validated using prospective external cohorts, the model established in this study could help clinicians to make decisions that are often still quite challenging.This will eventually ease the pressure on hospitals and doctors in the COVID-19 era and speed up the diagnosis and treatment process.

Conclusions
Machine learning models based on electronic medical records can effectively assess the severity of coronary heart disease and can identify a new set of new risk factors in the disease, and this study points to new research directions for future work.

Fig. 1 .
Fig. 1.The research process is depicted in the diagram where the raw data is initially subjected to pre-processing and fed into distinct regression and classification algorithms.Following the model training and hyperparameter tuning, the ultimate prediction outcomes are generated, and the SHAP framework is employed for model interpretation.

Fig. 2 .
Fig. 2. Results of screening clinical features using machine learning algorithms.(a) Distribution of Shapley values for the screened clinical features of the best performing diagnostic model.(b) Distribution of Shapley values for the screened clinical features of the bestperforming model based on SYNTAX score.(c) Distribution of Shapley values for the screened clinical features of the best-performing model based on the GENSINI score.EF%, ejection fraction; Hcy, homocysteine; Hb, hemoglobin; HDL, high density lipoprotein; BNP, brain natriuretic peptide; Glu, glucose; TC, total cholesterol; GHb, glycosylated hemoglobin; MB, myoglobin; CK, creatine Kinase; TG, triglyceride; LDL, low-density lipoprotein; IL6, interleukin 6; WBC, white blood cell; β-blocker, Beta blockers; CRP, C reactive protein; Leu, leucocyte; SBP, systolic blood pressure; DBP, diastolic blood pressure.

Fig. 3 .
Fig. 3. Similarities and differences in correlation factors in regression models and statistical analysis.S: SYNTAX score; MS: machine learning SYNTAX score; G: GENSINI score; MG: machine learning GENSINI score.The frequency represents the number oftimes that factor was considered to have an effect on the score in the S, MS, G, and MG scoring methods.EF%, ejection fraction; Hcy, homocysteine; Hb, Hemoglobin; HDL, high density lipoprotein; BNP, brain natriuretic peptide; Glu, glucose; TC, Total cholesterol; MB: myoglobin; CK: Creatine Kinase; LDL, low-density lipoprotein; IL6, Interleukin 6; WBC, white blood cell; CRP, C Reactive Protein; Leu, leucocyte; SBP, systolic blood pressure; DBP, diastolic blood pressure; TnI, Troponin I; CPK, creatine-phosphokinase; CT, clotting time; GHb, Glycosylated Hemoglobin; WMA, Left ventricular wall motion abnormalities.Green indicates statistically significant or meaningful in the machine learning models, while red indicates not meaningful.

Fig. 4 .
Fig. 4. Model evaluation.(a) Comparison of ROC curves of diagnostic models.(b) Scatterplot of the regression model based on SYNTAX score.(c) Scatterplot of the regression model based on GENSINI score.ROC, receiver operating characteristic.

Table 2 . Descriptive and exploratory analyses for categorical risk factors and scores.
pa, p value of diagnosis p erformance; pb, p value of SYNTAX score; pc, p value of GENSINI score; LVWMAs, Left ventricular wall motion abnormalities; ARBs, angiotensin receptor blockers; CCBs, calcium channel blockers.Median (P25, P75), the median (25th percentile-75th percentile), we correct these p-values for multiple testing.