IMR Press / FBL / Volume 27 / Issue 7 / DOI: 10.31083/j.fbl2707211
Open Access Original Research
A Machine Learning Model Based on Genetic and Traditional Cardiovascular Risk Factors to Predict Premature Coronary Artery Disease
Show Less
1 Guangzhou Institute of Cardiovascular Disease, Guangdong Key Laboratory of Vascular Diseases, State Key Laboratory of Respiratory Disease, The Second Affiliated Hospital, Guangzhou Medical University, 510260 Guangzhou, Guangdong, China
2 Department of Laboratory Medicine, Panyu Hospital of Chinese Medicine, Guangzhou University of Chinese Medicine, 511400 Guangzhou, Guangdong, China
3 Department of Emergency, The Second Affiliated Hospital, Guangzhou Medical University, 510260 Guangzhou, Guangdong, China
4 General Practice, Guangzhou Medical University, 510182 Guangzhou, Guangdong, China
*Correspondence: (Chao-Wei Tian); (Shi-Ming Liu)
Academic Editors: Wei Lan, Qingfeng Chen and Khondaker Miraz Rahman
Front. Biosci. (Landmark Ed) 2022, 27(7), 211;
Submitted: 27 April 2022 | Revised: 16 June 2022 | Accepted: 24 June 2022 | Published: 4 July 2022
Copyright: © 2022 The Author(s). Published by IMR Press.
This is an open access article under the CC BY 4.0 license.

Background: Premature coronary artery disease (PCAD) has a poor prognosis and a high mortality and disability rate. Accurate prediction of the risk of PCAD is very important for the prevention and early diagnosis of this disease. Machine learning (ML) has been proven a reliable method used for disease diagnosis and for building risk prediction models based on complex factors. The aim of the present study was to develop an accurate prediction model of PCAD risk that allows early intervention. Methods: We performed retrospective analysis of single nucleotide polymorphisms (SNPs) and traditional cardiovascular risk factors (TCRFs) for 131 PCAD patients and 187 controls. The data was used to construct classifiers for the prediction of PCAD risk with the machine learning (ML) algorithms LogisticRegression (LRC), RandomForestClassifier (RFC) and GradientBoostingClassifier (GBC) in scikit-learn. Three quarters of the participants were randomly grouped into a training dataset and the rest into a test dataset. The performance of classifiers was evaluated using area under the receiver operating characteristic curve (AUC), sensitivity and concordance index. R packages were used to construct nomograms. Results: Three optimized feature combinations (FCs) were identified: RS-DT-FC1 (rs2259816, rs1378577, rs10757274, rs4961, smoking, hyperlipidemia, glucose, triglycerides), RS-DT-FC2 (rs1378577, rs10757274, smoking, diabetes, hyperlipidemia, glucose, triglycerides) and RS-DT-FC3 (rs1169313, rs5082, rs9340799, rs10757274, rs1152002, smoking, hyperlipidemia, high-density lipoprotein cholesterol). These were able to build the classifiers with an AUC >0.90 and sensitivity >0.90. The nomograms built with RS-DT-FC1, RS-DT-FC2 and RS-DT-FC3 had a concordance index of 0.94, 0.94 and 0.90, respectively, when validated with the test dataset, and 0.79, 0.82 and 0.79 when validated with the training dataset. Manual prediction of the test data with the three nomograms resulted in an AUC of 0.89, 0.92 and 0.83, respectively, and a sensitivity of 0.92, 0.96 and 0.86, respectively. Conclusions: The selection of suitable features determines the performance of ML models. RS-DT-FC2 may be a suitable FC for building a high-performance prediction model of PCAD with good sensitivity and accuracy. The nomograms allow practical scoring and interpretation of each predictor and may be useful for clinicians in determining the risk of PCAD.

premature coronary artery disease
machine learning
single nucleotide polymorphisms
traditional cardiovascular risk factors
Fig. 1.
Back to top