- Academic Editor
Background: The severe acute respiratory syndrome coronavirus 2
(SARS-CoV-2) is responsible for the COVID-19 pandemic and so it is crucial the
right evaluation of viral infection. According to the Centers for Disease Control
and Prevention (CDC), the Real-Time Reverse Transcription PCR (RT-PCR) in
respiratory samples is the gold standard for confirming the disease. However, it
has practical limitations as time-consuming procedures and a high rate of
false-negative results. We aim to assess the accuracy of COVID-19 classifiers
based on Arificial Intelligence (AI) and statistical classification methods
adapted on blood tests and other information routinely collected at the Emergency
Departments (EDs). Methods: Patients admitted to the ED of Careggi
Hospital from April 7th–30th 2020 with pre-specified features of suspected
COVID-19 were enrolled. Physicians prospectively dichotomized them as COVID-19
likely/unlikely case, based on clinical features and bedside imaging support.
Considering the limits of each method to identify a case of COVID-19, further
evaluation was performed after an independent clinical review of 30-day follow-up
data. Using this as a gold standard, several classifiers were implemented:
Logistic Regression (LR), Quadratic Discriminant Analysis (QDA), Random Forest
(RF), Support Vector Machine (SVM), Neural Networks (NN), K-nearest neighbor
(K-NN), Naive Bayes (NB). Results: Most of the classifiers show a ROC
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for the COVID-19 pandemic. Since its first report in December 2019 [1], despite great efforts made in almost every country worldwide, this disease continues to spread globally.
According to the Centre for Disease Control and Prevention indications, the upper respiratory samples, and in particular the nasopharyngeal specimen, should be collected for RT-PCR based testing of COVID-19 [2].
Unfortunately, RT-PCR has several practical limitations due to time-consuming procedures and to problems related to the detection rate of viral nucleic acid closely related to the course of viral infection. The optimal sampling time is uncertain and subsequently the period of the high viral load can be missed, resulting in a high rate of false-negatives [3], while there is a growing interest in the role of biomarkers in the screening and especially the early detection of SARS-CoV-2 infection at emergency departments (EDs) [4, 5].
The aim of our study was to assess, through classification tools based on machine learning (ML) and standard statistical models, the predictive capacity to classify patients as COVID-19 positive cases from the first wave (April 2020), starting from diagnosis based on blood tests and other information routinely collected at the emergency department. With a rapidly changing virus, we believe that data processing automated procedures may provide a valid support to the physicians facing the decision to classify a patient as a COVID-19 positive or not. We tested different classifiers to find the best way to combine ML and statistical methods in the COVID-19 context, in order to improve their effectiveness while preserving the results’ interpretability.
Given the above recalled limitations of RT-PCR test and as far as the models were developed in early 2020, when the first wave of COVID-19 occurred and the timing to swab was still not well known, the benchmark considered to train the classifiers is the so called physician’s gestalt, which stems from a revision of clinical, laboratory (including RT-PCR), imaging parameters and a clinical review of 30-day follow-up data. This has been shown to play a key role in identification of patient infection in the first wave [6].
In addition, the classifiers were tested on data from patients from the second wave (October 2020) using, in this case, as a benchmark, the positivity to the specific RT-PCR within one week from the access to the ED (not available in the first dataset). This is in line with the growing body of evidence suggesting that missed positive patients may disclosure within 7 days from the admission to hospital [7, 8].
This study was conducted after approval by the local ethical committee and informed consent was submitted and signed to the involved patients.
971 consecutive patients at their admission to the ED of the hospital
AOU-Careggi were enrolled starting from April, 7th to April 30th 2020. Inclusion
criteria were: age
Of these, 118 were later classified as cases according to the physician’s gestalt as described in Nazerian et al. [6], while the COVID-19 negative were diagnosed with all the other diagnosis; no noticeable repetition found.
Serum, plasma and blood from peripheral venous sampling were evaluated upon admission to the ED. Hematochemical parameters were evaluated on specific analytical platforms. In particular: total blood cells count was performed on Sysmex XN analyzer (Dasit, Milan, Italy), while coagulation and serum biomarkers were evaluated after centrifugation, plasma samples were tested on ACL-TOP analyzer (Werfen, Barcelona, Spain) and sera on Cobas 8000 analyzer (Roche, Barsel, Switzerland).
The laboratory variables considered for our analysis are listed in Table 1:
1. | White Blood Cells (WBC) |
---|---|
2. | Neutrophils (Ne) |
3. | Lymphocytes (Ly) |
4. | Platelets (Plt) |
5. | Hemoglobin (Hb) |
6. | Lactate Dehydrogenase (LDH) |
7. | Alanine Transaminase (ALT) |
8. | Aspartate Aminotransferase (AST) |
9. | Sodium (Na) |
10. | Potassium (K) |
11. | Glucose (Glc) |
12. | Bilirubin (Bil) |
13. | Crea (Creatinin) |
14. | International NormalizedRatio (INR) |
15. | C-Reactive Protein (CRP) |
16. | Partial Thromboplastin Time (aPTT) |
17. | Fibrinogen (Fib) |
18. | Procalcitonin (PCT) |
19. | D-Dimer (Dim) |
20. | Interleukin-6 (IL-6) |
21. | Troponin-T (TnT) |
We also included the following variables (Table 2) not strictly laboratory related, but easily assessable at the first emergency:
1. | Systolic Blood Pressure (SBP) |
---|---|
2. | Diastolic Blood Pressure (DBP) |
3. | Heart Rate (HR) |
4. | Oxygen Sauration (sO |
5. | Body Temperature |
6. | Fever (yes/no) |
7. | Caugh (yes/no) |
8. | Dyspnea (yes/no) |
Notably, both fever and body temperature were considered: in particular, fever
was referred to at home self-assessment of body temperature higher than 37.4
Given the characteristics of the first wave, mainly affecting male and geriatric people, we deliberately excluded from our sets of covariates age and sex. The inclusion of these covariates would have improved the internal validation based on the first wave but would have had a worsening effect of the classification power in subsequent waves.
We decided to use a testing population patient from the second wave, in order to overcome overfitting matters as reported in literature [9]. Patients were selected by consecutive admission to the ED from 1st to 24th October 2020. In particular, we used the data from the first 30 admitted patients who resulted to have a positive molecular test for COVID-19 within one week and 30 patients with any other diagnosis (no specific repetition among diagnoses were noticed).
Recent developments in ML concern automated procedures to classify units. We here compare seven different technological approaches that, according to literature, provide optimal results [10].
The analyses were performed using the R version 3.6.3 (Vienna, Austria), via the package CARET (Classification and REgression Training).
We used a wrapper method for the feature selection, combined with cross-validation steps to improve the selection [11, 12]. The proposed recursive feature elimination (RFE) (backward selection algorithm) was combined with a RF model, and 10-folds cross-validation repeated for 5 times [11, 12]. The set of features for each algorithm was determined through hyperparameter optimization (grid search). To tackle the unbalancedness, the Synthetic Minority Oversampling Technique (SMOTE) computational method to artificially generate units from the minority class via the K-nearest neighbor (K-NN) method [13] was used.
The pattern of missingness of the laboratory variables is displayed in Fig. 1. The variables AST, Dim and TnT show more than 75% of missing values and therefore were removed from further analysis. Also other variables (namely Fib, PCT and IL-6) have a rather high percentage of missingness. However, given their importance, we decided to keep them for further analysis.

Plot of missingness in the laboratory variables, suggesting removal of AST, Dim, TnT.
Prior the application of various ML techniques, for each of the laboratory variable, we investigated if some transformations (such as log transform, quadratic power, etc.) may lead to an improvement of their discriminatory power, i.e., their capacity to separate positive from negative patients. This analysis led us to conclude that the variables, Ly, LDH, K, Bil, Crea, PCT, INR, IL-6, ALT, may be better imputed in logarithmic form (after adding 0.01 when necessary).
Additional preliminary analyses were performed in order to understand the influence of each covariate on the conditional log odds ratio of a patient to be COVID-19 positive. These concern a semi-parametric Logistic Regression (LR) against each variable and a parametric LR against each variable binned into quartiles. The former uses smoothing techniques based on cubic splines (i.e., third order polynomials constructed to achieve smoothness of the interpolating function) [14], while the latter is performed by grouping patients into quartiles, and then fitting a parametric logistic model against the binned variable considered as a factor. As an explicative and representative example, in Fig. 2 the resulting plot for the sO2 variable is presented. Both graphs show that for this covariate, after the level of 85 there is a dramatic decrease of the log odds that the patient is a case.

An example of the preliminary analyses performed on all
covariates. Here the effect sO
In order to apply the ML and statistical methods, the dataset was randomly divided in training (80%) and validation set. Therefore, the training set is formed by 778 patients, of which 95 are cases. Furthermore, data coming from the early-second wave (Oct 01, 2020–Oct 24, 2020) are also used to verify the accuracy of the classifiers. We will refer to them as internal (first wave)/external (second wave) validation sample. Further preliminary data analysis in the training datasets were performed adapting the method of Perez-Riverol et al. [15]. In detail, after using K-NN (k = 5) to impute missing values and a cupping procedure for outliers, the Ne variable is removed as it exhibits a large correlation (cutoff 0.7) with WBC and presents higher rate of missing values in the original dataset. A near zero variance analysis did not point to further removal.
On the training set, we then used a wrapper method for the feature selection and cross-validation steps to improve the selection [5, 15]. The proposed recursive feature elimination (RFE) (backward selection algorithm) was combined with a RF model, and 10-fold cross-validation repeated five times [10, 11]. This selection algorithm ranks the variables iteratively according to their importance (determined using RF) and, at each stage, the least important predictors were eliminated. The set of hyperparameters for each algorithm was determined through hyperparameter optimization (grid search). To tackle the unbalanceness, the Synthetic Minority Oversampling Technique (SMOTE) computational method to artificially generate units from the minority class via the K-nearest neighbor (K-NN) method [13] was used.
The repeated use of RFE indicated 16 covariates as relevant: Fever, Cough, LDH(log), WBC, CRP, Plt, Ly(log), Dyspnea, Fib, Bil(log), SO2, Na, Hb, PCT(log), Body Temperature, IL-6(log). On these 16 variables, several classifiers have been implemented both parametric (namely LR and QDA) and non-parametric (namely RF, SVM, NN, K-NN, NB). In implementing the above classifiers, a Spatial sign transform on the data has been applied, as it is known that this transform may achieve a better discriminatory power [16], a finding which is confirmed in our analysis. The imputation method was only used on the training set in order to preserve the validity of the cross-validation.
Almost all classifiers exhibit a high value of the area under the ROC curve (AUC), but the best results are obtained when classifiers are implemented on the log transformed data with the rebalancing SMOTE technique. In this case, with the only exception of SVM and KNN, all methods applied to the Internal Validation dataset present a ROC above 0.80 (see Table 3). The external validation also exhibits a high level of accuracy of precision (see Table 4), when applied on the log transformed data with the classifiers estimated with the SMOTE technique. In addition, all classifiers exhibit a good level of Specificity and Sensitivity in all the tested scenarios, confirming the efficacy of the proposed approaches. Notice that, depending on the aim of the classification, Sensitivity could be increased but at the Specificity expense. We documented that are the classifiers based on RF, NN and LR are particularly successful. Since the latter has a parametric form that naturally lends into a scientific interpretation, next section focuses on the estimated parameters obtained via implementation of such model.
Model | ROC (%) | Sensitivity (%) | Specificity (%) | |
---|---|---|---|---|
1 | RF | 82.35 | 60.87 | 90.59 |
2 | NN | 83.38 | 52.17 | 88.82 |
3 | SVM (radial) | 77.95 | 52.17 | 86.47 |
4 | K-NN | 64.72 | 47.83 | 84.12 |
5 | NB | 80.43 | 56.52 | 82.94 |
6 | LR | 86.75 | 73.91 | 85.88 |
7 | QDA | 82.10 | 56.52 | 84.12 |
RF, Random Forest; NN, Neural Network; SVM, Support Vector Machine; K-NN, K-nearest neighbor; NB, Naive Bayes; LR, Logistic Regression; QDA, Quadratic Discriminant Analysis.
Model | ROC (%) | Sensitivity (%) | Specificity (%) | |
---|---|---|---|---|
1 | RF | 89.70 | 83.33 | 100.00 |
2 | NN | 83.45 | 75.00 | 88.89 |
3 | SVM (radial) | 83.45 | 70.83 | 86.11 |
4 | KNN | 83.85 | 75.00 | 91.67 |
5 | NB | 79.75 | 75.00 | 77.78 |
6 | LR | 85.65 | 70.83 | 100.00 |
7 | QDA | 80.21 | 66.67 | 94.44 |
RF, Random Forest; NN, Neural Network; SVM, Support Vector Machine; K-NN, K-nearest neighbor; NB, Naive Bayes; LR, Logistic Regression; QDA, Quadratic Discriminant Analysis.
In Table 5 we reported the estimates of the parameters, together with their standard errors and p-values (in order to preserve the interpretation, all variables are here in the original scale, with the exception of the log transformation on the LDH, Ly, Bil, PCT, IL-6). The point and 95% confidence interval estimates of the odds ratios are also reported. All non-significant variables on SMOTE data appeared to improve the precision of the classifier in the external validation, and therefore were kept in the classifier (the only exception is Body Temperature). All variables acted in the expected direction, with the exception of Bil and PCT that exhibited a significant negative effect. The log odds to be a case of a patient with Dyspnea was 0.364 higher (that corresponded to an increase of 1.1370 on the odds ratio scale) than a patient without, when both had the same value of the other covariates. Similarly, at any level of the other covariates, a one-point increase of sO2 decreased the log odds to be a case of 0.539 (that corresponded to a decrease of 0.583 on the odds ratio scale).
Estimate | Std. Error | z value | p ( |
Odds Ratio | 95% CI | ||
---|---|---|---|---|---|---|---|
Lower Limit | Upper Limit | ||||||
(Intercept) | –1.0211 | 0.1667 | –6.13 | 0.0000 | 0.3602 | 0.2598 | 0.4994 |
Fever | 1.2284 | 0.1375 | 8.93 | 0.0000 | 3.4158 | 2.6088 | 4.4723 |
Caugh | 0.1291 | 0.1281 | 1.01 | 0.3136 | 1.1378 | 0.8852 | 1.4625 |
LDH(log) | 0.6081 | 0.1989 | 3.06 | 0.0022 | 1.8369 | 1.2439 | 2.7127 |
WBC | –0.5338 | 0.1684 | –3.17 | 0.0015 | 0.5864 | 0.4215 | 0.8157 |
CRP | 0.1849 | 0.1747 | 1.06 | 0.2899 | 1.2031 | 0.8543 | 1.6944 |
Plt | –0.6788 | 0.1675 | –4.05 | 0.0001 | 0.5072 | 0.3653 | 0.7043 |
Ly(log) | –0.3368 | 0.1438 | –2.34 | 0.0192 | 0.7141 | 0.5387 | 0.9465 |
Dyspnea | 0.3640 | 0.1202 | 3.03 | 0.0025 | 1.4391 | 1.1370 | 1.8214 |
Fib | 0.5726 | 0.2079 | 2.75 | 0.0059 | 1.7729 | 1.1795 | 2.6647 |
Bil(Log) | –0.3614 | 0.1369 | –2.64 | 0.0083 | 0.6967 | 0.5327 | 0.9111 |
sO2 | –0.5392 | 0.1371 | –3.93 | 0.0001 | 0.5832 | 0.4458 | 0.7630 |
Na | 0.5167 | 0.1338 | 3.86 | 0.0001 | 1.6765 | 1.2898 | 2.1792 |
Hb | 0.1235 | 0.1208 | 1.02 | 0.3066 | 1.1315 | 0.8929 | 1.4337 |
PCT(Log) | –1.9303 | 0.3936 | –4.9 | 0.0000 | 0.1451 | 0.0671 | 0.3138 |
IL-6(Log) | 0.3395 | 0.2436 | 1.39 | 0.1634 | 1.4042 | 0.8711 | 2.2636 |
In the last few years, several research groups are developing ML methods to support daily clinical practice and the worldwide pandemic enhanced the efforts in that side [5]. Our work applied ML methodology and classical statistical classification models to data coming from routine blood exams, which are commonly requested at the admission to the ED. Such exams are ready in a short time and much cheaper than molecular test or radiological examinations.
Furthermore, despite the fact that the RT-PCR test is still the gold standard for conclusively diagnosing COVID-19 infection, there are some concerns about its clinical performance, which is affected by a number of difficult-to-measure factors like low levels of shedding during incubation and, despite its known high specificity, its sensitivity is still debatable [17]. As a matter of fact, among initially negative patients subjected to repeat SARS-CoV-2 RT-PCR testing, newly positive results within 7 days at the time of initial presentation may occur [7].
Notice that, while during the first wave physicians were facing an unknown disease, in the second wave several limitations (as the high rate of false RT-PCR negatives) were already well documented. As a result, whereas the 30-day review was absolutely necessary for the first wave, it was not required for the second wave because new molecular approaches had been created. This validates our decision to employ the second molecular test when applied to the second wave data and our use of the physician’s gestalt when training the model on the first wave data. Since the two COVID-19 waves were markedly different in terms of demographic characteristics, we deliberately excluded from the training dataset age and sex. Keeping them into the models would have had a worsening effect in the external validation sample.
LR can be seen both as a good classifier and a tool to give scientific insights on the subject matter. A thorough analysis of the calculated parameters reveals that, all 16 factors affect a patient’s likelihood of being a case in the predicted way, with the exception of Bil and PCT, which have a considerable adverse effect. More investigations are required on this aspect. A possible explanation is that the two variables do not influence the probability of a patient to be a COVID-19 case. If so, the unfavorable connection may be an artifact of the sampling population’s selection procedure. In fact, patients who present at the ED with high values of these variables are influenced by diseases other than COVID-19, assuming the supposition of no effect is accurate. Since it is rather unlikely that they are affected by the two diseases at the same time, this fact may induce the negative association.
Limitations of our study come from the combined effect of data unbalancedness and missingness. This surely impacts mostly on the sensitivity of the classifiers, but our results are however in agreement with previous and recent studies [5, 18]. Further limitation may come from the fact that, at present, blood tests are not needed, exception done for patients with severe COVID-19 symptoms or with complications due to co-morbidities that, however, may affect blood test results. Recently, other tools considering clinical and epidemiological features are developed and well described in literature to build risk score to predict SARS-CoV-2 infection [19]. We claim that training of the model could be greatly improved if all the covariates that resulted into relevant according to this and other studies, are measured for all patients. By doing so, the problem of unbalancedness can be addressed in a more robust way via the SMOTE rebalancing approach, and the ML and statistical methods here described would then constitute a valid instrument for rapid assessment of a potential COVID-19 positive patient.
In case of another pandemic, new data should be collected in order to train the classifiers and this may constitute another limitation.
Anyways, our study is innovative in that we tested for positivity within a week, in line with the majority of recently published research studies [7, 8]. We want to underline that proposed algorithms do not claim to be alternative to the gold standard RT-PCR, but rather to provide an additional and impacting complementary information. This can be used either when the RT-PCR is missing or to identify, among patients with a negative RT-PCR SARS-CoV-2, which ones are more likely to result into cases within the 7-days window, although more recent studies demonstrated that only patients with a very high clinical probability and an initially negative result of the RT-PCR justify the need for retesting [20].
In this paper we demonstrated that ML and classical statistical methods may be applied to common blood tests upon admittance to the emergency departments when there is the clinical suspect of COVID-19, given their affordability and fast effectiveness.
The focus on the logistic model allows to better understand data and their trend, while ML models are kind of black boxes where a complete intuition of the data transformation remains unknown.
We underline that the use of automated processes to classify cases may significantly aid clinicians in dealing with the constantly evolving COVID-19 virus. In this way, our paper serves more as a manual on how to apply these techniques and what benefits and drawbacks they might have. This will support the idea that well trained and properly tested, routinely updated, automated classifiers may help clinicians’ decision if implemented in the Laboratory Information System and the outcome of the classifiers made available to the ED.
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
LL—conceptualization and interpretation of results, proof outline and manuscript writing; AM—data collection and database filling for validation; BP—data analysis; AAld—laboratory data collection; PN—data collection from the ED and database creation; AP—data collection from the ED and database creation; SG—data collection from the ED supervision and conceptualization; BT—general Management and Ethical committee bureaucratic acts; CN—general Management and Ethical committee bureaucratic acts; LT—general management and Ethical committee bureaucratic acts; AF—critical revision of conceptualization and curation of the Ethical committee submission and approval; AAme—critical revision of conceptualization, and supervision of the version to be published; ES—data analysis, data analysis supervision, conceptualization and interpretation of results, manuscript writing.
This study was conducted after approval by the local ethical committee (17104_oss) and informed consent was submitted and signed to the involved patients.
Not applicable.
This research received no external funding.
The authors declare no conflict of interest. AA is serving as one of the Editorial Board members and Guest editors of this journal. AF was the Guest Editor of this journal. We declare that AA and AF had no involvement in the peer review of this article and has no access to information regarding its peer review. Full responsibility for the editorial process for this article was delegated to RJPA.
Publisher’s Note: IMR Press stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.