Academic Editor: Sang Heui Seo
Background: In the current COVID-19 pandemic, with an absence of approved drugs and widely accessible vaccines, repurposing existing drugs is vital to quickly developing a treatment for the disease. Methods: In this study, we used a dataset consisting of sequences of viral proteins and chemical structures of pharmaceutical drugs for known drug–target interactions (DTIs) and artificially generated non-interacting DTIs to train a binary classifier with the ability to predict new DTIs. Random Forest (RF), deep neural network (DNN), and convolutional neural networks (CNN) were tested. The CNN and RF models were selected for the classification task. Results: The models generalized well to the given DTI data and were used to predict DTIs involving SARS-CoV-2 nonstructural proteins (NSPs). We elucidated (with the CNN) 29 drugs involved in 82 DTIs with a 97% probability of interaction, 44 DTIs of which had a 99% probability of interaction, to treat COVID-19. The RF elucidated 6 drugs involved in 17 DTIs with a 90% probability of interacting. Conclusions: These results give new insight into possible inhibitors of the viral proteins beyond pharmacophore models and molecular docking procedures used in recent studies.
Since December 2019, COVID-19 has caused a global pandemic, affecting millions of lives in over 210 countries and territories. There are currently several vaccines available but there is an absence of other treatments for this virus. Due to the structural similarity between SARS-CoV-2 and other Betacoronavirudae, such as SARS-MERS and SARS-CoV (although it is much more similar in structure to SARS-CoV [1]), many previously established drugs are being researched to repurpose them for the current pandemic [2]. This allows for more rapid drug discovery and approval, which is vital in the current emergency.
There are many different in-silico methods by which this could be done. Docking and molecular screening/modeling have been widely used to discover potential treatments for the novel coronavirus as well as for other diseases in studies such as [3, 4, 5, 6, 7], among others. Additionally, several studies [8, 9, 10, 11, 12, 13, 14, 15] have used machine learning and artificial intelligence to predict drug–target interactions (DTI) for various viruses, including SARS-CoV-2, with deep neural networks (DNN), support vector machines (SVM), and random forest (RF) classifiers, among others, as detailed by [16]. Studies such as [10] have employed methods like ours, using a convolutional neural network to predict drug target interactions; additionally, other studies have used other machine-learning methods such as Naïve Bayes to carry out the classification task [15]. On the other hand, studies such as [13] employed a regression model, as opposed to a binary classification model, to predict the binding scores of ligands against the SARS-CoV-2 viral proteins. Other similarity-based methods such as network-based inference and K nearest neighbor also have been utilized for this task, as they are often relatively less computationally intensive [17, 18].
There have been efforts to repurpose currently approved drugs to inhibit the
virus’s structural and nonstructural proteins by preventing the virus from
entering the cell, preventing it from activating, or preventing it from
replicating itself (these are the preferred drugs) [19]. SARS-CoV-2 has four
structural proteins and sixteen nonstructural proteins (NSP) that carry out
various tasks essential to the virus’s ability to infect individuals. In theory,
all the NSPs can be exploited as drug targets, impeding the virus’s ability to
carry out its harmful functions in the host cell; however, some are more viable
targets than others due to the availability of their crystal structures or their
importance in the life cycle of the virus [1]. It is also possible to inhibit
host-based targets that facilitate the virus’s entry into the host cell, such as
the angiotensin receptor enzyme 2 (ACE2), As a result, studies are emerging that
consider this a potential way to treat the disease [1]. Recently it has been
found that the TMPRSS2 enzyme in the host cell allows the virus to enter the cell
by priming the spike proteins, which is a promising target in
developing/repurposing a drug [20, 21]. Drugs that inhibit the RNA-dependent RNA
polymerase (RdRp, NSP12) are also being considered as possible treatments of the
virus; these include ribavirin, remdesivir, sofosbuvir, and IDX-184 [2, 19, 21].
NSP12 is an attractive target due to its role in RNA replication in the life
cycle of the virus and the availability of its crystal structure [21]. However,
it is also possible to inhibit the virus before the NSPs (including NSP12) have
been cleaved from the polyproteins 1a and b (pp1a and pp1b). The
3-chymotrypsin-like protease (3CL
Considering this, in our study we look to repurpose approved drugs to inhibit all SARS-CoV-2 NSPs using a machine-learning approach that takes advantage of structural similarities between viral proteins and similarities between pharmaceutical drugs. This method allows for high-throughput DTI prediction greatly aiding the fight against the virus.
The methods used in this project are presented in Fig. 1. The programs developed are presented at https://github.com/Shkev/Sars-CoV-2-NSP-Predictions.
Methods flowchart used in this study.
Experimentally verified DTIs for approved drugs, along with the drug SMILES and viral protein sequences, were downloaded from the DrugBank website [26] (release 5.1.7). In total, 19,242 DTIs were collected, involving 2468 drugs and 5177 proteins including those from various viruses (influenza, HIV, SARS-CoV, etc.) as well as human proteins.
All SMILES names were converted to canonical SMILES using Open Babel [27] to standardize them and allow easier handling of the data.
SARS-CoV-2 NSPs sequences were downloaded from the NCBI protein databank.
The dataset contained chemical structures of the drugs in the form of SMILES names. Two-dimensional (2D) drug descriptors were calculated using the Online Chemical Database [28]. These descriptors contain the direct connections between the structures of the drugs and their properties, providing sufficient information to train the machine-learning model to recognize patterns in the data [29]. Drugs for which descriptors could not be calculated were removed along with any DTIs they were involved in, leaving 2444 drugs and 16,640 DTIs.
It is known that sufficient information about proteins is contained in the
amino-acid sequences. Hence, we used common sequence descriptors and domain
information to represent the proteins in our dataset [30]. The protein-sequence
descriptors consisted of the amino-acid composition (AAC), dipeptide composition
(DC), and tripeptide composition (TC). AAC is the frequency of each amino acid in
the sequences. DC is the frequency of each possible pair of two amino acids in
the sequences. TC is the frequency of each possible triplet of amino acids in the
sequences. In addition to these, the domain information for each protein was
obtained from the NCBI Batch Conserved Domain search and was used to construct an
adjacency matrix. Each column and row represented one of the
Example
The respective protein and drug descriptors were combined to form one numerical vector for each DTI. Thus, each DTI was an array of approximately 25,000 values presented in the consequent set as in Eqn. 2.
Data from DrugBank supplied experimentally verified DTIs, however, to train a
machine-learning model, we also need a set of false DTIs. To achieve this, all
possible combinations between the
The artificially created set of negative DTIs was combined with the verified set. Positive DTIs were assigned a label of 1 and the artificial negative DTIs were assigned a label of 0, which allowed for binary classification. The data was randomly split into a training (70% of data) and a testing dataset (30%), ensuring that both datasets contained an approximately balanced number of both classes. The testing dataset was set aside and not considered in the development of the model as it was intended to represent a set of independent outside data.
The DTI data vectors were preprocessed to shift the mean of each feature to 0
and remove those that provide little information about patterns in the data.
Doing so reduces bias in giving more importance to some features over others when
training a neural network. Note that the protein features were not adjusted, as
this would remove the value in the frequency counts and adjacency matrix since
the units for these values are already standardized. The mean-adjusted value for
the
Preliminary feature reduction was performed on the mean-adjusted drug data and the protein sequence descriptors using a variance threshold. Namely, all features with variance less than the chosen threshold of 0.01 were removed. Domain features were not reduced in this way, as they were sparse (more than half the values per feature are 0) and thus would result in a near 0 variance for all such features.
Feature reduction is a key step in using the data to train a machine-learning model effectively by removing the features with the least influence on the model. This improves the efficiency and effectiveness of the model, as processing high-dimensional data is computationally expensive [11]. In this study, we implemented a Lasso linear model using the SciKit-Learn Python library [31] to filter out the most informative features [30]. This algorithm is a modified linear regression that attempts to minimize the coefficients of terms that are least informative to the model to 0. Thus, features with coefficients lower than a chosen threshold in the trained model should be removed, leaving the most prominent features.
Drug features and protein features were considered separately using two Lasso
models. Models were trained and validated using the training dataset. The Lasso
model contains a regularization parameter,
Classification models were trained for the DTI classification problem. All data handling in the process was done with the Pandas Python library [32]. The models were trained on the approximately 32,000 data points, half of which are the positive DTIs extracted from DrugBank and the other half are sub-sampled negative DTIs, each a vector with 3048 values along with a label, 0 or 1, to distinguish between positive and negative DTIs. The output of the model is a probability that the inputted DTI is positive.
Three machine-learning models were trained and tested. The best performing one was used in the final SARS-CoV-2 NSP DTI predictions. A deep neural network (DNN) [30], random forest classifier (RF) [33], and convolutional neural network (CNN) [34] were tested. A part of the training dataset (30%) was randomly separated from the rest of the data to create a validation dataset that was used to tune our models’ hyperparameters and optimize metrics. This tuning was done manually, individually adjusting the various hyperparameters of the models (using guidance from [35]), until the desired training/validation metrics were obtained. The AUROC and the binary accuracy of the models was used to compare them. The binary accuracy was calculated as the percentage of predictions that were consistent with their corresponding known value in the testing/validation datasets using a threshold of 0.5 (model predictions greater than or equal to 0.5 were considered as 1 and all others as 0). Since accuracy only measures the performance of the model at a single threshold, we also utilize the AUROC score of the model, as it measures the performance of the model at various thresholds, in order to better judge the viability of the models.
Random Forest was implemented using the Scikit-Learn Python library (version 0.24) [31]. The model was trained with 100 trees with no maximum depth, a minimum of 2 samples to split an internal node, a minimum of 1 sample required to be at a leaf node, and all other default parameters (which can be seen in the documentation).
We implemented a DNN and CNN using the TensorFlow Python library [36].
The DNN architecture consisted of an input layer, two hidden dense layers with 4096 nodes each, and an output layer. The rectified linear activation function (ReLU) was applied to each hidden layer, and a sigmoid activation was used on the output layer to yield a value between 0 and 1. Additionally, dropout layers of 50% and Ridge regression (L2) regularization were used in the hidden layers to reduce overfitting [30]. Two hidden layers of equal size were used as recommended by [35]. Different numbers of nodes were experimented with, and the value that resulted in the least overfitting (as determined by comparing training and validation metrics) was used, namely 4096 nodes per layer. In general, as the number of nodes increased, the training accuracy and AUROC increased while these validation metrics suffered. Likewise, as the number of nodes decreased, the training metrics fell, but there was less overfitting. The model was trained with a learning rate of 0.00001 and a batch size of 32.
The CNN model is like the DNN architecture with the addition of 1D convolutional (Conv1D) layers. This allows the model to extract hidden patterns in the data that would otherwise not be recognized by the dense layers. We implemented a three-layer convolutional network that outputs to a fully connected dense layer (with 2048 nodes) that feeds into the output layer. The number of filters in the Conv1D layers increased from 16, 32, to 64 (each with kernel size 3) to progressively learn more features from the data. Each Conv1D layer fed into a Batch Normalization layer and a Max Pooling layer with pool size 3 to normalize weight values and prevent the model from overfitting. The ReLU activation function was also used on all Conv1D layers and hidden layers. The sigmoid activation function was applied to the output layer. A dropout of 50% was added to the flattened output of the final convolutional layer and dense layers, while L2 regularization was applied to all convolutional and dense layers. This model was also trained with a learning rate of 0.00001 and a batch size of 32.
As can be seen in Table 1, the CNN model performed best in both metrics, hence it was used in predicting DTIs involving SARS-CoV-2 nonstructural proteins.
Machine-learning model | Validation AUROC | Validation accuracy |
Random Forest | 0.955 | 0.889 |
DNN | 0.930 | 0.856 |
CNN | 0.991 | 0.969 |
The testing dataset was used to test the best-performing model (CNN) on a partition of the DrugBank data that the model has not seen before (the labels for the testing input are known, so the accuracy and AUC of the model can be calculated). There was an approximately equal number of each class (positive and negative DTIs) in the test dataset, with 4992 negative DTIs and 4989 positive DTIs. The accuracy and AUC from these predictions give an accurate representation of how the model will perform when making predictions from the SARS-CoV-2 NSP data. The CNN scored very highly on this dataset, showing that it generalized well from the training data, which makes it viable to use in predicting new DTIs. The Random Forest classifier performed slightly worse in these metrics but outperformed the CNN in recall/true positive rate and F-measure, which are valuable metrics in this use-case as it is important that the predicted positive DTIs are predicted correctly (true positives).
The CNN model performed with an AUROC of 0.954 (Fig. 3A), Precision-Recall AUC of 0.951 (Fig. 3B), and an accuracy of 0.895. The Random Forest classifier performed with an AUROC of 0.950 (Fig. 4A), Precision-Recall AUC of 0.950 (Fig. 4B), and an accuracy of 0.888. The test metrics of the CNN, along with those of the other models used (Random Forest and DNN), is shown in Table 2. The CNN confusion matrix for the predictions at a truth threshold of 0.97 can be seen in Fig. 5A. The confusion matrix for the RF classifier can be seen in Fig. 5B.
Performance of a Convolution Neural Network (CNN) classification model. (A) ROC curve of CNN on the testing data. The area under the ROC curve is 0.95, which shows that our model generalized well to the training data and did not overfit. (B) Precision-Recall plot of CNN on the testing data. The area under the curve is 0.95, which shows that our model generalized well to the training data and performs well at different truth thresholds.
Performance of a Random Forest (RF) classification model. (A) ROC curve of RF on the testing data. The area under the ROC curve is 0.95, which shows that our model generalized well to the training data and did not overfit. (B) Precision–Recall curve of RF on the testing data. The area under the curve is 0.95, which shows that our model generalized well to the training data and performs well at different prediction probability truth thresholds.
Confusion matrices for machine-learning models test predictions with a truth threshold of 0.97. True negatives are represented by the top left square and true positives are represented by the bottom right square. False positives are seen in the top right square and false negatives are seen in the bottom left square. (A) Confusion matrix for CNN model test predictions. (B) Confusion matrix for the RF classifier test predictions.
Model | AUROC | Accuracy | Precision | Recall | F-measure |
Random Forest | 0.950 | 0.888 | 0.921 | 0.848 | 0.883 |
DNN | 0.920 | 0.846 | 0.971 | 0.406 | 0.573 |
CNN | 0.954 | 0.895 | 0.965 | 0.704 | 0.814 |
The CNN model trained on DTI data from the DrugBank website was used to predict potential interactions between drugs in the dataset and the 16 SARS-CoV-2 NSPs whose sequences were obtained from the NCBI protein databank. Each NSP was paired with all the drugs in the dataset and the same procedure presented above was followed to extract and reduce features from the proteins and drug sequences and create DTI vectors (Eqn. 2). The same features were chosen from the vectors as for the Lasso models. All the possible DTIs were inputted into the model, which calculated the probability that the input data correspond to a true DTI. DTIs with an output score greater than or equal to the thresholds of 0.97 and 0.99 were selected as potential DTIs between the repurposed DrugBank FDA-approved drugs and the viral proteins.
We trained a convolutional deep-learning model and a random forest classifier to predict drugs that may inhibit SARS-CoV-2 viral proteins. See Table 2 for the performance metrics of these models.
The convolutional model reduced the inputted approximately 39,000 possible DTIs down to 82 of the most viable ones. We predicted 82 different drug-target interactions between FDA-approved (DrugBank) drugs and the SARS-CoV-2 NSPs with a probability of 97% or greater of interacting. Table 3 (Ref. [5, 6, 8, 16, 22, 37, 38, 39, 40, 41, 42, 43, 44, 45]) shows the 29 unique drugs involved in these interactions. A subset of these results that met the threshold of 0.99 shown in Table 4 (Ref. [5, 6, 8, 22, 34, 38, 41, 44]) was separated, which yielded 44 DTIs involving 13 unique drugs with a 99% probability of interaction with their respective proteins.
DB ID |
Name | NSPs | Theoretical studies | Clinical studies (CT ID |
DB12010 | Fostamatinib | 1–16 | [22] | NCT04352465 |
DB01110 | Miconazole | 2, 3, 5, 6, 7, 9–16 | [8] | NA |
DB03147 | Flavin adenine dinucleotide | 2, 3, 6, 12, 13, 14 | [6, 37] | NA |
DB00114 | Pyridoxal phosphate | 3, 6, 12, 13 | NA | NA |
DB01987 | Cocarboxylase | 3, 12, 13 | [38] | NA |
DB06287 | Temsirolimus | 3, 12 | NA | NA |
DB09237 | Levamlodipine | 3 | NA | NA |
DB00132 | Alpha-linolenic acid | 6 | NA | NCT04647604 |
DB00157 | NADH | 6, 9, 12, 13 | NA | NA |
DB00162 | Vitamin A | 6 | NA | NA |
DB00755 | Tretinoin | 6 | [39] | NA |
DB02659 | Cholic acid | 6, 9, 12, 13 | NA | NA |
DB03247 | Flavin mononucleotide | 6, 9, 11, 12, 13 | [5] | NA |
DB03796 | Palmitic acid | 6 | [40] | NA |
DB09061 | Cannabidiol | 6, 12, 13 | [41] | NCT04647604 |
DB00143 | Glutathione | 9 | NA | NCT04703036 |
DB03619 | Deoxycholic acid | 9 | NA | NA |
DB05154 | Pretomanid | 9 | NA | NA |
DB00144 | Phosphatidyl serine | 12 | NA | NA |
DB00563 | Methotrexate | 12 | [42] | NCT04352465 |
DB01017 | Minocycline | 12 | [16] | NA |
DB01051 | Novobiocin | 12 | [5] | NA |
DB01329 | Cefoperazone | 12, 13 | [43] | NA |
DB08872 | Gabapentin enacarbil | 12 | NA | NA |
DB11901 | Apalutamide | 12, 13 | NA | NA |
DB14879 | Cefiderocol | 12, 13 | [44] | NA |
DB01117 | Atovaquone | 13 | [45] | NCT04456153 |
DB01212 | Ceftriaxone | 13 | NA | NA |
DB08943 | Isoconazole | 13 | NA | NA |
DB ID |
Name | NSPs | Theoretical studies | Clinical studies (CT ID |
DB12010 | Fostamatinib | 1–16 | [22] | NCT04352465 |
DB03147 | Flavin adenine dinucleotide | 3, 12, 13, 14 | [6, 34] | NA |
DB06287 | Temsirolimus | 3 | NA | NA |
DB01110 | Miconazole | 5, 6, 7, 9, 11–16 | [8] | NA |
DB00114 | Pyridoxal phosphate | 6, 12 | NA | NA |
DB00162 | Vitamin A | 6 | NA | NA |
DB03247 | Flavin mononucleotide | 6, 12, 13 | [5] | NA |
DB09061 | Cannabidiol | 6, 12 | [41] | NA |
DB02659 | Cholic acid | 9 | NA | NA |
DB00157 | NADH | 12 | NA | NA |
DB01987 | Cocarboxylase | 12 | [38] | NA |
DB08872 | Gabapentin enacarbil | 12 | NA | NA |
DB14879 | Cefiderocol | 12 | [44] | NA |
Similarly, the Random Forest classifier reduced the inputted DTIs down to 17 DTIs with probability greater than or equal to 90% of interacting, involving 6 unique drugs (Supplementary Table 1).
These results are summarized visually in Fig. 6 (Ref. [46]) and Fig. 7 (Ref. [46]). Figs. 8,9 also display the number of interactions for each drug and each target. NSP12, the RNA dependent RNA polymerase (RdRp), and NSP13, helicase, were the most targeted proteins in the 0.97 threshold result set with NSP12 being the most highly targeted protein overall. NSP12 was also the most targeted protein in the more restricted 0.99 threshold group, followed by NSP6 and NSP13, with NSP6 having two more inhibitors than NSP13 and four fewer inhibitors than NSP12. We also note that both fostamatinib, a tyrosine kinase inhibitor, and miconazole, an antifungal, were the drugs predicted to inhibit the most viral proteins overall, followed by the flavin adenine dinucleotide (in both result groups). Fostamatinib was also predicted as a potential inhibitor by the Random Forest classifier. The results for the inhibitors were consistent, as the drugs predicted in the 0.99 threshold group (excluding gabapentin enacarbil and vitamin A) were in the top half of those in the 0.97 group based on the number NSPs they were found to inhibit. Also, all but one of the drugs predicted by the Random Forest were in the 0.97 threshold group predicted by the CNN.
Network of DTIs scoring above 0.97 from the CNN. Each edge between a drug node (blue) and a NSP node (red) represents a DTI. The intensity of the color of each node is directly proportional to its degree. Drawn with Cytoscape [46].
Network of DTIs scoring above 0.99 from the CNN. Each edge between a drug node (blue) and a NSP node (red) represents a DTI. The intensity of the color of each node is directly proportional to its degree. Drawn with Cytoscape [46].
Charts displaying the number of inhibitors and NSP targets in the 0.99 threshold group obtained from CNN. (A) Number of NSPs that each drug was found to inhibit. (B) Number of inhibitors that each NSP was found to have.
Charts displaying the number of inhibitors and NSP targets in the 0.97 threshold group obtained from CNN. (A) Number of NSPs that each drug was found to inhibit. (B) Number of inhibitors that each NSP was found to have.
The majority of inhibitors were found to target less than or equal to four NSPs. Moreover, at least one drug was predicted to interact with all NSPs in both sets of results. However, not all the drugs in our dataset were found to interact with a SARS-CoV-2 viral protein.
There are many common compounds in our results that suggest the potential to quickly test and administer the drugs. These include the vitamins: vitamin A, pyridoxal phosphate (a vitamin B6 derivative), cocarboxylase (vitamin B1), and tretinoin (a vitamin A derivative). The bile acids, cholic and deoxycholic acids, were also included in our results. Many antibacterial drugs were also predicted to be effective against SARS-CoV-2 such as isoconazole, atovaquone, cefoperazone, novobiocin, and ceftriaxone. Supplementary Table 2 in Supplementary Materials shows all the compounds we elucidated and their current pharmaceutical applications.
Based on the amino-acid sequences of viral proteins and chemical descriptors for various drugs, we trained a convolutional deep neural network and a Random Forest Classifier to predict new drug-target interactions. The results obtained give a starting point for selecting currently approved drugs that can be repurposed to inhibit the SARS-CoV-2 virus. The use of machine learning to make these predictions accelerates the search for a treatment and allows for high volume DTI classification that would not be possible with other techniques. Furthermore, the methods used are not specific to the SARS-CoV-2 virus and can be applied to predict DTIs in general, facilitating rapid drug discovery for other diseases as well.
As shown in Table 2, the CNN outperformed the Random Forest and DNN models in both AUROC and accuracy. This is most likely because the CNN can extract obscure relationships between the various features in the data due to its 1D convolutional layers in a way that the other models cannot. This property allows it to better generalize to the training data without overfitting. However, the Random Forest model had the highest F-measure and recall, indicating a high true-positive rate, which is valuable in predicting DTIs. Thus, we present the results from both models as both provide unique information about potential inhibitors of the SARS-CoV-2 virus. The CNN results, however, are more thoroughly analyzed as they were predicted with higher confidence (97% for the CNN as opposed to 90% for the RF; there are no RF predictions with a probability of interacting greater than 97%) and contained almost all the RF model predictions.
Our CNN model achieved similar accuracy, AUC, and F-measure score to other recent machine-learning based DTI prediction studies such as [14]. The performance of the models used in this study along with those of other studies are presented in Table 5 (Ref. [14, 15, 30, 34]). Both of our models outscored all but one of the other models in AUROC and had much a higher precision score than [14]. The lower AUROC compared to [14] may be due to the restricted pool of drugs the study used as they only considered herbal drugs. The Random Forest also outperformed [14] in precision, recall, and F-measure (precision and recall scores were not available for the other studies). Note that we used a similar method to [30] in employing a Lasso model for feature selection as well as using the same protein features in the dataset, however we used a CNN as opposed to a DNN giving us more favorable metrics. This difference in performance can most likely be explained similarly to the difference in performance between the CNN and RF models. Overall, the relatively high performance of our models as compared to other studies can most likely be attributed to the unique DTI features used in this study, particularly the unique use of protein domain features as they are not widely used in DTI prediction studies. Additionally, the Lasso method for feature selection allows for highly effective dimensionality reduction. Thus, the models can learn relations among the data that would have otherwise been obscured or lost using other, less robust feature selection methods.
Study | Accuracy | AUROC | Precision | Recall | F-measure |
This Study (CNN) | 0.895 | 0.954 | 0.965 | 0.704 | 0.814 |
This Study (RF) | 0.888 | 0.950 | 0.921 | 0.848 | 0.883 |
Semi supervised model [14] | 0.940 | 0.970 | 0.817 | 0.830 | 0.822 |
CNN [34] | 0.923 | NA | NA | NA | 0.895 |
Lasso-DNN [30] | 0.81 | 0.89 | NA | NA | NA |
Naïve Bayes [15] | 0.730 | 0.666 | NA | NA | 0.768 |
Eighty-two DTIs involving SARS-CoV-2 viral proteins were predicted using this model, forty-four of which had a 99% probability of interaction. There were 26 unique drugs within these DTIs, including fostamatinib, a tyrosine kinase inhibitor; miconazole, an antifungal; and ceftriaxone, an antibacterial.
We trained the model on data from known DTIs involving various proteins—including those of influenza and Ebola viruses—and FDA-approved drugs. This model generalized exceptionally well to this data as it learned the important patterns in the data to distinguish true and false DTIs, making it a strong choice to use in predicting new relationships. A Lasso model was applied to the data before it was fed to the CNN to filter out the most informative features and improve the efficiency of our final model. The scores assigned to the COVID-19 DTIs by the model were compared and all pairs scoring above 0.97 were extracted as possible candidates.
Given that RdRp (NSP12), helicase (NSP13), and the main protease (NSP5) are viable and highly researched viral proteins for inhibition, it is of high interest that NSP12 and NSP13 are among the top 3 highly targeted proteins in both threshold sets resulting from our model’s predictions. Given the significant role these proteins play in the life cycle of the virus, the drugs targeting them should be given the highest priority in testing.
It is interesting to consider why NSP12 is the most targeted protein. This may be due to the fact that the enzyme is conserved in structure among all RNA viruses [1]. Given that our model exploits similarities in structure between various proteins and drugs to make predictions, it is very likely that it took advantage of the recurring structure of the RdRp enzyme across various viruses to predict inhibitors for SARS-CoV-2. This pattern in the data most likely explains the large number of inhibitors predicted for NSP12 as its familiar structure links lots of other proteins, and thus inhibitors, to it.
It is interesting to note that among our results were common compounds such as
vitamin A and the cholic and deoxycholic bile acids. In addition, clinical trials
are currently in progress to test the efficacy of fostamatinib (CT ID:
NCT04352465), cannabidiol (CT ID: NCT04647604), alpha-linolenic acid (omega-3
polyunsaturated fatty acid; CT ID: NCT04647604), glutathione (CT ID:
NCT04703036), methotrexate (CT ID: NCT04352465), and atovaquone (CT ID:
NCT04456153) in treating COVID-19. Furthermore, fostamatinib has been predicted
as a potential inhibitor of NSP5 (3CL
Drugs among those involved in the predicted DTIs that have been tested for other drugs. Obtained from the DrugVirus.info database [48].
We note that although the results of this study partially overlap with those of other theoretical studies, the results should be further validated using other methods before administering the drugs in clinical trials. Further analysis of these results using docking simulations (which have already been used in some of the studies cited above) and pharmacophore models may be useful in determining which of the predicted DTIs are most likely to give positive results in a clinical environment.
We developed a machine-learning model to predict possible inhibitors of the 16 SARS-CoV-2 nonstructural proteins. A convolutional neural network with three convolutional layers and a Random Forest model were used. The CNN model, trained on 2444 drugs and 16,640 known drug-target interactions (DTIs) from DrugBank, was developed using the TensorFlow Python library. The best algorithm for the classification task was the CNN. A part of the training dataset (30%) was randomly separated from the rest of the data to create a validation dataset that was used to tune our models’ hyperparameters and optimize metrics. The model predicted 29 COVID-19 drugs involved in 82 DTI with 97% probability.
IFT and SKA contributed to conception and study design, original manuscript preparation. SKA and VLK constructed the model and received the prediction data. SKA and VLK contributed to original manuscript preparation and final draft reviewing and editing.
Not applicable.
Not applicable.
This research received no external funding.
The authors declare no conflict of interest.