Potential SARS-CoV-2 nonstructural proteins inhibitors: drugs repurposing with drug-target networks and deep learning

Background : In the current COVID-19 pandemic, with an absence of approved drugs and widely accessible vaccines, repurposing existing drugs is vital to quickly developing a treatment for the disease. Methods : In this study, we used a dataset consisting of sequences of viral proteins and chemical structures of pharmaceutical drugs for known drug–target interactions (DTIs) and artificially generated non-interacting DTIs to train a binary classifier with the ability to predict new DTIs. Random Forest (RF), deep neural network (DNN), and convolutional neural networks (CNN) were tested. The CNN and RF models were selected for the classification task. Results : The models generalized well to the given DTI data and were used to predict DTIs involving SARS-CoV-2 nonstructural proteins (NSPs). We elucidated (with the CNN) 29 drugs involved in 82 DTIs with a 97% probability of interaction, 44 DTIs of which had a 99% probability of interaction, to treat COVID-19. The RF elucidated 6 drugs involved in 17 DTIs with a 90% probability of interacting. Conclusions : These results give new insight into possible inhibitors of the viral proteins beyond pharmacophore models and molecular docking procedures used in recent studies.


Introduction
Since December 2019, COVID-19 has caused a global pandemic, affecting millions of lives in over 210 countries and territories. There are currently several vaccines available but there is an absence of other treatments for this virus. Due to the structural similarity between SARS-CoV-2 and other Betacoronavirudae, such as SARS-MERS and SARS-CoV (although it is much more similar in structure to SARS-CoV [1]), many previously established drugs are being researched to repurpose them for the current pandemic [2]. This allows for more rapid drug discovery and approval, which is vital in the current emergency.
There are many different in-silico methods by which this could be done. Docking and molecular screening/modeling have been widely used to discover potential treatments for the novel coronavirus as well as for other diseases in studies such as [3][4][5][6][7], among others. Additionally, several studies [8][9][10][11][12][13][14][15] have used machine learning and artificial intelligence to predict drug-target interactions (DTI) for various viruses, including SARS-CoV-2, with deep neural networks (DNN), support vector machines (SVM), and random forest (RF) classifiers, among others, as detailed by [16]. Studies such as [10] have employed methods like ours, using a convolutional neural network to predict drug target interactions; additionally, other studies have used other machine-learning methods such as Naïve Bayes to carry out the classification task [15]. On the other hand, studies such as [13] employed a regression model, as opposed to a binary classification model, to predict the binding scores of ligands against the SARS-CoV-2 viral proteins. Other similarity-based methods such as network-based inference and K nearest neighbor also have been utilized for this task, as they are often relatively less computationally intensive [17,18].
There have been efforts to repurpose currently approved drugs to inhibit the virus's structural and nonstructural proteins by preventing the virus from entering the cell, preventing it from activating, or preventing it from replicating itself (these are the preferred drugs) [19]. SARS-CoV-2 has four structural proteins and sixteen nonstructural proteins (NSP) that carry out various tasks essential to the virus's ability to infect individuals. In theory, all the NSPs can be exploited as drug targets, impeding the virus's ability to carry out its harmful functions in the host cell; however, some are more viable targets than others due to the availability of their crystal structures or their importance in the life cycle of the virus [1]. It is also possible to inhibit hostbased targets that facilitate the virus's entry into the host cell, such as the angiotensin receptor enzyme 2 (ACE2), As a result, studies are emerging that consider this a potential way to treat the disease [1]. Recently it has been found that the TMPRSS2 enzyme in the host cell allows the virus to en-ter the cell by priming the spike proteins, which is a promising target in developing/repurposing a drug [20,21]. Drugs that inhibit the RNA-dependent RNA polymerase (RdRp, NSP12) are also being considered as possible treatments of the virus; these include ribavirin, remdesivir, sofosbuvir, and IDX-184 [2,19,21]. NSP12 is an attractive target due to its role in RNA replication in the life cycle of the virus and the availability of its crystal structure [21]. However, it is also possible to inhibit the virus before the NSPs (including NSP12) have been cleaved from the polyproteins 1a and b (pp1a and pp1b). The 3-chymotrypsin-like protease (3CL pro ), also called the main protease (M pro ) or NSP5, and the papain-like protease (PL pro , NSP3) of SARS-CoV-2, which are both encoded by the open reading frames (ORF 1a/b), have gathered a lot of attention as a possible target of an inhibitor due to their leading role in the replication and growth of the virus as they cleave the viral polyproteins to form the other nonstructural proteins [1,4,20]. Disrupting this process would affect the virus's life cycle, effectively disabling it from further infecting the host. Drugs such as fostamatinib, oberadilol, ribavirin, remdesivir, and itacitinib have been proposed as potential 3CL pro inhibitors through computational methods [6,22,23]. Similarly, nilotinib, levomefolic acid, and glecaprevir have been predicted as possible inhibitors of PL pro [3,24]. Interestingly, the drug ziprasidone, originally used to treat schizophrenia, has shown effectiveness against both the major viral proteins [3,22,25].
Considering this, in our study we look to repurpose approved drugs to inhibit all SARS-CoV-2 NSPs using a machine-learning approach that takes advantage of structural similarities between viral proteins and similarities between pharmaceutical drugs. This method allows for highthroughput DTI prediction greatly aiding the fight against the virus.

Methods
The methods used in this project are presented in Fig. 1. The programs developed are presented at https://github.com/Shkev/Sars-CoV-2-NSP-Predictions. All SMILES names were converted to canonical SMILES using Open Babel [27] to standardize them and allow easier handling of the data.

Datasets
SARS-CoV-2 NSPs sequences were downloaded from the NCBI protein databank.

Extracting drug and protein features
The dataset contained chemical structures of the drugs in the form of SMILES names. Two-dimensional (2D) drug descriptors were calculated using the Online Chemical Database [28]. These descriptors contain the direct connections between the structures of the drugs and their properties, providing sufficient information to train the machinelearning model to recognize patterns in the data [29]. Drugs for which descriptors could not be calculated were removed along with any DTIs they were involved in, leaving 2444 drugs and 16,640 DTIs.
It is known that sufficient information about proteins is contained in the amino-acid sequences. Hence, we used common sequence descriptors and domain information to represent the proteins in our dataset [30]. The proteinsequence descriptors consisted of the amino-acid composition (AAC), dipeptide composition (DC), and tripeptide composition (TC). AAC is the frequency of each amino acid in the sequences. DC is the frequency of each possible pair of two amino acids in the sequences. TC is the frequency of each possible triplet of amino acids in the sequences. In addition to these, the domain information for each protein was obtained from the NCBI Batch Conserved Domain search and was used to construct an adjacency matrix. Each column and row represented one of the T target proteins, creating a T × T matrix of values (Fig. 2). Protein pairs that shared at least one domain were assigned a value of 1, and all other values were set to 0. Proteins that did not have any domain data were removed. Each row in the matrix, corresponding to the i-th target protein, is the domain data portion of that protein's feature vector. In total, we ex- The row for target ti is part of the feature vector for the i-th targets.
The respective protein and drug descriptors were combined to form one numerical vector for each DTI. Thus, each DTI was an array of approximately 25,000 values presented in the consequent set as in Eqn. 2.

Negative DTI creation
Data from DrugBank supplied experimentally verified DTIs, however, to train a machine-learning model, we also need a set of false DTIs. To achieve this, all possible combinations between the D drugs and the T targets in the dataset were created, yielding D×T DTIs (approximately 12.7 million DTIs). This list was filtered to remove all such combinations that were contained in the true DTIs. These artificial false DTIs were randomly sub-sampled, and their number was the same as the number of true DTIs, giving approximately 32,000 data points in total. This sub-sampling also reduces the probability that the negative DTIs selected for our dataset are unidentified positive DTIs (that are yet to be experimented). The approximately 16,000 selected negative DTIs represent 0.126% of the total number of negative DTIs; therefore, there is a high degree of randomness in sub-sampling this sub-sampling process, making it unlikely that the selected negative DTIs are unidentified positive DTIs.
The artificially created set of negative DTIs was combined with the verified set. Positive DTIs were assigned a label of 1 and the artificial negative DTIs were assigned a label of 0, which allowed for binary classification. The data was randomly split into a training (70% of data) and a testing dataset (30%), ensuring that both datasets contained an approximately balanced number of both classes. The testing dataset was set aside and not considered in the development of the model as it was intended to represent a set of independent outside data.

Data preprocessing
The DTI data vectors were preprocessed to shift the mean of each feature to 0 and remove those that provide little information about patterns in the data. Doing so reduces bias in giving more importance to some features over others when training a neural network. Note that the protein features were not adjusted, as this would remove the value in the frequency counts and adjacency matrix since the units for these values are already standardized. The mean-adjusted value for the j-th value of the i-th descriptor was calculated using (Eqn. 3), where µ i is the mean of the i-th descriptor across all data points (standard deviation scaling was not used in this process as this would remove the ability to reduce features using variance).
Preliminary feature reduction was performed on the mean-adjusted drug data and the protein sequence descriptors using a variance threshold. Namely, all features with variance less than the chosen threshold of 0.01 were removed. Domain features were not reduced in this way, as they were sparse (more than half the values per feature are 0) and thus would result in a near 0 variance for all such features.

Lasso-based feature reduction
Feature reduction is a key step in using the data to train a machine-learning model effectively by removing the features with the least influence on the model. This improves the efficiency and effectiveness of the model, as processing high-dimensional data is computationally expensive [11]. In this study, we implemented a Lasso linear model using the SciKit-Learn Python library [31] to filter out the most informative features [30]. This algorithm is a modified linear regression that attempts to minimize the coefficients of terms that are least informative to the model to 0. Thus, features with coefficients lower than a chosen threshold in the trained model should be removed, leaving the most prominent features.
Drug features and protein features were considered separately using two Lasso models. Models were trained and validated using the training dataset. The Lasso model contains a regularization parameter, α, which effectively controls how aggressively the model reduces feature coefficients to 0 (higher value results in more aggressive feature selection). To find the optimal value for α, we iterated through values from 1 to 10 −8 using 5-fold cross validation to test accuracy and area under the receiver operator curve (AUROC) of the models. The chosen value for this parameter was the point where, as α decreased, there was minimal or no improvement in the AUC of the model. For the drug model, α was chosen as 10 −4 and for the protein model it was chosen as 10 −3 . Features in the trained protein model with a non-zero coefficient were selected giving 1228 values. Similarly, features with coefficients greater than 10 −3 were selected from the trained drug model giving 1820 val-ues; this threshold was selected to ensure a balance between the number of protein and drug features in the final dataset. The selected features were concatenated and used as the input data for our models.

Classification model
Classification models were trained for the DTI classification problem. All data handling in the process was done with the Pandas Python library [32]. The models were trained on the approximately 32,000 data points, half of which are the positive DTIs extracted from DrugBank and the other half are sub-sampled negative DTIs, each a vector with 3048 values along with a label, 0 or 1, to distinguish between positive and negative DTIs. The output of the model is a probability that the inputted DTI is positive.
Three machine-learning models were trained and tested. The best performing one was used in the final SARS-CoV-2 NSP DTI predictions. A deep neural network (DNN) [30], random forest classifier (RF) [33], and convolutional neural network (CNN) [34] were tested. A part of the training dataset (30%) was randomly separated from the rest of the data to create a validation dataset that was used to tune our models' hyperparameters and optimize metrics. This tuning was done manually, individually adjusting the various hyperparameters of the models (using guidance from [35]), until the desired training/validation metrics were obtained. The AUROC and the binary accuracy of the models was used to compare them. The binary accuracy was calculated as the percentage of predictions that were consistent with their corresponding known value in the testing/validation datasets using a threshold of 0.5 (model predictions greater than or equal to 0.5 were considered as 1 and all others as 0). Since accuracy only measures the performance of the model at a single threshold, we also utilize the AUROC score of the model, as it measures the performance of the model at various thresholds, in order to better judge the viability of the models.
Random Forest was implemented using the Scikit-Learn Python library (version 0.24) [31]. The model was trained with 100 trees with no maximum depth, a minimum of 2 samples to split an internal node, a minimum of 1 sample required to be at a leaf node, and all other default pa-rameters (which can be seen in the documentation).

Deep-learning models
We implemented a DNN and CNN using the Tensor-Flow Python library [36].
The DNN architecture consisted of an input layer, two hidden dense layers with 4096 nodes each, and an output layer. The rectified linear activation function (ReLU) was applied to each hidden layer, and a sigmoid activation was used on the output layer to yield a value between 0 and 1. Additionally, dropout layers of 50% and Ridge regression (L2) regularization were used in the hidden layers to reduce overfitting [30]. Two hidden layers of equal size were used as recommended by [35]. Different numbers of nodes were experimented with, and the value that resulted in the least overfitting (as determined by comparing training and validation metrics) was used, namely 4096 nodes per layer. In general, as the number of nodes increased, the training accuracy and AUROC increased while these validation metrics suffered. Likewise, as the number of nodes decreased, the training metrics fell, but there was less overfitting. The model was trained with a learning rate of 0.00001 and a batch size of 32.
The CNN model is like the DNN architecture with the addition of 1D convolutional (Conv1D) layers. This allows the model to extract hidden patterns in the data that would otherwise not be recognized by the dense layers. We implemented a three-layer convolutional network that outputs to a fully connected dense layer (with 2048 nodes) that feeds into the output layer. The number of filters in the Conv1D layers increased from 16, 32, to 64 (each with kernel size 3) to progressively learn more features from the data. Each Conv1D layer fed into a Batch Normalization layer and a Max Pooling layer with pool size 3 to normalize weight values and prevent the model from overfitting. The ReLU activation function was also used on all Conv1D layers and hidden layers. The sigmoid activation function was applied to the output layer. A dropout of 50% was added to the flattened output of the final convolutional layer and dense layers, while L2 regularization was applied to all convolutional and dense layers. This model was also trained with a learning rate of 0.00001 and a batch size of 32.
As can be seen in Table 1, the CNN model performed best in both metrics, hence it was used in predicting DTIs involving SARS-CoV-2 nonstructural proteins. The testing dataset was used to test the bestperforming model (CNN) on a partition of the DrugBank data that the model has not seen before (the labels for the testing input are known, so the accuracy and AUC of the model can be calculated). There was an approximately equal number of each class (positive and negative DTIs) in the test dataset, with 4992 negative DTIs and 4989 positive DTIs. The accuracy and AUC from these predictions give an accurate representation of how the model will perform when making predictions from the SARS-CoV-2 NSP data. The CNN scored very highly on this dataset, showing that it generalized well from the training data, which makes it viable to use in predicting new DTIs. The Random Forest classifier performed slightly worse in these metrics but outperformed the CNN in recall/true positive rate and Fmeasure, which are valuable metrics in this use-case as it is important that the predicted positive DTIs are predicted correctly (true positives).  The CNN model performed with an AUROC of 0.954 (Fig. 3A), Precision-Recall AUC of 0.951 (Fig. 3B), and an accuracy of 0.895. The Random Forest classifier performed with an AUROC of 0.950 (Fig. 4A), Precision-Recall AUC of 0.950 (Fig. 4B), and an accuracy of 0.888. The test metrics of the CNN, along with those of the other models used (Random Forest and DNN), is shown in Table 2. The CNN confusion matrix for the predictions at a truth threshold of 0.97 can be seen in Fig. 5A. The confusion matrix for the RF classifier can be seen in Fig. 5B.

Predictions
The CNN model trained on DTI data from the Drug-Bank website was used to predict potential interactions between drugs in the dataset and the 16 SARS-CoV-2 NSPs whose sequences were obtained from the NCBI protein databank. Each NSP was paired with all the drugs in the dataset and the same procedure presented above was followed to extract and reduce features from the proteins and drug sequences and create DTI vectors (Eqn. 2). The same features were chosen from the vectors as for the Lasso models. All the possible DTIs were inputted into the model, which calculated the probability that the input data corre-spond to a true DTI. DTIs with an output score greater than or equal to the thresholds of 0.97 and 0.99 were selected as potential DTIs between the repurposed DrugBank FDAapproved drugs and the viral proteins.

Results
We trained a convolutional deep-learning model and a random forest classifier to predict drugs that may inhibit SARS-CoV-2 viral proteins. See Table 2 for the performance metrics of these models.
Similarly, the Random Forest classifier reduced the inputted DTIs down to 17 DTIs with probability greater than or equal to 90% of interacting, involving 6 unique drugs (Supplementary Table 1).
These results are summarized visually in Fig. 6 (Ref. [46]) and Fig. 7 (Ref. [46]). Figs. 8,9 also display the number of interactions for each drug and each target. NSP12, the RNA dependent RNA polymerase (RdRp), and NSP13, helicase, were the most targeted proteins in the 0.97 threshold result set with NSP12 being the most highly targeted protein overall. NSP12 was also the most targeted protein in the more restricted 0.99 threshold group, followed by NSP6 and NSP13, with NSP6 having two more inhibitors than NSP13 and four fewer inhibitors than NSP12. We also note that both fostamatinib, a tyrosine kinase inhibitor, and miconazole, an antifungal, were the drugs predicted to inhibit the most viral proteins overall, followed by the flavin adenine dinucleotide (in both result groups). Fostamatinib was also predicted as a potential inhibitor by the Random Forest classifier. The results for the inhibitors were consistent, as the drugs predicted in the 0.99 threshold group (excluding gabapentin enacarbil and vitamin A) were in the top half of those in the 0.97 group based on the number NSPs they were found to inhibit. Also, all but one of the drugs predicted by the Random Forest were in the 0.97 threshold group predicted by the CNN.
The majority of inhibitors were found to target less than or equal to four NSPs. Moreover, at least one drug was predicted to interact with all NSPs in both sets of results. However, not all the drugs in our dataset were found to interact with a SARS-CoV-2 viral protein.
There are many common compounds in our results that suggest the potential to quickly test and administer the drugs. These include the vitamins: vitamin A, pyridoxal   Each edge between a drug node (blue) and a NSP node (red) represents a DTI. The intensity of the color of each node is directly proportional to its degree. Drawn with Cytoscape [46]. phosphate (a vitamin B6 derivative), cocarboxylase (vitamin B1), and tretinoin (a vitamin A derivative). The bile acids, cholic and deoxycholic acids, were also included in our results. Many antibacterial drugs were also predicted to be effective against SARS-CoV-2 such as isoconazole, atovaquone, cefoperazone, novobiocin, and ceftriaxone. Supplementary Table 2 in Supplementary Materials shows all the compounds we elucidated and their current pharmaceutical applications.

Discussion
Based on the amino-acid sequences of viral proteins and chemical descriptors for various drugs, we trained a convolutional deep neural network and a Random Forest Classifier to predict new drug-target interactions. The results obtained give a starting point for selecting currently approved drugs that can be repurposed to inhibit the SARS-CoV-2 virus. The use of machine learning to make these predictions accelerates the search for a treatment and allows for high volume DTI classification that would not be possi-ble with other techniques. Furthermore, the methods used are not specific to the SARS-CoV-2 virus and can be applied to predict DTIs in general, facilitating rapid drug discovery for other diseases as well.
As shown in Table 2, the CNN outperformed the Random Forest and DNN models in both AUROC and accuracy. This is most likely because the CNN can extract obscure relationships between the various features in the data due to its 1D convolutional layers in a way that the other models cannot. This property allows it to better generalize to the training data without overfitting. However, the Random Forest model had the highest F-measure and recall, indicating a high true-positive rate, which is valuable in predicting DTIs. Thus, we present the results from both models as both provide unique information about potential inhibitors of the SARS-CoV-2 virus. The CNN results, however, are more thoroughly analyzed as they were predicted with higher confidence (97% for the CNN as opposed to 90% for the RF; there are no RF predictions with a probability of interacting greater than 97%) and contained almost Each edge between a drug node (blue) and a NSP node (red) represents a DTI. The intensity of the color of each node is directly proportional to its degree. Drawn with Cytoscape [46]. all the RF model predictions.
Our CNN model achieved similar accuracy, AUC, and F-measure score to other recent machine-learning based DTI prediction studies such as [14]. The performance of the models used in this study along with those of other studies are presented in Table 5 (Ref. [14,15,30,34]). Both of our models outscored all but one of the other models in AUROC and had much a higher precision score than [14]. The lower AUROC compared to [14] may be due to the restricted pool of drugs the study used as they only considered herbal drugs. The Random Forest also outperformed [14] in precision, recall, and F-measure (precision and recall scores were not available for the other studies). Note that we used a similar method to [30] in employing a Lasso model for feature selection as well as using the same protein features in the dataset, however we used a CNN as opposed to a DNN giving us more favorable metrics. This difference in performance can most likely be explained similarly to the difference in performance between the CNN and RF models. Overall, the relatively high performance of our models as compared to other studies can most likely be attributed to the unique DTI features used in this study, particularly the unique use of protein domain features as they are not widely used in DTI prediction studies. Additionally, the Lasso method for feature selection allows for highly effective dimensionality reduction. Thus, the models can learn relations among the data that would have otherwise been obscured or lost using other, less robust feature selection methods.
Eighty-two DTIs involving SARS-CoV-2 viral proteins were predicted using this model, forty-four of which had a 99% probability of interaction. There were 26 unique drugs within these DTIs, including fostamatinib, a tyrosine kinase inhibitor; miconazole, an antifungal; and ceftriaxone, an antibacterial. We trained the model on data from known DTIs involving various proteins-including those of influenza and Ebola viruses-and FDA-approved drugs. This model generalized exceptionally well to this data as it learned the important patterns in the data to distinguish true and false DTIs, making it a strong choice to use in predicting new relationships. A Lasso model was applied to the data be- fore it was fed to the CNN to filter out the most informative features and improve the efficiency of our final model. The scores assigned to the COVID-19 DTIs by the model were compared and all pairs scoring above 0.97 were extracted as possible candidates. Given that RdRp (NSP12), helicase (NSP13), and the main protease (NSP5) are viable and highly researched viral proteins for inhibition, it is of high interest that NSP12 and NSP13 are among the top 3 highly targeted proteins in both threshold sets resulting from our model's predictions. Given the significant role these proteins play in the life cycle of the virus, the drugs targeting them should be given the highest priority in testing.
It is interesting to consider why NSP12 is the most targeted protein. This may be due to the fact that the en- zyme is conserved in structure among all RNA viruses [1]. Given that our model exploits similarities in structure between various proteins and drugs to make predictions, it is very likely that it took advantage of the recurring structure of the RdRp enzyme across various viruses to predict inhibitors for SARS-CoV-2. This pattern in the data most likely explains the large number of inhibitors predicted for NSP12 as its familiar structure links lots of other proteins, and thus inhibitors, to it.
It is interesting to note that among our results were common compounds such as vitamin A and the cholic and deoxycholic bile acids. In addition, clinical trials are currently in progress to test the efficacy of fostamatinib (CT ID: NCT04352465), cannabidiol (CT ID: NCT04647604), alpha-linolenic acid (omega-3 polyunsaturated fatty acid; CT ID: NCT04647604), glutathione (CT ID: NCT04703036), methotrexate (CT ID: NCT04352465), and atovaquone (CT ID: NCT04456153) in treating COVID-19. Furthermore, fostamatinib has been predicted as a potential inhibitor of NSP5 (3CL pro ) by [22] and, although less promising, has also been predicted to target NSP16 [47]. The flavin adenine dinucleotide was predicted to bind to NSP12 (RdRp) with a docking score of -11.8 kcal/mol and to NSP13 (helicase) with a score of -11.2 kcal/mol [37]. Wu and co-authors [5] proposed the antifungal novobiocin as a potential NSP12 inhibitor as well. Cefoperazone has also been found active in interaction with the apo-NSP13 ATP-binding sites (a Vina docking score of -10.2 kcal/mol [43]). All other predicted DTIs are unique to our study although many of the drugs are not. Namely, we predicted 15 inhibitors that have not yet been considered. Studies that examined our predicted drugs are indicated in Tables 3,4. A similar table for the results of the Random Forest can be seen in Supplementary Table 1 in Supplementary Materials. We note that methotrexate has shown efficacy in inhibiting viral RNA replication, viral protein synthesis, and virus release in an in-vitro setting [42]. Fig. 10 (Ref. [48]) shows that atovaquone and minocycline, both antibacterial pharmaceuticals, are the only drugs from our results that have been tested for other viruses, suggesting the novelty of our predictions.
We note that although the results of this study partially overlap with those of other theoretical studies, the results should be further validated using other methods before administering the drugs in clinical trials. Further analysis of these results using docking simulations (which have already been used in some of the studies cited above) and pharmacophore models may be useful in determining which of the predicted DTIs are most likely to give positive results in a clinical environment.

Conclusions
We developed a machine-learning model to predict possible inhibitors of the 16 SARS-CoV-2 nonstructural proteins. A convolutional neural network with three convolutional layers and a Random Forest model were used. The CNN model, trained on 2444 drugs and 16,640 known drugtarget interactions (DTIs) from DrugBank, was developed using the TensorFlow Python library. The best algorithm for the classification task was the CNN. A part of the training dataset (30%) was randomly separated from the rest of the data to create a validation dataset that was used to tune our models' hyperparameters and optimize metrics. The model predicted 29 COVID-19 drugs involved in 82 DTI with 97% probability.

Author contributions
IFT and SKA contributed to conception and study design, original manuscript preparation. SKA and VLK constructed the model and received the prediction data. SKA and VLK contributed to original manuscript preparation and final draft reviewing and editing.

Ethics approval and consent to participate
Not applicable.