Introduction: The prediction of interacting drug-target pairs plays an essential role in the field of drug repurposing, and drug discovery. Although biotechnology and chemical technology have made extraordinary progress, the process of dose-response experiments and clinical trials is still extremely complex, laborious, and costly. As a result, a robust computer-aided model is of an urgent need to predict drug-target interactions (DTIs). Methods: In this paper, we report a novel computational approach combining fuzzy local ternary pattern (FLTP), Position-Specific Scoring Matrix (PSSM), and rotation forest (RF) to identify DTIs. More specially, the target primary sequence is first numerically characterized into PSSM which records the biological evolution information. Afterward, the FLTP method is applied in extracting the highly representative descriptors of PSSM, and the combinations of FLTP descriptors and drug molecular fingerprints are regarded as the complete features of drug-target pairs. Results: Finally, the entire features are fed into rotation forests for inferring potential DTIs. The experiments of 5-fold cross-validation (CV) achieve mean accuracies of 89.08%, 86.14%, 82.41%, and 78.40% on Enzyme, Ion Channel, GPCRs, and Nuclear Receptor datasets. Discussion: For further validating the model performance, we performed experiments with the state-of-art support vector machine (SVM) and light gradient boosting machine (LGBM). The experimental results indicate the superiorities of the proposed model in effectively and reliably detect potential DTIs. There is an anticipation that the proposed model can establish a feasible and convenient tool to identify high-throughput identification of DTIs.
The identification of DTIs has turned into a focal point of pharmaceutical science to support screening the drug candidates and solving the problems of etiologies. The strikingly improved biochemical technologies have dramatically promoted the process of therapeutic drug discovery. In the last few years, Food and Drug Administration (FDA) has just approved a limited quantity of medicines due to the efficiency issues and harmful side effects . Detecting interacting drug-target pairs is still of great significance to select the promising molecule drugs. The researchers have put much effort into exploring the DTIs based on traditional experiments. Nevertheless, the biochemical methods remain to be expensive and cumbersome. Furthermore, these methods need to face the contingency of serial results. Hence, the novel computer-aided drug development (CADD) models are essential to be constructed for stably and reliably inferring DTIs .
With the breakthrough of protein sequencing and drug molecular structure determination technologies, various sorts of databases including PubChem , ChEMBL , Therapeutic Target Database (TTD) , Kyoto Encyclopedia of Genes and Genomes (KEGG) , and DrugBank  are continuously enriching the public data of target proteins and drug sub-structures. Previously, computational-based prediction models mainly focused on molecular docking, ligand, and data mining . However, there are some limitations in these traditional methods. For instance, the molecular docking method mainly predicts the binding sites by energy and geometry matching, it predicts the affinity of binding sites by computational simulation . This method plays a critical role in determining the mode of drug actions. However, molecular docking requires all proteins in the model to have a complete 3D prediction structure that seriously limits the versatility of the model. The ligand-based method combines the chemical structure and pharmacological activity of a specific object through quantitative-structure activity relationships (QSAR), each model can only predict the relationship of one target . The poor physical interoperability of the single model makes the method is hardly to be widely utilized in large-scale cross prediction. The data mining method collects DTIs by text mining and data matching . The method is limited by mining algorithm and database authority, so it cannot achieve further promotion and application in DTIs prediction. In conclusion, the development of effective and robust models has become the essential requirement of DTIs prediction.
Bolgár et al.  proposed Variational Bayesian Multiple Kernel Logistic Matrix Factorization which embedded multiple kernel learning, weighted observation, and graph Laplacian regularization to model DTIs. Shi et al.  develop two-layer multiple classifier system (TLMCS) which focuses on fully utilizing heterogeneous features for better predicting DTIs. Xia et al.  proposed a novel model namely Self-Paced Learning with Collaborative Matrix Factorization based on weighted low-rank approximation (SPLCMF) to predict DTIs. Specifically, this framework employed regularized least squares to fuse the related networks and reduce the complexity of samples by soft weighting. Yan et al.  developed (substructure-drug-target Kronecker product kernel regularized least squares) SDTRLS model which integrates RLS-Kron model, chemical substructure similarity fusion, and Gaussian Interaction Profile (GIP) kernels to detect interacting drug-target pairs. Cui et al.  proposed L2,1-GRMF which is a developed GRMF method to identify the DTIs by combining L2,1-norm. Hao et al.  construct dual network integrated logistic matrix factorization (DNILMF) for drug structure matrix and target sequence kernel matrix to predict DTIs.
We established a novel in silico mothed to infer DTIs within this paper, this method mainly integrates PSSM, FLTP, and RF classifier. Specifically, the target primary sequences are first converted into numerical PSSM metrics which record the frequencies of amino acids that appear in different positions. Then, we employed FLTP approach to excavate the potential characteristics of PSSMs. Subsequently, we merge them and drug fingerprints as entire feature vectors of drug-target pairs. Finally, the full feature describers are fed into rotation forest to detect DTIs. We verified our model on the benchmark data sets, viz. Enzymes, Ion Channels, GPCRs, and Nuclear Receptors by utilizing 5-fold Cross-validation. Furthermore, we compared the established model with another advanced feature describer and various classifiers including LGBM and RF. The different experimental results illustrate that the proposed model has an outstanding effect on predicting DTIs, this model can reliably screen candidates for clinical trials. The flowchart of the established model is depicted in Fig. 1.
Workflow of our model. (a) numerical convert the proteins to PSSMs. (b) characterize PSSMs by FLTP. (c) extract the molecular fingerprints of drugs. (d) feed the entire features into rotation forest. (e) predict DTIs.
In this paper, the databases, viz. DrugBank , SuperTarget , BRENDA , and KEGG BRITE  provide four benchmark datasets including Enzyme, Ion Channel, GPCRs, and Nuclear Receptor for us to execute the established model. Enzyme data set stores 445 drugs, 664 proteins, and 2926 DTIs. Ion channel data set stores 210 drugs, 204 proteins, and 1467 DTIs. GPCRs data set stores 223 drugs, 95 proteins, and 635 DTIs. Nuclear Receptor data set stores 54 drugs, 26 proteins, and 90 DTIs. Table 1 clearly listed the experimental statistics of these benchmark datasets.
|Statistics||Enzyme||Ion channel||GPCRs||Nuclear receptor|
Drug and protein interactions were represented as a bipartite graph; drugs and
proteins formed the nodes of the graph, and the verified interactions between
them were denoted by edges within the graph. In the experiments, all drug-target
pairs which are connected by edges are categorized to positive dataset, the other
pairs are treated as negative samples. Considering the number of the nodes, the
known interactions only take a little account of all relationships of drug-target
pairs. Take GPCRs dataset for an example, there are 42,840 (210
Recently, the molecular fingerprints which contain chemical substructure information can effectively reflect drug structure . It transforms the molecular structures into a series of binary fingerprint sequences by detecting specific fragments in the molecular structure . Although the molecular is divided into several independent parts, it still ensures the integrality of the entire drug structural information . Studies substantiate that the molecular fingerprints inhibit the information loss and accumulated error of screening procedures. Meanwhile, it also reduces the complexity of the calculation in the description process. Specifically, when the fraction matches a molecular substructure, the corresponding position of carrier will be assigned as 1. Mature fingerprint databases provide reliable tools for the generation of molecular fingerprints. We selected the fingerprint map which contains 881 substructures from Pubchem system (https://pubchem.ncbi.nlm.nih.gov/) . Therefore, the describers of drug molecules are completely converted into a series of 881-dimentional Boolean vectors. Fig. 2 gives the transformation of Zanamivir into a fingerprint.
The transformation of Zanamivir into a fingerprint.
In recent years, various Physico-chemical methods are applied to numerically characterize protein which is composed of 20 types of letters . Position-Specific Scoring Matrix (PSSM) is extensively utilized in protein binding site prediction, protein secondary structure prediction, and protein subcellular localization . In this section, PSSM is employed to excavate the evolutionary information by calculating the probability of an amino acid emerges in a specific location of protein primary sequence. PSSM matrix is showed as follows.
where PSSM is a matrix, where L
The example of Lipoprotein Lipase converting into PSSM.
Fuzzy Local Ternary Pattern (FLTP) can be utilized to precisely describe the
texture feature, and it has a wide application in preventing face spoofing and
image tampering areas . For the anti-rotation ability of FLTP, it is also
robust to the noise in the image. This method dynamically calculates the
threshold based on Weber’s law to extract multiple features. Meanwhile, it can be
extended to circles and neighborhoods with different radius. In this paper, FLTP
is employed to describe the characteristics of PSSMs. The algorithm converts the
difference between neighborhood pixels and center pixels into the upper and lower
binary codes. The upper binary code can be expressed as
Finally, the FLTP feature vector can be obtained as follow.
In this experiment, the radius of the circular domain R = 1, the number of
pixels in circular domain P = 8. The upper and lower binary codes are transformed
into 256 dimensional vectors respectively. Hence, the entire descriptor of PSSM
is a matrix of 1
Rodriguez et al.  proposed rotation forest (RF) based on integrated
forest [27, 28]. This ensemble classifier succeeds in the classification of
small-sized data set. Significantly, RF also has good effects on promoting sample
difference . Within the experiments, we utilized rotation forest to detect
DTIs. Firstly, RF stochastically separates the sample set into L
disjoint subsets. Subsequently, Principal Component Analysis (PCA) approaches to
convert subsets to generate rotation forest. Finally, send them to different base
classifiers for scorning each subtree. The matrix
(I) Follow obtaining the optimized parameter L, dataset P is
separated to L disjoint subsets stochastically, each subset has
(III) Execute PCA on
(IV) These coefficients make up the sparse rotation matrix
In the process of classification, the possibility that sample x belongs to
Finally, the sample x will be classified in accordance with the degree.
For improving the reliability of the experimental performance, the evaluative indices, viz. accuracy (Acc.), precision (Prec.), sensitivity (Sen.), specificity (Spec.), and Matthews correlation coefficient (MCC) are utilized to analyze the results of 5-fold CV.
where true positive (TP) records the aggregate of interacting drug-target pairs which were assigned to positive set; true negative (TN) denotes the sum of non-interacting drug-target pairs which were assigned to negative set; false positive (FP) is the quantity of non-interacting drug-target pairs which were assigned in positive set; false negative (FN) denotes the count of interacting drug-target pairs which were assigned to negative set. In addition, the receiver operating characteristic (ROC) curves were pictured to visualize the prediction results , the area under the curves (AUC) was also attached to ROC for justifying the established model . We also utilized PR curves and AUPR values to indicate the sample balance and model performance.
In RF classifier, the main parameters K and L denote the numbers of feature sub-sets and decision trees which affect the classification accuracy. To get the optimal parameters, this paper employs grid-search algorithm to study the influence of parameters on prediction results . When L-value increased from 0 to 38, the experimental results show that the accuracy was increasing, then it decreased sharply. Meanwhile, the accuracy was growing with the increase of K-value. In consideration of the model efficiency, the optimal parameters K and L are set to 18 and 38, respectively. Fig. 4 depicts the prediction accuracy surface with factors of K-value and L-value.
Accuracy surface of the optimization on K-value, and L-value.
To certify the feasibility of the established model and avoid over-fitting, we executed 5-fold CV on four benchmark data sets with the same parameters. Specifically, each data set is separated into 5 equal-sized and disjointed fractions. The independent fractions take turns to be treated as test sets, while the other fractions serve as train sets. Tables 2,3,4,5 display the experimental results of our method on four standard data sets.
|Test set||Acc. (%)||Pre. (%)||Sen. (%)||Spec. (%)||MCC (%)|
|Test set||Acc. (%)||Pre. (%)||Sen. (%)||Spec. (%)||MCC (%)|
|Test set||Acc. (%)||Pre. (%)||Sen. (%)||Spec. (%)||MCC (%)|
|Test set||Acc. (%)||Pre. (%)||Sen. (%)||Spec. (%)||MCC (%)|
The statistics of results has been shown in Table 6. The average criteria of accuracy, sensitivity, precision, specificity and Matthews correlation coefficient are 89.08%, 90.32%, 87.52%, 90.62%, and 78.17% on Enzyme data set. Their standard deviations are 0.68%, 0.59%, 1.21%, 0.43%, and 1.32%. We obtained the average criteria of 86.14%, 86.46%, 85.69%, 86.60%, and 72.28% on Ion Channel data set. Their standard deviations are 1.67%, 2.61%, 1.18%, 2.42%, and 3.37%. On GPCRs data set, our model generated the average criteria of 82.41%, 82.10%, 81.97%, 82.96%, and 64.85% with standard deviation of 2.20%, 3.48%, 3.12%, 1.60%, and 4.43%. In terms of Nuclear Receptor dataset, the average criteria are 78.40%, 76.33%, 77.78%, 76.43%, and 56.02%, respectively, with standard deviation of 5.07%, 7.02%, 14.65%, 5.99%, and 12.21%. As can be noted, the small size of Nuclear Receptor data set leads to a higher standard deviation. Figs. 5,6,7,8 record the performance of our model on four benchmark datasets, while the average AUC values of 0.9535, 0.9292, 0.8901, and 0.8534 are also attached to them. Figs. 9,10,11,12 plot the PR curve of our model on four golden standard datasets, while the average AUPR values of 0.9608, 0.9345, 0.8941, and 0.8636 are also attached to them.
The ROC curves generated by 5-fold CV on Enzyme dataset.
The ROC curves generated by 5-fold CV on Ion Channel dataset.
The ROC curves generated by 5-fold CV on GPCRs dataset.
The ROC curves generated by 5-fold CV on Nuclear Receptors dataset.
The PR curves generated by 5-fold CV on Enzyme dataset.
The PR curves generated by 5-fold CV on Ion Channel dataset.
The PR curves results generated by 5-fold CV on GPCRs dataset.
The PR curves generated by 5-fold CV on Nuclear Receptors dataset.
For strictly validating the feature describing ability of fuzzy local ternary pattern (FLTP) method. We constructed the comparative experiment by replacing FLTP descriptors with Zernike Moments (ZMs) descriptors which have strong Rotational Invariance [33, 34]. ZMs method is widely utilized in the field of edge detection by extracting global feature information at different scales . Table 7 shows the comparison of ZMs and FLTP with the same classifier. These experimental statistic shows that FLTP method has a significant performance improvement compared with Zernike Moments on benchmarks. The criteria values entirely get promoted on Enzyme, Ion Channel, and GPCRs dataset. Fig. 13 displays the mean ROC curves of FLTP model and ZMs model by an interpolation method. It is noteworthy that the AUC values of FLTP-embedded model are comprehensive greater than ZMs model, and the mean value gaps attain 2.55%, 0.89%, 1.17%, and 4.09%, respectively. The results indicate that our model provides an effective way to characterize PSSM for detecting potential DTIs.
Comparison of average AUC values on FLTP and ZMs.
|Dataset||Model||Acc. (%)||Prec. (%)||Sen. (%)||Spec. (%)||MCC (%)||AUPR (%)|
|Enzyme||FLTP + RF||89.08
|ZMs + RF||86.13
|Ion Channel||FLTP + RF||86.14
|ZMs + RF||84.00
|GPCRs||FLTP + RF||82.41
|ZMs + RF||81.50
|Nuclear Receptor||FLTP + RF||78.40
|ZMs + RF||75.15
Thus far, some machine learning-based classifiers are utilized to identify DTIs. To fairly verify the performance of the proposed model, we embed the state of art support vector machine (SVM) and light gradient boosting machine (LGBM) algorithm into our model with fuzzy local ternary pattern. Within RF classifier, we set parameters K = 18, L = 38 which was discussed above. The SVM utilized inner product kernel function instead of nonlinear mapping to high dimensional space, it also adopts small-sample learning method to greatly simplify the process of classification and regression. There are 400 experiments with different combinations of parameters c and g were carried out to get the highest accuracy, and we set c-value, g-value to 0.7 and 40, respectively. The kernel of SVM was select as radial basis function (RBF) based on LIBSVM tool. The LGBM method is the improved gradient boosting decision trees (GBDT) algorithm to reduce the time cost and power consumption in industrial applications. After parameter optimizations, the leaves-number, the learning rate, and the training rounds were set to 55, 0.05, and 37, respectively.
Fig. 14 records the comparison between RF, LGBM, and SVM on Enzyme, Ion Channel, GPCRs, and Nuclear Receptor data sets. The results indicate that model which embeds RF classifier has higher prediction accuracy. Compared with SVM classifier, the average accuracy promotions of RF are 10.49%, 10.57%, 8.40%, and 15.20%, the accuracy gaps between RF and LGBM are 3.93%, 3.24%, 3.21%, 6.77% on four benchmark dataset. Figs. 15,16 plot the ROC curves of the golden standard datasets based on the rates of 1-specificity against sensitivity. The model which has higher AUC values predict more accurate. As shown in Figs. 15,16, the AUC value gaps of four data sets attain to 0.1051, 0.1162, 0.0944, and 0.2232 between RF and SVM, the value gaps between RF and LGBM attain to 0.1013, 0.1013, 0.0910, and 0.1329, respectively. Therefore, it is considered that the proposed model is more efficient at predicting DTIs.
Comparison of advanced classifiers on gold standard data sets. (a) 5-fold CV results on Enzyme data set. (b) 5-fold CV results on Ion Channel data set. (c) 5-fold CV results on GPCRs data set. (d) 5-fold CV results on Nuclear Receptors data set.
ROC curves obtained by different classifiers on Enzyme and GPCRs datasets.
ROC curves obtained by different classifiers on Ion Channel and Nuclear Receptor datasets.
So far, numerous advanced models have been established to predict DTIs and assist drug design. In this section, we compared our model with partial state-of-art models for fully evaluating the model performance by adopting 5-fold CV. After experimenting the previous methods such as SIMCOMP , DCT , Bigram-PSSM , LOOP  on benchmark datasets. Table 8 gives the comparison of AUC value and AUPR values. It is clearly that the performance of the established model has risen significantly. Although the AUC value of our model is 0.006 lower than LOOP on Ion Channel dataset, the AUC values of Enzyme, GPCRs, and Nuclear Receptors have grown 0.003, 0.004, and 0.034, respectively, and the AUPR values of four benchmark datasets have grown 0.028, 0.014, 0.029, and 0.042, respectively. As a result, the experiments substantiate that the model which combining FLTP descriptors and rotation forest can remarkably enhance the performance of predicting DTIs.
In summary, this paper integrates Position-Specific Scoring Matrix, fuzzy local ternary pattern, and rotation forest as a novel prediction algorithm for identifying the relationships between drugs and targets. Specifically, the fusions which combine FLTP describers of PSSMs and drug molecular fingerprints are fed into RF for inferring DTIs. The mean accuracies of our model were 89.08%, 86.14%, 82.41%, and 78.40% on standard data sets. We also made systematic comparisons to ensure the superiority of our model. First, the Zernike Moments (ZMs) method was utilized to alter the FLTP method to validate the feature description ability. Second, the state-of-art SVM, LGBM with FLTP features are experimented to access the performance of RF. The results indicate that this computational can be regarded as a significantly reliable tool for screening feasible candidates for medical trials.
Besides achieving more accurate prediction results than previous models, we also noticed the limitations of our model. This section will analyze these limitations from two aspects. On one side, the fuzzy local ternary pattern only describes the local texture characteristics. This feature descriptor is hardly to capture the global information of the sample, which leads to the singleness of the feature of PSSM. To extract more excellent feature vectors, future work will focus on fusion features. We will study a variety of local and global feature extraction methods and combine them to build a prediction model. On the other side, the loss and noise of data samples have a great effect on the accuracy of the model. We will explore two-dimensional data sample filtering algorithms to reduce data noise and improve data robustness. Meanwhile, we will further optimize the parameters to keep the integrity of the samples for accurate prediction. In general, the subsequent work will concentrate on extracting more accurate supervised classifiers and more fusion features which integrate the texture features and contour features of PSSMs. The growth of high throughput data set will create favorable circumstances and challenges for constructing auxiliary tools to enhance the accuracy of identification.
ZYZ handled the Conceptualization. ZYZ and XKZ performed the methodology, software, and validation. YAH curated the data. WZH, SWZ, and CQY administrated the project. WZH handled the funding acquisition.
We thank Zhu-Hong You for technical assistance. Thanks to all the peer reviewers for their opinions and suggestions.
This research was supported by the National Natural Science Foundation of China under Grant No. 62072378.
The authors declare no conflict of interest.
DTIs, drug-target interactions; FLTP, fuzzy local ternary pattern; PSSM, Position-Specific Scoring Matrix; RF, rotation forest; CV, cross-validation; SVM, support vector machine; LGBM, light gradient boosting machine; FDA, food and drug administration; CADD, computer-aided drug development; TTD, therapeutic target database; KEGG, Kyoto encyclopedia of genes and genomes; QSAR, quantitative-structure activity relationships; TLMCS, two-layer multiple classifier system; SDTRLS, substructure-drug-target Kronecker product kernel regularized least squares; DNILMF, dual network integrated logistic matrix factorization; PSI-BLAST, position-specific iterated basic local alignment search tool; PCA, principal component analysis; TP, true positive; TN, true negative; FP, false positive; FN, false negative; ROC, receiver operating characteristic; AUC, area under the curves; ZMs, Zernike moments; RBF, radial basis function; GBDT, gradient boosting decision trees.