^{1}, Wen-Zhun Huang

^{1,*}, Xin-Ke Zhan

^{1}, Yu-An Huang

^{1}, Shan-Wen Zhang

^{1}, Chang-Qing Yu

^{1}

^{1}School of Information Engineering, Xijing University, 710123 Xi’an, Shaanxi, China

^{*}Correspondence: huangwenzhun@xijing.edu.cn (Wen-Zhun Huang)

**Submitted: 25 May 2021 | Revised: 21 June 2021 | Accepted: 2 July 2021 | Published: 30 July 2021**

**Introduction**: The prediction of interacting drug-target pairs plays an essential role in the
field of drug repurposing, and drug discovery. Although biotechnology and
chemical technology have made extraordinary progress, the process of
dose-response experiments and clinical trials is still extremely complex,
laborious, and costly. As a result, a robust computer-aided model is of an urgent
need to predict drug-target interactions (DTIs). **Methods**: In this paper, we report a novel computational approach combining fuzzy local
ternary pattern (FLTP), Position-Specific Scoring Matrix (PSSM), and rotation
forest (RF) to identify DTIs. More specially, the target primary sequence is
first numerically characterized into PSSM which records the biological evolution
information. Afterward, the FLTP method is applied in extracting the highly
representative descriptors of PSSM, and the combinations of FLTP descriptors and
drug molecular fingerprints are regarded as the complete features of drug-target
pairs. **Results**: Finally, the entire features are fed into rotation forests for inferring
potential DTIs. The experiments of 5-fold cross-validation (CV) achieve mean
accuracies of 89.08%, 86.14%, 82.41%, and 78.40% on Enzyme, Ion Channel,
GPCRs, and Nuclear Receptor datasets. **Discussion**: For further validating the model performance, we performed experiments with the
state-of-art support vector machine (SVM) and light gradient boosting machine
(LGBM). The experimental results indicate the superiorities of the proposed model
in effectively and reliably detect potential DTIs. There is an anticipation that
the proposed model can establish a feasible and convenient tool to identify
high-throughput identification of DTIs.

The identification of DTIs has turned into a focal point of pharmaceutical science to support screening the drug candidates and solving the problems of etiologies. The strikingly improved biochemical technologies have dramatically promoted the process of therapeutic drug discovery. In the last few years, Food and Drug Administration (FDA) has just approved a limited quantity of medicines due to the efficiency issues and harmful side effects [1]. Detecting interacting drug-target pairs is still of great significance to select the promising molecule drugs. The researchers have put much effort into exploring the DTIs based on traditional experiments. Nevertheless, the biochemical methods remain to be expensive and cumbersome. Furthermore, these methods need to face the contingency of serial results. Hence, the novel computer-aided drug development (CADD) models are essential to be constructed for stably and reliably inferring DTIs [2].

With the breakthrough of protein sequencing and drug molecular structure determination technologies, various sorts of databases including PubChem [3], ChEMBL [4], Therapeutic Target Database (TTD) [5], Kyoto Encyclopedia of Genes and Genomes (KEGG) [6], and DrugBank [7] are continuously enriching the public data of target proteins and drug sub-structures. Previously, computational-based prediction models mainly focused on molecular docking, ligand, and data mining [8]. However, there are some limitations in these traditional methods. For instance, the molecular docking method mainly predicts the binding sites by energy and geometry matching, it predicts the affinity of binding sites by computational simulation [9]. This method plays a critical role in determining the mode of drug actions. However, molecular docking requires all proteins in the model to have a complete 3D prediction structure that seriously limits the versatility of the model. The ligand-based method combines the chemical structure and pharmacological activity of a specific object through quantitative-structure activity relationships (QSAR), each model can only predict the relationship of one target [10]. The poor physical interoperability of the single model makes the method is hardly to be widely utilized in large-scale cross prediction. The data mining method collects DTIs by text mining and data matching [11]. The method is limited by mining algorithm and database authority, so it cannot achieve further promotion and application in DTIs prediction. In conclusion, the development of effective and robust models has become the essential requirement of DTIs prediction.

Bolgár *et al*. [12] proposed Variational Bayesian Multiple Kernel
Logistic Matrix Factorization which embedded multiple kernel learning, weighted
observation, and graph Laplacian regularization to model DTIs. Shi *et
al*. [13] develop two-layer multiple classifier system (TLMCS) which focuses on
fully utilizing heterogeneous features for better predicting DTIs. Xia *et
al*. [14] proposed a novel model namely Self-Paced Learning with Collaborative
Matrix Factorization based on weighted low-rank approximation (SPLCMF) to predict
DTIs. Specifically, this framework employed regularized least squares to fuse the
related networks and reduce the complexity of samples by soft weighting. Yan
*et al*. [15] developed (substructure-drug-target Kronecker product kernel
regularized least squares) SDTRLS model which integrates RLS-Kron model, chemical
substructure similarity fusion, and Gaussian Interaction Profile (GIP) kernels to
detect interacting drug-target pairs. Cui *et al*. [16] proposed L2,1-GRMF
which is a developed GRMF method to identify the DTIs by combining L2,1-norm. Hao
*et al*. [17] construct dual network integrated logistic matrix
factorization (DNILMF) for drug structure matrix and target sequence kernel
matrix to predict DTIs.

We established a novel *in silico* mothed to infer DTIs within this
paper, this method mainly integrates PSSM, FLTP, and RF classifier. Specifically,
the target primary sequences are first converted into numerical PSSM metrics
which record the frequencies of amino acids that appear in different positions.
Then, we employed FLTP approach to excavate the potential characteristics of
PSSMs. Subsequently, we merge them and drug fingerprints as entire feature
vectors of drug-target pairs. Finally, the full feature describers are fed into
rotation forest to detect DTIs. We verified our model on the benchmark data sets,
viz. Enzymes, Ion Channels, GPCRs, and Nuclear Receptors by utilizing 5-fold
Cross-validation. Furthermore, we compared the established model with another
advanced feature describer and various classifiers including LGBM and RF. The
different experimental results illustrate that the proposed model has an
outstanding effect on predicting DTIs, this model can reliably screen candidates
for clinical trials. The flowchart of the established model is depicted in Fig. 1.

**Workflow of our model.** (a) numerical convert the proteins to
PSSMs. (b) characterize PSSMs by FLTP. (c) extract the molecular fingerprints of
drugs. (d) feed the entire features into rotation forest. (e) predict DTIs.

In this paper, the databases, viz. DrugBank [7], SuperTarget [18], BRENDA [19], and KEGG BRITE [6] provide four benchmark datasets including Enzyme, Ion Channel, GPCRs, and Nuclear Receptor for us to execute the established model. Enzyme data set stores 445 drugs, 664 proteins, and 2926 DTIs. Ion channel data set stores 210 drugs, 204 proteins, and 1467 DTIs. GPCRs data set stores 223 drugs, 95 proteins, and 635 DTIs. Nuclear Receptor data set stores 54 drugs, 26 proteins, and 90 DTIs. Table 1 clearly listed the experimental statistics of these benchmark datasets.

**Statistical description of benchmark dataset.**

Statistics | Enzyme | Ion channel | GPCRs | Nuclear receptor |
---|---|---|---|---|

Drugs | 445 | 210 | 223 | 54 |

Target proteins | 664 | 204 | 95 | 26 |

Interactions | 2926 | 1467 | 635 | 90 |

Drug and protein interactions were represented as a bipartite graph; drugs and
proteins formed the nodes of the graph, and the verified interactions between
them were denoted by edges within the graph. In the experiments, all drug-target
pairs which are connected by edges are categorized to positive dataset, the other
pairs are treated as negative samples. Considering the number of the nodes, the
known interactions only take a little account of all relationships of drug-target
pairs. Take GPCRs dataset for an example, there are 42,840 (210

Recently, the molecular fingerprints which contain chemical substructure information can effectively reflect drug structure [20]. It transforms the molecular structures into a series of binary fingerprint sequences by detecting specific fragments in the molecular structure [21]. Although the molecular is divided into several independent parts, it still ensures the integrality of the entire drug structural information [22]. Studies substantiate that the molecular fingerprints inhibit the information loss and accumulated error of screening procedures. Meanwhile, it also reduces the complexity of the calculation in the description process. Specifically, when the fraction matches a molecular substructure, the corresponding position of carrier will be assigned as 1. Mature fingerprint databases provide reliable tools for the generation of molecular fingerprints. We selected the fingerprint map which contains 881 substructures from Pubchem system (https://pubchem.ncbi.nlm.nih.gov/) [23]. Therefore, the describers of drug molecules are completely converted into a series of 881-dimentional Boolean vectors. Fig. 2 gives the transformation of Zanamivir into a fingerprint.

**The transformation of Zanamivir into a fingerprint.**

In recent years, various Physico-chemical methods are applied to numerically characterize protein which is composed of 20 types of letters [24]. Position-Specific Scoring Matrix (PSSM) is extensively utilized in protein binding site prediction, protein secondary structure prediction, and protein subcellular localization [25]. In this section, PSSM is employed to excavate the evolutionary information by calculating the probability of an amino acid emerges in a specific location of protein primary sequence. PSSM matrix is showed as follows.

where PSSM is a matrix, where *L**e* is set to 0.001 and the iteration frequency is set to 3.
Fig. 3 gives the example of Lipoprotein Lipase converting into PSSM.

**The example of Lipoprotein Lipase converting into PSSM.**

Fuzzy Local Ternary Pattern (FLTP) can be utilized to precisely describe the
texture feature, and it has a wide application in preventing face spoofing and
image tampering areas [26]. For the anti-rotation ability of FLTP, it is also
robust to the noise in the image. This method dynamically calculates the
threshold based on Weber’s law to extract multiple features. Meanwhile, it can be
extended to circles and neighborhoods with different radius. In this paper, FLTP
is employed to describe the characteristics of PSSMs. The algorithm converts the
difference between neighborhood pixels and center pixels into the upper and lower
binary codes. The upper binary code can be expressed as

where

where

where

Finally, the FLTP feature vector can be obtained as follow.

In this experiment, the radius of the circular domain R = 1, the number of
pixels in circular domain P = 8. The upper and lower binary codes are transformed
into 256 dimensional vectors respectively. Hence, the entire descriptor of PSSM
is a matrix of 1

Rodriguez *et al*. [27] proposed rotation forest (RF) based on integrated
forest [27, 28]. This ensemble classifier succeeds in the classification of
small-sized data set. Significantly, RF also has good effects on promoting sample
difference [29]. Within the experiments, we utilized rotation forest to detect
DTIs. Firstly, RF stochastically separates the sample set into *L
*disjoint subsets. Subsequently, Principal Component Analysis (PCA) approaches to
convert subsets to generate rotation forest. Finally, send them to different base
classifiers for scorning each subtree. The matrix

(I) Follow obtaining the optimized parameter *L*, dataset *P* is
separated to *L* disjoint subsets stochastically, each subset has

(II) Let

(III) Execute PCA on

(IV) These coefficients make up the sparse rotation matrix

In the process of classification, the possibility that sample x belongs to
category

Finally, the sample x will be classified in accordance with the degree.

For improving the reliability of the experimental performance, the evaluative indices, viz. accuracy (Acc.), precision (Prec.), sensitivity (Sen.), specificity (Spec.), and Matthews correlation coefficient (MCC) are utilized to analyze the results of 5-fold CV.

where true positive (TP) records the aggregate of interacting drug-target pairs which were assigned to positive set; true negative (TN) denotes the sum of non-interacting drug-target pairs which were assigned to negative set; false positive (FP) is the quantity of non-interacting drug-target pairs which were assigned in positive set; false negative (FN) denotes the count of interacting drug-target pairs which were assigned to negative set. In addition, the receiver operating characteristic (ROC) curves were pictured to visualize the prediction results [30], the area under the curves (AUC) was also attached to ROC for justifying the established model [31]. We also utilized PR curves and AUPR values to indicate the sample balance and model performance.

In RF classifier, the main parameters *K* and* L* denote the
numbers of feature sub-sets and decision trees which affect the classification
accuracy. To get the optimal parameters, this paper employs grid-search algorithm
to study the influence of parameters on prediction results [32]. When
*L*-value increased from 0 to 38, the experimental results show that the
accuracy was increasing, then it decreased sharply. Meanwhile, the accuracy was
growing with the increase of *K*-value. In consideration of the model
efficiency, the optimal parameters *K* and *L* are set to 18 and
38, respectively. Fig. 4 depicts the prediction accuracy surface with factors of
*K*-value and *L*-value.

**Accuracy surface of the optimization on K-value, and
L-value.**

To certify the feasibility of the established model and avoid over-fitting, we executed 5-fold CV on four benchmark data sets with the same parameters. Specifically, each data set is separated into 5 equal-sized and disjointed fractions. The independent fractions take turns to be treated as test sets, while the other fractions serve as train sets. Tables 2,3,4,5 display the experimental results of our method on four standard data sets.

**Experimental results yield by 5-fold CV on**

*Enzyme*dataset.Test set | Acc. (%) | Pre. (%) | Sen. (%) | Spec. (%) | MCC (%) |
---|---|---|---|---|---|

1 | 88.46 | 90.51 | 86.05 | 90.89 | 77.02 |

2 | 89.15 | 89.61 | 87.18 | 90.91 | 78.23 |

3 | 88.55 | 89.88 | 87.14 | 89.98 | 77.14 |

4 | 89.06 | 91.12 | 87.88 | 90.38 | 78.15 |

5 | 90.17 | 90.46 | 89.35 | 90.95 | 80.33 |

Average | 89.08 |
90.32 |
87.52 |
90.62 |
78.17 |

**Experimental results yield by 5-fold CV on**

*Ion Channel*dataset.Test set | Acc. (%) | Pre. (%) | Sen. (%) | Spec. (%) | MCC (%) |
---|---|---|---|---|---|

1 | 88.31 | 90.21 | 86.29 | 90.38 | 76.69 |

2 | 84.58 | 85.33 | 84.49 | 84.67 | 69.14 |

3 | 87.12 | 87.46 | 86.87 | 87.37 | 74.24 |

4 | 84.41 | 83.16 | 84.34 | 84.47 | 68.77 |

5 | 86.27 | 86.15 | 86.44 | 86.10 | 72.54 |

Average | 86.14 |
86.46 |
85.69 |
86.60 |
72.28 |

**Experimental results yield by 5-fold CV on**

*GPCRs*dataset.Test set | Acc. (%) | Pre. (%) | Sen. (%) | Spec. (%) | MCC (%) |
---|---|---|---|---|---|

1 | 79.92 | 77.77 | 80.99 | 78.95 | 59.87 |

2 | 84.65 | 84.92 | 84.25 | 85.04 | 69.29 |

3 | 84.65 | 86.29 | 82.95 | 86.40 | 69.36 |

4 | 80.63 | 81.20 | 81.82 | 79.34 | 61.18 |

5 | 82.21 | 80.30 | 84.80 | 79.69 | 64.54 |

Average | 82.41 |
82.10 |
81.97 |
82.96 |
64.85 |

**Experimental results yield by 5-fold CV on**

*Nuclear Receptors*dataset.Test set | Acc. (%) | Pre. (%) | Sen. (%) | Spec. (%) | MCC (%) |
---|---|---|---|---|---|

1 | 73.88 | 81.25 | 56.52 | 76.92 | 42.32 |

2 | 86.11 | 78.95 | 93.75 | 80.00 | 73.41 |

3 | 80.56 | 83.33 | 78.95 | 82.35 | 61.21 |

4 | 74.29 | 66.67 | 71.43 | 76.19 | 47.14 |

5 | 77.14 | 71.43 | 88.24 | 66.67 | 56.01 |

Average | 78.40 |
76.33 |
77.78 |
76.43 |
56.02 |

The statistics of results has been shown in Table 6. The average criteria of
accuracy, sensitivity, precision, specificity and Matthews correlation
coefficient are 89.08%, 90.32%, 87.52%, 90.62%, and 78.17% on
*Enzyme* data set. Their standard deviations are 0.68%, 0.59%, 1.21%,
0.43%, and 1.32%. We obtained the average criteria of 86.14%, 86.46%,
85.69%, 86.60%, and 72.28% on *Ion Channel* data set. Their standard
deviations are 1.67%, 2.61%, 1.18%, 2.42%, and 3.37%. On *GPCRs* data
set, our model generated the average criteria of 82.41%, 82.10%, 81.97%,
82.96%, and 64.85% with standard deviation of 2.20%, 3.48%, 3.12%, 1.60%,
and 4.43%. In terms of *Nuclear Receptor* dataset, the average criteria
are 78.40%, 76.33%, 77.78%, 76.43%, and 56.02%, respectively, with standard
deviation of 5.07%, 7.02%, 14.65%, 5.99%, and 12.21%. As can be noted, the
small size of *Nuclear Receptor* data set leads to a higher standard
deviation. Figs. 5,6,7,8 record the performance of our model on four benchmark
datasets, while the average AUC values of 0.9535, 0.9292, 0.8901, and 0.8534 are
also attached to them. Figs. 9,10,11,12 plot the PR curve of our model on four
golden standard datasets, while the average AUPR values of 0.9608, 0.9345,
0.8941, and 0.8636 are also attached to them.

**The ROC curves generated by 5-fold CV on Enzyme dataset.**

**The ROC curves generated by 5-fold CV on Ion
Channel dataset.**

**The ROC curves generated by 5-fold CV on GPCRs
dataset.**

**The ROC curves generated by 5-fold CV on Nuclear Receptors dataset.**

**The PR curves generated by 5-fold CV on Enzyme dataset.**

**The PR curves generated by 5-fold CV on Ion Channel dataset.**

**The PR curves results generated by 5-fold CV on GPCRs dataset.**

**The PR curves generated by 5-fold CV on Nuclear Receptors dataset.**

**The statistics of results yield by 5-fold CV on four benchmark datasets.**

Statistics | Evaluation criteria | Acc. | Pre. | Sen. | Spec. | MCC |
---|---|---|---|---|---|---|

Enzyme |
Average | 89.08 | 90.32 | 87.52 | 90.62 | 78.17 |

Standard deviation | 0.68 | 0.59 | 1.21 | 0.43 | 1.32 | |

Ion Channel |
Average | 86.14 | 86.46 | 85.69 | 86.60 | 72.28 |

Standard deviation | 1.67 | 2.61 | 1.18 | 2.42 | 3.37 | |

GPCRs |
Average | 82.41 | 82.10 | 81.97 | 82.96 | 64.85 |

Standard deviation | 2.20 | 3.48 | 3.12 | 1.60 | 4.43 | |

Nuclear Receptors |
Average | 78.40 | 76.33 | 77.78 | 76.43 | 56.02 |

Standard deviation | 5.07 | 7.02 | 14.65 | 5.99 | 12.21 |

For strictly validating the feature describing ability of fuzzy local ternary
pattern (FLTP) method. We constructed the comparative experiment by replacing
FLTP descriptors with Zernike Moments (ZMs) descriptors which have strong
Rotational Invariance [33, 34]. ZMs method is widely utilized in the field of
edge detection by extracting global feature information at different scales [35].
Table 7 shows the comparison of ZMs and FLTP with the same classifier. These
experimental statistic shows that FLTP method has a significant performance
improvement compared with Zernike Moments on benchmarks. The criteria values
entirely get promoted on *Enzyme*, *Ion Channel*, and
*GPCRs* dataset. Fig. 13 displays the mean ROC curves of FLTP model and
ZMs model by an interpolation method. It is noteworthy that the AUC values of
FLTP-embedded model are comprehensive greater than ZMs model, and the mean value
gaps attain 2.55%, 0.89%, 1.17%, and 4.09%, respectively. The results
indicate that our model provides an effective way to characterize PSSM for
detecting potential DTIs.

** Comparison of average AUC values on FLTP and ZMs.**

**Performance comparison of fuzzy local ternary pattern with Zernike Moments.**

Dataset | Model | Acc. (%) | Prec. (%) | Sen. (%) | Spec. (%) | MCC (%) | AUPR (%) |
---|---|---|---|---|---|---|---|

Enzyme |
FLTP + RF | 89.08 |
90.32 |
87.52 |
90.62 |
78.17 |
96.08 |

ZMs + RF | 86.13 |
87.15 |
85.25 |
87.40 |
72.69 |
93.97 | |

Ion Channel |
FLTP + RF | 86.14 |
86.46 |
85.69 |
86.60 |
72.28 |
93.45 |

ZMs + RF | 84.00 |
84.14 |
84.00 |
84.11 |
68.13 |
91.88 | |

GPCRs |
FLTP + RF | 82.41 |
82.10 |
81.97 |
82.96 |
64.85 |
89.41 |

ZMs + RF | 81.50 |
81.27 |
81.46 |
81.62 |
63.06 |
88.21 | |

Nuclear Receptor |
FLTP + RF | 78.40 |
76.33 |
77.78 |
76.43 |
56.02 |
86.36 |

ZMs + RF | 75.15 |
76.19 |
75.90 |
75.78 |
51.83 |
81.09 |

Thus far, some machine learning-based classifiers are utilized to identify DTIs.
To fairly verify the performance of the proposed model, we embed the state of art
support vector machine (SVM) and light gradient boosting machine (LGBM) algorithm
into our model with fuzzy local ternary pattern. Within RF classifier, we set
parameters *K *= 18, *L *= 38 which was discussed above. The SVM
utilized inner product kernel function instead of nonlinear mapping to high
dimensional space, it also adopts small-sample learning method to greatly
simplify the process of classification and regression. There are 400
experiments with different combinations of parameters *c* and *g*
were carried out to get the highest accuracy, and we set *c*-value,
*g*-value to 0.7 and 40, respectively. The kernel of SVM was select as
radial basis function (RBF) based on LIBSVM tool. The LGBM method is the improved
gradient boosting decision trees (GBDT) algorithm to reduce the time cost and
power consumption in industrial applications. After parameter optimizations, the
leaves-number, the learning rate, and the training rounds were set to 55, 0.05,
and 37, respectively.

Fig. 14 records the comparison between RF, LGBM, and SVM on *Enzyme*,
*Ion Channel*, *GPCRs*, and *Nuclear Receptor* data sets.
The results indicate that model which embeds RF classifier has higher prediction
accuracy. Compared with SVM classifier, the average accuracy promotions of RF are
10.49%, 10.57%, 8.40%, and 15.20%, the accuracy gaps between RF and LGBM are
3.93%, 3.24%, 3.21%, 6.77% on four benchmark dataset. Figs. 15,16 plot the
ROC curves of the golden standard datasets based on the rates of 1-specificity
against sensitivity. The model which has higher AUC values predict more accurate.
As shown in Figs. 15,16, the AUC value gaps of four data sets attain to 0.1051,
0.1162, 0.0944, and 0.2232 between RF and SVM, the value gaps between RF and LGBM
attain to 0.1013, 0.1013, 0.0910, and 0.1329, respectively. Therefore, it is
considered that the proposed model is more efficient at predicting DTIs.

**Comparison of advanced classifiers on gold standard data sets.**
(a) 5-fold CV results on *Enzyme* data set. (b) 5-fold CV results on
*Ion Channel* data set. (c) 5-fold CV results on *GPCRs* data set.
(d) 5-fold CV results on *Nuclear Receptors* data set.

**ROC curves obtained by different classifiers on Enzyme
and GPCRs datasets.**

**ROC curves obtained by different classifiers on
Ion Channel and Nuclear Receptor datasets.**

So far, numerous advanced models have been established to predict DTIs and
assist drug design. In this section, we compared our model with partial
state-of-art models for fully evaluating the model performance by adopting 5-fold
CV. After experimenting the previous methods such as SIMCOMP [36], DCT [37],
Bigram-PSSM [38], LOOP [39] on benchmark datasets. Table 8 gives the comparison
of AUC value and AUPR values. It is clearly that the performance of the
established model has risen significantly. Although the AUC value of our model is
0.006 lower than LOOP on *Ion Channel *dataset, the AUC values of
*Enzyme*, *GPCRs*, and *Nuclear Receptors* have grown 0.003,
0.004, and 0.034, respectively, and the AUPR values of four benchmark
datasets have grown 0.028, 0.014, 0.029, and 0.042, respectively. As a
result, the experiments substantiate that the model which combining FLTP
descriptors and rotation forest can remarkably enhance the performance of
predicting DTIs.

**Comparison between our model with state-of-art methods in terms of benchmark data sets.**

Dataset | Method | AUC | AUPR |
---|---|---|---|

Enzyme | SIMCOMP | 0.876 | 0.358 |

DCT | 0.909 | 0.873 | |

Bigram-PSSM | 0.948 | 0.546 | |

LOOP | 0.951 | 0.933 | |

Our method | 0.954 | 0.961 | |

Ion Channel | SIMCOMP | 0.767 | 0.274 |

DCT | 0.893 | 0.812 | |

Bigram-PSSM | 0.889 | 0.39 | |

LOOP | 0.935 | 0.921 | |

Our method | 0.929 | 0.935 | |

GPCRs | SIMCOMP | 0.867 | 0.452 |

DCT | 0.867 | 0.793 | |

Bigram-PSSM | 0.872 | 0.282 | |

LOOP | 0.886 | 0.865 | |

Our method | 0.890 | 0.894 | |

Nuclear Receptor | SIMCOMP | 0.856 | 0.435 |

DCT | 0.799 | 0.628 | |

Bigram-PSSM | 0.869 | 0.411 | |

LOOP | 0.819 | 0.822 | |

Our method | 0.853 | 0.864 |

In summary, this paper integrates Position-Specific Scoring Matrix, fuzzy local ternary pattern, and rotation forest as a novel prediction algorithm for identifying the relationships between drugs and targets. Specifically, the fusions which combine FLTP describers of PSSMs and drug molecular fingerprints are fed into RF for inferring DTIs. The mean accuracies of our model were 89.08%, 86.14%, 82.41%, and 78.40% on standard data sets. We also made systematic comparisons to ensure the superiority of our model. First, the Zernike Moments (ZMs) method was utilized to alter the FLTP method to validate the feature description ability. Second, the state-of-art SVM, LGBM with FLTP features are experimented to access the performance of RF. The results indicate that this computational can be regarded as a significantly reliable tool for screening feasible candidates for medical trials.

Besides achieving more accurate prediction results than previous models, we also noticed the limitations of our model. This section will analyze these limitations from two aspects. On one side, the fuzzy local ternary pattern only describes the local texture characteristics. This feature descriptor is hardly to capture the global information of the sample, which leads to the singleness of the feature of PSSM. To extract more excellent feature vectors, future work will focus on fusion features. We will study a variety of local and global feature extraction methods and combine them to build a prediction model. On the other side, the loss and noise of data samples have a great effect on the accuracy of the model. We will explore two-dimensional data sample filtering algorithms to reduce data noise and improve data robustness. Meanwhile, we will further optimize the parameters to keep the integrity of the samples for accurate prediction. In general, the subsequent work will concentrate on extracting more accurate supervised classifiers and more fusion features which integrate the texture features and contour features of PSSMs. The growth of high throughput data set will create favorable circumstances and challenges for constructing auxiliary tools to enhance the accuracy of identification.

ZYZ handled the Conceptualization. ZYZ and XKZ performed the methodology, software, and validation. YAH curated the data. WZH, SWZ, and CQY administrated the project. WZH handled the funding acquisition.

Not applicable.

We thank Zhu-Hong You for technical assistance. Thanks to all the peer reviewers for their opinions and suggestions.

This research was supported by the National Natural Science Foundation of China under Grant No. 62072378.

The authors declare no conflict of interest.

https://github.com/zhaozhiya-20/Predict-the-interaction-of-DTIs-combining-FLTP-and-RF.

DTIs, drug-target interactions; FLTP, fuzzy local ternary pattern; PSSM, Position-Specific Scoring Matrix; RF, rotation forest; CV, cross-validation; SVM, support vector machine; LGBM, light gradient boosting machine; FDA, food and drug administration; CADD, computer-aided drug development; TTD, therapeutic target database; KEGG, Kyoto encyclopedia of genes and genomes; QSAR, quantitative-structure activity relationships; TLMCS, two-layer multiple classifier system; SDTRLS, substructure-drug-target Kronecker product kernel regularized least squares; DNILMF, dual network integrated logistic matrix factorization; PSI-BLAST, position-specific iterated basic local alignment search tool; PCA, principal component analysis; TP, true positive; TN, true negative; FP, false positive; FN, false negative; ROC, receiver operating characteristic; AUC, area under the curves; ZMs, Zernike moments; RBF, radial basis function; GBDT, gradient boosting decision trees.