NULL
Countries | Regions
Countries | Regions
Article Types
Article Types
Year
Volume
Issue
Pages
IMR Press / FBL / Volume 27 / Issue 3 / DOI: 10.31083/j.fbl2703084
Open Access Original Research
Comprehensive Prediction of Lipocalin Proteins Using Artificial Intelligence Strategy
Show Less
1 School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, 610054 Chengdu, Sichuan, China
2 School of Materials Science and Engineering, Hainan University, 570228 Haikou, Hainan, China
*Correspondence: yuxiaolong@hainanu.edu.cn (Xiao-Long Yu); zyzhang@uestc.edu.cn (Zhao-Yue Zhang)
Front. Biosci. (Landmark Ed) 2022, 27(3), 84; https://doi.org/10.31083/j.fbl2703084
Submitted: 2 December 2021 | Revised: 17 January 2022 | Accepted: 20 January 2022 | Published: 5 March 2022
(This article belongs to the Special Issue Computational biomarker detection and analysis)
This is an open access article under the CC BY 4.0 license.
Abstract

Background: Lipocalin belongs to the calcyin family, and its sequence length is generally between 165 and 200 residues. They are mainly stable and multifunctional extracellular proteins. Lipocalin plays an important role in several stress responses and allergic inflammations. Because the accurate identification of lipocalins could provide significant evidences for the study of their function, it is necessary to develop a machine learning-based model to recognize lipocalin. Methods: In this study, we constructed a prediction model to identify lipocalin. Their sequences were encoded by six types of features, namely amino acid composition (AAC), composition of k-spaced amino acid pairs (CKSAAP), pseudo amino acid composition (PseAAC), Geary correlation (GD), normalized Moreau-Broto autocorrelation (NMBroto) and composition/transition/distribution (CTD). Subsequently, these features were optimized by using feature selection techniques. A classifier based on random forest was trained according to the optimal features. Results: The results of 10-fold cross-validation showed that our computational model would classify lipocalins with accuracy of 95.03% and area under the curve of 0.987. On the independent dataset, our computational model could produce the accuracy of 89.90% which was 4.17% higher than the existing model. Conclusions: In this work, we developed an advanced computational model to discriminate lipocalin proteins from non-lipocalin proteins. In the proposed model, protein sequences were encoded by six descriptors. Then, feature selection was performed to pick out the best features which could produce the maximum accuracy. On the basis of the best feature subset, the RF-based classifier can obtained the best prediction results.

Keywords
lipocalins
bioinformatics
feature extraction
optimization
validation
1. Introduction

Lipocalin belongs to the calcyin family and is usually composed of 165–200 residues. It is mainly a stable and multifunctional extracellular protein. Lipocalin proteins can carry aquaphobic molecules such as lipids, steroids and retinoids [1, 2, 3]. Lipocalins have important applications in several stress responses, homeostasis, candidate markers of renal functions and allergic inflammations [4, 5, 6]. The biological role of lipocalin in human body is shown in Fig. 1.

Fig. 1.

The biological role of lipocalin in human body.

After the Human Genome Project (HGP), biological sequence data increased significantly [7, 8, 9]. The traditional research technology based on biochemistry is time-consuming, expensive and inefficient. Therefore, it is necessary to develop computational methods that can accurately recognize biomolecular functions in a short time [10, 11, 12]. Existing computating tools, such as FASTA [13], HAlign [14, 15] and BLAST [16], can search sequences with the help of known protein databases. However, these tools cannot correctly distinguish lipocalins when there is no homologous sequence in benchmark dataset. Therefore, it is urgent to establish a machine learning-based model to identify lipocalins. In previous methods, a model called lipocalin-pred [17] was established to recognize lipocalins by using amino acid composition (AAC), reduced AAC [18, 19], di-peptide composition (DPC), secondary structure composition (SSC) and position-specific scoring method (PSSM). It could yield an accuracy of 90.72%. Pugalenthi et al. [20] proposed a predictor based on support vector machine (SVM). Several features, such as AAC, SSC and physiochemical properties, were used. As a result, they achieved an accuracy of 84.37%. Although the two models can produce encouraging outcomes, there is still room for further improvement.

To further improve the prediction accuracy, we proposed a random forest (RF)-based model to recognize lipocalins. The flowchart of the proposed model was shown in Fig. 2. Initially, the sequences were encoded by six types of features, namely AAC, CKSAAP, GD [21], NMBroto [22], CTD [23] and PseAAC [24]. Subsequently, these features were optimized by using analysis of variance (ANOVA) [25], Maximum Relevance Maximum Distance (MRMD) [26] and minimum Redundancy Maximum Relevance (mRMR) [27] with incremental feature selection (IFS) [28]. The RF-based model was trained using the optimal features. The performance of the proposed model was evaluated by 10-fold cross-validation.

Fig. 2.

The flowchart of the whole study. (A) Dataset was constructed and then divided it in to training and testing data. (B) Extracted features by utilizing six types of feature descriptors and then optimized the features by using ANOVA and mRMR. (C) Constructed the models by utilizing different classifiers on 5/10-fold CV. (D) Evaluated the models on independent dataset on the basis of accuracy, specificity, sensitivity, MCC and AUC.

2. Materials and Methods

Reliable and accurate dataset is essential for the establishment of prediction model [29, 30, 31, 32, 33, 34]. Therefore, we firstly collected 307 positive and 307 negative samples from the open-source database UniProt, and then excluded highly similar sequences using CD-HIT with the cutoff of 40% [35]. Finally, we obtained 211 lipocalins and 211 non-lipocalin proteins. In addition, we also built an independent dataset, including 42 lipocalin protein sequences and 42 non-lipocalin protein sequences, to test the prediction model.

2.1 Feature Descriptors

Choosing informative and autonomous feature is a significant step in generating machine learning-based models [36, 37, 38, 39, 40, 41]. Formulating sequence with mathematical expression is also a crucial step in protein function prediction [42, 43, 44, 45, 46, 47, 48, 49]. Hence, six types of features were utilized to describe the residues sequences of proteins.

2.1.1 Amino Acid Composition Descriptor

AAC is the frequency of amino acid residues in a protein sequence [50, 51, 52, 53, 54]. The frequencies $f\left(x\right)$of 20 residues can be calculated by

(1)$f\left(x\right)=\frac{N\left(x\right)}{N}x\in\{\textit{ACDEFGHIKLMNPQRSTVWY}\}$

where $N\left(x\right)$ is the x-th residue in a protein sequence with N residues.

2.1.2 Composition of k-spaced Amino Acid Pairs Descriptor

CKSAAP describes the occurrence of amino acid pairs disengaged by any K amino acid (K = 0, 1, 2, 3, 4, 5). It [50] is demarcated as k-spaced residual pairs $Q_{xy}$ which is formulated as

(2)$Q_{xy}=\frac{N_{xy}}{N-k}\quad(k=0,1,2,3,4,5\text{ and }xy=\text{ type of }AA)$

where $N_{xy}$ is the number of residue pairs and ‘k’ denotes the number of residues. In this work, for saving calculation time, the value of ‘k’ was set to 3 and the dimension of the features is 1600.

2.1.3 Pseudo Amino Acid Composition Descriptor

It contains the occurrence frequency of amino acids and the correlation of physiochemical properties between two amino acid residues [55]. It comprises of ${Ac}_{i}$ and ${Ac}_{\partial{}i}$ which can be formulated as

(3)$\displaystyle Ac_{i}=\frac{N_{i}}{1+\omega\times\sum_{i=1}^{20}\theta_{i}}$ $\displaystyle\left(\text{here}\theta_{i}=\frac{\sum_{i=1}^{N-d}\left(Q_{i}-Q_{% i+d}\right)^{2}}{N_{Q}},(i=1,2,3\ldots,20)\right)$

(4)$Ac_{\partial i}=\frac{\omega\times\theta_{i}}{1+\omega\times\sum_{i=1}^{20}% \theta_{i}},(\text{ here },\omega=0.05)$

where $N_{Q}$ is properties number and $N_{i}$ is the i-th frequency of amino acid. $Q_{i}$ is the i-th physicochemical property value and ${\theta{}}_{i}$ is order factor of protein sequence.

2.1.4 Composition, Transition and Distribution Descriptor

CTD describes the composition, transition and distribution of AAC in a protein sequence [56]. Amino acids are separated into three different classes on the basis of their physiochemical properties [18, 57, 58]. It is calculated as follows:

(5)$C_{a}=\frac{N_{a}}{N}\quad(a=1,2,3\ldots)$

(6)$T_{b}=\frac{N_{b,c}+N_{c,b}}{N-1}\quad(b=1,2,3\ldots,c\neq b)$

(7)$D_{b,z}=\frac{N_{b,z}}{N}\quad(b=1,2\ldots,\text{ and }z=1,0.15N\ldots,N)$

where $N_{a}$ is number of classes, $N_{b,c}$ is the contiguous number of classes b and c and $N_{b,z}$ is the amino acids number which is in the z-th of b-th class.

2.1.5 Geary Descriptor

It is a type of association descriptor and has an extreme resemblance with M-descriptor [59]. The mathematical manifestation is shown as $Q(m)$:

(8)$Q(m)=\frac{N-1}{2\times(N-m)}\times\frac{\sum_{i=1}^{N-m}\left(P_{i}-P_{i+m}% \right)^{2}}{\sum_{i=1}^{N}\left(P_{i}-m\right)^{2}}\quad(m=1,2,\ldots,20)$

where $P_{i}$ is i-th amino acids property value in the amino-acid index.

2.1.6 Normalized Moreau-broto Autocorrelation Descriptor

It is a kind of autocorrelation function [22] and has a resemblance with Moran-descriptor. The mathematical manifestation is shown as

(9)$Q(m)=\frac{\sum_{i=1}^{N-m}\left(P_{i}\times P_{i+m}\right)}{N-m}\quad(m=1,2,3% ,\ldots,20)$

where $P_{i}$ is i-th amino acids property value in the amino-acid index.

2.2 Feature Selection

Redundancy and noise in feature set may lead to disappointing performance of prediction model [60, 61, 62]. Thus, feature selection is a key step to eliminate unimportant features and improve efficiency of prediction model [63, 64, 65, 66]. There are several feature selection techniques, such as ANOVA [25], F-score [67], mRMR [27], chi-square [68] and LGBM [69, 70]. A feature set with high dimension may produce redundancy, overfitting and yield low accuracy in cross-validation prediction. Hence, ANOVA is a good choice to deal with these problems, because it takes less time and produce effective outcomes. The fusion of feature does not mean that good results can be achieved. These features may be highly correlated, which will lead to the emergence of redundant information in the feature set. Therefore, mRMR is an ideal choice to overcome these problems, because it is able to find the correlation between features. In this work, ANOVA and mRMR [27] were used to rank features. By combining with IFS [67], the optimal feature subset could be obtained. The details about ANOVA, mRMR and IFS can be found in our previous study [24]. The comparison with other different kind of feature selection techniques and the contribution of feature descriptors have been shown in Fig. 3A–D.

Fig. 3.

Plot showing the IFS procedure for identifying lipocalins in 10-fold cross-validation. (A) Firstly, 890 features were selected from a total of 2990 features by ANOVA. (B) 117 optimal features were further obtained from 890 features by mRMR. The accuracy increases from 88.20% to 95.03%. (C) The contribution of different feature descriptors in the model based on fusion features. (D) Compare different feature selection algorithms. (E) The performance of single-encoded features and their fusion on different classifiers before feature selection. (F) The performance of optimal single-encoded features and their fusion on different classifiers. (G) AUC value of Lipo-RF on 10-fold cross validation. (H) Comparison of proposed model with Lipocalin-pred on independent dataset.

2.3 Machine Learning Classifiers

Classification is a form of supervised learning and plays an important role in decision making [60, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80]. In this work, we chose RF to establish a model for recognizing lipocalin. Three machine learning methods namely Naïve Bayes (NB), support vector machine (SVM) [81, 82], and Ada boost (AB) [83, 84] were compared. RF is a comprehensive knowledge technology, which has been widely used in bioinformatics [85, 86, 87]. The principle is to combine multiple weak classifiers and get the results through the voting process, so that the results of the prediction model have the greatest improvement and generalization. The complete procedure has been clearly described in reference [88]. Weka version 3.8.4 (University of Waikato, Hamilton, New Zealand) [89] was utilized to implement the RF-based classifiers. The best parameters were shown in Table 1.

Table 1.Best parameters of the proposed model.
 Best Parameters ‘N-estimators’ 100 ‘Learning-rate’ 0.001 ‘Mean absolute error’ 0.143 ‘Kappa statistics’ 0.900 ‘Mean square error’ 0.220
2.4 Evaluation Metrics

Accuracy (Acc), sensitivity (Sn), Matthews correlation coefficient (MCC) and specificity (Sp) [90, 91] were utilized to measure the performance of proposed model, which is expressed by the following formula:

(10)$\begin{gathered}\displaystyle Sn=\frac{TP}{TP+FN}\\ \displaystyle Sp=\frac{TN}{TN+FP}\\ \displaystyle ACC=\frac{TP+TN}{TP+FP+TN+FN}\\ \displaystyle MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FN)\times(TN+FN)% \times(TP+FP)\times(TN+FP)}}\end{gathered}$

where true positive indicates lipocalin, and false positive indicates the non-lipocalin classified as lipocalin. On the other hand, true negative represents non-lipocalin, and false negative represents lipocalin classified as non-lipocalin. The receiver operating characteristic (ROC) curve and the area under the curve (AUC) were used to measure the effectiveness of the prediction model. AUC of perfect classifier is 1 and AUC of random behavior is 0.5.

3. Results and Discussion
3.1 Performance Evaluation

Firstly, six different feature descriptors were used to transform the training data into feature vectors. Then, through 10-fold cross-validation, RF-based classifier was used to evaluate each feature descriptor. Subsequently, in order to improve the prediction accuracy, ANOVA and mRMR combined with IFS were used to select the optimal feature subset. Fig. 3A,B show the incremental feature selection curves. Fig. 3E,F show the AUC difference of single-encodings and their fusion on different ML-based classifiers before and after the feature selection. Table 2 shows the effectiveness of the improved prediction models based on single-encoding and feature fusion on several ML-based methods. The results of the model based on single-encoding and their fusion on several ML-based classifiers before feature selection have been shown in Supplementary Table 1 in Supplementary file 1. Supplementary Fig. 1 in Supplementary file 1 and Fig. 4 show the feature distribution of single-encoding features and fusion features before and after feature screening using t-SNE (t-distributed stochastic neighbor embedding) technique. The AUCs of single-encoding models are 0.928, 0.938, 0.886, 0.889, 0.876 and 0.939 for AAC, CKSAAP, CTD, GD, NMBroto and PseAAC, respectively. The AUC of PseAAC is 0.1%–6.3% higher than that of other encoding schemes. On the other hand, the Acc, Sp, Sn, MCC, and AUC of the feature fusion-based model are 95.03%, 94.00%, 96.20%, 0.901% and 0.987, respectively. The Acc, Sp, Sn, MCC, and AUC on independent data set are 89.90%, 92.66%, 91.73%, 0.868% and 0.956. The AUCs of the feature fusion-based model on training and independent datasets have been shown in Fig. 3G,H.

Table 2.Performance of optimized single-encodings and fusion models on different machine learning classifiers.
 Classifiers Support Vector Machine Random Forest Naïve Bayes Ada Boost Method Acc Sp Sn MCC AUC Acc Sp Sn MCC AUC Acc Sp Sn MCC AUC Acc Sp Sn MCC AUC AAC 62.11 74.70 79.60 0.520 0.827 84.25 84.68 87.36 0.703 0.928 76.54 77.00 74.10 0.681 0.894 73.43 74.70 79.60 0.672 0.871 CKSAAP 66.77 65.56 84.55 0.245 0.743 86.79 84.82 90.77 0.723 0.938 87.46 88.11 89.55 0.743 0.922 88.73 89.20 90.30 0.867 0.948 CTD 63.13 78.10 74.55 0.541 0.778 81.32 77.55 88.78 0.712 0.886 73.87 68.00 83.80 0.587 0.843 74.14 73.50 79.50 0.611 0.873 Geary 74.02 77.30 81.40 0.576 0.854 85.93 83.14 88.28 0.745 0.889 86.09 84.52 89.33 0.756 0.889 73.49 70.10 68.90 0.573 0.791 NMBroto 79.38 80.60 82.00 0.623 0.890 85.38 80.50 76.60 0.735 0.876 74.48 81.60 77.20 0.600 0.877 82.81 80.20 79.88 0.705 0.862 PseAAC 86.81 83.70 88.88 0.725 0.862 88.90 83.10 91.80 0.741 0.939 78.12 78.20 88.00 0.621 0.884 89.64 84.55 87.11 0.762 0.905 Fusion 88.71 94.00 92.00 0.862 0.971 95.03 94.00 96.20 0.901 0.987 90.85 88.90 89.70 0.772 0.936 92.78 90.10 90.50 0.873 0.968
Fig. 4.

The visualization of single encoding features and fusion feature through t-SNE. (A) AAC, (B) CKSAAP, (C) PseAAC, (D) CTD, (E) GD, (F) NMBroto and (G) fusion feature. Orange and blue represent lipocalins and non-lipocalins, respectively.

3.2 Performance Evaluation of Different ML Algorithms

In order to compare a variety of machine learning models, we input single-encoding features and their fusion into other machine learning methods, such as AB, NB and SVM. The 10-fold cross-validation test was used to estimate the efficiency of these models. The comparison results have been shown in Table 2. We noticed that the accuracies of feature-fusion models were higher than those of single-encoding models, demonstrating that a large amount of information can achieve better results. Fig. 3C shows the contribution of the feature descriptors in RF-based fusion model. The model based on the optimal fusion features consists of 117 features from six descriptors, AAC, CKSAAP, CTD, GD, NMBroto and PseAAC, contributed 9 %, 11%, 13%, 15% and 27% in the final optimized-fusion model, respectively. Fig. 3F displays that the RF-based prediction model performs best among all classifiers. The AUC of the RF-based model is 1.6%–5.1% higher than that of other classifiers, demonstrating that the RF-based model is suitable for lipocalin proteins prediction.

3.3 Comparison with Existing Model on Independent Dataset

We also compared our model with the existing model on independent dataset to examine the efficiency and performance of the models. The results on independent dataset show that our model outperformed the existing model by 4.17%. The comparison between our model and the existed model has been shown in Table 3 (Ref. [17]).

Table 3.Comparison between proposed model and the existing method on independent dataset.
 Method Acc MCC Sn Sp AUC Reference Lipocalin-Pred 85.73 0.776 88.41 90.11 0.922 [17] Lipo-RF 89.90 0.868 91.73 92.66 0.956 Our Study
4. Conclusions

Lipocalin are responsible for transporting small hydrophobic molecules such as steroids, retinoids, and lipids. They have sequence homology region and common tertiary structure [92, 93]. Lipocalins have been applied in several fields, such as stress responses, homeostasis, candidate markers for kidney functions and allergic infections. So for, some models have been developed to identify lipocalins [17, 20]. In this work, we developed an advanced computational model to discriminate lipocalin proteins from non-lipocalin proteins. In the proposed model, protein sequences were encoded by six descriptors. Then, feature selection was performed to pick out the best features which could produce the maximum accuracy. On the basis of the best feature subset, the RF-based classifier can obtained the best prediction results. Further studies will focus on creating a user-friendly web server for the prediction model, and will adopt additional feature selection methods and algorithms to further improve the efficiency of lipocalin recognition.

Data Availability Statement

The codes and data can be found in https://zenodo.org/record/5844993#.YeAL7fgRVPZ.

Author Contributions

HZ—Conceptualization, Methodology, Coding, Data curation, Visualization, Writing-Original draft preparation. RSK—Data curation, Methodology. CYM—Data curation, Methodology. ZA—Data curation. BKGM—Data curation. XLY—Reviewing and Editing. ZYZ—Reviewing, Editing and Supervision.

Ethics Approval and Consent to Participate

Not applicable.

Acknowledgment

We are very thankful to Hao Lin Center for Informational Biology, University of Electronic Science and Technology of China for their constructive suggestions and support on this work.

Funding

This work has been supported by the grant from National Natural Science Foundation of China (62102067).

Conflict of Interest

The authors declare no conflict of interest.

Publisher’s Note: IMR Press stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Share