Comprehensive Prediction of Lipocalin Proteins Using Artificial Intelligence Strategy

¹ School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, 610054 Chengdu, Sichuan, China

² School of Materials Science and Engineering, Hainan University, 570228 Haikou, Hainan, China

^*Correspondence: yuxiaolong@hainanu.edu.cn (Xiao-Long Yu); zyzhang@uestc.edu.cn (Zhao-Yue Zhang)
Academic Editor: Graham Pawelec

Front. Biosci. (Landmark Ed) 2022, 27(3), 84; https://doi.org/10.31083/j.fbl2703084

Submitted: 2 December 2021 | Revised: 17 January 2022 | Accepted: 20 January 2022 | Published: 5 March 2022

(This article belongs to the Special Issue Computational biomarker detection and analysis)

This is an open access article under the CC BY 4.0 license.

Abstract

Background: Lipocalin belongs to the calcyin family, and its sequence length is generally between 165 and 200 residues. They are mainly stable and multifunctional extracellular proteins. Lipocalin plays an important role in several stress responses and allergic inflammations. Because the accurate identification of lipocalins could provide significant evidences for the study of their function, it is necessary to develop a machine learning-based model to recognize lipocalin. Methods: In this study, we constructed a prediction model to identify lipocalin. Their sequences were encoded by six types of features, namely amino acid composition (AAC), composition of k-spaced amino acid pairs (CKSAAP), pseudo amino acid composition (PseAAC), Geary correlation (GD), normalized Moreau-Broto autocorrelation (NMBroto) and composition/transition/distribution (CTD). Subsequently, these features were optimized by using feature selection techniques. A classifier based on random forest was trained according to the optimal features. Results: The results of 10-fold cross-validation showed that our computational model would classify lipocalins with accuracy of 95.03% and area under the curve of 0.987. On the independent dataset, our computational model could produce the accuracy of 89.90% which was 4.17% higher than the existing model. Conclusions: In this work, we developed an advanced computational model to discriminate lipocalin proteins from non-lipocalin proteins. In the proposed model, protein sequences were encoded by six descriptors. Then, feature selection was performed to pick out the best features which could produce the maximum accuracy. On the basis of the best feature subset, the RF-based classifier can obtained the best prediction results.

Keywords

lipocalins

bioinformatics

feature extraction

optimization

validation

Figures