ET-MSF: a model stacking framework to identify electron transport proteins

Yizheng Wang; Qingfeng Pan; Xiaobin Liu; Yijie Ding

doi:10.31083/j.fbl2701012

Information
Figures
References
Contents

Download

[1]Chance B, Williams GR. The respiratory chain and oxidative phosphorylation. Advances in Enzymology and Related Subjects of Biochemistry. 1956; 17: 65–134.
- Google Scholar
- PubMed
- Crossref
[2]Foyer CH, Harbinson J. Oxygen metabolism and the regulation of photosynthetic electron transport. In: Causes of photooxidative stress and amelioration of defense systems in plants (pp. 1–42). CRC Press: Boca Raton. 2019.
- Google Scholar
- Crossref
[3]Mrozek D, Malysiak B, Kozielski S. ‘An optimal alignment of proteins energy characteristics with crisp and fuzzy similarity awards’, 2007 IEEE International Fuzzy Systems Conference. London, UK. 2007.
- Google Scholar
- Crossref
[4]Hu Y, Qiu S, Cheng L. Integration of Multiple-Omics Data to Analyze the Population-Specific Differences for Coronary Artery Disease. Computational and Mathematical Methods in Medicine. 2021; 2021: 7036592.
- Google Scholar
- PubMed
- Crossref
[5]Ritov VB, Menshikova EV, Azuma K, Wood R, Toledo FGS, Goodpaster BH, et al. Deficiency of electron transport chain in human skeletal muscle mitochondria in type 2 diabetes mellitus and obesity. American Journal of Physiology-Endocrinology and Metabolism. 2010; 298: E49–E58.
- Google Scholar
- PubMed
- Crossref
[6]Zou Q, Qu K, Luo Y, Yin D, Ju Y, Tang H. Predicting Diabetes Mellitus with Machine Learning Techniques. Frontiers in Genetics. 2018; 9: 515
- Google Scholar
- PubMed
- Crossref
[7]Qu K, Zou Q, Shi H. Prediction of diabetic protein markers based on an ensemble method. Frontiers in Bioscience-Landmark. 2021; 26: 207–221.
- Google Scholar
- PubMed
- Crossref
[8]Parker WD, Boyson SJ, Parks JK. Abnormalities of the electron transport chain in idiopathic parkinson’s disease. Annals of Neurology. 1989; 26: 719–723.
- Google Scholar
- PubMed
- Crossref
[9]Parker WD, Filley CM, Parks JK. Cytochrome oxidase deficiency in Alzheimer’s disease. Neurology. 1990; 40: 1302–1303.
- Google Scholar
- PubMed
- Crossref
[10]Xu L, Liang G, Liao C, Chen G, Chang C. K-Skip-n-Gram-RF: A Random Forest Based Method for Alzheimer’s Disease Protein Identification. Frontiers in Genetics. 2020; 10: 33.
- Google Scholar
- PubMed
- Crossref
[11]Hu Y, Sun J, Zhang Y, Zhang H, Gao S, Wang T, et al. Rs1990622 variant associates with Alzheimer’s disease and regulates TMEM106B expression in human brain tissues. BMC Medicine. 2021; 19: 11.
- Google Scholar
- PubMed
- Crossref
[12]Hu Y, Zhang H, Liu B, Gao S, Wang T, Han Z, et al. Rs34331204 regulates TSPAN13 expression and contributes to Alzheimer’s disease with sex differences. Brain. 2020; 143: e95–e95.
- Google Scholar
- PubMed
- Crossref
[13]Le N, Nguyen T, Ou Y. Identifying the molecular functions of electron transport proteins using radial basis function networks and biochemical properties. Journal of Molecular Graphics and Modelling. 2017; 73: 166–178.
- Google Scholar
- PubMed
- Crossref
[14]Khatun M, Hasan M, Kurata H. PreAIP: computational prediction of anti-inflammatory peptides by integrating multiple complementary features. Frontiers in Genetics. 2019; 10: 129.
- Google Scholar
- PubMed
- Crossref
[15]Hasan MM, Guo D, Kurata H. Computational identification of protein S-sulfenylation sites by incorporating the multiple sequence features information. Molecular BioSystems. 2017; 13: 2545–2550.
- Google Scholar
- PubMed
- Crossref
[16]Hasan MM, Zhou Y, Lu X, Li J, Song J, Zhang Z. Computational Identification of Protein Pupylation Sites by Using Profile-Based Composition of k-Spaced Amino Acid Pairs. PLoS ONE. 2015; 10: e0129635.
- Google Scholar
- PubMed
- Crossref
[17]Le N, Ho Q, Ou Y. Incorporating deep learning with convolutional neural networks and position specific scoring matrices for identifying electron transport proteins. Journal of Computational Chemistry. 2017; 38: 2000–2006.
- Google Scholar
- PubMed
- Crossref
[18]Chen S, Ou Y, Lee T, Gromiha MM. Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties. Bioinformatics. 2011; 27: 2062–2067.
- Google Scholar
- PubMed
- Crossref
[19]Mishra NK, Chang J, Zhao PX. Prediction of membrane transport proteins and their substrate specificities using primary sequence information. PLoS ONE. 2014; 9: e100278.
- Google Scholar
- PubMed
- Crossref
[20]Le NQK, Yapp EKY, Yeh H. ET-GRU: using multi-layer gated recurrent units to identify electron transport proteins. BMC Bioinformatics. 2019; 20: 377.
- Google Scholar
- PubMed
- Crossref
[21]Gromiha MM, Yabuki Y. Functional discrimination of membrane proteins using machine learning techniques. BMC Bioinformatics. 2008; 9: 135.
- Google Scholar
- PubMed
- Crossref
[22]Ru X, Li L, Zou Q. Incorporating Distance-Based top-n-gram and Random Forest to Identify Electron Transport Proteins. Journal of Proteome Research. 2019; 18: 2931–2939.
- Google Scholar
- PubMed
- Crossref
[23]Bateman A, Martin MJ, O’Donovan C, Magrane M, Apweiler R, Alpi E, et al. UniProt: a hub for protein information. Nucleic Acids Research. 2014; 43: D204–D212.
- Google Scholar
- PubMed
- Crossref
[24]Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nature Genetics. 2000; 25: 25–29.
- Google Scholar
- PubMed
- Crossref
[25]Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology. 1999; 292: 195–202.
- Google Scholar
- PubMed
- Crossref
[26]Su C, Chen C, Ou Y. Protein disorder prediction by condensed PSSM considering propensity for order or disorder. BMC Bioinformatics. 2006; 7: 319.
- Google Scholar
- PubMed
- Crossref
[27]Le N-Q-K, Ou Y-Y. Prediction of FAD binding sites in electron transport proteins according to efficient radial basis function networks and significant amino acid pairs. BMC Bioinformatics. 2016; 17: 298.
- Google Scholar
- PubMed
- Crossref
[28]Li Z, Zhao Y, Pan G, Tang J, Guo F. A Novel Peptide Binding Prediction Approach for HLA-DR Molecule Based on Sequence and Structural Information. BioMed Research International. 2016; 2016: 3832176.
- Google Scholar
- PubMed
- Crossref
[29]Mrozek D, Malysiak-Mrozek B, Kozielski S. Alignment of Protein Structure Energy Patterns Represented as Sequences of Fuzzy Numbers. 2009 Annual Meeting of the North American Fuzzy Information Processing Society 2009; 35–40.
- Google Scholar
- Crossref
[30]Hong Z, Zeng X, Wei L, Liu X. Identifying enhancer–promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism. Bioinformatics. 2019; 36:1037–1043.
- Google Scholar
- PubMed
- Crossref
[31]Zeng X, Liao Y, Liu Y, Zou Q. Prediction and Validation of Disease Genes Using HeteSim Scores. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2017; 14: 687–695.
- Google Scholar
- PubMed
- Crossref
[32]Zeng X, Lin W, Guo M, Zou Q. A comprehensive overview and evaluation of circular RNA detection tools. PLoS Computational Biology. 2017; 13: e1005420.
- Google Scholar
- PubMed
- Crossref
[33]Cai L, Wang L, Fu X, Xia C, Zeng X, Zou Q. ITP-Pred: an interpretable method for predicting, therapeutic peptides with fused features low-dimension representation. Briefings in Bioinformatics. 2021; 22: bbaa367.
- Google Scholar
- PubMed
- Crossref
[34]Cheng L, Qi C, Zhuang H, Fu T, Zhang X. GutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions. Nucleic Acids Research. 2020; 48: D554–D560.
- Google Scholar
- PubMed
- Crossref
[35]Cheng L, Hu Y, Sun J, Zhou M, Jiang Q. DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function. Bioinformatics. 2018; 34: 1953–1956.
- Google Scholar
- PubMed
- Crossref
[36]Wu Y, Lu X, Shen B, Zeng Y. The Therapeutic Potential and Role of miRNA, lncRNA, and circRNA in Osteoarthritis. Current Gene Therapy. 2019; 19: 255–263.
- Google Scholar
- PubMed
- Crossref
[37]Yu L, Wang M, Yang Y, Xu F, Zhang X, Xie F, et al. Predicting therapeutic drugs for hepatocellular carcinoma based on tissue-specific pathways. PLOS Computational Biology. 2021; 17: e1008696.
- Google Scholar
- PubMed
- Crossref
[38]Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997; 25: 3389–3402.
- Google Scholar
- PubMed
- Crossref
[39]Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10: 421.
- Google Scholar
- PubMed
- Crossref
[40]Chou K, Shen H. MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochemical and Biophysical Research Communications. 2007; 360: 339–345.
- Google Scholar
- PubMed
- Crossref
[41]Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995; 20: 273–297.
- Google Scholar
- Crossref
[42]Tao Z, Li Y, Teng Z, Zhao Y. A Method for Identifying Vesicle Transport Proteins Based on LibSVM and MRMD. Computational and Mathematical Methods in Medicine. 2020; 2020: 1–9.
- Google Scholar
- PubMed
- Crossref
[43]Jiang Q, Wang G, Jin S, Li Y, Wang Y. Predicting human microRNA-disease associations based on support vector machine. International Journal of Data Mining and Bioinformatics. 2013; 8: 282–293.
- Google Scholar
- PubMed
- Crossref
[44]Su R, Liu X, Jin Q, Liu X, Wei L. Identification of glioblastoma molecular subtype and prognosis based on deep MRI features. Knowledge-Based Systems. 2021; 232: 107490.
- Google Scholar
- Crossref
[45]Su R, Wu H, Xu B, Liu X, Wei L. Developing a Multi-Dose Computational Model for Drug-Induced Hepatotoxicity Prediction Based on Toxicogenomics Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2019; 16: 1231–1239.
- Google Scholar
- PubMed
- Crossref
[46]Liu J, Su R, Zhang J, Wei L. Classification and gene selection of triple-negative breast cancer subtype embedding gene connectivity matrix in deep neural network. Briefings in Bioinformatics. 2021. (in press)
- Google Scholar
- PubMed
- Crossref
[47]Cheng L, Yang H, Zhao H, Pei X, Shi H, Sun J, et al. MetSigDis: a manually curated resource for the metabolic signatures of diseases. Briefings in Bioinformatics. 2019; 20: 203–209.
- Google Scholar
- PubMed
- Crossref
[48]Lu X, Zhao S. Gene-based Therapeutic Tools in the Treatment of Cornea Disease. Current Gene Therapy. 2019; 19: 7–19.
- Google Scholar
- PubMed
- Crossref
[49]Tahir M, Idris A. MD-LBP: An Efficient Computational Model for Protein Subcellular Localization from HeLa Cell Lines Using SVM. Current Bioinformatics. 2020; 15: 204–211.
- Google Scholar
- Crossref
[50]Meng C, Guo F, Zou Q. CWLy-SVM: a support vector machine-based tool for identifying cell wall lytic enzymes. Computational Biology and Chemistry. 2020; 87: 107304.
- Google Scholar
- PubMed
- Crossref
[51]Kuo J, Chang C, Chen C, Liang H, Chang C, Chu Y. Sequence-based Structural B-cell Epitope Prediction by Using Two Layer SVM Model and Association Rule Features. Current Bioinformatics. 2020; 15: 246–252.
- Google Scholar
- Crossref
[52]Hasan MM, Basith S, Khatun MS, Lee G, Manavalan B, Kurata H. Meta-i6mA: an interspecies predictor for identifying DNA N6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework. Briefings in Bioinformatics. 2021; 22: bbaa202.
- Google Scholar
- PubMed
- Crossref
[53]Breiman L. Random Forests. Machine Learning. 2001; 45: 5–32.
- Google Scholar
- Crossref
[54]Qi Y. Random forest for bioinformatics. In: Ensemble machine learning (pp. 307–323). Springer: Berlin/Heidelberg, Germany. 2012.
- Google Scholar
- PubMed
- Crossref
[55]Wei L, Xing P, Shi G, Ji Z, Zou Q. Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2019; 16: 1264–1273.
- Google Scholar
- PubMed
- Crossref
[56]Wei L, Su R, Wang B, Li X, Zou Q, Gao X. Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites. Neurocomputing. 2019; 324: 3–9.
- Google Scholar
- Crossref
[57]Su R, Liu X, Wei L, Zou Q. Deep-Resp-Forest: a deep forest model to predict anti-cancer drug response. Methods. 2019; 166: 91–102.
- Google Scholar
- PubMed
- Crossref
[58]Cheng L, Han X, Zhu Z, Qi C, Wang P, Zhang X. Functional alterations caused by mutations reflect evolutionary trends of SARS-CoV-2. Briefings in Bioinformatics. 2021; 22: 1442–1450.
- Google Scholar
- PubMed
- Crossref
[59]Chen X, Shi W, Deng L. Prediction of Disease Comorbidity Using HeteSim Scores based on Multiple Heterogeneous Networks. Current Gene Therapy. 2019; 19: 232–241.
- Google Scholar
- PubMed
- Crossref
[60]Ao C, Yu L, Zou Q. RFhy-m2G: Identification of RNA N2-methylguanosine Modification Sites Based on Random Forest and Hybrid Features. Methods. 2021. (in press)
- Google Scholar
- PubMed
- Crossref
[61]Ao C, Yu L, Zou Q. Prediction of bio-sequence modifications and the associations with diseases. Briefings in Functional Genomics. 2021; 20: 1–18.
- Google Scholar
- PubMed
- Crossref
[62]Hasan MM, Alam MA, Shoombuatong W, Deng H, Manavalan B, Kurata H. NeuroPred-FRL: an interpretable prediction model for identifying neuropeptide using feature representation learning. Briefings in Bioinformatics. 2021. (in press)
- Google Scholar
- PubMed
- Crossref
[63]Chen T, Guestrin C: Xgboost. A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794). Association for Computing Machinery: New York, NY, United States. 2016.
- Google Scholar
- Crossref
[64]Tianqi C, Tong H. Higgs Boson Discovery with Boosted Trees. 2014 International Conference on High-Energy Physics and Machine Learning. Valencia, Spain, 2-9 July 2014.
- Google Scholar
- Crossref
[65]Torlay L, Perrone-Bertolotti M, Thomas E, Baciu M. Machine learning-XGBoost analysis of language networks to classify patients with epilepsy. Brain Informatics. 2017; 4: 159–169.
- Google Scholar
- PubMed
- Crossref
[66]Cai L, Ren X, Fu X, Peng L, Gao M, Zeng X. IEnhancer-XG: interpretable sequence-based enhancers and their strength predictor. Bioinformatics. 2021; 37: 1060–1067.
- Google Scholar
- PubMed
- Crossref
[67]Yu X, Zhou J, Zhao M, Yi C, Duan Q, Zhou W, et al. Exploiting XG Boost for Predicting Enhancer-promoter Interactions. Current Bioinformatics. 2020; 15: 1036–1045.
- Google Scholar
- Crossref
[68]Hasan MM, Schaduangrat N, Basith S, Lee G, Shoombuatong W, Manavalan B. HLPpred-Fuse: improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation. Bioinformatics. 2020; 36: 3350–3356.
- Google Scholar
- PubMed
- Crossref
[69]Breiman L. Bagging predictors. Machine Learning. 1996; 24: 123–140.
- Google Scholar
- Crossref
[70]Małysiak-Mrozek B, Baron T, Mrozek D. Spark-IDPP: high-throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud. Cluster Computing. 2019; 22: 487–508.
- Google Scholar
- Crossref
[71]Wei L, Xing P, Zeng J, Chen J, Su R, Guo F. Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier. Artificial Intelligence in Medicine. 2017; 83: 67–74.
- Google Scholar
- PubMed
- Crossref
[72]Wei L, Wan S, Guo J, Wong KK. A novel hierarchical selective ensemble classifier with bioinformatics application. Artificial Intelligence in Medicine. 2017; 83: 82–90.
- Google Scholar
- PubMed
- Crossref
[73]Wei L, Tang J, Zou Q. Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information. Information Sciences. 2017; 384: 135–144.
- Google Scholar
- Crossref
[74]Wang Z, He W, Tang J, Guo F. Identification of Highest-Affinity Binding Sites of Yeast Transcription Factor Families. Journal of Chemical Information and Modeling. 2020; 60: 1876–1883.
- Google Scholar
- PubMed
- Crossref
[75]Ding Y, Tang J, Guo F. Identification of Protein-Ligand Binding Sites by Sequence Information and Ensemble Classifier. Journal of Chemical Information and Modeling. 2017; 57: 3149–3161.
- Google Scholar
- PubMed
- Crossref
[76]Fu X, Cai L, Zeng X, Zou Q. StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency. Bioinformatics. 2020; 36: 3028–3034.
- Google Scholar
- PubMed
- Crossref
[77]Yu L, Xia M, An Q. A network embedding framework based on integrating multiplex network for drug combination prediction. Briefings in Bioinformatics. 2021. (in press)
- Google Scholar
- PubMed
- Crossref
[78]Ru X, Cao P, Li L, Zou Q. Selecting Essential MicroRNAs Using a Novel Voting Method. Molecular Therapy - Nucleic Acids. 2019; 18: 16–23.
- Google Scholar
- PubMed
- Crossref
[79]Zhu H, Du X, Yao Y. ConvsPPIS: Identifying Protein-protein Interaction Sites by an Ensemble Convolutional Neural Network with Feature Graph. Current Bioinformatics. 2020; 15: 368–378.
- Google Scholar
- Crossref
[80]Sultana N, Sharma N, Sharma KP, Verma S. A Sequential Ensemble Model for Communicable Disease Forecasting. Current Bioinformatics. 2020; 15: 309–317.
- Google Scholar
- Crossref
[81]Xu Z, Luo M, Lin W, Xue G, Wang P, Jin X, et al. DLpTCR: an ensemble deep learning framework for predicting immunogenic peptide recognized by T cell receptor. Briefings in Bioinformatics. 2021; 22: bbab335
- Google Scholar
- PubMed
- Crossref
[82]Huang Y, Zhou D, Wang Y, Zhang X, Su M, Wang C, et al. Prediction of transcription factors binding events based on epigenetic modifications in different human cells. Epigenomics. 2020; 12: 1443–1456.
- Google Scholar
- PubMed
- Crossref
[83]Zhang L, Xiao X, Xu ZC. iPromoter-5mC: A Novel Fusion Decision Predictor for the Identification of 5-Methylcytosine Sites in Genome-Wide DNA Promoters. Frontiers in Cell and Developmental Biology. 2020; 8: 614.
- Google Scholar
- PubMed
- Crossref
[84]Wang H, Tang J, Ding Y, Guo F. Exploring associations of non-coding RNAs in human diseases via three-matrix factorization with hypergraph-regular terms on center kernel alignment. Briefings in Bioinformatics. 2021 22: bbaa409.
- Google Scholar
- PubMed
- Crossref
[85]Ding Y, Tang J, Guo F. Identification of Drug-Target Interactions via Dual Laplacian Regularized least Squares with Multiple Kernel Fusion. Knowledge-Based Systems. 2020; 204: 106254.
- Google Scholar
- Crossref
[86]Ding Y, Tang J, Guo F. Identification of drug-target interactions via fuzzy bipartite local model. Neural Computing and Applications. 2020; 32: 10303–10319.
- Google Scholar
- Crossref
[87]Zeng X, Zhu S, Liu X, Zhou Y, Nussinov R, Cheng F. DeepDR: a network-based deep learning approach to in silico drug repositioning. Bioinformatics. 2019; 35: 5191–5198.
- Google Scholar
- PubMed
- Crossref
[88]Zeng X, Zhu S, Lu W, Liu Z, Huang J, Zhou Y, et al. Target identification among known drugs by deep learning from heterogeneous networks. Chemical Science. 2020; 11: 1775–1797.
- Google Scholar
- PubMed
- Crossref
[89]Zhai Y, Chen Y, Teng Z, Zhao Y. Identifying Antioxidant Proteins by Using Amino Acid Composition and Protein-Protein Interactions. Frontiers in Cell and Developmental Biology. 2020; 8: 591487.
- Google Scholar
- PubMed
- Crossref
[90]Guo Z, Wang P, Liu Z, Zhao Y. Discrimination of Thermophilic Proteins and Non-thermophilic Proteins Using Feature Dimension Reduction. Frontiers in Bioengineering and Biotechnology. 2020; 8: 584807.
- Google Scholar
- PubMed
- Crossref
[91]Jin Q, Cui H, Sun C, Meng Z, Su R. Free-form tumor synthesis in computed tomography images via richer generative adversarial network. Knowledge-Based Systems. 2021; 218: 106753.
- Google Scholar
- Crossref
[92]Wu X, Yu L. EPSOL: sequence-based protein solubility prediction using multidimensional embedding. Bioinformatics. 2021; 37: 4314–4320.
- Google Scholar
- PubMed
- Crossref

Open Access 11 Jan 2022Original Research

ET-MSF: a model stacking framework to identify electron transport proteins

Yizheng Wang ^1,2, Qingfeng Pan ^3,*, Xiaobin Liu ^4,*, Yijie Ding ^2,*

Affiliations

Article Info

¹ Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054 Chengdu, Sichuan, China

² Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, 324000 Quzhou, Zhejiang, China

³ Department of Oncology Radiology, Beidahuang Industry Group General Hospital, 150000 Harbin, Heilongjiang, China

⁴ Department of Nephrology, The Affiliated Wuxi People's Hospital of Nanjing Medical University, 214023 Wuxi, Jiangsu, China

^*Correspondence: qingfengpan1074@163.com (Qingfeng Pan); viplxb163@163.com (Xiaobin Liu); wuxi_dyj@163.com (Yijie Ding)
Academic Editor: Graham Pawelec

Abstract

Introduction: The electron transport chain is closely related to cellular respiration and has been implicated in various human diseases. However, the traditional “wet” experimental method is time consuming. Therefore, it is key to identify electron transport proteins by computational methods. Many approaches have been proposed, but performance of them still has room for further improvement. Methodological issues: In our study, we propose a model stacking framework, which combines multiple base models. The protein features are extracted via PsePSSM from protein sequences. Features are fed into the base model including support vector machines (SVM), random forest (RF), XGBoost, etc. The results of base model are entered into logistic regression model for final process. Results: On the independent dataset, the accuracy and Matthew’s correlation coefficient (MCC) of proposed method are 95.70% and 0.8756, respectively. Furthermore, we show that the model stacking framework outperforms single machine learning classifiers statistically. Conclusion: Our models are better than most known strategies for identifying electron transport proteins. Our model can be used to more precisely identify electron transport proteins.

Keywords

Electron transport chain
Ensemble learning
Model stacking
Logistic regression
Transport protein

1. Introduction

Protein is a vital component of all human cells and tissues, and it is intimately linked to life and many forms of biological activity, such as cellular respiration. Cellular respiration is the process by which organic matter passes through a series of oxidative breakdowns inside cells to form inorganic or small molecules of organic matter, releasing energy and producing Adenosine triphosphate (ATP), which is the most direct source of energy for most cellular reactions [1]. In this process, the electron transport chain is critical for storing and transferring electrons. Five protein complexes make up the electron transport chain and are named complex I, II, III, IV, and V. Electron transport proteins are made up of many electron carriers and serve a variety of molecular functions [2, 3, 4]. In studies, electron transporter abnormalities have been found to be associated with diseases such as idiopathic diabetes [5, 6, 7, 8], Parkinson’s disease [4], and Alzheimer’s disease [9, 10, 11, 12]. Therefore, the identification of electron transporter proteins is helpful in exploring the causes of human diseases and may help prevent and treat human diseases. Due to the high time and cost of identifying protein functions by traditional experimental techniques, computational approaches must be developed. The construction of meaningful feature sets and selection of appropriate classification algorithms are considered to be the two most important steps in protein classification. When constructing feature sets, some studies had taken advantage of the biochemical properties of proteins. Le et al. [13] extracted protein characteristics through biochemical characteristics, which improving the accuracy of identification of electron transporters. Khatun et al. [14] developed a model for predicting anti-inflammatory peptides using computational methods by using binary as the characteristic representation of proteins. Others employed position specific scoring matrix (PSSM) to extract evolutionary information of protein. In SulCysSite [15], binary, PSSM profiles, and pCKSAAP are combined as feature information to predict protein S-sulfenylation sites. Hasan et al. [16] used PSSM profiles to preserve the evolution information in proteins and developed pbPUP, a model for identifying protein pupylation sites, which had a good performance. Le et al. [17] proposed ET-CNN, which fed PSSM profiles into the convolutional neural networks (CNN) for electron transport protein classification. PSSM profiles and amino acid composition (AAC) were utilized by Chen et al. [18] to extract protein features, which were fed into radial basis function networks. In Mishra’s study [19], features including amino acid composition, biochemical properties, and PSSM profiles were fed into the support vector machines (SVM) to predict transporters including electron transporters.

ET-GRU is a model proposed by Le et al. [20] to identify electron transport proteins. CNN is used to extract features from PSSM matrix before using deep-gated recurrent unit (GRU) for classification. In addition, traditional machine learning algorithms have been widely used. Gromiha et al. [21] and Ru et al. [22] used machine learning methods, including support vector machine, logistic regression, decision tree, random forest, and naive Bayes, to perform functional recognition of electron transport proteins. Single machine learning classification has both disadvantages and advantages. For example, the random forest has good performance on the unbalanced dataset but has high feature requirements. When XGBoost is used, excellent model performance can be achieved by adjusting a large number of complex parameters, which is totally difficult. Ensemble learning can flexibly combine various classifiers, train different models as base classifiers, and then combine them with ensemble strategies for final prediction. Common integration strategies include stacking, majority voting, bagging, boosting.

In our study, a model of stacking framework (MSF), which combines multiple base models is proposed. We represent protein features using the PSSM matrix produced by PSI-BLAST. Then, Pseudo-PSSM (PsePSSM) is used to extract evolutionary information. These features are fed into the base model. These base models are different classification algorithms, such as random forest, XGBoost, k-nearest neighbor (KNN), and SVM. These models are loosely coupled. After combining the results of each classifier, logistic regression makes the final prediction. Finally, the experimental results prove that the model of stacking framework (MSF), which is built by SVM, XGBoost, and KNN, has the best effect. We compare MSF with the single classifier and majority voting on the same independent dataset and perform a t test, the MSF is significantly superior to other single classifiers. In addition, MSF also perform better than most existing methods for identifying electron transport proteins.

2. Materials and methods

2.1 Datasets

In our study, we utilize the benchmark dataset released by Le et al. [20], which contains 1324 electron transport proteins and 4569 general transport proteins. The data were initially taken from a previous study [17], and data from UniProt release-2018_05 (on 23-May-2018) [23] and Gene Ontology (GO) release-2018-05-01 [24] were also collected. Then, in order to avoid model overfitting, the data were removed the redundant sequences with similarities of more than 30%. To solve this binary classification problem, the dataset is randomly divided into cross-validation dataset and independent dataset in a ratio of 0.85:0.15. Table 1 shows the details of the datasets.

Table 1.Statistics of all retrieved electron transport proteins and general transport proteins.

	Original	CV	IND
Electron transport	1324	1125	199
General transport	4569	3884	685

2.2 Feature extraction

Using appropriate methods to extract protein characteristics is an important step to complete the task of classification. PSSM, which retains the evolutionary information of proteins in a matrix of L rows and 20 columns, is employed as the feature extraction approach in this work. PSSM was first introduced by Jones [25], which is a commonly used method in the field of bioinformatics. It has been used in a number of bioinformatics studies with positive results [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]. The PSSM profiles is derived from multiple sequence alignment and contains the evolutionary information of each residue in the protein sequence. Electron transport proteins belong to a class of proteins with a specific function. Compared with other protein families, the key evolutionary information of proteins can be captured by using PSSM matrix and related feature extraction method, and then the classifier can be used to effectively identify electron transport proteins.

FASTA sequences are searched against the Uniprot database to compile position specific scoring matrices (PSSMs) for two iterations using Position Specific Iterative BLAST (PSI-BLAST [38]). The options for using BLAST+ [39] are as follows:

$\textit{ -num\_iterations }2\textit{ -db uniprot }$

In our study, Pseudo-PSSM (PsePSSM) is employed to retain information in PSSM, and its basic idea is to consider the pseudo-amino-acid composition in PSSM [40]. This operation aims to get vectors that satisfy the algorithms and involves two steps.

The first step is the process of standardizing the PSSM. The formula is shown below:

(1) $p_{i,j}^{\prime}=\frac{p_{i,j}-\frac{1}{20}\sum_{m=1}^{20}p_{i,m}}{\sqrt{\frac% {1}{20}\sum_{n=1}^{20}\left(p_{i,n}-\frac{1}{20}\sum_{m=1}^{20}p_{i,m}\right)^% {2}}}$

The second step is to use the standardized matrix to generate Pseudo-PSSM (PsePSSM).

(2) $\begin{split}\displaystyle F_{PsePssm}=\left\{\begin{array}[]{c}\frac{1}{N}% \sum_{i=1}^{N}p_{i,j}^{\prime};j=1,\cdots,20\\ \frac{1}{N-lag}\sum_{i=1}^{N-lag}\left(p_{i,j}^{\prime}-p_{i+lag,j}^{\prime}% \right)^{2};\\ lag=1,\cdots,8,j=1,\cdots,20\end{array}\right.\end{split}$

where lag denotes the distance between one residue and its neighbors.

Finally, the standardized PSSM matrix is transformed into a 20 + 20 $\times$ 8 = 180 dimensions vector containing protein characteristics.

2.3 Classification algorithms

2.3.1 Support Vector Machine

Support Vector Machine (SVM) [41] is a supervised learning algorithm for classification. As a generalized linear classifier, the purpose of the support vector machine is to find the maximum boundary hyperplane as the decision boundary, so as to complete the classification task. SVM is widely used in classification, regression, and other tasks [42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52]. Since the dataset is linearly non-separable, the SVM with Gaussian kernel function (RBF) is employed as the fitting algorithm. C is a crucial parameter in support vector machines. The penalty factor, or error tolerance, is denoted by the C. The penalty SVM receives in the case of classification error is positively connected with the C. In addition, gamma ( $\gamma{}$ ) is also an important parameter that affects the classification effect of SVM, which implicitly determines the distribution of data after mapping to the new feature space. The larger $\gamma{}$ is, the fewer support vectors are. To discover the best combination of C and $\gamma{}$ so that the model has a positive effect, the grid search method is used.

2.3.2 Random forest

Random forest (RF) [53] is an ensemble machine learning technique that trains and predicts samples using several trees, which has found many successful applications in the field of bioinformatics [54, 55, 56, 57, 58, 59, 60, 61, 62]. RF consists of many decision trees that are not related to each other. When a sample is input into RF for the classification task, the sample will be judged by each decision tree in the forest and the classification result will be obtained. The final output of the RF is a combination of the results of each decision tree in the forest. Plenty of parameters affect the RF, among which the most influential ones include the number of subtrees to be built, and the maximum growth depth of trees. In the same way, grid search is used to find suitable values for these parameters.

2.3.3 K-Nearest Neighbor

K-Nearest Neighbor (KNN) determines the categories of input samples based on the most similar samples in the feature space, which is one of the most commonly used machine learning algorithms. K value is the only parameter that the KNN algorithm needs to specify, so it has a significant effect on the result. In KNN algorithm, each sample can be represented by its nearest K neighboring values. The smaller the k value is, the less the approximate error of learning is, because a small k value can make the prediction result of the algorithm only be affected by the training instance that is close to the input instance. Again, we use grid search to find the appropriate K value.

2.3.4 Extreme Gradient Boosting

Extreme Gradient Boosting (XGBoost) is another advance machine learning model [63]. XGBoost has been successful in a variety of machine learning competitions [64] as well as in other fields [65, 66, 67, 68]. It is a tool for large-scale parallel trees boosting. XGBoost has the advantages of higher accuracy and greater flexibility. The parameters that affect the XGBoost effect mainly include learning rate, minimum loss reduction required by leaf node splitting, maximum depth of the tree, and minimum weight of the leaf node. Again, use grid search to adjust these parameters.

2.3.5 Model stacking

Model stacking is a strategy of ensemble learning. The basic idea of ensemble learning is to combine multiple classifiers, and the errors encountered by one weak classifier are highly likely to be corrected by other weak classifiers. Combining multiple models can produce a model with better performance and stronger generalization ability. Typical integration strategies are bagging, boosting, stacking, and voting [69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80]. The use of bagging as an integration strategy is mainly to reduce the generalization error of models, which is achieved by combining multiple models. Bagging is implemented by using different bootstrap samples to train different models. When testing the sample input, the output of each model is voted to get the final result. Boosting’s idea is to combine a series of averagely performing models using particular cost functions. Majority voting includes both soft and hard voting. Hard voting is to make statistics on the predicted result label of the base model and take the result with more occurrences as the final result, while soft voting uses the predicted probability of the base model instead of the predicted result label to complete the voting mechanism. In our study, the integration strategy we choose is stacking. In order to prevent model overfitting, we use a simple model, logistic regression, to make the final prediction. The workflow of this method is shown in Fig. 1.

Fig. 1.

The MSF workflow.

2.4 Performance measures

The evaluation of the result is shown in four standard measurements, Accuracy (ACC), Matthew’s correlation coefficient (MCC) [43, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92], Sensitivity (SN), and Specificity (SP). Their formulas are as follows:

(3) $SN=\frac{TP}{FN+TP}$

(4) $SP=\frac{TN}{FP+TN}$

(5) $MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FN)\times(TP+FP)\times(TN+FP)% \times(TN+FN)}}$

(6) $ACC=\frac{TP+TN}{TP+TN+FP+FN}$

where TP, TN, FN, and FP denote the number of true positive, true negative, false negative, and false positive, respectively.

3. Results and discussion

3.1 Performance comparisons with the single classifiers and using different base classifiers

We conduct two types of experiments to test model stacking frameworks. Firstly, predictors of four single classifiers are built by RF, XGBoost, KNN, and SVM, respectively. We compare their performance on the same independent dataset, and the results are shown in Table 2. Then, three classifiers using different combinations of base models are constructed: MSF (SVM, XGBoost), MSF (SVM, XGBoost, KNN), and MSF (SVM, XGBoost, KNN, RF). They employ SVM + XGBoost, SVM + XGBoost + KNN, SVM + XGBoost + KNN + RF as the base classifiers, respectively. Next, LR is employed as the last layer of the model and the results of these base classifiers are fed into the final prediction. The results are shown in Table 3. The ROC curve with AUC values and PR curve are shown in Fig. 2.

Table 2.The performances of single classifiers on the independent dataset.

	SN (%)	SP (%)	ACC (%)	MCC
RF	93.20	91.58	91.86	0.7558
XGBoost	91.23	93.97	93.44	0.8057
KNN	90.50	94.75	93.89	0.8203
SVM	89.79	96.66	95.14	0.8599

Table 3.The performances using different base classifiers on the independent dataset.

	SN (%)	SP (%)	ACC (%)	MCC
MSF (SVM, XGBoost)	91.44	95.98	95.02	0.8549
MSF (SVM, XGBoost, KNN)	91.71	96.82	95.70	0.8756
MSF (SVM, XGBoost, KNN, RF)	91.24	96.81	95.59	0.8725

Fig. 2.

The ROC and PR curves of different models. (A) The ROC curves of single classifiers. (B) The PR curves of single classifiers. (C) The ROC curves of using different base classifiers. (D) The PR curves of using different base classifiers.

Table 2 shows that the predictor that uses SVM, which achieves the best performance of MCC = 0.8599, ACC = 95.14, SP = 96.66, and SN = 89.79%. The RF performs the worst results of MCC = 0.7558, ACC = 0.9186%, SP = 91.58, and SN = 93.20. The gaps of MCC, ACC, SP, and SN between the worst and best ones are 3.28%, 0.1041, –3.41% and 5.08%, respectively. Fig. 2 also shows that SVM has the best performance and highest AUC values.

Table 3 shows that the fourth predictor is ahead of the others on four measures. In all evaluative measurements, ensemble model is better than a single predictor. In Fig. 2, comparing the AUC values of MSF and single classifiers, we also find that MSF has better performance. We believe that ensemble learning performs better than single classifiers because ensemble learning can combine different classifiers flexibly and make better use of the advantages of classifiers. In addition, different base classifiers may have different feature representations for the same data, resulting in the effect of mutual error correction, thus obtaining better performance. Compared to the first predictor, the second predictor obtain better performance by adding KNN which performs second-best among the four single classifiers. RF reduces the overall prediction performance of ensemble model. The results show that a weak base classifier may lead to the effect of model stacking.

3.2 T-test analysis of model stacking framework and single classifiers

To further demonstrate the statistical significance of MSF, we perform t-test analysis on MSF and single classifiers including RF, XGBoost, KNN, and SVM.

The p-value of the t-test are shown in Table 4. In addition, log (0.05/p-value) visually shows the difference between p-value and 0.05. The larger the value is, the more significant the difference is.

Table 4.T-test analysis results of model stacking framework and single classifiers.

	p-value	log (0.05/p-value)
RF	3.8458 $\times$ 10 ${}^{–16}$	14.12
XGBoost	6.1513 $\times$ 10 ${}^{–7}$	4.91
KNN	9.6691 $\times$ 10 ${}^{–4}$	1.71
SVM	7.7435 $\times$ 10 ${}^{–3}$	0.81

The results show that the p-values of all algorithms are less than 0.05, indicating that the effects of stacked model framework and single classifier are significantly different and statistically significant. In addition, RF is the algorithm with the most significant difference from the stack model.

3.3 Performance comparisons with majority voting-based methods

In order to further verify that model stacking is an appropriate integration strategy for protein classification problems, the majority voting-based method using SVM, XGBoost, and KNN is tested on the independent dataset. It contains two types of hard voting and soft voting. The results can be found in Table 5.

Table 5.The performances using majority voting-based methods on the independent dataset.

	SN (%)	SP (%)	ACC (%)	MCC
Hard voting (SVM, XGBoost, KNN)	91.89	95.85	95.02	0.8546
Soft voting (SVM, XGBoost, KNN)	92.86	95.73	95.14	0.8576
MSF (SVM, XGBoost, KNN)	91.71	96.82	95.70	0.8756

The gaps of MCC, ACC, SP, and SN between the Hard voting and Soft voting are 0.12%, –0.97%, –0.003%, and –0.12%, respectively. On the same independent dataset, MSF ranks first except for the SN measurements.

3.4 Stability comparisons with the single classifiers and using different base classifiers

When the model is applied, the results of the system may fluctuate due to changes in the data. In order to judge the severity of the model changes, we conduct stability tests. We conduct five times of 5-fold cross-validation to test the stability of the algorithm, and calculate the mean and standard deviation of four metrics including SN, SP, ACC and MCC, as shown in Table 6.

Table 6.The means and standard deviations for the 5 $\times{}$ 5-fold cross-validations.

	SN	SP	ACC	MCC
RF	92.96 $\pm$ 0.036046	91.48 $\pm$ 0.002203	91.70 $\pm$ 0.005477	0.7514 $\pm$ 0.017186
XGBoost	90.65 $\pm$ 0.016817	94.25 $\pm$ 0.004943	93.53 $\pm$ 0.004603	0.8090 $\pm$ 0.014010
KNN	89.06 $\pm$ 0.012900	95.22 $\pm$ 0.004504	93.91 $\pm$ 0.003070	0.8223 $\pm$ 0.009366
SVM	89.41 $\pm$ 0.008326	96.57 $\pm$ 0.003200	94.98 $\pm$ 0.001693	0.8554 $\pm$ 0.005167
MSF (SVM, XGBoost, KNN)	91.50 $\pm$ 0.010614	96.63 $\pm$ 0.008257	95.50 $\pm$ 0.004375	0.8697 $\pm$ 0.013990

The results show that MSF achieves the best mean performance except that the SN value is worse than that of RF. In stability, MSF’s four metrics rank 2,5,3, and 3 respectively among all classifiers.

3.5 Comparisons of existing methods

To further test our method’s sophisticated and superior performance, we compare other existing works under the same data set. These models include ET-CNN [17] and ET-GRU [20].

Table 7 shows results of comparisons between our model and other approaches. It is easy to observe that MSF (SVM, XGBoost, KNN) ranks first with ACC = 95.7, MCC = 0.88, SN = 91.7%, and SP = 96.8%.

Table 7.Performance comparisons to existing methods.

	SN (%)	SP (%)	ACC (%)	MCC
ET-CNN	80.3	94.4	92.3	0.71
ET-GRU	79.8	95.9	92.3	0.77
MSF (SVM, XGBoost, KNN)	91.7	96.8	95.7	0.88

We believe that the main reason why ET-MSF performs better than ET-CNN and ET-GRU may be that our algorithm is more suitable for the task. Due to the characteristics of the neural network, only a large number of samples can make the network better fitting. However, the dataset of electron transport protein is relatively small, which limits the growth of neural network model size and leads to poor model effect. Therefore, using the ensemble strategy to combine machine learning classifiers can have a better effect.

3.6 Web server development

We provide a simple web server that can be freely accessible at http://82.156.89.65/ to allow readers to evaluate and use our approaches online. The online version of ET-MSF uses the Java language and the Spring Boot framework. ET-MSF can be used by biologists to identify electron transport proteins online. Biologists can obtain the model’s prediction probability by simply entering the protein’s amino acid sequence(s) in a standard FASTA file format. Biologists can also download our public datasets and models at https://github.com/Kinkou626/ET-MSF and run them on own computer.

4. Conclusions

In our study, a model of stacking framework is proposed to extract the features of protein sequences and identify electron transporter proteins by integrating multiple classifiers. Previously, the model of stacking framework has not been applied to electron transporter protein recognition tasks. We use an independent dataset to evaluate model. Model of stacking frameworks using different base model combinations are compared experimentally. Comprehensive experiments show that the model of stacking framework via SVM, XGBoost, and KNN is the best model, which achieve the ACC and MCC values of 95.70% and 0.8756 respectively. Compared with existing methods, MSF also achieves significant improvements in all measurements. Our method will be an effective bioinformatics tool. And it can also be used to recognize protein functions in other types of proteins.

Author contributions

YW did the experiments and wrote the manuscript. YW, QP, XL, and YD designed the method. YW, QP, XL, and YD revised the manuscript. All authors have read and approved the final manuscript.

Ethics approval and consent to participate

Not applicable.

Acknowledgment

Not applicable.

Funding

The work was supported by the National Natural Science Foundation of China (No. 61922020), the Sichuan Provincial Science Fund for Distinguished Young Scholars (2021JDJQ0025), and the Special Science Foundation of Quzhou (2021D004).

Conflict of interest

The authors declare no conflict of interest.

References

[1] Chance B, Williams GR. The respiratory chain and oxidative phosphorylation. Advances in Enzymology and Related Subjects of Biochemistry. 1956; 17: 65–134.
Cited within: 1Google Scholar PubMed Crossref
[2] Foyer CH, Harbinson J. Oxygen metabolism and the regulation of photosynthetic electron transport. In: Causes of photooxidative stress and amelioration of defense systems in plants (pp. 1–42). CRC Press: Boca Raton. 2019.
Cited within: 1Google Scholar Crossref
[3] Mrozek D, Malysiak B, Kozielski S. ‘An optimal alignment of proteins energy characteristics with crisp and fuzzy similarity awards’, 2007 IEEE International Fuzzy Systems Conference. London, UK. 2007.
Cited within: 1Google Scholar Crossref
[4] Hu Y, Qiu S, Cheng L. Integration of Multiple-Omics Data to Analyze the Population-Specific Differences for Coronary Artery Disease. Computational and Mathematical Methods in Medicine. 2021; 2021: 7036592.
Cited within: 2Google Scholar PubMed Crossref
[5] Ritov VB, Menshikova EV, Azuma K, Wood R, Toledo FGS, Goodpaster BH, et al. Deficiency of electron transport chain in human skeletal muscle mitochondria in type 2 diabetes mellitus and obesity. American Journal of Physiology-Endocrinology and Metabolism. 2010; 298: E49–E58.
Cited within: 1Google Scholar PubMed Crossref
[6] Zou Q, Qu K, Luo Y, Yin D, Ju Y, Tang H. Predicting Diabetes Mellitus with Machine Learning Techniques. Frontiers in Genetics. 2018; 9: 515
Cited within: 1Google Scholar PubMed Crossref
[7] Qu K, Zou Q, Shi H. Prediction of diabetic protein markers based on an ensemble method. Frontiers in Bioscience-Landmark. 2021; 26: 207–221.
Cited within: 1Google Scholar PubMed Crossref
[8] Parker WD, Boyson SJ, Parks JK. Abnormalities of the electron transport chain in idiopathic parkinson’s disease. Annals of Neurology. 1989; 26: 719–723.
Cited within: 1Google Scholar PubMed Crossref
[9] Parker WD, Filley CM, Parks JK. Cytochrome oxidase deficiency in Alzheimer’s disease. Neurology. 1990; 40: 1302–1303.
Cited within: 1Google Scholar PubMed Crossref
[10] Xu L, Liang G, Liao C, Chen G, Chang C. K-Skip-n-Gram-RF: A Random Forest Based Method for Alzheimer’s Disease Protein Identification. Frontiers in Genetics. 2020; 10: 33.
Cited within: 1Google Scholar PubMed Crossref
[11] Hu Y, Sun J, Zhang Y, Zhang H, Gao S, Wang T, et al. Rs1990622 variant associates with Alzheimer’s disease and regulates TMEM106B expression in human brain tissues. BMC Medicine. 2021; 19: 11.
Cited within: 1Google Scholar PubMed Crossref
[12] Hu Y, Zhang H, Liu B, Gao S, Wang T, Han Z, et al. Rs34331204 regulates TSPAN13 expression and contributes to Alzheimer’s disease with sex differences. Brain. 2020; 143: e95–e95.
Cited within: 1Google Scholar PubMed Crossref
[13] Le N, Nguyen T, Ou Y. Identifying the molecular functions of electron transport proteins using radial basis function networks and biochemical properties. Journal of Molecular Graphics and Modelling. 2017; 73: 166–178.
Cited within: 1Google Scholar PubMed Crossref
[14] Khatun M, Hasan M, Kurata H. PreAIP: computational prediction of anti-inflammatory peptides by integrating multiple complementary features. Frontiers in Genetics. 2019; 10: 129.
Cited within: 1Google Scholar PubMed Crossref
[15] Hasan MM, Guo D, Kurata H. Computational identification of protein S-sulfenylation sites by incorporating the multiple sequence features information. Molecular BioSystems. 2017; 13: 2545–2550.
Cited within: 1Google Scholar PubMed Crossref
[16] Hasan MM, Zhou Y, Lu X, Li J, Song J, Zhang Z. Computational Identification of Protein Pupylation Sites by Using Profile-Based Composition of k-Spaced Amino Acid Pairs. PLoS ONE. 2015; 10: e0129635.
Cited within: 1Google Scholar PubMed Crossref
[17] Le N, Ho Q, Ou Y. Incorporating deep learning with convolutional neural networks and position specific scoring matrices for identifying electron transport proteins. Journal of Computational Chemistry. 2017; 38: 2000–2006.
Cited within: 3Google Scholar PubMed Crossref
[18] Chen S, Ou Y, Lee T, Gromiha MM. Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties. Bioinformatics. 2011; 27: 2062–2067.
Cited within: 1Google Scholar PubMed Crossref
[19] Mishra NK, Chang J, Zhao PX. Prediction of membrane transport proteins and their substrate specificities using primary sequence information. PLoS ONE. 2014; 9: e100278.
Cited within: 1Google Scholar PubMed Crossref
[20] Le NQK, Yapp EKY, Yeh H. ET-GRU: using multi-layer gated recurrent units to identify electron transport proteins. BMC Bioinformatics. 2019; 20: 377.
Cited within: 3Google Scholar PubMed Crossref
[21] Gromiha MM, Yabuki Y. Functional discrimination of membrane proteins using machine learning techniques. BMC Bioinformatics. 2008; 9: 135.
Cited within: 1Google Scholar PubMed Crossref
[22] Ru X, Li L, Zou Q. Incorporating Distance-Based top-n-gram and Random Forest to Identify Electron Transport Proteins. Journal of Proteome Research. 2019; 18: 2931–2939.
Cited within: 1Google Scholar PubMed Crossref
[23] Bateman A, Martin MJ, O’Donovan C, Magrane M, Apweiler R, Alpi E, et al. UniProt: a hub for protein information. Nucleic Acids Research. 2014; 43: D204–D212.
Cited within: 1Google Scholar PubMed Crossref
[24] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nature Genetics. 2000; 25: 25–29.
Cited within: 1Google Scholar PubMed Crossref
[25] Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology. 1999; 292: 195–202.
Cited within: 1Google Scholar PubMed Crossref
[26] Su C, Chen C, Ou Y. Protein disorder prediction by condensed PSSM considering propensity for order or disorder. BMC Bioinformatics. 2006; 7: 319.
Cited within: 1Google Scholar PubMed Crossref
[27] Le N-Q-K, Ou Y-Y. Prediction of FAD binding sites in electron transport proteins according to efficient radial basis function networks and significant amino acid pairs. BMC Bioinformatics. 2016; 17: 298.
Cited within: 1Google Scholar PubMed Crossref
[28] Li Z, Zhao Y, Pan G, Tang J, Guo F. A Novel Peptide Binding Prediction Approach for HLA-DR Molecule Based on Sequence and Structural Information. BioMed Research International. 2016; 2016: 3832176.
Cited within: 1Google Scholar PubMed Crossref
[29] Mrozek D, Malysiak-Mrozek B, Kozielski S. Alignment of Protein Structure Energy Patterns Represented as Sequences of Fuzzy Numbers. 2009 Annual Meeting of the North American Fuzzy Information Processing Society 2009; 35–40.
Cited within: 1Google Scholar Crossref
[30] Hong Z, Zeng X, Wei L, Liu X. Identifying enhancer–promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism. Bioinformatics. 2019; 36:1037–1043.
Cited within: 1Google Scholar PubMed Crossref
[31] Zeng X, Liao Y, Liu Y, Zou Q. Prediction and Validation of Disease Genes Using HeteSim Scores. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2017; 14: 687–695.
Cited within: 1Google Scholar PubMed Crossref
[32] Zeng X, Lin W, Guo M, Zou Q. A comprehensive overview and evaluation of circular RNA detection tools. PLoS Computational Biology. 2017; 13: e1005420.
Cited within: 1Google Scholar PubMed Crossref
[33] Cai L, Wang L, Fu X, Xia C, Zeng X, Zou Q. ITP-Pred: an interpretable method for predicting, therapeutic peptides with fused features low-dimension representation. Briefings in Bioinformatics. 2021; 22: bbaa367.
Cited within: 1Google Scholar PubMed Crossref
[34] Cheng L, Qi C, Zhuang H, Fu T, Zhang X. GutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions. Nucleic Acids Research. 2020; 48: D554–D560.
Cited within: 1Google Scholar PubMed Crossref
[35] Cheng L, Hu Y, Sun J, Zhou M, Jiang Q. DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function. Bioinformatics. 2018; 34: 1953–1956.
Cited within: 1Google Scholar PubMed Crossref
[36] Wu Y, Lu X, Shen B, Zeng Y. The Therapeutic Potential and Role of miRNA, lncRNA, and circRNA in Osteoarthritis. Current Gene Therapy. 2019; 19: 255–263.
Cited within: 1Google Scholar PubMed Crossref
[37] Yu L, Wang M, Yang Y, Xu F, Zhang X, Xie F, et al. Predicting therapeutic drugs for hepatocellular carcinoma based on tissue-specific pathways. PLOS Computational Biology. 2021; 17: e1008696.
Cited within: 1Google Scholar PubMed Crossref
[38] Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997; 25: 3389–3402.
Cited within: 1Google Scholar PubMed Crossref
[39] Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10: 421.
Cited within: 1Google Scholar PubMed Crossref
[40] Chou K, Shen H. MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochemical and Biophysical Research Communications. 2007; 360: 339–345.
Cited within: 1Google Scholar PubMed Crossref
[41] Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995; 20: 273–297.
Cited within: 1Google Scholar Crossref
[42] Tao Z, Li Y, Teng Z, Zhao Y. A Method for Identifying Vesicle Transport Proteins Based on LibSVM and MRMD. Computational and Mathematical Methods in Medicine. 2020; 2020: 1–9.
Cited within: 1Google Scholar PubMed Crossref
[43] Jiang Q, Wang G, Jin S, Li Y, Wang Y. Predicting human microRNA-disease associations based on support vector machine. International Journal of Data Mining and Bioinformatics. 2013; 8: 282–293.
Cited within: 2Google Scholar PubMed Crossref
[44] Su R, Liu X, Jin Q, Liu X, Wei L. Identification of glioblastoma molecular subtype and prognosis based on deep MRI features. Knowledge-Based Systems. 2021; 232: 107490.
Cited within: 1Google Scholar Crossref
[45] Su R, Wu H, Xu B, Liu X, Wei L. Developing a Multi-Dose Computational Model for Drug-Induced Hepatotoxicity Prediction Based on Toxicogenomics Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2019; 16: 1231–1239.
Cited within: 1Google Scholar PubMed Crossref
[46] Liu J, Su R, Zhang J, Wei L. Classification and gene selection of triple-negative breast cancer subtype embedding gene connectivity matrix in deep neural network. Briefings in Bioinformatics. 2021. (in press)
Cited within: 1Google Scholar PubMed Crossref
[47] Cheng L, Yang H, Zhao H, Pei X, Shi H, Sun J, et al. MetSigDis: a manually curated resource for the metabolic signatures of diseases. Briefings in Bioinformatics. 2019; 20: 203–209.
Cited within: 1Google Scholar PubMed Crossref
[48] Lu X, Zhao S. Gene-based Therapeutic Tools in the Treatment of Cornea Disease. Current Gene Therapy. 2019; 19: 7–19.
Cited within: 1Google Scholar PubMed Crossref
[49] Tahir M, Idris A. MD-LBP: An Efficient Computational Model for Protein Subcellular Localization from HeLa Cell Lines Using SVM. Current Bioinformatics. 2020; 15: 204–211.
Cited within: 1Google Scholar Crossref
[50] Meng C, Guo F, Zou Q. CWLy-SVM: a support vector machine-based tool for identifying cell wall lytic enzymes. Computational Biology and Chemistry. 2020; 87: 107304.
Cited within: 1Google Scholar PubMed Crossref
[51] Kuo J, Chang C, Chen C, Liang H, Chang C, Chu Y. Sequence-based Structural B-cell Epitope Prediction by Using Two Layer SVM Model and Association Rule Features. Current Bioinformatics. 2020; 15: 246–252.
Cited within: 1Google Scholar Crossref
[52] Hasan MM, Basith S, Khatun MS, Lee G, Manavalan B, Kurata H. Meta-i6mA: an interspecies predictor for identifying DNA N6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework. Briefings in Bioinformatics. 2021; 22: bbaa202.
Cited within: 1Google Scholar PubMed Crossref
[53] Breiman L. Random Forests. Machine Learning. 2001; 45: 5–32.
Cited within: 1Google Scholar Crossref
[54] Qi Y. Random forest for bioinformatics. In: Ensemble machine learning (pp. 307–323). Springer: Berlin/Heidelberg, Germany. 2012.
Cited within: 1Google Scholar PubMed Crossref
[55] Wei L, Xing P, Shi G, Ji Z, Zou Q. Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2019; 16: 1264–1273.
Cited within: 1Google Scholar PubMed Crossref
[56] Wei L, Su R, Wang B, Li X, Zou Q, Gao X. Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites. Neurocomputing. 2019; 324: 3–9.
Cited within: 1Google Scholar Crossref
[57] Su R, Liu X, Wei L, Zou Q. Deep-Resp-Forest: a deep forest model to predict anti-cancer drug response. Methods. 2019; 166: 91–102.
Cited within: 1Google Scholar PubMed Crossref
[58] Cheng L, Han X, Zhu Z, Qi C, Wang P, Zhang X. Functional alterations caused by mutations reflect evolutionary trends of SARS-CoV-2. Briefings in Bioinformatics. 2021; 22: 1442–1450.
Cited within: 1Google Scholar PubMed Crossref
[59] Chen X, Shi W, Deng L. Prediction of Disease Comorbidity Using HeteSim Scores based on Multiple Heterogeneous Networks. Current Gene Therapy. 2019; 19: 232–241.
Cited within: 1Google Scholar PubMed Crossref
[60] Ao C, Yu L, Zou Q. RFhy-m2G: Identification of RNA N2-methylguanosine Modification Sites Based on Random Forest and Hybrid Features. Methods. 2021. (in press)
Cited within: 1Google Scholar PubMed Crossref
[61] Ao C, Yu L, Zou Q. Prediction of bio-sequence modifications and the associations with diseases. Briefings in Functional Genomics. 2021; 20: 1–18.
Cited within: 1Google Scholar PubMed Crossref
[62] Hasan MM, Alam MA, Shoombuatong W, Deng H, Manavalan B, Kurata H. NeuroPred-FRL: an interpretable prediction model for identifying neuropeptide using feature representation learning. Briefings in Bioinformatics. 2021. (in press)
Cited within: 1Google Scholar PubMed Crossref
[63] Chen T, Guestrin C: Xgboost. A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794). Association for Computing Machinery: New York, NY, United States. 2016.
Cited within: 1Google Scholar Crossref
[64] Tianqi C, Tong H. Higgs Boson Discovery with Boosted Trees. 2014 International Conference on High-Energy Physics and Machine Learning. Valencia, Spain, 2-9 July 2014.
Cited within: 1Google Scholar Crossref
[65] Torlay L, Perrone-Bertolotti M, Thomas E, Baciu M. Machine learning-XGBoost analysis of language networks to classify patients with epilepsy. Brain Informatics. 2017; 4: 159–169.
Cited within: 1Google Scholar PubMed Crossref
[66] Cai L, Ren X, Fu X, Peng L, Gao M, Zeng X. IEnhancer-XG: interpretable sequence-based enhancers and their strength predictor. Bioinformatics. 2021; 37: 1060–1067.
Cited within: 1Google Scholar PubMed Crossref
[67] Yu X, Zhou J, Zhao M, Yi C, Duan Q, Zhou W, et al. Exploiting XG Boost for Predicting Enhancer-promoter Interactions. Current Bioinformatics. 2020; 15: 1036–1045.
Cited within: 1Google Scholar Crossref
[68] Hasan MM, Schaduangrat N, Basith S, Lee G, Shoombuatong W, Manavalan B. HLPpred-Fuse: improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation. Bioinformatics. 2020; 36: 3350–3356.
Cited within: 1Google Scholar PubMed Crossref
[69] Breiman L. Bagging predictors. Machine Learning. 1996; 24: 123–140.
Cited within: 1Google Scholar Crossref
[70] Małysiak-Mrozek B, Baron T, Mrozek D. Spark-IDPP: high-throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud. Cluster Computing. 2019; 22: 487–508.
Cited within: 1Google Scholar Crossref
[71] Wei L, Xing P, Zeng J, Chen J, Su R, Guo F. Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier. Artificial Intelligence in Medicine. 2017; 83: 67–74.
Cited within: 1Google Scholar PubMed Crossref
[72] Wei L, Wan S, Guo J, Wong KK. A novel hierarchical selective ensemble classifier with bioinformatics application. Artificial Intelligence in Medicine. 2017; 83: 82–90.
Cited within: 1Google Scholar PubMed Crossref
[73] Wei L, Tang J, Zou Q. Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information. Information Sciences. 2017; 384: 135–144.
Cited within: 1Google Scholar Crossref
[74] Wang Z, He W, Tang J, Guo F. Identification of Highest-Affinity Binding Sites of Yeast Transcription Factor Families. Journal of Chemical Information and Modeling. 2020; 60: 1876–1883.
Cited within: 1Google Scholar PubMed Crossref
[75] Ding Y, Tang J, Guo F. Identification of Protein-Ligand Binding Sites by Sequence Information and Ensemble Classifier. Journal of Chemical Information and Modeling. 2017; 57: 3149–3161.
Cited within: 1Google Scholar PubMed Crossref
[76] Fu X, Cai L, Zeng X, Zou Q. StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency. Bioinformatics. 2020; 36: 3028–3034.
Cited within: 1Google Scholar PubMed Crossref
[77] Yu L, Xia M, An Q. A network embedding framework based on integrating multiplex network for drug combination prediction. Briefings in Bioinformatics. 2021. (in press)
Cited within: 1Google Scholar PubMed Crossref
[78] Ru X, Cao P, Li L, Zou Q. Selecting Essential MicroRNAs Using a Novel Voting Method. Molecular Therapy - Nucleic Acids. 2019; 18: 16–23.
Cited within: 1Google Scholar PubMed Crossref
[79] Zhu H, Du X, Yao Y. ConvsPPIS: Identifying Protein-protein Interaction Sites by an Ensemble Convolutional Neural Network with Feature Graph. Current Bioinformatics. 2020; 15: 368–378.
Cited within: 1Google Scholar Crossref
[80] Sultana N, Sharma N, Sharma KP, Verma S. A Sequential Ensemble Model for Communicable Disease Forecasting. Current Bioinformatics. 2020; 15: 309–317.
Cited within: 1Google Scholar Crossref
[81] Xu Z, Luo M, Lin W, Xue G, Wang P, Jin X, et al. DLpTCR: an ensemble deep learning framework for predicting immunogenic peptide recognized by T cell receptor. Briefings in Bioinformatics. 2021; 22: bbab335
Cited within: 1Google Scholar PubMed Crossref
[82] Huang Y, Zhou D, Wang Y, Zhang X, Su M, Wang C, et al. Prediction of transcription factors binding events based on epigenetic modifications in different human cells. Epigenomics. 2020; 12: 1443–1456.
Cited within: 1Google Scholar PubMed Crossref
[83] Zhang L, Xiao X, Xu ZC. iPromoter-5mC: A Novel Fusion Decision Predictor for the Identification of 5-Methylcytosine Sites in Genome-Wide DNA Promoters. Frontiers in Cell and Developmental Biology. 2020; 8: 614.
Cited within: 1Google Scholar PubMed Crossref
[84] Wang H, Tang J, Ding Y, Guo F. Exploring associations of non-coding RNAs in human diseases via three-matrix factorization with hypergraph-regular terms on center kernel alignment. Briefings in Bioinformatics. 2021 22: bbaa409.
Cited within: 1Google Scholar PubMed Crossref
[85] Ding Y, Tang J, Guo F. Identification of Drug-Target Interactions via Dual Laplacian Regularized least Squares with Multiple Kernel Fusion. Knowledge-Based Systems. 2020; 204: 106254.
Cited within: 1Google Scholar Crossref
[86] Ding Y, Tang J, Guo F. Identification of drug-target interactions via fuzzy bipartite local model. Neural Computing and Applications. 2020; 32: 10303–10319.
Cited within: 1Google Scholar Crossref
[87] Zeng X, Zhu S, Liu X, Zhou Y, Nussinov R, Cheng F. DeepDR: a network-based deep learning approach to in silico drug repositioning. Bioinformatics. 2019; 35: 5191–5198.
Cited within: 1Google Scholar PubMed Crossref
[88] Zeng X, Zhu S, Lu W, Liu Z, Huang J, Zhou Y, et al. Target identification among known drugs by deep learning from heterogeneous networks. Chemical Science. 2020; 11: 1775–1797.
Cited within: 1Google Scholar PubMed Crossref
[89] Zhai Y, Chen Y, Teng Z, Zhao Y. Identifying Antioxidant Proteins by Using Amino Acid Composition and Protein-Protein Interactions. Frontiers in Cell and Developmental Biology. 2020; 8: 591487.
Cited within: 1Google Scholar PubMed Crossref
[90] Guo Z, Wang P, Liu Z, Zhao Y. Discrimination of Thermophilic Proteins and Non-thermophilic Proteins Using Feature Dimension Reduction. Frontiers in Bioengineering and Biotechnology. 2020; 8: 584807.
Cited within: 1Google Scholar PubMed Crossref
[91] Jin Q, Cui H, Sun C, Meng Z, Su R. Free-form tumor synthesis in computed tomography images via richer generative adversarial network. Knowledge-Based Systems. 2021; 218: 106753.
Cited within: 1Google Scholar Crossref
[92] Wu X, Yu L. EPSOL: sequence-based protein solubility prediction using multidimensional embedding. Bioinformatics. 2021; 37: 4314–4320.
Cited within: 1Google Scholar PubMed Crossref

Download

Fig. 1.

Fig. 2.

Article Metrics

Download

Fig. 1.

Fig. 2.

Abstract

Keywords

References