IMR Press / FBL / Volume 26 / Issue 7 / DOI: 10.52586/4935
Open Access Article
Prediction of diabetic protein markers based on an ensemble method
Show Less
1 School of Computer and Software, Nanyang Institute of Technology, 473004 Nanyang, Henan, China
2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054 Chengdu, Sichuan, China
3 Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, 324000 Quzhou, Zhejiang, China
4 School of Opto-electronic and Communication Engineering, Xiamen University of Technology, 361024 Xiamen, Fujian, China
Front. Biosci. (Landmark Ed) 2021, 26(7), 207–221;
Submitted: 26 May 2021 | Revised: 10 June 2021 | Accepted: 25 June 2021 | Published: 30 July 2021
Copyright: © 2021 The Author(s). Published by BRI.
This is an open access article under the CC BY 4.0 license (

Introduction: A diabetic protein marker is a type of protein that is closely related to diabetes. This kind of protein plays an important role in the prevention and diagnosis of diabetes. Therefore, it is necessary to identify an effective method for predicting diabetic protein markers. In this study, we propose using ensemble methods to predict diabetic protein markers. Methodological issues: The ensemble method consists of two aspects. First, we combine a feature extraction method to obtain mixed features. Next, we classify the protein using ensemble classifiers. We use three feature extraction methods in the ensemble method, including composition and physicochemical features (abbreviated as 188D), adaptive skip gram features (abbreviated as 400D) and g-gap (abbreviated as 670D). There are six traditional classifiers in this study: decision tree, Naive Bayes, logistic regression, part, k-nearest neighbor, and kernel logistic regression. The ensemble classifiers are random forest and vote. First, we used feature extraction methods and traditional classifiers to classify protein sequences. Then, we compared the combined feature extraction methods with single methods. Next, we compared ensemble classifiers to traditional classifiers. Finally, we used ensemble classifiers and combined feature extraction methods to predict samples. Results: The results indicated that ensemble methods outperform single methods with respect to either ensemble classifiers or combined feature extraction methods. When the classifier is a random forest and the feature extraction method is 588D (combined 188D and 400D), the performance is best among all methods. The second best ensemble feature extraction method is 1285D (combining the three methods) with random forest. The best single feature extraction method is 188D, and the worst one is g-gap. Conclusion: According to the results, the ensemble method, either the combined feature extraction method or the ensemble classifier, was better than the single method. We anticipate that ensemble methods will be a useful tool for identifying diabetic protein markers in a cost-effective manner.

Diabetic protein marker
Machine learning
Feature extraction method
Ensemble classifiers
Dimensionality reduction
Fig. 1.
Back to top