Prediction of diabetic protein markers based on an ensemble method

Introduction: A diabetic protein marker is a type of protein that is closely related to diabetes. This kind of protein plays an important role in the prevention and diagnosis of diabetes. Therefore, it is necessary to identify an effective method for predicting diabetic protein markers. In this study, we propose using ensemble methods to predict diabetic protein markers. Methodological issues: The ensemble method consists of two aspects. First, we combine a feature extraction method to obtain mixed features. Next, we classify the protein using ensemble classifiers. We use three feature extraction methods in the ensemble method, including composition and physicochemical features (abbreviated as 188D), adaptive skip gram features (abbreviated as 400D) and g-gap (abbreviated as 670D). There are six traditional classifiers in this study: decision tree, Naive Bayes, logistic regression, part, k-nearest neighbor, and kernel logistic regression. The ensemble classifiers are random forest and vote. First, we used feature extraction methods and traditional classifiers to classify protein sequences. Then, we compared the combined feature extraction methods with single methods. Next, we compared ensemble classifiers to traditional classifiers. Finally, we used ensemble classifiers and combined feature extraction methods to predict samples. Results: The results indicated that ensemble methods outperform single methods with respect to either ensemble classifiers or combined feature extraction methods. When the classifier is a random forest and the feature extraction method is 588D (combined 188D and 400D), the performance is best among all methods. The second best ensemble feature extraction method is 1285D (combining the three methods) with random forest. The best single feature extraction method is 188D, and the worst one is g-gap. Conclusion: According to the results, the ensemble method, either the combined feature extraction method or the ensemble classifier, was better than the single method. We anticipate that ensemble methods will be a useful tool for identifying diabetic protein markers in a cost-effective manner.

Keywords

Diabetic protein marker

Machine learning

Feature extraction method

Ensemble classifiers

Dimensionality reduction

2. Introduction

Due to continuous improvements and changes in people’s lifestyles, an increasing number of people are suffering from diabetes mellitus [1]. At present, diabetes is one of the most prevalent diseases in many countries. According to clinical diagnosis, people who suffer from diabetes to be younger, and the incidence of diabetics is rising [2]. Therefore, improving the diagnostic efficiency of diabetes and identifying diabetic protein markers for use are currently hot topics. The continuous development of machine learning has resulted in its increasing use for disease prediction [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. Machine learning methods to predict diabetes mellitus (DM) have been around for some time, eliciting sufficient controversy.

With the development of sequencing technology, the protein’s function has gradually been found. Thus, proteomics has become a popular research hotspot. The essence of proteomics is to study proteins on a large-scale level, including protein expression, post-translational modification, protein-protein interactions [13], etc. Proteome research can not only provide a material basis for the laws of life activities, but also provide theoretical basis and solutions for the elucidation and conquering of many kinds of disease mechanisms [14]. By comparing and analyzing the proteome between normal individuals and pathological individuals, we can find certain “disease-specific protein molecules”, which can become molecular targets for new drug design, or provide molecular markers for early diagnosis of diseases. Therefore, proteomics research is not only a necessary work to explore the mysteries of life, but also can bring huge benefits to human health. Proteomics research is a symbol which indicates that life science has entered the post-gene era [15]. Currently, we can make relevant predictions based on machine learning and protein markers. For example, machine learning has also been used for age prediction using protein markers [16] as well as the detection of other prevalent age-related diseases. Fleischer et al. [16] studied a computational method that can use ensemble machine learning methods to predict biological age from gene expression data of skin fibroblasts. Reboucas et al. [17] used biomarkers to detect the potential for recurrence of lung adenocarcinoma after surgical resection. These studies illustrate the importance of predicting protein markers. So, in this study, we use machine learning method to predict diabetic protein markers.

Protein is the material basis of all life, an important part of the composition of cells, and the primary raw material for the regeneration and repair of human tissue. Changes in protein morphology and quantity may lead to a variety of diseases, and some diseases may affect protein synthesis. Diabetic marker proteins are linked to diabetes, and these proteins directly or indirectly affect the diagnosis of diabetes [18]. Huth et al. [19] used proteomics to predict T2D. In this experiment, the authors selected 892 people who were 42 to 81 years old. Through the experiments, the authors found that the level of mannan-binding lectin serine peptidase (MASP) was positively correlated with T2D and prediabetes. However, adiponectin is negatively correlated with the T2D. MASP, adiponectin, apolipoprotein A-IV, apolipoprotein C-II and C-reactive protein are related to prognosis. The results show that diabetes can be predicted by protein levels in the body. As we all know, diabetes can cause a series of complications, such as diabetic nephropathy (DN). Hirao et al. [20] conducted a comprehensive analysis of the diabetic patients and healthy people, using label-free semi-quantitative methods. Protein identification analysis showed that there are 327 proteins unique to healthy people and 30 unique proteins to diabetic patients. There are a total of 615 proteins in the two groups. Gestational diabetes mellitus (GDM) refers to abnormal blood sugar that occurs during pregnancy. It is one of the common pregnancy complications in obstetrics, and it has serious adverse effects on the health of mothers and babies [19]. Through experiments, Kim et al. [21] proved that the level of apolipoprotein C III in women with GDM was significantly increased. According to experimental data, it can be found that there are biomarkers in patients with gestational diabetes at 16–20 weeks of pregnancy. Therefore, it is completely feasible to determine protein biomarkers and predict the later development of GDM.

As above, there are many proteins related to diabetes. Their presence or level of presence can usually be used as a criterion for judging diabetes. So, it is important to identify the kind of proteins which are associated to diabetes. Establishing a good protein classification model and identifying diabetic marker proteins are important steps for understanding and predicting diabetes. The main methods currently used to study proteomics include two-dimensional gel electrophoresis (2-DE), time-of-flight mass spectrometry (TOFMS), semi-quantitative multiple reaction monitoring (SQMRM) and bioinformatics technology, etc. [22]. These methods mainly use biological methods to analyze proteins. Biological methods can accurately perform qualitative analysis, but these methods produce a series of costs. Moreover, when biological methods face an unknown protein, they cannot rapidly judge the protein’s function based on the structural characteristics. Therefore, we hope to use machine learning methods to predict diabetes protein markers.

Machine learning has been widely used in protein classification. Machine learning methods can build models based on known proteins, which can make function predictions for unknown proteins faster. For example, Feng et al. [23] used the amino acid composition and frequency of occurrence of each dipeptide to extract certain features and used Naive Bayes as the classifier to predict samples. Ding et al. [24] used the g-gap method to extract features from protein sequences. They used the support vector machine (SVM) to classify protein sequences. Song et al. [25] converted protein sequences into 188-dimensional features according to their composition, physical and chemical properties and distribution. Yuan et al. [26] extracted features according to peptides. This method considered the structural information between amino acids and obtained comprehensive and informative characteristics. The pseudo-amino acid information proposed by Chou et al. [27] contained the sequence information of two amino acids separated by one or more amino acid residues, and this method obtained good accuracy. Liu et al. [28] proposed an enhanced method called pseudo amino acid composition (Pse-ACC). In this method, the authors reduced the amino acid alphabet profile, proposing physicochemical distance transformation (PDT), which is similar to Pse-ACC. Zhou et al. [29] and Tian et al. [30] enhanced feature extraction of protein sequences by fusing Pse-ACC with the dipeptide composition and auto-covariance, encoding proteins based on their grouped weight. In addition to research on feature extraction methods, reconstruction of classifiers is also currently a research hotspot, particularly research on ensemble learning. Several previous studies established a series of random forest models based on different features [31, 32, 33, 34, 35, 36]. They ultimately obtained results by voting on the results of each classification. Han et al. [37] constructed a two-layer multi-classification support vector machine model to predict subcellular localization. The output of the first layer is the input of the second layer. Bahri et al. [38] built an ensemble method called Greedy-Boost. This method not only improves stability but also enhances the speed of classification. Lin et al. [39] used 18 classifiers to predict protein sequences, using K-means to cluster the results. Wang et al. [40] predicted protein-protein interaction sites using an ensemble random forests with synthetic minority oversampling technique. We use Table 1 (Ref. [23, 24, 25, 26, 27, 28, 29, 30, 37, 38, 39, 40]) to summarize the above methods.

Table 1.Using machine learning methods to predict protein.

Authors	Feature extraction method	Classifier
Feng et al. [23]	amino acid composition and frequency of occurrence of each dipeptide	Naive Bayes
Ding et al. [24]	g-gap	SVM
Song et al. [25]	188-dimensional features	Ensemble classifier
Yuan et al. [26]	the structural information between amino acids	RBF network
Chou et al. [27]	pseudo-amino acid information	augmented covariant-discriminant algorithm
Liu et al. [28]	physicochemical distance transformation (PDT)	SVM
Zhou et al. [29] and Tian et al. [30]	fusing Pse-ACC with the dipeptide composition and auto-covariance	SMOTE
Han et al. [37]	Physicochemical Properties	two-stage multiclass support vector machine
Bahri et al. [38]	-	Greedy-Boost
Chen et al. [39]	188-dimensional features	Mulit-classifiers and k-means
Wang et al. [40]	PSSM-SPF and RER features	ensemble random forests

We are committed to building a predictive model for diabetes protein markers. In this way, it is possible to discover the diabetes protein markers whether contained in the human body or not, and we can label the function of unknown proteins. In this study, we focused on building a diabetes protein markers predictive model, which is based on machine learning.

3. Materials and methods

To build an efficient classification model, the following steps are required. First, the protein sequences must be converted into vectors. Then, the dimensionality of the feature vectors are reduced, if necessary. Finally, the classification model is obtained by training the classifier. In this study, we developed a feature extraction method and a classifier. We used an ensemble method to predict diabetic protein markers. First, we obtained positive data from Uniprot, and then, we obtained negative data from the positive data. Next, we used three methods to extract the features and six classifiers to predict proteins. Then, we obtained four new feature sets by combining three feature extraction methods. Finally, we used a dimensionality reduction method to reduce the features that were obtained in the previous step. In the classification experiment for each step, we used ensemble classifiers and traditional classifiers. The process flow chart is shown in Fig. 1.

Fig. 1.

Overall process of the method described in this paper.

3.1 Dataset

Due to the low number of diabetic marker proteins currently available, it is very important to build a representative and non-redundant negative dataset. In this study, we used the protein family database (PFAM) based on structural information to build a negative dataset according to two principles [41, 42]: (1) extract all positive PFAM information and then choose the longest sequence as the negative sequence among the rest of all positive PFAM members; (2) the positive sequence is derived from the Uniprot Database.

Using ‘diabetes’ as the key word, 574 sequences were extracted from UniProt (Universal Protein, http://www.uniprot.org/uniprot/), containing human, mouse, cow and other species’ protein data. After screening, we obtained 310 human protein sequences, and then, we used CD-HIT [43] to reduce redundant data and removed sequences that contained illegal letters. We were left with 309 diabetic protein markers and 9695 negative protein sequences. Because the data set was unbalanced, we randomly selected a negative dataset according to the positive samples’ length and proportion. We randomly selected 5 sets of negative samples and averaged the results of 5 experiments using these 5 sets.

3.2 Feature extraction method

Since a computer cannot directly recognize a protein sequence, it needs to convert the sequence into a set of vectors to recognize the information, which is called feature extraction [44, 45, 46, 47, 48, 49, 50, 51, 52]. A good feature extraction method will comprehensively consider the information contained in the protein sequence. As we know, protein is composed of 20 amino acids arranged in combination, so we reflect the properties of proteins through the position of amino acids, physical and chemical properties, etc. At present, feature extraction methods mainly take into account the following aspects: (1) amino acid composition, (2) amino acid physicochemical information, (3) intrinsic correlation information of amino acid sequences and (4) structural information of proteins. Feature extraction methods have a strong influence on experimental results. Therefore, how to enhance the feature extraction method is a problem worth studying.

3.2.1 Composition and physicochemical features

Composition and physicochemical features (188D) can extract 188 features containing composition, transform and distribution information [53]. This method includes three parts: composition, transform and distribution [54, 55]. The first section is the amino acid (AA) composition. By calculating the frequencies of amino acids in a protein sequence, sequences are converted into 20D vectors [53]:

(1) $\left(v_{1},v_{2},v_{3},\cdots,v_{20}\right)^{T}=\left(\frac{n_{1}}{L},\frac{n% _{2}}{L},\cdots,\frac{n_{20}}{L}\right)$

where $n_{i}$ represents the quantity of an AA in the protein sequence.

The second section is transform. According to the physicochemical properties of a protein, 20 AAs can be divided into 3 different groups. The proportion of each group in the protein sequences is calculated. We use the secondary structure as an example:

(2) $C_{j}=\frac{\operatorname{count}_{D_{i}}}{L}(i=1,2,3;\mathrm{j}=1,2,\cdots,8)$

where $D_{i}\$ represents the number of each kind of amino acids and L is the length of the sequence. In this study, we considered [56] eight physicochemical properties. For each physicochemical property, there are three types of amino acids. Therefore, we can obtain 24D features.

The third section is distribution [55, 57]. We calculate the distribution at five positions: at the beginning, 25%, 50%, 75% and the end. We obtained 120D features in this section. Then, we considered the number of different amino acid dipeptides. According to this step, we obtained the 24D features. The formulas are as follows.

(3) $\displaystyle T_{i,j}=\frac{D_{i}D_{j}\operatorname{or}D_{j}D_{i}}{L-1}$ $\displaystyle T_{i,j}$
$\displaystyle\quad i,j\in\{(i=1,j=2),(i=2,j=3),(i=3,j=1)\}$

(4) $\displaystyle\mathrm{D}=\frac{H_{ij}}{L}$
$\displaystyle(j=\text{ begining },25\%,50\%,75\%,\text{ ending };i=1,2,3)$

where the chain length is measured as $H_{ij}$ at the beginning, 25%, 50%, 75% and end of AAs at which a particular property is located.

This method divides amino acids into three types according to different physical and chemical properties, and considers the position information of three different types of amino acids under different physical and chemical properties. This method comprehensively considers the location information and physical and chemical properties of amino acids. The method is simple and easy to understand. Although the method uses physical and chemical properties, it still mainly focuses on the position information of amino acids, and the physical and chemical properties are not further reflected.

3.2.2 G-gap

G-gap dipeptide composition is a method used to describe the information about the composition of dipeptides in protein sequences. In this study, we used an enhanced method proposed by Huan et al. [56], who added a pseudo amino acid composition to the g-gap. Thus, each protein sequence is converted into $400+n\lambda{}$ vectors [56, 58, 59].

(5) $FV_{g-gap}=\left[x_{1},x_{2},\cdots,x_{400},x_{400+1},\cdots,x_{400+n\lambda}% \right]^{T}$

(6) $x_{u}=\left\{\begin{array}[]{c}\frac{f_{u}}{\sum_{i=1}^{400}f_{i}+\omega\sum_{% j=1}^{n\lambda}\tau_{i}}(1\leq u\leq 400)\\ \frac{\omega\tau_{u}}{\sum_{i=1}^{400}f_{i}+\omega\sum_{j=1}^{n\lambda}\tau_{i% }}(400+1\leq u\leq 400+n\lambda)\end{array}\right.\\$

(7) $f_{u}=\frac{n_{i}^{g}}{L-g-1}$

where the number of occurrences of i-th dipeptide appearances is denoted by $n_{i}^{g}$ and $\omega{}$ is the weight. $\lambda{}$ is the number of total counted ranks or tires of the correlations along a protein sequence, and n is the number of physicochemical properties used in this study. ${\tau{}}_{i}$ is the i-th tier correlation factor, which reflects the sequence-order correlation between all the i-th most contiguous dipeptides along a protein sequence. In this study, n is 9 and g is 2. Finally, we obtained the 670D features.

This feature extraction method is based on two amino acids. This method first considers the position of each amino acid in the sequence, and fuses the physical and chemical properties of the amino acid through the pseudo amino acid composition.

3.2.3 Adaptive Skip Gram Features

Adaptive Skip Gram Features (hereafter referred to as 400D) can extract 400 features. This method extracts features according to the distance between amino acids [55, 60, 61]. We assume a given protein sequence P. P is expressed as follows.

(8) $A_{1}A_{2}A_{3}\cdots A_{n}$

$\mathrm{DT}\left(A_{i},A_{j}\right)$ is the distance between amino acids.

(9) $\mathrm{DT}\left(A_{i},A_{j}\right)=j-i-1$

where i and j represent the position of amino acids. According to the definition of amino acid distance, if two amino acids are adjacent, the distance is 0. The maximum distance between amino acids is L-2. The k-skip-n-gram algorithm counts the frequency of occurrence of any n amino acid sequences in the sequence, expressed as follows:

(10) $\!\!\!\!\!\!\!\!\!FV_{\text{skipgram }}=\left\{\frac{N\left(a_{m_{1}}a_{m_{2}}% \cdots a_{m_{n}}\right)}{N\left(T_{\text{skipgram }}\right)}\biggr{|}1\leq m_{% 1}\leq 20,\cdots,1\leq m_{n}\leq 20\right\}$

(11) $T_{\text{skipgram }}=\left\{\bigcup\nolimits_{a=0}^{k}Skip(DT=a)|a=0,1,2,% \cdots,k;k\leq\frac{L^{\min}}{n-1}\right.$

$T_{\text{skipgram }}$ represents subsequences, which are composed of n amino acids in the sequence. $N\left(T_{skipgram}\right)$ represents the number of elements in $T_{skipgram}$ . $N\left(a_{m_{1}}a_{m_{2}}\cdots{}a_{m_{n}}\right)$ represents the number of occurrences of all n amino acid component sequences in $T_{skipgram}$ . The number of ${FV}_{skipgram}$ is ${20}^{n}$ . Due to the differential value of k, Wei et al. [61] proposed the adaptive skip-n-gram model. This model cancels the limitation of k. The values of k are adaptive according to the length of the sample sequence, which makes the features contain more distance information and makes the k-skip-n-gram model have no parameters, avoiding the overfitting problem. In this study, n is 2, so we obtained 400D features using this method.

This method is derived from the n-gram model in natural language processing. It mainly considers whether each amino acid or each polypeptide appears in the sequence and how often it appears. So, this method take into account the distance between amino acids.

3.3 Classifier

3.3.1 Logistic regression

Logistic regression [62, 63] is a logarithmic model, and its form is a parametric logistic distribution represented by the conditional probability distribution $P(Y|{}X)$ .

(12) $\mathrm{P}(Y=1\mid x)=\frac{\exp(\omega\cdot x)}{1+\exp(\omega\cdot x)}$

(13) $\mathrm{P}(Y=0\mid x)=\frac{1}{1+\exp(\omega\cdot x)}$

where $x\in{}R^{n}$ is the feature, $Y\in{}\{0,1\}$ is the class, and $\omega{}$ is the weight vector. For a given sample, we should calculate both probabilities.

In this study, we used two kinds of logistic regression models. One is shown above. The other is kernel logistic regression. LR is a linear classifier. Therefore, kernel logistic regression (KLR) [64] was proposed, which can be used to classify nonlinear data. KLR uses kernel functions to project the features into high-dimensional space.

3.3.2 Decision tree

Two kinds of decision trees are used in this study. One is C4.5 [12, 65], and the other is PART [66], which is based on C4.5.

(1) C4.5. This model uses the tree structure to describe the process of classification. C4.5 contains nodes and directed edges. There are two kinds of nodes, internal nodes and leaf nodes. Internal nodes indicate attributes, which are the conditional basis of classification. Leaf nodes are classes, which are the labels of samples. The process of C4.5 classifying instances is that C4.5 arranges samples from the root node to a leaf node. Feature selection is based on the information gain ratio.

(2) PART. PART is based on C4.5 and the partial decision tree, proposed by Eibe Frank et al. [67] in 1998. PART extracts rules from a dataset according to an incomplete decision tee. The original principle of the algorithm comes from the separate-and-conquer strategy. This method creates a rule and then removes the instance that is covered by the rule. The method builds rules for the remaining instances until there are no existing instances, recursively.

3.3.3 Naïve Bayes

Naïve Bayes [53, 23] is a classifier based on Bayes’ theorem and features condition independent hypotheses. In this algorithm, first, the input’s and output’s joint probability distribution are studied, which is based on the independent hypothesis of feature conditions; then, for a given input x, we used Bayes’ theorem to calculate the maximum posterior probability of output y. Naive Bayes is a common classification method, which is easy to implement [23, 66].

3.3.4 K-Nearest Neighbor

K-Nearest Neighbor (KNN) [68, 69] is a basic classification and regression method. In this study, we only discuss the application of KNN in classification problems. KNN contains three important factors: the selection of k values, distance functions and decision rules. The algorithm for KNN is as follows:

(1) According to the determined distance function, we can derive k points that are closed to instance x in the training set. The neighborhood of x that covers these k points is called $N_{k}(x)$ ;

(2) In $N_{k}(x)$ , the class of x is determined according to the classification decision rule.

(14) $\displaystyle\mathrm{y}=\arg\max_{c_{j}}\sum_{x_{i}\in N_{k}(x)}I\left(y_{i}=c% _{j}\right),$
$\displaystyle i=1,2,3,\ldots,N;j=1,2,3,\ldots,K$

where $x_{i}$ is the feature vector and $y_{i}=\{c_{1},c_{2},…,c_{K}\}$ is the label. $I(*)$ is the indicator function.

3.4 Ensemble method

The ensemble method uses many kinds of methods to process a dataset, which may obtain superior results [7, 70, 71, 72, 73, 74, 75, 76, 77]. Different methods have different emphases on data processing. We combined different methods to improve the classification efficiency. In this study, we mixed feature extraction methods and used ensemble classifiers to improve their performances.

3.4.1 Ensemble feature extraction methods

We used the 188D, g-gap and 400D to extract features from sequences. These methods have different emphases. 188D contains the amino acid composition and physical and chemical properties. G-gap contains the dipeptide composition, which can indicate the importance of the peptide chain. Since proteins are produced by the distortion and folding of peptide chains, dipeptides can better recognize proteins. G-gap adds nine kinds of physical and chemical properties to improve its accuracy. 400D features take into account the distance between amino acids. Therefore, it is meaningful to combine these three methods to more comprehensively extract features and improve accuracy.

In this study, we used four combination methods. First, we combined 188D and g-gap and obtained 858D features. 188D divides amino acids into three groups and studies the physical and chemical properties of each type of amino acid. However, g-gap considers the properties of each amino acid.

Second, we combined 188D and 400D and obtained 588D features. Since 400D does not consider the physicochemical properties of amino acids, the combination of 188D and 400D extracts features based on the AA composition and physicochemical properties.

Third, we combined g-gap and 400D and obtained 1070D features. 400D is different from g-gap. The dipeptides in g-gap are adjacent, while in 400D, two amino acids can be separated by several amino acids, that is, the two amino acids considered are not adjacent. G-gap focuses on the composition of dipeptides in the protein sequence, and 400D focuses on the relative position of the amino acids.

Fourth, we combined 188D, g-gap and 400D and obtained 1258D features. 188D considers the frequency of occurrence of a single amino acid and the amino acid’s position according to its physical and chemical properties, which are different from the other two methods. G-gap focuses on the composition of dipeptides in the protein sequence, which is different from the other two methods. 400D focuses on the distance between two amino acids, which is different from the other methods.

Ensemble feature extraction methods can make up for the shortcomings between different methods, and then can construct a set of comprehensive features. By combining feature extraction methods, we can get four different sets of feature vectors.

3.4.2 Ensemble classifiers

Ensemble classifiers can organically combine multiple traditional learning models to obtain more stable and accurate results. There are three common ensemble learning algorithms, including bagging, boosting and stacking.

In this study, we used random forest and vote as the ensemble learning methods to classify proteins.

(1) Random Forest (RF) [78, 79, 80, 81, 82]. Random forest is an extension of bagging. Leo Breiman proposed RF. RF is composed of many decision trees, with no correlation between different decision trees. When we need to classify a sample, each decision tree in the forest makes a judgment and classification. The final result is the class with the highest number of votes.

(2) Vote. This method classifies the sample according to the seven classifiers. First, we use LR, KLR, NB, DT, PART, RF and KNN to classify protein sequences. We obtain seven results, and we use majority voting to obtain the final result. If a label receives more than half the votes, the prediction is that label; otherwise, the prediction is rejected.

(15) $\mathrm{H}(\mathbf{x})=\left\{\begin{array}[]{cc}c_{j},if&\sum\limits_{i=1}^{T% }h_{i}^{j}(\boldsymbol{x})$>$0.5\sum_{k=1}^{N}\sum_{i=1}^{T}h_{i}^{k}(% \boldsymbol{x})\\ reject,&otherwise\end{array}\right.$

where $h_{i}^{j}(x)$ is the output of classifier $h_{i}$ on the label $c_{j}$ . T represents the number of base classifiers, and N represents the number of labels.

3.5 Measurement

In this study, we used accuracy (ACC), the Matthews correlation coefficient (MCC), F-Measure and the area under the receiver operating characteristic curve (AUC) to measure classifier efficacy [83, 84]. The formulas are as follows:

(16) $\mathrm{ACC}=\frac{TP+TN}{TN+TP+FN+FP}$

(17) $\mathrm{MCC}=\frac{(TP\times TN)-(FN\times FP)}{\sqrt{(TP+FN)(TN+FP)(TP+FP)(TN% +FN)}}$

(18) $\mathrm{F}1=\frac{2\times\frac{TP}{TP+FP}\times\frac{TP}{TP+FN}}{\frac{TP}{TP+% FP}+\frac{TP}{TP+FN}}$

where TP represents the number of correct classifications in the positive dataset. TN is the number of correct classifications in the negative dataset. FN is the number of false negatives. FP is the number of false positives.

4. Result and discussion

Due to the imbalanced dataset, we randomly extracted 5 sets of negative samples and averaged the results of 5 experiments using these 5 sets. Each experiment was subjected to 10-fold cross-validation. The dataset was divided into 10 sections. Nine groups were used to train the model, and the remaining group was used to test the model.

4.1 Using the single feature extraction method and a single classifier

To evaluate the ensemble methods, first, we used the single feature extraction method and traditional classifier to predict proteins. When we used 188D to extract features from the protein sequences, the performances of the six classifiers had negligible differences. The best AUC from KLR was 0.81, and the worst AUC from KNN was 0.70. When 400D was used to extract features, the best classifier was NB, the AUC of which was 0.77, and the worst AUC was 0.66 from DT. The best AUC of g-gap method was 0.74 from KLR, and the worst AUC was 0.52 from NB. The detailed classification results are shown in Fig. 2. The overall effect of KLR was the best. KLR can map nonlinear features to high-dimensional space by adding kernel functions, which solves nonlinear problems. NB, which is the best classifier for 400D, but the worst for g-gap, assumes that each feature is independent with an identical distribution, so the classification effect of the different feature extraction methods fluctuates greatly.

In the previous section, we evaluated the classifiers according to the ROC curve and AUC. Next, we evaluated the feature extraction methods, as shown in Fig. 3, which are more vivid. According to Fig. 3, 188D has the best performance among the five classification results, except for NB. 400D has the best performance among the three feature extraction methods when the classifier is NB. The performance of NB indicates that features obtained using 400D have the highest independence among the three feature extraction methods. According to the DT and PART results, 188D has the best performance, which may indicate that features extracted from 188D contain more effective information for classifying diabetic protein markers. All of the experimental results using the single methods are shown in Appendix Table 5.

Fig. 2.

Performances comparison of single feature extraction method. (A) The ROC curve of 188D with six classifiers. (B) The ROC curve of g-gap with six classifiers. (C) The ROC curve of adaptive skip gram feature with six classifiers.

Fig. 3.

The results of using single feature extraction method and traditional classifier. Compare the feature extraction methods by controlling the classifier.

4.2 Comparison of the ensemble methods with single methods

4.2.1 Ensemble feature extraction methods outperform the single methods

In this section, we compare the joint features with single ones. We combined the three feature extraction methods, obtaining four joint features: 588D (combining 188D with 400D), 858D (combining 188D with g-gap), 1070D (combining 400D and g-gap) and 1258D (combining 188D with 1070D). In this section, we also use six classifiers for prediction.

First, we conducted the classification experiment on 588D. To make the comparison of the results clearer, we used the DT, NB and PART experimental results to create a histogram, which is shown in Fig. 4. When the classifier is DT, 188D has the best result. NB and PART were selected for similar reasons. 400D had the best result with NB, and 588D had the best result with PART. The experimental results of the other classifiers are shown in Appendix Table 6. According to Fig. 4, 588D is the best feature extraction method among the three methods, except for DT. 588D improves accuracy most of the time, but it is slightly worse than 188D when the classifiers are DT and LR.

Fig. 4.

The results of combining 188D and 400D. Three classifiers are selected for comparison. (A) When classifier is DT, 188D has the best performance. (B) When feature extraction method is 400D, using NB can have the best performance, but performance of 588D is better than 400D. (C) When classifier is PART, 588D has the best performance.

Similar to the above method, we created histograms of the remaining three combined methods, which are shown in Fig. 5. According to Fig. 5, we found the accuracy was generally improved, but the improvement rate was not large. Specifically, 1258D had little improvement compared to 588D and 1070D. However, when the classifier was NB, the performance of 1070D was worse than 400D, potentially because the ensemble method increases the correlation between features and reduces feature independence. We used Max-revelation-Max-Distance (MRMD) to reduce the dimensionality, which may improve the accuracy. This method used the Pearson correlation coefficient (PCC) to calculate the relevance and the Euclidean distance to identify instances of redundancy. The results are shown in Appendix Table 7. Compared to the results without dimensionality reduction, the effect of the classifiers increased and decreased. When the classifiers are LR, DT and NB, the results improved. Compared to the single method, overall accuracy was improved. Therefore, using the ensemble feature method is better than the single feature extraction method.

Fig. 5.

Results of using ensemble feature extraction method. In this section, we used traditional classifiers. (A), (B), (C) are the results of 188D, g-gap and 858D. (D), (E), (F) are the results of 400D, g-gap and 1070D. (G), (H) are the results of 588D, 858D, 1070D, and 1258D.

4.2.2 Ensemble classifiers outperform single classifiers

In this section, we compare the ensemble classifiers with traditional classifiers. We use RF and VOTE as the ensemble classifiers. RF is an ensemble classifier based on the bagging algorithm, using DT as the base learner. RF combines all the classification results for voting and designates the label with the most votes as the final result. We proposed the vote method. In this method, we used seven classifiers to classify the samples, obtaining the final result according to majority voting. The seven classifiers are DT, PART, KLR, LR, KNN, RF and NB, which were used in the previous sections. The results are shown in Fig. 6. There are three ROC curves in each subgraph. Two of them are RF and VOTE, and the other is the traditional classifier with the best performance. From Fig. 6, we observe that the results using ensemble classifiers are better than the results using traditional classifiers. The best AUC was 0.90, for which the classifier was RF and the feature extraction method was 188D. Moreover, the worst AUC in the section was 0.70, for which the classifier was PART and the feature extraction method was g-gap. According to Fig. 6, RF is superior to VOTE. The reason the ensemble method is better than the traditional classification method is that the ensemble method avoids the accidental errors of single methods by comprehensively considering multiple classifiers. All experimental results are shown in Table 2.

Fig. 6.

Performance comparison of ensemble classifiers and traditional classifier. (A) The ROC curve of the ensemble classifiers and DT on 188D. (B) The ROC curve of the ensemble classifiers and NB on g-gap. (C) The ROC curve of the ensemble classifiers and PART on 400D.

Table 2.The results of using ensemble classifiers with single feature extraction method.

Method	Classifier	ACC	F_measure	MCC
188D	RF	0.8139	0.8255	0.6334
188D	VOTE	0.8042	0.8013	0.6087
400D	RF	0.7718	0.7800	0.5452
400D	VOTE	0.7573	0.7541	0.5147
g-gap	RF	0.7977	0.8109	0.6013
g-gap	VOTE	0.7686	0.7682	0.5372

According to the results, we found ensemble classifiers are better than single classifier. Ensemble classifiers are beneficial in three ways. First, since the learning task has a large hypothesis space, there may be many hypotheses in the training set to achieve the same performance. Therefore, the single classifier may choose the hypothesis space by mistake, resulting in poor generalization. Ensemble classifiers can reduce this risk. Second, ensemble classifiers can reduce the risk of falling into a terrible local minimum. Third, by combining multiple classifiers, the corresponding hypothesis space will expand, making it possible to learn the best approximation.

4.3 Ensemble classifiers with a combined feature extraction method have the best performance

According to the above results, we know that when the classifier is traditional, the ensemble extraction method is better than a single method, and ensemble classifiers are better than traditional classifiers when a single feature extraction method is used. Therefore, in this section, we discuss the performance, which used ensemble classifiers and ensemble extraction methods.

In this section, we created a histogram according to the RF, VOTE and traditional classifiers results, as shown in Fig. 7 and Table 3. The selected traditional classifier had the best performance among the six classifiers. According to Fig. 7, we found the ensemble classifiers were better than traditional classifiers when we use combined feature extraction methods. In this section, RF was better than VOTE. This conclusion is the same as for Section 4.2.1. When we used MRMD, the effect was not improved. The results are shown in Table 4.

Fig. 7.

Performance comparison of ensemble classifiers and traditional classifier. In this section, we used combined feature extraction methods.

Table 3.The results of using ensemble classifiers with combined feature extraction method.

Method	Classifier	ACC	F_measure	MCC
188 + 400D	RF	0.8366	0.8463	0.6786
188 + 400D	VOTE	0.8269	0.8277	0.6538
188 + g-gap	RF	0.8172	0.8296	0.6411
188 + g-gap	VOTE	0.8074	0.8120	0.6156
400 + g-gap	RF	0.8269	0.8366	0.6585
400 + g-gap	VOTE	0.7864	0.7871	0.5728
188 + 400 + g-gap	RF	0.8317	0.8448	0.6730
188 + 400 + g-gap	VOTE	0.8285	0.8323	0.6576

Table 4.The results of using ensemble classifiers with combined feature extraction method after dimensionality reduction.

Method	Classifier	ACC	F_measure	MCC
188 + 400D (reduced)	RF	0.8285	0.8374	0.6610
188 + 400D (reduced)	VOTE	0.8123	0.8135	0.6246
188 + g-gap (reduced)	RF	0.8155	0.8262	0.6359
188 + g-gap (reduced)	VOTE	0.7961	0.7886	0.5937
400 + g-gap (reduced)	RF	0.8139	0.8250	0.6329
400 + g-gap (reduced)	VOTE	0.7913	0.7923	0.5826
188 + 400 + g-gap (reduced)	RF	0.8333	0.8451	0.6745
188 + 400 + g-gap (reduced)	VOTE	0.8220	0.8308	0.6475

According to performance, the ensemble method is better than the single method. When the classifier is RF and the feature extraction is 588D, the performance is the best among all the methods. The second best ensemble method was 1285D with RF. Therefore, we can use 588D and RF to build the prediction model. All of the experimental results are shown in Appendix Table 6.

5. Conclusions

Diabetes is a common chronic disease. If diabetes is not detected and treated in time, it can lead to serious complications. In this study, we conducted research on diabetic protein markers. By classifying proteins, we can determine whether there are diabetic protein markers in the human body that can be used to better diagnose diabetes.

In this study, we proposed using ensemble methods to predict diabetes protein markers, including ensemble feature extraction methods and ensemble classifiers. We used three feature extraction methods and six traditional classifiers. We combined three methods, obtaining four combined methods. We used seven classifiers to form an ensemble learning method. To validate the performance of our ensemble classifier, we evaluated and compared it with the traditional classifier using 10-fold cross validation.

According to the results, ensemble method is better than single method. We compared the combined features with existing features. The performance revealed that the combined feature extraction method was more effective. Especially when the feature dimension is 588D and the classification method is random forest, the effect is best. Therefore, 588D features and random forest can be used to construct a model for predicting diabetes protein markers. Using machine learning methods can quickly predict protein function. 188D divides amino acids into three groups and studies the physical and chemical properties of each type of amino acid. Since 400D does not consider the physicochemical properties of amino acids, the combination of 188D and 400D extracts features based on the AA composition and physicochemical properties. Combining these two feature extraction methods, protein sequences can be analyzed in terms of physical and chemical properties, amino acid positions, amino acid fragments, etc. The ensemble feature extraction methods can analyze the sequence comprehensively, and the ensemble machine learning method can avoid many problems, e.g., poor generalization ability. According to the results, the ensemble method, either the combined feature extraction method or the ensemble classifier, was better than the single method. We anticipate that ensemble methods will be a useful tool for identifying diabetic protein markers in a cost-effective manner.

In this study, we only made predictions for diabetes marker proteins. We can use the model to predict protein. Due to lack of relevant data, we are temporarily unable to predict diabetes. Currently, the obtained diabetes marker proteins are not used for diabetes prediction. Therefore, in the next step of research, we will focus on using diabetes marker proteins to predict diabetes, which is more valuable in clinical applications.

6. Author contributions

KQ and HS implemented the experiments and drafted the manuscript. QZ and KQ initiated the idea, conceived the whole process, and finalized the paper. HS and QZ helped with data analysis and revised the manuscript. All authors have read and approved the final manuscript.

7. Ethics approval and consent to participate

Not applicable.

8. Acknowledgment

Not applicable.

9. Funding

The work was supported by the National Natural Science Foundation of China (No. 61922020, No. 61771331), and the Special Science Foundation of Quzhou (2020D003).

10. Conflict of interest

The authors declare no conflict of interest.

11. Appendix

See Tables 5,6,7.

Table 5.The results of using three feature extraction methods.

Method	Classifier	ACC	F_measure	MCC
188D	DT	0.7605	0.7574	0.5212
188D	NB	0.6634	0.5873	0.3517
188D	LR	0.7443	0.7367	0.4895
188D	KLR	0.7362	0.7279	0.4734
188D	KNN	0.6957	0.6459	0.3955
188D	PART	0.7492	0.7488	0.4984
400D	LR	0.6019	0.5759	0.2054
400D	NB	0.7249	0.7543	0.4633
400D	DT	0.6667	0.6634	0.3334
400D	KLR	0.6845	0.6620	0.3722
400D	PART	0.6828	0.6689	0.3670
400D	KNN	0.6796	0.6387	0.3688
g-gap	DT	0.6812	0.6755	0.3627
g-gap	LR	0.6424	0.6061	0.2898
g-gap	NB	0.5259	0.6659	0.0949
g-gap	KNN	0.5793	0.3467	0.2258
g-gap	PART	0.7023	0.7070	0.4047
g-gap	KLR	0.7023	0.6761	0.4099

Table 6.The results of using ensemble feature extraction methods.

Method	Classifier	ACC	F_measure	MCC
188 + 400D	NB	0.7492	0.7496	0.4984
188 + 400D	LR	0.6440	0.6393	0.2881
188 + 400D	DT	0.7476	0.7383	0.4964
188 + 400D	KLR	0.7621	0.7570	0.5247
188 + 400D	PART	0.7864	0.7836	0.5730
188 + 400D	KNN	0.7136	0.6788	0.4376
188 + g-gap	LR	0.6197	0.6010	0.2405
188 + g-gap	NB	0.5340	0.6705	0.1214
188 + g-gap	KNN	0.6327	0.4829	0.3256
188 + g-gap	PART	0.7427	0.7381	0.4857
188 + g-gap	KLR	0.7654	0.7563	0.5322
188 + g-gap	DT	0.7621	0.7586	0.5245
400 + g-gap	NB	0.5858	0.6974	0.2541
400 + g-gap	LR	0.6068	0.5714	0.2166
400 + g-gap	DT	0.7071	0.6917	0.4163
400 + g-gap	KLR	0.7184	0.6990	0.4406
400 + g-gap	KNN	0.6489	0.5373	0.3399
400 + g-gap	PART	0.7104	0.7136	0.4208
188 + 400 + g-gap	DT	0.7654	0.7619	0.5310
188 + 400 + g-gap	LR	0.6036	0.5769	0.2088
188 + 400 + g-gap	NB	0.6133	0.7131	0.3154
188 + 400 + g-gap	PART	0.7945	0.7955	0.5890
188 + 400 + g-gap	KLR	0.7670	0.7592	0.5351
188 + 400 + g-gap	KNN	0.6570	0.5583	0.3508

Table 7.The results of using MRMD.

Method	Classifier	ACC	F_measure	MCC
188 + 400D (reduced)	NB	0.7573	0.7440	0.5173
188 + 400D (reduced)	LR	0.6780	0.6700	0.3564
188 + 400D (reduced)	KLR	0.7540	0.7556	0.5081
188 + 400D (reduced)	PART	0.7751	0.7702	0.5506
188 + 400D (reduced)	KNN	0.7136	0.6811	0.4363
188 + 400D (reduced)	DT	0.7621	0.7648	0.5244
188 + g-gap (reduced)	DT	0.7346	0.7320	0.4693
188 + g-gap (reduced)	LR	0.7282	0.7143	0.4585
188 + g-gap (reduced)	NB	0.6586	0.5772	0.3437
188 + g-gap (reduced)	KNN	0.6926	0.6520	0.3960
188 + g-gap (reduced)	PART	0.7249	0.7231	0.4499
188 + g-gap (reduced)	KLR	0.7330	0.7236	0.4671
400 + g-gap (reduced)	NB	0.5922	0.7000	0.2652
400 + g-gap (reduced)	LR	0.6133	0.5770	0.2299
400 + g-gap (reduced)	DT	0.7282	0.7191	0.4573
400 + g-gap (reduced)	KLR	0.7152	0.6966	0.4337
400 + g-gap (reduced)	KNN	0.6505	0.5365	0.3457
400 + g-gap (reduced)	PART	0.7071	0.7076	0.4142
188 + 400 + g-gap (reduced)	DT	0.7605	0.7613	0.5210
188 + 400 + g-gap (reduced)	LR	0.6683	0.6623	0.3368
188 + 400 + g-gap (reduced)	NB	0.7282	0.7742	0.4997
188 + 400 + g-gap (reduced)	PART	0.7913	0.7875	0.5829
188 + 400 + g-gap (reduced)	KLR	0.7443	0.7492	0.4890
188 + 400 + g-gap (reduced)	KNN	0.7168	0.6858	0.4424

References

[1]

Gupta A, Behl T, Sehgal A, Sharma S, Singh S, Sharma N, et al. Unmasking the therapeutic potential of biomarkers in type-1 diabetes mellitus. Biointerface Research in Applied Chemistry. 2021. 11: 13187–13201.

| Google Scholar PubMed | Crossref

[2]

Giglio RV, Stoian AP, Haluzik M, Pafili K, Patti AM, Rizvi AA, et al. Novel molecular markers of cardiovascular disease risk in type 2 diabetes mellitus. Biochimica et Biophysica Acta (BBA). Molecular Basis of Disease. 2021; 1867: 166148.

| Google Scholar PubMed | Crossref

[3]

Shi H, Liu S, Chen J, Li X, Ma Q, Yu B. Predicting drug-target interactions using Lasso with random forest based on evolutionary information and chemical structure. Genomics. 2019; 111: 1839–1852.