Sequence-Based Prediction with Feature Representation Learning and Biological Function Analysis of Channel Proteins

Background : Channel proteins are proteins that can transport molecules past the plasma membrane through free diffusion movement. Due to the cost of labor and experimental methods, developing a tool to identify channel proteins is necessary for biological research on channel proteins. Methods : 17 feature coding methods and four machine learning classifiers to generate 68-dimensional data probability features. Then, the two-step feature selection strategy was used to optimize the features


Introduction
Channel proteins are a type of cross plasma membrane that can transport molecules of appropriate size and charged molecules from one side of the plasma membrane to the other through free diffusion motion.Channel proteins can be monomer proteins or proteins composed of multiple subunits.They are rearranged through hydrophobic amino acid chains to form aqueous channels.They do not directly interact with small charged molecules, which can diffuse freely through the aqueous channels formed by the charged hydrophilic regions of membrane proteins in lipid bilayers.The transport of channel proteins possesses a selective function, so there are various channel proteins in the cell membrane.
With high mortality, cancer is one of the most catastrophic diseases causing millions of deaths worldwide every year [1][2][3].Therefore, research on the mechanism of cancer occurrence and development is still a research hotspot.However, although significant progress has been made in cancer research, there are still no good treatment strategies for cancer because the mechanism of cancer occurrence and development is too complex.Previous studies have suggested that abnormalities in channel proteins in some signaling pathways can promote the occurrence and development of cancer.For instance, chloride intracellular channel 1 (CLIC1) is a chloride channel protein.The upregulated expression of CLIC1 is positively related to cell proliferation, invasion, migration, and angiogenesis.Chloride intracellular channel 1 promotes the progression of oral squamous cell carcinoma, and its potential mechanism may be correlated with ITGαV and ITGβ1 regulation, resulting in the activation of MAPK/ERK and MAPK/p38 signaling pathways [4].Aquaporin-4 (AQP4) forms a heterotetramer composed of m23-AQP4 and m1-AQP4 subtypes on the plasma membrane.The isoform ratio controls the aggregation of AQP4 into the supramolecular structure, which is called the orthogonal particle array.Studies have shown that the aggregation/decomposition of AQP4 into OAP affects the biological characteristics of glioma cells, and the aggregation state of AQP4 may be an important determinant of glioma cell survival or death.The decomposition of AQP4 may enhance invasiveness, and the aggregation of AQP4 may activate the apoptotic pathway [5].HERG (human ether -à -go related gene) K + current realizes essential ionic functions in the heart.HERG channels affect the migration and growth of various types of tumor cells.Studies have shown that HERG1 channel proteins take part in the growth of small cell lung cancer (SCLC) cells [6].In rats and humans, voltage-gated Na + channels (VGSCs) are upregulated in prostate cancer (PCA), in which channel activity promotes cell invasiveness in vitro and metastasis in vivo [7].
Recent studies have shown that machine learningbased methods have been well developed, especially those related to effective feature representation algorithms [8][9][10].At present, various sequences based on feature descriptors are obtained from many studies [11][12][13].Taking various types of features to train classifiers is a simple method to build predictive models.A high feature dimension of integration will lead to dimension disaster, and simple integration will lead to information redundancy.One efficient way to reduce the dimension is feature selection [14].These problems will affect the prediction performance of the model.A more effective method is necessary to use feature information.In addition, most existing feature descriptors only use sequence information to build prediction models.This may not be enough to provide enough information to accurately distinguish between real CAPs and no-CAPs.Efficient computational identification tools are good choices; however, current research efforts in this area are lacking.Due to their efficiency and convenience, machine learning-based methods have been widely used in protein function prediction [12,[15][16][17][18][19][20][21].Therefore, it is desirable to research reliable and effective machine learning tools for CAP identification.
In this research, Metascape was applied for the enrichment and network analysis of the biological functions of channel proteins.CAPs-LGBM is the first software model that can classify proteins as CAPs or no-CAPs.We establish the first benchmark dataset composed of 914 CAPs and 914 non-CAPs, which is publicly available to ensure the reproducibility of the proposed predictor.On this basis, we studied the feature representation learning strategy that integrates the prediction probability information into the newly derived features.To improve the prediction performance, a two-step feature optimization protocol was adopted to manually select the optimal feature subset containing 16 information features.Based on the optimal 16-dimensional probability feature, a sequence-based CAP predictor CAPs-LGBM was constructed.The results suggest that the proposed prediction model has good recognition performance.The overall framework of CAPs-LGBM is shown in Fig. 1.The prediction of channel proteins is a novel work, and there is no previous research on it.The prediction accuracy of the model has a high accuracy in our research.Meanwhile, we also optimized the prediction algorithm and developed a user-friendly online server.The model is a quick and effective way to predict whether a protein is a channel protein or not.The website is http: //lab.malab.cn/~acy/CAPs-LGBM.It has the potential to promote future computing work in this field.

Datasets
To establish a reliable and robust CAP-LGBM predictor, a well-prepared dataset is essential.CAP and non-CAP protein sequences are composed of a positive and negative dataset for binary prediction model construction.CAP sequences were downloaded from the UniProt database (UniProtKB version 2021_03, https://www.UniProt.org/)[22]."channel protein and reviews: Yes" was applied as the keyword to search the protein sequences.A total of 18,375 protein sequences were obtained from the UniProt database, and 2105 channel proteins were selected according to the functional annotation as the selection criteria for positive samples.Negative samples were selected from the protein family database (Pfam, version 35.0, http://pfam.xfam.org/)[23].There are two principles for negative dataset selection: (Ⅰ) each negative sample is the longest sequence from different protein families; (II) samples from positive families will be removed.Finally, a negative dataset containing CAP protein sequences was established.CD-Hit (V4.8.1) [24] was applied to remove the sequence redundancy for the positive and negative samples to avoid pro-tein homology deviation with the threshold set to 0.8 [25].Then, 914 CAP protein sequences were selected for machine learning.There is enough protein sequence information for the machine learning algorithm to construct an optimal model for channel proteins.Then, CAP-positive and -negative samples were retained as the channel protein dataset.The original dataset is randomly divided into two subsets at a ratio of 8:2 (Table 1), of which 80% are training sets and the rest are test sets to verify the performance of CAP.The datasets described above are freely available at http://lab.malab.cn/~acy/CAPs-LGBM.

Feature Representation Learning
Sufficient feature information from the channel protein sequences is essential for accurate and reliable bioinformatics model construction [10,[26][27][28][29][30][31][32].Following previous research [9,11,[33][34][35][36][37][38], we used a feature representation learning protocol to predict and identify the CAPs.First, 17 feature coding algorithms were used to construct the initial feature pool to represent the protein sequence.There were three categories divided by the coding methods: (1) amino acid composition characteristics features, (2) based on the characteristics of physical and chemical properties, and (3) features based on sequence order.All of the above feature descriptors are defined by ilearn tools [39].In the second step, four common classifiers, namely, random forest (RF), XGBoost (XGBT) and SVM, were employed to train on the 17 descriptors to build the baseline prediction models.Each prediction model will provide both class information (predicted label) and probabilistic information (predicted confidence).In this work, we utilized the probabilistic information predicted by each model as a "feature".Probability information will be used as the "feature" of each model in our study, and a 68-dimensional probability feature vector will generate the prediction models (68 feature descriptors × 4 machine learning classifiers) [9].

Classifiers
Classifier choice plays an essential role in machine learning [40][41][42].Various machine-learning algorithms have been applied in machine learning methods [8,[43][44][45].Seventeen descriptors were trained by four common classifiers: random forest (RF), XGBoost (XGBT), and SVM to construct the model.As described in previous research [46][47][48], all four classifiers were derived from the scikitlearn package (version 0.24).Finally, a grid search was used to adjust hyperparameters for the classifiers, and the search range is provided in Supplementary Table 1.

Feature Selection
In this study, a two-step feature selection method was used to improve the feature representation ability and prediction performance of the models [49].First, the original feature set was ranked according to the classification importance score.Second, the SFS (sequential forward search) strategy was used to search the optimal feature subset from the feature list in the first step.Generally, feature selection methods are divided into packaging, filtering, and embedding methods [50,51].The light gradient boosting machine (LGBM) is a packaging method, and the LGBM model is obtained by inputting the training data, which are sorted according to the importance score of the features.In the SFS step, additional features are obtained in the first step according to the lower to higher rank, and reconstruct the prediction model with various features.The subset with the highest accuracy of the prediction model was determined as the optimal feature set.

Performance Measurement
The validation strategies of 10-fold cross validation (CV) and testing were used to evaluate the performance of the involved models [13,[52][53][54][55].The training dataset was randomly divided into 10 subsets of approximately the same size for 10-fold CV validation.The ratio of the training data and the validation dataset was 9:1.The performance of 10 test subsets was averaged, and the result is the overall performance of the 10-fold CV test.Thus, the proposed model is verified more strictly and compared fairly with other methods.
Furthermore, accuracy (Acc), sensitivity (SE), specificity (SP), and Matthew correlation coefficient (MCC) [8,56,57] are four common metrics in binary classification tasks.Receiver operating characteristic (ROC) curves were also used to provide intuitive performance comparisons.The area under the ROC curve (AUC) was calculated and used as the main quantitative index of overall performance.

Construction of 3D Structure and Phylogenetic Tree for CAPs
A phylogenetic tree of CAPs was constructed to analyze the evolutionary diversity of the proteins.CAP sequence alignment results were analyzed by muscle software (V3.8.1551) and used to construct a phylogenetic tree using IQ-TREE software (multicore version 1.6.12).The best fitting model for the phylogenetic tree was VT+R6 [58].The ultrafast bootstrap method was used for phylogenetic assessment, and 1000 replicates per method were chosen in this work [59][60][61].Tree file was visualized by the iTOL website (version 6, https://itol.embl.de/).Phyre2 software (version 2.0, http://www.sbg.bio.ic.ac.uk/phyre2/html/pag e.cgi?id=index) was applied for the 3D structure prediction of CAPs.The prediction results were visualized by PyMOL (version 2.5.1,DeLano Scientific LLC) software (https://pymol.org/2/).

Metascape Analysis
Metascape (version 3.5, https://metascape.org/gp/index.html#/main/step1), as an effective tool, can analyze multiple groups of data on multiple platforms [62].In our research, Metascape was used for enrichment analysis of CAPs.CAPs were carried out based on pathway enrichment and analysis of molecular function (MF), biological process (BP) and cell composition (CC) by the Metascape tool based on the two databases of Kyoto Encyclopedia of Genes and Genomes (KEGG, version 100.0) and Gene Ontology (GO).A network plot was constructed by a subset of enriched terms to capture the relationships between the terms.Similarity >0.3 and p values < 0.05 of the terms were connected for the edges and the 20 clusters with less than 15 terms per cluster.Based on the BioGRID, In-Web_IM, and OmniPath databases, a PPI (protein-protein interaction) network was constructed and visualized by Metascape.Cluster analysis was carried out by MCODE (minimum common oncology data elements) to identify the key clusters with default parameters in the PPI network.In addition, the significant function module selected was predicted with p < 0.05 significance using Metascape.

Phylogenetic Analysis and Structural Features of CAP Proteins
The results of the phylogenetic tree (Fig. 2) showed that 134 CAP genes were divided into five groups, and the length of branches indicated the genetic relationship of CAP sequences.Among the five groups, Groups I, III, and IV contained two subfamilies.The function of AQPs is as a water molecule transport protein, which belongs to the branch of the fifth family.Calcium channel protein belongs to group IVa, potassium voltage gated channel protein belongs to groups I and III, calcium activated potassium channel subunit belongs to IVa and IIIa, sodium channel protein belongs to Ia and IVa, chloride intracellular channel protein belongs to groups Ia and IIIa, potassium/sodium channel protein belongs to IIIb and IVa.The secondary structure of CAPs includes α-helices, β-pleated sheets, and random coils.Usually, an α-helix exists in the plasma membrane and belongs to the secondary structure of transmembrane proteins.In this study, a protein was selected from each subfamily, and the 3D structure of these proteins was constructed.The results show that all proteins had multiple α-helix proteins, indicating that the subcellular localization of most channel proteins belongs to the plasma membrane.The structure of channel proteins in Groups I and V was relatively simple, while the structure of channel proteins in groups II, III, and IV was relatively complex (Fig. 2).The composition of these structures is necessary for determining the function of channel proteins.

Detection and Enrichment Analysis of Related Genes
To identify the interaction and internal mechanism of CAP coexpressed genes, Metascape was applied to analyze the overlapping genes, enrichment, PPI network, and MCODE of CAPs and their coexpressed genes.First, the overlap analysis of coexpressed genes was divided into two methods.Specific overlapping genes were found in the coexpressed genes of CAPs of different species (Fig. 3A,B).According to the classification of channel proteins, only potassium channels and sodium channels have overlapping genes.Aquaporins and chloride channel proteins are specific molecules and ion channels, and there were no overlapping genes (Fig. 3C,D).
A total of 134 human channel proteins were screened from the positive case set for Metascape analysis.These genes were enriched based on the David database and Gene Ontology (GO).As shown in Fig. 3E, CAPs and their coexpressed genes were primarily enriched in potassium channels (R-HSA-1296071), stimuli-sensing channels (R-HSA-2672351), and regulation of membrane potential (GO:0042391).

Network of Enriched Terms and Screening of Hub Genes
To better understand the relationships between the terms, a network plot of enriched terms was constructed.As shown in Fig. 4A, there were 111 GO BP, 13 KEGG pathways, and 27 reaction gene sets.The data show that CAPs are enriched in potassium channels, regulation of membrane potential, stimuli sensing channels, protein complex oligomerization, and sodium ion transport.

Feature Selection Results
To establish an effective prediction model, the probability characteristics of possible redundant information need to be processed to avoid the waste of computing resources and affect the final classification effect.In this study, a two-step feature selection method was used to identify the optimal feature representation from the original feature set.Sixty-eight features were sorted by LGBM according to classification importance.Fig. 5A shows the ranking of the top 20 features, in which the importance decreases along the x-axis.The results show that the most important feature is APAAC-SVM, indicating that it is the most effective classification among all features, followed by CTDD_LGBM and CTDD-SVM.APAAC-SVM represents the probability features derived from the APAAC descriptor on the SVM classifier.To evaluate the prediction performance of the four classifiers (SVM, RF, LGBM, and XGBoost) used in the feature representation learning scheme, 10 CV experiments and 68-dimensional classification probability features were implemented, and 68 prediction models were obtained.See Supplementary Tables 2-5 for the results.The overall ACC performance is shown in Fig. 5B.In fact, the ACC curves of the four classifiers seem to have similar patterns.To clarify the discussion of the SFS method on the LGBM classifier, we also refer to the ACC results of the SVM, SFS, and LGBM classifiers in Fig. 5B.The results show that the SVM and LGBM classifiers are generally better than the SFS classifier.The ACC curves of the LGBM classifier increased steadily with the increase of the number of features, until the number of features was close to 16 with the Acc value of 92.34, and gradually tended towards a fluctuating plateau, this model was named M16-LGBM (Fig. 5B).The Acc curve of the SVM classifier increased steadily until it reached the maximum value of 92.34 when the feature number was 34, and the model was named M34-SVM (Fig. 5B).Considering other indicators and all the above findings, we choose the LGBM classifier to construct the predictor.We believe that only the subset of the first 16 features is optimal.To better understand the selected features, we further analyzed their composition.In fact, these 16 features are: APAAC_SVM, CTDD_LGBM, CTDD_SVM, PAAC_RF, CTDD_RF, CKSAAGP_SVM, CKSAAGP_XGboost, DDE_SVM, APAAC_LGBM, QSOrder_SVM, CTDD_XGboost, GAAC_SVM, APAAC_RF, Moran_XGboost, SOCN_SVM, and CTriad_SVM.CTDD and APAAC generate four and three features, respectively.The physicochemical information and amino acid composition information of CAPs have the strongest feature representation ability.

Comparison of the Individual Feature Descriptors and Class Features with Probability Features
To verify the effective representation ability of the probability features generated by the feature representation learning scheme, we compared the probability features M16-LGBM with class features M34-SVM and the sequence-based feature descriptors related to M16-LGBM.The 10-fold CV results of 10 sequence-based feature descriptors are summarized in Supplementary Table 6.We selected the M16-LGBM subset and compared them with the first three feature descriptors with the best performance, namely, DDE, QSOrder, and APAAC.Table 2 shows the results of all training sets with comparative characteristics.The overall performance of probability features was significantly improved compared with original individual features.For example, the ACC values of DDE, QSOrder, and APAAC were 88.09%, 87.34% and 88.30%, respectively, while the probability features M16-LGBM and M34-SVM had the same ACC values of 92.33%.In addition, the AUC, Sn, Sp, and MCC values of the probability features were all higher than those of the three feature descriptors.This shows that the probability information generated by the model can improve the prediction model performance.The ROC curve of comparative features is shown in Fig. 6F.We can clearly see that the AUC of probability features is the highest of all features.In conclusion, the observation results showed that the probability feature with a small dimension is more suitable for constructing our final predictor.The best performance value is highlighted in bold for clarification.
The t-distributed stochastic neighborhood embedding (t-SNE) algorithm [63] was applied to explain why our model-based probability features can effectively improve the prediction performance.As shown in Fig. 6A-C, in the feature space of APAAC, DDE, and QSOrder descriptors, many positive and negative samples are mixed together, indicating that their expression ability may not be enough to fully distinguish between real CAP and non-CAP samples.In contrast, in the two model-based feature spaces, most positive and negative samples are distributed in two significantly different clusters (Fig. 6D,E).This shows that compared with the descriptor based on individual sequences, CAPs and non-CAPs are easy to distinguish in the vector space of the probability feature model.Using our probability feature, most of the positive and negative samples are clearly separated, and only a few samples overlap in the middle region.This once again confirms that the probability feature is more effective and can reveal the potential difference pattern between CAPs and non-CAPs to improve the prediction performance.

Comparison Results of Different Ensemble Learning Methods
To verify the effectiveness of the feature representation learning method used in this paper, we explored different integrated learning schemes, including hard voting, soft voting, and stacking.We compared our CAPs-LGBM model building method with three traditional ensemble methods and evaluated their performance with 10-fold CV.Table 3 summarizes the performance results.Our CAPs-LGBM model building method was optimal in the training dataset, and similar results were found in the test dataset except for the Sn parameter.We observed that the proposed feature representation learning method CAPs-LGBM was significantly better than the traditional ensemble method.The results in Table 3 indicate that the prediction performance is significantly improved compared with other models (Supplementary Table 6).The feature representation learning method has the best overall performance among all comparison strategies.The ACC value on our CAP training set was 92.34%, which is 2.32%, 2.25%, and 3.69% higher than that of hard voting, soft voting, and stacking, respectively (Fig. 7A).To verify the robustness and practical applicability of CAPs-LGBM, we further compared these methods through independent testing.Similar to the training dataset, performance improvement was also observed on the test dataset.Compared with the other three ensemble predictors, the average performance of CAPs-LGBM on ACC and MCC is improved by 1.55% and 3.02%, respectively (Fig. 7B).In conclusion, our method provides satisfactory prediction results.This result shows that compared with other ensemble learning methods, feature representation learning in CAPs can make more effective use of the output of a single baseline model and help to distinguish CAPs from non-CAPs.

Case Study
In order to test whether our CAPs-LGBM toolkit can predict the protein sequence with unknown function in practical scenarios, two common proteins from UniProt database are downloaded and are using our CAPs-LGBM method for practical application prediction.The two common proteins are ubiquitin and actin, none of which are channel proteins.The sequences of these two proteins are input into our prediction toolkit CAPs-LGBM.The results showed that there are 7668 sequences of actin protein, and 701 sequences were predicted incorrectly, with an error rate of 9.1%, while among 19,757 ubiquitin sequences, 2005 The best performance value is highlighted in bold for clarification.sequences were predicted incorrectly, with a prediction error rate of 10.1%.Therefore, our prediction toolkit CAPs- LGBM can obtain accurate results relatively in channel protein prediction.The results of these sequences are particularly meaningful for further experimental validation.

Web Server Implementation
To facilitate the identification of CAPs by researchers, we built a user-friendly online web server named CAPs- LGBM, which is freely available at http://lab.malab.cn/~acy/CAPs-LGBM.To validate our findings, the benchmark dataset has been applied on the online server.A simple guideline was obtained to provide researchers on the use method for the CAPs-LGBM webserver.First, users need to put the query sequence in fasta format in the left input box and click the submit button.Finally, the results are displayed on the right result box.To restart a new task, a clear button or the resubmit button should be selected to clear the sequences in the input box.Finally, new query protein sequences were allowed to enter the input box.In addition, detailed instructions and example sequences in fasta format can be found on the web server interface.The home page provides links of the contact information of authors and relevant data to download.

Conclusions
Channel proteins are proteins that can transport molecules past the plasma membrane through free diffusion movement.They can be divided into ion channel proteins and aquaporins.Because it is generally located in the membrane system, the protein structure usually contains an αhelix.In this study, the samples were subjected to enrichment and network analysis.The subnetwork detected by the MCODE tool usually consisted of several single channel proteins.For example, MCODE 1 is a subnetwork construct with several potassium channel proteins, which are tetrameric ion channels existing in all cell types.Potassium channels control the resting membrane potential of neurons, help to regulate the action potential of the myocardium and release insulin by pancreatic cells.Cluster results suggested that passive transport by aquaporins (R-HSA-432047) is significant in humans and mice.Aquaporin is a six-channel transmembrane protein that forms channels on the cell membrane.Each monomer contains a central channel composed of two asparagine-prolinealanine motifs (NPA boxes), which determine water and/or solute selectivity.AQP0/MIP, AQP1/2/3/4/5/7/8/9, and AQP10 transport water inside and outside the cell through the osmotic gradient of the cell membrane.Four aquaporins (AQP7/9 and AQP10) conduct urea, four aquaporins (AQP3/7/9 and AQP10) conduct glycerol, and aquaporin AQP6 conducts anions, especially nitrates.AQP8 conducts water and ammonia.Knockout mice lacking AQP11 formed fatal cysts in the proximal renal tubules.Exogenous AQP12 was localized intracellularly and expressed only in pancreatic acinar cells.Therefore, the biological function of channel proteins is very important.Developing a tool to identify channel proteins is necessary for biological research on channel proteins.
We screened reliable and experimentally verified channel protein sequences as the dataset and established an effective prediction classifier on this basis.First, we selected 17 feature coding methods and four machine learning classifiers to generate 68-dimensional data probability features.Then, the two-step feature selection strategy was used to optimize the features, and the final prediction Model M16-LGBM was obtained on the 16-dimensional optimal feature vector.We made a comprehensive comparison between the proposed probability feature and the existing sequence-based feature description.The results showed that the probability features generated by the feature representation learning method had strong resolution and that the CAPs and non-CAPs were easier to separate.In addition, the proposed CAPs-LGBM was compared with three common ensemble learning strategies to verify the feature representation learning scheme.The 10-fold CV and the independent test showed that CAPs-LGBM was superior to other methods in CAP prediction.Finally, we also established a user-friendly online predictor to promote the use of relevant research communities.We expect that CAP-LGBM will contribute to the identification of CAPs, reveal their biological function mechanism, and accelerate pathological research related to CAPs.sition; DPC, dipeptide composition; PAAC, pseudo-amino acid composition; APAAC, amphiphilic pseudo-amino acid composition; DDE, dipeptide deviation from expected mean; ASDC, adaptive skip dinucleotide composition; CKSAAGP, composition of k-spaced amino acid group pairs; CTD, composition (C)-transition (T)-distribution (D) model; GAAPC, grouped amino acid peptide composition; CTriad, conjoint triad; QSOrder, quasi-sequence order.
Fig. 3C also indicted the potassium/sodium channel protein involved the transportation of potassium and sodium ions.The volume-regulated anion channel subunit (leucine rich repeat containing protein) belongs to the IIa group.

Fig. 3 .
Fig. 3. Overlaps and heatmap enrichment analysis of CAPs and their coexpressed genes.(A,B) Overlap circus plot among CAPs of different species.(C,D) Overlap circus plot among CAPs of different functions.(E) Heatmap of enriched terms among CAPs.

Fig. 4 .
Fig. 4. PPI network, enriched terms network, and MCODE analysis of CAPs.(A) Network of the enriched terms colored by cluster ID. (B) The protein-protein interaction (PPI) network for enrichment analysis of functions and pathways.(C) MCODE components identified from the PPI network among CAPs.

Fig. 5 .
Fig. 5. Feature selection of the 68-dimensional probability feature.(A) Classification importance score of the top 20 features.(B) Tenfold cross validation accuracy on the SVM, RF, XGBoost, and LGBM classifiers with feature numbers.