ProSE-Pero: Peroxisomal Protein Localization Identification Model Based on Self-Supervised Multi-Task Language Pre-Training Model

Background : Peroxisomes are membrane-bound organelles that contain one or more types of oxidative enzymes. Aberrant localization of peroxisomal proteins can contribute to the development of various diseases. To more accurately identify and locate peroxisomal proteins, we developed the ProSE-Pero model. Methods : We employed three methods based on deep representation learning models to extract the characteristics of peroxisomal proteins and compared their performance. Furthermore, we used the SVMSMOTE balanced dataset, SHAP interpretation model, variance analysis (ANOVA), and light gradient boosting machine (LightGBM) to select and compare the extracted features. We also constructed several traditional machine learning methods and four deep learning models to train and test our model on a dataset of 160 peroxisomal proteins using tenfold cross-validation. Results : Our proposed ProSE-Pero model achieves high performance with a specificity (Sp) of 93.37%, a sensitivity (Sn) of 82.41%, an accuracy (Acc) of 95.77%, a Matthews correlation coefficient (MCC) of 0.8241, an F1 score of 0.8996, and an area under the curve (AUC) of 0.9818. Additionally, we extended our method to identify plant vacuole proteins and achieved an accuracy of 91.90% on the independent test set, which is approximately 5% higher than the latest iPVP-DRLF model. Conclusions : Our model surpasses the existing In-Pero model in terms of peroxisomal protein localization and identification. Additionally, our study showcases the proficient performance of the pre-trained multitasking language model ProSE in extracting features from protein sequences. With its established validity and broad generalization, our model holds considerable potential for expanding its application to the localization and identification of proteins in other organelles, such as mitochondria and Golgi proteins, in future investigations.


Introduction
Organelle proteins are a diverse group of proteins that are either bound to or distributed throughout different regions of the organelle [1].Their presence is essential for the organelle to carry out a range of life-sustaining activities.Each organelle protein has a specific biological function that contributes to the overall functionality of the organelle [2].Accurate identification of organelle protein types is crucial for researchers to gain a deeper understanding of their roles and to develop effective treatment strategies for diseases.Moreover, precise knowledge of the spatial distribution of organelle proteins is essential for their functional characterization.This knowledge has far-reaching implications for advancing our understanding of cell biology and developing targeted therapeutic interventions.
Most studies on identifying the localization of organelle proteins rely on machine-learning approaches.For instance, Zhou et al. [3] introduced a novel method for predicting Golgi protein types, which integrates pseudo amino acid composition (PseAAC), dipeptide composition (DC), pseudo-position specific scoring matrix (PsePSSM), and encoding based on grouped weight (EBGW) to extract feature vectors.The authors employed the extreme gradient boosting (XGBoost) algorithm as a classifier and achieved an impressive overall prediction accuracy of 92.1% in the internal validation using the training set, surpassing the performance of existing state-of-the-art methods.However, when evaluating the model's generalization ability on an independent test set, the accuracy drops to 86.5%.This discrepancy suggests that further improvements are needed to enhance the method's performance.Lv et al. [4] developed a Golgi protein classifier called rfGPT, which employs 2gap dipeptide and split amino acid composition as feature vectors.The authors utilized the SMOTE technique to balance the dataset and analysis of variance (ANOVA) as the feature selection method and then input the selected features into the random forest (RF) model.The independent test accuracy of rfGPT was found to be 90.6%.While rfGP presents itself as a practical tool that eliminates the need for location-specific scoring matrices and their derived features, the lower accuracy observed on the independent test set suggests that further enhancements are required in the tool's feature fusion methodology.In another study, Yu et al. [5] proposed SubMito-XGBoost, an XGBoost-based method for predicting protein submitochondrial type, using two training datasets, M317 and M983.The SubMito-XGBoost method demonstrated high prediction accuracies of 97.7% and 98.9%, respectively, on these datasets while achieving a prediction accuracy of 94.8% on an independent test set, M495.While SubMito-XGBoost exhibits improvements in the accuracy of protein submitochondrial prediction to some extent, there remains significant potential for further enhancement in both prediction accuracy and algorithm efficiency.Numerous other studies have also investigated the identification of organelle proteins [6][7][8].
In this paper, we studied the localization identification of peroxisomal proteins.Peroxisomes, also known as microbodies, are important organelles surrounded by a monolayer of membranes containing one or more oxidases.Peroxisomes play an important role in regulating cellular immunity and cancers characterized by metabolic abnormalities [9].These cancers include prostate cancer [10,11], bladder cancer [12], and so on.Human peroxisomal malfunction can result in certain diseases, such as Alzheimer's disease and X-linked adrenoleukodystrophy (X-ALD) [13].At present, the treatment of these diseases mainly utilizes different chemical drugs, such as anti-inflammatory and neuroprotective therapy, but in most cases, these treatments cannot provide a permanent cure [14][15][16][17].Therefore, it is very important to detect abnormalities and injuries in time.Accurate identification and localization of peroxisomal proteins play an important role and significance in the treatment of corresponding diseases.However, the problem of localization and recognition of peroxisomal proteins has received too little attention.At present, the localization and identification tool of peroxisomal proteins is only In-Pero, constructed by Anteghini et al. [18] in 2021.They utilized deep learning embedding methods UniRep [19] and SeqVec [20] to extract the characteristics of peroxisomal protein sequences and compared four different machine learning methods, namely logistic regression (LR), random forest (RF), support vector machine (SVM) and partial least squares discriminant analysis (PLS-DA).By combining five protein embedding methods, a crossvalidation classification accuracy of 0.92 was ultimately achieved.This work became the first work on this topic and provided a complete method and benchmark.
In this work, we proposed the ProSE-Pero model, which utilized the deep learning method to locate and iden- tify peroxisomal proteins for the first time.We utilized three deep representation learning models to extract the features of peroxisome protein sequences.These three methods include SeqVec [20], which is based on the ELMO model, TAPE [21], which is based on the BERT model; and ProSE, which is based on a pre-trained multi-task language model [22].In order to address the issue of imbalanced data, the SVMSMOTE technique was employed to balance the dataset.Furthermore, variance analysis using ANOVA [23] and a light gradient boosting machine (LightGBM) were utilized to select the most informative features from the extracted feature set.At the same time, these feature extraction and feature selection methods were compared.Finally, the selected features were applied to nine traditional machine learning methods and four deep learning methods.The overall flowchart of the ProSE-Pero model is shown in Fig. 1.

Peroxisomal Datasets
The selection of appropriate datasets is a crucial step in the classification model and has a significant impact on the model's performance.In this study, we utilized the peroxisomal protein dataset, as constructed by Anteghini et al. [18] in 2021, which was obtained from the UniprotKB/SwissProt database (https://www.uniprot.org/)[24].After filtering the data, CD-HIT [25] was applied for clustering with a sequence similarity threshold of 40%.The final dataset comprised 132 peroxisomal membrane protein sequences and 28 peroxisome matrix protein sequences, resulting in an imbalanced dataset with a ratio of approximately 5:1 between the two classes.This observation underscores the importance of addressing class imbalance when training classification models.

Vacuole Datasets
In the study of plant vacuole protein identification, we used the data set collected by Yadav et al. [26] to train and test the model.Both PVPs and non-PVPs are from the UniProtKB/SwissProt database [24].They utilized CD-HIT software to remove redundant samples by setting the sequence identity threshold to 60%.A total of 274 positive and 274 negative samples were initially obtained.Subsequently, a sequence identity threshold of 40% was applied, resulting in the screening of 200 out of the 274 PVPs as positive samples for the training set, while the remaining PVPs were assigned as positive samples for the test set.Similarly, the same number of 40% identical negative samples were collected to construct balanced training and independent test datasets, respectively, as shown in Fig. 2.

Feature Extraction
In previous models, feature extraction is mainly based on component features, location features, physical and chemical properties, etc.In recent years, with the continuous maturity and development of deep learning methods, deep learning has begun to be applied to sequence-based protein characterization tasks [27][28][29][30][31][32].Natural language processing (NLP) has received more and more attention in the field of protein sequence analysis in bioinformatics [33].To obtain a vector representation of a protein sequence, the sequence is treated as a sentence, where an amino acid or k-mers is treated as a word [34,35].
In this work, we utilized SeqVec, ProSE, and TAPE, three feature extraction methods based on NLP pre-training models; we utilized the idea of transfer learning.And we will introduce these three feature extraction methods.

SeqVec
This feature extraction method utilizes the deep bidirectional model ELMo, commonly used in natural language processing (NLP), to represent protein sequences as continuous vectors known as embeddings.ELMo effectively captures the biophysical properties of protein sequences by leveraging unlabeled large-scale data.It employs a probability distribution model to generate embeddings that incorporate evolutionary information.The trained model captures important biophysical properties from the unlabeled database (UniRef50) and transfers this knowledge to individual protein sequences by predicting relevant sequence characteristics [20].

ProSE
The feature extraction method uses three learning tasks to simultaneously train a three-layer bidirectional LSTM with skip connections: (a) Masked language modeling task; (b) Contact prediction between residues in protein structure; (c) Structural similarity prediction.Training protein language models by self-supervised learning of large amounts of natural sequence data and structural supervision of smaller sequence sets [22].The authors believed that prior knowledge of protein function and structure could be encoded into the learned representation through supervised training of structural similarity tasks.

TAPE
With the continuous development of protein representation learning in machine learning research, the author introduced a task to evaluate protein embedding (TAPE).The author selected supervised tasks based on three areas of protein biology where self-supervised learning can lead to improvements (structural prediction, remote identification, protein engineering).In this paper, we chose the BERTbased TAPE model.
Each organelle protein sequence is first converted to an integer sequence according to the following function: where m j is the j th amino acid of the sequence, The integer sequence f (m j ), j = 1,2,3,4, ……L (length of protein sequence) was embedded into 1024-long feature vectors via the SeqVec method, 6165-long feature vectors via the ProSE method, 768-long feature vectors via the TAPE method.

Feature Selection
Since the extracted features may have redundant information to make the prediction results inaccurate, it may also lead to overfitting problems.We employed the SHAP interpretation model visualization technique to identify the feature dimension that strongly influences prediction results.Subsequently, we used ANOVA [23] and LightGBM to select the relevant features within this dimension and compared their performance by incorporating them into the classifier.The better feature selection method is selected from the two and utilized as the feature extraction method of the model.

Balanced Dataset
Since we utilized the peroxisomal protein data set constructed by Anteghini et al. [18] in 2021, there are 132 membrane protein sequences and 28 matrix proteins, and the ratio of the two is about 5:1.There is an imbalance in the data set; and unbalanced data sets will affect the performance of the model.The SMOTE algorithm is a method for random oversampling of samples, and it is also a common method for processing unbalanced data.In this work, we utilized the SVMSMOTE algorithm, which focused on adding a few points along the decision boundary [36].

Evaluation Metrics and Methods
Accuracy (Acc), sensitivity (Sn), specificity (Sp), Matthews correlation coefficient (MCC), and F1-score were used to evaluate the performance of the prediction system [42][43][44][45][46][47][48].The calculation method is as follows: For a binary classification problem, the actual prediction will have only two values, 0 and 1. True class (TP) if the instance is positive and is predicted to be positive, false positive class (FP) if the instance is negative and is predicted to be positive, and negative class if the instance is negative and is predicted to be negative.Sn, Sp are the proportion of correct predictions in positive and negative samples, respectively.The F1 score reflects the robustness of the model.The higher the score, the more robust the model is.Acc reflects the overall accuracy of the predictor.When the data set is unbalanced, Acc cannot really assess the quality of the classification results.In this case, it can be evaluated by MCC.The horizontal axis of the receiver operating characteristic (ROC) curve is generally the ratio of false positive rate(FPR), i.e., the ratio of negative class samples being judged as positive class samples, and the vertical axis is the ratio of true positive rate(TPR), i.e., the ratio of positive class samples being judged as positive class samples.In addition, we also draw the PR curve.The vertical axis of the curve is precision, and the horizontal axis is recall.In this paper, area under the curve (AUC) defaults to ROC-AUC.ROC-AUC represents the area under the ROC curve, and the higher the value, the better the model.Like ROC-AUC, we can calculate the area under the PR curve to describe the performance of the model.We can think of PRAUC as the average precision calculated for each Recall threshold.In this study, we utilized the PyCharm software, specifically version 2020.3.2, developed by JetBrains, to write the model code.The software originates from Prague, Czech Republic.

Performance of Features Extracted by Different Methods on Different Classification Models after Balancing the Dataset
In this study, three feature extraction methods, namely SeqVec [18] based on the ELMO model, TAPE [19] based on the BERT model, and ProSE [20] based on a pre-trained multi-task language model, were employed to extract features from peroxisomal protein sequences.To address class imbalance, the SVMSMOTE algorithm was utilized to balance the dataset.Subsequently, the extracted features were inputted into nine traditional machine learning models, in-

Performance of Features Extracted by Different Methods on Different Classification Models after Feature Selection
In the next step, we conducted experiments on the features extracted by the ProSE method and utilized the SHAP interpretation model to plot all instances.In this way, we can see that the size of the feature's impact on the predic-tion is shown in Fig. 4.Each row in the figure represents a feature and the abscissa is the Shap value.The ranking of features is based on the average absolute value of Shap, which can be seen as an arrangement diagram of feature importance.The 4614th dimension feature shown in the figure is the most important feature of the model and has the greatest impact on the results.The features of the first N that have the greatest impact on the model are generally obtained by the mean of the absolute values of each feature (abs → mean ()).The absolute value is used to solve the problem of positive and negative cancellation, and the size of the correlation is more concerned; as shown in Fig. 5, it can be seen from the figure that the first 4614 dimensional features have the greatest effect on the model.Combining the results of the first two graphs, we selected the features of the first 4616.
It is evident that the 4614-dimensional features have a significant impact on the prediction results.Therefore, the features extracted using the ProSE method were subjected to feature selection using ANOVA and LightGBM, resulting in a feature dimension of 4614.These selected features were then fed into nine traditional machine learning models and four deep learning models, and the results are presented in Table 4 and Table 5.The ANOVA and LightGBM feature selection methods exhibit varying performances across different models.Notably, for the FastText model, the   At the same time, we also draw the ROC and PR curves of the deep learning model after ANONA and Light-GBM feature selection methods, as shown in Fig. 6, Fig. 7, Fig. 8, and Fig. 9.
To evaluate the impact of the SVMSMOTE method on data set balancing, we conducted experiments without incorporating SVMSMOTE-generated features into the Fast-Text model.The results, as depicted in Fig. 10, clearly demonstrate significant improvements in model performance after applying the SVMSMOTE method.This observation confirms the crucial role of data set balancing in enhancing model indicators.

Comparison with Previous Models
Finally, our ProSE-Pero model was compared with the In-Pero model developed by Anteghini et al. [18] in 2021,  as depicted in Fig. 11.The comparison clearly illustrates that our proposed ProSE-Pero model achieves an approximately 4% higher accuracy than the In-Pero model.This  notable improvement underscores the effectiveness of our model.The detailed parameters of our ProSE-Pero model are provided in Table 6.

Comparison of the Performance of Different Classification Models on the Plant Vacuole Protein Independent Data Set
Vacuoles are unique organelles in plant cells and play a key role in plant growth and development.Vacuoles have cell functions such as degradation, autolysis, and regulation.The basis for studying the maintenance mechanism of vacuole biogenesis is to understand the biochemical and physiological functions of vacuole proteins [49][50][51].Accurate identification of vacuolar proteins plays an important role in understanding their biological properties.But now, there are few tools for identifying vacuolar proteins [52][53][54].
In order to verify the generalization performance of our model and find an effective way to identify plant vac- uole proteins.We extended our method to the identification of vacuole proteins and utilized the ProSE method based on the self-supervised multi-task language pre-training model to extract the features of vacuole protein sequences.By using the SHAP interpretable model and ANOVA method to select the extracted features, we can see the size of the influence of the features on the prediction, as shown in Fig. 12, and select 606-dimensional data.
Subsequently, we conducted a comparative analysis of the performance of nine traditional machine learning models and the deep learning model FastText on the independent test set.As shown in Table 7, FastText exhibited superior performance, achieving an accuracy of 91.90%, F1-score of 0.9122, specificity of 86.64%, sensitivity of 97.05%, MCC of 0.8379, and AUC of 0.9626 on the independent dataset.Notably, among the nine traditional machine learning models, LightGBM demonstrated the highest accuracy of 89.19%, along with an F1-score of 0.9000, specificity of 81.08%, sensitivity of 97.30%, MCC of 0.7943, and AUC of 0.9573 on the independent dataset.

Comparison with Previous Models
Finally, we compared our method with the previous vacuole protein identification model.As shown in Fig. 13, it can be seen that our method is superior to the iPVP-DRLF model [48] and the previous model in Acc Sp, Sn, MCC, and AUC, which are about 5%, 2%, 8%, 0.1 and 0.05 higher than the PVP-DRLF model respectively.

Discussion
The experimental results of our study have demonstrated the effectiveness of our approach, which utilizes the multi-training task pre-training model ProSE, in extracting peroxisomal and plant vacuole proteins.These findings hold significant biomedical implications as they provide insights into the understanding of protein localization and function within specific organelles.Moreover, the success of our approach opens up avenues for its application in extracting features of proteins from other organelles.
The accurate identification and localization of organelle proteins play a crucial role in unraveling the bio- logical functions of organelles.For instance, dysregulation of the Golgi apparatus has been implicated in various genetic and neurodegenerative disorders, including diabetes [55], cancer [56], Alzheimer's disease [57], and Parkinson's disease [58].Although current therapeutic strategies primarily rely on pharmacological interventions such as anti-inflammatory and neuroprotective treatments, they often fall short of providing a definitive cure [3].To gain deeper insights into Golgi dysfunction, timely detection of abnormalities and damage is of utmost importance.Hence, precise identification of Golgi-resident protein types holds significant potential in advancing our understanding of the roles played by Golgi proteins in the aforementioned pathologies.Mitochondria, essential organelles in eukaryotic cells, play critical roles in various physiological processes, including cell differentiation, cellular signaling, apoptosis, and growth [5].Impaired mitochondrial function disrupts energy metabolism and ultimately leads to cell death [59].Aberrant identification and localization of submitochondrial proteins can lead to detrimental interactions, thereby contributing to the onset and progression of various disorders, including Parkinson's disease [60], multifactorial diseases [61], and type II diabetes [62], among others.Therefore, investigating the subcellular localization of mitochondrial proteins holds significant importance in elucidating the molecular mechanisms underlying these diseases, facilitating their diagnosis, and fostering the development of novel therapeutic interventions.Vacuoles, being the largest organelle in plants, play a pivotal role in diverse cellular functions such as the storage of inorganic ions and metabolites, protein degradation, detoxification, and the regulation of cytoplasmic ionic homeostasis [63].These vital functions contribute to the overall cellular integrity and homeostasis in plants.Accurate identification of plant vacuole proteins and subsequent exploration of their biochemical properties and physiological functions serve as fundamental steps toward understanding the mechanisms underlying vacuole biogenesis and maintenance [53].In this study, we have demonstrated the validity and broad generalizability of our proposed ProSE-Pero model.The ProSE-Pero model presented in this study holds significant potential for its application in accurately identifying and precisely localizing the organelle proteins mentioned above, including submitochondrial proteins and Golgi proteins.This model offers promising prospects for future studies in this field, allowing for an improved understanding of the roles and functions of these organelle proteins in various cellular processes.However, it is important to acknowledge the limitations of our research.Currently, our focus is primarily on the identification of organelle proteins, and our methods may not be directly applicable to other protein prediction tasks.Further research and refinement are needed to expand the scope of our methods to encompass other protein-related analyses, such as protein function prediction, protein folding studies, solubility prediction, and drug design.
By addressing these limitations and advancing our methods, we aim to contribute to the broader field of proteomics and facilitate advancements in protein analysis and prediction.Ultimately, our research holds the potential to enable more accurate and comprehensive investigations into protein structure, function, and their roles in biological processes, ultimately benefiting biomedical research and applications.

Conclusions
Through this study, we discovered that the ProSE method, which is based on a self-supervised multi-task language pre-training model, is highly effective in identifying peroxisomal protein localization.In addition to traditional machine learning methods, we also utilized deep learning methods such as FastText, TextCNN, CNNBiL-STM, and CNNBiLSTM with an attention mechanism.Our deep learning methods achieved accuracy rates of over 94% in peroxisomal protein localization and identification, yielding impressive results.After balancing the dataset with SVMSMOTE and comparing feature selection methods such as ANOVA and LGBM, our approach achieved 95.77% in Acc, 0.8996 in F1-score, 93.37% in Sp, 82.41% in Sn, 0.8241 in MCC, and 0.9818 in AUC on the FastText model using tenfold cross-validation.These results represent a 4% improvement over the In-Pero model proposed by Anteghini et al. [18] in 2021, placing our approach at the forefront of peroxisome protein localization and identification research.This study highlights the importance of balancing imbalanced datasets and utilizing feature selection methods to enhance model performance.Moreover, in comparison with the In-Pero model that combines the Se-qVec method and UniRep method, our approach only uses ProSE as the feature extraction method, demonstrating the superior performance of the ProSE method in peroxisomal protein localization and identification.
Furthermore, our approach has also been extended to identify vacuolar proteins in plant organelles.Notably, our method achieved remarkable results on the independent test set using the FastText model, with an accuracy of 91.90%, F1-score of 0.9122, specificity of 86.64%, sensitivity of 97.05%, MCC of 0.8379, and AUC of 0.9626, which is approximately four percentage points higher than the iPVP-DRLF model ACC proposed by Jiao et al. [54] in 2022.Moreover, the method we utilize in the ProSE-Pero model has demonstrated excellent effectiveness and generalization, as evidenced by the leading level of performance achieved on the independent test set for tonoplast proteins.
The above results show that the ProSE method based on a self-supervised multi-task language pre-training model has a good effect on extracting the features of organelle protein sequences.It also shows the superiority of enriching the model with biological prior knowledge and integrating protein structure knowledge into coding.At the same time, we believe that our method can be extended to other organelle protein localization and recognition, such as mitochondria and Golgi proteins.In the future, we will put it into practice and expand it on the basis of this work.

Availability of Data and Materials
The pre-trained ELMo-based SeqVec model and a description on how to implement the embeddings can be found here: https://github.com/Rostlab/SeqVec.The ProSE model and a description on how to implement the embeddings can be found here: https://github.com/tbepler/prose.The ProSE-Pero model and datasets can be found here: https://github.com/SJNNNN/ProSE-Pero.

Fig. 4 .
Fig. 4. Size of influence of features on prediction.

Fig. 5 .
Fig. 5. Size of influence of mean prediction of feature absolute values.

Fig. 12 .
Fig. 12. Size of influence of features on prediction.

Fig. 13 .
Fig. 13.Performance of our method with previous models on the plant vacuole protein independent data set.