IMR Press / FBL / Volume 28 / Issue 12 / DOI: 10.31083/j.fbl2812322
Open Access Original Research
ProSE-Pero: Peroxisomal Protein Localization Identification Model Based on Self-Supervised Multi-Task Language Pre-Training Model
Show Less
1 School of Information Science and Engineering, University of Jinan, 250022 Jinan, Shandong, China
2 Laboratory of Zoology, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University, Fukuoka-shi, 819-0395 Fukuoka, Japan
3 School of Artificial Intelligence Institute and Information Science and Engineering, University of Jinan, 250022 Jinan, Shandong, China
4 School of Computer Science and Engineering, University of Electronic Science and Technology of China, 611731 Chengdu, Sichuan, China
*Correspondence: yhchen@ujn.edu.cn (Yuehui Chen); iwamori@agr.kyushu-u.ac.jp (Naoki Iwamori)
These authors contributed equally.
Front. Biosci. (Landmark Ed) 2023, 28(12), 322; https://doi.org/10.31083/j.fbl2812322
Submitted: 27 April 2023 | Revised: 17 July 2023 | Accepted: 24 July 2023 | Published: 1 December 2023
Copyright: © 2023 The Author(s). Published by IMR Press.
This is an open access article under the CC BY 4.0 license.
Abstract

Background: Peroxisomes are membrane-bound organelles that contain one or more types of oxidative enzymes. Aberrant localization of peroxisomal proteins can contribute to the development of various diseases. To more accurately identify and locate peroxisomal proteins, we developed the ProSE-Pero model. Methods: We employed three methods based on deep representation learning models to extract the characteristics of peroxisomal proteins and compared their performance. Furthermore, we used the SVMSMOTE balanced dataset, SHAP interpretation model, variance analysis (ANOVA), and light gradient boosting machine (LightGBM) to select and compare the extracted features. We also constructed several traditional machine learning methods and four deep learning models to train and test our model on a dataset of 160 peroxisomal proteins using tenfold cross-validation. Results: Our proposed ProSE-Pero model achieves high performance with a specificity (Sp) of 93.37%, a sensitivity (Sn) of 82.41%, an accuracy (Acc) of 95.77%, a Matthews correlation coefficient (MCC) of 0.8241, an F1 score of 0.8996, and an area under the curve (AUC) of 0.9818. Additionally, we extended our method to identify plant vacuole proteins and achieved an accuracy of 91.90% on the independent test set, which is approximately 5% higher than the latest iPVP-DRLF model. Conclusions: Our model surpasses the existing In-Pero model in terms of peroxisomal protein localization and identification. Additionally, our study showcases the proficient performance of the pre-trained multitasking language model ProSE in extracting features from protein sequences. With its established validity and broad generalization, our model holds considerable potential for expanding its application to the localization and identification of proteins in other organelles, such as mitochondria and Golgi proteins, in future investigations.

Keywords
peroxisomal localization identification
SVMSMOTE
multitasking language model
feature selection
deep learning
vacuole proteins identification
Funding
ZR2021MF036/Shandong Provincial Natural Science Foundation
31872415/National Natural Science Foundation of China
Figures
Fig. 1.
Share
Back to top