IMR Press / FBL / Volume 26 / Issue 12 / DOI: 10.52586/5036
Open Access Original Research
Incorporating structural features to improve the prediction and understanding of pathogenic amino acid substitutions
Show Less
1 State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, 518055 Shenzhen, Guangdong, China
2 Assisted Reproduction Center, Northwest Women’s and Children’s Hospital, 710003 Xi’an, Shaanxi, China
3 Shenzhen Bay Laboratory, 518055 Shenzhen, Guangdong, China
4 College of Chemistry and Molecular Engineering, Peking University, 100871 Beijing, China
*Correspondence: yezq@pku.org.cn (Zhi-Qiang Ye); ydwu@pku.edu.cn (Yun-Dong Wu)
Academic Editor: Alexandros G. Georgakilas
Front. Biosci. (Landmark Ed) 2021, 26(12), 1422–1433; https://doi.org/10.52586/5036
Submitted: 6 July 2021 | Revised: 9 October 2021 | Accepted: 21 October 2021 | Published: 30 December 2021
Copyright: © 2021 The Author(s). Published by BRI.
This is an open access article under the CC BY 4.0 license (https://creativecommons.org/licenses/by/4.0/).
Abstract

Background: The wide application of gene sequencing has accumulated numerous amino acid substitutions (AAS) with unknown significance, posing significant challenges to predicting and understanding their pathogenicity. While various prediction methods have been proposed, most are sequence-based and lack insights for molecular mechanisms from the perspective of protein structures. Moreover, prediction performance must be improved. Methods: Herein, we trained a random forest (RF) prediction model, namely AAS3D-RF, underscoring sequence and three-dimensional (3D) structure-based features to explore the relationship between diseases and AASs. Results: AAS3D-RF was trained on more than 14,000 AASs with 21 selected features, and obtained accuracy (ACC) between 0.811 and 0.839 and Matthews correlation coefficient (MCC) between 0.591 and 0.684 on two independent testing datasets, superior to seven existing tools. In addition, AAS3D-RF possesses unique structure-based features, context-dependent substitution score (CDSS) and environment-dependent residue contact energy (ERCE), which could be applied to interpret whether pathogenic AASs would introduce incompatibilities to the protein structural microenvironments. Conclusion: AAS3D-RF serves as a valuable tool for both predicting and understanding pathogenic AASs.

Keywords
Amino acid substitution
Single-nucleotide variant
Pathogenic
Protein structure
Machine learning
Figures
Fig. 1.
Share
Back to top