Academic Editor: Alexandros G. Georgakilas
Background: The wide application of gene sequencing has accumulated numerous amino acid substitutions (AAS) with unknown significance, posing significant challenges to predicting and understanding their pathogenicity. While various prediction methods have been proposed, most are sequence-based and lack insights for molecular mechanisms from the perspective of protein structures. Moreover, prediction performance must be improved. Methods: Herein, we trained a random forest (RF) prediction model, namely AAS3D-RF, underscoring sequence and three-dimensional (3D) structure-based features to explore the relationship between diseases and AASs. Results: AAS3D-RF was trained on more than 14,000 AASs with 21 selected features, and obtained accuracy (ACC) between 0.811 and 0.839 and Matthews correlation coefficient (MCC) between 0.591 and 0.684 on two independent testing datasets, superior to seven existing tools. In addition, AAS3D-RF possesses unique structure-based features, context-dependent substitution score (CDSS) and environment-dependent residue contact energy (ERCE), which could be applied to interpret whether pathogenic AASs would introduce incompatibilities to the protein structural microenvironments. Conclusion: AAS3D-RF serves as a valuable tool for both predicting and understanding pathogenic AASs.
The development of high-throughput sequencing technologies continuously accelerates human genome re-sequencing and the identification of variants [1]; a vast quantity of single-nucleotide variants (SNV) have been registered in the dbSNP database to date [2]. Among them, non-synonymous single-nucleotide variants (nsSNV), which lead to amino acid substitutions (AAS) in their protein products, account for more than half of disease-associated genetic lesions [3, 4], and are thus of great interest. While only a small part of nsSNVs has been confirmed to be related to disease or not, the vast majority remain with uncertain significance (variants of uncertain significance, or VUS). In detail, the dbSNP contains about 8 million nsSNVs or AASs currently, but only about 0.1 million (~1.25%) have explicit associations with clinical phenotypic effects according to annotations in ClinVar [5], UniProt [6], and HGMD [7]. As it is impractical to characterize each VUS through experimental approaches, in silico prediction of their disease-association and understanding of the molecular basis of their pathogenicity have become in high demand.
The past decades have witnessed the development of various computational predictors aimed at screening disease-associated AASs (daAASs) from neutral ones (nAASs) [8, 9, 10, 11, 12, 13, 14, 15]. Many of them adopted machine learning strategies to train predictors based on various features, especially sequence conservation-related features [10, 11, 12, 14]. However, the performance of available predictors still leaves room for improvement, and these predictors often fail to provide insights into the molecular basis of daAASs. To advance in these aspects, it will be promising to explore more informative and interpretable features.
Sequence conservation-related features have been demonstrated as a group of powerful features, but they cannot provide in-depth mechanistic hints since the conservation actually serves as a result (but not cause) of natural selection for structural and functional importance. It is the folded three-dimensional (3D) structure that directly fulfills the function. Hence, it is hoped that exploring structural features can provide more interpretable insights to understand the underlying mechanisms of the pathogenicity of daAASs. With a large collection of protein 3D structures that has accumulated in recent years [16], it has proven to be promising to further explore features based on them, and to develop predictors by combining both sequence and the mined structural features. However, it is challenging to put this idea into practice as the complexity of protein structure data makes their high-throughput processing much more difficult than sequence data.
Several recent studies have put massive effort in this direction. PhyreRisk can map AASs to experimental or homology-modeled structures, and the associated Missense3D tool can then list their potential structural impacts according to a set of knowledge-based rules, such as breaking the disulfide bond, burying the hydrogen bond, introducing clash, altering secondary structures, etc. [17, 18]. VarSite, a protein-centered, experimental structure-focused resource, has integrated various features, including tissue-specific expression, disease type, conservation, protein domain, secondary structure, and interaction site [19]. Users can inspect AASs in the contexts of these features to understand their structural basis [19]. The mutfunc tool provides pre-calculated properties associated with datasets curated from ExAC [20] and ClinVar [5], including stability, interaction, post-translational modification, linear motif, and transcription factor binding site [21]. Moreover, it has covered non-coding variants and extended from human to yeast and E. coli [21]. MISCAST has analyzed 40 properties associated with the AAS position for all protein classes jointly and for each of 24 protein functional classes separately, and have identified many properties that are significantly associated with pathogenic or neutral AASs, including protein-protein interactions, residue exposure levels, secondary structures, etc. [22]. Further, MISCAST defined the P3DFi scores for each AAS position on the basis of the analyses of these properties, and the trained machine-learning model has shown that the P3DFi scores can offer orthogonal information to improve the prediction of pathogenic AAS compared with the combination of SIFT [8], PolyPhen-2 [11], and CADD [23]. These studies have largely enhanced the interpretation of AASs from the perspective of structures and functions.
Herein, we explored a more comprehensive set of structural and sequence features, selected an optimal feature subset using an automatic pipeline, and trained a machine-learning model for predicting the pathogenicity of AASs. First, we curated a high-quality collection of experimental structures and homology structure models for 5278 and 3682 human proteins, respectively. Second, one training and two testing datasets were constructed with 14,117, 5485, and 5249 AASs that can be mapped to structures, respectively. Third, a total of 212 candidate features for each AAS were extracted based on sequence or structure, and 21 of them were selected by automatic feature selection to train a random forest predictor [24], namely, AAS3D-RF. Fourth, the evaluations demonstrated that AAS3D-RF outperformed seven other popular tools. The structural features selected in this work improve the prediction performance to different degrees in different scenarios, and the nature of their interpretability also provides more understanding of the molecular basis of the daAASs.
For all human proteins in the Swiss-Prot database (UniProtKB/Swiss-Prot, Release 2018, 04) [6], we downloaded and curated their available experimental structures from Protein Data Bank (PDB) [16]. For those proteins without available experimental structures, their homology models from ModBase [25] were adopted. The processing details are described in Supplementary Fig. 1 and Supplementary Methods.
The humsavar.txt (2018, 04 of 25 Apr 2018) file provided by the UniProt FTP server contained manually annotated AAS, and is the main source for preparing the AAS datasets [6]. According to its documentation, AASs labeled with “Disease” serve as positive samples (i.e., daAAS), while those labeled with “Polymorphism” (similar to “neutral” or “benign”) are negative samples (i.e., nAAS). Another two data sources of AASs are the 1000 Genomes Project (1000G, Release 2 May 2013) [26] and the VariSNP (Release 16 Feb 2017) [27]. After removing those whose mutant allele frequency is less than 1%, the remaining AASs from 1000G and VariSNP were regarded as nAASs. After mapping to experimental structures and keeping only one instance for duplicates, the AASs from these three sources constituted the TotalDataset.
The TotalDataset was then split into TrainDataset and
TestDataset 1 according to the following procedure: After an all-to-all
BLASTP (E-value
A series of subsets were further prepared from these three datasets. In detail, the subsets were organized according to the proportion of daAASs on each protein: For proteins containing only daAASs or nAASs, their AASs constitute the “Pure” subsets; other subsets, namely, “Mixed” subsets at different mixing levels, were constructed by selecting proteins within specific ranges of daAASs proportion, including open interval between 0.0 and 1.0 (denoted as ]0.0, 1.0[) and close intervals of [0.1, 0.9], [0.2, 0.8], [0.3, 0.7], and [0.4, 0.6].
Based on protein structures, sequences, and sequence alignments, a total of 212 candidate features were calculated to characterize each AAS from various perspectives (gene/protein, AAS site, substitution itself, etc.). The full list of features is described in Supplementary Methods.
Each dataset was organized as a feature matrix with each row representing an AAS and each column corresponding to a feature. The missing data were filled with the mean values derived from the TrainDataset. Furthermore, each feature column was transformed into Z-score, i.e., all values in the same feature column were standardized with the mean and standard deviation derived from the TrainDataset.
Random forest (RF) is a machine learning framework consisting of an ensemble of decision trees, and has been widely applied in classification problems [24]. In the training stage, each tree is trained with only a part of the training samples to decrease over-fitting. In the stage of prediction, the consensus output from the majority of the trees was taken as the final result. Considering its successful application in many bioinformatics studies [29], we chose the RF technique in this work for automatic feature selection and model training.
We adopted a wrapper strategy in the feature selection procedure. That is, we iteratively generated feature subsets, and evaluated them by using the 10-fold cross-validation performance of RF classifiers trained on them accordingly. The brief logic was that we added two features with the most contribution and removed one feature with the least feature importance in each iteration, and this operation would be terminated to deduce the feature redundancy if the adding of new features would not improve the performance of the classifier any more. This procedure has been demonstrated effective in one previous study [30]. Notably, we conducted the cross-validation at the protein level, where AASs from the same protein were required to reside in the same fold. That is, AASs were grouped at the protein level, namely, Group10Fold cross-validation (G10F-CV). The purpose of this operation was to reduce the level of type 2 circularity, which has been discussed in-depth previously [31]: If protein/gene-level features were used, there would exist over-fitting in predicting the disease-association of those AASs whose proteins contain AASs used in the training procedure.
After obtaining the final feature subset, we determined the optimal RF hyper-parameters by finding the maximum area under the ROC curve (AUC) of G10F-CV in a random search of 1000 trials (Supplementary Table 1). Then, by specifying the hyper-parameters with the optimal values, the final classifier was re-trained on the entire TrainDataset. All these processes were conducted by using the Scikit-learn toolkit v0.19.0 [32].
The two independent testing datasets, i.e., TestDataset 1 and TestDataset 2, were utilized to evaluate performance and to compare with other predictors by adopting standard performance measures, including accuracy (ACC), AUC, Matthews correlation coefficient (MCC), sensitivity (Sen), specificity (Spe), positive predictive value (PPV), and negative predictive value (NPV) (definitions in Supplementary Methods).
In addition to the evaluations on the full datasets of TestDataset 1 and TestDataset 2, more in-depth evaluations were performed on the ‘Pure’ and ‘Mixed’ testing subsets, which are designed to provide an examination of the extent of type 2 circularity [31].
Our predictor was compared with seven popular tools, including SIFT (version 5.2.2) [8], HumDiv-trained PolyPhen-2 (PPH2_HD) (version 2.2.2r405c) and HumVar-trained PolyPhen-2 (PPH2_HV) (version 2.2.2r405c) [11], PROVEAN (version 1.1.5) [13], FATHMM’s weighted method (FATHMM-W) and unweighted method (FATHMM-U) (version 2.3) [14], and PANTHER-PSEP (version 1.01) [15]. The settings of these tools are described in detail in Supplementary Methods.
In summary, 6430 experimental structures of 5278 proteins and 4238 homology models of 3682 proteins were obtained for preparing the AAS datasets (Supplementary Fig. 1).
After initial data cleaning, the humsavar dataset retained 29,328 daAASs and 39,679 nAASs. Among them, 11,157 daAASs and 7072 nAASs were mapped to experimental structures, and 2471 daAASs and 2778 nAASs were mapped to ModBase [25] homology models. Overall, the mapped percentages were 46.5% and 24.8% for daAASs and nAASs, respectively. For the VariSNP dataset, the initial cleaning, keeping those with MAF at no less than 1%, and mapping to UniProt canonical sequences, resulted in 18,233 nAASs. Among them, 1281 (7.0%) were mapped to experimental structures. For the 1000G dataset, 3296 of 34,791 (9.5%) nAASs were mapped to experimental structures. After integration and proper data partition described in section 3.1, these mapped AASs were separated into TrainDataset, TestDataset 1, and TestDataset 2 (Supplementary Fig. 2).
The TrainDataset consists of 14,117 AASs mapped on the experimental structures of 1979 proteins, with 7385 daAASs on 611 proteins and 6732 nAASs on 1750 proteins (Table 1). The ratio of daAAS to nAAS is highly balanced (7385:6732), which will ensure that the trained classifier would not suffer from data bias [33, 34].
Dataset | Class | # of AAS | # of Proteins |
# of Structures |
TrainDataset | Disease | 7385 | 611 | 708 |
Neutral | 6732 | 1750 | 1938 | |
Total | 14,117 | 1979 | 2239 | |
TestDataset 1 | Disease | 3772 | 321 | 338 |
Neutral | 1713 | 733 | 757 | |
Total | 5485 | 834 | 873 | |
TestDataset 2 | Disease | 2471 | 383 | 394 |
Neutral | 2778 | 1208 | 1265 | |
Total | 5249 | 1418 | 1482 | |
The TestDataset 1 contains 5485 AASs mapped on the experimental
structures from 834 proteins, with 3772 daAASs on 321 proteins and 1713 nAASs on
733 proteins (Table 1). The TestDataset 2 contains 5249 AASs mapped on
homology models of 1418 proteins, with 2471 daAASs on 383 proteins and 2778 nAASs
on 1208 proteins (Table 1). The TestDataset 1 and TestDataset 2
were utilized to evaluate the performance of the trained predictor on AASs with
features extracted from experimental structures and reliable homology models,
respectively. Notably, both the TestDataset 1 and TestDataset 2
have no overlap with TrainDataset at either the substitution level or
the protein level, which would presumably avoid the two types of circularity that
may cause overly optimistic estimation of performance [31]. Moreover, the
sequence identity of any pairwise comparison between TestDataset 1 and
TrainDataset is less than 30%, but 947 of 1418 proteins in
TestDataset 2 have high scoring pairs with sequence identity
We inspected the datasets by examining the “Pure” and “Mixed” subsets (Fig. 1A,B). A substantial portion of AASs came from the “Pure” subsets (32.6% in TestDataset 1 and 65.1% in TestDataset 2), i.e., many proteins contributed only daAASs or only nAASs. These “Pure” or “Mixed” testing subsets can provide more detailed comparisons of the predictors’ performance, as some predictors confounded by type 2 circularity could not be evaluated properly on the entire testing dataset [31].
The composition of different AAS subsets and the performance of eight predictors on them. Each subset was prepared by incorporating the AASs whose proteins harbor a specific range of daAASs proportion (interval labels under the X axis) for TestDataset 1 (A) and TestDataset 2 (B). Among them, “All” represents the whole testing dataset, while “Pure” contains the AASs whose proteins harbor either daAASs or nAASs only. The numbers of daAASs and nAASs of each subset are given on top of each bar. The percentage of the AAS number in each subset is also indicated. The AUC scores on different AAS subsets are plotted for TestDataset 1 (C) and TestDataset 2 (D).
A similar pattern can be found in the TrainDataset, where a large part of AASs (44.3%) are from the “Pure” subset (Supplementary Fig. 3). In addition, considering that some features adopted in the classifier describe the whole gene or protein but not the AAS itself, the trained classifier may be skewed to classify protein but not the AAS (type 2 circularity). As we had adopted protein level cross-validation (G10F-CV in this work) in the training procedure [31], this skewness could presumably be removed.
A total of 21 features (Supplementary Table 2) were finally selected after the iterative feature selection procedure. After randomly searching 1000 hyper-parameter combinations of RF, we obtained the optimal one (n estimators: 338, max depth: 10, max features: 0.5233959) corresponding to the highest AUC (0.873) in the G10F-CV. By specifying the optimal hyper-parameter combination, we re-trained the final RF classifier, namely AAS3D-RF, on TrainDataset. We also implemented an automatic pipeline that can calculate the required features for a given AAS, and can then load AAS3D-RF to make predictions. The codes were implemented in Python 2.7 and are available at http://www.wdspdb.com/AAS3D-RF/ and https://github.com/PKU-XiongYao/AAS3D-RF.
The measures of the performance on TestDataset 1 and
TestDataset 2 are listed in Table 2 and Supplementary Table 3. The ACC
and MCC values are 0.811 and 0.591 for TestDataset 1, and 0.839 and
0.684 for TestDataset 2, demonstrating its superior overall performance.
The better performance on TestDataset 2 may partially stem from the
similar proteins between TestDataset 2 and TrainDataset, since
we did not require that the proteins having HSP with sequence identity
Predictor | Dataset | ACC | MCC |
AAS3D-RF | TestDataset 1 | 0.811 | 0.591 |
TestDataset 2 | 0.839 | 0.684 | |
structure-removed | TestDataset 1 | 0.783 | 0.542 |
TestDataset 2 | 0.835 | 0.676 |
We compared the performance of AAS3D-RF with seven other popular tools. Except FATHMM-W (discussed below), AAS3D-RF outperforms all the other predictors on TestDataset 1 and TestDataset 2 in terms of ACC, AUC, MCC, and PPV (Fig. 2A,B, Supplementary Table 4). In particular, the MCC values are 8.7 and 11.3 percentage points higher than the second-best predictors (except FATHMM-W) on TestDataset 1 and TestDataset 2, respectively (Supplementary Table 4). Notably, many other tools, including SIFT, PPH2_HD, PPH2_HV, PROVEAN, and PANTHER_PSEP, have achieved much higher Sen scores, but their Spe scores are very low, indicating that they are biased to positive predictions and prone to higher false positive rates (1-Spe). As MCC is a performance metric considering both positive and negative predictions in balance, we can conclude that AAS3D-RF achieved superior balanced performance without bias to a specific class. The detailed ROC curves demonstrate a similar conclusion: The curve of AAS3D-RF covers the largest area under the curve except FATHMM-W (Fig. 2C,D).
Comprehensive performance comparison of eight predictors on the two independent testing datasets. (A) and (B) show the performance in terms of seven metrics for TestDataset 1 and TestDataset 2, respectively. (C) and (D) display the ROC curves for TestDataset 1 and TestDataset 2, respectively.
Several previous studies have reported that the prediction performance of FATHMM-W was significantly confounded by type 2 circularity, i.e., it intended to predict all the variants from the same protein as pathogenic or neutral as a whole [31, 35, 36]. In other words, FATHMM-W would perform worse on a dataset with proteins containing a nearly equal number of daAASs and nAASs on each of them (e.g., the subsets with daAAS proportion in the range of [0.4, 0.6]) than on a dataset with proteins containing only daAASs or nAASs (e.g., the “Pure” subsets). Prediction methods sensitive to type 2 circularity would perform poorly in discriminating daAASs from nAASs within a given protein. In our work, the evaluations on “Pure” and different levels of “Mixed” subsets have demonstrated this: FATHMM-W performs the worst on the subsets with daAAS proportion in [0.4, 0.6], but obtains an impressively high AUC on the “Pure” subsets (Fig. 1C,D). As shown, AAS3D-RF consistently ranks at the top tier of all methods among “Mixed” subsets, indicating that the type 2 circularity in AAS3D-RF has been removed at the highest level. As their values are in fact the same for all variants from the same protein, the gene/protein-level features should be the source of type 2 circularity. In our cross-validation and independent testing, the variants from the same protein were either all in the training set or all in the testing sets, hence this grouped cross-validation at the protein level here may have served as an effective strategy to decrease type 2 circularity.
As published tools were trained on different datasets, it is often difficult to obtain a proper and fair testing dataset that contains enough data and has no overlap with any of their training datasets. Here, TestDataset 1 and TestDataset 2 are sufficiently rigorous to our predictor, since there are no overlapped AASs or proteins between them and TrainDataset. However, this scenario does not apply to other tools. Hence, the performance comparisons based on TestDataset 1 and TestDataset 2 will presumably provide a realistic performance estimate for AAS3D-RF but may supply overly optimistic estimates for other tools. In a truly fair comparison, the improvement of AAS3D-RF over other tools would presumably be larger.
To interrogate the interpretability of features, which may offer clues for understanding the molecular basis of the daAASs, we plotted the distributions of some features in daAAS and nAAS separately (Fig. 3). As shown, these features can be grouped within four categories: physicochemical properties of residues, conservation-related properties, gene/protein-level features, and structure-based features. While many of them have been analyzed in previous studies, we here mainly focus on several structure-based features.
Z-score distributions of a part of selected features for the daAASs (red) and nAASs (blue) in TrainDataset. The p-values of two-tailed Mann-Whitney test for each feature comparison are shown.
Five features in the conservation-related group show the most evident contrast between daAASs and nAASs (Fig. 3). Conservation often indicates structural stability or functional importance. As a comprehensive result of many underlying mechanisms, the conservation-related features have superior ability in separating daAASs from nAASs. Many previous studies have also repeatedly revealed the power of this type of features [34, 37, 38, 39, 40, 41, 42, 43]. However, these features cannot offer further molecular mechanistic clues of the disease-associated AASs, as mentioned in the Introduction.
Relative solvent accessibility (RSA) and folding free energy change
(
The context-dependent substitution score (CDSS) is a substitution score between different amino acids under specific structural environments or contexts. Three features, including Miyata, Abs_dF1, and Abs_dpI, describe general physicochemical differences independent of residues’ structural environments. In contrast, CDSS provides a more specific residue substitution score in given RSA levels and secondary structure states [51]. In Fig. 3, the CDSS values of daAASs are smaller than those of nAASs, indicating that unfavorable substitutions with respect to CDSS tend to be associated with diseases.
Environment-dependent residue contact energy (ERCE) is another selected structural environment-related feature, accounting for both the secondary structure and residue contacts in the 3D structure [52]. The selected ERCE feature for an AAS describes the summation of contacting energy values between the wild-type residue and all its neighboring residues within a distance of 6.5 Å. The lower the contact energy, the more stability it contributes. In Fig. 3, the ERCE values of daAASs are smaller than those of nAASs, suggesting that the residues with higher stability contribution, if substituted, tend to become pathogenic.
To our knowledge, it is the first time that CDSS and ERCE have been utilized in developing methods for predicting the disease-association of AAS. The effectiveness of CDSS and ERCE indicates an understanding that substitutions introducing incompatible residues into a specific structural context will tend to be deleterious to the protein and thus be pathogenic.
In addition to better interpretability of the structural features, to what extent they contribute to the prediction performance is another aspect that must be interrogated. Toward this end, we re-trained an additional predictor with all seven structural features removed according to the same procedure adopted in training AAS3D-RF, namely, the structure-removed predictor. Then, we compared its performance with AAS3D-RF on TestDataset 1 and TestDataset 2 (Table 2 and Supplementary Table 3).
First, the ACC and MCC of structure-removed predictor were 2.8 and 4.9 percentage points less than the AAS3D-RF on the TestDataset 1, respectively (Table 2 and Supplementary Table 3), demonstrating that the structure features evidently contributed to the performance.
Second, when evaluated on TestDataset 2, the observed improvement of
ACC and MCC after adding structure-related features was not as large as that
observed on TestDataset 1 (
What factors have affected the extent of performance improvement of structural
features? As several previous studies have proposed that the contribution of
structural features is more evident when reliable conservation-related features
are unavailable [53, 54], we checked whether this applies in our case. On
TestDataset 2, by using the number of aligned sequences (the feature of
nal_1e-45) as a proxy of reliability of conservation-related features, we
observed that the improvement of MCC was indeed more evident in AAS data with
nal_1e-45
In summary, incorporation of structural features further improves prediction performance, especially in scenarios wherein conservation-related features are of low reliability.
Developing tools for predicting daAASs has been a challenge for over a decade. Although a plethora of studies have devoted significant effort and achieved much progress in this field, their performance requires further improvement [37, 58]. As more and more AASs with phenotypic effects are determined, available methods can be re-evaluated and previous machine learning-based tools could be upgraded with new data. With other related data accumulated, such as homologous sequences and structures in public databases, novel predictors could be developed with more accurate feature descriptors, more complete feature space, or brand-new features. Given this background, our work has been carried out to explore new structural features and to combine them with sequence features aiming at improving the prediction and understanding of disease-associated AASs.
Many studies have adopted sequence features and the predicted structural features based on sequence, but only a few directly extract features from protein 3D structures, whether experimental structures or homology models [45, 57, 59, 60]. According to the paradigm of sequence-structure-function, protein structures are more directly related to function than sequence, so the 3D structures should be able to provide greater understanding of pathogenic AASs. However, the structure feature extraction procedure is much more complicated, and only a small part of AASs are covered by structures, which may have restricted the extensive exploration of structural features. Several recent studies have integrated or explored AASs in the structural contexts with much larger datasets. By adopting both experimental and homology-modeled structures, PhyreRisk and Missense3D have increased the number of AASs that can be mapped to 3D structures and can calculate the potential structural impacts based on knowledge-based rules [17, 18]. Similarly, the mutfunc tool has also incorporated homology-modeled structures, and has provided pre-calculated properties of stability, interaction, post-translational modification, linear motif, and transcription factor binding site [21] for datasets curated from ExAC [20] and ClinVar [5]. By focusing only on experimental structures, VarSite has offered a graphical platform to inspect AASs in the contexts of conservation levels, protein domains, secondary structures, and interaction sites [19]. Through analyzing 40 properties associated with the variant position, MISCAST has identified properties significantly associated with daAASs or nAASs [22]. More specifically, it has also identified significant properties within each of the 24 protein functional groups separately. Accordingly, MISCAST has defined P3DFi scores for each residue position based on either the joint analyses of all protein classes or the separate analyses of each functional class, and these scores can improve the prediction of pathogenic AASs. These studies have largely advanced the interpretation of AASs in the contexts of 3D structures. In this work, we have constructed a more comprehensive set of 212 candidate features, and have adopted an automatic feature selection pipeline considering the redundancy between features [30]. Unlike previous studies that have chosen interpretable features based on knowledge or statistical analyses, our feature selection is more intended to maximize prediction performance and to ignore redundant features.
The main new structural features in this work include ERCE and CDSS. ERCE relies
on the secondary structure and structural interacting residues, while CDSS is
dependent on the secondary structure and RSA (Fig. 3). The analysis of ERCE and
CDSS herein hints that AASs incorporating residues incompatible to its structural
environment may be potential reasons for certain daAASs. Another two selected
structure features, RSA and
As for physicochemical properties irrelevant to structures, MISCAST has emphasized the properties of the wild-type residues, while AAS3D-RF has mainly focused on the difference between wild-type and mutant residues (Fig. 3). Disulfide bond-related features have also been highlighted in both MISCAST and AAS3D-RF but are slightly different: The former represents the structural neighboring disulfide bonds, while the latter represents the sequence neighboring ones. The post-translational features have been absent in our work, but most of them have been demonstrated as significant in the work of MISCAST. Hence, they should be considered and utilized in future studies of developing predictors. Notably, we have extracted many more features as the candidates (212 in total), including functional regions (calcium binding, DNA-binding, nucleotide phosphate, membrane-spanning, and zinc finger regions), sequence and structural neighboring functional sites (active pocket, chemical group binding, metal ion binding, and disulfide bonded cysteine sites), hydrogen bonds, and the 8-state secondary structures. However, our automatic feature selection pipeline has not incorporated most of them into the final feature subset, possibly because they could not contribute further to maximize the prediction performance. Nevertheless, many of these dropped features are informative in providing clues to understand the molecular basis of pathogenic AASs, as demonstrated in the work of MISCAST, VarSite, mutfunc, PhyreRisk and Missense3D [17, 18, 19, 21, 22]. Hence, one may consider strategies to combine manual retaining of certain biologically meaningful features with automatic feature selection together aiming at providing better interpretability in the future.
When developing machine learning-based tools for predicting daAASs, type 2 circularity is an important issue that is worth noting. In our work, Residual Variation Intolerance Score (RVIS), a gene-level metric measuring the relative ability of a human gene tolerating common functional genetic variation in healthy individuals [61], has been chosen during the feature selection. The RVIS scores of AASs from the same gene/protein are identical. In the cross-validation, if AASs from the same gene/protein are separated into the training and the validation data, their identical gene/protein-level feature value will lead to over-fitting. To avoid this, we conducted cross-validation at the protein-level, i.e., AASs were first grouped according to their source proteins, and then the training and validation data separation was carried out without splitting any group. The performance evaluation on the ‘Mixed’ datasets has demonstrated that this strategy is effective. In the future, if training data accumulate sufficiently, one can train better predictors based on only ‘Mixed’ datasets with a balanced number of daAASs and nAASs in each protein.
Currently, most predictors and AAS annotators, including AAS3D-RF, are developed for human proteins, and only few can be applied to other organisms. The underlying reasons may be the lack of training/testing samples or proper features suitable for non-human species. For those that can be applied to other organisms, they mainly adopted universal features that are not restricted within specific organisms, such as conservation scores and structural stability scores. Such examples include mutfunc [21], Fido-SNP [62], and Envision [63]. As for AAS3D-RF, though it cannot be applied to other organisms due to the human-specific RVIS feature currently, its CDSS and ERCE features will be promising when extrapolating to other organisms.
For this study, we exclusively adopted homology models from ModBase and AASs from humsavar, 1000G, and VariSNP datasets. With the unprecedented progress in protein structure prediction such as the recent AlphaFold, 58% of human proteome residues now have been confidently mapped with 3D conformations [64]. Moreover, ClinVar contains more annotated variants than humsavar and is growing rapidly as well. A study showed that only 8% of ClinVar daAASs and 32% of humsavar daAASs overlap [65]. In addition to public data, the commercial HGMD database holds even more variants with disease annotations [7]. Integrating these larger datasets will be beneficial for improving AAS3D-RF and other related predictors in the future.
In this work, we curated a training dataset containing about 15 thousand AASs with known phenotypic effects and mappable to reliable 3D structures. Based on 21 automatically selected sequence and structural features, the RF-based machine-learning model AAS3D-RF was trained using G10F-CV. Evaluation on several independent testing datasets showed that AAS3D-RF achieved ACC of 0.811
YX, ZQY, and YDW conceived and designed this study; YX, JBZ, and KA performed the calculations; YX, ZQY, YDW, WH, and TW analyzed the data; YX and ZQY drafted the manuscript; all the authors revised and approved the manuscript.
This is a computational study and no subjects of humans or animals were involved in.
We would like to thank Xinhao Zhang, Fan Jiang, and Xudong Zou for their helpful suggestions. We also thank the reviewers’ valuable comments for improving the manuscript. We gratefully acknowledge the support of Shenzhen Bay Laboratory Supercomputing Center and the National Supercomputer Center in Guangzhou (NSCC-GZ) for computing resources.
This work was supported by the Key-Area Research and Development Program of Guangdong Province [2020B0101350001]; the National Natural Science Foundation of China [32070664, 21933004, 31471243]; the Shenzhen Science and Technology Innovation Commission [JCYJ20170818085409785, GXWD20201231165807007-20200812124825001]; and the Shenzhen Municipal Health Commission [SZSM201809085].
The authors declare no conflict of interest.
During the peer-review of this manuscript, the reviewers suggested further evaluation of AAS3D-RF on new AASs and on previously unmapped AASs that can be mapped now due to the increased coverage of 3D structures since April, 2018. We undertook several steps to fulfil this. First, we compared the human protein entries’ annotations in UniProtKB/Swiss-Prot (2021, 03) and UniProtKB/Swiss-Prot (2018, 04), and obtained 943 new experimental structures of 907 proteins. The quality-check and the removal of those that are overlapped with the ModBase structures mentioned in section 3.1 resulted in 631 new structures of 606 proteins. Second, we mapped the humsavar (2021, 03) AASs to all the structures including these newly curated 631 structures, previously curated experimental structures, and previously curated homology-modelled structures (Supplementary Fig. 1), and then removed those that have occurred in TrainDataset, TestDataset 1, or TestDataset 2, obtaining a new testing dataset, namely TestDataset 3, which contains 2,614 new AASs (1,675 daASs and 939 nAASs). The AASs in TestDataset 3 stem from 616 structures of 582 proteins (Supplementary Table 6).
Third, we ran AAS3D-RF and the other seven tools against TestDataset 3. The settings for running these seven tools were the same as those used on TestDataset 1 and TestDataset 2, and are described in Supplementary Methods. Lastly, we evaluated the performance of AAS3D-RF on TestDataset 3. The ACC and MCC are respectively 0.821 and 0.619 similar to those based on TestDataset 1 and TestDataset 2 (Supplementary Tables 4,7). In addition, AAS3D-RF has also evidently outperformed other tools with respect to ACC and MCC (Supplementary Fig. 4A). The ROC plot shows that AAS3D-RF covers a much larger area (0.888) than others as well (Supplementary Fig. 4B). Details of TestDataset 3 are also provided at http://www.wdspdb.com/AAS3D-RF/. In summary, AAS3D-RF has stably superior performance to many other tools by combining new interpretable structural features and sequence features.
AAS, Amino acid substitution; SNV, Single-nucleotide variant; VUS, Variants of uncertain significance; RF, Random forest; G10F-CV, Group10Fold cross-validation; AUC, Area under the ROC curve; ACC, Accuracy; MCC, Matthews correlation coefficient; PPV, Positive predictive value; NPV, Negative predictive value; CDSS, Context-dependent substitution score; ERCE, Environment-dependent residue contact energy; RVIS, Residual variation intolerance score.