- Academic Editor
Background: Alzheimer’s disease (AD) is an irreversible primary brain disease with insidious onset. The rise of imaging genetics research has led numerous researchers to examine the complex association between genes and brain phenotypes from the perspective of computational biology. Methods: Given that most previous studies have assumed that imaging data and genetic data are linearly related and are therefore unable to explore their nonlinear relationship, our study applied a joint depth semi-supervised nonnegative matrix decomposition (JDSNMF) algorithm to solve this problem. The JDSNMF algorithm jointly decomposed multimodal imaging genetics data into both a standard basis matrix and multiple feature matrices. During the decomposition process, the coefficient matrix A multilayer nonlinear transformation was performed using a neural network to capture nonlinear features. Results: The results using a real dataset demonstrated that the algorithm can fully exploit the association between strongly correlated image genetics data and effectively detect biomarkers of AD. Our results might provide a reference for identifying biologically significant imaging genetic correlations, and help to elucidate disease-related mechanisms. Conclusions: The diagnostic model constructed by the top features of the three modality data sets mined by the algorithm has high accuracy, and these features are expected to become new therapeutic targets for AD.
Alzheimer’s disease (AD) is a chronic neurodegenerative disease with an
incidence that is increasing yearly. The major clinical finding of AD is the
accumulation of amyloid
Machine learning methods have been widely used in AD data analysis, although there is a need for more partial modal data in AD analysis. Hu et al. [3] proposed an effective data augmentation method using generative adversarial networks to reconstruct missing positron emission tomography (PET) images in order to address the class imbalance challenge. Yu et al. [4] proposed a new multi-directional perceptual generative adversarial network (MP-GAN). This method delineates subtle lesions through magnetic resonance (MR) image transformation between source and predefined target domains. It is used to visualize morphological features indicative of AD severity in patients at different stages of the disease [4]. Based on the existing sliding window correlation test, Jo T et al. [5] proposed a cyclic sliding window correlation test method using a three-step approach (feature correlation analysis, feature selection, and classification) in order to improve the prediction accuracy of AD using serum-based metabolomics classification. Lee et al. [6] proposed a novel convolutional neural network model for interpolating tau PET images from more widely available cross-modal imaging inputs. This model can effectively improve the accuracy of AD classification [6].
A previous study has shown that joint nonnegative matrix decomposition is a robust algorithm for association analysis. Wang and others proposed a group sparse joint nonnegative matrix decomposition (GSJNMF) algorithm integrating single nucleotide polymorphism (SNP), functional magnetic resonance imaging (fMRI), and DNA methylation data for schizophrenia (SZ) [7]. The method incorporated the structural information of the three integrative findings based on the joint nonnegative matrix decomposition. The genetic data in the modules obtained by the algorithm were significantly correlated with the activity of several at-risk brain regions (including the insula, lingual gyrus, fusiform gyrus, postcentral gyrus, supramarginal gyrus, superior temporal gyrus, superior temporal pole, and lobule VI of the cerebellar hemisphere). Peng et al. [8] added the introduction of orthogonal constraints on the basis matrix to discard insignificant features in the rows of the coefficient matrix to the GSJNMF algorithm, resulting in improved results. Wei et al. [9] proposed a joint connectivity-based nonnegative matrix decomposition (JCB-SNMF) algorithm and applied it to AD imaging genetic data. The algorithm added connectivity constraints on the coefficient matrix based on joint nonnegative matrix decomposition (JNMF) to incorporate the connectivity information between brain regions and genetic data of the brain. With this algorithm, some essential pairs of imaging genetic relations in AD were found. All AD samples were used as input to the proposed algorithm during the experiment. The Pearson correlation coefficient between the original matrix and the reconstructed matrix was used as an indicator to measure the algorithm’s performance for parameter selection. Specifically, all parameters were selected within the range of [0.0001, 0.001, 0.01, 0.1, 1, 10] to evaluate the changes in the Pearson correlation coefficient of the algorithm under different parameter combinations. Finally, the parameter combination that maximized the Pearson correlation coefficient was selected as the final parameter.
Although the above algorithms incorporate a variety of prior information, they only consider the feature matrix’s linear features and cannot capture its nonlinear features. To this regard, our study applied a joint depth semi-supervised non-negative matrix decomposition (JDSNMF) algorithm to integrate structural magnetic resonance imaging (sMRI), gene expression, and SNP data of AD. The top biomarkers mined by the algorithm are expected to provide a reference for the diagnosis and treatment of AD.
We used
If
If
According to the value of
In the feature selection, we used the scikit-learn package for Python (v3.7, Python Software Foundation, Portland, OR, USA) to achieve the weight assignment to ROI, SNP, and genes in the co-expression module using the random forest (RF) algorithm. The parameters were explicitly set: ‘n_estimators’ was selected between 100 and 600, and ‘criterion’ was identified between ‘gini’ and ‘entropy’.
We performed a five-fold cross-validation on the training set to construct a diagnostic model using the GridSearchCV function. Finally, the optimal parameters were ‘entropy’ for ‘criterion’ and 500 for ‘n_estimators’. In addition, we constructed diagnostic models for ROI, SNP, and genes based on the logistic regression (LR) algorithm utilizing IBM SPSS Statistics 26 software (IBM SPSS Statistics, Chicago, IL, USA).
The BrainNet Viewer package of Matlab 2018a software (https://www.nitrc.org/projects/bnv/) was used to visualize the important brain regions selected by the JDSNMF algorithm.
The R package “clusterProfiler” (https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) was applied to perform the Kyoto Encyclopedia of Genomes (KEGG) and Gene Ontology (GO) enrichment analysis of the important genes in the module. Bubble plots were then visualized using the R package “ggplot2” (v3.4.4, https://cran.r-project.org/web/packages/ggplot2/index.html).
AD, mild cognitive impairment (MCI), and healthy control (HC) samples were
downloaded from the The Alzheimer’s Disease Neuroimaging Initiative (ADNI)
database. The information on their sample set is shown in Table 1. For sMRI, we aligned the data to the Montreal
Neurological Institute (MNI) standard space using the statistical parametric mapping (SPM) toolkit of Matlab
software. Correction, segmentation, and alignment were then performed and,
finally, the gray matter density of 90 brain regions with the cerebellar regions
removed were extracted as the sMRI ROIs. Regarding the SNP data, we used the
PLINK tool (https://www.cog-genomics.org/plink2/) to remove SNPs that did not meet the criteria for sex detection,
Hardy-Weinberg equilibrium, and minor allele frequencies less than 0.05. The SNPs
were then genetically annotated using ANNOVAR (http://annovar.openbioinformatics.org/), and 2378 SNPs within
Category | Quantity | Age (mean |
Sex (M/F) |
AD | 30 | 70.80 |
17/13 |
MCI | 100 | 71.66 |
40/60 |
HC | 50 | 70.96 |
30/20 |
SD, Standard Deviation; M, male; F, female; AD, Alzheimer’s disease; MCI, mild cognitive impairment; HC, healthy control.
Category | AD vs MCI | AD vs HC | MCI vs HC |
Age | 0.032 | 0.034 | 0.23 |
Sex | 0.107 | 0.769 | 0.021 |
Since the JDSNMF algorithm was unsupervised, only the AD and MCI samples were
used as input to the algorithm. The hyperparameters to be selected for the JDSNMF
algorithm included the activation function, learning rate, and the number of
dimensionality reduction. We set the number of iterations to 10,000. After fixing
the other parameters, we first selected the activation function. The losses using
the tanh, sigmoid, and rectified linear unit (ReLU) functions were 21,679.373, 4781.157, and 5563.1367,
respectively. Therefore, we chose the sigmoid function for the subsequent
analysis. Next, we selected the learning rate from the range of [0.1, 0.01, 0.001],
with losses of 30,049, 4835, and 4781, respectively. Consequently, we set the
learning rate to 0.001. Finally, we selected
The process of parameter selection. (A) Histogram showing the loss at values between 40 and 60. (B) Line graph showing the variation of the loss for the best combination of parameters.
Since we set
Histogram showing the reconstruction error of co-expression modules.
In this section, we ranked the feature importance of ROIs, genes, and SNPs in module 35 based on the RF algorithm. The histograms of the top 10 feature weights are given in Fig. 3A–C, respectively. The diagnostic performance of these top markers was then explored, and diagnostic models were constructed using these ROIs, genes, and SNPs, respectively.
Heat maps showing weights of top features. (A–C) Weighted histograms of the top 10 regions of interest, top 10 single nucleotide polymorphisms, and top 10 genes, respectively.
We analyzed the correlation of the top 10 ROIs as shown in Fig. 4. Fig. 4A is a heat map drawn from the Pearson correlation coefficients among the 10 ROIs. Fig. 4B shows the connectivity of these ROIs in the brain template. There was a maximum positive correlation between rInfPar and rAng (corr = 0.9533), except for the correlation between each brain region and itself, which is 1. The maximum negative correlation can be found between rPal and rPal (corr = –0.9679). For the top 10 genes, Fig. 5 displays the results of their GO and KEGG enrichment analysis. We aimed to explore the biological significance of the top 10 ROIs, the top 10 genes, and the biological pathways in which they are involved. In addition, to confirm the algorithm’s performance for association analysis, we plotted the correlation heat map of the top ROIs and top SNPs (Fig. 6A) and the correlation heat map of the top ROIs and top genes (Fig. 6B). For Fig. 6A, rs4844384 and lPal had the highest positive correlation (corr = 0.5754). rs957191 and lPut had the highest negative correlation (corr = –0.4794). As shown in Fig. 6B, sulfiredoxin 1 (SRXN1) and lTha had the highest positive correlation (corr = 0.4962). collagen type III alpha 1 chain (COL3A1) and Iput had the highest negative correlation.
Correlation analysis and visualization of the top 10 ROIs. (A) Heat map showing the correlation between the top 10 ROIs. (B) Visualization of the top 10 ROIs and the relationship between them on the brain template. ROIs, region of interests.
Results of Gene Ontology and Kyoto Encyclopedia of Genomes enrichment analysis of the top 10 genes.
Heat maps showing the correlation between the top 10 ROIs and the top 10 SNPs and top 10 genes. (A) Heat map showing the correlation between the top 10 ROIs and the top 10 SNPs. (B) Heat map showing the correlation between the top 10 ROIs and the top 10 genes. SNPs, single nucleotide polymorphism.
In order to construct a diagnostic model on top markers, we constructed a diagnostic model for AD using the top 10 ROIs, top 10 SNPs, and top 10 genes based on the LR algorithm, respectively. Fig. 7A–C shows the receiver operating characteristic (ROC) curves of the three diagnostic models. The top 10 genes in the test set area under the curve (AUC) can reach the maximum (AUC = 0.947). The respective ROC curves of the top 10 ROIs, SNPs, and genes are presented in Figs. 8,9,10. As can be seen from the figures, the majority of top markers have AUCs greater than 0.5.
Diagnostic model construction based on top markers. (A–C) ROC curves of the top 10 ROIs, top 10 SNPs, and top 10 genes, respectively. ROC, receiver operating characteristic; TPR, true positive rate; FPR, false positive rate.
Diagnostic performance validation of the top 10 ROIs. (A–J) ROC curves of the top 10 ROIs, respectively.
Diagnostic performance validation of the top 10 SNPs. (A–J) ROC curves of the top 10 SNPs, respectively.
Diagnostic performance validation of the top 10 genes. (A–J) ROC curves of the top 10 genes, respectively.
We compared the performance of the JDSNMF algorithm with the JNMF algorithm and the JCB-SNMF algorithm for correlation analysis (Table 3). Specifically, we introduced the Pearson correlation coefficients of the original and reconstructed matrices for comparison to achieve similarity between the original matrix and the reconstructed matrix after decomposition.
Algorithm | |||
JNMF | 0.7729 | 0.8387 | 0.7227 |
JCB-SNMF | 0.7740 | 0.8385 | 0.7218 |
JDSNMF | 0.9012 | 0.8391 | 0.9676 |
JNMF, joint nonnegative matrix decomposition; JCB-SNMF, joint connectivity-based nonnegative matrix decomposition; JDSNMF, joint depth semi-supervised nonnegative matrix decomposition.
AD is a severe neurodegenerative disease that imposes a heavy burden on both families and society. Biomarker mining of AD can assist in relevant drug development and therapeutic target discovery. To this end, this study explored imaging genetic biomarkers of AD using the JDSNMF algorithm. Specifically, we integrated sMRI, SNP, and genetic data of AD using the JDSNMF algorithm. The algorithm adequately captured the non-linear features of the three sets of data. This module contained 15 ROIs, 42 SNP loci, and 49 genes. We ranked the feature importance of each of the three sets of data based on the RF algorithm and finally retained the top 10 ROIs, SNPs, and genes, respectively.
This study determined the top 10 brain regions (lCau, Right Angular gyrus (RANG), Right Inferior parietal lobule (RINFPAR), Left Inferior Frontal Gyrus (Linffroope), Right Pars orbitalis (RPAL), Left Pars orbitalis (LPAL), Left Putamen (LPUT), Language area (Lang), Right Superior occipital gyrus (RSUPOCC) using the algorithm JDSNMF. Indifference is a common neuropsychiatric symptom in AD patients. David et al. [10] indicated that dopaminergic dysfunction in the left caudate nucleus was related to atrophy of the left caudate nucleus. Udo et al. [11] found that the blood pressure in the left caudate nucleus was negatively correlated with the Indifference Assessment Scale-Japanese Version (AES-I-J) score in a study exploring whether dopaminergic activity was related to the development of AD apathy. The angular gyrus is the visual language center (reading center) and its activities are related to memory retrieval and formation, perceptual attention, decision-making, and manipulation [12]. Gaubert et al. [13] stated that the angular gyrus showed a significant metabolic decline in AD patients, and its dysfunction was related to cognitive impairment. The relationship between the globus pallidus and motor symptoms is closer than that of cognitive impairment. In the research on AD and normal aging over the last 20 years, Pini et al. [14] found that only one study mentioned slight morphological changes in the globus pallidus in AD.
We also determined that activity regulated cytoskeleton associated protein (ARC), golgi phosphoprotein 3 like (GOLPH3L), cytochrome P450 family 46 subfamily A member 1 (CYP46A1), NPC1 like intracellular cholesterol transporter 1 (NPC1L1), sulfiredoxin 1 (SRXN1), and interleukin 1 receptor associated kinase 3 (IRAK3) of the top 10 genes directly
or indirectly participate in the pathological process of AD. ARC
(Activity Regulatory Cytoskeleton Related Protein) is a protein-coding gene that
plays a vital role in synaptic plasticity, learning, memory, and A
Among the pathways enriched by the top 10 genes, some genes have been confirmed
to be risk genes for AD. Corsi et al. [26] identified and characterized
the functional characterization and performed pathway analysis of two fAD
mutations in the presenilin-79 (PSEN150) gene, revealing profound
expression changes in extracellular matrix components that are useful to help
elucidate the affected cellular mechanism in AD neurons. Studies have shown that
AD may be the pathological consequence of an aging immune system [27]. In
addition, inflammation is a significant physiological immune response, and some
essential proteins can promote the clearance of inflammatory mediators to
participate in the immune response and play a role in the release of A
There are no data that support the correlation between the top 10 SNPs that were identified in this study and AD. What role these SNPs play in the pathological process of AD requires exploration in future studies. Finally, we built a diagnostic model based on these top biomarkers and explored their diagnostic performance. These biomarkers might be useful for the future diagnosis and treatment of AD.
In this study we analyzed the imaging genetic data of AD in detail using the JDSNMF algorithm and mined several biologically significant pathways for diagnosing AD. Additionally, multiple strongly correlated ROI-SNP pairs, as well as ROI-gene pairs, were identified. However, due to the imbalance of AD samples, in future studies we will introduce sample adoption strategies to mitigate the estimation bias caused by sample imbalance. The potential AD-related markers and association patterns identified remain to be validated by further experimental work.
lCau, Left Caudate nucleus; RANG, Right Angular gyrus; RINFPAR, Right Inferior parietal lobule; Iinffroope, Left Inferior Frontal Gyrus; RPAL, Right Pars orbitalis; LPAL, Left Pars orbitalis; LPUT, Left Putamen; Lang, Language area; RSUPOCC, Right Superior occipital gyrus.
The data used in this paper came from the ADNI database (https://adni.loni.usc.edu/).
YW: Conceptualization, Data curation, Software, Visualization, Writing—original draft. XW: Methodology, Writing—review & editing. Both authors contributed to editorial changes in the manuscript. Both authors read and approved the final manuscript. Both authors have participated sufficiently in the work and agreed to be accountable for all aspects of the work.
Not applicable.
We thank the anonymous reviewers for their constructive comments which have helped improve the manuscript.
This research received no external funding.
The authors declare no conflict of interest.
Publisher’s Note: IMR Press stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.