A split-and-merge deep learning approach for phenotype prediction

Background : Phenotype prediction with genome-wide markers is a critical but difficult problem in biomedical research due to many issues such as nonlinearity of the underlying genetic mapping and high-dimensionality of marker data. When using the deep learning method in the small-n -large- p data, some serious issues occur such as over-fitting, over-parameterization, and biased prediction. Methods : In this study, we propose a split-and-merge deep learning method, named SM-DL method, to learn a neural network on the dimension reduce data by using the split-and-merge technique. Conclusions : Numerically, the proposed method has significant performance in phenotype prediction for a simulated example. A real example is used to demonstrate how the proposed method can be applied in practice.


Introduction
During the past two decades, the dramatic improvement in data collection and acquisition technologies has enabled scientists to collect a great amount of highdimensional data, for which the dimension p can be much larger than the sample size n, says small-n-large-p. For example, high-throughput genomic data are highly dimensional relative to their sample size. These data often make prediction models sensitive to noise and false positive associations, which consequently make predicting accurate prognoses difficult. The ability to predict complex traits from marker data is becoming increasingly important in plant breeding and association study [1,2].
There have been numerous studies in the genomic prediction models developed with conventional statistical algorithms, including the traditional hypothesis testing approaches [3,4], hidden Markov models [5,6], regressionbased methods [7,8], and some Bayesian algorithms [9,10]. Some approaches used regularization methods to reduce the high-dimensional feature sizes, such as ridge regression best linear unbiased prediction (rrBLUP) [7], genomic relationship best linear unbiased prediction (GBLUP) [8], Bayes-A, Bayes-B, and Bayes LASSO [9,10]. However, among these different genomic prediction models, there was not frequently observed in variation of prediction accuracy. In addition, these prediction models typically make strong assumptions and perform linear regression analysis. For instance, in the rrBLUP model, the assumption is all the marker effects are normally distributed with a small but non-zero variance and it predicts phenotypes from a linear function of genotype markers [7]. Therefore, there has difficulty capturing complex relationship within genotype, and between genotypes and phenotypes in these genomic prediction models for the highly dimensional marker data.
Deep learning is a recently developed machine learning technique that builds multi-layered neural networks containing a large number of neurons to model complex relationship in big data [11]. Deep learning has emerged as a powerful tool to improve prediction performance over traditional models for speech recognition, image identification and natural language processing [11]. This advanced model has also been adopted in bioinformatics and genomics issues recently [12][13][14][15]. Many biologists have successfully applied it to several prediction problems including the gene expression inference [16,17], the functional annotation of genetic variants [18,19], phenotype identification from genetic variations [18][19][20][21][22], the recognition of protein folds [23,24], and the prediction of genome accessibility [25]. Pérez-Enciso and Zingaretti [26] provided a guide for using deep learning for complex trait genomic prediction.
Recently, some phenotype prediction and genomic selection methods adopted the deep learning model, such as DualCNN [27], G2PDeep [28], GenNet [29], and the comparative approach [30]. Ma, et al. [31] proposed a deep learning method, called DeepGS, to predict phenotypes from genotypes using a deep convolutional neural network. Unlike conventional statistical models, DeepGS automatically learns complex relationships between genotypes and phenotypes from training data, without pre-defined rules (e.g., normal distribution, non-zero variance) for the variables in the neural network. However, when using the DeepGS method in the small-n-large-p data, it can lead to over-fitting, over-parameterization, and biased prediction. Hence, a large number of training dataset and a lowdimensional subset data are required in order to overcome these issues. In this study, a split-and-merge deep learning method, named SM-DL method, is proposed to learn a neural network on the dimension reduced subset data by using the split-and-merge technique on deep learning to obtain nonlinear dimension reduction of the original data.
The rest of the paper is organized as follows. The splitand-merge deep learning method for high dimensional data is discussed in Section 2. Section 3 presents a simulated example. A real data example is given in Section 4 to demonstrate how the proposed method can be applied in practice. Section 5 concludes this paper with brief discussion. Finally, a conclusion is given in Section 6.

Split-and-merge deep learning method
In this section, we present a split-and-merge deep learning method, referred to as the SM-DL method, to address the previously mentioned issue, such as over-fitting, over-parameterization, and biased prediction. Based on the SM-DL strategy, the information of the original input features is reduced through deep learning. And then the nonlinear sufficient dimension reduced data is used for the following network algorithms.
To be more precise, the high-dimensional input features are split into several low-dimensional subsets dependent on the dimension of feature. Here, the ways of partition feature are not restricted, which means one can partition randomly or depending on the informative rules, even if the overlapped are also allowed. For simplicity, in this study, we partition features into the non-overlapping subsets. For each low-dimensional subset, a "local" neural network is fitted. The neurons of the last hidden layer for each local network are extracted, named dimension reduced subset data. Finally, all the dimension reduced subset data are merged as the input of the "global" network and then construct a global neural network. Both the local neural network and the global neural network are trained on the training set and validated on the testing set during each fold of cross-validation. We use a ten-fold cross-validations with ten replicates to evaluate the prediction performance of the SM-DL method. If it is needed, the split-and-merge procedure can be repeated in the local network until the dimension reduce sufficiently. Fig. 1 presents the structure of the SM-DL method. The method is summarized in Table 1. 2: Partition the high dimensional dataset to many low-dimensional subsets. 3: For each low dimensional subset, do • Fit a "local" neural network.
• Get the dimension reduced subset data by extracting the outputs of the last hidden layer. 4: Combine the dimension reduced subset datasets to form a new dataset, called a dimension reduced subset data.
If the dimension of the dimension reduced subset data is still higher than n, go to step 2. 5: Learn a "global" neural network on the combined dimension reduced dataset.
The structure of both "local" and "global" neural networks via SM-DL strategy are not any strict constraints. We just note that the number of neurons in the last hidden layer of the "local" network is generally set to be smaller than the number of input features in order to serve the purpose of dimension reduction. The source codes of the SM-DL method are available at GitHub (https://github.com/WeiHeng86/SM-DL).

A simulated example
In this section, we illustrate the use of the SM-DL method via a simulated example. We mimic the real data which included the markers with biallelic genotype as the input and the continuous phenotype as the output. The framework of the relationship between the input and output was referred to in the previous study [32]. Let Y be a continuous output to mimic the phenotype of an individual and X 1 , . . . , X p be the discrete variants with values {0, 1, 2} to mimic genotypes of markers. The true dense feedforward neural network (FNN) with a 2000-500-300-100-1 structure, which includes one input layer, three hidden layers, and one output layer, is determined by is an activation function, h (£) denotes an output vector on the £-th layer, and W (£) and b (£) are a weight matrix and a bias vector, respectively. Note that the first layer h (0) receives the variants (x's) as input and h (4) is the output layer.
We first generated the 2000 biallelic markers with three genotypes represented as values {0, 1, 2}, and the minor allele frequency of each marker sets as 0.3. For the FNN, the activation function is set to a hyperbolic tangent function, the weights in W (£) 's are generated from a Gaussian mixture model, which is p(w) = φ 1 N (w:µ 1 , σ 1 ) + φ 2 N (w:µ 2 , σ 2 ) with the parameters φ 1 = φ 2 = 0.5, µ 1 = -2, µ 2 = 2, and σ 1 = σ 2 = 1, and the bias in b (£) 's follows a normal distribution with mean 0 and standard deviation 10 −4 . There are 4000 subjects in the training dataset and 1000 subjects in the test dataset, and there has 2000 variants in each subject.
For SM-DL method, we first split the variants of the training dataset into two non-overlapping subset each with size 1000, and a local network with a 1000-800-700-600-1 structure is fitted to each subset. After merging the output of the last hidden layer from the local network, which consists of 1200 units as the input of a global network, we fit a global network with a 1200-600-500-400-1 structure on the dimension reduced subset data. For all the local and global networks, the Adam method was used as the optimizer by setting the learning rate to 10 −4 and the L 2 -regularization parameter to 10 −4 for the connection weights. The loss function is set to use mean square error (MSE).
For comparison, an FNN with the true structure 2000-500-300-100-1, referred as True FNN, identical to the data generated structure was used. Moreover, a more complex large network with the structure 2000-1000-500-300-1, referred as Large FNN, was fitted the simulation data. We applied the same learning rate and regularization parameter used in the SM-DL method in the FNNs.
Both Pearson correlation coefficients (PCC) and mean squared error (MSE) were measured the performance of these methods. A ten-fold cross-validation was implemented to evaluate the train performance of each model, and the results were shown in the "Train" columns of Table 2. After getting the optimal model, the predictive performance of testing dataset was shown in the "Test" columns of Table 2. We repeated the same process 10 times, and the average PCC and MSE from the 10 calculations was reported to measure model performance. The results summarized in Table 2. The PCC of the SM-DL method is slightly lower than those of the true FNN and large FNN method; however, the MSE of the SM-DL is lower than those of the other two methods, which shows the SM-DL method has the better prediction performance. This example supports that the last hidden layer of the network retains all the response information contained in the input data.

A real data example
In this section, we present a real data example to illustrate how the SM-DL method can be applied to predict phenotype using genome-wide biomarkers. The dataset used consists of 2000 Iranian bread wheat (Triticum aestivum) landrace accessions, each of which was genotyped with 33,709 Diversity Array technology (DArt) markers. For The average Pearson correlation coefficients (PCC) and average mean squared error (MSE) with standard deviation of the estimates listed in parentheses for the SM-DL, True FNN and Large FNN methods. The "Train" presents the train performance of the training dataset and the "Test" indicates the predictive performance of the testing data set by using a ten-fold cross-validation. The average Pearson correlation coefficients (PCC) and average mean squared error (MSE) with standard deviation of the estimates listed in parentheses for the different methods. The "Train" presents the train performance of the training dataset and the "Test" indicates the predictive performance of the testing data set by using a ten-fold cross-validation.
each DArT marker, the allele was encoded by either 1 or 0, to indicate its presence or absence, respectively. The phenotype used as the response variable is grain length (GL). These genotypic and phenotypic data can be downloaded from the International Maize and Wheat Improvement Center (CIMMYT) wheat gene bank (https://www.cimmyt.org /resources/data/). Detailed descriptions for this dataset can be found in [33]. For this dataset, 2000 Iranian bread wheat were divided into a training set with 1600 subjects and a testing set with 400 subjects.
To assess the effect of the data partition, four strategies for the SM-DL method with different split sizes were compared including subset size 1000, 2000, 3000, 4000. Here, the markers were sorted by the location on the genome and then divided into the non-overlapping subset with equal size. The additional strategy without split for total of 33,709 variants was also compared. For the local network of the SM-DL method, two types of neural networks, CNN and FNN, following the DeepGS structures were adopted [31]. The CNN model was trained using the Adam method as the optimizer with the number of epochs of 6000, the learning rate of 0.01, the momentum for moving average of 0.5, and the weight of 10 −5 . For the FNN model, the parameters were optimized using the stochastic gradient descent (SGD) with the number of epochs of 10,000 and the learning rate of 10 −4 . For simplicity, the activation function and all hyper-parameters for the local networks were set to the default values given in the R package "DeepGS" (https://github.com/cma2015/DeepGS). For the global network, we use a basic FNN with one hidden layer with size 500, and use the tanh function as the activation and the MSE as the loss function. For training the global network, we use the learning rate of 0.01 and the L2-regularization parameter of 0.02. Also, two BLUP (best linear unbiased prediction)based models, which included using ridge regression BLUP (rrBLUP) and genomic relationship BLUP (GBLUP), were constructed by using the R package "rrBLUP" (https://cran .r-project.org/web/packages/rrBLUP) and "BGLR" (https: //cran.r-project.org/web/packages/BGLR), respectively.  A ten-fold cross-validation with 10 times was implemented to evaluate the train performance of each model, and the results were shown in the "Train" columns of Table 3. After getting the optimal model, the predictive performance of testing dataset was shown in the "Test" columns of Table 3. The average PCC and average MSE were used as metrics for measuring predictive performance of different models and the results were summarized in Table 3. Fig. 2A and Fig. 2B present the MSE of the testing data set for the CNN and FNN model for the ten-fold crossvalidation, respectively. CNN model has a lower mean but a higher variance of MSE than the FNN model. For the model with the local CNN network, the SM-DL method significantly outperforms using all dataset without split in both PCC and MSE. Similarly, for the model with the local FNN network, the SM-DL method has better performance than the FNN model without split structure. However, the rrBLUP and GBLUP model has higher PCC than the SM-DL method with local FNN network. It indicates the architecture of the local networks plays an important role. Overall, the SM-DL method with local CNN network has the best performance among these methods in both PCC and MSE. We can have great improvement with local CNN networks, but not with local FNN networks. For the local FNN networks, we cannot achieve sufficient dimension reduction because the subset data cannot be well approximated. Therefore, it is important for SM-DL method that the local networks should be chosen such that sufficient dimension reduction can be achieved.
Wilcoxon signed-rank test was conducted to statistically assess the performance of the SM-DL method as com-pared to two other regression-based methods. The SM-DL method with the local CNN network for comparison was the one with the highest PCC value. Table 4 shows the performance of the SM-DL method with the local CNN network is significantly better than other methods based on the 5% significance level (p-value < 0.05). Hence, the outperformance of the SM-DL method with the local CNN network was statistically significant compared to two other regression-based methods.

Discussion
A split-and-merge deep learning method to learn a neural network on the dimension reduced subset data has been developed. Two neural networks, CNN and FNN, are applied to the local network. The CNN is regarded as a local connectivity algorithm that can integrate the information of adjacent features. This structure is helpful to take the relative location information of features (genetic markers) into consideration. For this reason, we adopted CNN and FNN models in the local neural network, and the performance of CNN is much better than FNN. However, the input (dimension reduced subset data) of the global network was combined from the local model, and the relative location of these neurons was not meaningful. For this reason, we adopted FNN rather than CNN in the global neural network. It would be worthwhile to apply different structures of deep learning algorithm to the local and global neural network. Moreover, the convolutional layer in CNN is regarded as a kind of dimension reduction strategy. The effect from the split-and-merge algorithm may be partially covered by the convolutional layer. On the contrary, there is no efficient dimension reduction mechanism in the FNN model, and the advantage of the split-and-merge algorithm is shown apparently. For CNN, the optimal subset size is 3000 that based on the testing results in Table 3. For FNN, using the smaller subset size (says 1000) could help to efficiently reduce the dimension.
Here are few noteworthy perspectives for the SM-DL algorithm. First, the split and-merge procedure was efficient and effective to integrate the information of a great number of inputs such as genomic data. Considered all potential genetic variants into one algorithm to predict phenotype was the optimal circumstance. However, it is almost impossible and time-consuming because of too many genetic variants such as over 88 million identified genetic variants in the human genome [34]. In general procedures, the variants in one chromosome or a small genome fragment were considered at a time, and the variants in the different fragments were seemed independent without further integrated process. It is inappropriate because the mechanism of phenotype general resulted from the variants across several genome regions. The SM-DL algorithm used the split step to partition the genome into small fragments to construct the local network that make the computing more efficient. Also, the merge step effectively integrated the important information retrieved from the last hidden layer in each local network of fragments together to predict phenotype. The information of each fragment transferred from other research could also be adopted as the input in the merge step to make the procedure more efficient. For the split-and-merge deep learning strategy approach, the input variable as a number represents the information of a genetic molecular from a sample. We believe this approach could be applied to other types of genetic variants such as intensity from single nucleotide polymorphism microarray and read depth for copy number variation from sequencing with appropriate data preprocessing.
Second, the SM-DL algorithm could be taken as the parallel ensemble learning algorithm. Two steps for ensemble learning are essential, including several model construction parallelly and combination results of constructed models. For the split step of SM-DL, several local neural networks were built parallelly. Here, the methods of partition features are not restricted even if the overlapped features among local networks. These features have a similar mechanism but in different regions could be overlapped in several local networks. For example, the features/variants were partitioned based on the genetic location first. Then, these variants that belonged to the same gene or pathway were copied to all the corresponding networks as inputs. Alternatively, the randomly selected overlapping features were allowed to decorrelate the features in each local model such as random forest algorithm. In this study, we adopted a simple way to test the proposed SM-DL algorithm in order to focus on the performance of the split-and-merge strategy. It would be worthwhile to study the performance of the SM-DL method with the overlapping features among local networks. Moreover, a systematically integrated method via neural network model was adopted in the merge step of SM-DL. Different weights of the results from each local network can be considered via the network automatically rather than the equal weight such as generally used arithmetic mean and majority vote for ensemble learning algorithms.
Third, SM-DL can be generalized easily to different kinds of neural networks, such as FNN, CNN, and more complex structures. In the SM-DL algorithm, the neurons from the last hidden layer of each network were the only information kept to the next merge-split loop. It mentioned that SM-DL can be applied to any neural network in which information from the last hidden layer can be extracted. It is flexible to use the appropriate network and hyperparameter settings based on your data structure, application issue, and prior information. Through the SM-DL procedure with accurate prediction, the results could be further applied to select top-ranked individuals for animal breeding or find the important genetic variants for medical diagnosis.

Conclusions
In this research, we proposed a split-and-merge strategy for deep learning to treat the high-dimensional features problem. A large number of features were reduced to the lower-dimensional data while keeping the information on response contained in the features. In the simulated and real data example, the non-overlapping features among local networks were adopted and the results show the SM-DL method has the better performance. This strategy enhances the predictive performance of deep learning and can be applied to different structures of deep learning algorithms.

Author contributions
W-HH and Y-CW supervised each individual project and performed the entire research together. W-HH analyzed the data and Y-CW presented the results. Both W-HH and Y-CW contributed to write, read, and approved the submitted version of the manuscript.

Ethics approval and consent to participate
Not applicable.