Identification of Gene-Environment Interactions by Non-Parametric Kendall’s Partial Correlation with Application to TCGA Ultrahigh-Dimensional Survival Genomic Data

Background : In biomedical and epidemiological studies, gene-environment (G-E) interactions play an important role in the etiology and progression of many complex diseases. In ultra-high-dimensional survival genomic data, two common approaches (marginal and joint models) are proposed to determine important interaction biomarkers. Most existing methods for detecting G-E interactions (marginal Cox model and marginal accelerated failure time model) are limited by a lack of robustness to contamination/outliers in response outcome and prediction biomarkers. In particular, right-censored survival outcomes and ultra-high-dimensional feature space make relevant feature screening even more challenging. Methods : In this paper, we utilize the non-parametric Kendall’s partial correlation method to obtain pure correlation to determine the importance of G-E interactions concerning clinical survival data under a marginal modeling framework. Results : A series of simulated scenarios are conducted to compare the performance of our proposed method (Kendall’s partial correlation) with some commonly used methods (marginal Cox’s model, marginal accelerated failure time model, and censoring quantile partial correlation approach). In real data applications, we utilize Kendall’s partial correlation method to identify G-E interactions related to the clinical survival results of patients with esophageal, pancreatic, and lung carcinomas using The Cancer Genome Atlas clinical survival genetic data, and further establish survival prediction models. Conclusions : Overall, both simulation with medium censoring level and real data studies show that our method performs well and outperforms existing methods in the selection, estimation, and prediction accuracy of main and interacting biomarkers. These applications reveal the advantages of the non-parametric Kendall’s partial correlation approach over alternative semi-parametric marginal modeling methods. We also identified the cancer-related G-E interactions biomarkers and reported the corresponding coefficients with p -values.


Introduction
In order to understand, model, and treat complex diseases such as diabetes, cancer and so on, gene-environment (G-E) interaction has been shown to be a significant role beyond the main genetic (G) or environmental (E) effects [1,2]. G-E interaction has important implications for the etiology and progression of many complex diseases. For example, Batchelor, et al. [3] demonstrated that the interaction between gene TP53 and environmental factor age to affected the prognosis of glioblastoma. To this end, we would like to identify significant interaction biomarkers that are associated with clinical survival outcomes, which is a crucial task for establishing survival prediction models.
According to Zhou, et al. [4], several statistical methods have been developed to identify significant G-E interactions biomarkers. In high-dimensional genetic data, two general approaches are proposed to identify important interacting biomarkers and estimate their corresponding effects. One performs a marginal analysis, considering only one gene at a time; the other performs a joint analysis and considers all genes in a single model.
In the framework of marginal analysis, for each gene, to fit a model consisting of the single gene itself, consider a few E factors, and its interaction with E factors. The conceptual marginal model is "Outcome~Es + G + G * (Es)", where the response outcome variable can be continuous phenotypes, categorical disease status or patients' survival time, Es represents a set of environmental factors include environmental exposures as well as demographic, clinical and socioeconomic variables, and G*(Es) represents the interaction between the G factor and all E factors. The significant G-E interaction can be identified based on the correspondence of the marginal p-value. Since the marginal model is low-dimensional, the main advantage of the marginal model is its computational stability and conceptual simplicity; accordingly, marginal programs are still popular in the fields of bioinformatics and biomedicine. In particular, marginal popular models include the accelerated failure time (AFT) model and Cox's model.
However, a common limitation of traditional marginal analysis methods is their lack of robustness. In actual genetic studies, Xu, et al. [5] noted that long-tailed distributions and contamination in prognostic response outcomes and predictor biomarkers are not uncommon. In addition, human input errors can also lead to long-tailed distributions and contamination. The long-tailed distribution of the survival time means the survival data has a higher rate of censoring. Due to the censoring rate being higher, the proportion of missing survival time is higher. Therefore, the longtailed distribution of the survival time can lead to the poor performance of the statistical methods.
In Fig. 1, we analyze The Cancer Genome Atlas (TCGA) clinical survival data for esophageal carcinoma (ESCA), pancreatic adenocarcinoma (PAAD) and head and lung adenocarcinoma (LUAD) to show the long-tailed distribution phenomenon. In the top three panels of Fig. 1, the dashed line is the sample mean of logarithm of survival time (months) minus three times standard deviations of survival time (months), which is the 99.73% confidence interval, but we observe that there are still some cancer patients outside the 99.73% confidence interval. Looking at the bottom three panels, the dashed line is the density of fitted normal distribution for logarithm of survival time (months) and the solid line is the empirical density function for logarithm of survival time (months), so it can be seen that the empirical distribution of the actual survival data of cancer patients is different from the fitted normal distribution. And, we also performed an Anderson-Darling test for normality, with the test statistic focusing on a good fit with more emphasis on the tails. The corresponding p-values for ESCA, PAAD, and LUAD are 6e-04, 3e-03, and 3.5e-05, respectively. We have strong evidence to claim that the logarithm of survival times for TCGA data does not follow a normal distribution. From these two viewpoints, it can be inferred that the survival data is contaminated. Moreover, censored survival outcomes make the relevant feature screening difficult, so several robust methods are proposed to overcome the problem based on the marginal analysis framework [5][6][7].
Xu, et al. [5] developed the censored quantile partial correlation approach (CQPCorr) to identify G-E interactions. The CQPCorr approach is built on the quantile regression technique, utilizes weights to accommodate censoring, and adopts a partial correlation to obtain pure correlation for response and interaction biomarker by controlling the main genetic and environmental effects. Shi, et al. [6] developed a robust rank-based estimation approach for the identification of G-E interaction, which is less sensitive to model specification, but computation-demanding.
Furthermore, Wang and Chen [8] proposed an inverse probability-of-censoring weighted (IPCW) Kendall's tau statistic to measure the association of a right-censored survival trait with biomarkers, and the associated Kendall's partial correlation reflect the relationship of the survival trait with second-order variables containing quadratic and two-way interactions conditional on the main effects. In simulation studies and real data applications, they demonstrate that the newly proposed method can provide substantially higher accuracy of gene-gene interaction selection hence leading to more accurate survival prediction than existing methods, as the Kendall's tau measure is not influenced by outliers, which is a major concern in gene expression data where contaminated data are common. Furthermore, as it is a model-free measure, it can work for a wide class of survival models while being easy and fast to calculate big data with ultrahigh-dimensional features space. Consequently, we extended the application of the nonparametric IPCW Kendall's partial correlation approach to the G-E interaction content.
In this article, we perform a series of simulated scenarios to compare marginal AFT, marginal Cox and CQPCorr methods with our proposed IPCW Kendall's partial correlation method concering the accuracy of G-E interaction selection under a marginal modeling framework, while in the application of real data, we also aim at selecting several important G-E interactions associated with clinical survival outcomes of patients with ESCA, PAAD and LUAD using TCGA clinical survival genetic data [9].

The Marginal Models Review
In this section, we first introduce the common and robust marginal modeling methods for G-E interactions. Consider a study with N independent subjects. For a subject i, suppose that there are q environmental/clinical variables e i = (e i1 , e i2 , …, e iq ) ′ 1×q , and p genes . Assume the survival outcome T i is related to the environmental/clinical variables e i , gene ex-pression covariates x i , and their component-wise interactions Since the survival time may be right-censored and incompletely observed. Define observe survival time V = min(T, C), C is censoring time, and we use δ = I(T ≤ C) is the indicator of whether the survival time of subject is censored.
In a marginal Cox's regression framework for G-E selection, the hazard function at time t for subject i 's survival given all environmental factors, one gene factor k and their corresponding interactions in a single model is modeled as where λ 0 (t) is a non-negative deterministic baseline hazard function, and (α, β k , γ ·k ) are corresponding parameters for these considering biomarkers.
as full parameters of the full model. In a marginal accelerated failure time model framework for G-E selection, the log of survival time for subject i 's given all environmental factors, one gene factor k and their corresponding interactions in a single model is modeled as Where α 0 is an intercept term and ϵ is the error term. Note that the significant G-E interaction can be selected based on the correspondence of the marginal p-value.
Although inverse probability-of-censoring weighted (IPCW) Kendall's partial correlation [8] was originally developed for G-G interaction selection, this method can naturally be applied to the G-E interaction selection issue, as the concept is the same. Kendall [10] defined the partial rank correlation in the context of Kendall's correlation, and showed that Pearson's partial correlation formula still applies to Kendall's correlation. For example, for four random variables K1; K2; K3; K4, Kendall's partial correlation is calculated by the following formula gives the Kendall's partial correlation between K1 and K2 conditional on K3 and K4. Therefore, the Kendall's partial correlation of the survival trait with the G-E interaction terms can be obtained as follows, Ej G k ,G k ·Ej , j = 1, . . . , q; k = 1, . . . , p.
To accommodate right-censored survival time data, we utilize the IPCW Kendall's tau statistic proposed by Wang and Chen [8] and consider the resulting partial correlation statistics: The full computation details of (τ T,Ej G k ·Ej ,τ T,G k ·Ej , τ Ej G k ,G k ·Ej ) can be seen as follows, Note that the IPCW Kendall's tau statistic [8] can be computed as follows The CQPCorr method [5] consists of three steps, where quantile regression is used to accommodate longtailed or contaminated responses, partial correlation is used to determine important interactions biomarkers, and the main G and E effects are appropriately controlled. The full detail can be seen in Xu, et al. [5]. Note that there is a tuning parameter in the CQPCorr approach, which is a quantile used in the first and third steps, with quantile τ = 0.5 being the most popular choice [5]. The CQPCorr approach can be performed by "QPCorr.matrix" R function of the R package "GEInter" (https://cran.r-project.org/web/package s/GEInter/) [11].

Evaluation Performance in the Simulation Study
To evaluate the performance of G-E interaction selection, we considered seven measures, such as accuracy rate (ACC), true positive rate (TPR), precision rate (PRE), true negative rate (TNR), false positive rate (FPR), false negative rate (FNR) and minimum of model size (MMS), where the definitions of the first six metrics are displayed in Supplementary Table 1 and MMS is the minimum of model size of a selected set of associated interaction variables, including all potential valid interaction biomarkers. MMS measures the complexity of the selected model and reflects the accuracy of the screening process; larger ACC, TPR, PRE, TNR/smaller FPR, FNR, MMS values indicate higher accuracy of feature screening. We adopt the hard thresholding rule proposed by Fan and Lv [12] to select the candidate set of G-E interactions; that is, after ranking the G-E interaction predictors using a certain correlation measure, we select a prefixed number ( N 2 * log(N ) ) of top-ranked G-E interaction predictors as our candidate model. We then report the average number of these seven measures for each method in 200 replications.
In order to respect the "main effect, interaction" hierarchical constraint, if we choose a w ·jk interaction biomarker in the model, then we assume the main factors e ·j are x ·k are related to the response, and all environmental factors are considered into the model, we then estimate the corresponding parameters using a maximum-likelihood estimation approach based on the AFT regression model. We report the root mean square error (RMSE) to measure the accuracy of the estimation, which is defined as where S is the full model size including all main and interaction covariates. In order to evaluate the estimation of the performance of the selected biomarkers, we report RMSE.M, the average of the root mean square error of 200 replications.
To evaluate the performance of survival prediction, letθ be an estimator of the AFT regression parameter in a prediction model obtained from the training dataset and θ as the prognosis index (PI) value for subject i. The AFT test is defined as the p-value of PI, where PI is used as the covariate in the univariate AFT model of survival outcomes in the test data. Similarly, LR test is the p-value of the log-rank test of the null hypothesis of equal survival between the "poor" and "good" prognostic groups in the test data, depending on whether the PI is higher or lower than the median PI value. Also, the c-index metric is considered to investigate survival prediction accuracy. Smaller AFT test and LR test values/larger c-index correspond to better predictive power.

Simulation Scenarios
In the following simulations, a series of simulation studies were conducted to compare the existing marginal modeling methods with IPCW Kendall's partial correlation approach (IPCW-pcorr) in the selection of G-E interactions, the estimation of the selected biomarker effects, and prediction of the final survival prediction model. We also considered a simple measure, IPCW Kendall's tau correlation (IPCW-tau). IPCW Kendall's tau correlation [8] measures the association between survival characteristics and G-E interaction biomarkers without conditional main effects.
In order to compare the CQPCorr and IPCW Kendall's partial correlation approaches clearly, we follow the simulation settings of Xu, et al. [5] by generating a cohort of 200 subjects and 100 subjects for training and testing data respectively. Each subject's survival time follows the accelerated failure time model, where the covariates e jointly follow a 5-dimensional multivariate standard normal distribution with the first-order autoregressive (AR(1)) structure that is corr (e .j , e .k ) = 0.3 |j−k| , and the covariates jointly follow a 1000dimensional multivariate standard normal distribution with the AR(1) structure that is corr (x .j , x .k ) = 0.5 |j−k| . Moreover, we assume that gene expression may be contaminated by outliers generated from a t-distribution with two degrees of freedom with a probability of 0.1. The outlier generation setting is the same as that of Wang and Chen [8]. Consider three error distributions: (Error 1) N(0, 1), (Error 2) 90% N(0, 1)+10% N(±50, 1) and (Error 3) 80% N(0, 1)+20% N(0, 50). The last two error distributions lead to long-tailed distributions/contamination. The censoring time distribution follows a uniform distribution U(a,b) , which is chosen to control the censoring rate at about 30% (light censoring) and 60% (heavy censoring) respectively.
C2 setting is the same as C1 setting except that the first type interactions and the corresponding main effects are at the same level. Specifically, γ jk = α j = β k = 1.5 for j = 1, 2 and k = 1, 2, …, 5.
C3 setting is the same as C1 setting except that the magnitudes of the main effects are larger. Specifically, C4 setting is the same as C1 setting except that the magnitudes of the interactions are smaller. Specifically, γ jk = 0.5 for j = 1, 2 and k = 1, 2, …, 5, and j = 3, 4, 5 and k = 6, 7.
Each scenario has sixteen important G-E interactions together with two main E effects and five main G effects. There are two types of important interactions. The first type includes ten interactions γ jk , j = 1, 2 and k = 1, 2, …, 5 with both main E(α 1 , α 2 ) and G(β 1 , …, β 5 ) effects. The second type includes six interactions γ jk , j = 3, 4, 5 and k = 6, 7 without main effects, which violates the "main effects, interactions" hierarchy. These simulated settings are close to the actual data. Note that the above simulated survival clinical genomic data can be generated by "simulated_data" function of the R package "GEInter" [11].

Simulation Studies
The numerical results are summarized in Tables 1,2,3,4,5 for scenarios C1 to C5 with a censoring rate of 30%. In each cell, mean (standard deviation, SD) values are based on 200 replicates. We observe that the performances of the IPCW Kendall's partial correlation approach are always better than the CQPCorr, marginal Cox, marginal AFT approaches in all evaluation metrics, but the marginal Cox and marginal AFT approaches have smaller variability compared to the IPCW Kendall's partial correlation approach in C2, C3, and C4 scenarios.
The numerical results are summarized in Supplementary Tables 2-6 for scenarios C1 to C5 with a censoring rate of 60%. In each cell, mean (SD) values are based on 200 replicates. We observe that the performances of the CQPCorr approach are always better than the alternative approaches in all evaluation metrics for scenarios C2, C3 and C4, and the IPCW Kendall's partial or IPCW-tau correlation approach performs better than others for scenarios C1 and C5. In addition, we observe the marginal Cox and marginal AFT approaches have smaller variability compared to the IPCW Kendall's partial correlation and CQPCorr approaches. Table 6 presents simulation findings in tabular form based on censoring rates (30% and 60%) and five parameter scenarios to give readers a better understanding of how these approaches stack up.

Real Data Application with TCGA ESCA Data
After excluding patients with missing survival time data, our analysis is focused on the subset of the TCGA ESCA data with 368 patients and 20,501 gene expression variables. The censoring rate of the survival time in the data is about 58%.
The top 1000 genes with the smallest p-values based on the marginal (univariate) COX model are selected for downstream analysis, since the number of cancer-related genes is expected to be limited. The seven clinical variables whose E effects are analyzed include age, gender, esophageal tumor central location, person neoplasm cancer status, race, BMI and AJCC pathologic stage, and their            summary information is reported in the Supplementary Table 7, with its source derived from Wang and Chen [13]. Some of the clinical variables contain missing values, and we use the sparse boosting method [14] in the R package "GEInter" to perform multiple imputation for the missing values in the clinical variables. We take fifty random splits of the whole data into 258:110 training/test sets of the data to evaluate the performance of all methods for survival prediction via the significant proportion of the LR-test in the TCGA ESCA data. Higher significant proportion of LR-test corresponds to better prediction accuracy. From Table 7 we see that the performance of the IPCW Kendall's partial correlation approach is better than the marginal Cox, marginal AFT and CQPCorr methods.
In addition, we apply the proposed IPCW Kendall's partial correlation approach for whole data to identify several important G-E interaction biomarkers and estimate the corresponding parameters by AFT regression model. Table 8 provides the list of selected associated predictors with their correspondence weights, with the "*" notation meaning statistical significance (p-value < 0.05). One candidate gene (GADD45B) has been shown to be related to ESCA [15] while Weygant, et al. [16] indicated that the gender factor can be used as prognosis biomarker for ESCA. In addition, we found the GADD45B-gender interaction biomarker to be significant from Table 8, and therefore we consider the GADD45B-gender biomarker to be a potential prognostic biomarker for ESCA, as the major gene GADD45B and gender factor have been documented to be associated with ESCA.

Real Data Application with TCGA PAAD Data
After excluding patients with missing survival time data, our analysis is focused on the subset of the TCGA PAAD data with 170 patients and 20,501 gene expression variables. The censoring rate of the survival time in the data is about 58%.
The top 1000 genes with the smallest p-values based on the marginal (univariate) COX model are selected for downstream analysis, since the number of cancer-related genes is expected to be limited. The seven clinical variables whose E effects are ethnicity, race, lymph node examined count, maximum tumor dimension, anatomic neoplasm subdivision, gender and age, and their summary information is reported in Supplementary Table 8, with its source being from Wang, et al. [13]. Some of the clinical variables contain missing values, and we use the sparse boosting method [14] in the R package "GEInter" to perform multiple imputation for the missing values in the clinical variables.
We take fifty random splits of the whole data into 119:51 training/test sets of the data to evaluate the performance of all methods for survival prediction via the significant proportion of the LR-test in the TCGA PAAD data.  Higher significant proportion of LR-test corresponds to better prediction accuracy. From Table 7, we see that the performance of the IPCW Kendall's partial correlation approach is better than the marginal Cox, marginal AFT and CQPCorr methods.
In addition, we apply the proposed IPCW Kendall's partial correlation approach for whole data to identify several important G-E interaction biomarkers and estimate the corresponding parameters by AFT regression model. Table 9 provides the list of selected associated predictors with their correspondence weights, with the "*" notation meaning statistical significance (p-value < 0.05). Two candidate genes (NRSN2 and TRIM59) have been shown to be related to PAAD [17,18], while Chakladar, et al. [19] indicated that the gender factor can be used as prognosis biomarker for PAAD. In addition, we found the TRIM59-gender interaction biomarker to be significant from Table 9, and therefore we consider the TRIM59-gender biomarker to be a potential prognostic biomarker for PAAD, as the major gene TRIM59 and gender factor have been documented to be associated with PAAD.

Real Data Application with TCGA LUAD Data
After excluding patients with missing survival time data, our analysis is focused on the subset of the TCGA LUAD data with 505 patients and 20,501 gene expression variables. The censoring rate of the survival time in the data is about 64%.
The top 1000 genes with the smallest p-values based on the marginal (univariate) COX model are selected for downstream analysis, since the number of cancer-related genes is expected to be limited. The eight clinical variables whose E effects are analyzed include age at initial pathologic diagnosis, number pack years smoked, AJCC pathologic metastasis, AJCC pathologic nodes, AJCC pathologic stage, race, gender and AJCC pathologic tumor, with their summary information reported in Supplementary Table 9, sourced from Wang, et al. [13]. Some of the clinical variables contain missing values, and we use the sparse boosting method [14] in the R package "GEInter" to perform multiple imputation for the missing values in the clinical variables.
We take fifty random splits of the whole data into 354:151 training/test sets of the data to evaluate the performance of all methods for survival prediction via the significant proportion of the LR-test in the TCGA LUAD data. Higher significant proportion of LR-test corresponds to better prediction accuracy. From Table 7, we see that the performance of the IPCW Kendall's partial correlation ap-proach is better than the marginal Cox, marginal AFT and CQPCorr methods.
In addition, we apply the proposed IPCW Kendall's partial correlation approach for whole data to identify several important G-E interaction biomarkers and estimate the corresponding parameters by AFT regression model. Table 10 provides the list of selected associated predictors with their correspondence weights, with the "*" notation meaning statistical significance (p-value < 0.05). There are six candidate genes (ATP13A4, GPR116, HABP2, MBIP, SFTA3, and ZSCAN16) have been shown to be related to LUAD [20][21][22][23][24].

Discussion
In our work, we have several motivations to adopt the sparse boosting method to impute the missing clinical variables. The first is that the sparse boosting method [14] was proposed initially to assign TCGA clinical data, and we also use TCGA clinical data in our actual data application. The second is that Wu, et al. [11] provides the friendly R package "GEInter" to perform the sparse boosting method. Therefore, we adopt the sparse boosting method to impute the missing clinical variables. We agree with the comment that using the different impute techniques to process missing values will lead to different numerical results, it deserves further study and will be investigated in our future work.
In the final step of our analysis, we utilized the maximum likelihood estimation approach to estimate the corresponding parameters of the selected candidate biomarkers. Park and Ha [25] performed variable selection and parameter estimation procedures for fixed effects in parametric AFT models using penalized likelihood procedures. We agree with the comment that the sure independence screening method [12] is just a screening process, and penalized regression should be applied to reduce the irrelevant predictors for the prediction model. Although we found some useful references about penalized AFT model, we did not find suitable software or code to perform the regularization methods. Implementing regularization methods in our analysis is worthy of further research and will be studied in our future work.
In real data application, we adopt the hard thresholding rule proposed by Fan and Lv [12] to select the candidate set of G-E interaction biomarkers; that is, after ranking the G-E interactions biomarkers using some correlation measure, we select a prefixed number of top-ranked G-E interaction predictors as our candidate model. Several alternative strategies for thresholding rule can be considered such as the soft thresholding rule proposed by Zhu, et al. [26], a method based on the control of the false-positive rate or false discovery rate by Zhao and Li [27], and a method based on multiple testing procedure by Song, et al. [28]. Furthermore, we assumed that the number of cancer-related biomarkers would be limited, so we selected the top 1000 Voorman, et al. [29] and Ueki, et al. [30] give deep insight into genome-wide environment interaction studies (GWEIS), in which gene-environment interaction analysis is common known to be susceptible to model specification. Problematic behavior may occur due to insufficient specification of the null model for models with no genetic effects. In the framework of marginal analysis, Kendall's par-tial correlation method is a pure correlation used to reflect the relationship of the survival trait with the G-E interaction biomarker conditional on the main effects. The proposed measure can mitigate errors due to unspecified null models. However, for simplicity, we specifically focus on two-way pairwise G-E interactions. Although the same idea of the proposed IPCW Kendall's partial correlation method might apply to the problem of evaluating higher-order interactions, the associated computational complexity seems challenging and will be investigated in detail in our future work.

Conclusions
In this article, we extend the non-parametric IPCW Kendall's partial correlation [8] approach to G-E interac-tions to measure the association of a right-censored survival trait with G-E interactions biomarkers, and the associated Kendall's partial correlation to reflect the relationship of the survival trait with G-E interactions predictors conditional on the genetic and environmental effects. We agree with the comment that there are no methodological advances of this paper. However, the paper provides a useful contribution to real genomic data applications. In simulations (medium censoring level) and real data applications, we show that the proposed IPCW Kendall's partial correlation method can provide substantially more powerful and accurate predictor selection, and can lead to more accurate survival prediction than alternative methods (marginal COX, marginal AFT and CQPCorr).
In the real data analysis for informative G-E interaction biomarker selection, we first performed a G-E interaction screening procedure on the entire data based on hard threshold rules using the non-parametric Kendall's partial correlation method, then incorporated all environmental factors and some genes selected based on the hierarchical principle into the survival prediction model, we then estimated the corresponding parameters of the selected candidate biomarkers using a maximum-likelihood estimation approach based on the AFT regression model. To this end, we not only defined survival prediction models, but also provided their corresponding weights for selected biomarkers, which have implications for clinical significance.

Author Contributions
J-HW conceived and designed the research study. J-HW and C-TY performed the research, analyzed the data, and prepared figures and/or tables. J-HW wrote the manuscript. All authors contributed to editorial changes in the manuscript. All authors read and approved the final manuscript.

Ethics Approval and Consent to Participate
Not applicable.