^{1,*}, Chun-Ting Yang

^{1}

^{1}Department of Statistics, Feng Chia University, 40724 Taichung, Taiwan

^{*}Correspondence: jhwang@mail.fcu.edu.tw (Jie-Huei Wang)

Academic Editor: Alexandros G. Georgakilas

**Submitted: 19 April 2022 | Revised: 29 June 2022 | Accepted: 7 July 2022 | Published: 18 July 2022**

**Background**: In biomedical and epidemiological studies, gene-environment
(G-E) interactions play an important role in the etiology and progression of many
complex diseases. In ultra-high-dimensional survival genomic data, two common
approaches (marginal and joint models) are proposed to determine important
interaction biomarkers. Most existing methods for detecting G-E interactions
(marginal Cox model and marginal accelerated failure time model) are limited by a
lack of robustness to contamination/outliers in response outcome and prediction
biomarkers. In particular, right-censored survival outcomes and
ultra-high-dimensional feature space make relevant feature screening even more
challenging. **Methods**: In this paper, we utilize the non-parametric
Kendall’s partial correlation method to obtain pure correlation to determine the
importance of G-E interactions concerning clinical survival data under a marginal
modeling framework. **Results**: A series of simulated scenarios are
conducted to compare the performance of our proposed method (Kendall’s partial
correlation) with some commonly used methods (marginal Cox’s model, marginal
accelerated failure time model, and censoring quantile partial correlation
approach). In real data applications, we utilize Kendall’s partial correlation
method to identify G-E interactions related to the clinical survival results of
patients with esophageal, pancreatic, and lung carcinomas using The Cancer Genome
Atlas clinical survival genetic data, and further establish survival prediction
models. **Conclusions**: Overall, both simulation with medium
censoring level and real data studies show that our method performs well and
outperforms existing methods in the selection, estimation, and prediction
accuracy of main and interacting biomarkers. These applications reveal the
advantages of the non-parametric Kendall’s partial correlation approach over
alternative semi-parametric marginal modeling methods. We also identified the
cancer-related G-E interactions biomarkers and reported the corresponding
coefficients with *p*-values.

In order to understand, model, and treat complex diseases such as diabetes,
cancer and so on, gene-environment (G-E) interaction has been shown to be a
significant role beyond the main genetic (G) or environmental (E) effects [1, 2].
G-E interaction has important implications for the etiology and progression of
many complex diseases. For example, Batchelor, *et al*. [3] demonstrated that the interaction between
gene *TP53* and environmental factor age to affected the prognosis of
glioblastoma. To this end, we would like to identify significant interaction
biomarkers that are associated with clinical survival outcomes, which is a
crucial task for establishing survival prediction models.

According to Zhou, *et al*. [4], several statistical methods have been developed to identify
significant G-E interactions biomarkers. In high-dimensional genetic data, two
general approaches are proposed to identify important interacting biomarkers and
estimate their corresponding effects. One performs a marginal analysis,
considering only one gene at a time; the other performs a joint analysis and
considers all genes in a single model.

In the framework of marginal analysis, for each gene, to fit a model consisting
of the single gene itself, consider a few E factors, and its interaction with E
factors. The conceptual marginal model is “Outcome ~ Es + G + G
* (Es)”, where the response outcome variable can be continuous phenotypes,
categorical disease status or patients’ survival time, Es represents a set of
environmental factors include environmental exposures as well as demographic,
clinical and socioeconomic variables, and G*(Es) represents the interaction
between the G factor and all E factors. The significant G-E interaction can be
identified based on the correspondence of the marginal *p*-value. Since
the marginal model is low-dimensional, the main advantage of the marginal model
is its computational stability and conceptual simplicity; accordingly, marginal
programs are still popular in the fields of bioinformatics and biomedicine. In
particular, marginal popular models include the accelerated failure time (AFT)
model and Cox’s model.

However, a common limitation of traditional marginal analysis methods is their
lack of robustness. In actual genetic studies, Xu, *et al*. [5] noted that long-tailed
distributions and contamination in prognostic response outcomes and predictor
biomarkers are not uncommon. In addition, human input errors can also lead to
long-tailed distributions and contamination. The long-tailed distribution of the
survival time means the survival data has a higher rate of censoring. Due to the
censoring rate being higher, the proportion of missing survival time is higher.
Therefore, the long-tailed distribution of the survival time can lead to the poor
performance of the statistical methods.

In Fig. 1, we analyze The Cancer Genome Atlas (TCGA) clinical survival data for
esophageal carcinoma (ESCA), pancreatic adenocarcinoma (PAAD) and head and lung
adenocarcinoma (LUAD) to show the long-tailed distribution phenomenon. In the top
three panels of Fig. 1, the dashed line is the sample mean of logarithm of
survival time (months) minus three times standard deviations of survival time
(months), which is the 99.73% confidence interval, but we observe that there are
still some cancer patients outside the 99.73% confidence interval. Looking at
the bottom three panels, the dashed line is the density of fitted normal
distribution for logarithm of survival time (months) and the solid line is the
empirical density function for logarithm of survival time (months), so it can be
seen that the empirical distribution of the actual survival data of cancer
patients is different from the fitted normal distribution. And, we also performed
an Anderson-Darling test for normality, with the test statistic focusing on a
good fit with more emphasis on the tails. The corresponding *p*-values for
ESCA, PAAD, and LUAD are 6e-04, 3e-03, and 3.5e-05, respectively. We have strong
evidence to claim that the logarithm of survival times for TCGA data does not
follow a normal distribution.

**The scatter plot and long-tailed distribution of clinical
survival time data for the TCGA ESCA, PAAD, and LUAD.**

From these two viewpoints, it can be inferred that the survival data is contaminated. Moreover, censored survival outcomes make the relevant feature screening difficult, so several robust methods are proposed to overcome the problem based on the marginal analysis framework [5, 6, 7].

Xu, *et al*. [5] developed the censored quantile partial correlation approach (CQPCorr) to
identify G-E interactions. The CQPCorr approach is built on the quantile
regression technique, utilizes weights to accommodate censoring, and adopts a
partial correlation to obtain pure correlation for response and interaction
biomarker by controlling the main genetic and environmental effects. Shi, *et al*.
[6] developed a robust rank-based estimation approach for the identification of
G-E interaction, which is less sensitive to model specification, but
computation-demanding.

Furthermore, Wang and Chen [8] proposed an inverse probability-of-censoring weighted (IPCW) Kendall’s tau statistic to measure the association of a right-censored survival trait with biomarkers, and the associated Kendall’s partial correlation reflect the relationship of the survival trait with second-order variables containing quadratic and two-way interactions conditional on the main effects. In simulation studies and real data applications, they demonstrate that the newly proposed method can provide substantially higher accuracy of gene-gene interaction selection hence leading to more accurate survival prediction than existing methods, as the Kendall’s tau measure is not influenced by outliers, which is a major concern in gene expression data where contaminated data are common. Furthermore, as it is a model-free measure, it can work for a wide class of survival models while being easy and fast to calculate big data with ultrahigh-dimensional features space. Consequently, we extended the application of the non-parametric IPCW Kendall’s partial correlation approach to the G-E interaction content.

In this article, we perform a series of simulated scenarios to compare marginal AFT, marginal Cox and CQPCorr methods with our proposed IPCW Kendall’s partial correlation method concering the accuracy of G-E interaction selection under a marginal modeling framework, while in the application of real data, we also aim at selecting several important G-E interactions associated with clinical survival outcomes of patients with ESCA, PAAD and LUAD using TCGA clinical survival genetic data [9].

In this section, we first introduce the common and robust marginal modeling
methods for G-E interactions. Consider a study with *N* independent
subjects. For a subject *i*, suppose that there are *q
*environmental/clinical variables

Since the survival time may be right-censored and incompletely observed. Define
observe survival time *C* is censoring time, and we use

In a marginal Cox’s regression framework for G-E selection, the hazard function
at time *t* for subject *i* ‘s survival given all environmental
factors, one gene factor k and their corresponding interactions in a single model
is modeled as

where

In a marginal accelerated failure time model framework for G-E selection, the
log of survival time for subject *i* ‘s given all environmental factors,
one gene factor k and their corresponding interactions in a single model is
modeled as

Where *p*-value.

Although inverse probability-of-censoring weighted (IPCW) Kendall’s partial correlation [8] was originally developed for G-G interaction selection, this method can naturally be applied to the G-E interaction selection issue, as the concept is the same. Kendall [10] defined the partial rank correlation in the context of Kendall’s correlation, and showed that Pearson’s partial correlation formula still applies to Kendall’s correlation. For example, for four random variables K1; K2; K3; K4, Kendall’s partial correlation is calculated by the following formula

gives the Kendall’s partial correlation between K1 and K2 conditional on K3 and K4.

Therefore, the Kendall’s partial correlation of the survival trait with the G-E interaction terms can be obtained as follows,

To accommodate right-censored survival time data, we utilize the IPCW Kendall’s tau statistic proposed by Wang and Chen [8] and consider the resulting partial correlation statistics:

The full computation details of
(

Note that the IPCW Kendall’s tau statistic [8] can be computed as follows

The CQPCorr method [5] consists of three steps, where quantile regression is
used to accommodate long-tailed or contaminated responses, partial correlation is
used to determine important interactions biomarkers, and the main G and E effects
are appropriately controlled. The full detail can be seen in Xu, *et al*. [5]. Note that there
is a tuning parameter in the CQPCorr approach, which is a quantile used in the
first and third steps, with quantile

To evaluate the performance of G-E interaction selection, we considered seven
measures, such as accuracy rate (ACC), true positive rate (TPR), precision rate
(PRE), true negative rate (TNR), false positive rate (FPR), false negative rate
(FNR) and minimum of model size (MMS), where the definitions of the first six
metrics are displayed in **Supplementary Table 1** and MMS is the minimum of
model size of a selected set of associated interaction variables, including all
potential valid interaction biomarkers. MMS measures the complexity of the
selected model and reflects the accuracy of the screening process; larger ACC,
TPR, PRE, TNR/smaller FPR, FNR, MMS values indicate higher accuracy of feature
screening. We adopt the hard thresholding rule proposed by Fan and Lv [12] to select the
candidate set of G-E interactions; that is, after ranking the G-E interaction
predictors using a certain correlation measure, we select a prefixed number
(

In order to respect the “main effect, interaction” hierarchical constraint, if
we choose a

where *S* is the full model size including all main and interaction
covariates. In order to evaluate the estimation of the performance of the
selected biomarkers, we report RMSE.M, the average of the root mean square error
of 200 replications.

To evaluate the performance of survival prediction, let *, ${\delta}_{i}^{\mathrm{*}}$ ,${\mathbf{e}}_{i}^{{\mathrm{*}}^{\mathrm{\prime}}}$ ,
${\mathbf{x}}_{i}^{{\mathrm{*}}^{\mathrm{\prime}}}$, ${\mathbf{w}}_{i}^{{\mathrm{*}}^{\mathrm{\prime}}}$*) the survival and covariate data of subject

*i*in the test data. Define (

*i*. The AFT test is defined as the

*p*-value of PI, where PI is used as the covariate in the univariate AFT model of survival outcomes in the test data. Similarly, LR test is the

*p*-value of the log-rank test of the null hypothesis of equal survival between the “poor” and “good” prognostic groups in the test data, depending on whether the PI is higher or lower than the median PI value. Also, the c-index metric is considered to investigate survival prediction accuracy. Smaller AFT test and LR test values/larger c-index correspond to better predictive power.

In the following simulations, a series of simulation studies were conducted to compare the existing marginal modeling methods with IPCW Kendall’s partial correlation approach (IPCW-pcorr) in the selection of G-E interactions, the estimation of the selected biomarker effects, and prediction of the final survival prediction model. We also considered a simple measure, IPCW Kendall’s tau correlation (IPCW-tau). IPCW Kendall’s tau correlation [8] measures the association between survival characteristics and G-E interaction biomarkers without conditional main effects.

In order to compare the CQPCorr and IPCW Kendall’s partial correlation
approaches clearly, we follow the simulation settings of Xu, *et al*. [5] by generating a
cohort of 200 subjects and 100 subjects for training and testing data
respectively. Each subject’s survival time follows the accelerated failure time
model,

where the covariates ** e** jointly follow a 5-dimensional multivariate standard
normal distribution with the first-order autoregressive (AR(1)) structure that is

Five parameters scenarios are considered:

C1 setting has

C2 setting is the same as C1 setting except that the first type interactions and
the corresponding main effects are at the same level. Specifically,

C3 setting is the same as C1 setting except that the magnitudes of the main
effects are larger. Specifically,

C4 setting is the same as C1 setting except that the magnitudes of the
interactions are smaller. Specifically,

C5 setting is the same as C1 setting except that the first type interactions
have negative effects. Specifically,

Each scenario has sixteen important G-E interactions together with two main E
effects and five main G effects. There are two types of important interactions.
The first type includes ten interactions

The numerical results are summarized in Tables 1,2,3,4,5 for scenarios C1 to C5 with a censoring rate of 30%. In each cell, mean (standard deviation, SD) values are based on 200 replicates. We observe that the performances of the IPCW Kendall’s partial correlation approach are always better than the CQPCorr, marginal Cox, marginal AFT approaches in all evaluation metrics, but the marginal Cox and marginal AFT approaches have smaller variability compared to the IPCW Kendall’s partial correlation approach in C2, C3, and C4 scenarios.

**Simulation results for Scenario C1 with a censoring rate of 30%.**

Error | Approach | Acc | TPR | Pre | TNR | FPR | FNR | MMS | MSE | PI | LR | c-index |

1 | IPCW-tau | 0.9974 (0.0005) | 0.6981 (0.0841) | 0.5852 (0.0711) | 0.9984 (0.0003) | 0.0016 (0.0003) | 0.3019 (0.0841) | 1100.425 (1113.88) | 0.0457 (0.0111) | 0 (0) | 0 (0) | 0.8805 (0.0364) |

IPCW-pcorr | 0.9974 (0.0006) | 0.6991 (0.0852) | 0.5845 (0.073) | 0.9984 (0.0003) | 0.0016 (0.0003) | 0.3009 (0.0852) | 1118.815 (1130.12) | 0.0458 (0.0110) | 0 (0) | 0 (0) | 0.8805 (0.0362) | |

CQPCorr | 0.9965 (0.0007) | 0.5494 (0.1148) | 0.4626 (0.0967) | 0.9980 (0.0004) | 0.0020 (0.0004) | 0.4506 (0.1148) | 1532.490 (1274.86) | 0.0688 (0.0147) | 0 (0) | 0 (0.0001) | 0.8277 (0.0524) | |

COX | 0.9962 (0.0007) | 0.5094 (0.0986) | 0.4276 (0.0838) | 0.9978 (0.0004) | 0.0022 (0.0004) | 0.4906 (0.0986) | 2769.280 (1337.15) | 0.0708 (0.0164) | 0 (0.0001) | 0.0003 (0.0021) | 0.8125 (0.0546) | |

AFT | 0.9961 (0.0011) | 0.5116 (0.1267) | 0.4229 (0.1118) | 0.9977 (0.0008) | 0.0023 (0.0008) | 0.4884 (0.1267) | 3031.525 (1253.05) | 0.0655 (0.0178) | 0.0038 (0.054) | 0.0045 (0.055) | 0.8122 (0.0592) | |

2 | IPCW-tau | 0.9967 (0.0006) | 0.5750 (0.091) | 0.4834 (0.0767) | 0.9980 (0.0003) | 0.0020 (0.0003) | 0.4250 (0.091) | 1628.050 (1258.73) | 0.1248 (0.0151) | 0.0057 (0.0512) | 0.0058 (0.034) | 0.7085 (0.0468) |

IPCW-pcorr | 0.9967 (0.0006) | 0.5775 (0.0919) | 0.4853 (0.0761) | 0.9980 (0.0003) | 0.0020 (0.0003) | 0.4225 (0.0919) | 1649.265 (1275.03) | 0.1250 (0.0151) | 0.0035 (0.0217) | 0.0037 (0.0157) | 0.7072 (0.0465) | |

CQPCorr | 0.9960 (0.0008) | 0.4625 (0.1241) | 0.3895 (0.1045) | 0.9977 (0.0004) | 0.0023 (0.0004) | 0.5375 (0.1241) | 1898.270 (1342.61) | 0.1326 (0.0167) | 0.0174 (0.1048) | 0.0193 (0.0768) | 0.6863 (0.0574) | |

COX | 0.9954 (0.0007) | 0.3778 (0.1036) | 0.3182 (0.0872) | 0.9974 (0.0003) | 0.0026 (0.0003) | 0.6222 (0.1036) | 3311.220 (1136.78) | 0.1409 (0.0173) | 0.0237 (0.0881) | 0.0323 (0.0938) | 0.6599 (0.0525) | |

AFT | 0.9946 (0.0007) | 0.2550 (0.1097) | 0.2134 (0.0915) | 0.9970 (0.0004) | 0.0030 (0.0004) | 0.7450 (0.1097) | 3912.250 (887.96) | 0.1490 (0.0167) | 0.0804 (0.187) | 0.1092 (0.2179) | 0.6259 (0.0551) | |

3 | IPCW-tau | 0.9964 (0.0007) | 0.5269 (0.1099) | 0.4429 (0.0918) | 0.9979 (0.0004) | 0.0021 (0.0004) | 0.4731 (0.1099) | 2209.155 (1432.65) | 0.1669 (0.0312) | 0.0253 (0.0764) | 0.0393 (0.1192) | 0.6560 (0.0508) |

IPCW-pcorr | 0.9964 (0.0007) | 0.5253 (0.1084) | 0.4416 (0.0906) | 0.9979 (0.0003) | 0.0021 (0.0003) | 0.4747 (0.1084) | 2222.295 (1444.16) | 0.1676 (0.0306) | 0.0362 (0.1200) | 0.0376 (0.1364) | 0.6552 (0.0486) | |

CQPCorr | 0.9957 (0.0008) | 0.4212 (0.1258) | 0.3547 (0.106) | 0.9975 (0.0004) | 0.0025 (0.0004) | 0.5788 (0.1258) | 2170.735 (1289.56) | 0.1661 (0.0287) | 0.0404 (0.1209) | 0.0606 (0.1651) | 0.6461 (0.054) | |

COX | 0.9951 (0.0008) | 0.3294 (0.119) | 0.2774 (0.1002) | 0.9972 (0.0004) | 0.0028 (0.0004) | 0.6706 (0.1190) | 3570.570 (1054.19) | 0.1815 (0.0306) | 0.0722 (0.1663) | 0.0985 (0.1844) | 0.6213 (0.0503) | |

AFT | 0.9941 (0.0007) | 0.1691 (0.109) | 0.1414 (0.0907) | 0.9967 (0.0004) | 0.0033 (0.0004) | 0.8309 (0.1090) | 4029.910 (796.717) | 0.1859 (0.0307) | 0.2017 (0.2699) | 0.2482 (0.291) | 0.5787 (0.0512) | |

In each cell, mean (SD) is based on 200 replicates. |

**Simulation results for Scenario C2 with a censoring rate of 30%.**

Error | Approach | Acc | TPR | Pre | TNR | FPR | FNR | MMS | MSE | PI | LR | c-index |

1 | IPCW-tau | 0.9970 (0.0007) | 0.6206 (0.1093) | 0.5217 (0.0926) | 0.9982 (0.0004) | 0.0018 (0.0004) | 0.3794 (0.1093) | 994.4750 (1023.24) | 0.0546 (0.0136) | 0 (0) | 0 (0.0003) | 0.8451 (0.047) |

IPCW-pcorr | 0.9971 (0.0007) | 0.6362 (0.1112) | 0.5354 (0.0934) | 0.9982 (0.0004) | 0.0018 (0.0004) | 0.3638 (0.1112) | 956.3250 (1034.12) | 0.0521 (0.0133) | 0 (0) | 0 (0.0001) | 0.8513 (0.0445) | |

CQPCorr | 0.9957 (0.0009) | 0.4219 (0.1364) | 0.3553 (0.1148) | 0.9975 (0.0004) | 0.0025 (0.0004) | 0.5781 (0.1364) | 1924.255 (1380.74) | 0.0747 (0.0147) | 0.0009 (0.0074) | 0.0046 (0.0539) | 0.7764 (0.0705) | |

COX | 0.9961 (0.0007) | 0.4875 (0.1048) | 0.4098 (0.0894) | 0.9977 (0.0004) | 0.0023 (0.0004) | 0.5125 (0.1048) | 2533.405 (1448.55) | 0.0652 (0.0141) | 0.0002 (0.0024) | 0.0006 (0.0054) | 0.8026 (0.0564) | |

AFT | 0.9958 (0.001) | 0.4562 (0.1347) | 0.3796 (0.1147) | 0.9976 (0.0007) | 0.0024 (0.0007) | 0.5438 (0.1347) | 2702.195 (1333.10) | 0.0649 (0.0161) | 0.0099 (0.079) | 0.0123 (0.0899) | 0.7879 (0.0668) | |

2 | IPCW-tau | 0.9959 (0.0008) | 0.4600 (0.1218) | 0.3871 (0.1027) | 0.9977 (0.0004) | 0.0023 (0.0004) | 0.5400 (0.1218) | 1707.035 (1307.47) | 0.1284 (0.0157) | 0.0332 (0.1285) | 0.0294 (0.1134) | 0.6756 (0.0611) |

IPCW-pcorr | 0.9961 (0.0008) | 0.4806 (0.1224) | 0.4046 (0.103) | 0.9977 (0.0004) | 0.0023 (0.0004) | 0.5194 (0.1224) | 1661.200 (1343.69) | 0.1274 (0.0158) | 0.0167 (0.0647) | 0.0200 (0.0773) | 0.6810 (0.0557) | |

CQPCorr | 0.9952 (0.0008) | 0.3406 (0.1317) | 0.2868 (0.1109) | 0.9973 (0.0004) | 0.0027 (0.0004) | 0.6594 (0.1317) | 2329.945 (1402.15) | 0.1336 (0.0155) | 0.0645 (0.1667) | 0.0636 (0.1430) | 0.6457 (0.0594) | |

COX | 0.9951 (0.0007) | 0.3353 (0.1114) | 0.2824 (0.0938) | 0.9973 (0.0004) | 0.0027 (0.0004) | 0.6647 (0.1114) | 3208.765 (1183.70) | 0.1371 (0.0148) | 0.0541 (0.1537) | 0.0465 (0.1375) | 0.6458 (0.0531) | |

AFT | 0.9940 (0.0008) | 0.1659 (0.0992) | 0.1389 (0.0833) | 0.9967 (0.0005) | 0.0033 (0.0005) | 0.8341 (0.0992) | 4005.95 (836.40) | 0.1468 (0.015) | 0.1490 (0.2301) | 0.1548 (0.2455) | 0.6016 (0.0556) | |

3 | IPCW-tau | 0.9956 (0.0008) | 0.3991 (0.1277) | 0.3361 (0.1076) | 0.9975 (0.0004) | 0.0025 (0.0004) | 0.6009 (0.1277) | 2194.13 (1335.13) | 0.1705 (0.0317) | 0.0729 (0.1595) | 0.0799 (0.1651) | 0.6282 (0.0543) |

IPCW-pcorr | 0.9956 (0.0008) | 0.4034 (0.1293) | 0.3395 (0.1088) | 0.9975 (0.0004) | 0.0025 (0.0004) | 0.5966 (0.1293) | 2136.10 (1372.06) | 0.1702 (0.0311) | 0.0710 (0.1518) | 0.0676 (0.1521) | 0.6318 (0.0543) | |

CQPCorr | 0.9949 (0.0008) | 0.2928 (0.129) | 0.2466 (0.1086) | 0.9971 (0.0004) | 0.0029 (0.0004) | 0.7072 (0.129) | 2585.69 (1313.51) | 0.1683 (0.0294) | 0.1365 (0.2376) | 0.1236 (0.2303) | 0.6095 (0.0541) | |

COX | 0.9948 (0.0008) | 0.2734 (0.1181) | 0.2303 (0.0995) | 0.9971 (0.0004) | 0.0029 (0.0004) | 0.7266 (0.1181) | 3609.92 (1065.72) | 0.1784 (0.0316) | 0.1521 (0.2345) | 0.1394 (0.2164) | 0.6022 (0.0532) | |

AFT | 0.9936 (0.0006) | 0.0991 (0.0841) | 0.0830 (0.0707) | 0.9965 (0.0003) | 0.0035 (0.0003) | 0.9009 (0.0841) | 4140.23 (792.79) | 0.1836 (0.0297) | 0.3254 (0.3175) | 0.3215 (0.3102) | 0.5580 (0.0496) | |

In each cell, mean (SD) is based on 200 replicates. |

**Simulation results for Scenario C3 with a censoring rate of 30%.**

Error | Approach | Acc | TPR | Pre | TNR | FPR | FNR | MMS | MSE | PI | LR | c-index |

1 | IPCW-tau | 0.9955 (0.0008) | 0.3956 (0.1263) | 0.3332 (0.1063) | 0.9975 (0.0004) | 0.0025 (0.0004) | 0.6044 (0.1263) | 2366.805 (1391.63) | 0.0995 (0.026) | 0.0005 (0.0066) | 0.0018 (0.0165) | 0.8014 (0.0676) |

IPCW-pcorr | 0.9958 (0.0008) | 0.4350 (0.1262) | 0.3663 (0.1063) | 0.9976 (0.0004) | 0.0024 (0.0004) | 0.5650 (0.1262) | 2190.735 (1442.02) | 0.0925 (0.0261) | 0 (0.0001) | 0.0002 (0.0025) | 0.8187 (0.0601) | |

CQPCorr | 0.9945 (0.0008) | 0.2397 (0.1256) | 0.2018 (0.1058) | 0.9970 (0.0004) | 0.0030 (0.0004) | 0.7603 (0.1256) | 3086.990 (1276.07) | 0.1268 (0.0242) | 0.0376 (0.1495) | 0.0279 (0.0899) | 0.7244 (0.0933) | |

COX | 0.9955 (0.0007) | 0.3900 (0.1096) | 0.3284 (0.0923) | 0.9974 (0.0004) | 0.0026 (0.0004) | 0.6100 (0.1096) | 3335.320 (1154.68) | 0.0971 (0.0268) | 0.0003 (0.0043) | 0.0008 (0.0074) | 0.8104 (0.0599) | |

AFT | 0.9955 (0.0009) | 0.3928 (0.1309) | 0.3272 (0.1103) | 0.9974 (0.0005) | 0.0026 (0.0005) | 0.6072 (0.1309) | 3675.175 (1078.55) | 0.0895 (0.0259) | 0.0059 (0.0563) | 0.0068 (0.0724) | 0.8124 (0.069) | |

2 | IPCW-tau | 0.9948 (0.0008) | 0.2875 (0.1258) | 0.2421 (0.1059) | 0.9971 (0.0004) | 0.0029 (0.0004) | 0.7125 (0.1258) | 2893.300 (1361.82) | 0.1612 (0.0237) | 0.0275 (0.114) | 0.0271 (0.1075) | 0.6761 (0.0627) |

IPCW-pcorr | 0.9950 (0.0008) | 0.3172 (0.1298) | 0.2671 (0.1093) | 0.9972 (0.0004) | 0.0028 (0.0004) | 0.6828 (0.1298) | 2755.805 (1395.49) | 0.1578 (0.0245) | 0.0276 (0.1157) | 0.0388 (0.1587) | 0.6877 (0.0635) | |

CQPCorr | 0.9943 (0.0007) | 0.1991 (0.1162) | 0.1676 (0.0979) | 0.9968 (0.0004) | 0.0032 (0.0004) | 0.8009 (0.1162) | 3260.400 (1221.02) | 0.1704 (0.0216) | 0.0772 (0.1827) | 0.0817 (0.185) | 0.6405 (0.07) | |

COX | 0.9947 (0.0007) | 0.2581 (0.1098) | 0.2174 (0.0925) | 0.9970 (0.0004) | 0.0030 (0.0004) | 0.7419 (0.1098) | 3802.255 (957.41) | 0.1626 (0.0223) | 0.0241 (0.1118) | 0.0238 (0.1006) | 0.6790 (0.0589) | |

AFT | 0.9941 (0.0007) | 0.1822 (0.1065) | 0.1519 (0.0883) | 0.9967 (0.0005) | 0.0033 (0.0005) | 0.8178 (0.1065) | 4055.030 (758.31) | 0.1709 (0.0218) | 0.0521 (0.1645) | 0.0522 (0.1447) | 0.6501 (0.065) | |

3 | IPCW-tau | 0.9947 (0.0008) | 0.2731 (0.1283) | 0.2300 (0.108) | 0.9971 (0.0004) | 0.0029 (0.0004) | 0.7269 (0.1283) | 3334.465 (1228.05) | 0.1971 (0.0365) | 0.0450 (0.1346) | 0.0536 (0.1553) | 0.6438 (0.057) |

IPCW-pcorr | 0.9949 (0.0008) | 0.2912 (0.126) | 0.2453 (0.1061) | 0.9971 (0.0004) | 0.0029 (0.0004) | 0.7088 (0.126) | 3144.285 (1244.02) | 0.1954 (0.0355) | 0.0431 (0.1242) | 0.0516 (0.1527) | 0.6469 (0.0594) | |

CQPCorr | 0.9942 (0.0007) | 0.1841 (0.113) | 0.1550 (0.0951) | 0.9968 (0.0004) | 0.0032 (0.0004) | 0.8159 (0.113) | 3459.235 (1118.06) | 0.2030 (0.0313) | 0.1194 (0.2277) | 0.1081 (0.2135) | 0.6150 (0.062) | |

COX | 0.9944 (0.0007) | 0.2109 (0.1071) | 0.1776 (0.0902) | 0.9969 (0.0003) | 0.0031 (0.0003) | 0.7891 (0.1071) | 3985.405 (840.16) | 0.2033 (0.0295) | 0.0896 (0.2157) | 0.0815 (0.1915) | 0.6343 (0.0582) | |

AFT | 0.9937 (0.0006) | 0.1175 (0.0937) | 0.0986 (0.0787) | 0.9966 (0.0003) | 0.0034 (0.0003) | 0.8825 (0.0937) | 4215.990 (707.02) | 0.2100 (0.0334) | 0.1610 (0.258) | 0.1593 (0.267) | 0.5972 (0.0601) | |

In each cell, mean (SD) is based on 200 replicates. |

**Simulation results for Scenario C4 with a censoring rate of 30%.**

Error | Approach | Acc | TPR | Pre | TNR | FPR | FNR | MMS | MSE | PI | LR | c-index |

1 | IPCW-tau | 0.9950 (0.0008) | 0.3153 (0.1236) | 0.2655 (0.1041) | 0.9972 (0.0004) | 0.0028 (0.0004) | 0.6847 (0.1236) | 2563.640 (1320.47) | 0.0409 (0.0095) | 0.0088 (0.0580) | 0.0212 (0.1062) | 0.7293 (0.0759) |

IPCW-pcorr | 0.9951 (0.0008) | 0.3328 (0.1240) | 0.2803 (0.1045) | 0.9973 (0.0004) | 0.0027 (0.0004) | 0.6672 (0.1240) | 2197.875 (1271.31) | 0.0401 (0.0094) | 0.0053 (0.0439) | 0.0135 (0.0559) | 0.7382 (0.0765) | |

CQPCorr | 0.9942 (0.0007) | 0.1816 (0.1146) | 0.1529 (0.0965) | 0.9968 (0.0004) | 0.0032 (0.0004) | 0.8184 (0.1146) | 3632.740 (1118.21) | 0.0470 (0.006) | 0.0638 (0.1655) | 0.0747 (0.1868) | 0.6514 (0.0806) | |

COX | 0.9951 (0.0007) | 0.3253 (0.1144) | 0.2739 (0.0963) | 0.9972 (0.0004) | 0.0028 (0.0004) | 0.6747 (0.1144) | 2823.735 (1282.34) | 0.0370 (0.0066) | 0.0010 (0.0106) | 0.0016 (0.0114) | 0.7615 (0.064) | |

AFT | 0.9947 (0.0008) | 0.2744 (0.1308) | 0.2304 (0.1100) | 0.9971 (0.0004) | 0.0029 (0.0004) | 0.7256 (0.1308) | 2932.025 (1271.79) | 0.0388 (0.0089) | 0.0210 (0.1205) | 0.0288 (0.1397) | 0.7342 (0.0794) | |

2 | IPCW-tau | 0.9943 (0.0007) | 0.2037 (0.1171) | 0.1716 (0.0986) | 0.9968 (0.0004) | 0.0032 (0.0004) | 0.7962 (0.1171) | 3247.700 (1167.82) | 0.1184 (0.0141) | 0.2933 (0.3132) | 0.1963 (0.262) | 0.5812 (0.0607) |

IPCW-pcorr | 0.9944 (0.0008) | 0.2131 (0.1212) | 0.1795 (0.1021) | 0.9969 (0.0004) | 0.0031 (0.0004) | 0.7869 (0.1212) | 3004.435 (1192.98) | 0.1186 (0.0149) | 0.2565 (0.2773) | 0.1996 (0.2662) | 0.5864 (0.0613) | |

CQPCorr | 0.9939 (0.0007) | 0.1416 (0.1032) | 0.1192 (0.0869) | 0.9966 (0.0003) | 0.0034 (0.0003) | 0.8584 (0.1032) | 3734.380 (1045.65) | 0.1149 (0.0131) | 0.3258 (0.2996) | 0.2536 (0.29) | 0.5652 (0.0577) | |

COX | 0.9942 (0.0006) | 0.1803 (0.0979) | 0.1518 (0.0824) | 0.9968 (0.0003) | 0.0032 (0.0003) | 0.8197 (0.0979) | 3677.595 (1054.77) | 0.1208 (0.0132) | 0.2336 (0.2641) | 0.1274 (0.2101) | 0.5962 (0.0563) | |

AFT | 0.9932 (0.0005) | 0.0303 (0.0469) | 0.0255 (0.0395) | 0.9963 (0.0004) | 0.0037 (0.0004) | 0.9697 (0.0469) | 4463.435 (446.62) | 0.1240 (0.0129) | 0.4211 (0.3052) | 0.3092 (0.2942) | 0.5477 (0.0462) | |

3 | IPCW-tau | 0.9940 (0.0007) | 0.1512 (0.103) | 0.1274 (0.0868) | 0.9967 (0.0003) | 0.0033 (0.0003) | 0.8488 (0.103) | 3694.655 (1072.43) | 0.1622 (0.0303) | 0.3868 (0.3014) | 0.3065 (0.2821) | 0.5511 (0.0486) |

IPCW-pcorr | 0.9940 (0.0007) | 0.1509 (0.1064) | 0.1271 (0.0896) | 0.9967 (0.0003) | 0.0033 (0.0003) | 0.8491 (0.1064) | 3590.680 (1041.92) | 0.1620 (0.0296) | 0.3894 (0.3063) | 0.3088 (0.2861) | 0.5534 (0.0506) | |

CQPCorr | 0.9937 (0.0006) | 0.1128 (0.0872) | 0.0950 (0.0734) | 0.9965 (0.0003) | 0.0035 (0.0003) | 0.8872 (0.0872) | 3949.735 (960.11) | 0.1529 (0.0289) | 0.3934 (0.3077) | 0.3464 (0.317) | 0.5487 (0.0456) | |

COX | 0.9938 (0.0005) | 0.1181 (0.0856) | 0.0995 (0.0721) | 0.9966 (0.0003) | 0.0034 (0.0003) | 0.8819 (0.0856) | 4090.200 (802.156) | 0.1648 (0.0321) | 0.3606 (0.2951) | 0.2857 (0.3036) | 0.5586 (0.0491) | |

AFT | 0.9931 (0.0003) | 0.0163 (0.0382) | 0.0136 (0.0322) | 0.9962 (0.0002) | 0.0038 (0.0002) | 0.9838 (0.0382) | 4525.750 (444.49) | 0.1594 (0.0303) | 0.4492 (0.2978) | 0.4227 (0.2807) | 0.5314 (0.0399) | |

In each cell, mean (SD) is based on 200 replicates. |

**Simulation results for Scenario C5 with a censoring rate of 30%.**

Error | Approach | Acc | TPR | Pre | TNR | FPR | FNR | MMS | MSE | PI | LR | c-index |

1 | IPCW-tau | 0.9972 (0.0006) | 0.6506 (0.0900) | 0.5474 (0.0752) | 0.9983 (0.0003) | 0.0017 (0.0003) | 0.3494 (0.0900) | 1490.900 (1307.73) | 0.0496 (0.013) | 0 (0) | 0 (0) | 0.8683 (0.041) |

IPCW-pcorr | 0.9972 (0.0006) | 0.6541 (0.0868) | 0.5501 (0.0728) | 0.9983 (0.0003) | 0.0017 (0.0003) | 0.3459 (0.0868) | 1506.345 (1300.66) | 0.0488 (0.0125) | 0 (0) | 0 (0) | 0.8696 (0.0393) | |

CQPCorr | 0.9967 (0.0006) | 0.5769 (0.0966) | 0.4858 (0.0814) | 0.9980 (0.0003) | 0.0020 (0.0003) | 0.4231 (0.0966) | 1814.535 (1379.89) | 0.0606 (0.0131) | 0 (0) | 0 (0) | 0.8457 (0.0449) | |

COX | 0.9962 (0.0006) | 0.4950 (0.0957) | 0.4152 (0.0799) | 0.9978 (0.0003) | 0.0022 (0.0003) | 0.5050 (0.0957) | 2728.400 (1418.65) | 0.0712 (0.0171) | 0 (0) | 0.0001 (0.0008) | 0.8169 (0.0471) | |

AFT | 0.9959 (0.0024) | 0.4997 (0.1165) | 0.4119 (0.1038) | 0.9975 (0.0024) | 0.0025 (0.0024) | 0.5003 (0.1165) | 2923.145 (1373.36) | 0.1277 (0.6735) | 0.0096 (0.0775) | 0.0117 (0.0812) | 0.8112 (0.0662) | |

2 | IPCW-tau | 0.9966 (0.0007) | 0.5584 (0.1035) | 0.4701 (0.0870) | 0.9980 (0.0003) | 0.0020 (0.0003) | 0.4416 (0.1035) | 2219.715 (1467.74) | 0.1240 (0.0139) | 0.0069 (0.0693) | 0.0116 (0.0514) | 0.6998 (0.0512) |

IPCW-pcorr | 0.9966 (0.0006) | 0.5603 (0.1003) | 0.4717 (0.0843) | 0.9980 (0.0003) | 0.0020 (0.0003) | 0.4397 (0.1003) | 2265.580 (1492.43) | 0.1247 (0.015) | 0.0047 (0.0377) | 0.0119 (0.0485) | 0.7010 (0.0506) | |

CQPCorr | 0.9962 (0.0007) | 0.4981 (0.1059) | 0.4195 (0.0892) | 0.9978 (0.0003) | 0.0022 (0.0003) | 0.5019 (0.1059) | 2325.780 (1404.27) | 0.1267 (0.0149) | 0.0061 (0.0646) | 0.0113 (0.0503) | 0.6968 (0.0479) | |

COX | 0.9955 (0.0007) | 0.3859 (0.1001) | 0.3247 (0.0847) | 0.9974 (0.0003) | 0.0026 (0.0003) | 0.6141 (0.1001) | 3383.00 (1176.03) | 0.1416 (0.0178) | 0.0139 (0.0461) | 0.0318 (0.0816) | 0.6659 (0.0505) | |

AFT | 0.9947 (0.0008) | 0.2666 (0.1115) | 0.2233 (0.0942) | 0.9970 (0.0005) | 0.0030 (0.0005) | 0.7334 (0.1115) | 3765.250 (955.26) | 0.1467 (0.0158) | 0.0570 (0.1472) | 0.1101 (0.2090) | 0.6282 (0.0526) | |

3 | IPCW-tau | 0.9961 (0.0007) | 0.4909 (0.1095) | 0.4133 (0.0921) | 0.9978 (0.0004) | 0.0022 (0.0004) | 0.5091 (0.1095) | 2567.115 (1391.50) | 0.1674 (0.0287) | 0.0503 (0.1478) | 0.0661 (0.1640) | 0.6459 (0.0530) |

IPCW-pcorr | 0.9961 (0.0007) | 0.4897 (0.1105) | 0.4123 (0.093) | 0.9978 (0.0004) | 0.0022 (0.0004) | 0.5103 (0.1105) | 2600.685 (1379.32) | 0.1684 (0.0281) | 0.0456 (0.1287) | 0.0682 (0.1650) | 0.6472 (0.0530) | |

CQPCorr | 0.9958 (0.0007) | 0.4409 (0.1135) | 0.3712 (0.0955) | 0.9976 (0.0004) | 0.0024 (0.0004) | 0.5591 (0.1135) | 2567.785 (1298.51) | 0.1620 (0.0288) | 0.0488 (0.1378) | 0.0645 (0.1636) | 0.6461 (0.0531) | |

COX | 0.9951 (0.0007) | 0.3228 (0.1061) | 0.2718 (0.0894) | 0.9972 (0.0003) | 0.0028 (0.0003) | 0.6772 (0.1061) | 3514.235 (1103.94) | 0.1833 (0.0331) | 0.1012 (0.2113) | 0.1433 (0.2262) | 0.6143 (0.0552) | |

AFT | 0.9940 (0.0007) | 0.1609 (0.1079) | 0.1355 (0.0909) | 0.9967 (0.0003) | 0.0033 (0.0003) | 0.8391 (0.1079) | 4025.405 (816.489) | 0.1897 (0.0307) | 0.2303 (0.2805) | 0.3021 (0.3110) | 0.5756 (0.0534) | |

In each cell, mean (SD) is based on 200 replicates. |

The numerical results are summarized in **Supplementary Tables 2–6** for
scenarios C1 to C5 with a censoring rate of 60%. In each cell, mean (SD) values
are based on 200 replicates. We observe that the performances of the CQPCorr
approach are always better than the alternative approaches in all evaluation
metrics for scenarios C2, C3 and C4, and the IPCW Kendall’s partial or IPCW-tau
correlation approach performs better than others for scenarios C1 and C5. In
addition, we observe the marginal Cox and marginal AFT approaches have smaller
variability compared to the IPCW Kendall’s partial correlation and CQPCorr
approaches.

Table 6 presents simulation findings in tabular form based on censoring rates (30% and 60%) and five parameter scenarios to give readers a better understanding of how these approaches stack up.

**Simulation findings in tabular form based on censoring rates (30% and 60%) and five parameter scenarios.**

Censoring rate | C1 setting | C2 setting | C3 setting | C4 setting | C5 setting |

30% | IPCW-pcorr | IPCW-pcorr | IPCW-pcorr | IPCW-pcorr | IPCW-pcorr |

60% | IPCW-tau | CQPCorr | CQPCorr | CQPCorr | IPCW-pcorr |

After excluding patients with missing survival time data, our analysis is focused on the subset of the TCGA ESCA data with 368 patients and 20,501 gene expression variables. The censoring rate of the survival time in the data is about 58%.

The top 1000 genes with the smallest *p*-values based on the marginal
(univariate) COX model are selected for downstream analysis, since the number of
cancer-related genes is expected to be limited. The seven clinical variables
whose E effects are analyzed include age, gender, esophageal tumor central
location, person neoplasm cancer status, race, BMI and AJCC pathologic stage, and
their summary information is reported in the **Supplementary Table 7**, with its
source derived from Wang and Chen [13]. Some of the
clinical variables contain missing values,
and we use the sparse boosting method [14] in the R package “GEInter” to
perform multiple imputation for the missing values in the clinical variables.

We take fifty random splits of the whole data into 258:110 training/test sets of the data to evaluate the performance of all methods for survival prediction via the significant proportion of the LR-test in the TCGA ESCA data. Higher significant proportion of LR-test corresponds to better prediction accuracy. From Table 7 we see that the performance of the IPCW Kendall’s partial correlation approach is better than the marginal Cox, marginal AFT and CQPCorr methods.

**Proportions of significant LR-test (LR-test**$$ 0.05) over 50 random splits for different interaction screening methods under
TCGA data.

*p*-valueTCGA data | IPCW-tau | IPCW-pcorr | CQPCorr | COX | AFT |

ESCA | 0.96 | 1 | 1 | 0.84 | 1 |

PAAD | 0.34 | 0.46 | 0.38 | 0.42 | 0.26 |

LUAD | 0.5 | 0.58 | 0.46 | 0.32 | 0.3 |

In addition, we apply the proposed IPCW Kendall’s partial correlation approach
for whole data to identify several important G-E interaction biomarkers and
estimate the corresponding parameters by AFT regression model. Table 8 provides
the list of selected associated predictors with their correspondence weights,
with the “*” notation meaning statistical significance (*p*-value *GADD45B*) has been shown to be related to ESCA
[15] while Weygant, *et al*. [16] indicated that the* gender* factor can be used as
prognosis biomarker for ESCA. In addition, we found the *GADD45B-gender*
interaction biomarker to be significant from Table 8, and therefore we consider
the *GADD45B-gender* biomarker to be a potential prognostic biomarker for
ESCA, as the major gene *GADD45B* and *gender* factor have been
documented to be associated with ESCA.

**Selected G-E interaction biomarkers with their corresponding estimates (**

*p*-values) by IPCW Kendall’s partial correlation approach for the whole TCGA ESCA data.Gene | Person neoplasm cancer status | Gender |

C16orf87 |
–0.0829 (0.9182) | |

C22orf29 |
–0.3488 (0.2335) | |

C8orf58 |
–0.6543 (0.0054)* | 1.6584 (0.0265)* |

DLL1 |
0.1923 (0.5715) | |

DUSP1 |
0.0855 (0.7145) | |

EIF2S3 |
0.7882 (0.0254)* | |

ELAC1 |
–0.7617 (0.0044)* | |

ELK1 |
–0.8351 (0.0470)* | |

EPHB4 |
–0.5758 (0.0045)* | |

EPO |
0.1598 (0.3890) | |

GADD45B |
–2.1710 (0)* | |

IL8 |
–0.0228 (0.7388) | |

JUN |
0.2225 (0.6218) | |

KDM4D |
–0.2091 (0.2257) | |

LZTS2 |
1.0687 (0.1153) | |

MAPK7 |
–0.9098 (0.3922) | |

NFIX |
0.8423 (0)* | |

NT5C3L |
–0.3800 (0.0026)* | |

PATL1 |
1.2257 (0.0014)* | |

TAB1 |
0.9494 (0.1037) | |

TRIB2 |
0.1155 (0.4378) | |

TRMT2A |
–1.0338 (0.0319)* | |

TXLNA |
0.9600 (0.0223)* | |

UBE2J2 |
–1.2922 (0.1182) | |

WDR24 |
1.5244 (0.0002)* | |

ZC3H7B |
–1.9506 (0.0004)* | |

ZDHHC8 |
–0.2015 (0.4832) | |

ZNF74 |
1.01682 (0.0003)* | –2.2506 (0.0361)* |

ZNRF3 |
–0.1136 (0.5441) |

After excluding patients with missing survival time data, our analysis is focused on the subset of the TCGA PAAD data with 170 patients and 20,501 gene expression variables. The censoring rate of the survival time in the data is about 58%.

The top 1000 genes with the smallest *p*-values based on the marginal
(univariate) COX model are selected for downstream analysis, since the number of
cancer-related genes is expected to be limited. The seven clinical variables
whose E effects are ethnicity, race, lymph node examined count, maximum tumor
dimension, anatomic neoplasm subdivision, gender and age, and their summary
information is reported in **Supplementary Table 8**, with its source being from Wang, *et al*.
[13]. Some of the clinical variables contain missing values, and we use the
sparse boosting method [14] in the R package “GEInter” to perform multiple
imputation for the missing values in the clinical variables.

We take fifty random splits of the whole data into 119:51 training/test sets of the data to evaluate the performance of all methods for survival prediction via the significant proportion of the LR-test in the TCGA PAAD data. Higher significant proportion of LR-test corresponds to better prediction accuracy. From Table 7, we see that the performance of the IPCW Kendall’s partial correlation approach is better than the marginal Cox, marginal AFT and CQPCorr methods.

In addition, we apply the proposed IPCW Kendall’s partial correlation approach
for whole data to identify several important G-E interaction biomarkers and
estimate the corresponding parameters by AFT regression model. Table 9 provides
the list of selected associated predictors with their correspondence weights,
with the “*” notation meaning statistical significance (*p*-value *NRSN2* and *TRIM59*) have been shown
to be related to PAAD [17, 18], while Chakladar, *et al*. [19] indicated that the gender factor can
be used as prognosis biomarker for PAAD. In addition, we found the
*TRIM59-gender* interaction biomarker to be significant from Table 9, and
therefore we consider the *TRIM59-gender* biomarker to be a potential
prognostic biomarker for PAAD, as the major gene *TRIM59* and
*gender* factor have been documented to be associated with PAAD.

**Selected G-E interaction biomarkers with their corresponding estimates (**

*p*-values) by IPCW Kendall’s partial correlation approach for the whole TCGA PAAD data.Gene | AJCC pathologic nodes | Gender |

BCAS3 |
–0.3774 (0.6215) | |

COQ10A |
–0.5120 (0.2800) | |

FAM110B |
0.3987 (0.3284) | |

KCTD2 |
0.0676 (0.9442) | |

LMBR1L |
1.0804 (0.1328) | |

LRRC37A3 |
0.0146 (0.9706) | |

NRSN2 |
–3.3384 (0.0013)* | |

PPFIBP1 |
0.1262 (0.6605) | |

RYK |
0.7033 (0.1168) | |

SLC22A17 |
0.3416 (0.5455) | |

TCEAL3 |
3.8602 (0.0034)* | |

TMEM43 |
–0.6927 (0.0955) | |

TRIM59 |
–0.6336 (0.0101)* | |

TTC21A |
–1.524 (0.0115)* | |

ZNF324 |
1.1587 (0.2190) | |

ZNF446 |
–2.5858 (0.0008)* | |

ZNF547 |
1.5870 (0.0258)* |

After excluding patients with missing survival time data, our analysis is focused on the subset of the TCGA LUAD data with 505 patients and 20,501 gene expression variables. The censoring rate of the survival time in the data is about 64%.

The top 1000 genes with the smallest *p*-values based on the marginal
(univariate) COX model are selected for downstream analysis, since the number of
cancer-related genes is expected to be limited. The eight clinical variables
whose E effects are analyzed include age at initial pathologic diagnosis, number
pack years smoked, AJCC pathologic metastasis, AJCC pathologic nodes, AJCC
pathologic stage, race, gender and AJCC pathologic tumor, with their summary
information reported in **Supplementary Table 9**, sourced from Wang, *et al*. [13]. Some of the
clinical variables contain missing values, and we use the sparse boosting method
[14] in the R package “GEInter” to perform multiple imputation for the missing
values in the clinical variables.

We take fifty random splits of the whole data into 354:151 training/test sets of the data to evaluate the performance of all methods for survival prediction via the significant proportion of the LR-test in the TCGA LUAD data. Higher significant proportion of LR-test corresponds to better prediction accuracy. From Table 7, we see that the performance of the IPCW Kendall’s partial correlation approach is better than the marginal Cox, marginal AFT and CQPCorr methods.

In addition, we apply the proposed IPCW Kendall’s partial correlation approach
for whole data to identify several important G-E interaction biomarkers and
estimate the corresponding parameters by AFT regression model. Table 10 provides
the list of selected associated predictors with their correspondence weights,
with the “*” notation meaning statistical significance (*p*-value *ATP13A4*, *GPR116*,
*HABP2*, *MBIP, SFTA3*, and *ZSCAN16*) have been shown to be
related to LUAD [20, 21, 22, 23, 24].

**Selected G-E interaction biomarkers with their corresponding estimates (**

*p*-values) by IPCW Kendall’s partial correlation approach for the whole TCGA LUAD data.Gene | AJCC pathologic nodes | AJCC pathologic stage | Race |

ATP13A4 |
1.8356 (0.0056)* | ||

B3GNT8 |
0.7140 (0.0088)* | ||

C16orf89 |
–0.2306 (0.2214) | ||

C17orf44 |
–2.1237 (0.0114)* | ||

CD302 |
0.3977 (0.1098) | ||

ENPP4 |
–0.0032 (0.9809) | ||

FUCA1 |
–0.4834 (0.3353) | ||

FUT1 |
0.2615 (0.0440)* | ||

GALNT11 |
2.6784 (0.0037)* | ||

GNA14 |
0.7127 (0.0684) | ||

GPC4 |
–3.1234 (0.0036)* | ||

GPD1L |
–0.2679 (0.4329) | ||

GPR116 |
0.9095 (0.0152)* | ||

HABP2 |
–1.2867 (0.0063)* | ||

HLF |
–0.3133 (0.2451) | ||

LDB1 |
–0.1730 (0.4393) | ||

LGR4 |
–0.0776 (0.8053) | ||

LIFR |
–0.1512 (0.3790) | 0.0101 (0.9462) | |

LMO3 |
0.2531 (0.1666) | ||

MBIP |
–1.4990 (0.0005)* | ||

MYLIP |
–0.2072 (0.2117) | ||

NAPSA |
–0.5345 (0.0644) | ||

ORMDL3 |
–1.2296 (0.0306)* | ||

PDIK1L |
–1.2479 (0.0646) | ||

PLA2G4F |
0.0321 (0.7772) | ||

PNMA2 |
1.7741 (0.0150)* | ||

SEMA4A |
1.9070 (0.0026)* | ||

SFTA3 |
0.5666 (0.0294)* | ||

SLC34A2 |
–0.0494 (0.8077) | ||

SNX30 |
1.6043 (0.0022)* | ||

TMEM170B |
0.7158 (0.0049)* | –0.0254 (0.8792) | |

TMEM57 |
0.0511 (0.9395) | ||

TRIM24 |
–0.0804 (0.7317) | 0.0944 (0.6107) | |

UNC13B |
–0.1529 (0.2606) | ||

VWA2 |
0.5239 (0.0641) | ||

ZNF750 |
–1.4333 (0.0055)* | ||

ZSCAN16 |
0.3507 (0.0348)* |

In our work, we have several motivations to adopt the sparse boosting method to
impute the missing clinical variables. The first is that the sparse boosting
method [14] was proposed initially to assign TCGA clinical data, and we also use
TCGA clinical data in our actual data application. The second is that Wu, *et al*. [11]
provides the friendly R package “GEInter” to perform the sparse boosting
method. Therefore, we adopt the sparse boosting method to impute the missing
clinical variables. We agree with the comment that using the different impute
techniques to process missing values will lead to different numerical results, it
deserves further study and will be investigated in our future work.

In the final step of our analysis, we utilized the maximum likelihood estimation approach to estimate the corresponding parameters of the selected candidate biomarkers. Park and Ha [25] performed variable selection and parameter estimation procedures for fixed effects in parametric AFT models using penalized likelihood procedures. We agree with the comment that the sure independence screening method [12] is just a screening process, and penalized regression should be applied to reduce the irrelevant predictors for the prediction model. Although we found some useful references about penalized AFT model, we did not find suitable software or code to perform the regularization methods. Implementing regularization methods in our analysis is worthy of further research and will be studied in our future work.

In real data application, we adopt the hard thresholding rule proposed by Fan and Lv [12]
to select the candidate set of G-E interaction biomarkers; that is, after ranking
the G-E interactions biomarkers using some correlation measure, we select a
prefixed number of top-ranked G-E interaction predictors as our candidate model.
Several alternative strategies for thresholding rule can be considered such as
the soft thresholding rule proposed by Zhu, *et al*. [26], a method based on the control of the
false-positive rate or false discovery rate by Zhao and Li [27], and a method based on
multiple testing procedure by Song, *et al*. [28]. Furthermore, we assumed that the number of
cancer-related biomarkers would be limited, so we selected the top 1000 G-E
interaction biomarkers with the smallest *p*-values for downstream
analysis based on the marginal COX model. Xu, *et al*. [5] and Wu and Ma [14] and so on have a similar
screening process for downstream analysis. Different candidate models choices
lead to different survival prediction models. How to define the number of
cancer-related biomarkers for downstream analysis is a critical and interesting
open question?

Voorman, *et al*. [29] and Ueki, *et al*. [30] give deep insight into genome-wide environment interaction studies
(GWEIS), in which gene-environment interaction analysis is common known to be
susceptible to model specification. Problematic behavior may occur due to
insufficient specification of the null model for models with no genetic effects.
In the framework of marginal analysis, Kendall’s partial correlation method is a
pure correlation used to reflect the relationship of the survival trait with the
G-E interaction biomarker conditional on the main effects. The proposed measure
can mitigate errors due to unspecified null models. However, for simplicity, we
specifically focus on two-way pairwise G-E interactions. Although the same idea
of the proposed IPCW Kendall’s partial correlation method might apply to the
problem of evaluating higher-order interactions, the associated computational
complexity seems challenging and will be investigated in detail in our future
work.

In this article, we extend the non-parametric IPCW Kendall’s partial correlation [8] approach to G-E interactions to measure the association of a right-censored survival trait with G-E interactions biomarkers, and the associated Kendall’s partial correlation to reflect the relationship of the survival trait with G-E interactions predictors conditional on the genetic and environmental effects. We agree with the comment that there are no methodological advances of this paper. However, the paper provides a useful contribution to real genomic data applications. In simulations (medium censoring level) and real data applications, we show that the proposed IPCW Kendall’s partial correlation method can provide substantially more powerful and accurate predictor selection, and can lead to more accurate survival prediction than alternative methods (marginal COX, marginal AFT and CQPCorr).

In the real data analysis for informative G-E interaction biomarker selection, we first performed a G-E interaction screening procedure on the entire data based on hard threshold rules using the non-parametric Kendall’s partial correlation method, then incorporated all environmental factors and some genes selected based on the hierarchical principle into the survival prediction model, we then estimated the corresponding parameters of the selected candidate biomarkers using a maximum-likelihood estimation approach based on the AFT regression model. To this end, we not only defined survival prediction models, but also provided their corresponding weights for selected biomarkers, which have implications for clinical significance.

R codes for the simulation studies and real data are available at the
*figshare* website
https://doi.org/10.6084/m9.figshare.19306967.v3.
The TCGA ESCA, PAAD and LUAD genomic data with survival traits analyzed during
this study are all available at the figshare website
https://doi.org/10.6084/m9.figshare.16816654.v5.

G-E, gene-environment; AFT, accelerated failure time; TCGA, the cancer genome atlas; ESCA, esophageal carcinoma; LUAD, lung adenocarcinoma; PAAD, pancreatic adenocarcinoma; CQPCorr, censored quantile partial correlation; IPCW, inverse probability-of-censoring weighted; IPCW-pcorr, IPCW Kendall’s partial correlation approach; IPCW-tau, IPCW Kendall’s correlation approach; PI, prognosis index; LR, log-rank; SD, standard deviation; GWEIS, genome-wide environment interaction studies.

J-HW conceived and designed the research study. J-HW and C-TY performed the research, analyzed the data, and prepared figures and/or tables. J-HW wrote the manuscript. All authors contributed to editorial changes in the manuscript. All authors read and approved the final manuscript.

Not applicable.

We thank the reviewers for their valuable comments and suggestions that helped us improve the paper.

This research was supported by the grant MOST 110-2118-M-035-001-MY2 from the Ministry of Science and Technology of Republic of China (Taiwan). The funding body did not play any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

The authors declare no conflict of interest.

**Publisher’s Note:** IMR Press stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Publisher’s Note:**IMR Press stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.