Characterization of DNA Damage Repair Related Signature and Molecular Feature in Low-Grade Gliomas to Aid Chemotherapy and Drug Discovery

Background : DNA damage repair (DDR) related genes are associated with the development, progression, aggressiveness, and heterogeneity of low-grade gliomas (LGG). However, the precise role of DDR in LGG prognosis and molecular subtypes remains to be elucidated. Methods : We analyzed 477 and 594 LGG samples from the Cancer Genome Atlas (TCGA) and Chinese Glioma Genome Atlas (CGGA) to develop a prognostic model using the random forest algorithm and Cox regression. Independent prognostic factors were incorporated into a nomogram, and its performance was assessed using receiver operating characteristic and calibration curves. We also used Connectivity Map analysis to identify potential small molecule drugs targeting DDR. Molecular subtypes based on DDR were identified by consensus cluster analysis, and the clinical characteristics, mutation landscape, immune tumor microenvironment, and drug sensitivity of patients with different subtypes in the TCGA and CGGA datasets were further compared. The Boruta algorithm was used to select features from the differentially expressed genes between clusters to generate DDR scores. Results were further validated in the Glioma Longitudinal AnalySiS consortium dataset. Statistical analysis and tests were implemented using R software version 4.0.2. Results : We developed a prognostic model containing six DDR-related genes, which served as a potential independent prognostic indicator in LGG across three datasets. The area under the curve (AUC) values for 1, 3-, and 5-year survival in the TCGA dataset were 0.901, 0.832, and 0.771, respectively. The nomogram demonstrated high accuracy in predicting 1, 3-, and 5-year survival, with AUC values greater than 0.8. Additionally, we identified and validated two molecular subtypes based on DDR genes. These subtypes exhibited significant differences in somatic mutations, clinical prognosis, and immune cell infiltration. One subtype showed higher immune and stromal scores, worse prognosis, and increased sensitivity to common chemotherapeutic agents. Finally, we established a DDR score which served as another promising prognostic predictor for LGG. Conclusions : The prognostic model and molecular subtypes based on DDR genes can help in more detailed classification and provide insights for personalized management of LGG and clinical drug development.


Introduction
Low-grade gliomas (LGG) are WHO grade II and III gliomas and include oligodendroglioma and anaplastic astrocytoma [1].Currently, even with the standard treatment strategies, such as maximum safe resection, radiotherapy and chemotherapy, the mean survival varies significantly from 2 to 10 years [2,3].Histopathological and clinical results indicate that high heterogeneity and aggressiveness contribute to the great variability in survival of LGG [4].In fact, residual tumor cells may recur and develop malignant progression soon after surgery, resulting in resistance to treatment and making it more difficult [5].Based on the above characteristics of LGG, it is crucial to identify biomarkers that may affect the prognosis of LGG and improve the risk stratification of LGG.
Accumulated evidences revealed that molecular diagnosis is more important than tumor grade in the classification and prognosis of LGG [6][7][8].Some biomark-ers, including isocitrate dehydrogenase 1 (IDH1) mutation, 1p/19q combined deletion (1p/19q co-deletion) and O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation, revealed the histological characteristics of LGG and could be used to guide LGG treatment regimen [7].However, in the genetically heterogeneous population, the heterogeneity of LGG determined the low predictive value of these widely used biomarkers, which can neither fully elucidate individual variation nor correctly addressing the problem of LGG risk stratification.With the advanced molecular platforms, this enables the development of new prognostic signatures, accurately identify LGG novel categorized methods and stratify risk based on molecular profiles [9].
In order to maintain genome integrity and stability, cells develop DNA damage repair (DDR) mechanisms that identify and repair damage to the DNA molecules encoded by the genome.Previous studies have shown that DDR, as a widely concerned hallmark in tumor biology, plays a crucial role in tumor genesis, progression and therapeutic response [10][11][12][13].Defects in DDR prevent cells from properly repairing damaged DNA, leading to the accumulation of genetic changes in normal cells that turn into cancer cells [14].Consequently, cancer cells generally exhibit a decreased ability to repair DNA compared with normal cells.As a result, DNA replication stress accumulates and DNA damage is superimposed due to dependence on a subset of other repair pathways.Studies have reported the abnormal expression of DNA repair genes contributed to tumorigenesis [15].Indeed, cells lacking DDR are often associated with sensitivities to DNA damaging agents, so DDR defects associated with cancer may also lead to vulnerabilities that can be exploited therapeutically [16].This evidence highlights the importance of DDR regulatory molecules in targeted anticancer therapy [17].The current findings suggest that DDR may be the underlying mechanism of gliomagenesis, rather than a random secondary event [18].Alterations in DDR pathways have been considered to play essential roles in many aspects of glioma biology such as genesis, progression, therapy resistance, and recurrence [19][20][21][22].Gliomas usually rely on intrinsic DNA repair pathways to handle the adverse effects of DNA damage such as single-and doublestrand DNA breaks caused by radiotherapy and chemotherapy.MGMT, for instance, repairs the cytotoxic DNA damage caused by temozolomide (TMZ), which is considered to be the common mechanism of chemotherapy resistance in glioma [20,22].Based on the above knowledge of DDR, a recent glioma treatment strategy is to induce overwhelming DNA damage and inhibit glioma repair mechanisms, and cause fatal cytotoxicity to gliomas, resulting in cell cycle arrest and reduced drug resistance [21,22].In addition, targeting DDR with novel agents or therapeutic combinations provides a promising adjuvant therapy for improving the clinical prognosis of glioma [21,23].Furthermore, a DDR-related signature had been identified in glioblastoma to predict prognosis [24].Therefore, the inclusion of DDRrelated genes in the study of risk stratification, subtypes, prognosis and treatment response of LGG is promising to provide a new perspective for the clinical treatment of patients.
Based on the Cancer Genome Atlas (TCGA, training set, n = 477) and Chinese Glioma Genome Atlas (CGGA, validation set, n = 594) RNA sequencing datasets, Cox regression and random forest analysis were performed to identify a DDR-related signature and to construct a nomogram for survival prediction of LGG patients.We also demonstrated the underlying mechanism of risk characteristics and screened small molecule drugs targeting for DDR.Moreover, we systematically analyzed DDR-related genes in LGG, and classified LGG subtypes according to DDRrelated genes.Mutations of each molecular subtype were further presented, and the relationship between the molecular subtype and the prognosis, immune tumor microen-vironment (TME), chemotherapy and clinical characteristics of patients were further explored.Finally, the DDR score which could be served as an independent and effective prognostic factor was constructed.The results of the analyses were validated in an additional validation dataset-Glioma Longitudinal AnalySiS (GLASS) [25] consortium dataset (https://www.glass-consortium.org).Robust prognostic model, DDR-score system and molecular subtype based on DDR-related genes will improve risk stratification in LGG patients and provide more accurate personalized clinical management assessments.

Data Acquisition
mRNA expression data with clinical information of 506 and 596 LGG patients were collected from TCGA (training dataset) and CGGA (validation dataset), respectively.The corresponding clinical data include: gender, age, histology, subtype, IDH1 mutational status, MGMT promoter status, 1p/19q co-deletion status and survival information.Somatic mutation data were obtained from the TCGA portal.Considering that the GLASS dataset (https: //www.glass-consortium.org)contained only the minimal clinical information (n = 330) [25], the GLASS dataset only served as an additional validation dataset to validate a portion of results.A total of 150 genes (Supplementary Table 1) associated with DDR were retrieved from Molecular Signature Database V7.0 (MSigDB) [26] (http://www.broad.mit.edu/gsea/msigdb/,HALLMARK_DNA_REPAIR).

Data Pre-Processing
The mRNA expression data for samples were preprocessed according to the following steps.Samples with missing clinical information and those with an overall survival (OS) of less than 30 days were excluded, and 477 LGG samples from TCGA were eventually included in this study for follow-up analysis (Supplementary Table 1).In addition, the GLASS dataset was used as an independent validation dataset, in which samples from TCGA were excluded.
The DDR-related genes shared among the TCGA, CGGA, and GLASS datasets were retained for further analysis.Consequently, the final dataset comprised expression profiles of these common DDR-related genes derived from the aforementioned three datasets.
The CGGA mRNA sequencing data included both mRNAseq_693 (Illumina HiSeq platform) and mR-NAseq_325 (Illumina HiSeq 2000 and 2500 platforms).Therefore, the batch effects between these two datasets were removed using the "sva" R package [27] and a total of 594 LGG patients in CGGA were retrieved (Supplementary Table 1).

Gene Signature Identification and Risk Score Construction
In order to identify gene signature and construct risk score, the R package (http://cran.r-project.org)"survival" and "survminer" were used to perform univariate Cox proportional hazard regression on 477 LGGs in TCGA dataset to select DDR-related genes that are highly correlated with prognosis (p < 0.05 was selected as the threshold).Then "randomForest" R package was performed to ranked the importance of the selected DDR-related genes related to the prognosis by Random Survival Forest algorithm (nrep and nstep respectively represent the number of Monte Carlo simulation iterations and the number of steps forward, with values of 100 and 5).Genes with the relatively importance value >0.2 were subsequently incorporated into the multivariate stepwise cox regression analysis to established a prognostic signature.In three datasets, the Cox regression coefficient and prognostic gene expression were used to calculate the risk score for each patient.Finally, the risk model containing 6 genes was established, and the final formula was as follows: Risk score= ∑ N i=1 (βi × Expi), where βi represented the coefficient of gene i, and Expi represented the expression value of gene i.
According to the risk score formula, samples from the training and validation datasets were divided into highrisk and low-risk groups based on the median risk scores.Kaplan-Meier (KM) curves were used to estimate the survival divergence between different risk groups.To determine the independent predictive power of each variable, we conducted multivariate and univariate regression analyses using R "survival" package and visualized by the "forestplot" package in R. Additionally, we retrieved the gene protein expression level from the Human Protein Atlas (HPA) database to evaluate the expression of prognostic genes.

Construction and Validation of a Nomogram Model
A nomogram is a graphical representation of a mathematical model that estimates the outcome or probability of an event based on multiple variables or inputs, serving as a visual tool to simplify complex calculations.Nomograms facilitate clinicians' decision-making processes, enabling them to tailor treatment strategies for individual patients according to their unique characteristics and risk factors.Univariate and multivariate Cox regression were conducted to identify the independent prognostic risk factors for nomogram construction.Using the "rms" packages, calibration curves were drawn to assess the OS probability for LGG patients at 1-, 3-, and 5-years.In addition, nomogram accuracy was also determined by concordance index (C index).

Gene Set Enrichment Analysis (GSEA)
Gene set enrichment analysis (GSEA) was performed using the GSEA V3 software (Broad Institute, Inc., Mas-sachusetts Institute of Technology, and Regents of the University of California) [28] to identify gene sets that were statistically different between different risk groups in LGG samples of TCGA and CGGA mRNA expression profiles.GSEA V3 software was used to select the MSigDB "c2.cp.kegg.v7.1.symbols.gmt"gene set as a reference to screen out significantly enriched pathways with NOM pvalue and false discovery rate (FDR) less than 0.05 and 0.25, respectively.

Connectivity Map (CMap) Analysis
We consulted the recently updated Connectivity Map (CMap, http://cmap.topcoder.com/)to screen for potential small molecule compounds targeting DDR-related genes in the hope of achieving new breakthroughs in LGG therapy."limma" R package [29] was applied to identify differential expression genes (DEGs) with FDR <0.05 and |Fold change (FC)| ≥2 between different risk score groups.The up-and down-regulated genes were uploaded to the CMap (https://clue.io)database [30].Small molecule compounds with an absolute value of enrichment greater than or equal to 0.7 and p < 0.05 were considered as candidates.

Consensus Clustering
"ConsensusClusterPlus" [31] R package was performed for consensus clustering of the prognostic DDRrelated genes.Similarity distance among samples was calculated using Euclidean distance and clustering was conducted using K-means.Resampling analysis was conducted to estimate the optimal cluster number based on cumulative distribution function (CDF) and consensus matrix.And the clustering results determined by the above methods were also validated in GLASS dataset.The principal component analysis (PCA) was applied to evaluate the distribution of the molecular subtypes in TCGA and GLASS datasets via the "princomp" package.The differences in OS among different clusters in TCGA and GLASS datasets were present using KM curves.We further estimated single nucleotide polymorphism (SNP) distribution of each cluster.The somatic mutation profiles of LGG were firstly obtained from TCGA data portal and were divided into distinct clusters for analysis according to consensus clustering.The somatic mutation data of each cluster were summarized, analyzed, and visualized based on the Mutation Annotation format (MAF) file using the "maftool" [32] R package, and the driver genes were identified.The copy number alterations (CNA) data in each cluster were also generated by GIS-TIC2.0from GDAC Firehose (https://gdac.broadinstitute.org) [33].

Estimation of STromal and Immune Cells in MAlignant Tumour Tissues using Expression Data (ESTIMATE) and ssGSEA
782 marker genes for predicting the abundance of 28 tumor-infiltrating immune cells (TIICs) were obtained from Broad Institute and Bindea et al. [34].According to the characteristics of expressed genes, gene set variation analysis (GSVA) software "gsva" [35] R package was used for single-sample gene set enrichment analysis (ssGSEA), and the enrichment scores of 28 different immune cell types in each LGG sample were quantitatively analyzed.The levels and functions of immune infiltration were compared using Kruskal-Wallis test, and heatmaps and boxplots were used to visualize the results.In addition, the immune score and matrix score were calculated through the ESTIMATE algorithm [36].

Chemotherapy Drugs Response Prediction
The chemotherapy response of LGG samples in TCGA and GLASS were predicted using the Genomics of Drug Sensitivity in Cancer (GDSC) database [37].GDSC is a large-scale research project that aims to identify genomic biomarkers associated with the response of cancer cells to various anti-cancer drugs.We assessed the sensitivity of eight chemotherapy drugs (Cisplatin, Erlotinib, Methotrexate, Vincristine, Carmustine, TMZ, Rapamycin and Doxorubicin) in gliomas.By using the "pRRophetic" [38] R package we estimated LGG's half maximal inhibitory concentration (IC50) by ridge regression, and 10-fold crossvalidation was performed to determine the accuracy.

Construction and Validation of DDR Score
To construct a DDR-related score, the "limma" [29] R package was applied to filter DEGs among DDR clusters (adjusted p < 0.05 and |log2FC| >1).Boruta algorithm was used for feature selection based on DEGs, and then feature genes were analyzed by univariate Cox regression.The Boruta algorithm is a feature selection method used in machine learning and data analysis to identify the most rele-vant and informative features in a dataset.It is based on the Random Forest algorithm, a powerful ensemble learning technique that constructs multiple decision trees to make predictions or classifications.Further analysis was performed on the genes with the significant prognosis.We then performed PCA to establish DDR related gene signature.Both the principal component 1 and 2 were selected to serve as signature scores.We then define the DDR score similar to genomic grade index [39,40]: DDR score = Ʃ (PC1i + PC2i), where i represent the expression of the prognostic genes.We subsequently validated the prognostic value of DDR score in the disease special survival (DSS), progression free survival (PFS) and GLASS dataset, respectively.In addition, we systematically searched the gene expression profiles of DDR therapy, and a advanced urothelial cancer with intervention of atezolizumab (IMvigor210 cohort) was included in our analysis [41].

Statistical Analysis
Statistical analysis and tests were implemented through R software version 4.0.2(R Foundation for Statistical Computing, Vienna, Austria).One-way ANOVA with Tukey's test (q-value 0.05, |log2FC| >1) was implemented to determine DEGs between different risk groups or molecular subgroups by employing the "limma" R package.Wilcoxon rank sum test was applied for continuous data of two groups, while Kruskal-Wallis test was used for continuous data of more than two groups.And the Chi-Squared Test was used to the categorical data.The optimal cut-off point of each subgroup in each dataset was determined by the "surv_cutpoint" function in the "survminer" R package.Multivariate and univariate Cox regression analyses were utilized to assess association between OS and

The Prognostic Risk Model Based on DDR-Related Genes
An LGG prognostic risk model was constructed based on DDR gene in training set.Firstly, the relationship between 150 DDR-related genes and survival was analyzed by univariate Cox regression and 87 DDR-related genes that were associated with LGG survival (p < 0.05) were identified (Supplementary Table 2).Further, random forest algorithm ("randomForestSRC" R software package) was used for feature selection and the importance of survivalrelated DDR genes was ranked.Finally, the genes with relative importance greater than 0.2 were selected as the final signature (as shown in Supplementary Fig. 1A,B).We subsequently performed multivariate stepwise cox regression analysis on the seven genes identified by the random forest to established the prognostic signature, and the importance and relative importance of the Hazard ratios (HRs), coefficients, confidence intervals and out-of-bag estimates of the six genes are shown in Table 1.The gene expression levels used for calculating the Risk score were absolute values, and the unit of each gene expression level was transcripts per million (TPM).The risk score formula was as follows: Risk Score = TARBP2 ×.512574212 As shown in Fig. 1, there were more deaths in TCGA cohort with a high-risk score than low-risk score, while the heatmap (Fig. 1A) showed that changes in TARBP2, POLR3GL, POLL, STX3, POLD3 and SDCBP gene expression increased the risk value in different samples.KM curve analysis results indicated that the prognosis of patients in different risk groups was significantly distinct (p < 0.00001), and patients in the low-risk group had a better prognosis than those in the high-risk group (Fig. 1B).We also evaluated the accuracy of risk score to predict prognosis.We performed the R software package "SurvivalROC" to conduct ROC analysis on the prognostic classification of risk score and evaluate the accuracy of risk signature.The AUC value of the 1-, 3-and 5-year in TCGA dataset were 0.901, 0.832 and 0.771 (Fig. 1C).
To validate the survival risk score, we calculated the risk score of patients in CGGA and GLASS validation datasets with the same regression coefficient and repeated the above analysis, and as expected, the validation datasets also achieved consistent results (Supplementary Fig. 2A-C and Supplementary Fig. 3A,B).We also compared our signature with other previously reported risk models [42][43][44][45], the result showed that our signature had excellent AUC value when compared to the other signature, indicating that our risk model is the optimal (Supplementary Fig. 4).In addition, we also evaluated the protein expression level through the HPA database and found that TARBP2 and POLD3 were highly expressed in the tumor tissue, while POLR3GL and SDCBP showed low expression in normal tissue (Supplementary Fig. 5).

The Risk Signature Shows Strong Prognostic Efficacy
The association between risk score and major clinical characteristics in LGG was investigated.We found a significant divergence of risk score between different IDH1, MGMT promoter methylation, 1p/19q deletion, and pathological grades in TCGA and CGGA datasets.Patients with IDH1-wildtype (WT), 1p/19q non-codel, G3 grade and unmethylated MGMT promoter in LGG had significantly higher risk scores than those with IDH1-Mutant (Fig. 2C), 1p/19q codel (Fig. 2D), G2 grade (Fig. 2F) and methylated MGMT promoter (Fig. 2E) in TCGA and CGGA datasets (Supplementary Fig. 6C-F).The above results were similar to our previous studies [46].However, differences in risk scores were not consistent across age stratification and gender in both datasets.In the TCGA dataset, patients over 50 years old had significantly higher risk scores than those aged 50 years or younger (Fig. 2A), and there was no difference between male and female patients (Fig. 2B).However, in CGGA dataset, patients of different ages (Supplementary Fig. 6A) and genders (Supplementary Fig. 6B) did not have significantly different risk scores.Possibly, this is due to differences in basic clinical characteristics between the two datasets, such as bias or ethnicity.
In addition, we validated the prognostic differences between high-and low-risk group in stratified cohorts of different clinical characteristics (e.g., IDH1-WT/-Mutant, MGMT promoter methylated/unmethylated and 1p/19q codel/non-codel cohorts).Results were consistent with expectations.In the stratified cohorts of TCGA (Fig. 3) and CGGA (Supplementary Fig. 7) datasets with different clinical characteristics, except for the IDH1-WT (Fig. 3F) group of TCGA, grade G2 (Supplementary Fig. 7K) and IDH1-WT (Supplementary Fig. 7F) of CGGA, the highrisk groups showed poor prognosis.A possible reason for this due to the small sample size of the IDH1-WT group in TCGA and CGGA.
Clinical characteristics (age, gender, risk score, grade, IDH1, MGMT promoter methylated, 1p/19q codeletion) in TCGA and CGGA datasets were analyzed by conducting univariate and multivariate Cox regression.As shown in the forest plot (Fig. 4), the risk score constructed by the 6 DDRrelated genes was an independent prognostic risk factor in TCGA (p < 0.05).The independence was also validated in CGGA dataset (Supplementary Fig. 8).

Construction and Validation of the Nomogram
To further improve the accuracy of prognosis, a nomogram integrating risk score and independent clinical characteristics was constructed to visualize the prognostic value of different survival-related variables in TCGA and CGGA datasets (Fig. 5A and Supplementary Fig. 9A).In addition, to assess the performance of nomogram, a calibration curve analysis was conducted ("rms" R package) and we observed that the prediction curve was close to the ideal curve (Fig. 5B-D and Supplementary Fig. 9B-D), indicating that the nomogram had excellent prediction function.We discovered that the nomogram showed a high accuracy in the prognosis of 1-, 3-and 5-year (AUC value >0.8) (Fig. 5E and Supplementary Fig. 9E).Interestingly, the prediction accuracy of nomogram was also higher than the other clinical factors (Fig. 5F and Supplementary Fig. 9F).

The Difference of Biological Processes between Highand Low-Risk Groups
GSEA provides valuable insights into the biological processes and pathways that are potentially dysregulated in the context of the studied conditions, helping researchers understand the underlying molecular mechanisms and guiding the development of targeted therapies or interventions.GSEA analysis of different risk groups in the TCGA cohort revealed different biological processes.Significant candidate pathways were selected according to the criteria with FDR <0.25 and nominal p-value < 0.05.GSEA showed that cancer-related signaling pathways including focal adhesion, glioma, oocyte meiosis and apoptosis were significantly enriched in high-risk patients, while none of these pathways were observed in low-risk patients (Supplementary Fig. 10).

Identification of Potential Compounds Targeting DDR Prognostic Genes
In order to identify candidate compounds that may potentially influence the expression of DDR prognostic genes for the treatment of LGG, DEGs between different risk groups (191 up-regulated genes and 81 downregulated genes, "limma" R package, Supplementary Table 1) were uploaded to the CMap database.As a result, 35 compounds with 25 mechanisms of action (MoA) were identified (Fig. 6).Results indicated that multiple compounds shared the same MOA.For example, six compounds (scriptaid, ISOX, vorinostat, belinostat, THM-I-94 and HC-toxin) shared the MOA of histone deacetylase (HDAC) inhibitor; four compounds including amsacrine, irinotecan, teniposide and SN-38 shared the MoA of Topoisomerase inhibitor; another two compounds (NU-7441and KU-0060648) shared the MoA of DNA dependent protein kinase inhibitor.In addition, a total of five compounds (TW-37, butein, everolimus, NVP-TAE684 and XMD-892) shared the following five tumor-related MoAs, respectively: B cell lymphoma 2 (BCL) inhibitor, epidermal growth factor receptor (EGFR) inhibitor, mammalian target of rapamycin (MTOR) inhibitor, anaplastic lymphoma kinase (ALK) inhibitor and mitogen-activated protein (MAP) kinase inhibitor.The present research identified potential small molecule compounds that target DDR.

Association between Gene Signature and TME
We calculated the enrichment levels of 28 immunerelated cells for each risk group using ssGSEA algorithm.We found that most immune-related cells have a signif-icant divergence between different risk groups in TCGA and CGGA dataset, such as Memory B cell, Regulatory T cell, Eosinophil, Macrophage, Natural killer T cell.Interestingly, the enrichment level was significantly higher in high-risk group.(Fig. 7 and Supplementary Fig. 11).

Molecular Subtypes of LGG Based on DDR-Related Genes
Consensus clustering was conducted for clustering analysis of 477 samples and 150 DDR-related genes expression profiles in TCGA RNA-seq dataset.All LGG samples were divided into K clusters (K = 2-9) using the "Cons-esusClusterPlus" R package.Based on the CDF curve, 477 samples were divided into two clusters after a thorough consideration (Fig. 8A-C).OS significantly differed between the two clusters based on the KM curve analysis (Fig. 8D).PCA results showed that the two clusters were independent of each other (Fig. 8E).
The above analyses were repeated in GLASS dataset and the results obtained were shown in Supplementary Fig. 12. Heatmap of the two clusters (the top 100 variably expressed genes) showed that a significant divergence existed in the IDH1 status, 1p/19q co-deletion status and MGMT methylation status (Fig. 9).Furthermore, we compared the risk score differences among the two different clusters in TCGA.Interestingly, as shown in Supplementary Fig. 13A, Cluster B with the worse prognosis had the higher risk scores.Conversely, Cluster A with the better prognosis had the lower risk scores.Therefore, we identified two LGG clusters based on DDR gene expression.Clustering results showed that there was a wide range of heterogeneity in LGG, and DDR-related genes were closely associated with LGG malignancy grade and prognosis.
We further showed the mutant landscape in different clusters of LGG patients.By analyzing the somatic mutation data in TCGA dataset, we evaluated the somatic variation distribution of driver genes among LGG subtypes, to clarify whether there are differences in the frequency of somatic mutation and observe the different patterns of mutations among LGG clusters.LGG driver genes were obtained by Maftools.The analysis of TCGA mutation annotation files showed that there were obvious mutation characteristics between Cluster A and Cluster B. Among them, IDH1, TP53 and ATRX were the top three mutation genes and Cluster A corresponded to more mutation frequency than Cluster B (Supplementary Fig. 14A,B).Our cluster-specific CNAs analysis revealed that chromosome deletions and amplifications, for example, deletion of 9p21.3 were significantly enriched in Cluster B. The 9p21.3 deletions is associated with the poor survival outcome in LGG [47] (Supplementary Fig. 15).

The Landscape of Subtypes and Immune TME
The scores of different clusters were compared, as shown in Fig. 10A-C, Cluster B with the worst prognosis in TCGA dataset had the higher immune and stromal scores compared to Cluster A, while tumor purity was lower than the other cluster (ANOVA, p < 0.001).Cluster A with the better prognosis had the opposite result, with the lower immune and stromal scores and the higher tumor purity.The correlation between LGG subtypes and TME features was further explored in our present study.Results of ssGSEA showed that in LGG patients of TCGA cohort, the enrich scores of activated B cells (Ba), eosinophils and monocytes in Cluster B patients with worse prognosis were highly expressed compared to Cluster A. Fig. 10D shows the high expression level of immune cells in Cluster A.

Sensitivity Prediction of Different Clusters to Chemotherapies
Currently, surgery and chemotherapy are the most common treatments for cancer.Therefore, we sought to evaluate the differences in estimated IC50 levels between the clusters for eight chemotherapeutic agents, including Cisplatin, Erlotinib, Methotrexate, Vincristine, Carmustine, TMZ, Rapamycin and Doxorubicin.Ridge regression was used to train the prediction model on the dataset of GDSC cell lines, and 10-fold cross-validation was performed to evaluate prediction accuracy.The IC50 value of each LGG patient was then calculated according to the prediction model of the eight chemotherapeutic drugs.Results showed that except for Erlotinib (Wilcoxon rank test, p = 0.483 (Fig. 11D), the IC50 sensitivity of the seven chemotherapeutic agents was statistically different in the two clusters.Interestingly, the IC50 estimates for the seven chemotherapeutic agents in Cluster B were significantly lower than those in the other cluster, including Carmustine (Wilcoxon rank test, p = 1.927 × 10 −5 ) (Fig. 11A), Cisplatin (Wilcoxon rank test, p = 1.985 × 10 −11 ) (Fig. 11B), Doxorubicin (Wilcoxon rank test, p = 1.857 × 10 −6 ) (Fig. 11C), TMZ (Wilcoxon rank test, p = 1.157 × 10 −6 ) (Fig. 11E), Methotrexate (Wilcoxon rank test, p = 0.0492) (Fig. 11F), Vincristine (Wilcoxon rank test, p = 5.504 × 10 −6 ) (Fig. 11G) and Rapamycin (Wilcoxon rank test, p = 1.587 × 10 −7 ) (Fig. 11H), indicating that Cluster B was more sensitive to the above seven chemotherapeutic agents.

Construction and Validation of DDR-Related Score
A total of 574 DEGs were identified between the two clusters (Supplementary Fig. 16A,B).The GO enrichment analysis exhibited that these DEGs mainly enriched in chromosome segregation, nuclear division, and DNA replica-tion (Supplementary Fig. 16C).The KEGG pathway enrichment analysis result was shown as Supplementary Fig. 16D.
We first performed the univariate cox regression analysis and selected the genes with p-value < 0.05.We subsequently performed feature selection for significant DEGs exploiting Boruta algorithm.95 genes from the feature selection were eventually applied to construct the DDR score based on the formula: DDRscore = Ʃ (PC1i + PC2i).The KM curve analysis result revealed that low DDR score corresponded to a better survival outcome, while high DDR score was associated with poor prognosis in TCGA dataset (Fig. 12A) and GLASS dataset (Supplementary Fig. 17).Moreover, the DDR score in Cluster B were significantly increased compared to Cluster A (Wilcoxon rank test p < 2.2 × 10 −16 ) (Supplementary Fig. 13B).The ROC anal- ysis revealed that the DDR score had high AUC value in 1-, 3-and 5-years, suggesting that the DDR score could be used as potential biomarker to the prognosis of LGG (Fig. 12B).In addition, we also retrieved the progression free survival (PFS) and disease special survival (DSS) clinical info of LGG to validate the prognostic value of DDR score.As showed in Fig. 12C,E, the high DDR score group had a significantly better survival outcome compared to the low DDR score group in the PFS and DSS.The high AUC value also demonstrated that DDR score had a good prognostic value in PFS and DSS, respectively (Fig. 12D,F).
We further evaluated the response to immune checkpoint blockade (ICB) of DDR score in the IMvigor210 cohort (bladder cancer).Of note, survival outcome was sig-nificantly different from LGG, with low DDR scores associated with poorer survival outcomes (Supplementary Fig. 18).

Discussion
In the present study, we conducted a comprehensive and systematic data mining analysis on LGG, focusing on DDR-related genes, using two large-scale cohort datasets.Our research involves the development of a prognostic model, molecular subtyping, immune landscape, chemotherapy response, and drug development based on DDR, as well as the establishment of a DDR scoring system that shows promising potential as an independent prog- nostic factor.The validity of these findings has been confirmed through verification using additional datasets and the GLASS consortium dataset.The prognostic model, DDR scoring system, and molecular subtyping framework based on DDR-related genes hold promising potential to improve risk stratification, prognostic assessment, and treatment response prediction for LGG patients.This research offers more precise personalized clinical management decisions for patients and clinicians and has the potential to advance the development of small-molecule drugs for LGG.
In recent years, the continuous development of highthroughput whole genome sequencing technology has made it possible to further explore the molecular pathogenesis of LGG.It has been shown that DDR-related genes are related to the genesis, development, and prognosis of gliomas by some researchers.For example, Jin et al. [24] identified 5 prognostic DDR-related genes in primary GBM (pGBM) patients according to the risk scoring model.However, Jin's study had a small sample size, with only 89 pGBM cases in CGGA dataset.Similarly, Zeng et al. [45] developed and verified DDR risk signatures (CRY2, HDAC1, DCLRE1B, KPNA2) related to the prognosis of LGG by Cox and Lasso regression in two independent datasets: CGGA (172 samples) and TCGA (451 samples).In order to quantify the prognosis of LGG patients, a nomogram prediction model was developed.Pang et al. [48] further suggested that IDH mutation may affect the prognosis of LGG by regulating the DDR pathway.The above studies all focused on the value of DDR in gene prognostic prediction of LGG, and some studies had insufficient sample size to prove the robustness and effectiveness of signature in population samples.In this study, in comparison with other studies, we further systematically and comprehensively analyzed the differences of DDR gene-based LGG molecular subtypes, the relationship between subtypes and immune microenvironment and their mutation patterns, to assist in targeting LGG chemotherapy.Finally, we also constructed a molecular subtype classifier.These differences and innovations set our study apart.
Here, we systematically identified 6 prognosis-related DDR-related genes including POLL, POLR3GL, SDCBP, POLD3, STX3 and TARBP2.Multiple studies underscored the tumor-driver roles of the hub genes identified in present study.These hub genes were potential to be considered as novel therapeutic targets and prognostic predictors in LGG.POLL, also known as Pol or Pol λ, is the DNA polymerase lambda.Studies have elucidated its crucial role in cancer development through Base Excision Repair (BER) and Nonhomologous End-joining (NHEJ) [49,50].It is worth mentioning that in another study that used comprehensive analysis to identify DDR signature to predict prognosis of LGG, Pang et al. [48] identified DDR signatures (PLK3, WEE1, POLL, and PARPBP) related to prognosis of LGG included POLL.The role of POLL in glioma development and treatment warrants further investigation.SDCBP (Syndecan binding protein) is also called Melanoma differentiation associated gene-9 (MDA-9) or Syntenin-1 [51].SDCBP is significantly increased in breast cancer [52], prostate cancer [53,54], metastatic melanoma [55], glioblastoma [56][57][58][59], small cell lung cancer [60] and other malignant tumors.SDCBP is now considered to be a unique gene promoting metastasis [61].In addition, preclinical studies have confirmed that genetic or pharmacological inhibition of SDCBP effectively inhibits cell metastasis [53,59], especially in GBM cells [59].In terms of immunology, SDCBP can increase the immune infiltration of B cells, CD8+ T cells, CD4+ T cells, macrophages, neutrophils and dendritic cells in different types of lung cancer [62].POLD3 is an accessory subunit of replicating Pol δ polymerase and plays a unique role in replication stress-induced DNA fracture repair [63].Oncogeneinduced DNA replication stress is considered to be the driv- ing factor of tumorigenesis.Thus, different but generally increased genomic instability is commonly found in cells deficient in POLD1 or POLD3, manifested by DNA disruption, S phase progression disorder, and chromosomal abnormalities [63,64].Therefore, targeting POLD3 potentially provides a priority opportunity to target tumor cells [65][66][67].TARBP2 plays multifaceted roles in many biological and pathological conditions, including HIV-1 virus expression, microsatellite instability, cancer stem cell characterization, and tumor progression [68].TARBP2 expression is inconsistent in many human cancers, such as high expression of TARBP2 in lung cancer [69], breast cancer [70] and liver cancer [71], but down-regulated expression in colorectal cancer [72] and gastric cancers [68].Therefore, TARBP2 as a tumor suppressor or tumor promoter remains controversial.Currently, there are few studies on STX3 in cancer.Further studies have shown that STX3 promotes the growth of breast cancer cells by regulating the PTEN-PI3K-Akt-mTOR signaling pathway.STX3 may be a potential target for treatment of human breast cancer [73].The role of STX3 in LGG warrants further exploration in the future.In addition, the role of POLR3GL in the genesis and progression of cancer and glioma has been largely unexplored.In conclusion, the functions of 6 genes identified in our signature in glioma remain to be further studied.By comparing changes in gene expression across diseases, CMap identifies potential small molecule compounds that may treat different diseases.CMap analysis helped us identify compounds that had proven or might have potential efficacy on LGG by comparing DEGs between different risk groups of LGG.These compounds include HDAC inhibitor such as scriptaid [74], vorinostat [75], belinostat [76,77] and HC-toxin [78]; topoisomerase inhibitor-amsacrine [79,80], irinotecan [81][82][83], SN-38 [84] and teniposide [85,86]; DNA dependent protein kinase inhibitor-NU-7441 [87,88] and KU-0060648 [89,90]; MTOR inhibitor-everolimus [91][92][93]; KIT inhibitor-cediranib [94,95] and aurora kinase inhibitor-reversine [96].The effects of HDAC inhibitor-THM-I-94, HDAC inhibitor-ISOX, BCL inhibitor-TW-37, EGFR inhibitor-butein, ALK inhibitor-NVP-TAE684, MAP kinase inhibitor-XMD-892, IKK inhibitor-IKK-16, casein kinase inhibitor-LY-303511, DNA alkylating agentmitomycin, acetylcholine receptor antagonist-oxybutynin, RNA polymerase inhibitor-rifampicin, serotonin receptor agonist-S-14506, FLT3 inhibitor-TG-101348 (also known as fedratinib), HIF mudulator-VU-0418947-2, metalloproteinase inhibitor-WAY-170532 and leucine rich repeat kinase inhibitor-XMD-1150 on LGG have not been elucidated.In the future, the effect of the small molecule drugs identified by CMap analysis on LGG remains to be further studied.
Because of the heterogeneity of tumors, patients with the same cancer may respond differently to specific drug treatments.The identification of individual predictive biomarkers of drug sensitivity is the key to improve the ac-curacy of cancer medicine [97,98].TMZ is the standard first-line chemotherapy for glioma.Despite the use of a uniform treatment regimen, the prognosis of glioma varies greatly.Some patients have excellent results, while others do not respond to TMZ [7,99].Cisplatin is a commonly used chemotherapy drug in clinical settings, and its main target is DNA.By forming platinum-DNA adduct in cells, it inhibits DNA replication and transcription, thus inducing DNA damage and apoptosis in tumor cells.Cisplatin is an important agent for clinical treatment of a variety of solid tumors.Due to its high sensitivity to glioma [100,101], it is also an effective chemotherapy agent for the treatment of LGG in children [101,102].In addition, TMZ combined with platinum drugs exhibits a better effect in the treatment of recurrent malignant tumors, which can effectively improve the clinical symptoms of patients, control cancer progression and prolong the survival time of patients [100].In addition, the final results of the phase 3 randomized clinical trial RTOG9802 [103] showed that compared with adjuvant radiotherapy(RT) and RT+ PCV (Procarbazine, Lomustine, and Vincristine) × 6 cycles of treatment for newly diagnosed high-risk low-grade gliomas, adjuvant radiotherapy was less effective than adjuvant radiotherapy and PCV.The median OS and PFS of LGG patients treated with PCV were significantly improved.Therefore, we evaluated the sensitivity of two chemotherapy agents in the PCV regimen (Procarbazine was not included in GDSC) to different clusters of LGG, including Carmustine (Lomustine was not included in GDSC, Wilcoxon rank test, p = 1.927 × 10 −5 ) and Vincristine (Wilcoxon rank test, p = 5.504 × 10 −6 ).Furthermore, we evaluated the sensitivity of several other promising chemotherapeutic agents in the treatment of glioma, including Doxorubicin (Wilcoxon rank test, p = 1.857 × 10 −6 ), Methotrexate (Wilcoxon rank test, p = 0.0492) and Rapamycin (Wilcoxon rank test, p = 1.587 × 10 −7 ).In addition to Erlotinib, the results of several other chemotherapeutic agents showed that Cluster B was more sensitive to chemotherapeutic agents than the other cluster.In this study, we used the GDSC dataset to predict the commonly used chemotherapy drugs in the treatment of LGG.The results showed that the sensitivity of DDR-based LGG molecular typing to chemotherapy drugs was different, thus identifying a cluster that was highly sensitive to TMZ, Cisplatin, PCV and other chemotherapy drugs.Interestingly, this cluster happened to be one of the subtypes with worse prognosis.The results of this study can provide guidance for LGG patients with different molecular types to choose the treatment regimens.
Owing to their high diversity and plasticity, immune cells can play a variety of roles in TME, such as anti-tumor or pro-tumor [104].The interaction among TIIc, malignant cells and stromal cells leads to the formation of TME [105].In tumor progression and response to immunotherapy, tumor infiltrating immune cells exhibit crucial effects [106].There is growing evidence that genetic perturba-tions in cancer cells determine the immune environment of a tumor [104].Tumor cells can circumvent anti-tumor immune responses or immune checkpoint blockade therapy by regulating TME, for example by down-regulating antigen-presentation features and up-regulating repressor proteins or cytokine release [107].DDR gene mutations are associated with increased tumor-infiltrating lymphocytes, genomic instability, and tumor mutation burden in cancer [108].Additionally, DDR and immune response systems work together by activating each other, leading to a bidirectional connection.Disruption of DDR-immune response cross talk compromises (multi)cellular integrity, leading to cell-cycle-related and immune defects [109].Therefore, a growing number of studies have revealed the potential therapeutic significance of the interaction between DDR deficiency and immune TME components [104].Here, we quantified immune and stromal cells via ESTIMATE and ssGSEA algorithm of each LGG samples in TCGA and CGGA datasets.Our data suggested that different LGG clusters were characterized by different immune and stromal scores as well as tumor purity.Furthermore, we found that the LGG cluster with a worse prognosis featured by increased infiltration levels of activated B cells (Ba), eosinophils and monocytes.And we have observed a higher enrichment of immune cells in the high-risk group of LGG patients.This finding is consistent with previous studies that have indicated the significance of immunobiology as a dominant factor in malignant processes, particularly in gliomas [110][111][112].Additionally, the immune infiltrating cell signature has been recognized as a prognostic marker in gliomas [113,114].These observations suggest that immune processes play a crucial role in the development and progression of LGG.
While it may seem counterintuitive that immune cells, which are typically associated with an anti-tumor response, are enriched in the high-risk group, it is important to note that immune cells have complex and diverse functions in TMEs.The specific roles of immune cells in cancer are context-dependent and can vary based on their subtypes and functional states [115,116].In our study, we identified specific immune cell subtypes that were positively correlated with a higher risk score, including Memory B cell, Regulatory T cell, Eosinophil, Macrophage, Natural killer T cell, myeloid-derived suppressive cell (MDSC).The enrichment of specific immune cell subtypes in the high-risk group of LGG patients suggests their potential involvement in LGG progression and aggressiveness.For instance, MDSCs have been reported to promote B-cell-mediated immunosuppression in gliomas [117].Infiltrated tumor-associated macrophages have been associated with poor survival in glioma patients [118].T cells have been found to mediate immunosuppression and resistance to treatment in gliomas [119].NK-cell-targeted therapies have also been highlighted in combating immune escape in IDHmut gliomas [120].By interactions with cancer cells, macrophages of-fer a cancer-favorable microenvironment, which induce immune escape and LGG proliferation and metastases [121].Therefore, targeting these immune cells could potentially benefit LGG patients with an unfavorable prognosis.However, the precise mechanisms through which immune cells influence LGG development and progression are still being actively investigated.
Although the results of our study contribute to the clinical treatment of gliomas, there are inevitably some limitations that need to be acknowledged.First, this study is a retrospective analysis based on open datasets and rigorous validation, but more external datasets, multi-center prospective studies even experimental studies are needed to verify the clinical value and prognostic robustness of the signature.It should be noted that while our findings are promising, the limited amount of data analyzed in a specific type of cancer (i.e., LGG) calls for cautious interpretation and further validation.Furthermore, basic experiments are necessary to verify our results and elucidate the molecular mechanism that underlies them, thereby helping us better understand DDR-related genes.

Conclusions
In summary, we identified the risk signatures of DDRrelated genes and combined risk score with clinical information to construct a nomogram to optimize OS risk assessment.And we further analyzed and compared enrichment pathways between different risk groups and identified potential small molecule drugs targeting DDR.We systematically analyzed DDR-related genes in LGG to provide a comprehensive understanding of the molecular subtypes, mutations, clinical prognosis, chemotherapy sensitivity, and patient characteristics of LGG.Our study suggests that the DDR score system developed in this research could potentially serve as an independent prognostic indicator, providing valuable insights into the prognosis of LGG patients.In conclusion, the signature model and molecular typing based on DDR genes are helpful for more detailed classification and provide a useful clue for promoting personalized management of LGG and clinical drug development.
content.All authors contributed to editorial changes in the manuscript, read and approved the final version, and agreed to be accountable for all aspects of the work, ensuring that questions related to any part of the work are appropriately investigated and resolved.

Fig. 1 .
Fig. 1. Construction of prognostic gene signature in the Cancer Genome Atlas (TCGA) dataset.(A) Risk score, survival time, survival state and expression of the 6 genes in TCGA training dataset.The panel consists of three rows: top row showed the risk score distribution for the high-and low-risk score group, color transition from green to red indicates the increasing risk score of corresponding expression level of DNA damage repair (DDR) genes from low to high; middle row represents the low-grade gliomas (LGG) patients' distribution and survival status; the bottom row shows that the heatmap of 6 prognostic DDR-related genes expression.(B) Kaplan-Meier curves of prognostic models for high-(n = 238) and low-risk (n = 239) subgroups of patients with LGG in TCGA datasets.(C) Receiver Operating Characteristic (ROC) curve and area under the curve (AUC) of the 6-gene signature classification.

Fig. 4 .
Fig. 4. DDR signature as an independent prognostic factor shows strong prognostic power.(A) Univariate and (B) multivariate Cox regression analysis of clinical characteristics and DDR signature for survival of LGG in TCGA training dataset.

Fig. 5 .
Fig. 5. Construction of nomogram.(A) A nomogram constructed based on independent prognostic factors.The calibration plot for internal validation of the nomogram in 1-(B), 3-(C) and 5-years (D) OS. (E) ROC curve and AUC of the Nomogram (F) Comparisons of the ROC curve among the clinical factors and nomogram.Abbreviations: OS, overall survival; ROC, receiver operating characteristic; AUC, area under the curve (AUC).* p < 0.05, *** p < 0.001.

Fig. 6 .
Fig. 6.Identification of potential molecular compounds and its MoAs in LGG.Heatmap showed each compound from the CMap that shares a MoA (rows), sorted by descending number of compounds with a shared MoA.The above compounds have an absolute value of enrichment score ≥0.7 and might be capable of targeting the DDR-related signature.Abbreviations: MoA, mechanisms of action; CMap, Connectivity Map; BCL, B cell lymphoma; HDAC, histone deacetylase; EGFR, epidermal growth factor receptor; MTOR, mammalian target of rapamycin; ALK, anaplastic lymphoma kinase; MAP, mitogen-activated protein; FLT3, Fms-like tyrosine kinase 3; HIF, Hypoxia-Inducible Factor; IKK, IkappaB kinase; KIT, tyrosine kinase.

Fig. 8 .
Fig. 8. DDR-related genes could distinguish LGG patients in TCGA with different clinical and molecular features.(A) Consensus clustering CDF for k = 2 to k = 9. (B) Delta area plot exhibits relative change in area under CDF curve when at a certain k value compared with k-1.(C) Consensus clustering matrix heatmap plots of 477 samples from TCGA datasets for k = 2. (D) KM analysis of patients among two clusters,the prognosis of Cluster A (n = 361) is better than Cluster B (n = 116).(E) PCA analysis of the DDR-related genes expression when k = 2. Abbreviations: CDF, cumulative distribution function; KM, Kaplan-Meier.

Fig. 9 .
Fig. 9.The landscape of DDR-related genes among the two subtypes (Cluster A, n = 361; Cluster B, n = 116).Heatmap of two clusters defined by the top 100 variable expression genes.

Fig. 10 .
Fig. 10.The landscape of immune infiltration and TME characteristics in two subtypes.Violin diagram showed the differences in stromal score (A), immune score (B) and tumor purity (C) among 2 clusters.(D) Unsupervised clustering of LGG patients from the TCGA cohort using ssGSEA scores which represent the 28 cell infiltration.The cluster, age, survival status, gender, grade, IDH1 status, 1p/19q status and MGMT status are shown as patient annotations in the lower panel.

Fig. 12 .
Fig. 12. Construction and validation of the DDR score.(A) KM survival curve analysis in the overall survival (A), progression free survival (C) and disease special survival (DSS) (E).ROC analysis was applied to evaluate the specificity and sensitivity of the DDR score in overall survival (B), progression free survival (D) and DSS (F).Abbreviations: KM, Kaplan-Meier; OS, overall survival; DSS, disease special survival.