Predictive Value of Machine Learning for Recurrence of Atrial Fibrillation after Catheter Ablation: A Systematic Review and Meta-Analysis

Background: Accurate detection of atrial fibrillation (AF) recurrence after catheter ablation is crucial. In this study, we aimed to conduct a systematic review of machine-learning-based recurrence detection in the relevant literature. Methods: We conducted a comprehensive search of PubMed, Embase, Cochrane, and Web of Science databases from 1980 to December 31, 2022 to identify studies on prediction models for AF recurrence risk after catheter ablation. We used the prediction model risk of bias assessment tool (PROBAST) to assess the risk of bias, and R4.2.0 for meta-analysis, with subgroup analysis based on model type. Results: After screening, 40 papers were eligible for synthesis. The pooled concordance index (C-index) in the training set was 0.760 (95% confidence interval [CI] 0.739 to 0.781), the sensitivity was 0.74 (95% CI 0.69 to 0.77), and the specificity was 0.76 (95% CI 0.72 to 0.80). The combined C-index in the validation set was 0.787 (95% CI 0.752 to 0.821), the sensitivity was 0.78 (95% CI 0.73 to 0.83), and the specificity was 0.75 (95% CI 0.65 to 0.82). The subgroup analysis revealed no significant difference in the pooled C-index between models constructed based on radiomics features and those based on clinical characteristics. However, radiomics based showed a slightly higher sensitivity (training set: 0.82 vs. 0.71, validation set: 0.83 vs. 0.73). Logistic regression, one of the most common machine learning (ML) methods, exhibited an overall pooled C-index of 0.785 and 0.804 in the training and validation sets, respectively. The Convolutional Neural Networks (CNN) models outperformed these results with an overall pooled C-index of 0.862 and 0.861. Age, radiomics features, left atrial diameter, AF type, and AF duration were identified as the key modeling variables. Conclusions: ML has demonstrated excellent performance in predicting AF recurrence after catheter ablation. Logistic regression (LR) being the most widely used ML algorithm for predicting AF recurrence, also showed high accuracy. The development of risk prediction nomograms for wide application is warranted.


Introduction
As the global population ages at an accelerated rate, atrial fibrillation (AF) has emerged as one of the cardiovascular diseases with the highest incidence in the 21st Century [1].In the United States alone, at least 3 to 6 million individuals are currently suffering from AF. Early rhythm control can significantly reduce the risk of cardiovascular adverse events among AF patients [2].Two common rhythm control methods used in clinical practice include (1) catheter ablation treatment and (2) antiarrhythmic drug therapy [3,4].The catheter ablation treatment has been shown to outperform drug therapy, as it aids patients in recovering from sinus rhythm [3,5] and improves their quality of life during early disease progression [6].However, it's important to note that AF reoccurs in approximately a third of patients undergoing catheter ablation [7].Therefore, it is important to assess AF recurrence following ablation to develop primary prevention strategies.Although CHADS 2 , CHA 2 DS 2 -VASc, and R 2 CHADS 2 scores can be used to predict AF recurrence after catheter ablation, their predictive accuracy remains unsatisfactory [8].Consequently, it remains to be proven if the prediction models can truly improve patient prognosis.
Recent advances in artificial intelligence, statistics, and machine learning (ML) have gradually found new applications in clinical settings, including disease diagnosis and prognosis [9][10][11].In this context, some investigators have utilized ML to identify risk factors related to the early recurrence of AF following catheter ablation, and to construct prognostic models to maximize clinical outcomes [12,13].However, prediction accuracy remains controversial since ML covers many mathematical methods, variables, and models.Therefore, this study aimed to explore the predictive performance of ML for AF recurrence following catheter ablation, and comprehensively summarize modeling variables, thus promoting the development of risk stratification tools in the field.

Study Registration
This systematic review was conducted following the requirements of the preferred reporting items for systematic reviews and meta-analyses (PRISMA2020) (Supplementary Table 1) [14], and registered via PROS-PERO (ID: CRD42023401497).

Inclusion Criteria
(1) Studies occurred in patients diagnosed with AF who underwent catheter ablation.
(2) The observed outcome event was AF recurrence, and a ML prediction model was constructed.
(3) Different studies may apply the same data set to different ML models, and these models may have different variables.Therefore, different studies on ML algorithms published based on the same data set were included in this systematic review.
(4) Studies without an independent validation set were included in this systematic review.
(5) Original study type includes cohort studies, randomized controlled trials (RCTs), case-control studies, cross-sectional studies, case-cohort studies, and nested case-control studies.

Exclusion Criteria
(1) Studies with significant flaws in diagnosing AF or recurrence of AF.
(2) Only the risk factors were analyzed, and no complete ML model was constructed.
(3) Studies lacking the following outcome measures in assessing the accuracy of ML models: Roc, C-statistics, concordance index (C-index), sensitivity, specificity, accuracy, recovery rate, accuracy rate, confusion matrix, diagnostic fourfold table, F1 score, and calibration curve.
(4) Studies only on the validation of a maturity scale.
(5) Studies on the accuracy of single-factor prediction.(6) Meta-analyses, reviews, guidance, expert opinions, or articles of similar nature.

Data Sources and Search Strategy
PubMed, Embase, Web of Science, and Cochrane databases were searched from 1980 to December 31, 2022, by combining the subject terms and subheadings of "atrial fibrillation", "recurrence" and "machine learning".The complete search strategy is shown in Supplementary Table 2.

Study Selection and Data Extraction
All retrieved literature was imported into Endnote.After removing duplications, titles and abstracts were reviewed to exclude irrelevant studies.Subsequently, the full texts of the studies selected in the initial screening were downloaded and read to select eligible original studies.A data extraction table was prepared in advance to record the following data: study types (e.g., cohort studies, crosssectional studies), study characteristics (e.g., author, year, title, and author's country), study groups (e.g., total sample size, number of relapsed cases, total number of cases in the training set, number of recurrent cases in the training set, number of recurrent cases in the validation set, and the total number of cases in the validation set), ablation type, follow-up time, definition of blank period, definition of AF recurrence, method of generating the validation set, overfitting method, missing value treatment method, variable screening method, model type, and modeling variables.
The literature screening and data extraction were independently conducted by two investigators (XF and XL), with a cross-check conducted following completion.In the event of any disagreements or uncertainties regarding the eligibility of a particular study, another reviewer (YL) was consulted for resolution.

Risk of Bias in the Included Studies
The prediction model risk of bias assessment tool (PROBAST) [15] was used to assess the risk of bias in the original studies included.This tool included a total of 20 questions organized across four domains (participators, predictors, outcomes, and statistical analysis).Each question can be answered as Yes/Probably Yes, No/Probably No, or No Information.If a domain included at least one question answered with No or Probably No, it was considered to have a high bias risk.A domain was considered low risk if the answers to all questions were Yes or Probably Yes.The overall bias risk was considered low if all domains were classified as low risk.Conversely, if at least one domain is considered high risk, the overall risk of bias is regarded as high.
To ensure accuracy, two investigators (XF and XL) independently conducted the risk of bias assessment based on PROBAST and cross-checked their results.In case of any disagreements, a third investigator (YL) would be asked for assistance in reaching a judgment.

Outcomes
The C-index was utilized as the outcome measure to reflect the overall accuracy of the model.However, in case of severe imbalance between relapsed and non-relapsed cases, the C-index may not reflect the true prediction accuracy of models for the recurrence risk.Therefore, our main outcome measures also included sensitivity and specificity, and the secondary outcome measure was the frequency of occurrence of pooled modeling variables.

Statistical Analysis
If C-index lacked a 95% confidence interval (CI) and standard error in the original study, the standard error was estimated through the by Debray et al. [16] calculation method.Given the differences in the variables included in each ML model and the inconsistency in the parameters, we utilized a random-effects model for the meta-analysis of the C-index.
In addition, a bivariate mixed-effects model was employed to assess the sensitivity and specificity of the metaanalysis.Functioning as a random effects model, it accounts for the correlation between sensitivity and specificity.During the meta-analysis process, sensitivity and specificity were analyzed based on the diagnostic fourfold table, which unfortunately were not reported in most of the original studies.To address this, we utilized the following two methods to calculate the diagnostic fourfold table : (1) Calculate the fourfold table using sensitivity, specificity, and precision in combination with the number of cases; (2) Extract the sensitivity and specificity according to the best Youden's index, and then calculate the fourfold table using the number of cases.The meta-analysis of the study was conducted using R4.2.0 (R development Core Team, Vienna, Austria, http://www.R-project.org).

Study Selection
In total, 770 articles were identified from multiple databases.Out of these, 220 articles were duplicates and removed.After reviewing the titles and abstracts of the remaining 550 articles, 48 were selected for full-text assessment and downloaded.
Among them, one article was unavailable in full text, 6 articles were excluded for other reasons, and one article was deleted due to duplication of an identical cohort.Finally, 40 studies were included in this systematic review and metaanalysis [12,.Fig. 1 displays the PRISMA flow chart outlining the study selection process.

Modeling Variables
This study involved 93 predictors, with the top 5 are being age, radiomics features, left atrial diameter, type of AF, and AF duration.The remaining predictors include body mass index (BMI), sex, left ventricular ejection fraction (LVEF), hypertension, diabetes, and estimated glomerular filtration rate (eGFR) (see attachment materials-Supplementary Table 3 for modeling variables in detail).

Risk of Bias in the Included Studies
The risk of bias and the overall applicability was assessed using the PROBAST checklist, which is provided in Supplementary Table 1.Details of the risk of bias and applicability for each model included in the study can be found in online Supplementary Table 4, and a summary of the bias risk is presented in Fig. 2.
Out of the 54 models identified of the 40 eligible studies, two models (3.7%) had high and moderate risks of bias in terms of participants and predictors, possibly because their study type, namely case-control design, makes it impossible to determine whether the source of participants is appropriate or whether the predictors were evaluated without knowing outcome data.The risk of bias in outcome was moderate in 42 models (77.8%).Regarding the statistical analysis, the underfitting process resulting from insufficient sample size or failure to overfit the prediction model led to a high risk of bias in 43 models.

Meta-Analysis 3.5.1 Synthesized Results
The C-index of prediction modes for recurrent AF following catheter ablation treatment are shown in Table 1.Among the 40 included studies, the training set comprised a total of 48 models, with a pooled C-index of 0.760 (95% CI 0.739 to 0.781) calculated using the random effects model.The validation set consisted of 19 models, with a pooled Cindex of 0.787 (95% CI 0.752 to 0.821).In the training set, the pooled fourfold tables of 40 models were either directly or indirectly reported, and the bivariable mixed model was utilized for the meta-analysis of sensitivity and specificity.The pooled sensitivity and specificity were 0.74 (95% CI 0.69 to 0.77) and 0.76 (95% CI 0.72 to 0.80), respectively.In the validation set, 15 models reported fourfold tables, and the bivariable mixed model was utilized for the metaanalysis of sensitivity and specificity.The pooled sensitivity and specificity were 0.78 (95% CI 0.73 to 0.83) and 0.75 (95% CI 0.65 to 0.82), respectively (Table 2).

Modeling Variables
The modeling variables were categorized into clinical characteristics or radiomics features for subgroup analysis.

Model Integrity
In the study, LR was the most commonly used ML algorithm, with 22 LR models and 7 LR models included in the training set and validation set, respectively.The pooled C-index for LR models was 0.785 (95% CI 0.737 to 0.833) in the training set and 0.804 (95% CI 0.735 to 0.872) in the validation set.Among non-LR models, the prediction   2).

Summary of the Main Results/Findings
This meta-analysis aimed to assess the performance of ML models in predicting AF recurrence following ablation.The pooled C-index results of 54 models demonstrated the high accuracy of ML in predicting and recognizing AF recurrence.As a digital-driven method, ML allows continuous learning from data to refine the model using various statistical probability and optimization techniques.This feature presents significant opportunities for developing risk prediction models in cardiovascular research similar to the well-known Framingham Heart Study [56].By developing risk models using ML, it becomes possible to classifying ablation-treated AF patients into different risk groups, which in turn, allows the formulation of personalized follow-up protocols based on the specific timing and populations.This approach can minimize overtreatment in low-risk populations and strike a better balance between the risk-benefit and cost-benefit in the screening of AF recurrence.Overall, ML holds promising potential in advancing the field of cardiovascular risk prediction and improving patient care.
We tested many methods of subgroup analysis to predict AF recurrence in patients after catheter ablation treatment.The traditional methods included logistic regression and Cox regression.Additionally, we explored the application of support vector machines, ensemble learning, artificial neural networks, deep learning, and other ML methods.Deep learning proved to be advantageous in image recognition and data processing, as it can convert low-level characteristic data into more abstract high-level characteristic data through layer-by-layer conversion.Based on the subgroup analysis results, the model constructed using the CNN algorithm by Yi-Ting Hwang et al. [52] demonstrated the highest C-index, specificity, and sensitivity.However, due to the limited number of models, it is essential to increase the sample size and conduct external validation to gather more robust risk assessment evidence.After considering the models constructed based on clinical characteristics and radiomics features, logistic regression emerged as the most commonly used method for predicting AF recurrence in patients after ablation treatment.It had the secondhighest testing power compared to the CNN model in the training set and displayed the best specificity and sensitivity in the validation set.Given these advantages, logistic regression is expected to be effectively applied in developing nomograms based on clinical characteristics for predicting AF recurrence after ablation treatment.
The selection of variables in prediction models plays a critical role in their performance.Among the 54 models, 29 models included the age of AF patients receiving ablation treatment as a modeling variable.Age has been identified as the most likely risk factor for AF, more so than with sex, BMI, hypertension, and cardiac failure [57].However, for AF patients receiving ablation treatment at different ages, there were no statistical differences in the AF recurrence rate [58,59].Another important modeling variable is radiomics features, formally proposed by Lambin in 2012 [60].These high-dimensional features that not visible to the naked eye in medical digital images such as ultrasound, computed tomography (CT) and magnetic resonance imaging (MRI).However, they can be analyzed using high throughput programs.By transforming the image data of the region of interest (ROI) into high-resolution, exploitable spatial data using full-automatic or semi-automatic analysis methods, the accuracy of disease prediction, diagnosis, and prognosis estimation can be improved.The subgroup analysis results showed no significant differences in the pooled C-index between the models constructed based on clinical characteristics and those based on radiomics features in either the training set or the validation set.This lack of difference may be due to data overfitting caused by excessive data extraction and decreased prediction performance resulting from inaccurate image segmentation [17,30].Nonetheless, prediction models constructed based on radiomics features exhibited higher sensitivity, which is clinically significant for predicting AF recurrence after ablation.
While several studies have highlighted the significance of genetic variation in AF within the context of genomics [61,62], none of the studies included in this review used alleles related to AF recurrence after ablation as predictors for model development.
Moreover, most of the predictors in these models were came from the baseline data of AF patients before admission, such as BMI, eGFR and left atrial diameter.However, it's important to note that these short-term risk factors are subject to change, and AF recurrence may be influenced by healthy habits after discharge.Unfortunately, these factors are rarely considered in the analysis of prediction models.A recent single-center, randomized controlled trial of symptomatic AF in obesity [63] demonstrated that weight control and enhanced management of risk factors in AF patients after discharge improved the long-term success rate of AF ablation.

Clinical Feasibility
As cross-disciplinary research in AI-medicine progresses, there is a growing focus on developing and validating prediction models based on ML algorithms for cardiovascular diseases [64,65].In this systematic review and meta-analysis, we combined the training set and the validation set (including both randomly acquired internal sampling results and a small number of external validation results) to assess the performance of ML in predicting AF recurrence in patients after ablation.The C-index results demonstrated high accuracy in both the training set (0.760 [0.739-0.781])and the validation set (0.787 [0.752-0.821])with similar prediction performance, and without overfitting.Among the top 5 risk predictors for AF recurrence after ablation, age, type of AF, and duration of AF are relatively easy to obtain, show small population differences, and high reproducibility, making them suitable for clinical use and popularization to a certain degree.

Strengths and Limitations
This systematic review represents the first attempt to assess the predictive accuracy of ML for AF recurrence after ablation, providing evidence for the promising prediction capacity of ML models in these patients.However, our study does have some limitations.
First, the ML models included in the review suffered from high bias due to the rigid assessment using PROBAST for bias risk.In terms of statistical methods, a model is considered low bias only if the events per variable (EPV) is larger than 20 and it has an independent validation set with more than 100 cases.However, this rule ignores certain rare diseases or particular research fields (radiomics).Therefore, we focused on prediction factors and results for studies with high bias.
Second, an essential aspect of ML is selecting effective modeling variables.To minimize the discrepancy in modeling variables, we conducted subgroup analysis based on clinical characteristics and radiomics features, which reduced the number of models in the analysis process.
Third, radiomics lacks a standardized operating procedure, resulting in multiple approaches for dividing new areas, extracting texture features, screening modeling features, and constructing models.Despite this variability, it is important to acknowledge and recognize its clinical application value.
Finally, it is worth noting that some models in the included studies lacked valid independent validation sets [25,[38][39][40].Overcoming this limitation in systematic reviews of ML can be challenging.To address this issue, we combined the results of both the training set and the validation set to assess the value of ML by comparing their accuracy levels.

Conclusions
In conclusion, the ML method has shown high performance in predicting AF recurrence, making it a competitive and cost-effective approach to screening the AF recurrence after ablation.In the future, multi-center, largesample clinical data sets can be established to develop the correlation nomogram for predicting AF recurrence after ablation based on LR.Additionally, to enhance the efficiency and feasibility of the model, future predictors should not only focus on the baseline data indicators of AF patients after ablation but also include radiomics features and postdischarge health habits of AF patients.

Fig. 1 .
Fig. 1.PRISMA (preferred reporting items for systematic reviews and meta-analyses) flow diagram for study selection.
The results indicated there was no significant difference in the pooled C-index for either the training set or the validation set (training set: 0.751 vs. 0.793; validation set: 0.794 vs. 0.779).However, the prediction models constructed based on the radiomics features showed a higher sensitivity (training set: 0.82 [95% CI 0.75 to 0.87]; validation set: 0.83 [95% CI 0.77 to 0.88]) compared to those constructed from clinical characteristics (training set: 0.71 [95% CI 0.67 to 0.76]; validation set: 0.73 [95% CI 0.66 to 0.79]) in both the training and validation sets.
models constructed based on the CNN algorithm showed the highest C-index, specificity, and sensitivity in both the training set and the validation set.Additionally, in the subgroup analysis by model type, two survival models (Cox and DeepSurv) were also reported.The training set C-index of Cox and DeepSurv were 0.735 (95% CI 0.697 to 0.773) and 0.730 (95% CI 0.710 to 0.750), respectively (Table

Fig. 2 .
Fig. 2. Risk of Bias Assessment Result Included in the Machine Learning Model.