Colorectal cancer (CRC) is the third most common cancer, accounting for about 10% annually diagnosed tumors worldwide, and it is the second leading cause of death from among all tumors [1, 2]. Given the impairment of quality of life from not only CRC itself but also the treatment’s adverse effects, such as a stoma, it is pivotal to predict a patient’s overall survival (OS).

American Joint Committee on Cancer (AJCC) TNM stage is a typical and extensively used reference for cancer prognosis. However, many studies have revealed that the survival of the same stage CRC patients varied, and a more precise staging system is needed [37]. Another choice is to use the Cox proportional hazard model (CPH). But the CPH is a semiparametric model, assuming that a patient’s log-risk of an event (e.g., “death”) is a linear combination of the patient’s covariates, which might be too simplistic to handle time-to-event prediction in the real world [8, 9]. In this regard, some researchers began to set their sights on machine learning algorithms and even deep learning neural networks (NNs). NNs can improve prediction accuracy by discovering relevant features of high complexity [8, 9]. There are 8 popular NN survival theories, such as DeepSurv and CoxCC (Cox case-control corresponding methods). However, no study has compared them yet. At the same time, though there have been some predictive models for CRC, they were mainly based on the CPH, traditional machine learning method or using American clinical data, such as the Surveillance, Epidemiology, and End Results (SEER) database [1012].

We aimed to compare several survival algorithms based on NN and develop a deep learning survival model for colorectal cancer patients (DeepCRC) using Asian clinical data. It might offer advice for Asian doctors on patients’ therapeutic decisions, to avoid unnecessary treatment and complications such as a stoma.

Methods

Study design and data source

This study was designed as a retrospective cohort study. Patients diagnosed with colorectal cancer in 2006–2014 were included and the last follow-up time was 2018. Raw clinical information was obtained from the biobank of Shanghai Outdo Biotech Company. Multivariate Imputation by Chained Equations was employed to fill in missing values (Supplementary Figure S1). All data were then divided into two cohorts randomly (Figure 1): the training cohort (80% of all) and the test cohort (20% of all). Survival models were trained using the training cohort, with validation by itself and the test cohort.

Figure 1

Schematic diagram of this study

LASSO – Least Absolute Shrinkage and Selection Operator.

https://www.archivesofmedicalscience.com/f/fulltexts/156477/AMS-19-1-156477-g001_min.jpg

This study has been approved by the Ethics Committee (No. LW-2022-007) and individual consent for this retrospective analysis was waived.

Model training

Sex, age, size, site, grade, numbers of lymph nodes examined, numbers of positive lymph nodes, T, N, M and stage were all the clinical features included by the authors (abbreviated as ALL variables). Classical TNM variables (T, N, M and stage) were included as input features too, called TNM variables by us. Least Absolute Shrinkage and Selection Operator (LASSO) was adopted to refine variables, filtering non-zero coefficient features as LASSO variables (Age, Size, Site, Grade, Lymph nodes examined, Lymph nodes positive, T, N, M and Stage) (Supplementary Figure S2, Supplementary Table SI). Three group variables were then combined with 8 NN survival algorithms to identify the best one, with traditional Cox models conducted too as a comparison. Before building the models, categorical clinical features were recoded as dummy variables. The Adam algorithm was chosen to be an optimizer. Batch training and batch normalization were used to avoid underfitting, while dropout layers and the early stopping callback function were applied to avoid overfitting when necessary. Dropout layers could silence some neural nodes randomly and the early stopping callback function could end up training when performance did not improve during several epochs. Training curves are shown in Supplementary Figure S3.

Model evaluation

The concordance index (C-index), also known as area under the receiver operating curve (AUC), was the main criterion. The C-index close to 1.0 showed a perfect prediction, while a 0.5 or smaller one tended to randomly guess. Another indicator was the integrated Brier score, whose range was between 0 and 1, with a smaller one or near 0 representing a better performance. Each model was evaluated on the training cohort and test cohort. 1000 times bootstrap (resampling 1000 times from the training or test cohort) was taken to get precise 95% confidence intervals (CIs) of the C-index.

Data processing and statistical analysis

Missing values were visualized and imputation performed by R 4.1.2 with mice and VIM packages. LASSO regression was established with the R package glmnet. NN was constructed with python 3.9.7, pytorch and pycox. R packages (fmsb, RColorBrewer and ggplot2) were used for visualization. Two-sided p < 0.05 was considered statistically significant.

Results

Patient characteristics

Patients diagnosed with CRC in 2006–2014 (n = 416) were stochastically split up into two groups, the training cohort (80% of all, n = 333) and test cohort (20% of all, n = 83) (Figure 1). Table I shows the clinical characteristics of the two cohorts. The median follow-up time of the training cohort was 62 months, with that of the test cohort being 65 months. There were 156 events observed in the training cohort and 30 in the test cohort.

Table I

Demographics and clinical characteristic of two cohorts

VariableTraining cohort (n = 333) N (%)Test cohort(n = 83) N (%)
Sex:
 Female136 (40.84)37 (44.58)
 Male197 (59.16)46 (55.42)
Age:
 Median (IQR)65 (57, 73)66 (55.5, 75.5)
Size [mm]:
 Median (IQR)50 (40, 70)50 (42.5, 67.5)
Site:
 Ascending colon53 (15.92)9 (10.84)
 Descending colon22 (6.61)2 (2.41)
 Hepatic flexure1 (0.3)1 (1.2)
 Ileocecal junction7 (2.1)2 (2.41)
 Rectosigmoid junction11 (3.3)2 (2.41)
 Rectum186 (55.86)53 (63.86)
 Sigmoid colon39 (11.71)8 (9.64)
 Transverse colon4 (1.2)1 (1.2)
 Others10 (3)5 (6.02)
Grade:
 I11 (3.3)0 (0)
 II228 (68.47)60 (72.29)
 III94 (28.23)23 (27.71)
Lymph nodes examined:
 Median (IQR)8 (5, 15)7 (4, 15)
Lymph nodes positive:
 Median (IQR)0 (0, 2)0 (0, 1)
T:
 T13 (0.9)1 (1.2)
 T249 (14.71)6 (7.23)
 T3181 (54.35)57 (68.67)
 T42 (0.6)0 (0)
 T4a72 (21.62)14 (16.87)
 T4b26 (7.81)5 (6.02)
N:
 N0189 (56.76)51 (61.45)
 N112 (3.6)6 (7.23)
 N1a37 (11.11)10 (12.05)
 N1b36 (10.81)9 (10.84)
 N1c1 (0.3)1 (1.2)
 N215 (4.5)2 (2.41)
 N2a29 (8.71)3 (3.61)
 N2b14 (4.2)1 (1.2)
M:
 M0323 (97)81 (97.59)
 M16 (1.8)1 (1.2)
 M1a3 (0.9)1 (1.2)
 M1b1 (0.3)0 (0)
Stage:
 I44 (13.21)7 (8.43)
 II35 (10.51)12 (14.46)
 IIA85 (25.53)27 (32.53)
 IIB15 (4.5)3 (3.61)
 IIC8 (2.4)2 (2.41)
 III26 (7.81)7 (8.43)
 IIIA6 (1.8)0 (0)
 IIIB77 (23.12)21 (25.3)
 IIIC27 (8.11)2 (2.41)
 IV6 (1.8)1 (1.2)
 IVA3 (0.9)1 (1.2)
 IVB1 (0.3)0 (0)
Follow-up time:
 Median (IQR)62 (28, 88)65 (39.5, 90)

[i] IQR – interquartile range.

Model performance

As illustrated in Figure 2 and Table II, TNM variables could not reflect a patient prognosis appropriately enough even using the NN algorithm, with a C-index between 0.4756–0.6957, of which DeepSurv behaved best. When LASSO variables were inputted, the performances were boosted markedly, with the top C-index up to 0.8224 in the training cohort and 0.7491 in the test cohort, from DeepSurv too. All variables were employed to conduct models finally, making some enhancement, for the C-index was determined as 0.8300 in the training cohort and 0.7681 in the test cohort by DeepSurv. Of 3 groups, ALL variables seemed to be the best indicator while DeepSurv showed the greatest potency in predicting patient OS.

Table II

C-index and integrated Brier score of different deep learning survival models

ModelsTNM variablesLASSO variablesAll variables
Internal validationExternal validationInternal validationExternal validationInternal validationExternal validation
C-indexIntegrated Brier ScoreC-indexIntegrated Brier ScoreC-indexIntegrated Brier ScoreC-indexIntegrated Brier ScoreC-indexIntegrated Brier ScoreC-indexIntegrated Brier Score
DeepSurv0.69570.17630.65930.17340.82240.11740.74910.15540.83000.11180.76810.1517
CoxCC0.66640.18570.67550.17680.70800.17490.68140.17380.75370.15660.69470.1629
CoxTime0.66860.33920.66830.23730.73130.33750.69520.26550.74120.33790.71470.2667
LogisticHazard0.67290.37640.67390.17800.72090.36950.68030.16750.73010.36750.69930.1656
PCHazard0.62870.18740.60160.17540.66600.16110.59130.16800.66620.16850.65290.1615
N-MTLR0.49750.38350.60120.18320.51640.38190.69470.15990.51790.36950.72990.1712
DeepHit0.47560.37970.67780.18590.63110.36280.68500.18030.68510.37010.72810.1866
PMF0.51640.38340.66060.17520.54250.38340.68030.16530.62190.37760.68030.1753
Cox0.66870.67070.73470.66370.7343-0.6643-

[i] C-index, concordance index. TNM variables: T + N + M + Stage. LASSO variables: Age + Size + Site + Grade + Lymph nodes examined + Lymph nodes positive + T + N + M + Stage. All variables: Sex + Age + Size + Site + Grade + Lymph nodes examined + Lymph nodes positive + T + N + M + Stage. LASSO – Least Absolute Shrinkage and Selection Operator. CoxCC – Cox Case-control Corresponding methods. PCHazard – Piecewise Constant Hazard. N-MTLR, Neural Multi-Task Logistic Regression. PMF – Probability Mass Function.

Figure 2

Performance of 8 neural network algorithms combined with 3 group variables, both internal and external validations. A – The Brier score of them. B – The concordance index of them. C – The radar plot showing the comparison of concordance index among these combinations

C-index – concordance index. TNM vars – T + N + M + Stage. LASSO vars – Age + Size + Site + Grade + Lymph nodes examined + Lymph nodes positive + T + N + M + Stage. All vars – Sex + Age + Size + Site + Grade + Lymph nodes examined + Lymph nodes positive + T + N + M + Stage. CoxCC – Cox Case-control Corresponding methods. PCHazard – Piecewise Constant Hazard. N-MTLR – Neural Multi-Task Logistic Regression. PMF – Probability Mass Function. LASSO – Least Absolute Shrinkage and Selection Operator.

https://www.archivesofmedicalscience.com/f/fulltexts/156477/AMS-19-1-156477-g002_min.jpg

After 1000 times bootstrap, DeepSurv still exhibited the best performance, with the C-index 0.8315 (95% CIs: 0.8297–0.8332) in the training cohort and 0.7719 (95% CIs: 0.7693–0.7745) in the test cohort (Supplementary Table SII).

Discussion

As a semiparametric and linear-assumption model, CPH has inherent limitations in forecasting the real word data. As the top algorithm in the machine learning field, NN has become more and more popular in the medical domain. Typical examples were application for tumor pathology or X-ray computed tomography (CT). Reasonably, researchers hoped to utilize NN to improve the accuracy of predicting cancer patients’ OS. In fact, the NN survival model has shown great potential. For example, to predict urinary continence recovery after robot-assisted radical prostatectomy, Loc Trinh and colleagues compared the Cox and NN survival model DeepSurv (C-index: CPH 0.695, DeepSurv 0.708) [13]. However, there are several NN survival algorithms, but nobody has compared them yet.

Though there are already survival models for CRC, an NN model based on Asian data has not been reported but is needed. Simultaneously, we hoped to identify the best one based on our collected clinical features, by comparing 8 frequent NN survival algorithms. DeepSurv had the highest C-index in all 8 algorithms in both cohorts (0.8300 in the training cohort and 0.7681 in the test cohort). The codes we used have been uploaded to Github, hoping it will offer some help for doctors not only for CRC but also other cancers.

There were some limitations in this study. Family history, lifestyle and some biomarkers are important reasons for colorectal carcinogenesis, possibly influencing prognosis, but they were not considered in this study [14, 15]. The sample size of this study was moderate. It is better to validate DeepCRC using prospective data.

Collectively, this study pioneered the use of 8 NN survival models with real Asian data for predicting CRC patients’ OS. The prediction of OS might offer a reference for doctors on treatment options.

In conclusion, we utilized and compared 8 deep learning survival models to predict CRC patients’ survival (DeepCRC) using Asian data. The DeepCRC model had good performance in predicting CRC patients’ overall survival.