Survival analysis using survstan

Author

Luis Otavio, Andre Tcherniacovski, Miguel Gonzaga, Joaquim Neto and Diego Braga

Published

June 1, 2023

Introduction

A randomized clinical trial was carried out to compare 3 different types of treatments used to treat patients diagnosed with lung cancer. The variables obtained after carrying out the study are:

  • time: patient survival time, measured in years.
  • status: variable indicating failure (1) or censorship (0).
  • age: patient’s age, in years.
  • sex: patient’s sex.
  • karno: Karnofsky performance score, measured by the doctor.
  • treatment: treatment received by the patient (three types of treatments: 1, 2 and 3).

Brief Descriptive Analysis of the Database

Let’s start by getting to know a little about our database.

Lung Database Summary
time status idade sexo karno tratamento
Min. :0.002369 Min. :0.0000 Min. :33.00 F:160 Min. : 50.00 1: 83
1st Qu.:0.196642 1st Qu.:1.0000 1st Qu.:54.00 M:140 1st Qu.: 63.00 2: 93
Median :0.537423 Median :1.0000 Median :60.00 NA Median : 76.00 3:124
Mean :0.925945 Mean :0.7733 Mean :60.63 NA Mean : 76.04 NA
3rd Qu.:1.353466 3rd Qu.:1.0000 3rd Qu.:68.00 NA 3rd Qu.: 90.00 NA
Max. :4.970028 Max. :1.0000 Max. :84.00 NA Max. :100.00 NA

Histogram of the variables Age, Status, Time and Karno

We can see some important things, we have an average time of 0.92 (years), but with a median of 0.53, which indicates an asymmetric distribution. We can also see that age and karno have more symmetrical distributions and that treatment group 3 is the largest among the 3 researched, among other things.

Additionally, see the failure and censorship graph below.

Failure and Censorship of patients diagnosed with lung cancer

Lung Survival Analysis

Model Choice

Now let’s test the main survival analysis models seen during the Survival Analysis discipline. We will use as structures the Cox model as a non-parametric option and the Accelerate Failure Time, Proportional Hazards, Proportional Odds, Accelerated Hazard and Yang and Prentice models using the Exponential, Weibull, Log-Normal and Log-Logistic distributions as distribution possibilities.

It is worth noting that each adjustment was made stepwise in order to obtain the best model for that combination.

Now we will analyze which models achieved the best fit to the data. See Table.

Models tested for the lung database
Model AIC Formula Distribution
AFT 457.9999 Surv(time, status) ~ idade + sexo + tratamento exponential
PH 457.9999 Surv(time, status) ~ idade + sexo + tratamento exponential
PO 447.8186 Surv(time, status) ~ idade + tratamento exponential
YP 449.9697 Surv(time, status) ~ idade + tratamento exponential
AFT 452.5137 Surv(time, status) ~ idade + sexo + tratamento weibull
PH 452.5136 Surv(time, status) ~ idade + sexo + tratamento weibull
PO 447.0453 Surv(time, status) ~ idade + tratamento weibull
YP 451.6039 Surv(time, status) ~ idade + tratamento weibull
AFT 462.7182 Surv(time, status) ~ idade + tratamento lognormal
PH 461.6407 Surv(time, status) ~ idade + tratamento lognormal
PO 473.8168 Surv(time, status) ~ idade + tratamento lognormal
YP 461.8633 Surv(time, status) ~ idade + tratamento lognormal
AFT 451.0738 Surv(time, status) ~ idade + tratamento loglogistic
PH 452.6900 Surv(time, status) ~ idade + sexo + tratamento loglogistic
PO 451.0738 Surv(time, status) ~ idade + tratamento loglogistic
YP 453.3683 Surv(time, status) ~ idade + tratamento loglogistic

The models that achieved the best fit, taking into account the lowest AIC, were the PO models with weibull and exponential distributions, respectively. Let’s test the possibility of using the most parsimonious model, that is, the exponential model. See the adjustments obtained below:

# A tibble: 5 × 6
  type        parameter   estimate      se  `2.5%` `97.5%`
  <chr>       <chr>          <dbl>   <dbl>   <dbl>   <dbl>
1 coefficient idade         0.0498 0.00953  0.0311  0.0685
2 coefficient tratamento2  -2.04   0.288   -2.61   -1.48  
3 coefficient tratamento3  -0.889  0.252   -1.38   -0.396 
4 baseline    alpha         1.13   0.0782   0.978   1.28  
5 baseline    gamma         3.90   1.50     0.967   6.83  
# A tibble: 4 × 6
  type        parameter   estimate      se  `2.5%` `97.5%`
  <chr>       <chr>          <dbl>   <dbl>   <dbl>   <dbl>
1 coefficient idade         0.0380 0.00556  0.0271  0.0489
2 coefficient tratamento2  -1.97   0.281   -2.52   -1.42  
3 coefficient tratamento3  -0.861  0.246   -1.34   -0.378 
4 baseline    lambda        0.369  0.108    0.157   0.580 

As 1 is within the confidence interval for the alpha estimate in the model with a Weibull base distribution, we can use the exponential model for our analysis without losing much information. We will perform the likelihood ratio test to test whether the model can be simplified.


exponential(po) 1: Surv(time, status) ~ idade + tratamento 
weibull(po) 2: Surv(time, status) ~ idade + tratamento 
--- 
                      loglik        LR df Pr(>Chi)  
exponential(po) 1: -219.9093    2.7733  1  0.09585 .
weibull(po) 2:     -218.5226         -  -        -  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value obtained was greater than the 5% significance level, we will reduce the PO model that uses the Weibull model as its base distribution to a PO model that uses exponential distribution as its base, since we did not reject the null hypothesis of the test .

Therefore, the model chosen to follow the analysis was a Proportional Odds model whose base chance is exponential distribution with \(\lambda = 0.368\).

Residuals Analysis

Now we will analyze the residuals generated by the chosen model.

Cox Snell residuals for the chosen model

Analyzing the Cox Snell residual graph, it can be seen that the points fit well to the central line.

Martingale residuals for the chosen model

Analyzing the Martingale residual graph, it is noted that the points are randomly around zero.

Deviance residuals for the chosen model

Analyzing the Deviance residual graph, it is noted that the points are randomly and symmetrical around zero.

Analysis of the results of the chosen model

Now that we have chosen a model and carried out its residual analysis, let’s analyze the results obtained. Recalling the model output:

# A tibble: 4 × 6
  type        parameter   estimate      se  `2.5%` `97.5%`
  <chr>       <chr>          <dbl>   <dbl>   <dbl>   <dbl>
1 coefficient idade         0.0380 0.00556  0.0271  0.0489
2 coefficient tratamento2  -1.97   0.281   -2.52   -1.42  
3 coefficient tratamento3  -0.861  0.246   -1.34   -0.378 
4 baseline    lambda        0.369  0.108    0.157   0.580 

The model is interpreted as follows:

The estimate of the \(\beta's\) obtained is exponentiated and affects the model’s base chance distribution. See the structure used by the Proportional Odds model:

\[ R(t|\Theta, \mathbf{x}) = R_{0}(t|\boldsymbol{\theta})\exp^{\mathbf{x} \boldsymbol{\beta}} \]

where \(R_{0}(t|\boldsymbol{\theta})\) is the base chance function that will have exponential distribution and \(\exp^{\mathbf{x} \boldsymbol{\beta}}\) are the covariates that will “disturb” the base chance function.

To measure this “disturbance” as a percentage, the following calculation is made:

\[ 100(e^{\beta}-1) \]

Therefore, we can make the following interpretations:

For the age variable:

  • An increase of one unit in the age variable has an impact increase of 3.87% in the base chance function.

For the treatment variable:

  • In relation to treatment 1, the use of treatment 2 has an 86.1% lower impact on the base chance function.

  • In relation to treatment 1, the use of treatment 3 has a 57.73% lower impact on the base chance function.

Since the function’s output uses treatment 1 as its base, let’s look at Tukey’s multiple comparison to see the last remaining comparison, between treatment 2 and 3.


     General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Linear Hypotheses:
           Estimate
2 - 1 == 0   -1.973
3 - 1 == 0   -0.861
3 - 2 == 0    1.112
  • In relation to treatment 2, treatment 3 has a 204.04% greater impact on the base chance function.

Survival curves

Let’s now test some possibilities for survival curve graphs for different ages in order to compare the groups.

Our minimum age in the lung database was 33 and the maximum was 84. Let’s look at the survival curves for different ages below.

Survival curves for ages 33 to 83 by treatment

Practical conclusions

Using Figure above as the main reference, we can note that at all ages, treatment 2 presented a survival curve superior to the other treatments. Furthermore, treatment 1 always presented the survival curve with the greatest decline at all ages. Consequently, treatment 3 obtained an intermediate survival curve compared to the others.

In section Analysis of the results of the chosen model it was highlighted that a one unit increase in the age variable has an impact increase of 3.87% on the base chance function. This fact was reinforced when the survival curves were made for different ages, since as age increased the curves declined more quickly.

Furthermore, when analyzing the \(\beta's\) of the treatment variable, treatments 2 and 3 had a smaller impact on the base chance function than treatment 1 and, when multiple comparisons were made to analyze the comparison between treatments 2 and 3 , treatment 3 had a 204.04% greater impact on the base chance function than treatment 2.

Therefore, the analysis carried out in the first paragraph in relation to Figure Survival curves for ages 33 to 83 by treatment was completely consistent with the analysis of \(\beta's\) and provided us with insights into the database of lung cancer cases.

References