Introduction

A randomized clinical trial was carried out to compare 3 different types of treatments used to treat patients diagnosed with lung cancer. The variables obtained after carrying out the study are:

time: patient survival time, measured in years.
status: variable indicating failure (1) or censorship (0).
age: patient’s age, in years.
sex: patient’s sex.
karno: Karnofsky performance score, measured by the doctor.
treatment: treatment received by the patient (three types of treatments: 1, 2 and 3).

Brief Descriptive Analysis of the Database

Let’s start by getting to know a little about our database.

Lung Database Summary
time	status	idade	sexo	karno	tratamento
Min. :0.002369	Min. :0.0000	Min. :33.00	F:160	Min. : 50.00	1: 83
1st Qu.:0.196642	1st Qu.:1.0000	1st Qu.:54.00	M:140	1st Qu.: 63.00	2: 93
Median :0.537423	Median :1.0000	Median :60.00	NA	Median : 76.00	3:124
Mean :0.925945	Mean :0.7733	Mean :60.63	NA	Mean : 76.04	NA
3rd Qu.:1.353466	3rd Qu.:1.0000	3rd Qu.:68.00	NA	3rd Qu.: 90.00	NA
Max. :4.970028	Max. :1.0000	Max. :84.00	NA	Max. :100.00	NA

Histogram of the variables Age, Status, Time and Karno

We can see some important things, we have an average time of 0.92 (years), but with a median of 0.53, which indicates an asymmetric distribution. We can also see that age and karno have more symmetrical distributions and that treatment group 3 is the largest among the 3 researched, among other things.

Additionally, see the failure and censorship graph below.

Failure and Censorship of patients diagnosed with lung cancer

Lung Survival Analysis

Model Choice

Now let’s test the main survival analysis models seen during the Survival Analysis discipline. We will use as structures the Cox model as a non-parametric option and the Accelerate Failure Time, Proportional Hazards, Proportional Odds, Accelerated Hazard and Yang and Prentice models using the Exponential, Weibull, Log-Normal and Log-Logistic distributions as distribution possibilities.

It is worth noting that each adjustment was made stepwise in order to obtain the best model for that combination.

Now we will analyze which models achieved the best fit to the data. See Table.

Models tested for the lung database
Model	AIC	Formula	Distribution
AFT	457.9999	Surv(time, status) ~ idade + sexo + tratamento	exponential
PH	457.9999	Surv(time, status) ~ idade + sexo + tratamento	exponential
PO	447.8186	Surv(time, status) ~ idade + tratamento	exponential
YP	449.9697	Surv(time, status) ~ idade + tratamento	exponential
AFT	452.5137	Surv(time, status) ~ idade + sexo + tratamento	weibull
PH	452.5136	Surv(time, status) ~ idade + sexo + tratamento	weibull
PO	447.0453	Surv(time, status) ~ idade + tratamento	weibull
YP	451.6039	Surv(time, status) ~ idade + tratamento	weibull
AFT	462.7182	Surv(time, status) ~ idade + tratamento	lognormal
PH	461.6407	Surv(time, status) ~ idade + tratamento	lognormal
PO	473.8168	Surv(time, status) ~ idade + tratamento	lognormal
YP	461.8633	Surv(time, status) ~ idade + tratamento	lognormal
AFT	451.0738	Surv(time, status) ~ idade + tratamento	loglogistic
PH	452.6900	Surv(time, status) ~ idade + sexo + tratamento	loglogistic
PO	451.0738	Surv(time, status) ~ idade + tratamento	loglogistic
YP	453.3683	Surv(time, status) ~ idade + tratamento	loglogistic

The models that achieved the best fit, taking into account the lowest AIC, were the PO models with weibull and exponential distributions, respectively. Let’s test the possibility of using the most parsimonious model, that is, the exponential model. See the adjustments obtained below:

# A tibble: 5 × 6
  type        parameter   estimate      se  `2.5%` `97.5%`
  <chr>       <chr>          <dbl>   <dbl>   <dbl>   <dbl>
1 coefficient idade         0.0498 0.00953  0.0311  0.0685
2 coefficient tratamento2  -2.04   0.288   -2.61   -1.48  
3 coefficient tratamento3  -0.889  0.252   -1.38   -0.396 
4 baseline    alpha         1.13   0.0782   0.978   1.28  
5 baseline    gamma         3.90   1.50     0.967   6.83

# A tibble: 4 × 6
  type        parameter   estimate      se  `2.5%` `97.5%`
  <chr>       <chr>          <dbl>   <dbl>   <dbl>   <dbl>
1 coefficient idade         0.0380 0.00556  0.0271  0.0489
2 coefficient tratamento2  -1.97   0.281   -2.52   -1.42  
3 coefficient tratamento3  -0.861  0.246   -1.34   -0.378 
4 baseline    lambda        0.369  0.108    0.157   0.580

As 1 is within the confidence interval for the alpha estimate in the model with a Weibull base distribution, we can use the exponential model for our analysis without losing much information. We will perform the likelihood ratio test to test whether the model can be simplified.


exponential(po) 1: Surv(time, status) ~ idade + tratamento 
weibull(po) 2: Surv(time, status) ~ idade + tratamento 
--- 
                      loglik        LR df Pr(>Chi)  
exponential(po) 1: -219.9093    2.7733  1  0.09585 .
weibull(po) 2:     -218.5226         -  -        -  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value obtained was greater than the 5% significance level, we will reduce the PO model that uses the Weibull model as its base distribution to a PO model that uses exponential distribution as its base, since we did not reject the null hypothesis of the test .

Therefore, the model chosen to follow the analysis was a Proportional Odds model whose base chance is exponential distribution with \(\lambda = 0.368\).

Residuals Analysis

Now we will analyze the residuals generated by the chosen model.

Cox Snell residuals for the chosen model

Analyzing the Cox Snell residual graph, it can be seen that the points fit well to the central line.

Martingale residuals for the chosen model

Analyzing the Martingale residual graph, it is noted that the points are randomly around zero.

Analyzing the Deviance residual graph, it is noted that the points are randomly and symmetrical around zero.

Analysis of the results of the chosen model

Now that we have chosen a model and carried out its residual analysis, let’s analyze the results obtained. Recalling the model output:

# A tibble: 4 × 6
  type        parameter   estimate      se  `2.5%` `97.5%`
  <chr>       <chr>          <dbl>   <dbl>   <dbl>   <dbl>
1 coefficient idade         0.0380 0.00556  0.0271  0.0489
2 coefficient tratamento2  -1.97   0.281   -2.52   -1.42  
3 coefficient tratamento3  -0.861  0.246   -1.34   -0.378 
4 baseline    lambda        0.369  0.108    0.157   0.580

The model is interpreted as follows:

The estimate of the \(\beta's\) obtained is exponentiated and affects the model’s base chance distribution. See the structure used by the Proportional Odds model:

\[ R(t|\Theta, \mathbf{x}) = R_{0}(t|\boldsymbol{\theta})\exp^{\mathbf{x} \boldsymbol{\beta}} \]

where \(R_{0}(t|\boldsymbol{\theta})\) is the base chance function that will have exponential distribution and \(\exp^{\mathbf{x} \boldsymbol{\beta}}\) are the covariates that will “disturb” the base chance function.

To measure this “disturbance” as a percentage, the following calculation is made:

\[ 100(e^{\beta}-1) \]

Therefore, we can make the following interpretations:

For the age variable:

An increase of one unit in the age variable has an impact increase of 3.87% in the base chance function.

For the treatment variable:

In relation to treatment 1, the use of treatment 2 has an 86.1% lower impact on the base chance function.
In relation to treatment 1, the use of treatment 3 has a 57.73% lower impact on the base chance function.

Since the function’s output uses treatment 1 as its base, let’s look at Tukey’s multiple comparison to see the last remaining comparison, between treatment 2 and 3.


     General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Linear Hypotheses:
           Estimate
2 - 1 == 0   -1.973
3 - 1 == 0   -0.861
3 - 2 == 0    1.112

In relation to treatment 2, treatment 3 has a 204.04% greater impact on the base chance function.

Survival curves

Let’s now test some possibilities for survival curve graphs for different ages in order to compare the groups.

Our minimum age in the lung database was 33 and the maximum was 84. Let’s look at the survival curves for different ages below.

Survival curves for ages 33 to 83 by treatment

Practical conclusions

Using Figure above as the main reference, we can note that at all ages, treatment 2 presented a survival curve superior to the other treatments. Furthermore, treatment 1 always presented the survival curve with the greatest decline at all ages. Consequently, treatment 3 obtained an intermediate survival curve compared to the others.

In section Analysis of the results of the chosen model it was highlighted that a one unit increase in the age variable has an impact increase of 3.87% on the base chance function. This fact was reinforced when the survival curves were made for different ages, since as age increased the curves declined more quickly.

Furthermore, when analyzing the \(\beta's\) of the treatment variable, treatments 2 and 3 had a smaller impact on the base chance function than treatment 1 and, when multiple comparisons were made to analyze the comparison between treatments 2 and 3 , treatment 3 had a 204.04% greater impact on the base chance function than treatment 2.

Therefore, the analysis carried out in the first paragraph in relation to Figure Survival curves for ages 33 to 83 by treatment was completely consistent with the analysis of \(\beta's\) and provided us with insights into the database of lung cancer cases.

References

R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
survstan: Fitting Survival Regression Models via Stan. URL: https://cran.r-project.org/web/packages/survstan/index.html
Lawless (2002) ISBN:9780471372158;
Bennett (1982) doi:10.1002/sim.4780020223;
Chen and Wang(2000) doi:10.1080/01621459.2000.10474236;
Demarqui and Mayrink (2021) doi:10.1214/20-BJPS471.