This 3rd clinical trial design blog from Quantics Biostatistics takes an introductory look at how the various types of clinical endpoints (which we introduced in our previous blog) determine the recommended calculations for working out what can be inferred from the summary data.
Formal statistical testing for a clinical trial is traditionally based on the concept of ‘proof by contradiction’. This means we aim to disprove the assumption that there is no difference between the treatments (often a placebo treatment and the new treatment). Bayesian approaches are also used but these will be discussed in a future blog.
A general approach of ‘proof by contradiction’ is as follows. First, we calculate the difference, D, between the average values of the endpoint for the treatment groups (for example average tumour size reduction). Secondly, we determine the probability that the difference observed in a trial will be as large as D, under the assumption that the true difference is zero. If the calculated probability, known as the P value, is small, this means D is too large to be consistent with the assumption of zero difference. In that case, there is a significant difference between the treatments tested.
- To clarify, if the P value is less than a stated minimum (usually 5% or 1%) then there is good evidence that the true difference is not zero (there is significance). This minimum is often called the ‘level’ or ‘alpha level’ of the test. It is the probability of concluding that there is a difference between the groups when in fact there is no difference.
If the treatment could conceivably make the condition better or worse, the P value can be calculated to assess a positive or a negative difference – this results in a ‘two sided test’. If the treatment can only have no effect or a positive effect (a difference in one direction only) – a ‘one sided test’ can be used.
Calculation of the P value
Depending on the type of clinical trial endpoint, different methods are used to calculate the P value. These methods are referred to as ‘tests’, often named after the statistician who worked out how to calculate the P value.
Example 1: Binary endpoint
Suppose that in a trial of a new treatment for psoriasis, the primary endpoint was PASI-75 at 8 weeks: a reduction in the Psoriasis Assessment of Severity Index of at least 75% after 8 weeks’ treatment. The groups can be compared via the percentages of the groups who achieved PASI-75.
In the placebo treatment group of 63 patients, 31 (49%) achieved PASI-75; in the experimental treatment group of 64 patients, 42 (66%) achieved PASI-75.
In this example Fisher’s Exact test can be used to calculate the P value. Assuming we are only interested in an increase in the percentage achieving PASI-75 when the experimental treatment is used, a one sided P value is calculated.
The calculation for determining the actual P value isn’t one that can simply be written down in an easily understandable equation and specialist statistical software exists for this reason. The results for the example described above can be seen in the figure below.
|Fisher’s Exact test gives:|
P = 0.045
Therefore P < 0.05
The conclusion is that there is evidence to support a higher rate of PASI-75 with the experimental treatment, at the 5% level.
Had the alpha level of the test been set at 1%, the conclusion would have been that there is no evidence to support a higher rate of PASI-75 in the experimental treatment, at the 1% level.
Example 2: Continuous endpoint
Suppose that in a trial of a new treatment for back pain, the primary endpoint was pain reduction at 1 month, measured using a visual analogue scale (VAS). Respondents specify their level of pain by indicating a position along a 100 mm line. For each patient, the change between the baseline VAS score and the 1 months VAS score is calculated, and averaged for each treatment group using the median. The medians were -59 and -46 in the new treatment (N = 19) and placebo (N = 18) groups respectively.
For this continuous endpoint, the groups can be compared using the Wilcoxon rank sum (or Mann-Whitney U) test. Again, assuming we are only interested in bigger median change in VAS when the experimental treatment is used, a one sided P value is calculated.
Wilcoxon’s test: P = 0.282
Therefore P > 0.05
The conclusion is that there is no evidence to support a higher change in VAS with the experimental treatment, at the 5% level.
Alternatively, the Student t test can be used. For its calculation of the P value, however, assumptions are required about the distribution of the values within each group: firstly that the data are ‘normally distributed’ (bell-shaped curve) and secondly that the variance (spread) is the same in the two groups. The data in this case appear skewed and not bell-shaped. Therefore we prefer the Wilcoxon test.
Example 3: Time to event endpoint
In a trial of a new treatment for cancer, the primary endpoint was time to death. The groups can be compared via the overall survival in the groups. The figure below shows a Kaplan Meier plot of the times to death, as described in our previous blog.
The log rank test can be used to compare the two survival plots. The P value for this example is greater than 5%, so there is no evidence to support an increase in time to death (survival) in the experimental group.
We will focus on more complex methods to compare overall survival between groups in a future blog but next, we will look at different study population types and outline some of the issues that arise, along with the best practices for dealing with them.