Clinical trial design – Comparing clinical trial endpoint summaries
This 3rd clinical trial design blog from Quantics Biostatistics takes an introductory look at how the various types of clinical endpoints (introduced in our previous blog) determine the recommended calculations for what can be inferred from summary data.
Formal statistical testing for a clinical trial is traditionally based on the concept of proof by contradiction. This means we aim to disprove the assumption that there is no difference between the treatments (often placebo versus a new treatment). Bayesian approaches are also used but are discussed elsewhere.
Key Takeaways
- Clinical trials commonly use hypothesis testing based on “proof by contradiction”.
- A small P value indicates evidence against the assumption of no treatment difference.
- Different endpoint types require different statistical tests.
- Choosing an inappropriate test can lead to misleading conclusions.
The general approach is as follows. First, we calculate the difference between the average endpoint values for the treatment groups. Secondly, we determine the probability of observing a difference at least this large under the assumption that the true difference is zero. This probability is known as the P value. If the P value is small, the observed difference is unlikely to be due to chance alone.
If the P value is less than a pre-specified threshold (commonly 5% or 1%), there is evidence that the true difference is not zero. This threshold is called the alpha level and represents the probability of concluding there is a difference when none exists.
If a treatment could plausibly have either a positive or negative effect, a two-sided test is used. If only improvement is possible, a one-sided test may be appropriate.
Calculation of the P value
The method used to calculate the P value depends on the type of clinical trial endpoint. These methods are known as statistical tests.
Example 1: Binary endpoint
Consider a psoriasis trial where the primary endpoint is PASI-75 at 8 weeks (a 75% reduction in severity). In the placebo group, 31 of 63 patients (49%) achieved PASI-75. In the experimental group, 42 of 64 patients (66%) achieved PASI-75.
Fisher’s Exact test can be used to compare these proportions. Assuming interest only in improvement with the experimental treatment, a one-sided test is appropriate.

Fisher’s Exact test gives P = 0.045. Since P < 0.05, there is evidence at the 5% level to support a higher PASI-75 response rate with the experimental treatment.
Example 2: Continuous endpoint
In a back pain trial, pain reduction at 1 month is measured using a visual analogue scale (VAS). Median reductions were −59 in the experimental group (N=19) and −46 in the placebo group (N=18).
The Wilcoxon rank sum (Mann–Whitney U) test is suitable here, particularly as the data appear skewed. A one-sided test is used to assess greater improvement with the experimental treatment.

The Wilcoxon test gives P = 0.282. Since P > 0.05, there is no evidence of a treatment difference at the 5% level.
A Student’s t-test could also be used, but it requires assumptions of normality and equal variances. Given the skewed data, the Wilcoxon test is preferred.
Example 3: Time-to-event endpoint
In a cancer trial where the endpoint is time to death, overall survival between treatment groups can be compared using Kaplan–Meier curves.

The log-rank test is used to compare survival curves. In this example, the P value exceeds 5%, indicating no evidence of improved survival with the experimental treatment.
More complex survival analysis methods will be discussed in a future blog.

