May 22
Hypothosis test

A Review of Statistical Hypothesis Testing

A clinical trial is often designed to demonstrate that a new treatment is performing better than the currently available treatment.  To do this, one must show that there is a positive difference between the new and old treatments, and that this difference is statistically significant. To determine statistical significance, we use statistical hypothesis testing procedures.  This blog will serve as background information for a follow-up blog on sequential (adaptive) trial designs.

The purpose of this blog is not to consider how the test statistic or its p-value are computed; this will depend on how the data are collected (for example see our blog on survival analysis). We instead will focus on the meaning of the significance level, making a conclusion and the implications on multiple hypothesis testing.

Hypothesis Testing

Regardless of how the endpoint is defined or the statistical test being performed, hypothesis testing methods follow the same general framework.  A hypothesis test can be broken down into 4 steps:

  1. Stating the null and alternative hypotheses;
  2. Specifying the significance level;
  3. Computing the test statistic and p-value;
  4. Comparing the p-value to the significance level to make a decision.

For the purpose of this blog, assume the following set of hypotheses:

Null hypothesis: The new treatment is the equal to or worse than the current treatment;
Alternative hypothesis: The new treatment is better (superior) than the current treatment.

In reality, only one of these two hypotheses is true, but which one is unknown to the researcher.   Hypothesis testing, helps us to decide which hypothesis is most consistent with the data observed in the trial using a p-value.

Suppose the trial has been concluded, all data collected and analysed using an appropriate method and the p-value has been calculated.  The p-value can then be compared to the significance level to make a final conclusion.

A hypothesis test can have one of two possible outcomes:

  1. The p-value ≤ significance level → Reject the null hypothesis; conclude the alternative.
  2. The p-value > significance level → Fail to reject the null hypothesis.

Since only one of our two hypotheses are true, and our hypothesis test is based on only a sample (fraction) of the population (all disease sufferers), it is entirely possible that a wrong conclusion (or an error) is made.  There are two types of errors that can be made, these are known as Type I and Type II errors (see the table below).

Hypothosis testing

If we reject the null hypothesis, when in fact the null hypothesis is true, then we will have committed a Type I error; this is equivalent to falsely claiming superiority of the new treatment.  This is the primary error of concern from a regulatory standpoint.  In hypothesis testing, we control for the possibility of committing a Type I error with the significance or alpha level; typically this significance level is set at 5%.  In other words, the probability of committing a Type I error is limited to 5%.

On the other hand, if we fail to reject a false null hypothesis, this is known as a Type II error; this would be failing to detect a benefit in the new treatment that really exists. The Type II is related to the power of the test and is accounted for in any sample size calculations.  Committing a Type II error, or failing to identify a truly superior new treatment, is a risk to the producer.

Multiple Testing

For the moment assume the null hypothesis is true and that the new treatment is ineffective.  Suppose we were to conduct a clinical trial with an interim and final analysis and that both analyses were conducted at the 5% significance level.  Then, considering each hypothesis test individually, there would be a 5% chance of rejecting the null hypothesis (in error).  When considering the trial as a whole (across both analyses), the overall probability of rejecting the null hypothesis (i.e., the probability of rejecting the null for at least one analysis) surely must be greater than 5%.

In other words, if there is a 5% chance of making an error (Type I) on a single hypothesis test, the more times that something is tested, the more likely an error is to be made.  From a regulatory standpoint, this is unacceptable; the probability of brining an ineffective drug to market would be greater than 5%.  Therefore, to control the overall probability of a Type I error, the significance level must be adjusted (reduced) for each individual test so as to preserve the overall alpha level.

There are many different ways to make this adjustment and ultimately, how this adjustment is made will have an impact on the probability of an early stop as well as the final sample size required.  We will discuss these decisions in a future follow-up blog which will take the concepts discussed into an adaptive trial context.

About The Author

Matthew Stephenson is Director of Statistics at Quantics Biostatistics. He completed his PhD in Statistics in 2019, and was awarded the 2020 Canadian Journal of Statistics Award for his research on leveraging the graphical structure among predictors to improve outcome prediction. Following a stint as Assistant Professor in Statistics at the University of New Brunswick from 2020-2022, he resumed a full-time role at Quantics in 2023.