In In Vitro vs In Vivo Potency Assays: Modern Vaccine Batch Release, we examined the pros and cons of two methods of testing vaccine potency: in vivo assays, which use animal models, and in vitro assays, which use laboratory experiments. As we established in that blog, the direction of travel is towards the improved analytical powers of in vitro assays, particularly as experimental techniques have advanced and the ethics of animal testing draw ever more intense questioning. A challenge which can appear for existing vaccines, therefore, is ensuring that a novel in vitro assay is able to provide comparable results to an established in vivo assay. Here, we’ll explore some of the statistical considerations which should be kept in mind when examining the capabilities of an in vitro assay intended to replace an existing in vivo process, and some key components of a bridging study.
In Vivo vs In Vitro
For several decades, a common approach to test the potency of a batch of vaccine has used animal models. Whether through challenge or serological assays, the potency of the batch can be established by comparing subject responses to the test batch with those dosed with a reference standard.
Key Takeaways
-
In vitro assays are increasingly preferred over in vivo assays due to ethical considerations, faster turnaround, lower costs, and greater precision, but they require rigorous validation to ensure comparable results.
-
Bridging studies are essential for replacing established in vivo assays with in vitro alternatives, using statistical tools such as correlation analysis, equivalence testing (e.g., Geometric Mean Ratio within defined limits), and regression models to assess comparability and bias.
-
Study design factors — especially assay variability, equivalence margins, and sample size — heavily influence statistical power, with higher variability and stricter equivalence limits potentially requiring very large sample sizes, making early statistical planning crucial.
While such methods are well established, they have several challenges. First among these are ethical concerns surrounding the use of animals in laboratory experiments, which faces increasing scrutiny as other approaches have become available. Practical issues also prove problematic: in vivo assays are often time and resource intensive, which means they are also expensive. And, from a statistical point of view, the results they produce are often highly variable, meaning they might not be able to provide the same assurances of quality as more precise techniques.
In vitro assays, by contrast, typically use cells or which can be grown in a laboratory environment. The goal of these methods is the same as for an in vivo assay – to test the immunological response to the test batch compared to a reference. The difference is that an in vitro assay uses a simulacrum of an immune system rather than one in a living organism.
This means that any measurement made in an in vitro assay is inherently a proxy, compared to the direct measurement of the response generated by the vaccine from an in vivo assay. Arguably, this cost is outweighed by the benefit of removing the need for animal subjects which, even putting aside ethical considerations, eliminates the need to feed, water, and house subjects. In vitro experiments are typically faster, increasing testing throughput, and can provide a dramatic improvement in precision compared to in vivo methods.
Components of Bridging
It can, therefore, be desirable to develop alternative in vitro methods for vaccines with an established in vivo batch release assay to access the benefits provided by in vitro analysis. To do this, one must demonstrate that the in vitro method can produce results which are comparable to those produced by the existing assay. The original assay will have been validated to ensure that its results were acceptably precise and accurate, so if the results from the in vitro are sufficiently similar then we can be confident that it performs comparably to the in vivo assay. This is known as bridging between the two assays.
A major component of a bridging study is demonstrating comparability between the in vivo and in vitro assays. Demonstrating acceptable comparability involves understanding the relationship between the potency results provided by the in vivo and in vitro assays, and that there is no bias between the two methods. In this context, bias refers to the difference in a relative potency found by the in vivo method and that found by the in vitro method, assessed over a series of measurements.
Correlation
If the in vitro assay provides results which are comparable to those provided by the in vivo assay, then we would expect there to be a relationship between the results from the two assays. This can be quantified using the correlation between the results. Correlation is a particularly useful metric if the in vivo and in vitro methods measure different outcomes.
Specifically, the correlation coefficient indicates the strength and direction of any relationship. The correlation coefficient ranges between -1 and 1, with the magnitude of giving information on the strength of the correlation. The sign of shows whether the relationship is direct or inverse. A direct correlation means an increase in the in vivo result is associated with an increase in the in vitro result, while an inverse correlation means that an increase in the in vivo result is associated with a decrease in the in vitro result.
For example, if r=0.75, we expect a strong, direct relationship between the two assays. If, by contrast, if r=-0.25. Ideally, we want the correlation between the in vivo and in vitro results to be as strong as possible.
It is, however, arguable how strong a correlation must be before it is considered strong enough. For example, two different groups have examined correlations between in vivo and in vitro testing methods for HPV vaccines. One group, Shank-Retzlaff, Wang, Morley et al., found a correlation coefficient of -0.75 between vaccine antigenicity and mouse potency ED50, which they considered an indication that the two methods were “strongly correlated”. Meanwhile, Hu, Jia, Wang, et al. claimed a correlation coefficient of greater than 0.5 between in vitro and in vivo methods on all four strains of a quadrivalent HPV vaccine indicated a “good correlation”. These correlation coefficients are of noticeably different magnitude, but, without an agreed-upon minimum acceptable correlation strength, it is difficult to determine whether either, both, or neither can be considered to have sufficient correlation to demonstrate comparability.
Regardless, bridging between two methods often cannot be performed using correlation alone. This is because correlation does not account for any bias between the results from the two assays. Correlation only considers the pattern between the results, not whether they agree. For example, readings from two thermometers might both increase when the temperature in a room increases, but one might consistently read two degrees higher than the other. These measurements might be correlated, but we would observe a systematic bias.
This can also occur for assays: in vitro and in vivo potency may show high correlation, but also exhibit a bias which could lead to inappropriate outcomes from batch release testing. That means we not only need to assess correlation, but also agreement. One way to do this would be to examine a linear regression model fit to the comparison of results. We would expect results which agree to give a model with a slope of one and an intercept of zero. We could also employ equivalence testing, which is what we will examine next.
Equivalence
To establish that the difference between the in vivo and in vitro results is acceptable, we can demonstrate that the measured relative potencies of a sample using each method are close enough that the difference is negligible. As such, we are proving that the results are equivalent, at least from a statistical perspective, meaning we can proceed with the in vitro assay confident that any bias will be small enough to be unlikely to affect the outcomes of testing.
Equivalence testing is a form of hypothesis test. We first define – and assume – a null hypothesis . We then look for evidence that we should reject the null, and instead adopt an alternative hypothesis
. For our case here, our hypotheses are formed as follows:
: The difference in results from the in vivo and in vitro methods is meaningful
: The difference in results from the in vivo and in vitro methods is negligible
So, we assume that there is a meaningful difference between the two methods, and seek to prove that this difference is, in fact, negligible. This is one reason why equivalence testing is preferred to its counterpart – significance testing – for demonstrating comparability. In a significance test, we would assume that there is no difference, and seek to prove there is. If we fail to do so, however, this does not demonstrate that there is no difference, just that we have not yet found sufficient evidence. As such, we can never definitively show there is no difference using a significance test.
We require a metric by which we can compare results from the two methods. We will assume that both assays provide a relative potency as their response. Then in this case, we can use the Geometric Mean Ratio (GMR). Simply, this is the ratio of the relative potency of a sample measured using the in vivo method to that of the same sample measured using the in vitro method. That is:
If the two assays produce the same result, the GMR will be one.
To define the equivalence test, we require equivalence limits. These define a region in which we consider the GMR to be close enough to one that the difference is negligible. These should have multiplicative symmetry about one – a possible choice is (0.80, 1.25). So, our hypotheses for an equivalence test showing the two methods produce results with sufficient agreement might be:
The test itself is performed by calculating the GMR for a series of samples, whose relative potency has been measured using both methods. A confidence interval (typically 90% confidence) is then calculated on the GMR. The key criterion is this: in order for the methods to be deemed equivalent, the 90% CI on the GMR must be entirely contained within the equivalence limits. In our case, the upper confidence limit must be less than 1.25, and the lower confidence limit must be greater than 0.8. The plot below shows examples of some possible scenarios in an equivalence test:

In scenarios A and B, the method would fail the equivalence test as the 90% CI is not entirely contained by the equivalence limits. Indeed, the entire 90% CI is outside the equivalence limits in scenario A. Conversely, the in vitro method would pass equivalence in scenarios C and D since the 90% CI is contained by the equivalence limits in these cases.
Note that the 90% CI does not need to contain one – the value which indicates perfect equivalence – in order to pass. Recall that the goal of an equivalence test is to demonstrate that any difference between the two methods is negligible, not that they produce identical results. Definitionally, the equivalence limits define a region in which this difference is considered negligible. Since we can consider a confidence interval to indicate a range of plausible values for the “true” GMR, if the 90% CI falls entirely within this region, we can conclude that any difference is highly likely to be negligible, regardless of whether it contains one.
Study design considerations
As with any study, it is important to design a bridging study such that it has sufficient statistical power, which, in simple terms, is the probability of a study being successful. Clearly, we want to design our study such that the statistical power of our study is likely to be sufficiently high. Some factors which influence the statistical power must be assumed in advance of the study to assess its power, such as the precision of both methods and the true GMR. These assumptions are often informed using historical information on the assays or conducting feasibility experiments. Another factor is the width of the equivalence limits. These, however, require scientific justification and should consider safety and efficacy information about the vaccine, so may often be constrained.
The dial in the study design which can be most easily turned to influence the statistical power is the sample size – the number of samples whose relative potency is measured by both methods for comparison. Generally, the greater the sample size, the greater the power of the study. Of course, a larger sample size means an increased resource use and, therefore, cost. That means appropriate sample size is often the minimum which gives a desired statistical power, such as 90%.
Let’s examine a couple scenarios to see how these factors combine to give an optimal sample size.
In scenario 1, an appropriate equivalence margin is determined to be (0.8, 1.25). We can assume that the true GMR is 1 – that is, the methods are expected to produce identical results. We also assume that the intermediate precision – the day-to-day variability of an assay operated in a single lab, expressed as a percentage geometric coefficient of variation (%GCV) – is 25% for the in vivo assay and 7% for the in vitro assay.
The plot below shows how the statistical power behaves with sample size under these assumptions:

We see that the statistical power initially increases swiftly with sample size, but begins to flatten off after about N=10. If we require a minimum power of 90%, our minimum sample size is 14. It is often prudent to be conservative when making the assumptions which go into a sample size calculation. This comes at the cost of a slightly inflated sample size, but means that the study is less likely to end up being unintentionally underpowered if some of the assumptions which went into the sample size calculation were faulty.
In scenario 2, we find a less ideal scenario. Our appropriate equivalence region is narrowed to (0.9, 1.11), our true GMR is expected to be 1.03, and we have assays with higher variability. Specifically, the IP of the in vivo assay is 50%, while that of the in vitro assay is 15%. All of these changes will decrease the power of the experiment, meaning the sample size will need to be greater to compensate.
A plot similar to the one we examined in scenario 1 is shown below:

Here, we see that the statistical power of the study is effectively zero until the sample size reaches about 50. After that, it increases swiftly until gradually flattening out between N=100 and N=300. In this case, the minimum sample size required for a 90% statistical power is 276. This demonstrates how influential the precision of the methods is on the required sample size for a bridging study: doubling the variability, in combination with the other changes, has led to a 19-fold(!) increase in the minimum sample size. Bear in mind that this is the required sample size per method. A bridging study in scenario 2 would require 552 total assays to be run between the two methods.
In such a scenario, it is often necessary to be realistic about what you’re trying to prove. If you have a highly-variable assay, it may be advisable to relax what is considered comparable – e.g. to widen your equivalence limits – to avoid unmanageably large sample sizes.
Strategic bridging
Moving from an established in vivo potency assay to an in vitro alternative is not simply a matter of switching technologies: it requires a rigorous statistical exercise to demonstrate that the methods are comparable. Correlation analysis can show that results from the two assays move together, but only equivalence testing can show that any differences between them are small enough to be operationally irrelevant.
Careful consideration of study design, including realistic assumptions about variability, expected difference, and the setting of scientifically justified equivalence margins, is essential for generating convincing results. As we’ve seen, the interplay between assay precision and required sample size can be dramatic, and underestimating variability can result in an underpowered study.
As with all such endeavours, it’s vital to involve statistical help as early in the study design process as possible to ensure that your bridging study is as efficient and powerful as possible. Through strategic deployment of resources in well-designed bridging exercises, it is possible to adopt faster, more precise, and more ethical in vitro methods without compromising the integrity of vaccine batch release decisions.
Comments are closed.