Testing Diagnostics: Qualifying In Vitro Devices

Sandra Quickert | Jason Segall

If you’ve ever taken a rapid test for COVID-19, then you’ve used an in vitro diagnostic device (IVD). At the start of the pandemic, the only reliable method for determining whether a patient was currently infected with COVID was a slow PCR test which required processing in a lab to return a result. By contrast, a rapid test directly informs the user of their COVID status in as little as fifteen minutes. Rapid tests, while not as reliable as PCR, meant that the correct response – even if that was no response – could be made far faster. While many – but by no means all – IVDs are used in a hospital setting, their use often means testing can be performed faster and less invasively than using traditional methods. IVDs are now available for a wide range of conditions, from commonplace at-home pregnancy tests and blood glucose detectors for diabetics to cutting-edge cancer detection systems.

Here, we’re going to examine the process of testing an IVD, discussing the difference in statistics involved in a clinical trial and an IVD trial.

Key Takeaways

Unlike drug trials, which assess treatment efficacy and safety through patient outcomes, in vitro device (IVD) trials primarily focus on evaluating the diagnostic device’s accuracy, precision, and reliability in measuring specific biological markers.
Developers must carefully choose whether to prioritise the sensitivity or specificity of the device when considering its design. This will depend on the clinical use-case of the diagnostic test.
Techniques such as variance components analysis, reference interval determination, and method comparison studies are key for assessing device performance and ensuring reliable diagnoses.

Two Trials

IVD trials are different from a usual clinical study, such as an evaluation of a drug product. Key to a drug product being licensed for market is conducting an evaluation of the efficacy of the product at treating the target condition. We also need to check the drug is safe and spot any side effects, and to understand the pharmacological properties of the drug. These are tested by administering the new treatment to a group or groups of subjects and repeatedly assessing their response over the course of the trial. This usually takes place over a series of four phases:

Phase 1: Tests dosing and safety in a small group of healthy volunteers. Pharmacological information (e.g. PK and PD data) is also assessed.
Phase 2: Evaluates short-term safety and efficacy in a small population of patients with the targeted condition as a proof of concept before a larger trial.
Phase 3: The “pivotal” clinical trial. Here, the product is administered to a large number of patients with the targeted condition, who’s response is evaluated over a longer period of time to assess long-term safety and efficacy. In most cases, the new drug is compared with the current standard of care or a placebo as a control arm.
Phase 4: Post-market monitoring to assess the safety and efficacy of the product of extended time periods.

The choice of endpoint for evaluating the efficacy will vary by the study, but common examples might be subject survival at a set date after administration, changes on an approved quality of life scale, or changes in growth of tumours. If the study has a control arm, then the eventual goal will be to determine the treatment effect by comparison to the control arm. Simultaneously, subjects will be assessed for any side effects of the treatment – known as adverse events – the number and severity of which serve as a measurement of the safety of the new drug.

Clinical trials are governed by Good Clinical Practice (GCP). These are guidances which outline required approaches to ethics and quality on the part of all stakeholders and participants, as well as ensuring that the trial follows a scientifically rigorous process.

IVD trials require a somewhat different approach. The eventual goal is still to prove that the device provides a safe and effective way to diagnose the target condition, but this takes a different form. The focus is on testing the IVD’s ability to reliably assess one or more properties of collected samples, known as measurands. The scientific validity of using these measurands to diagnose the target condition should be demonstrated for each device. The process to take an IVD from a prototype to a marketable device looks like this:

Proof of concept: Test the basic feasibility of the device, iing determining its key performance characteristics, how samples will be collected and stored, and the methods by which measurands will be evaluated.
Analytical Validity: Test the device’s ability to accurately and precisely measure measurands in relevant samples.
Clinical Validity: Test whether the measurements of the IVD are capable of reliably diagnosing the target condition in the target population.

In many ways, even while testing a IVD falls under the remit of GCP, this process looks more similar to validating a bioassay than the usual clinical trial process for a drug product. Our goal is to assess the measurement capabilities of the device, rather than examining the outcomes of patients with a certain treatment. From a statistical perspective, this requires a different set of tools than for a standard clinical trial.

Key quantities

In the early stages of developing and testing an IVD, it is crucial to establish the key metrics by which the performance of the device can be evaluated. These often include:

Accuracy: Whether the measurements produced by the device are close to the “true” values of the measurands. If there is evidence that the device systematically over- or under-estimates results over a number of measurements, then we say that it is biased in that direction.

Precision: Whether the measurements made by the device show low variability. If multiple samples ought to produce identical results for a certain measurand, then measurements using a high-precision device would produce results for those samples which would be close together.

Almost all IVDs will need to be assessed for their accuracy and precision before they can be approved for market.

Analytical Sensitivity: The true positive rate of the device. A device with high sensitivity is able to detect small changes in the value of a measurand between samples, meaning that it is likely to correctly identify positive samples.

Analytical Specificity: The true negative rate of the device. A device with high specificity is able to differentiate signal from background noise, meaning it is likely that negative samples will be identified correctly.

A highly sensitive device might give few false positives, but we would expect an increased false negative rate. By contrast, a highly specific device would give few false negatives, but is more likely to give false positives. In an ideal world, an IVD would be both highly sensitive and highly specific, but these two properties are often difficult to obtain simultaneously. For example, if a device is able to detect small changes in a measurand, then more of the background noise from the sample will also be detected, making the device less specific as it is harder to differentiate the signal from the noise.

As a result, devices are often optimised for one or the other depending on their purpose. If an IVD is intended to be used for screening before further testing, then sensitivity is usually preferred. For these devices, a false positive can usually be corrected at a later stage, while a false negative might lead to significant harm to a patient if a condition is missed. Examples include a COVID-19 or pregnancy test.

Devices which are used for confirmatory tests, such as those used to identify the nature of specific cancers, are usually optimised for specificity. A false negative will likely lead to further investigation, meaning the error can be caught, but a false positive may lead to the incorrect course of treatment being undertaken causing unnecessary harm to the patient.

Statistical methodologies

Once the metrics by which the IVD will be evaluated have been chosen, the process of collecting data to perform the required tests can begin. The nature of the data collection process will vary depending the type of IVD being tested. In many cases, for example, most of the ethical considerations at the core of clinical trials, such as informed consent, do not concern IVD trials as there is no direct intervention being made on patients. For some devices, such as certain types of implants, will nevertheless need to consider such concerns in their planning.

Along similar lines, many IVD trials do not require tracking of adverse events in the same way as a clinical trial. As there are no interactions with patients beyond the collection of samples in many cases, we do not expect any to experience significant side effects. That means the evaluation of safety for IVDs takes on a different form.

As it simplifies the overall trial, sample collection is often performed in a manner which avoids the need for informed consent where possible. These often take the form of samples left over from routine clinical treatment, which would otherwise have been discarded and whose provenance is unknown and unidentifiable by the analysts or trail coordinators. In other cases, particularly where a target condition has low prevalence in the population, samples may be spiked with known amounts of analyte to form samples used in the study.

Once sample collection is complete, the testing of the device can begin. Let’s look at some of the key statistical studies which form part of the testing process.

Precision

As mentioned previously, the precision of an IVD is a key characteristic which is usually required to be assessed during testing. This is not as simple as running a bunch of samples and seeing how close the results are, however. It is important to set up a study plan which allows an investigator to understand the effects of a range of factors on the variability of results. These studies can encompass a single site or several: for simplicity, we’ll consider an example which takes place at a single site using a single device.

A standard study design for a single site is known as a 20x2x2 design. This stands for a study which requires two replicates of differing samples to be analysed in two runs each day for a period of 20 days, which need not be consecutive. This allows two major components of the variability of the results to be drawn out:

Repeatability: Also known as within-run precision, this is a measure of the natural underlying variability in the results produced by the device. It is the variability which is observed when multiple replicates of the same sample are measured in the same run, in quick succession, using the same device which is operated by the same analyst. It can be thought of the “background” noise of the device.
Within-laboratory Precision: This is the variability which comes about due to the day-to-day operations which would be expected in a laboratory, such as due to different analysts operating the device or the device being used at different times of the day. This is important to understand for real-world use of the device.

These, and any factor-specific effects, can be extracted from the collected data using a statistical technique known as variance components analysis. This examines trends in the data and associates it with changes in different factors, such as the device operator or the time of day a sample is run. Several statistical methods are available to perform a VCA, an example of which is Analysis of Variance, or ANOVA.

Using this process, an estimate of the overall precision for each of the measurements of the IVD can be found, along with the variance associated with any important factors. The precision estimate is often expressed as a percentage coefficient of variation, or %CV.

It is also possible to calculate a confidence interval on the precision estimate. Results of the precision analysis should be held against the validation criteria as defined in the study protocol.

Reference intervals

The results produced by an IVD tell us something empirical about the physical state of a patient: the concentration of a certain substance in their blood, for example. This does not, however, itself diagnose the patient with anything. We must first determine what is “normal” for that measurement, and what value(s) are indicative that something is amiss.

As a very basic example, imagine we suspect someone is running a fever, and we take their temperature using a thermometer. If our measurement falls somewhere roughly in the normal range of 36-38°C, we have no evidence of a fever. If, however, the measurement falls above this range, our diagnostic device has provided this evidence. Without the knowledge of the normal range of the key measurands, we would have no way of inferring a diagnosis from the results provided.

In the same way, it is important to understand the range of values we would expect to see for the measurands in a healthy subject. These are know as reference intervals. If a measurement falls outside of the reference interval for a certain property, this can be interpreted as evidence towards the diagnosis of the target condition.

To find appropriate reference intervals, a series of measurements are taken from a reference population of healthy individuals. The population is sometimes broken down to form reference intervals for certain important subpopulation – male and female subjects being a good example.

Once the measurements are taken, the results are then ordered by magnitude, and the reference interval set to exclude chosen percentiles at the top and bottom of the range of values observed. For example, a typical choice is to include 95% of the values observed among the reference population. So, for a two-sided measurement, we would set the lower reference value as the 2.5^th percentile, and the upper reference value as the 97.5^th percentile. Excluding some of the normal range of observed values increases the specificity of the device: we may get more false positives than if we had included the full range, but we can be more confident that the result for someone with the target condition will fall outside the reference interval.

Mean difference ≈ 0, so bias is small.

Mean difference is positive, so there is a positive bias

Method comparison

A further key stage to testing an IVD is to compare it to the current standard practice for evaluating the measurands of interest. The intention of a method comparison such as this is to estimate any bias in the measurement compared to the current best option. This is a measurement of the accuracy of the IVD.

Recall that accuracy describes how close a measurement is to the “true” value. In most cases, however, the “true” value of a measurand can never be accessed: it can only ever be inferred using a measurement device which itself has variability and bias. That means that the only way we can assess the accuracy of a new device is by comparison either with a standard measurement or the currently accepted best practice. Bias in this case is thus defined as the difference between the value measured by the IVD and the comparator method.

To perform a method comparison study, samples are analysed using the IVD and using the comparator method. The agreement between the two models is then found using a Bland-Altman analysis. The difference and average of each pair of measurements from the two methods is found, and these are then plotted on a scatter graph. This shows how the bias in the IVD measurement varies across a range of measured values – often the clinically reportable range for the measurand. Figure 1 shows examples of Bland-Altman plots for artificial data. For an unbiased measurement, we expect the scatter to be centred on a difference of zero: there are roughly as many measurements where the IVD gave a result higher than the comparator as measurements where the reverse was true. If, by contrast, the scatter is centred above or below zero, this is an indication of a systematic bias in the IVD measurement with respect to the comparator.

An alternative or complementary approach to measuring the bias is to plot the measurements from the two methods against each other directly. A regression technique, such as Deming regression or Passing-Bablok regression, can then be used to fit a linear model to the data. If there is zero bias, we would expect the data to be well fit by a line y=x. If a systematic bias exists, however, the y-intercept of the fitted line will be shifted away from zero. Further, if the bias changes with the magnitude of the measurement, then we would observe a gradient different to one.

In a similar way to the assessment of precision, a confidence interval can be found on the calculated bias of the IVD. In some cases, this requires iterative techniques such as bootstrapping as the confidence interval cannot be calculated analytically. Once confidence limits are found, the results can be compared against pre-determined acceptable limits for the bias of the IVD.

The intercept of the OLS fit is greater than that of the identity line, meaning we have a positive bias.

The intercept of the OLS fit is less than that of the identity line, meaning we have a negative bias.

The gradient of the OLS line is greater than the identity, meaning the bias is increasingly positive with growing measurement magnitude.

The gradient of the OLS line is smaller than the identity, meaning the bias is increasingly negative with growing measurement magnitude.

A unique approach

The key metrics we’ve highlighted above demonstrate that the evaluation of an IVD is indeed often somewhat different from the usual clinical trial process. Testing the precision of the IVD and comparing it to the current standard diagnostic practice to check for bias are key to assessing its analytical validity, while setting robust reference intervals is a vital step towards establishing the device’s clinical validity.

As technology continues to advance, new diagnostic methods and devices are certain to arise, not least with the ever-increasing presence of AI across our lives. These innovations promise faster and less invasive methods of detecting the conditions which pose the greatest threat to our health and quality of life. As with all studies such as this, early statistical involvement can simplify the process of testing an IVD. By identifying and providing solutions to the inevitable hurdles in the path between prototype and patients, statistical support can be the key to a successful product.

Subscribe to the Quantics Blog

Follow Quantics on Social Media:

LinkedIn Facebook Twitter

Try QuBAS Now!

About the Authors

Sandra Quickert

Sandra joined Quantics in 2017. She has a PhD and Masters both in Mathematics from the University of Bonn in Germany. Since joining Quantics, Sandra has been a key member of our Clinical, Bioassay and HTA teams and is the responsible statistician for many of our key client clinical trials for medical devices and pharmaceuticals.
View all posts
Jason Segall

Jason joined the marketing team at Quantics in 2022. He holds master's degrees in Theoretical Physics and Science Communication, and has several years of experience in online science communication and blogging.
View all posts

Testing Diagnostics: Qualifying In Vitro Devices

Key Takeaways

Two Trials

Key quantities

Read next…

Accuracy, Precision and Bias

AI, Machine Learning and Medical Regulators

Statistical Sample Size Calculations for Clinical Trials