Statistical Software Validation: A Risk-Based Approach

Ian Yellowlees | Jason Segall

Validation of statistical software is required for use in regulatory work. Many auditors expect to see the traditional IQ, OQ, PQ approach applied to the base system (SAS, R, etc). This approach, however, was never mandated by the FDA – or anyone else – and was never really suited to software validation because it ignores the inevitability of undetectable bugs in the code. The new CSA guidelines from the FDA, released in September 2022¹, seek to re-focus validation effort on those issues of the greatest importance, and go some way to expose the fallacy of the traditional approach.

Much has been written about the guidance, detailed summaries of which can be found elsewhere (see, for example, the webinar given by Cisco Vincenty²). Given the repeated publications from the FDA and other regulators along similar lines, it is surprising that any reputable QA supplier is still expecting the old IQ, PQ, OQ approach on statistical base systems. It is time all auditors moved away from this requirement.

Key Takeaways

The FDA’s Computer Software Assurance (CSA) guidance shifts focus away from blanket IQ/OQ/PQ checks on base tools like SAS or R and toward risk-based thinking that concentrates validation where failures could genuinely impact product quality and patient safety.
The greatest real-world risk lies not in the core statistical platforms but in bespoke analysis code, where human programmers inevitably introduce errors that can produce plausible but incorrect results.
Fault-tolerant approaches such as independent, diverse double programming (e.g. DSCPP) detect discrepancies at the point of analysis, providing continuous, real-time assurance that aligns with modern software assurance principles.

Under the old approach, validation required (or, at least, was understood to require) documentation and testing of everything, even for aspects largely inconsequential for end products and patients. The CSA guidance, by contrast, suggests that validation efforts are focused in a risk-based way. The phrase “make the rigour match the risk” captures this nicely: identify the areas of your process which have the highest risk of affecting performance on failure, and spend most of the time, effort, and cost of validation there. Figure 1 outlines this approach.

A risk based approach to validation — Figure 1: Risk analysis in a risk-based validation approach. Image from https://www.youtube.com/watch?v=bwGLh5-VTqE

What the FDA are looking for

Validation is about assuring yourself that the system is fit for purpose. It is not (or should not be) about satisfying the regulators, although some degree of box ticking will always be required. Vincenty’s webinar makes it clear that to assure yourself that the software is fit for purpose you must²:

Understand the intended use of the software. This article is concerned with statistical software for which the intended use is the creation bespoke programs for data analysis.
Understand where the risk is being introduced
Take appropriate steps to test / manage these critical risk areas

Vincenty confirms that the FDA will be satisfied if these steps have been taken. He also makes it clear that documentation does not have to be a very exhaustive step-by-step list of everything that was done². Nevertheless, he outlines that there should be a record that demonstrates a process of appropriate testing and a conclusion statement saying that the system was acceptable to you as an organization².

Identifying risk in statistical software validation

Understand the intended use of the software. This article is concerned with statistical software for which the intended use is the creation bespoke programs for data analysis.
Understand where the risk is being introduced
Take appropriate steps to test / manage these critical risk areas

Receive every Quantics blog as soon as it’s released

Subscribe to the Quantics Blog

1) Category of risk

If incorrect results produced from the analysis “may result in a quality problem that foreseeably compromises safety”¹ then it is high risk. This covers clinical trial data analysis, GMP batch release calculations, GLP toxicology analysis and many other areas.

2) Where does the risk actually lie?

Whilst it is certain that there will be errors in say, SAS base, or R base, it is very, very unlikely that any testing done by an end user—particularly with tests provided by a vendor—will find an error in the base system, or that such errors will actually cause a problem.

An error in code in the Therac-25 x-ray system led to several patients being administered lethal doses of radiation⁴

However, the analyses we are discussing as high risk require programming using these software systems to create the analysis for a specific product or clinical trial. This programming is very likely to contain errors, and these may be such as to cause major incorrect results.

In 2019, for example, a simple categorisation error led to a trial reporting a successful outcome. In fact, the treatment resulted in more hospital admissions. The publication was withdrawn 10 months later⁵ – actually, it should have been re-published as a negative trial, but that is another issue.

Even the best software engineers make mistakes at a rate of about 15-50 observable errors per 1000 lines of code⁶. Statisticians are mainly self-taught programmers, and many “statistical programmers” have limited exposure to software engineering best practice, so it seems reasonable that the error rate in bespoke statistical analyses might be even higher.

For now, though, let’s be extra generous and say they can program with as few as 40 errors per 1000 lines. In a typical analysis of perhaps 6000 lines of code this gives 240 errors. Some, perhaps many, may be found in development and testing, but there will be some that won’t—however hard you test.

And these are only observable errors. In a complex program, many errors arise from branching logic and are not easily detectable.

Errors that generate no result or silly numbers are easily detected. Nobody has a BMI of 234.5, and “p= -25.4” is clearly an error. Errors that create incorrect but plausible numbers, on the other hand, can easily slip by. Perhaps a p value = 0.02 is reported but the correct value is 0.08, or there has been a categorisation error as in the trial noted above.

So, what is the point in auditors requiring validation evidence for SAS or R as a base system when more consequential errors are much, much, more likely to be introduced when programming an analysis? The FDA CSA approach requires software validation teams to use “critical thinking”. Even simple thinking would conclude that validation of these base systems by running a handful of tests is a complete waste of time. More importantly, this draws attention away from the area that is a real risk: errors in the analysis code itself.

Fault-Tolerant Systems

The question is, therefore, how do we gain confidence that the analysis is fit for purpose?

There are several problems here.

It is not possible for an auditor or validation staff to check that the 6000 – 10,000 lines of program written to produce perhaps 200-300 tables, listings and figures for a clinical trail all work correctly.
Generally, auditors or validation experts can not verify that the results produced by such programming are correct – unless they are themselves able to re-calculate the results in a different way.

The key point is to recognise that however good your programmers, there will be errors in the programmes. These errors may or may not be important, and may only show up with particular data values.

Other industries have been tackling this issue for decades. In particular, the aerospace industry recognises that there will be errors in critical flight safety software and try to build systems that are “fault tolerant”. That is, when an (inevitable) fault occurs, the critical systems do not crash. Rather, the error is detected and safely handled in real time by the software.

So, a validation expert / auditor displaying “critical thinking” and a risk-based CSA approach should be asking to see evidence of how the analysis programmes handle the inevitable errors. This is validation in real time. They should not be wading through IQ, PQ, OQ documentation of SAS or R base systems.

There is a huge amount of literature on fault tolerant computing and many ways of approaching the issue. At Quantics Biostatistics – the biostatistics consultancy I co-founded in 2002 – we use Diverse Self-Checking Pair Programming (DSCPP) on all regulatory work. In essence, all analyses are carried out twice, in completely different systems, using different mathematical processes and staff, and the results must match to be acceptable.

DSCPP does not eliminate errors, but goes a long way to create a “fault tolerant” analysis so that the inevitable errors do not propagate through to end results and potential harm to patients. Instead, the errors are caught, in real time, on the data actually being used for the results, and can then be explored and corrected.

Quantics call this process Continuous real-time Validation^TM, and consider it as perhaps the ultimate expression of the principles of Computer Software Assurance.

An example where such a system might have averted disaster is the Boeing 737 Max crashes⁷. These occurred, in part, because fault tolerant computing principles were not followed, though in this case the “bug” was really a major design flaw.

There were two identical sensors supposed to measure the pitch of the aircraft (angle of attack). Of course, it was expected the readings would be the same.

The software only used the results of the left-hand sensor (red), which was faulty and showed that the angle of attack was too high. The software forced the nose of the aircraft down to correct this and blocked pilot inputs.

Figure 2: Faulty angle of attack readings from a Boeing 737 Max. Image from ⁷

Had DSCP programming concepts or another fault tolerant system been used, the error would have been caught and control returned to the pilots. (Note that other training issues were also involved in the sensor error leading to the crash).

Implementing fault-tolerant programming design is the best way to manage risk in statistical software validation. Even if the risk is high, trying to find every bug in thousands of lines of code is not realistically possible: the rigour required would be near infinite. Instead of trawling through reems of validation documentation, validation experts and auditors should be considering how the analysis will be robust to the inevitable bugs and errors. It is time for software validation in this industry to catch up with the expert world outside.

CrtV^TM is included in QuBAS, Quantics’ powerful bioassay statistics package. Find out more about CrtV^TM can help streamline your workflow.

References

US Food and Drug Administration. Computer Software Assurance for Production and Quality System Software Draft Guidance for Industry and Food and Drug Administration Staff DRAFT GUIDANCE. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/computer-software-assurance-production-and-quality-system-software (2022).
Vincenty, C. Computer Software Assurance (CSA): Understanding the FDA’s New Draft Guidance. (Greenlight Guru via YouTube, 2022).
Harford, T. & McDonald, C. High-frequency trading and the $440m mistake. BBC News https://www.bbc.co.uk/news/magazine-19214294 (2012).
Johnston, P. Historical Software Accidents and Errors. Embedded Artistry https://embeddedartistry.com/fieldatlas/historical-software-accidents-and-errors/ (2019).
Aboumatar, H. & Wise, R. A. Notice of Retraction. Aboumatar et al. Effect of a Program Combining Transitional Care and Long-term Self-management Support on Outcomes of Hospitalized Patients With Chronic Obstructive Pulmonary Disease: A Randomized Clinical Trial. JAMA. 2018;320(22):2335-2343. JAMA 322, 1417–1418 (2019).
McConnell, S. Code Complete. (Microsoft Press, 2004).
Sieker, B. Boeing 737MAX: Automated Crashes. media.ccc.de https://media.ccc.de/v/36c3-10961-boeing_737max_automated_crashes#t=780 (2019).

About the Authors

Ian Yellowlees

Ian Yellowlees has an engineering degree and experience in software engineering and is also fully medically qualified, with 20+ years experience as an NHS consultant. He developed Quantics’ unique ISO9001 and GXP quality management system and provides business management and medical support to Quantics.

View all posts
Jason Segall

Jason joined the marketing team at Quantics in 2022. He holds master's degrees in Theoretical Physics and Science Communication, and has several years of experience in online science communication and blogging.

View all posts