AI, Machine Learning and Medical Regulators

Matt Chapman-Rounds | Jason Segall

Unless you’ve been hiding under a rock for the past few months (and, honestly, who could blame you), you will have interacted with artificial intelligence (AI) in some form. Whether it’s using ChatGPT to write an email to that one client you wish would go to your competitor or bemoaning the vagaries of the TikTok algorithm, AI has become a fact of modern life whether we like it or not.

It is no surprise, therefore, that AI is becoming ever more prevalent within fields such as medicine. This is far from a bad thing: AI can mean improved diagnoses, personalised care, and, ultimately, better outcomes for patients. From a regulatory perspective, however, AI-based medical devices can find themselves falling foul of regulations written for older, simpler technologies. We want to examine how a study might be designed to set an AI-based device on a path to regulatory approval, as well as discussing how the regulatory environment might evolve to be more inclusive of AI-based medicine.

Key Takeaways

Machine learning tools in medicine are powerful but often clash with regulatory frameworks built for static, non-learning devices, so current approvals usually require “frozen” models that can’t keep adapting in the field.
Robust ML development depends on careful splitting of data into training, testing, and validation sets, with regulators increasingly insisting on independent, external validation datasets to demonstrate real-world performance.
To unlock the full benefit of adaptive ML systems, future regulations will likely need structured mechanisms for controlled post-approval updating and retraining, while still assuring safety and consistent performance across populations and settings.

Machines which learn

The first thing to note here is that we are not talking about using SkyNet on hospital wards. The type of AI which is seeing increasing use throughout pharma and medicine is known as Machine Learning (ML). The purpose of ML is to make sense of datasets which are either very large, very complex, or both. Unlike humans, computers are not easily overwhelmed by a large quantity of complex data, meaning they are often able to find patterns in that data which would be difficult—or, at least, time-consuming—for a person to locate.

Social media feeds work on this principle. To the Facebook algorithm, for example, you are but a list of datapoints detailing your age, nationality, preferences, among a vast array of other factors (feel free to go change your privacy settings…). These are used as inputs to the algorithm, which then outputs the content it predicts is most likely to “engage” you to your feed.

Medical ML works in a similar way. Instead of data about your recent likes and political views, an ML-based device would take in data about your body to determine a defined endpoint. An example we recently worked on at Quantics involved a device which took in a stream of data about a patient’s normal breathing and used it to determine their maximum lung capacity. This has the advantage of being far less invasive than a normal test of lung capacity, while also not relying on being able to communicate instructions to the patient. And, it’s a process only a computer could undertake: it would be very difficult for a human to consistently correlate the vast array of interconnected datapoints with an accurate result in a timely manner for such a task.

Some quick definitions before we move on. We will refer frequently to the idea of a ML model. This is the summary of our beliefs about the relationship between the data we have collected and the endpoint we care about. An extremely simple model might be similar to those you would see in a high school classroom, such as linear or exponential models, but problems solved by ML are likely to require far more complex models in reality.

All models include parameters—such as the coefficients of a linear or exponential model—which are tuned during training to maximise how well the model predicts the endpoint based on input data. We refer to this as performance: a better-performing model will predict the endpoint more accurately than a worse-performing model.

AI: Neural Netwwork — A neural network is a type of ML model which solves problems by using connections between nodes, similar to how neurons interact in the brain. Image from https://commons.wikimedia.org/wiki/File:Two-layer_feedforward_artificial_neural_network.png

Train hard, play hard

Before you can start thinking about designing a ML device, you first need to collect a large amount of information about the inputs and the endpoints you’re interested in predicting. In our lung capacity example, this would involve measuring the breathing of a number of patients and the maximal lung capacity of those patients.

The next step is choosing a class of model—how will your process “learn”? There are many different types of model used in ML—perhaps the most famous being the neural network based on the interconnections of neurons in the brain. All, however, require similar process to design and test. What follows here is a high-level overview of the process: the gory details can be found elsewhere!

A ML model is trained using the collected data. This is where the model is optimised—by tuning its parameters—such that, when new data is given as input, the output is predictive of the endpoint of interest. Importantly, the human operator is unlikely to know what this model is in advance or even after the model has been trained.

During this process, however, we need to be wary of overfitting. The data used to train the model is, by necessity, a subset of all the possible data the model could ever be used with. There will be nuances to the data present—whether systematic or random. The model could end up performing very well for those nuanced data, but very poorly for data in the real world. This is especially likely if multiple iterations of the model are trained on the same dataset.

An example of an overfitted model. Here, the data are generated from a linear process, so the “true” underlying model should be linear. A model which performed well would be expected to lie close to the green dashed line over the range of the dataset. The fitted model here is non-linear. It passes through every point exactly, so it is extremely predictive of the points used to generate the model. But, the model performs very poorly outside of those points. This can be seen in the large distance between the fitted model and the “true” model.

For example, imagine you were asked to create a model which can differentiate between cars and buses based on colour. If you trained that model using camera feeds from Oxford Street—containing only London’s black cabs and red buses—your model would likely predict that every black vehicle was a car and every red vehicle a bus when exposed to a more typical street. The model would be overfit to the Oxford Street training set.

One way to mitigate overfitting is to reserve some of your collected data which is unseen by the model in training. If the trained model still performs well with this testing dataset, then it is more likely to perform well in the real world too.

This can, however, just push the overfitting back one stage, since the programmer can learn which classes of models tend to train in a way which perform well on the test data. Other classes of models which may be better when presented with real-world data might not ever even make it to testing.

A further mitigation step, therefore, is to split the data up even further to include a set reserved for validation. This is used as infrequently as possible as a proxy for real-world data—only models which have passed all other stages of testing should be let loose on the validation set. This can be difficult: there’s a chance that a model which performs excellently on the testing data performs terribly on the validation set, putting the whole process back to the drawing board.

Receive every Quantics blog as soon as it’s released

Subscribe to the Quantics Blog

Regulating AI Devices

So, our process for ML device design goes something like train, test, validate. “Happy days”, says the ML engineer, who is probably already down the pub. “Not so fast…”, say the regulators. As in many fields, the regulations do not quite match up with the cutting edge of technological development. The standard ML train, test, and validate procedure is often not sufficient to gain regulatory approval.

There are a couple main clashes between ML design and the regulatory approval processes:

ML models can change through routine use
One of the powerful advantages of ML models is that they can use the data they encounter in routine use to further refine the models. This, however, means that the parameters of the model necessarily change, and this is a problem when seeking regulatory approval since the new versions of the model are different to the one which was approved.
Validation ≠ Validation
The validation step of the ML design process is often not sufficient to fulfil the validation requirements for regulatory approval. Specifically, it does not actually prove that the device will be effective when used with real-world patients to the liking of the regulators as it uses data from the same dataset which was used to test and train the model. Never mind that it may have been carefully reserved until the very end of the design process: regulators still consider it insufficiently representative of real-world use.

So, how do regulators typically approve ML devices? In short: by making them look as much like normal devices as possible. The model is completely locked down—the parameters must remain unchanged from when the model was trained and tested—and the entire device is treated like a “black box”. That is, the only data that is considered is the input and output data, nothing in between is important. Regulators then expect the device to undergo a full validation study with new data collected from real patients.

Playing the Game

Given the regulators’ approach, one of the most important considerations which should be made at the outset of ML device design is when to lock down the model to undergo validation with real-world patient data. One way to do this would be to write two separate study proposals, one to gather data for the testing and training of the model, and a second to gather validation data once that process is complete.

An alternative approach would be to design a single study with a built-in pause for model testing and training once a certain amount of data has been collected. The study would then enter a validation phase using a locked-down model following the pause. This has the cost-saving advantage of only needing to design and write a single study rather than two.

Indeed, while it would be beneficial from a resource-saving perspective to avoid additional data collection for validation altogether, having multiple data collection phases can be useful. In particular, the amount of data—the sample size—required for validation can be better predicted once the model has been better understood following testing. If all the data is collected before the model is trained and tested, the validation dataset can end up being too small, or indeed unnecessarily large.

It can also be a good idea to build in a feasibility pause into the early stages of the ML development process. There is no guarantee that a ML model will be able to solve every problem, so it is wise to attempt to train a model on a small amount of data to check it’s possible before collecting the amount of data required for full training and testing.

Moving Forward: Building regulations for AI-based devices

As we mentioned earlier, the current regulatory approval process is generally designed for static devices. This can limit the capabilities of ML devices. In particular, the requirement to lock down models before validation means that one of the key advantages of ML is lost: the ability to tune models to certain population subsets even, potentially, down to the individual.

Take our maximal lung capacity example. Imagine we wanted to use the device in hospitals both at sea level in Florida and at altitude in Colorado. We might expect that the correlations between breathing and lung capacity might be different between Floridians and Coloradoans, meaning the ML model was less predictive for the latter. If the model was allowed to evolve with routine use, however, the model used in Colorado hospitals could account for these differences. This process could be replicated for every population in which the device was used, giving more effective predictions than a locked down model could ever hope for.

There would, clearly, still be some form of regulation required for such processes. One way this could be implemented was a well-defined procedure by which the model could be retrained. For example, a model could be retrained by a pre-determined process every six months, or after a certain number of patients had been treated using the device. This would reduce the speed at which the model evolves, meaning it would be easier to keep checks on.

The details are a discussion for another time but, for now, the key takeaway is this: AI and ML is coming. It has, in fact, already come in many fields where it is already providing significant advantages over older technology. As more and more ML devices seek regulatory approval, it is vital that developers have a good understanding of regulatory requirements and how they interact with the ML development process. And, it is similarly important that medicine as a field begins to consider ways to regulate ML devices in a way which better allows the advantages of the technology to shine through.

About the Authors

Matt Chapman-Rounds

Matt is writing up a PhD at the University of Edinburgh. At Quantics, he works as a statistician with a focus on NMA. He has experience with machine learning, data science and experiment design, with additional interests in Model Explainability, inference in graphs, and bayesian models of cognition.

View all posts
Jason Segall

Jason joined the marketing team at Quantics in 2022. He holds master's degrees in Theoretical Physics and Science Communication, and has several years of experience in online science communication and blogging.

View all posts

AI, Machine Learning and Medical Regulators

Key Takeaways

Machines which learn

Train hard, play hard

Regulating AI Devices

Playing the Game

Moving Forward: Building regulations for AI-based devices

About the Authors

Can you Validate a Spreadsheet? Using Bespoke Software Tools for GxP Analysis

Choosing the best Parallel Line Analysis (PLA) software for your lab

Understanding Cut-Point Analysis for ADA Assays

Endpoints: Clinical Trial Design

Building your Bioassay: Cell Substrates (with RoukenBio)

Optimising Suitability Criteria for Bioassays

Quantics Biostatistics

Contact Us

Key Takeaways

Machines which learn

Train hard, play hard

Regulating AI Devices

Playing the Game

Moving Forward: Building regulations for AI-based devices

About the Authors

Read Next

Quantics Biostatistics

Contact Us