Is NICE right on what needs to be undertaken to progress the use of AI for reviewing chest imaging in the NHS?
Chris Richmond - Principal Clinical Consultant
On 28 September 2023 NICE published a health technology evaluation (HTE12) titled Artificial intelligence-derived software to analyse chest X-rays for suspected lung cancer in primary care referral. In this early value assessment NICE makes a number of recommendations. But, what does this mean? And will they delay the uptake of AI to support clinical decision making in chest imaging?
Recent policy promotes the use of AI and machine learning to support this clinical area, which has recently been pump-primed by dedicated NHSE funding to the tune of £21m through the AI Diagnostic Fund. Some of this is driven by acute radiologist shortages, as outlined in the Richards report. Given that this report is three years old and there are over 200 AI algorithms that are CE marked – many of which are already being used - is NICE behind the curve?
If we look at the recommendations that NICE makes, we can broadly break those into three areas; process, AI and data.
Below are some of the areas where NICE recommends more research on outcomes is required. They broadly refer to the process aspects of imaging although the third bullet also makes mention of the accuracy which will be picked up later.
- the impact of the software on clinical decision making and the number of people referred to have a chest CT scan.
- the impact on review and reporting time, and time to CT referral and diagnosis.
- the diagnostic accuracy of using AI‑derived software alongside clinician review for identifying normal X‑rays with high confidence and the impact of this on work prioritisation and patient flow.
- how using the software affects healthcare costs and resource use.
The first bullet is very broad. At present it is not clear what impact AI might have on overall workflows. It may mean that there is an increase in referrals from primary care to secondary care, or to a Community Diagnostic Centre (CDC). What they do not explore is the positive effect that AI might have by scanning existing images for people with high risk factors, or newly identified risk factors. Doing this may mean that those people are recalled for more up to date imaging, but a clinical decision might be made on the old image if it is deemed to be within an acceptable range. This may not just apply to CT or MRI images of the chest, but could also apply to existing x-ray images.
Logically this may have an impact on the numbers referred, but an alternative view is that it will improve early diagnosis in certain cases. Work undertaken by Cancer Research UK in Liverpool, and now being scaled nationally, has shown the impact of early diagnosis of lung cancer. Could AI being utilised on existing images improve this further?
There is significant evidence to support the second bullet. AI can review images much quicker than a radiologist with an equitable margin of error, as demonstrated by the number of products receiving CE or FDA approval. A limitation is that an AI model may only be looking for a specific anomaly or cluster of signs, whereas a clinician may identify other issues. This can be resolved if it is possible to easily deploy and utilise a number of AI models to review the images, and would not increase the review time. Ultimately the image is still reviewed by a clinician but it is informed by multiple AI models. Using a platform such as AIDE allows a system or organisation to easily deploy multiple AI models and provides an easy platform from which to review the results. This could improve the time to diagnosis, identify issues that may not have been recognised, and decrease the pressure on reporting radiologists.
We will discuss the efficacy of available solutions later in this article, but the error margins for FDA and CE approved models are currently deemed acceptable by regulatory bodies. As demonstrated in the NHS and internationally, AI can have a positive impact on workflows and diagnosis. Over time, as the models improve, they become quicker to deploy, and clinical workflows and processes align, this will improve further.
A number of studies have looked at the economic impact. A study by Chernina et al suggested that AI being used across 10 pathologies showed that AI presented 3.6 times more benefit than the non AI pathway, when the AI was used as an “assistant”. This is supported by a MedTech Europe report from 2020 that showed significant impacts through the use of AI more widely and not just limited to chest imaging.
The three recommendations that NICE broadly makes on data can broadly be answered by looking at the way in which models are trained. More research is requested on the outcomes of the following:
- the technical failure and rejection rates of the software.
- whether the software works in cases when it is hard to get high-quality images, for example, scoliosis and morbid obesity.
- whether the software works in groups that could particularly benefit, including people with multiple conditions, people from high-risk family backgrounds, and younger women who do not smoke.
If we expose models to larger data sets when we are training them and continually undertake this, theoretically the outputs may improve, but only if we are training on good data. The adage ‘rubbish in, rubbish out’ stands in AI as much as anywhere else. Approved models are expected to have a certain accuracy and specificity. Exposure to more data may improve this, alternatively it may make the specificity worse. This would need to be monitored, but given each output is clinically reviewed this should be possible. It may be that the specificity drops, but is still within the confidence range for one condition, but improves dramatically in another area. It would be advantageous to be able to utilise a range of AI models to look at each image. Likewise having the ability to easily train existing models and test new models in a safe environment is key. This could be done by utilising a package such as FLIP that the AI Centre for Value Based Healthcare developed in conjunction with Answer Digital. This is approved by the HRA to train AI models in a standardised and safe way.
If we train the models on the examples given by NICE (scoliosis, morbid obesity, people with multiple conditions, people from high-risk family backgrounds, and younger women who do not smoke) it will improve the outputs in those groups. This will only be possible if we can expose models to images from those groups. We would also need to ensure that the models are exposed to images from different ethnicities so as to maintain high quality outputs. Undertaking ensemble training on the models would help to alleviate this.
In addition, if we adopt the principles of user centred design when building and training AI models we will be better placed to limit some of the data issues as we will be able to better understand the impacts on certain groups of people.
The final recommendations from NICE can be grouped to look at the actual AI. Some of the points have been covered above.
- the diagnostic accuracy of AI‑derived software alongside clinician review in detecting nodules or other abnormal lung features that suggest lung cancer.
- patient perceptions of using AI‑derived software.
There is significant empirical evidence that shows AI can detect signs to identify lung cancer. These points have been covered and given that a significant number of the approved models are to detect just this, it feels that the NICE recommendation is moot. If we look at the response to data and we are able to teach the models using more data then we will be able to improve the outcomes, and this will be reflected in the workflows. It is also important that new consultants in training are exposed to AI as part of the pre-registration and post-registration education so that they feel confident in its use. This point is critically not made by NICE but is equally important as the patient perspective.
There is a significant amount of public discussion around the use of AIbeyond its application than healthcare. This could lead to misunderstandings in the general population on how it is used in healthcare. Once it becomes mainstream it will require careful coordinated messaging from NHS England, the Royal Colleges and patient groups. As is outlined in the research by Chernina et al, AI is very effective when used as an “assistant” to clinical decision making and it should be clear that it is an “extension of the clinician” rather than a replacement for them.
There will also need to be clear messaging in regard to access to data to train the models. This should be transparent and accessible to all. If not there is a risk that those in the most deprived groups do not consent and the models miss out on vital important variations from some of the most at risk groups.
It is important to note that the risks of trust in relation to AI are not limited to AI as they exist when there is any major service redesign. However, AI is complicated as it can involve sensitive personal data that patients worry will be abused.
In conclusion, NICE has in parts made sensible recommendations, but there is nothing that could not have been predicted. As they come three years after the Richards report and the drive to use AI in healthcare is well underway, they risk derailing the scaling and use of AI unless there is a timely response by AI vendors, academics, health providers and most importantly patients and citizens. We must continue to develop and implement AI while remembering the key tenets of do no harm and continual improvement. To do so, AI models must be tested in safe environments and easy to implement once approved.