Risk Management for AI/ML Devices: Addressing Algorithmic Risks

Q: Can we use aggregate accuracy as evidence of no bias?

No. Aggregate accuracy hides subgroup underperformance. The risk file needs disaggregated performance by the subgroups defined in the intended use. If you do not know which subgroups matter clinically, that is a gap to close before the risk analysis, not after.

Quick Summary

How to fold ML-specific hazards like drift, dataset bias, and subgroup underperformance into the EN ISO 14971 risk management process under MDR.

AI and ML devices fail in ways that traditional risk analysis techniques were not designed to catch. Distribution drift, dataset bias, subgroup underperformance, and adversarial inputs all route directly from software behaviour to patient harm without any component failing. Under MDR, these hazards belong in the EN ISO 14971:2019+A11:2021 risk management process, identified explicitly, controlled with the hierarchy the standard requires, and monitored through the lifecycle the regulation demands.

By Tibor Zechmeister and Felix Lenhard.

TL;DR

EN ISO 14971:2019+A11:2021 is the harmonised risk management standard the MDR expects, and it applies to AI and ML devices the same way it applies to any other SaMD.
ML-specific hazards include distribution drift, training-data bias, subgroup underperformance, adversarial inputs, label noise, and feedback-loop contamination. None of these show up in a traditional component FMEA.
Most AI and ML decision-support software lands in Class IIa or higher under MDR Annex VIII Rule 11, which means notified body review and deep risk-file scrutiny.
The EN ISO 14971:2019+A11:2021 risk control hierarchy still applies: inherent safety by design first, protective measures second, information for safety last.
Post-market performance monitoring is not optional. MDR Annex I GSPR 3 requires the risk management process to run through the entire lifecycle, and ML devices drift faster than the standard update cadence.
Tibor's observation from follow-up interviews: structured AI-assisted hazard discovery during the risk session helps teams find hazards they would otherwise miss. It is an emerging state of the art, not a replacement for multi-disciplinary human review.

Why algorithmic risk is different

Tibor has reviewed risk files for AI and ML devices that looked complete on paper and fell apart on the first clinical question. A team would list hazards such as "model returns incorrect output" with a generic mitigation such as "validate against test dataset." That is not risk management. That is a placeholder pretending to be risk management.

The failure mode under ML is almost never that the model stops working. The failure mode is that the model keeps working, confidently, on a population where its training data is no longer representative. A dermatology classifier trained on light skin performs worse on dark skin. A sepsis early-warning model trained on one hospital's lab normalisation ranges performs worse on another's. A speech-to-text clinical documentation tool trained on native speakers performs worse on accented speech. In each case, the software did exactly what it was trained to do. The hazard lived in the data, the assumptions, and the deployment context, not in the code.

Felix coaches teams on the gap between building a model and shipping one into a regulated market. The teams that hit trouble are the ones that copied the risk analysis format from a hardware file and never rewrote it for what their actual device does. The teams that ship cleanly rewrite the risk analysis around algorithmic hazards and then layer the traditional techniques underneath.

What MDR actually says about AI and ML risk

MDR does not use the words AI or machine learning anywhere in its articles or annexes. What it does require applies directly and unconditionally.

Annex I GSPR 1 requires devices to achieve the performance intended by the manufacturer while being safe and effective. GSPR 3 requires the manufacturer to establish, implement, document, and maintain a risk management system as a continuous iterative process throughout the entire lifecycle, with regular systematic updating. GSPR 4 requires risks to be reduced as far as possible without adversely affecting the benefit-risk ratio.

Annex I §17.1 requires electronic programmable systems to be designed to ensure repeatability, reliability, and performance in line with their intended use, and to implement measures to eliminate or reduce risks associated with a single fault condition. §17.2 requires software to be developed and manufactured in accordance with the state of the art taking into account the principles of development lifecycle, risk management, verification, and validation.

Annex VIII Rule 11 classifies software intended to provide information used to take decisions with diagnosis or therapeutic purposes into Class IIa, Class IIb, or Class III depending on the severity of the outcome. MDCG 2019-11 Rev.1 confirms that almost all clinically meaningful AI and ML diagnostic or decision-support software falls under Rule 11 and lands in Class IIa at minimum.

In parallel, the EU AI Act classifies most medical AI systems as high-risk under its own framework. The AI Act and MDR run in parallel, not instead of each other. A team shipping an AI medical device has to satisfy both. For this post, the focus stays on the MDR risk file.

The training dataset underrepresents or misrepresents parts of the target population.

ML-specific hazards to add to the risk analysis

A software risk analysis that covers ML properly adds at least the following hazard categories. Each one can be the starting point of multiple specific hazards in a real file.

Distribution drift. The real-world input distribution shifts away from training, patient populations change, lab methods change, imaging devices update firmware, performance degrades silently.

Dataset bias. The training dataset underrepresents or misrepresents parts of the target population. Subgroups defined by age, gender, ethnicity, or comorbidity may receive systematically worse predictions. Bias is frequently invisible in aggregate metrics.

Subgroup underperformance. Even on a representative dataset, a model can perform worse on specific subgroups because the task is inherently harder there, or because labels are noisier, or because the feature representation misses the relevant variation.

Adversarial inputs. Small perturbations cause the model to produce wrong outputs with high confidence. For imaging and text-based SaMD, adversarial vulnerability is a real attack surface.

Label noise. Training labels are incorrect for a non-trivial fraction of examples, which propagates into systematic errors at inference.

Feedback-loop contamination. The model's outputs influence the data it is later retrained on, reinforcing bias.

Automation bias. Clinicians trust the model more than they should and stop checking edge cases. Originates in the model's confident presentation.

Out-of-scope input. The model is asked to classify inputs it was not designed to handle and returns a confident answer because it was not trained to abstain.

None of these hazards will appear in an FMEA spreadsheet that lists component failures. They have to be added deliberately during the hazard identification step of EN ISO 14971:2019+A11:2021 clause 5.

A worked example: diabetic retinopathy screening tool

Consider a startup shipping an ML tool that reads fundus images and flags possible diabetic retinopathy for referral to an ophthalmologist. Under Rule 11, this provides information used to take decisions with diagnostic purposes, with a risk of serious deterioration if missed. Class IIa at minimum, plausibly Class IIb depending on how the clinical pathway is framed.

A weak risk file for this product lists hazards like "model produces false negative", scores it with a probability and a severity, and adds a mitigation like "validate with clinical study". It is compliant-shaped without being compliant.

A strong risk file for the same product asks: what happens when the deployed fundus camera model differs from the one the training images came from? What happens when the population shifts, because the product ships into a country with different retinopathy prevalence and different comorbidity patterns? What happens when the tool performs worse on patients with cataracts, a subgroup the training set underrepresents? What happens when image compression artefacts systematically change pixel distributions? What happens when a clinician accepts the tool's low-risk output without re-reviewing, because the tool has been reliable for three weeks?

Each of these becomes a hazard in the risk file. Each gets a severity based on the clinical consequence of a missed referral. Each gets a control. Inherent safety by design where possible: refuse to output a decision when input image quality is below a validated threshold. Protective measures where that is not possible: route uncertain cases to mandatory human review. Information for safety as the last layer: tell clinicians explicitly where the tool is and is not validated.

Tibor's audit expectation: the file should show evidence that these hazards were considered, not just evidence that a validation study was run. The validation study is one data point. The risk analysis is the framework the data point lives in.

The Subtract to Ship playbook for ML risk management

Felix coaches teams through a specific sequence to keep ML risk analysis defensible without drowning the startup.

Step 1. Before any ML-specific hazard work, produce a clear intended-use statement and a documented target population. Every later bias and subgroup discussion depends on knowing exactly who the device is for. MDR Article 2(12) defines intended purpose as the use for which a device is intended according to the data supplied by the manufacturer on the label, in the instructions for use, in promotional or sales materials or statements, and as specified by the manufacturer in the clinical evaluation. That definition anchors the subgroup analysis.

Step 2. Run a structured hazard identification session using the EN ISO 14971:2019+A11:2021 process and the ML hazard categories above as prompts. Multi-disciplinary: clinical, ML engineer, software engineer, product, and if possible an external reviewer. Tibor's follow-up interviews flagged AI-assisted hazard discovery as an emerging technique worth using as a second pass, not as a replacement for the human session.

Step 3. Document training data provenance, representativeness, and known gaps. This becomes the input to bias and subgroup hazards. A risk file that claims "our dataset is representative" without a provenance document is hollow.

Step 4. Apply the EN ISO 14971:2019+A11:2021 risk control hierarchy. Where the design can refuse to output a decision rather than output a wrong one, that is inherent safety by design. Where a mandatory human check can intercept a wrong output, that is a protective measure. Where only a warning can alert the clinician, that is information for safety, and it should be the last layer.

Where the design can refuse to output a decision rather than output a wrong one, that is inherent safety by design.

Step 5. Connect every ML hazard to an EN 62304:2006+A1:2015 software item and update the software safety class accordingly. The model inference component of an ML SaMD is almost always at least Class B, often Class C.

Step 6. Design post-market performance monitoring before release. MDR Annex I GSPR 3 requires the risk management process to continue through production and post-production, ML devices drift, the risk file has to be updated continuously, not every three years. Post-market metrics feed back into severity and probability estimates in the risk file, and trigger CAPA when thresholds are crossed.

Step 7. Plan the change-control pathway for model updates. A retrained model is a change to the device. Depending on the scope, it may require a new conformity assessment. The risk analysis has to track what counts as a significant change before the first update is shipped, not after.

Reality Check

Does your risk file name drift, bias, subgroup underperformance, and out-of-scope input as distinct hazards, or are they absent?
Can you point to a documented target population and training-data provenance record that your subgroup analysis builds on?
Did a multi-disciplinary team run your ML hazard identification session, including at least one clinical voice?
Does your software safety class per EN 62304:2006+A1:2015 reflect the hazards the ML components contribute to, or was the class chosen before the hazards were listed?
Is your post-market performance monitoring designed to detect drift at a clinically meaningful resolution, or only to track error reports?
Do you have a documented rule for what counts as a significant change when you retrain the model?
If a notified body auditor asked how you would notice that the model has stopped working as well for one subgroup, could you show the monitoring design that answers that?

Frequently Asked Questions

Does EN ISO 14971 cover AI and ML specifically?

EN ISO 14971:2019+A11:2021 is technique-neutral. It does not call out ML by name, but its process, identify hazards, estimate and evaluate risks, control them, monitor through the lifecycle, applies to ML devices as fully as to any other device. Additional guidance such as AAMI CR34971 exists, but the harmonised standard remains EN ISO 14971:2019+A11:2021.

Is the EU AI Act replacing MDR for AI devices?

No. The AI Act runs in parallel with MDR. An AI medical device has to satisfy both. The MDR risk file still has to conform to EN ISO 14971:2019+A11:2021 and the device still has to meet Annex I GSPRs.

How do we handle drift without blocking every release?

Design monitoring thresholds before release, not after drift appears. Make the risk file explicit about what level of performance change triggers a review, what triggers a field notice, and what triggers a recall. The thresholds go in the risk management plan and the post-market surveillance plan, not in a Slack channel.

Can we use aggregate accuracy as evidence of no bias?

No, aggregate accuracy hides subgroup underperformance, the risk file needs disaggregated performance by the subgroups defined in the intended use. If you do not know which subgroups matter clinically, that is a gap to close before the risk analysis, not after.

What counts as a significant change when we retrain?

That question deserves a dedicated change-control decision tree. As a starting point, any change that could affect safety, performance, or intended purpose is potentially significant and needs assessment. The decision rule should be documented in the QMS before the first retraining.

Is AI-assisted hazard discovery allowed in a compliant process?

Yes, and Tibor's follow-up interviews describe it as an emerging state of the art. The key is that it supplements, not replaces, multi-disciplinary human review, and that every AI-suggested hazard is evaluated by the same criteria as human-identified hazards before it enters the risk file.

Risk management for AI medical devices gives the companion overview of the AI risk management programme.
Drift detection for AI medical devices focuses on the monitoring design that closes the post-market loop.
Data quality and bias in AI medical devices goes deeper on dataset-level hazards.
Classification of AI and ML software under Rule 11 explains the classification route most ML devices take.
Software risk analysis: FMEA and EN ISO 14971 for SaMD covers the general SaMD risk process this post builds on.

Sources

Regulation (EU) 2017/745 on medical devices, consolidated text. Annex I GSPR 1, 3, 4; Annex I §17.1, §17.2; Annex VIII Rule 11.
EN ISO 14971:2019+A11:2021, Medical devices, Application of risk management to medical devices.
EN 62304:2006+A1:2015, Medical device software, Software life cycle processes.
MDCG 2019-11 Rev.1 (October 2019, Rev.1 June 2025), Guidance on Qualification and Classification of Software in Regulation (EU) 2017/745 and Regulation (EU) 2017/746.

The Bigger Picture