---
title: Risk Management for AI Medical Devices: ML Failure Modes
description: How to fold AI-specific failure modes — drift, bias, adversarial input, data shift — into an EN ISO 14971 risk management file under MDR.
authors: Tibor Zechmeister, Felix Lenhard
category: AI, ML & Algorithmic Devices
primary_keyword: risk management AI medical devices MDR
canonical_url: https://zechmeister-solutions.com/en/blog/risk-management-ai-medical-devices
source: zechmeister-solutions.com
license: All rights reserved. Content may be cited with attribution and a link to the canonical URL.
---

# Risk Management for AI Medical Devices: ML Failure Modes

*By Tibor Zechmeister (EU MDR Expert, Notified Body Lead Auditor) and Felix Lenhard.*

> **Risk management for an AI medical device under MDR means applying EN ISO 14971 to a system whose failure modes are statistical, not deterministic. You must identify hazards that are specific to machine learning — data drift, distribution shift, adversarial input, bias amplification, silent degradation — integrate them into the standard hazard analysis, and design controls that work beyond the algorithm itself.**

**By Tibor Zechmeister and Felix Lenhard.**

## TL;DR
- MDR Annex I §3 requires risk management as a continuous, iterative process over the entire device lifecycle. For AI, "lifecycle" includes deployment environments that change after CE.
- EN ISO 14971:2019+A11:2021 is the standard every Notified Body will audit against. It does not mention AI, but it is flexible enough to cover it if you extend your hazard identification properly.
- AI introduces failure modes that traditional rule-based software does not have: dataset shift, concept drift, bias amplification, adversarial perturbation, out-of-distribution silent failure, over-reliance by clinicians.
- Risk controls for AI extend beyond the algorithm. Human factors, labeling, monitoring, and deployment constraints are often more effective than model tweaks.
- A risk file that only mentions "software errors" and "false positives" is not sufficient. A reviewer will expect AI-specific hazards explicitly named.
- Your risk file must remain live post-market. Drift monitoring feeds back into hazard re-assessment under Article 83-86.

## Why this matters

The first AI device risk file I reviewed as a lead auditor was forty pages long and beautifully formatted. It mentioned "software failure" twice, "incorrect output" once, and did not contain a single entry for distribution shift, dataset bias, or adversarial input. The team had treated their neural network like a normal medical algorithm — deterministic, bounded, auditable. It was none of those things. I raised a major non-conformity and sent them back for a rewrite.

That is the central problem. EN ISO 14971 was written in a world where software was deterministic: if the input is X, the output is always Y, and the failure modes are bugs and interface errors. AI devices do not behave like that. The same image run through the same model on the same hardware can drift in meaning over months because the patient population has shifted, or because the imaging protocol at the deploying hospital has changed. Nothing in the code broke. The device is silently wrong.

MDR Annex I §3 demands risk management as a continuous lifecycle activity. For AI devices, that means the risk file is not a document you finish before CE and file away. It is an instrument that has to keep detecting hazards the model encounters in the wild.

## What MDR actually says

Four parts of Annex I do the heavy lifting.

**Annex I §3** — Manufacturers shall establish, implement, document, and maintain a risk management system. The system shall be understood as a continuous iterative process throughout the entire lifecycle of a device, requiring regular systematic updating. Key word: continuous. For AI, this is not boilerplate.

**Annex I §4** — Manufacturers shall adopt risk control measures in the following order of priority: (a) eliminate or reduce risks as far as possible through safe design and manufacture; (b) where appropriate, take adequate protection measures, including alarms if necessary, in relation to risks that cannot be eliminated; (c) provide information for safety (warnings, precautions, contraindications) and, where appropriate, training to users. The hierarchy matters for AI. You cannot skip straight to "add a warning in the IFU" when a dataset improvement is feasible.

**Annex I §8** — All known and foreseeable risks, and any undesirable side-effects, shall be minimised and be acceptable when weighed against the evaluated benefits to the patient and/or user arising from the achieved performance during normal conditions of use. For AI, "known and foreseeable" now includes drift, bias, and adversarial robustness. Reviewers know these exist. Claiming ignorance is not a defense.

**Annex I §17.1 and §17.2** — Software-specific requirements. §17.1 requires repeatability, reliability, and performance in line with intended use. §17.2 requires software to be developed according to the state of the art considering the principles of development lifecycle, risk management, verification, and validation. The state of the art for AI risk management in 2026 explicitly includes the failure modes we cover below — MDCG guidance and published literature assume them.

The standard you build this on is **EN ISO 14971:2019+A11:2021**. Its process — hazard identification, risk estimation, risk evaluation, risk control, residual risk evaluation, benefit-risk analysis, production and post-production information — works for AI if you populate it with the right hazards.

## A worked example

A Class IIb AI device that flags suspected pulmonary embolism on contrast-enhanced CT scans. Intended purpose: assist radiologists in triage by prioritizing scans likely to contain PE.

A conventional risk file would list: false positive (delays other cases), false negative (missed PE), software crash, interface misread. Good enough for a rule-based system. Nowhere near sufficient for AI. Here is what a competent risk file looks like.

**AI-specific hazards and controls:**

*Hazard 1 — Dataset shift across hospitals.* The model was trained on scans from Hospital A and B using GE scanners with iodinated contrast protocol X. Deployed at Hospital C using Siemens scanners with protocol Y, sensitivity could drop. Risk control: deployment-time calibration check on a local validation set before go-live; per-site performance monitoring; deployment restriction to specified scanner types in the IFU.

*Hazard 2 — Concept drift over time.* The distribution of PE presentations changes (e.g. COVID-era pulmonary pathology shifted imaging baselines). Risk control: quarterly drift detection on input distribution and output distribution; pre-defined thresholds that trigger re-validation or field action. See [drift detection for AI medical devices](/blog/drift-detection-ai-medical-devices-mdr) for the monitoring pattern.

*Hazard 3 — Out-of-distribution silent failure.* A non-contrast scan accidentally sent to the model. It returns a confident answer that is meaningless. Risk control: input quality check — the device must detect contrast vs non-contrast and refuse to process non-contrast scans with an explicit "not applicable" output.

*Hazard 4 — Bias amplification.* Training data underrepresented elderly female patients; model sensitivity drops for that subgroup. Risk control: subgroup performance monitoring during validation (documented in the risk file, not just the validation report); labeling that reflects known performance gaps; post-market subgroup tracking.

*Hazard 5 — Adversarial or corrupted input.* An image with compression artifacts or scanner noise could flip a borderline prediction. Risk control: robustness testing against perturbations in validation; output confidence thresholds; human-in-the-loop always required for positive findings.

*Hazard 6 — Over-reliance by clinicians.* The device is meant to triage, not diagnose. Radiologists may defer to the algorithm and miss findings it didn't flag. Risk control: labeling that explicitly states the device does not replace radiologist review; IFU training; display of model confidence rather than binary outputs to preserve clinician judgment.

*Hazard 7 — Silent model degradation after an OS update or library change.* A dependency update changes inference behavior. Risk control: version-locked inference stack; regression testing gated on every release; configuration management per EN 62304 §8.

*Hazard 8 — Feedback loops.* The device's own outputs influence the data it later sees (e.g. flagged scans get re-imaged differently). Risk control: monitoring of input distribution changes; analysis of whether the model's outputs are influencing upstream workflows.

Each hazard gets a probability estimate, severity estimate, risk evaluation against acceptance criteria, control measures, verification that the controls work, and residual risk assessment. Every one of those fields must be filled. "Low" is not an acceptable entry for probability of dataset shift when the device is deployed internationally.

## The Subtract to Ship playbook

Most teams I coach have either an over-engineered risk file nobody reads, or a minimal one that will fail audit. The middle path is to be exhaustive about hazard identification and ruthless about risk control simplicity.

**Step 1 — Build an AI-specific hazard checklist.** Use the list above as a starting point. Add hazards specific to your modality, deployment, and user population. Review it with a radiologist, a data scientist, a clinician, and a regulatory lead — four different perspectives surface different hazards.

**Step 2 — Integrate into the standard EN ISO 14971 process.** Do not maintain a "normal risk file" and a "separate AI risk file." One file, one process. The standard accommodates AI hazards if you actually populate them.

**Step 3 — Prefer non-algorithmic controls.** Annex I §4 prioritizes design controls, then protective measures, then information. For AI, that means: constrain the intended purpose first (design), add input validation and output confidence gates second (protection), warn in the IFU third (information). A contraindication that excludes a risky subgroup is more robust than a "use with caution" warning.

**Step 4 — Make the risk file live.** Tie it to your post-market surveillance plan. Drift monitoring triggers feed into hazard re-estimation. Complaints feed into hazard identification. The file updates. Date every change. See our [post-market surveillance for AI devices](/blog/post-market-surveillance-ai-devices) post for the feedback loop.

**Step 5 — Verify every control.** A risk control with no verification evidence is an opinion. If you claim "input quality check rejects non-contrast scans," there must be a test that demonstrates it across the realistic input space. Reviewers will ask.

**Step 6 — Document the benefit-risk conclusion honestly.** Annex I §8 requires the benefit to outweigh the residual risk under normal conditions. For AI, "normal conditions" includes the subgroups where your model is weaker. If your residual risk for the elderly subgroup is above acceptance, either improve the model or exclude that subgroup from intended use.

**Step 7 — Keep the file readable.** An auditor should be able to trace from a hazard to a control to a verification to a residual risk in under five minutes. If they cannot, the file is failing its job regardless of how thorough it is.

Subtract what to subtract: the fluff. Add what to add: the AI-specific hazards most teams skip.

## Reality Check

1. Does your risk file contain explicit entries for dataset shift, concept drift, and bias amplification — by name?
2. Have you considered out-of-distribution inputs and how the device behaves when it receives them?
3. For every risk control, is there verification evidence that the control works?
4. Are your residual risks evaluated subgroup by subgroup, not just in aggregate?
5. Is your risk file connected to a live post-market monitoring process that can detect new hazards?
6. Have you prioritized design and protective controls over warnings in the IFU?
7. Would an auditor be able to trace any single hazard end-to-end in under five minutes using your risk file?
8. When the model is retrained, does your process automatically re-trigger hazard re-evaluation?

## Frequently Asked Questions

**Is EN ISO 14971 sufficient for AI medical devices, or do I need something else?**
EN ISO 14971:2019+A11:2021 remains the MDR-harmonized risk management standard and is sufficient as a process framework. What is insufficient is applying it with only traditional software hazards in mind. You extend the hazard identification — you do not replace the standard.

**Do I need to reference ISO/IEC 23894 or similar AI risk standards?**
Referencing them is reasonable and increasingly expected as state of the art. But EN ISO 14971 remains the primary harmonized standard under MDR. Additional AI-specific references support your state-of-the-art argument under Annex I §17.2.

**How often should the risk file be updated for an AI device?**
At least annually, after every significant change, after every retraining, and whenever post-market data surfaces a new hazard. For continuously deployed AI, monthly or quarterly review cycles are common.

**Can I use qualitative probability estimates for AI-specific hazards?**
Yes. EN ISO 14971 allows qualitative estimation. What matters is that the estimate is defensible and supported by data where available — subgroup performance, observed drift rates, complaint frequencies.

**What happens if a post-market finding reveals a previously unidentified hazard?**
You update the risk file, re-evaluate residual risk, and decide whether the device remains within acceptance criteria. If not, that may trigger a field safety corrective action under MDR Article 87-92.

**Is "bias" a hazard or a harm?**
Bias is a failure mode that leads to hazards — differential performance across subgroups that can cause harm. Your risk file should capture the failure mode, the hazardous situation it creates, and the resulting harms. See [data quality and bias in AI medical devices](/blog/data-quality-bias-ai-medical-devices) for the framework.

## Related reading
- [Drift detection for AI medical devices under MDR](/blog/drift-detection-ai-medical-devices-mdr) — the monitoring layer of your risk controls.
- [Data quality and bias in AI medical devices](/blog/data-quality-bias-ai-medical-devices) — the upstream source of several hazards.
- [Performance validation for AI medical devices](/blog/performance-validation-ai-medical-devices) — where subgroup analysis lives.
- [Locked vs adaptive AI algorithms under MDR](/blog/locked-vs-adaptive-ai-algorithms-mdr) — adaptive models change your risk profile.
- [Post-market surveillance for AI devices](/blog/post-market-surveillance-ai-devices) — the live feedback into the risk file.

## Sources
1. Regulation (EU) 2017/745 on medical devices, consolidated text. Annex I §3, §4, §8, §17.1, §17.2.
2. EN ISO 14971:2019+A11:2021 — Application of risk management to medical devices.
3. EN 62304:2006+A1:2015 — Medical device software — Software lifecycle processes.

---

*This post is part of the [AI, ML & Algorithmic Devices](https://zechmeister-solutions.com/en/blog/category/ai-ml-devices) cluster in the [Subtract to Ship: MDR Blog](https://zechmeister-solutions.com/en/blog). For EU MDR certification consulting, see [zechmeister-solutions.com](https://zechmeister-solutions.com).*