Across Notified Body audits of AI-based medical devices, the same seven mistakes keep appearing. Rule 11 misclassification, vague intended purpose, treating model retraining as a non-significant change, no drift monitoring in the PMS plan, weak clinical evidence, unclear human oversight, and undocumented training data governance. Each one is fixable before submission and almost impossible to fix after a non-conformity is raised. This is the auditor's view and the fix for each.
By Tibor Zechmeister and Felix Lenhard.
TL;DR
- MDR Annex VIII Rule 11 places most AI decision-support software in Class IIa or higher, and MDCG 2019-11 Rev.1 is the interpretive reference.
- An intended purpose written for marketing rather than regulatory precision is the upstream cause of most AI-device audit findings.
- Model retraining is very often a significant change under MDR. Treating it as a silent update is the single riskiest pattern Tibor sees.
- A PMS plan that does not include model drift monitoring does not satisfy MDR Articles 83 to 86 for an AI device.
- Clinical evidence for AI devices must address performance on the target population, not just internal validation metrics.
- Human oversight, training data provenance, and change management must all be documented before the Notified Body stage 2 audit.
Why this matters (Hook)
Tibor has audited AI-based medical devices from both sides: as a Notified Body lead auditor deciding whether to grant or withhold a CE certificate, and as a founder of four MedTech companies. Including Flinn.ai. That had to pass audits. The patterns are consistent. AI MedTech startups are usually strong on the model and weak on the regulatory framing around the model. The result is that the device might work clinically but the file cannot defend it under MDR.
This post walks through the seven recurring mistakes, the MDR articles they touch, and the fix for each. None of them require rewriting the model. All of them require rewriting or strengthening the regulatory artefacts around the model before the Notified Body sees them.
What MDR actually says (Surface)
Three anchor points frame every AI-device file.
Annex VIII Rule 11. Software intended to provide information used to take decisions with diagnosis or therapeutic purposes is classified in Class IIa, unless such decisions have an impact that may cause death or an irreversible deterioration of a person's state of health (Class III), or a serious deterioration of a person's state of health or a surgical intervention (Class IIb). Software intended to monitor physiological processes is in Class IIa, unless it is intended for monitoring of vital physiological parameters where variations could result in immediate danger (Class IIb). All other software is in Class I. MDCG 2019-11 Rev.1 is the operational guidance on how to apply this rule.
Article 2(12). Intended purpose. The intended purpose is "the use for which a device is intended according to the data supplied by the manufacturer on the label, in the instructions for use or in promotional or sales materials or statements and as specified by the manufacturer in the clinical evaluation." This definition is the anchor for classification, clinical evidence, risk management, and every subsequent decision.
Article 61 and Annex XIV Part A. Clinical evaluation. Clinical evidence must demonstrate safety and performance for the intended purpose in the target population. For an AI device, internal validation metrics alone are not clinical evidence.
Around these anchors sit: the significant change framework used by Notified Bodies to decide whether a modification triggers a new conformity assessment, the PMS obligations of Articles 83 to 86, the QMS obligations of Article 10(9) and EN ISO 13485:2016+A11:2021, and the software lifecycle obligations of EN 62304:2006+A1:2015.
A worked example (Test)
A Class IIa AI image analysis startup for dermatology submits technical documentation to a Notified Body. The classification rationale references Rule 11 but justifies Class IIa in three lines. The intended purpose reads like a marketing page: "helps dermatologists improve diagnostic decisions." The file contains a validation report showing 93% sensitivity and 89% specificity on a held-out test set. The PMS plan is copy-pasted from a generic template. The training data section describes "a large representative dataset."
The Notified Body raises seven non-conformities. All seven are fixable. But the fixes take four months and the audit gets postponed. Every single finding maps to one of the seven mistakes below. This is not a theoretical scenario. It is the modal first audit outcome for AI startups.
The seven mistakes (Ship)
Mistake 1. Treating model retraining as a non-significant change. The assumption: "we retrain monthly on new data, it's the same model architecture, nothing changed." The reality: retraining alters the device's outputs. Under the Notified Body significant change framework used across Europe for software changes, a change to the algorithm, performance claims, intended purpose, or the intended patient population triggers re-assessment. A model retrained on a new dataset with different performance characteristics is almost always a significant change. The fix: document a change management procedure that explicitly classifies retraining events, and only deploy retrained models after the change has been assessed against the significant change criteria. If your Notified Body has agreed a predetermined change control plan, that agreement and its scope must be in the file.
Mistake 2. A vague or marketing-driven intended purpose. "Helps clinicians make better decisions" is not an intended purpose. It is marketing. The intended purpose under Article 2(12) must be specific enough to anchor classification under Rule 11, scope the clinical evaluation under Article 61, and define the target population and use environment. The fix: write the intended purpose as one precise paragraph answering what, for whom, by whom, in what setting, for what clinical decision, and with what output. Every word in that paragraph carries regulatory weight.
Mistake 3. No drift monitoring in the PMS plan. An AI device whose inputs change over time. Patient demographics shift, imaging hardware updates, clinical workflows evolve. Experiences model drift. A PMS plan that does not describe how drift will be detected, thresholds for action, and the response pathway does not satisfy the PMS obligations of MDR Articles 83 to 86 for an AI device. The fix: add a dedicated section to the PMS plan on performance monitoring in the field, with defined metrics, thresholds, monitoring cadence, and a link to the CAPA process if thresholds are breached.
Mistake 4. Clinical evidence limited to internal validation metrics. A held-out test set is model validation, not clinical evidence. MDR Article 61 and Annex XIV Part A require clinical evidence for the device in its intended use, on the intended population, in the intended environment. For an AI diagnostic, this means clinical performance studies or high-quality real-world evidence, not just AUROC on an internal dataset. The fix: the Clinical Evaluation Plan must explicitly bridge the gap from model validation to clinical performance, typically via a clinical investigation or a prospective real-world evaluation. The CER argues safety and performance, not accuracy.
Mistake 5. Unclear human oversight in the intended use. Notified Bodies now ask very directly: what is the human's role in the loop, and what happens when the AI is wrong. An intended purpose that does not specify whether the AI output is advisory, whether a clinician must review every result, and what the fallback is when the AI is unavailable or uncertain, leaves the classification rationale and the risk file both undefended. The fix: document human oversight explicitly in the intended purpose, in the Instructions for Use, and in the risk management file. Describe the decision the human retains.
Mistake 6. Undocumented training data provenance and governance. "A large representative dataset" is not a training data description. Notified Bodies expect: where the data came from, how it was collected, consent and legal basis, inclusion and exclusion criteria, demographic composition, quality control steps, how bias was assessed, how the test set was kept independent, and how ongoing datasets are governed. This is a file, not a paragraph. The fix: create a data governance document that lives in the technical documentation and is referenced from the clinical evaluation and the risk management file. Update it every time new data is ingested.
Mistake 7. No link between risk management and model behaviour. The risk file under EN ISO 14971:2019+A11:2021 often treats the AI as a black box. Hazards related to misclassification, false negatives on specific subpopulations, drift, adversarial inputs, or over-reliance by clinicians are not explicitly analysed. The fix: expand the hazard analysis to include AI-specific failure modes. Each hazard gets a risk control, each risk control gets verification, and the residual risk gets a benefit-risk justification in the CER. This is the single most effective way to demonstrate regulatory maturity to a Notified Body reviewing an AI device.
Reality Check
- Is your intended purpose statement one precise paragraph that answers what, for whom, by whom, in what setting, for what decision, with what output, and with what level of human oversight?
- Does your classification rationale cite Rule 11 directly, walk through the rule's conditions, and justify the specific class in multiple paragraphs, not three lines?
- Does your change management procedure name model retraining as a change type and specify how it is assessed against significant change criteria before deployment?
- Does your PMS plan describe drift monitoring with specific metrics, thresholds, monitoring cadence, and a CAPA trigger?
- Does your Clinical Evaluation Plan explicitly bridge from internal model validation to clinical performance on the target population?
- Is there a standalone training data governance document referenced from your technical file, CER, and risk file?
- Does your risk management file contain AI-specific hazards. Misclassification, subpopulation failure, drift, over-reliance. With risk controls and verification?
- Can a Notified Body auditor open your file and find, within thirty minutes, a defensible answer to "what happens when the AI is wrong"?
Frequently Asked Questions
Is every AI medical device a Class IIa device under Rule 11? No. Rule 11 has multiple branches. Decision-support AI is typically Class IIa, but escalates to IIb or III depending on the severity of the clinical decision it informs. Monitoring AI for vital parameters where variation could cause immediate danger is Class IIb. Always walk through the rule text and MDCG 2019-11 Rev.1 rather than defaulting.
Is model retraining always a significant change? No, but it usually is. A retrained model with different performance, different intended population, or changes in the risk profile is a significant change. A pre-agreed predetermined change control plan with your Notified Body can define a controlled envelope within which certain retraining activities do not trigger a new assessment. Without that agreement, assume retraining is significant until proven otherwise.
What counts as clinical evidence for an AI diagnostic? Evidence that the device, in its intended use, on its intended population, in its intended setting, delivers safe and clinically acceptable performance. Internal validation metrics on a held-out test set are part of the verification evidence but do not by themselves satisfy Article 61. A clinical investigation or a well-designed prospective real-world evaluation is typically needed.
Where does EN 62304 fit in for AI devices? EN 62304:2006+A1:2015 governs the software lifecycle. Planning, requirements, architecture, implementation, verification, integration, release, maintenance, problem resolution, configuration management. It applies to AI devices exactly as it applies to any other medical device software. Training the model is not a substitute for the 62304 lifecycle.
Does MDCG 2019-11 address AI specifically? MDCG 2019-11 Rev.1 is the software qualification and classification guidance and is the operational reference for applying Rule 11 to any software, including AI. It does not contain AI-only rules, but its application to AI devices is how classification is currently defended in Notified Body review.
How does human oversight affect classification? It does not automatically down-classify the device. A decision-support AI that clinicians "review" is still a decision-support AI under Rule 11. What human oversight does is shape the risk profile, the Instructions for Use, and the clinical evaluation. Document it honestly.
Related reading
- Classification of AI/ML software under Rule 11 – the classification walk-through referenced above.
- Continuous learning AI under MDR – how retraining and adaptive algorithms are handled.
- Drift detection in AI medical devices under MDR – the PMS mechanics for mistake number three.
- AI/ML change management and retraining assessment – the significant change framework in operational detail.
- Post-market surveillance for AI devices – extending PMS Articles 83 to 86 to AI-specific signals.
Sources
- Regulation (EU) 2017/745 on medical devices, consolidated text. Article 2(12), Article 10, Article 61, Articles 83 to 86, Annex VIII Rule 11, Annex XIV Part A.
- MDCG 2019-11 Rev.1 (June 2025). Guidance on qualification and classification of software in Regulation (EU) 2017/745.
- EN ISO 13485:2016+A11:2021. Medical devices. Quality management systems.
- EN 62304:2006+A1:2015. Medical device software. Software lifecycle processes.