Clinical Evaluation of AI/ML Medical Devices: Proving Safety and Performance

Quick Summary

Clinical evaluation of an AI or ML medical device follows the same MDR Article 61 and Annex XIV process as any other device: a planned, ongoing generation and appraisal of clinical data to verify safety, performance, and clinical benefit for the intended purpose. The additions for AI are.

Clinical evaluation of an AI or ML medical device follows the same MDR Article 61 and Annex XIV process as any other device: a planned, ongoing generation and appraisal of clinical data to verify safety, performance, and clinical benefit for the intended purpose. The additions for AI are not conceptual. They are empirical. The clinical evidence has to include algorithm performance metrics on an independent test set, subgroup analysis for bias, external validation on data the model has never seen, and a post-market clinical follow-up plan that detects drift. A clinical evaluation that reports only technical accuracy on the training distribution is not a clinical evaluation under MDR. It is a model evaluation that has been mislabelled.

By Tibor Zechmeister and Felix Lenhard. Last updated 10 April 2026.

TL;DR

MDR Article 61 and Annex XIV apply to AI medical devices exactly as they apply to any other device. There is no AI carve-out and no reduced clinical evidence pathway for AI.
The AI-specific content of the clinical evaluation sits inside that same framework: performance metrics on an independent test set, bias analysis across relevant subgroups, external validation, and characterisation of failure modes.
Equivalence under MDR Article 61(4) and Annex XIV Part A(3) is technically possible for AI devices but practically difficult, because two models with the same intended purpose can behave very differently on the same inputs. MDCG 2020-5 sets the bar.
Clinical investigation under MDR Article 62 following EN ISO 14155:2020+A11:2024 is the default route when literature and equivalence cannot carry the evidence. For AI, a prospective or retrospective performance study on independent data is usually part of the evidence package.
Post-market clinical follow-up under Annex XIV Part B is where the AI-specific work really lives. A PMCF plan without drift monitoring is incomplete for an AI medical device.
The most common pitfall is conflating technical performance (AUC on a held-out slice of the training data) with clinical performance (benefit to the patient in the intended use population). The Notified Body will catch this, and should.

Why this is the hardest chapter for AI MedTech founders

Most AI MedTech founders come into MDR with a model that already works. It works in the sense that it produces sensible outputs on the data the team has collected, and the team has charts showing it. The ROC curve looks good. The confusion matrix looks good. The paper draft exists.

Then the Notified Body asks for the clinical evaluation, and the team discovers that none of that work is clinical evidence in the MDR sense. It is model evaluation, useful, necessary, but one layer below what Annex XIV actually asks for, clinical evaluation is about whether the device, used as the manufacturer intends it to be used, in the population the manufacturer intends, produces a clinical benefit that outweighs the risks. Model performance is a necessary input to that question. It is not the answer.

This is the chapter where AI founders most often have to go back and do work they thought was already done. The good news is that if the team has been disciplined about dataset management and test set isolation from the start, most of the raw material is there. The bad news is that if the team has not been disciplined. If the test set has leaked into training runs, if the labels are inconsistent, if the intended use population was never defined. The clinical evaluation cannot be fixed with a better report. The underlying work has to be redone.

For a broader introduction to clinical evaluation itself, see our post on what clinical evaluation under MDR is. This post assumes that foundation and focuses on the AI-specific overlay.

Clinical evaluation applies to AI the same way. MDR Article 61

MDR Article 61 establishes that the manufacturer of every medical device shall plan, conduct, and document a clinical evaluation in accordance with Article 61 and Part A of Annex XIV. Clinical evaluation is defined as a systematic and planned process to continuously generate, collect, analyse, and assess clinical data pertaining to a device, in order to verify safety, performance, and the clinical benefits when used as intended by the manufacturer. (Regulation (EU) 2017/745, Article 61 and Article 2(44).)

The Regulation does not say "unless the device contains an AI model." It does not carve out software. It does not carve out algorithmic devices. Every medical device under MDR has to have a clinical evaluation, documented in a clinical evaluation report (CER), updated throughout the lifecycle of the device, and reviewed by a Notified Body for any device at Class IIa or above. MDCG 2019-11 Rev.1 (June 2025) confirms that AI and ML software fall under the same qualification and classification regime as any other software, which means the same clinical evaluation obligations flow through.

The conceptual framework. Generate, collect, analyse, assess; clinical data; intended purpose; safety and performance and clinical benefit. Is identical to the framework for a physical device. What differs is what goes into the "generate" and "collect" and "analyse" steps, because the failure modes of an AI device are different, and the clinical data has to be adequate to show safety and performance in the presence of those failure modes.

The test set has to be genuinely independent, its provenance has to be documented, and its representativeness has to be defensible.

The additions for AI: what the clinical evidence package has to show

An AI clinical evaluation has to address several questions that a classical clinical evaluation does not have to address, or addresses less formally. These are not bolted on to the CER at the end. They are built into the clinical data strategy from the start.

Algorithm performance metrics on an independent test set. The evidence has to include quantitative performance. Sensitivity, specificity, positive and negative predictive value, AUC, calibration, and whichever metrics are appropriate for the clinical task. Measured on a test set that was never used during training, hyperparameter tuning, or model selection. Test set leakage is the single most common finding Notified Bodies raise on AI submissions, and it will fail a clinical evaluation regardless of how good the headline number looks. The test set has to be genuinely independent, its provenance has to be documented, and its representativeness has to be defensible.

Test dataset independence and representativeness. Independence is a necessary condition. Representativeness is the sufficient one. An independent test set that comes from the same hospital, the same scanner, the same patient demographic as the training set does not tell you how the device performs in the intended use population. The CER has to describe who the intended users are, where the device is intended to be used, on which patients, with which workflows, and then show that the test data reflects those conditions. This is where MDCG 2020-5 principles on clinical data relevance apply with full force.

Bias testing across subgroups. The clinical evaluation has to report performance broken down by the subgroups where bias is clinically plausible. Age, sex, ethnicity, comorbidities, disease severity, scanner or device type, geographic region, whichever are relevant to the intended use. A model with 92% overall accuracy that drops to 68% for a specific subgroup is not a 92% device; it is a device with a known safety issue in a defined population, and the CER has to either show that the subgroup is outside the intended purpose or justify the residual risk under EN ISO 14971:2019+A11:2021. Bias analysis is a hazard analysis activity as much as a performance activity, and the CER and the risk management file have to tell a consistent story.

External validation. Internal test set performance is necessary but not sufficient, external validation, evaluating the model on data from a site, scanner, or population that was not part of the training data pipeline at all. Is what tells you whether the model generalises beyond its development environment. For many AI clinical evaluations, the external validation set is the backbone of the clinical evidence, because it is the closest proxy to what will happen when the device is placed on the market. When external validation is not available at the time of initial CE marking, the PMCF plan has to collect it.

Equivalence for AI. Can you claim it?

MDR Article 61(4) and Annex XIV Part A(3) allow a manufacturer to base clinical evaluation on data from an equivalent device under defined conditions. MDCG 2020-5 (April 2020) sets out the criteria: the two devices have to be equivalent in their technical, biological, and clinical characteristics to a degree that rules out any clinically significant difference in safety and performance. The manufacturer has to have sufficient levels of access to the data of the equivalent device. The principles from MDCG 2023-7 on "sufficient levels of access" apply where relevant.

For AI devices, equivalence is technically permitted and practically rare. The reason is that two AI models with the same intended purpose, trained on different data, with different architectures, can produce genuinely different clinical behaviour on the same inputs. Two convolutional networks reading the same chest X-ray will agree most of the time and disagree on the hard cases, which are precisely the cases where clinical evidence matters, technical characteristics, the architecture, the training data, the input preprocessing, the output calibration. Are usually different enough between any two AI products that a defensible equivalence claim is difficult to construct.

Notified Bodies in 2026 are increasingly sceptical of equivalence claims for AI devices, and founders should expect that scepticism. Equivalence can still have a role as supporting evidence, or as a framing for comparison with the current standard of care, but it is rarely the primary clinical evidence route for an AI device. The primary route is usually the manufacturer's own performance data combined with clinical investigation where needed. We cover equivalence in detail in our post on en.

Key Takeaway

Equivalence can still have a role as supporting evidence, or as a framing for comparison with the current standard of care, but it is rarely the primary clinical evidence route for an AI device.

Clinical investigation considerations for AI

When literature and equivalence cannot carry the clinical evidence. Which, for most AI devices at Class IIa and above, is the default. A clinical investigation under MDR Articles 62 to 82, following EN ISO 14155:2020+A11:2024, becomes part of the package. For AI devices the clinical investigation is often structured as a performance study rather than a traditional interventional trial. The device produces an output; the output is compared against a reference standard (histopathology, expert consensus, long-term outcome); the statistical analysis quantifies sensitivity, specificity, and whichever endpoint the intended purpose demands.

There are AI-specific design choices the clinical investigation protocol has to make. Is the study retrospective on archived data, or prospective on new patients? Retrospective studies are faster and cheaper but carry selection bias risk and depend entirely on the quality of the archived data and the ability to freeze the model before the data is looked at. Prospective studies are the cleaner evidence but take longer and cost more. For high-risk AI devices, Class IIb decision-support in critical care, Class III therapy control, a prospective clinical investigation is usually expected. For lower-risk AI devices where good archival data exists, a well-designed retrospective performance study with a pre-specified, locked model and a pre-specified analysis plan can carry the evidence.

Either way, EN ISO 14155:2020+A11:2024 applies. The investigation has to be planned, approved by ethics and competent authorities where required, monitored, and reported. The AI-specific elements. The locked model version, the input preprocessing pipeline, the test data isolation. Have to be fixed in the protocol before the data is unblinded. A performance study that was "adjusted" after looking at the results is not a clinical investigation. It is a model tuning run with a clinical label.

In Practice

The clinical evaluation has to report performance broken down by the subgroups where bias is clinically plausible.

PMCF with drift monitoring

Post-market clinical follow-up under Annex XIV Part B is the part of the clinical evaluation that runs after the device is on the market. MDR requires every manufacturer to plan PMCF as part of the clinical evaluation and to execute it throughout the lifecycle of the device. For classical devices, PMCF is mostly about confirming that the pre-market clinical data generalises to real-world use and detecting rare adverse events that a clinical investigation was too small to see.

For AI devices, PMCF carries an additional load: drift monitoring. A model that performed at 91% sensitivity at the time of CE marking does not necessarily still perform at 91% six months later, because the input distribution in the field can drift away from the distribution the model was trained and validated on, new imaging hardware changes pixel statistics, changes in clinical guidelines shift which patients get referred, seasonal disease patterns shift the prevalence, the model has not moved, but its effective performance in the field has.

A PMCF plan for an AI medical device has to include an operational mechanism to detect drift. The specific form depends on the device. Input distribution monitoring, model output monitoring, periodic re-evaluation against held-out reference data, clinical outcome tracking where it is feasible. But the principle is the same. Drift has to be detected before it causes harm, not after. The PMCF plan has to specify the metrics, the thresholds at which action is triggered, and the response pathway when a threshold is crossed. A PMCF plan that says "we will review complaints quarterly" is not adequate for an AI medical device. Complaints are a lagging indicator; drift is the leading indicator, and the PMCF plan has to look at the leading indicator.

We cover the operational patterns in depth in our post on en.

The common AI CE pitfall: conflating technical performance with clinical performance

Here is the mistake Tibor sees most often on AI clinical evaluations, and it is the mistake that most often comes back from a Notified Body as a major finding.

A team has built a model. The model has a great AUC on a held-out slice of the training data. The CER reports the AUC, presents it as the clinical performance of the device, and submits. The Notified Body reads the CER and asks the questions the CER should have already answered: What is the intended use population? Does the test data reflect it? Was the test set truly independent of the training pipeline? What is the performance in the subgroups where bias is plausible? What is the clinical benefit to the patient, not the technical accuracy of the model? How does the device change the clinician's decision, and how does that change propagate into patient outcomes? What happens when the model is wrong. How does the clinician catch it, and is the workflow designed so that human oversight is effective?

None of those questions are answered by an AUC number. Technical performance is a necessary input to clinical performance. It is not the same thing. Clinical performance, in the MDR sense, is about what happens to the patient when the device is used in the real clinical workflow the manufacturer intends. The CER has to connect the technical metrics to that clinical reality, and it has to do so with data, not assertions.

The cleanest way to avoid this pitfall is to write the clinical evaluation plan before the model is finalised. The plan forces the team to define intended purpose, intended use population, clinical endpoints, and evidence sources up front. The model development then serves the evaluation plan, rather than the evaluation being reverse-engineered to fit a model that already exists. Teams that do it in this order rarely hit the Notified Body pitfall. Teams that do it in the other order almost always do.

The Efficiency Lens

The Subtract to Ship framework applied to AI clinical evaluation looks like this.

Scope tightly. The intended purpose statement is the single most consequential sentence in the CER. A narrow, precise intended purpose, one clinical task, one user group, one care setting, is cheaper to evidence than a broad one. Broadening the intended purpose to future-proof the CE mark is the classic mistake; every new population you add is another subgroup you have to evidence. Ship the narrow purpose first, expand later through a change notification.

Freeze the model early. The clinical evaluation evidence only counts against the version of the model that gets CE marked. Every day the model keeps changing, the test data you collected becomes stale. Freeze the architecture, the training data, the preprocessing, and the model weights at a defined version, collect clinical evidence against that frozen version, treat any change to the model after that point as a change control event, with defined re-evaluation criteria.

Isolate the test set from day one. Test set contamination is impossible to fix after the fact. Build a locked, documented, access-controlled test set at the start of the project. Do not look at it during model development. When the time comes for clinical evaluation, the test set is there and it is credible. This is free if you do it from the start and impossibly expensive if you do not.

Build PMCF into the product, not just the paperwork. Drift monitoring is an engineering feature, not a document. The telemetry hooks, the reference data pipeline, the threshold logic, and the alerting have to be built into the product. A PMCF plan that depends on manual quarterly reviews by an overworked team is a PMCF plan that will fail in month six.

A narrow, precise intended purpose, one clinical task, one user group, one care setting, is cheaper to evidence than a broad one.

Do not duplicate model evaluation and clinical evaluation. A well-structured AI project has one test set, one evaluation protocol, one statistical analysis plan, and the outputs flow into both the technical documentation and the clinical evaluation report. Running two parallel evaluations, one for the data scientists and one for the regulatory team, wastes effort and creates inconsistency, subtract the duplication.

Reality Check. An honest assessment

Can you state the intended use population of your AI device in one precise sentence, including the clinical task, the user group, and the care setting?
Is your test set isolated, never used for training, hyperparameter tuning, or model selection, and documented to a level where a Notified Body auditor could verify the isolation?
Does your clinical evidence include performance broken down by the subgroups where bias is clinically plausible for your intended use population?
Have you performed external validation on data from a site or population that was not part of the training data pipeline, or do you have a PMCF plan that commits to collecting it?
If you are claiming equivalence to another AI device, can you defend that claim under MDCG 2020-5 on technical, biological, and clinical characteristics, with sufficient access to the other device's data?
If you are running a clinical investigation, is the protocol compliant with EN ISO 14155:2020+A11:2024, with the model version and analysis plan locked before the data is unblinded?
Does your PMCF plan include active drift monitoring with defined metrics, thresholds, and a response pathway, not passive complaint handling?
Does your CER connect technical performance to clinical benefit in the intended use population, or does it stop at the AUC?
Was the clinical evaluation plan written before the model was finalised, or is the CER reverse-engineered from a model that already exists?

Frequently Asked Questions

Does MDR require a separate clinical evaluation process for AI medical devices?

No. MDR Article 61 and Annex XIV apply to AI medical devices with the same structure and the same obligations as for any other device. The AI-specific elements (independent test data, bias analysis, external validation, drift monitoring in PMCF) sit inside that framework, not outside it. There is no reduced-evidence pathway for AI under MDR in 2026.

Can I use equivalence to another AI device for my clinical evaluation?

In principle yes, under MDR Article 61(4) and Annex XIV Part A(3), with MDCG 2020-5 as the guidance. In practice it is difficult, because two AI models with the same intended purpose can behave very differently on the same inputs, and the technical characteristics are usually different enough to block a defensible equivalence claim. Most AI clinical evaluations rely on the manufacturer's own performance data and clinical investigation rather than equivalence.

Is an AUC on a held-out test set enough clinical evidence for an AI device?

No. An AUC on a held-out test set is model evaluation, not clinical evaluation. Clinical evaluation under MDR asks what the device does for the patient in the intended use population when used in the intended workflow, which the AUC alone cannot answer. The CER has to address intended use population, subgroup performance, external validation, failure modes, and clinical benefit, using the technical metrics as one input among several.

Do I need a clinical investigation for my AI device?

It depends on the class, the intended purpose, and whether literature and the manufacturer's own performance data can carry the evidence. For Class IIb decision-support in critical care and Class III therapy control, a clinical investigation is usually expected. For lower-risk devices, a well-designed retrospective performance study with a locked model and a pre-specified analysis plan can sometimes suffice. EN ISO 14155:2020+A11:2024 applies whenever a clinical investigation is run.

How does PMCF differ for an AI device compared to a classical device?

PMCF for an AI device has to include drift monitoring. The input distribution in the field can drift away from the training and validation distribution over time. New hardware, shifting patient mix, changing guidelines, and the model's effective performance can degrade silently. The PMCF plan has to define metrics, thresholds, and a response pathway for drift detection. Passive complaint handling is not adequate PMCF for an AI medical device.

What is the single most common finding Notified Bodies raise on AI clinical evaluations?

Test set contamination, followed closely by missing subgroup analysis. A test set that has leaked into training, hyperparameter tuning, or model selection is not an independent test set, and the performance numbers derived from it are not credible. A clinical evaluation that reports only overall performance without subgroup breakdown does not address the bias risk that every AI device carries. Both findings are major and both can delay certification by months.

What Is Clinical Evaluation Under MDR?, the foundational post on the Article 61 and Annex XIV process that this post builds on.
Clinical Evaluation Report (CER) Structure Under MDR, how the CER is structured and what each section has to contain.
Equivalence Under MDCG 2020-5, the detailed walk-through of the equivalence criteria that AI devices rarely meet.
AI Medical Devices Under MDR: The Regulatory Environment, the pillar post that frames the full AI MedTech regulatory picture.
Machine Learning Medical Devices Under MDR, the companion post on ML development discipline under MDR.
Locked Versus Adaptive AI Algorithms Under MDR, the open question on continuous learning and why locked models are the default pathway in 2026.
Post-Market Surveillance for AI Medical Devices, drift detection and operational PMS patterns that pair with the PMCF plan described here.
AI/ML Medical Device Compliance Checklist 2027, the consolidated checklist for AI MedTech founders preparing for certification.
The Subtract to Ship Framework for MDR Compliance, the methodology that runs through every post in this blog.

Sources

Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices. Article 61 (clinical evaluation), Annex XIV Part A (clinical evaluation), Annex XIV Part B (post-market clinical follow-up). Official Journal L 117, 5.5.2017.
MDCG 2019-11 Rev.1. Guidance on Qualification and Classification of Software in Regulation (EU) 2017/745. MDR and Regulation (EU) 2017/746. IVDR, October 2019, Revision 1 June 2025.
MDCG 2020-5. Clinical Evaluation. Equivalence: A guide for manufacturers and notified bodies, April 2020.
EN ISO 14971:2019 + A11:2021. Medical devices. Application of risk management to medical devices.
EN 62304:2006 + A1:2015. Medical device software. Software life-cycle processes.
EN ISO 14155:2020 + A11:2024. Clinical investigation of medical devices for human subjects. Good clinical practice.

This post is part of the AI, Machine Learning and Algorithmic Devices category in the Subtract to Ship: MDR blog, authored by Felix Lenhard and Tibor Zechmeister, clinical evaluation is the chapter where AI MedTech founders most often discover that work they thought was done has to be done again. If the distance between the model evaluation you already have and the clinical evaluation MDR actually asks for is bigger than this post can close, that is expected. The gap is where a partner who has walked other AI founders through the same CER review earns their keep.