Clinical Evaluation for AI/ML Devices: How to Handle Continuously Learning Algorithms

Clinical evaluation of a continuously learning AI/ML device under MDR Article 61 and Annex XIV cannot be performed against a moving target. The only workable approach in 2026 is to scope the clinical evaluation against a locked version of the model, the exact weights, preprocessing pipeline, and inference logic that will be placed on the market, and then define, in the technical documentation, a predetermined boundary inside which future updates are pre-authorised. Clinical evidence is generated against the locked snapshot. Anything that falls outside the predetermined boundary is a significant change that triggers a new clinical evaluation. Post-market clinical follow-up under Annex XIV Part B carries the drift-monitoring load: it detects when real-world performance of the locked or lightly adapted model has drifted away from the clinical evidence base, and it triggers the response pathway before patients are harmed. A clinical evaluation that tries to evidence "the learning system" rather than a specific snapshot is not a clinical evaluation under MDR. It is a proposal the Notified Body cannot accept.

By Tibor Zechmeister and Felix Lenhard. Last updated 10 April 2026.

TL;DR

MDR Article 61 and Annex XIV apply to continuously learning AI devices with no carve-out. The clinical evaluation has to evidence a specific, identifiable device configuration, not a system that changes its own behaviour in the field.
The practical pattern is the locked-version snapshot: freeze the model, generate the clinical evidence against that frozen version, and treat any change that exits a predetermined boundary as a significant change requiring re-evaluation.
The predetermined boundary has to be specific, which parameters can change, on what trigger, within what performance bounds, with what revalidation protocol, with what documentation, and it has to be agreed with the Notified Body as part of the initial conformity assessment.
Post-market clinical follow-up under Annex XIV Part B is where drift monitoring lives. For a continuously learning device, PMCF has to detect both input drift and behaviour drift of the model itself, with defined metrics, thresholds, and a response pathway.
Some things cannot be evaluated upfront: clinical performance of model versions that do not yet exist, subgroup behaviour in populations not yet seen in the field, and long-horizon outcomes of drift that has not yet happened. The clinical evaluation has to state this honestly and route it to PMCF.
The most common mistake is trying to evidence "the learning capability" rather than a concrete snapshot. The second most common is a predetermined boundary so vague it functions as a blank cheque.

The clinical evaluation challenge for adaptive algorithms

Clinical evaluation under MDR Article 61 is defined as a systematic and planned process to continuously generate, collect, analyse, and assess clinical data pertaining to a device, in order to verify safety, performance, and clinical benefit when used as intended by the manufacturer. (Regulation (EU) 2017/745, Article 61.) Annex XIV Part A sets out the content of the clinical evaluation; Annex XIV Part B governs post-market clinical follow-up.

That framework assumes a fixed referent. The clinical data is "pertaining to a device." The device has a defined intended purpose, a defined configuration, and defined technical characteristics. The clinical evidence tests whether that specific device produces the claimed benefit in the intended use population. Every element of Annex XIV. Data identification, data appraisal, data analysis, demonstration of acceptable benefit-risk. Is anchored to the device being evaluated.

A continuously learning algorithm detaches the referent. If the model in the field today is not the model that was evaluated, the clinical evidence is no longer evidence about the device on the market. It is evidence about a historical artefact that no longer exists. Every claim in the clinical evaluation report. Sensitivity, specificity, subgroup performance, clinical benefit. Becomes an assertion about a version of the device that is no longer in use. The Notified Body reading that report in 2026 has to ask a question the manufacturer cannot answer: which device is this clinical evidence actually about?

This is not a philosophical objection. It is an operational one. MDCG 2019-11 Rev.1 (June 2025) is clear that AI software sits inside the existing software qualification and classification framework without a special pathway. EN 62304:2006+A1:2015 lifecycle discipline applies to the development and maintenance of the software, and EN ISO 14971:2019+A11:2021 risk management applies to the hazards, including hazards that arise from adaptation. None of these standards contain a mechanism that lets a manufacturer evidence "a class of possible future models" and have that count as clinical evaluation of whatever the model becomes.

The challenge, then, is how to perform a clinical evaluation that is honest about the fact that the model may change, without writing a clinical evaluation that evidences nothing concrete.

The locked-version snapshot approach

The answer that has emerged in practice, and that Notified Bodies will accept in 2026, is the locked-version snapshot.

The manufacturer freezes the model at a specific version: weights, architecture, preprocessing pipeline, postprocessing logic, calibration parameters, the full set of deterministic behaviours that produce the output for a given input. That frozen version is given a version identifier. The technical documentation describes it precisely enough that an auditor could, in principle, reproduce its behaviour on a reference test set. The clinical evaluation is then conducted against that specific snapshot. The clinical data, the performance metrics, the subgroup analysis, the external validation, the demonstration of clinical benefit, all of it pertains to that one version of the model.

The clinical evaluation report does not claim to evidence a learning system. It claims to evidence Version 1.0.0, placed on the market on a specific date, with a specific intended purpose, for a specific intended use population. For a deeper cover how this applies to software as a medical device more broadly, see our companion post on clinical evaluation of software as a medical device.

The locked-version snapshot solves the referent problem cleanly. The clinical evidence is about a device that exists, that is reproducible, and that is the device on the market. When the manufacturer wants to release Version 1.1.0. Retrained on new data, or with an updated preprocessing step, or with a refined calibration. That release goes through the change management process. Either the change is covered by the predetermined boundary the Notified Body already approved, or it is a significant change that requires its own clinical evaluation update and notification. There is no magic. There is no hand-waving. Each configuration that reaches patients has clinical evidence traceable to it.

The locked-version snapshot is not anti-learning. It is a discipline that makes learning compatible with a patient-safety framework. The learning happens in the development environment, the new version is tested against a frozen reference set, the clinical evidence is updated, the release goes out. The adaptation runs on a human timescale, not on a silent retraining loop, and every version that reaches patients is a version a Notified Body has seen.

The predetermined boundary concept

For teams that genuinely need bounded adaptation in the field. Not a general learning capability, but a specific, defensible set of updates. The pattern is the predetermined boundary, submitted as part of the initial technical documentation.

The boundary is a document that pre-specifies exactly what can change about the model after release without triggering a new clinical evaluation. It defines which parameters are mutable, what triggers a permitted update, what performance bounds the updated model has to satisfy before it can go live, what revalidation protocol is run on every update, what documentation is produced, and how the fleet is kept in sync. The Notified Body assesses the boundary itself. Not each future update, but the rules governing the updates. As part of the initial conformity assessment under MDR Article 61 and Annex XIV.

The detailed regulatory mechanics of change control under this pattern are covered in our post on locked versus adaptive AI algorithms under MDR. The clinical-evaluation angle is specific: for every permitted update inside the boundary, the clinical evidence generated against the original snapshot has to remain valid. That is only possible if the boundary is narrow enough that updates inside it cannot plausibly invalidate the clinical evidence.

Narrow looks like: calibration parameters A, B, C may be re-estimated monthly on condition that sensitivity on the frozen reference set remains within [x, y], specificity within [p, q], subgroup performance does not drop by more than z percentage points in any of the pre-specified subgroups, and each update is logged with version, data snapshot, validation results, and sign-off. Vague looks like: the model may be improved as new data arrives. Vague boundaries do not survive Notified Body review, and they should not, because they are not assessable.

The clinical evaluation report has to explicitly tie the boundary to the evidence. Sentence by sentence, the CER has to say: this performance claim holds for Version 1.0.0; updates inside the predetermined boundary are bounded to preserve this claim within these tolerances; updates outside the boundary are a significant change and will trigger a new clinical evaluation before they reach patients. That is the only way the clinical evidence remains anchored to a concrete device as the model evolves.

PMCF with drift monitoring

Post-market clinical follow-up under Annex XIV Part B is where the continuously learning question really earns its complexity. For any AI device, PMCF has to include drift detection. The pre-market clinical evidence is inevitably generated on a finite snapshot of the world, and the world moves. For a continuously learning device inside a predetermined boundary, the PMCF load is doubled: it has to detect input drift in the field and behaviour drift of the model itself as bounded updates accumulate.

The PMCF plan has to specify which metrics are monitored, what the alert thresholds are, how often the reference-set revalidation runs, and what happens when a threshold is crossed. For input drift: monitoring of input distribution statistics against the distribution the model was trained and validated on, with alerts when drift exceeds defined bounds. For behaviour drift: monitoring of model output statistics against expected ranges, periodic revalidation of the current model version against the frozen reference test set, and tracking of any subgroup where performance is known to be sensitive. For clinical outcomes: where feasible and ethical, tracking of downstream patient outcomes that the device is intended to influence.

When a threshold is crossed, the PMCF plan has to specify the response. Options include immediate rollback to the previous model version, pausing further permitted updates inside the boundary, escalating to a full clinical evaluation update, or notifying the Notified Body of a significant change. A PMCF plan that defines metrics without defining the response pathway is incomplete.

We cover the broader clinical-evidence framework for AI devices in our post on clinical evaluation of AI/ML medical devices and the operational drift-detection patterns in the post on post-market surveillance for AI medical devices.

What cannot be evaluated upfront

Honesty about the limits of pre-market clinical evaluation is itself a regulatory requirement. Several things about a continuously learning device cannot be evaluated upfront, and the clinical evaluation report has to state this directly and route the evidence to PMCF.

Clinical performance of model versions that do not yet exist cannot be evaluated in advance. The CER can describe the predetermined boundary, the revalidation protocol, and the bounds that future updates must satisfy, but it cannot claim evidence for specific future versions. Those versions will generate their own evidence inside the boundary rules when they are released.

Subgroup behaviour in populations that were not represented in the pre-market data cannot be evaluated upfront. The CER has to describe the intended use population, document the subgroups covered by the clinical evidence, and explicitly list the subgroups where evidence is thin. Those gaps become PMCF objectives with defined data-collection plans and timelines.

Long-horizon drift. The slow, cumulative change in the relationship between inputs and clinical reality that happens over years. Cannot be evaluated in a pre-market clinical investigation. The CER has to acknowledge that the drift risk exists and that the PMCF plan is the mechanism that will detect it. Pretending otherwise does not make the Notified Body's concern go away; it makes it sharper.

The discipline is to be explicit. The clinical evaluation says what it evidences, what it does not evidence, and where the missing evidence will come from in the post-market phase. An honest CER with defined PMCF gap-closure is more credible than a CER that overclaims pre-market certainty.

Common mistakes

Evidencing "the learning system" instead of a snapshot. The CER describes the continuous learning architecture and presents aggregated performance across some notional envelope of possible models. The Notified Body cannot assess a notional envelope against Annex XIV. The fix is to re-scope the CER against a specific locked version and treat the learning architecture as a change control mechanism, not as the object of evaluation.
A predetermined boundary that is actually a blank cheque. The boundary is written loosely enough that almost any future update could be claimed to fall inside it. Notified Bodies in 2026 read these boundaries carefully and reject the vague ones. The fix is to narrow the boundary to the specific, small set of updates that the business case actually requires, and write it with concrete triggers, bounds, and revalidation protocols.
PMCF that is only passive complaint handling. The PMCF plan relies on waiting for complaints to surface a drift problem. Complaints are lagging indicators; by the time they accumulate to a statistical signal, patients have already been exposed. The fix is active monitoring with defined metrics, thresholds, and response pathways.
No reference test set for revalidation. The manufacturer has no frozen reference dataset against which every bounded update can be re-evaluated. Without it, there is no way to verify that an update inside the boundary has actually stayed inside the performance bounds. The fix is to build the reference set before the first release and lock it under access control.
Treating the CER as a one-time deliverable. The team ships the CER for Version 1.0.0 and moves on. Clinical evaluation under MDR Article 61 is a continuous process. The CER has to be updated throughout the lifecycle, especially for a device where the model is expected to evolve. The fix is to build CER updates into the QMS cadence from day one.
Confusing technical revalidation with clinical evaluation. A team runs the updated model through the reference test set, confirms the metrics are inside bounds, and assumes the clinical evaluation is done. Technical revalidation is necessary but not sufficient; the clinical evaluation also has to reason about whether the update affects clinical benefit, subgroup risk, and the failure modes the risk file identified. The fix is to run clinical evaluation impact assessment alongside technical revalidation on every update.

The Subtract to Ship angle

The Subtract to Ship framework applied to the clinical evaluation of continuously learning devices produces a sharp subtraction: in almost every case, cut the continuous learning ambition from the initial clinical evaluation scope. Ship the locked snapshot. Evidence it properly. If the business case genuinely requires post-market adaptation, add back exactly the narrowest predetermined boundary that moves the clinical needle. Not a general learning capability, but a specific, bounded set of updates you can defend sentence by sentence to a Notified Body.

The subtraction is not anti-adaptation. It is anti-ambiguity. Every degree of vagueness in the predetermined boundary costs months in Notified Body review and creates a trap where updates that the team believes are covered turn out not to be. The clinical evaluation report that evidences one concrete version of the device, with a tight boundary and a serious PMCF plan, ships faster and lasts longer than a report that gestures at learning without committing to anything specific. Subtract the aspiration, ship the snapshot, expand inside a disciplined envelope.

Reality Check. Where do you stand?

Can you point to a specific, version-identified snapshot of your model and say "this is the device the clinical evaluation evidences". Architecture, weights, preprocessing, calibration, all locked?
Is your clinical evidence generated against that snapshot, or against an aggregated set of runs from models that have since changed?
If you are relying on a predetermined boundary for field updates, is it written with concrete triggers, performance bounds, and revalidation protocols. Not general language about model improvement?
Does your CER explicitly state which clinical claims hold for the locked snapshot, which are bounded by the predetermined boundary, and which are routed to PMCF as gap-closure items?
Does your PMCF plan include active drift monitoring with defined metrics, thresholds, and response pathways for both input drift and model behaviour drift?
Do you have a frozen reference test set under access control that every bounded update is revalidated against before it reaches patients?
When an update inside the boundary is released, does your QMS run a clinical evaluation impact assessment alongside the technical revalidation, or does it stop at the metrics?
Have you engaged your Notified Body on the predetermined boundary before finalising the technical documentation, or are you planning to find out their position at audit?

Frequently Asked Questions

Can I perform a clinical evaluation on a continuously learning AI device under MDR? Only against a specific locked version of the model. MDR Article 61 and Annex XIV require clinical data that pertains to the device being evaluated, and that is only coherent when the device has a fixed configuration. The workable pattern is to freeze a version, evidence it, and define a predetermined boundary for bounded future updates. Each update outside the boundary is a significant change that requires its own clinical evaluation.

What exactly is a locked-version snapshot for clinical evaluation purposes? It is a version of the model. Weights, architecture, preprocessing pipeline, postprocessing logic, calibration. That is frozen, version-identified, and described in the technical documentation precisely enough to be reproducible. The clinical evidence in the CER is generated against that specific snapshot, and every claim in the CER pertains to that version of the device.

What is a predetermined boundary and how does it interact with clinical evaluation? The predetermined boundary is a pre-specification, in the initial technical documentation, of which model parameters can change in the field, on what trigger, within what performance bounds, with what revalidation protocol. The clinical evaluation report has to tie the existing clinical evidence to that boundary: updates inside the boundary are bounded to preserve the clinical claims, updates outside are significant changes that trigger a new clinical evaluation.

How does PMCF differ for a continuously learning AI device? PMCF under Annex XIV Part B has to detect both input drift, the distribution in the field moving away from the distribution the model was trained on, and behaviour drift of the model itself as bounded updates accumulate. The plan has to define metrics, thresholds, and a response pathway for each, with active monitoring rather than passive complaint handling.

What if my CER cannot evidence some aspects of the device upfront? State that explicitly and route the missing evidence to PMCF with a defined data-collection plan and timeline. Clinical performance of future model versions, subgroup behaviour in under-represented populations, and long-horizon drift cannot be evaluated in a pre-market investigation. An honest CER with defined PMCF gap-closure is stronger than an overclaiming CER.

Can I argue that the whole learning system, rather than a specific snapshot, is the device under evaluation? No Notified Body in 2026 will accept that framing. Annex XIV requires clinical evidence about the device being placed on the market, and "the learning system" is not a device a Notified Body can assess against Article 61. The assessable unit is a specific version plus a specific predetermined boundary.

Clinical Evaluation of Software as a Medical Device, the foundational post on how clinical evaluation applies to software devices, which the AI-specific overlay in this post builds on.
Clinical Evaluation of AI/ML Medical Devices: Proving Safety and Performance, the companion post on the broader clinical evidence package for AI devices, including bias testing and external validation.
AI Medical Devices Under MDR: Overview and Regulatory Landscape, the pillar post framing the full AI MedTech regulatory picture this post sits inside.
Locked vs. Adaptive AI Algorithms Under MDR: Regulatory Implications, the detailed treatment of change control for locked and adaptive models, including the predetermined change control plan concept.
Classification of AI and ML Software Under Rule 11, how Rule 11 classification flows into the clinical evaluation obligations referenced here.
Machine Learning Medical Devices Under MDR, the companion post on ML development discipline under MDR.
Training Data Governance for AI Medical Devices, the data discipline that underpins both the locked snapshot and any bounded update.
Post-Market Surveillance for AI Medical Devices, the operational drift-detection patterns that pair with the PMCF plan described here.
Explainability and Interpretability of AI Medical Devices, how transparency obligations interact with clinical evaluation for adaptive systems.
AI/ML Medical Device Compliance Checklist 2027, the consolidated checklist for AI MedTech founders preparing for certification.
The Subtract to Ship Framework for MDR Compliance, the methodology that runs through every post in this blog.

Sources

Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices. Article 61 (clinical evaluation), Annex XIV Part A (clinical evaluation), Annex XIV Part B (post-market clinical follow-up). Official Journal L 117, 5.5.2017.
MDCG 2019-11 Rev.1. Guidance on Qualification and Classification of Software in Regulation (EU) 2017/745. MDR and Regulation (EU) 2017/746. IVDR, October 2019, Revision 1 June 2025.
EN 62304:2006 + A1:2015. Medical device software. Software life-cycle processes.
EN ISO 14971:2019 + A11:2021. Medical devices. Application of risk management to medical devices.

This post is part of the Clinical Evaluation and Clinical Evidence category in the Subtract to Ship: MDR blog. Authored by Felix Lenhard and Tibor Zechmeister. The continuously learning question is the one AI MedTech founders most often hope has a clever answer, and the honest answer is that the clever answer is discipline. Lock the snapshot, narrow the boundary, instrument the drift detection, and be explicit in the CER about what is evidenced today and what is routed to PMCF tomorrow. If the shape of the right boundary and the right PMCF plan for your specific product is not obvious after reading this post, that is expected. It is exactly the kind of decision where a sparring partner who has walked other AI MedTech teams through the same Notified Body conversation earns their keep.