Drift detection in an AI medical device is the active, instrumented monitoring of whether the model's inputs, outputs, and performance in the field still match the configuration that was certified. Under MDR Article 83 and Annex XIV Part B, a post-market surveillance system has to be proportionate to the risk class and appropriate for the device — and for an AI device whose effective performance can degrade silently when the input distribution shifts, appropriate means drift detection with defined metrics, defined thresholds, and a defined response pathway. MDCG 2025-10 (December 2025) is the current operational guidance on PMS and PMCF. EN ISO 14971:2019+A11:2021 is where the drift hazards are identified and where the effectiveness of the drift controls is verified over time. A drift detection strategy that only exists on paper is not a strategy.
By Tibor Zechmeister and Felix Lenhard. Last updated 10 April 2026.
TL;DR
- Drift is the gap that opens over time between the world the model was certified against and the world the model is actually running in. The model does not have to change for drift to harm patients.
- MDR Article 83 requires a PMS system proportionate to the risk class and appropriate for the device. For an AI device, appropriate means active drift detection — not complaint intake alone.
- Annex XIV Part B requires post-market clinical follow-up that confirms clinical performance and safety in the intended use population, which is where drift shows up clinically.
- Data drift, concept drift, and label drift have different signals and different response pathways. A single generic drift monitor is rarely adequate.
- A locked reference dataset, held outside training and validation, evaluated on a defined cadence, is the backbone of any defensible drift detection strategy.
- Thresholds for escalation have to be numerical and set before drift happens, not improvised when it does.
- MDCG 2025-10 (December 2025) frames the operational PMS obligations that drift detection discharges for AI devices. EN ISO 14971:2019+A11:2021 is the risk management standard the drift findings feed back into.
What drift actually means for an AI medical device
Drift is a word that gets used loosely in machine learning and precisely in regulatory work. The precise version is the one worth carrying into a PMS plan.
Drift is the gap that opens over time between the world the model was certified against and the world the model is now running in. The model weights have not moved. The code has not changed. The version number in the technical file is still the version number in production. But the effective behaviour of the device — what it does for patients on a given day, in a given hospital, on a given scanner — has drifted away from the behaviour that was assessed, because something around the model has moved.
The thing that moved can be upstream. A CT scanner firmware update changes pixel statistics. A lab changes its assay calibration. A new imaging protocol becomes the default in radiology. The thing that moved can be in the patient population. A new referral pattern shifts the mix. A seasonal disease pattern changes the prevalence of the condition the model is trained to detect. A new clinical guideline changes who is scanned in the first place. The thing that moved can be in the clinical interpretation layer. A downstream clinician's use of the model's output changes, and the effective clinical benefit moves with it.
In every one of these cases, the model is innocent and the performance has still degraded. That is the failure mode drift detection exists to catch. For the broader framing of how this sits inside the PMS system, see post-market surveillance for AI devices.
Types of drift — data, concept, label, and prior
A drift detection strategy that treats drift as a single phenomenon will miss half of what it needs to catch. The useful distinction is between four kinds of drift.
Data drift is a shift in the distribution of inputs the model sees. The features of the data arriving at the model are different from the features in the training and validation sets. For an imaging model, this can be a shift in intensity histograms, resolution, acquisition parameters. For a structured-data model, it can be a shift in feature distributions or missingness patterns. Data drift is usually the earliest signal available because it does not require ground truth — the inputs alone tell the story.
Concept drift is a shift in the relationship between inputs and the correct output. The input distribution may look the same, but the mapping from inputs to the true clinical answer has changed. A new variant of a disease presents differently. A treatment change means that the same radiographic feature now means something different clinically. Concept drift is harder to detect because it requires ground truth to see clearly, and ground truth in medical devices often arrives late.
Label drift is a shift in the distribution of the ground-truth labels in the population. The prevalence of the condition the model is trained to detect changes. Seasonal, demographic, or guideline changes all do this. Label drift can look like output drift on the model's side but is driven by the population, not the model.
Prior drift is a shift in the base rates used implicitly in the model's calibration. A calibrated probability output that assumed a 12% positive rate in training is no longer well-calibrated when the field rate is 25%.
A drift detection strategy names which of these matter most for the specific product, based on the risk analysis under EN ISO 14971:2019+A11:2021, and instruments each one separately. One generic monitor is rarely adequate for a Class IIa or higher AI device.
Detection methods that actually work in practice
Drift detection is a technical field, and the PMS plan does not need to implement every known method. It needs to implement methods that are sensitive to the drift modes the specific product is exposed to and specific enough not to drown the review team in false alarms.
The useful detection patterns for AI medical devices include statistical distribution comparison on inputs and outputs (tests like Kolmogorov-Smirnov, population stability index, or simple moment comparisons against the reference distributions), periodic evaluation on a locked reference set (running the current production model through the same frozen dataset on a defined cadence and tracking the metric trajectory), shadow evaluation on held-out field data where ground truth arrives after the fact (comparing model predictions against confirmed outcomes on a rolling sample), calibration monitoring (watching whether the model's confidence distributions still match observed outcomes), subgroup performance tracking (breaking the metrics down by demographic or clinical subgroup where bias risk exists), and workflow signal monitoring (override rates, acceptance rates, time-on-case, which surface concept drift through clinician behaviour when ground truth is slow).
Each pattern has a cost. Each pattern has a failure mode. A strategy that leans only on input distribution tests will miss concept drift. A strategy that leans only on clinical outcome tracking will find drift months after it started. A strategy that combines a small number of complementary patterns, each cheap to run, is what holds up under real operating conditions.
The choice of methods belongs in the risk file under EN ISO 14971:2019+A11:2021 and the PMS plan under MDR Annex III, with the reasoning written down. A CT-based diagnostic model needs input distribution monitoring on acquisition parameters. A structured-data risk score needs feature distribution monitoring and missingness tracking. A clinical decision-support tool needs output acceptance tracking and clinician override analysis. Copying a generic drift detection template into a specific PMS plan is a common failure mode, and it produces a plan that passes a casual read and fails a Notified Body review.
The baseline performance reference
The single most important engineering decision in a drift detection strategy is the baseline performance reference. Without a baseline that does not itself drift, every measurement is relative to a moving target, and the signal is lost.
The baseline is a locked reference dataset that is representative of the intended use population, isolated from training and validation data, and stable over time. Each periodic evaluation runs the current production model through the same dataset and produces the same core metrics — accuracy, sensitivity, specificity, AUC, calibration, subgroup performance. The trajectory of those metrics over time is the primary signal of model integrity. Because the dataset does not change, any change in the metrics is a change in the model or in the way the model is being run, not in the yardstick.
Building the reference set is real work, and the effort is not optional. The set has to be curated with clinical input so that the population it represents matches the intended use. It has to be large enough to give statistical power to the chosen metrics at the chosen thresholds. It has to be protected from leakage into the training pipeline. It has to be version-controlled so that the provenance of every evaluation is traceable. And it has to be kept separate from the drift monitoring data collected from the field, because the field data is where the drift appears and the reference set is where the yardstick lives.
A team that starts this work at PMS planning time is starting late. A team that builds the reference set as part of the validation strategy and then keeps it frozen for the life of the device is doing it the right way. The reference set is also the foundation of any change control envelope under a predetermined change control plan — see locked vs. adaptive AI algorithms under MDR — because an envelope is only defensible if there is a stable yardstick to verify updates against.
What triggers a regulatory action
Drift monitoring only delivers value when it is wired into a response pathway. Detection without action is decoration. The PMS plan has to specify, in numerical terms, which findings trigger which actions and on what timeline.
Useful thresholds are concrete. A drop of more than X percentage points in AUC on the reference set, sustained across two consecutive evaluation cycles, triggers formal review. A population stability index above a defined value on input distributions triggers investigation within a defined window. A drop of more than Y percentage points in subgroup performance triggers an accelerated review under the risk file. A sudden change in clinician override rates above a defined threshold triggers a workflow review and an engineering audit of the deployment context.
Each of these belongs in the plan with a named owner, a named escalation path, and a defined maximum time to initial response. Vague language — "we will investigate significant drops" — does not survive audit and does not produce consistent action when the signal fires. Numerical thresholds do both.
Thresholds also need to be reviewable. A drift monitor with thresholds that fire constantly loses the team's trust and gets silenced. A drift monitor with thresholds that never fire is not actually watching anything. Pre-specified thresholds should be reviewed on a defined cadence — typically at the same time as the PSUR under MDR Article 86 — and adjusted based on observed false-positive and false-negative rates. The adjustment is itself a documented decision, tied back to the risk file.
Retraining versus notification versus restriction
When drift crosses a threshold, the PMS plan has to specify what happens next. The usual options are: continue monitoring at a higher cadence, adjust the clinical indications or restrict the intended use, retrain the model, or withdraw the device. Each is a different kind of corrective action with different regulatory consequences.
Retraining is the option teams reach for first, and it is also the option with the most regulatory friction. A retrained model is a changed model. The change is assessed under the Notified Body change notification framework. If it is significant, notification is required, and where the change affects safety or performance, re-assessment follows. Teams with a predetermined change control plan can handle certain retraining events inside the pre-authorised envelope; teams without one have to go through full change notification each time. For the mechanics, see continuous learning AI under MDR.
Restriction of intended use is often faster than retraining and sometimes safer. If drift is concentrated in a specific subgroup or use context, narrowing the indications can restore the conformity assessment to a configuration that is still defensible, while retraining is planned deliberately on a longer timeline. Restriction is itself a change and has to flow through the same framework, but for urgent findings it can be the right first move.
Withdrawal is the option teams hope not to use and must be willing to use. If drift has moved the device outside a defensible safety envelope and no restriction or retraining path is available on a timeline proportionate to the risk, the device comes off the market. That decision has to be made explicitly, with documentation and a rationale that survives review.
The PMS plan should pre-define the trigger conditions for each option and the decision process that runs when a trigger fires. Pre-defined paths prevent two failure modes. The first is retraining too eagerly, chasing noise and introducing new variance. The second is acting too late, after drift has persisted long enough to harm patients.
Documentation — what the PMS file has to contain for drift
The documentation set for drift detection has to satisfy Annex III and the class-specific reporting obligations under Article 85 or 86, with AI-specific content woven through the relevant sections. Concretely, the PMS plan under Annex III names the drift detection methods in use, the metrics each method produces, the cadence of measurement, the baseline reference dataset, the numerical thresholds for escalation, the named owner of the review, and the response pathway for each trigger. The PMCF plan under Annex XIV Part B names the clinical outcome signals that feed the drift detection loop where feasible. The risk management file under EN ISO 14971:2019+A11:2021 contains the drift-related hazards, the controls that drift detection implements, and the verification that those controls remain effective. The PSUR under Article 86 for Class IIa, IIb, and III devices reflects the actual drift data from the period and the actions taken.
The documentation has to describe an operational system that the team is actually running. A drift detection plan that describes a pipeline that does not exist is a finding waiting to happen. The audit question — and the honest self-audit question — is whether the drift monitor is on, reviewed, and acted on, not whether it is documented.
Common mistakes teams make with drift detection
The patterns that show up most often in early-stage AI MedTech drift work:
- No locked reference set. Without a stable yardstick, every measurement is relative to a moving target.
- Treating drift as one phenomenon. Data drift, concept drift, label drift, and prior drift need different instruments.
- Vague thresholds. "Investigate significant changes" is not a threshold; it is a placeholder.
- Dashboards without owners. A drift monitor nobody reviews is a monitor that does not exist.
- Copying a generic template into a specific product. The drift modes of a CT model are not the drift modes of a clinical decision-support tool.
- Treating drift detection as a future project. It belongs in the technical file at certification, not in year two.
- Confusing automatic retraining with permission to retrain. The pipeline is engineering; the permission is regulatory.
- No traceability from drift findings into the risk file. Drift that does not feed the risk file is drift the Regulation cannot see.
The Subtract to Ship angle
The Subtract to Ship framework applied to drift detection produces a sharp discipline: build the smallest set of drift instruments that catches every drift mode the specific product is actually exposed to and is actually reviewed on a defined cadence. Everything more is decoration; everything less is a non-conformity.
For a Class IIa AI decision-support tool, the Subtract to Ship drift strategy looks something like this. One locked reference dataset held outside training and validation. One scheduled pipeline that runs the reference set through the current production model on a defined cadence and logs the metric trajectory. One input distribution monitor on the features or pixels that matter most for this product. One output distribution monitor that catches calibration shift. One clinical outcome signal where feasibility and ethics allow. One subgroup performance breakdown where bias risk exists. One named owner, one defined review cadence, one set of numerical escalation thresholds tied to the risk file. One documented response pathway for each trigger, tied to the change control process.
That is lean. Every instrument traces to a specific drift hazard in the risk file and to a specific MDR obligation under Article 83 or Annex XIV Part B. What the Subtract to Ship drift strategy does not include is the elaborate monitoring surface that looks impressive in a deck, the dashboards nobody opens, and the quarterly reports that repeat the previous quarter's copy. The discipline is the same as the rest of MDR work: cut the activity that does not trace to an obligation, keep the activity that does, and actually run it.
Reality Check — Where do you stand?
- Do you have a locked reference dataset that never changes, held outside training and validation, used for periodic re-evaluation of the production model?
- Does your PMS plan name the specific drift detection methods in use for your device, or does it say "we monitor performance" without detail?
- For each drift metric, is there a numerical escalation threshold written down, and has it been set deliberately rather than as a placeholder?
- Do you distinguish between data drift, concept drift, label drift, and prior drift in your instrumentation, or do you treat drift as a single phenomenon?
- Is there a named owner of the drift review, with a defined cadence that is actually followed?
- Does your PMCF plan under Annex XIV Part B specify the clinical outcome signals that feed the drift loop where feasibility allows?
- Does every drift finding have a documented path into your risk management file under EN ISO 14971:2019+A11:2021 and into your change control process?
- If a drift threshold were crossed tomorrow, would the response run on a defined pathway with a named owner, or would it improvise?
- Have you read MDCG 2025-10 (December 2025) end-to-end, or have you only skimmed it?
Frequently Asked Questions
Does MDR require drift detection for AI medical devices explicitly? Not in those words. MDR Article 83 requires a PMS system proportionate to the risk class and appropriate for the device. For an AI device whose effective performance can degrade silently when the input distribution shifts, "appropriate" means active drift detection — because a PMS system that cannot catch the primary failure mode of the device is not appropriate for the device. Notified Bodies in 2026 expect to see it.
What is the difference between data drift and concept drift in practice? Data drift is a shift in the distribution of the inputs the model sees — the features have moved. Concept drift is a shift in the relationship between inputs and the correct clinical answer — the mapping has moved. Data drift is usually easier to detect because it does not need ground truth; concept drift usually needs downstream clinical outcomes to see clearly. A drift strategy has to instrument both.
How often should drift evaluation run against the locked reference set? It depends on the risk class, the clinical domain, and the expected drift rate. Quarterly is a common cadence for many products. Monthly is reasonable for products with faster-changing upstream inputs. The cadence belongs in the PMS plan, is justified in the risk file, and has to actually be executed — a plan that says monthly and runs yearly is worse than a plan that says quarterly and runs quarterly.
Is drift detection a substitute for PMCF under Annex XIV Part B? No. Drift detection is the technical arm; PMCF is the clinical arm. Annex XIV Part B requires proactive collection and evaluation of clinical data from the use of the CE-marked device, and that obligation stands on its own. In a well-built system the two feed each other — PMCF data provides the clinical outcome signals that the drift monitor uses to see concept drift, and drift monitor findings feed the PMCF analysis.
What happens when a drift threshold is crossed? The response depends on the trigger. Options include continuing monitoring at a higher cadence, restricting the intended use, retraining the model through the change control process, or withdrawing the device. Each option is a different kind of corrective action with different regulatory consequences, and the PMS plan has to pre-define which trigger leads to which pathway. Improvising the response when a trigger fires is a failure mode.
Can I rely on clinician complaints to tell me about drift? No. Drift is usually distributed across thousands of cases and invisible from inside any one of them. By the time a clinician notices enough to file a complaint, the degradation has often been running for months. An AI PMS system that relies on complaint intake to catch drift is not proportionate to the risk of the device.
How does drift detection interact with a predetermined change control plan? The drift detection system provides the evidence base that the change control envelope needs. An envelope that pre-authorises bounded updates in the field is only defensible if there is a stable reference set and a drift monitor to verify that each update keeps the model inside the envelope. Drift detection is the operational yardstick that makes a predetermined change control plan usable. See continuous learning AI under MDR for the change control side.
Related reading
- AI Medical Devices Under MDR: The Regulatory Landscape in 2026 — the pillar post that frames the full AI MedTech regulatory picture.
- Machine Learning Medical Devices Under MDR — the companion post on ML development discipline under MDR.
- Locked vs. Adaptive AI Algorithms Under MDR — the change control context that shapes drift response pathways.
- Continuous Learning AI Under MDR: The Unsolved Regulatory Challenge in 2026 — why drift detection is the backbone of any defensible change envelope.
- Model Validation and Verification for AI Medical Devices — the reference-set discipline that underpins every drift measurement.
- Real-World Performance Monitoring for AI Medical Devices — the engineering companion to the regulatory framing in this post.
- Post-Market Surveillance for AI Medical Devices — the full PMS system that drift detection sits inside.
- Cybersecurity Monitoring for AI Medical Devices — the parallel monitoring loop on the security side.
- AI/ML Medical Device Compliance Checklist 2027 — the consolidated checklist for AI MedTech founders.
- The Subtract to Ship Framework for MDR Compliance — the methodology behind every post in this blog, applied here to drift detection.
Sources
- Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices — Article 83 (post-market surveillance system), Annex XIV Part B (post-market clinical follow-up). Official Journal L 117, 5.5.2017.
- MDCG 2025-10 — Guidance on post-market surveillance of medical devices and in vitro diagnostic medical devices. Medical Device Coordination Group, December 2025.
- EN ISO 14971:2019 + A11:2021 — Medical devices — Application of risk management to medical devices.
This post is part of the AI, Machine Learning and Algorithmic Devices category in the Subtract to Ship: MDR blog. Authored by Felix Lenhard and Tibor Zechmeister. Drift detection is where the gap between what an AI model was certified to do and what it actually does in the field becomes visible, and it is where the hardest engineering-meets-regulation conversations usually land. If the shape of the right drift detection strategy for your specific AI device is not obvious after reading this post, that is expected — the strategy is bespoke work, and it is exactly the kind of decision where a sparring partner who has walked other AI MedTech founders through the same build earns their keep.