Post-Market Surveillance for AI Devices: Monitoring Algorithm Performance Over Time

Post-market surveillance for an AI medical device has to do everything that PMS does for a classical device. Collect complaints, analyse trends, feed risk management and clinical evaluation, escalate into vigilance when thresholds are crossed. And it has to do one more thing that classical PMS does not: actively monitor the algorithm's performance in the field, because an AI model can degrade silently when the input distribution drifts, even if the model itself never changes. Under MDR Articles 83 to 86 and MDCG 2025-10 (December 2025), the obligation is the same for AI and non-AI devices; the operational content is different. An AI PMS system that only handles complaints is not proportionate to the risk of the device, and it will not pass a serious Notified Body review in 2026.

By Tibor Zechmeister and Felix Lenhard. Last updated 10 April 2026.

TL;DR

MDR Articles 83 to 86 apply identically to AI and classical medical devices. The PMS system must be proactive, proportionate to risk class, and integrated into the QMS. What changes for AI is the operational content of the plan.
An AI medical device can degrade in the field without the algorithm changing at all. The input distribution drifts, the population shifts, the hardware upstream is updated, and the model's effective accuracy erodes silently.
Active drift detection is the defining PMS discipline for AI devices. Monitoring of input distributions, output distributions, calibration against a locked reference set, and clinical outcome signals where feasible.
PMCF under Annex XIV Part B is not optional for most AI medical devices. Real-world performance in the intended use population has to be tracked with a defined plan, defined metrics, and defined thresholds.
Triggers for model retraining should be pre-defined in the PMS plan. Not improvised when drift is detected. Every retraining is a change control event under the device's change management process or inside a predetermined change control envelope.
MDCG 2025-10 (December 2025) is the current operational guidance. EN ISO 14971:2019+A11:2021 is the risk management standard the drift hazards feed back into.
The most common mistake in AI PMS is building elaborate dashboards that nobody watches. The second most common mistake is equating "we have a feedback form" with "we have PMS."

Why AI PMS is different. The degradation problem

Classical PMS is built around events. A complaint arrives. A service record shows a failure. An incident happens in the field. The PMS system logs it, assesses it, escalates it where needed, and feeds the finding back into the risk file and the clinical evaluation. The underlying assumption is that the device itself is stable. The version shipped on day one is the version running on day three hundred, and changes come through deliberate releases.

AI devices break that assumption in a specific way. The software is stable. The weights are locked. The version running in hospital A on Monday is identical to the version that was certified. But the effective performance of that identical software in the field is not stable, because AI systems are sensitive to the distribution of their inputs in ways that classical software is not.

A CT scanner upstream of the model gets a firmware update that changes pixel statistics. A hospital changes its referral pattern and the patient mix shifts. A new imaging protocol becomes standard and the contrast characteristics of the input data change. A seasonal disease pattern alters the prevalence of the condition the model is trained to detect. The model has not moved. The world around the model has. And the model's accuracy, sensitivity, specificity, and subgroup performance can degrade materially, without a single complaint being filed, because the degradation is silent, distributed across thousands of cases, and invisible from inside any one of them.

This is why an AI PMS system that only runs on complaint intake is insufficient. It will catch the failures that are loud. It will miss the failures that matter most.

For the foundational PMS framework that applies to every device class, see what is post-market surveillance under MDR. For the AI-specific regulatory context, see the pillar post on AI medical devices under MDR. The locked-versus-adaptive decision that shapes the PMS plan is covered in locked vs. adaptive AI algorithms under MDR.

What the Regulation actually requires for AI PMS

The text of MDR Articles 83 to 86 does not name AI. It does not need to. The articles require every manufacturer to plan, establish, document, implement, maintain, and update a PMS system that is proactive, proportionate to the risk class, and appropriate for the type of device. The plan is specified in detail in Annex III. The reports, PMS Report under Article 85 for Class I, PSUR under Article 86 for Class IIa, IIb, and III, are scaled to the class.

For AI devices, two phrases in this framework do the real work. The first is "proportionate to the risk class." Most AI medical devices sit at Class IIa or higher under Annex VIII Rule 11, because they are built precisely to support or drive clinical decisions. At those classes, the PMS obligations are not minimal. The second is "appropriate for the type of device." Appropriate for an AI device means the PMS has to surface the kinds of failures AI devices actually have, such as drift, distribution shift, and subgroup performance degradation, not only the kinds of failures classical software has.

MDCG 2025-10 (December 2025) is the current operational guidance on PMS for medical devices and in vitro diagnostic medical devices. It describes the PMS system, the PMS plan, the main PMS activities, and how PMS interacts with clinical evaluation, risk management, and vigilance. It does not carve out AI as a separate category, but when read in light of the fact that an AI device's failure modes include silent drift, the MDCG 2025-10 requirements for data collection, assessment methods, and conclusions all point toward active performance monitoring as part of the "appropriate" PMS for an AI device.

MDCG 2019-11 Rev.1 (June 2025) on qualification and classification of software under MDR is the upstream document that determines which software counts as a medical device and at which class. It does not directly specify PMS content, but the classification it drives, typically Class IIa or higher for AI decision-support, is what makes rigorous PMS mandatory for these products.

EN ISO 14971:2019+A11:2021 is the harmonised standard for risk management. For AI devices, the relevant AI hazards, such as bias, distribution shift, adversarial robustness, and explainability gaps, are identified and controlled inside the EN ISO 14971 process. The PMS system is where the effectiveness of those controls is verified over time. If a risk control assumed the model would maintain 92% sensitivity in the intended use population, and drift monitoring shows sensitivity has dropped to 86%, the risk control is no longer effective and the risk file has to be reassessed. This loop is what connects AI PMS to the standard risk management lifecycle the Regulation expects.

The performance metrics to monitor

An AI PMS plan has to specify, in concrete terms, what is being monitored. Vague language such as "we monitor model performance" does not survive audit and does not catch drift. Concrete monitoring specifies the metrics, the frequency, the reference sets, and the thresholds.

The useful metric categories for most AI medical devices include the following.

Input distribution metrics. Statistics on the distribution of inputs the model is seeing in the field, compared to the distribution of the training data and the distribution during validation. For imaging devices, this can include intensity histograms, resolution statistics, acquisition parameter distributions. For structured data, it can include feature distributions, missingness patterns, and value ranges. The question the metrics answer is: is the population the device is now seeing still the population the device was validated for?

Output distribution metrics. Statistics on the distribution of model outputs in the field. Predicted class frequencies, score distributions, confidence intervals. If a diagnostic model that was trained to produce a 12% positive rate is now producing 25% positives in the field, something has moved. It might be a real shift in disease prevalence, or it might be a drift in the model's calibration. Either way, the signal is worth investigating.

Performance against a locked reference set. A curated, isolated reference dataset that never changes and is used for periodic re-evaluation of the model. Each evaluation run produces the same core metrics: accuracy, sensitivity, specificity, AUC, calibration, subgroup performance. And the trajectory of those metrics over time is the primary signal of model integrity. The reference set does not change, so any change in the metrics is a change in the model or the way the model is being run.

Subgroup performance. Performance broken down by demographic or clinical subgroup, where the groups matter for intended use. A model whose aggregate performance is stable but whose performance on a specific subgroup has degraded is producing a different kind of risk signal than a uniform degradation, and the PMS plan should distinguish between them.

Clinical outcome signals where feasible. For some AI devices, it is possible to connect model outputs to downstream clinical outcomes. Confirmed diagnoses, treatment decisions, patient follow-up data. Where this is feasible and ethically permissible, it is the highest-quality signal available, because it measures the clinical benefit the device was certified to deliver, not only the technical metrics the device was trained on.

Usage and workflow signals. How clinicians interact with the model. Override rates, acceptance rates, time spent on each case, distribution of use contexts. These are not performance metrics in the strict sense, but they reveal when the product's role in the clinical workflow is shifting, which is itself a post-market safety signal.

Each of these categories belongs in the PMS plan with a specified metric, a specified frequency, a specified source, and a specified threshold. "Monitor quarterly" is a placeholder; "compute AUC on the locked reference set quarterly and flag any drop greater than two percentage points from the baseline for formal review" is a specification.

Drift detection methods

Drift detection is a technical field of its own, and the PMS plan does not have to implement every known method. It has to implement methods that are sensitive enough to catch the drift modes the product is actually exposed to, and specific enough not to drown the team in false alarms.

The useful drift detection patterns for AI medical devices include statistical distribution comparison (comparing input and output distributions in the field against reference distributions using tests like Kolmogorov-Smirnov, population stability index, or simple moment comparisons), periodic reference-set evaluation (running the locked reference set through the current model on a defined cadence and tracking the metric trajectory), shadow evaluation on held-out field data (when ground truth becomes available after the fact, for example from confirmed diagnoses recorded downstream, comparing model predictions against the ground truth on a rolling sample), and clinical outcome tracking as described above.

The choice of methods should be justified in the risk file and cross-referenced in the PMS plan. A CT-based diagnostic model needs input distribution monitoring for the acquisition parameters. A structured-data risk score needs feature distribution monitoring and missingness tracking. A clinical decision-support tool needs output acceptance tracking and clinician override analysis. One size does not fit all products, and copying a generic drift detection template into a specific PMS plan is a common failure mode.

Thresholds for escalation should be set before drift happens, not improvised when it does. A drop of X percentage points on metric Y triggers formal review. A distribution shift with population stability index above Z triggers investigation. A sudden change in override rates above a defined threshold triggers a workflow review. These thresholds are the conversion from passive monitoring into actionable PMS.

PMCF for AI devices

Post-market clinical follow-up under Annex XIV Part B is not separate from the drift discussion. It is the clinical arm of exactly the same loop. PMCF for an AI medical device has to collect and evaluate real-world clinical data from the device's use, confirm that the clinical performance and safety claims still hold in the intended use population, identify previously unknown side effects or performance gaps, and feed the findings back into the clinical evaluation report.

For AI products, PMCF has a specific shape. It has to cover the intended use population the device is actually seeing in the field, including the subgroups where bias risk exists. It has to collect enough data to detect changes in clinical performance, not only clinical complications. It has to use a defined, documented methodology. The PMCF plan specifies the data sources, the sample sizes, the metrics, and the cadence. And for higher-class AI devices, the findings feed directly into the PSUR under Article 86, which explicitly names the main findings of the PMCF as a required element.

A PMCF plan that says "we will conduct a retrospective study in year two" is a placeholder, not a plan. A PMCF plan that specifies which hospitals will provide which data on which cadence, with which ground-truth adjudication process and which analysis protocol, is a plan. The difference matters at audit and it matters when drift actually happens and the team needs to respond.

For the broader PMCF framework, see PMCF under MDR. A guide for startups.

Triggers for model retraining

When drift monitoring surfaces a problem, the PMS plan has to specify what happens next. The usual options are: continue monitoring at a higher cadence, adjust the clinical indications or restrict the intended use, retrain the model, or withdraw the device. Each of these is a different kind of corrective action with different regulatory consequences.

Retraining is the option most AI teams reach for first, and it is also the option that has the most regulatory friction. A retrained model is a changed model. The change is evaluated under the Notified Body change notification framework. If the change is significant, and a retraining driven by drift detection usually is, notification is required, and where the change affects safety or performance, re-assessment follows. Teams that have a predetermined change control plan in place can handle certain retraining events inside the pre-authorised envelope; teams that do not have to go through full change notification each time.

The PMS plan should pre-define the trigger conditions for retraining, which metrics, at what thresholds, for what duration, and the process that follows when a trigger fires. Pre-defined triggers prevent two failure modes. The first is retraining too eagerly, chasing noise and introducing new variance. The second is retraining too late, after drift has persisted long enough to harm patients. Both happen when triggers are improvised.

For the deeper treatment of locked versus adaptive models and change control envelopes, see locked vs. adaptive AI algorithms under MDR.

Documentation. What the PMS file has to contain

The PMS documentation set for an AI medical device has to satisfy Annex III and the class-specific reporting obligations under Articles 85 or 86, with AI-specific content woven through. Concretely, the documentation includes a PMS plan that specifies every element of Annex III with AI-specific monitoring methods, thresholds, and response pathways named in each relevant section; a PMCF plan under Annex XIV Part B that addresses the intended use population with subgroup-level coverage where bias risk exists; a PMS Report under Article 85 for Class I devices or a PSUR under Article 86 for Class IIa, IIb, and III devices, updated on the class-specific cadence and reflecting the actual drift and performance data from the period; a defined record of retraining triggers and outcomes, tied to change control records; and traceability from PMS findings into the risk management file under EN ISO 14971:2019+A11:2021 and the clinical evaluation report under Article 61.

The documentation has to be more than a template. It has to reflect the actual operational system the team is running. A PMS plan that describes a drift monitoring pipeline the engineering team does not actually operate is a finding waiting to happen.

Common mistakes startups make

The patterns that show up most often in early-stage AI MedTech PMS work:

Building dashboards that nobody watches. Monitoring infrastructure without a review cadence is decoration.
Equating complaint intake with PMS. Complaint intake is necessary and insufficient for an AI device.
Copying a classical software PMS plan and changing the product name. AI-specific failure modes are not in a classical template.
Defining thresholds vaguely ("we will investigate significant drops") instead of numerically.
No locked reference set. Without a reference that never changes, you cannot detect drift reliably.
PMCF as a future project. The PMCF plan is a pre-market deliverable, not a year-two problem.
Confusing automatic retraining pipelines with regulatory permission to retrain. The pipeline is engineering; the permission is regulatory; they are not the same thing.
No traceability from PMS findings into the risk file. Drift that does not feed the risk file is drift the Regulation cannot see.

The Subtract to Ship angle

The Subtract to Ship framework applied to AI PMS produces a sharp discipline: build the smallest PMS system that satisfies every Article 83 to 86 obligation for the device class AND actively detects the drift modes the specific product is exposed to. Everything beyond that is waste. Everything less is a non-conformity.

For an AI decision-support tool at Class IIa, the Subtract to Ship PMS looks something like this. One complaint intake channel. One locked reference set held outside the training and validation sets. One scheduled pipeline that runs the reference set through the current production model on a defined cadence and logs the metric trajectory. One input distribution monitor on the features or pixels that matter most for this product. One dashboard that is reviewed on a defined cadence by a named owner. One PMCF plan with a clinical outcome tracking protocol at a defined cadence. One PMS plan that names every Annex III element with the AI-specific monitoring methods in each section. One PSUR under Article 86 on the class-specific cadence that reflects the actual drift and performance data. One defined set of retraining triggers with pre-specified thresholds. And one feedback loop from each of these into the risk file and the clinical evaluation.

That is lean. It is not minimal in the sense of skimping. Every element traces to a specific MDR obligation or to a specific AI-hazard control in the risk file. What the Subtract to Ship PMS does not include is the elaborate monitoring surface that looks impressive in a deck and runs unattended in production, the quarterly PMS reports that repeat the previous quarter's copy, the dashboards that have never been opened, and the process documents describing activities the team does not actually perform.

Reality Check. Where do you stand?

Does your PMS plan specify the drift detection methods used for your AI device by name, or does it say "we monitor performance" without detail?
Do you have a locked reference set that never changes, held outside your training and validation sets, used for periodic re-evaluation of the production model?
For each performance metric you monitor, is there a numerical threshold for escalation written down in the PMS plan?
Is there a named owner of the drift monitoring review, with a defined cadence that is actually followed?
Does your PMCF plan specify data sources, sample sizes, ground-truth adjudication, and analysis methods for the intended use population, including subgroup-level coverage where bias risk exists?
Are your retraining triggers pre-defined in the PMS plan, tied to specific metric thresholds and to your change control process?
Does every PMS finding have a documented path into your risk management file under EN ISO 14971:2019+A11:2021 and your clinical evaluation?
If your Notified Body asked for your PSUR under Article 86 tomorrow, would it reflect real drift and performance data from the period, or would it be a copy-paste of the last one?
Have you read MDCG 2025-10 (December 2025) end-to-end, or have you only skimmed it?

Frequently Asked Questions

Does MDR require drift detection for AI medical devices explicitly? Not in those words. MDR Article 83 requires a PMS system proportionate to the risk class and appropriate for the device. For an AI device whose effective performance can degrade silently when the input distribution shifts, "appropriate" means active drift detection, because a PMS system that cannot catch the primary failure mode of the device is not appropriate for the device. Notified Bodies in 2026 expect to see it.

Is drift monitoring sufficient for Class IIa and above, or is PMCF also required? PMCF is also required. Annex XIV Part B requires proactive collection and evaluation of clinical data from the use of the CE-marked device, and this applies to AI devices the same way it applies to any other. Drift monitoring handles the technical side; PMCF handles the clinical side. For AI devices they are complementary, not substitutes.

How often should an AI model be evaluated against a locked reference set? It depends on the risk class, the clinical domain, and the expected drift rate. Quarterly is a common cadence for many products. Monthly is reasonable for products with faster-changing upstream inputs. The cadence should be specified in the PMS plan, justified in the risk file, and actually executed. A plan that says monthly and runs yearly is worse than a plan that says quarterly and actually runs quarterly.

Is automatic retraining a way to avoid PMS obligations? No. Automatic retraining is a technical capability, not a regulatory pathway. Every change to a certified device is assessed under the change notification framework, and silent retraining without a defined change control envelope does not fit the MDR framework in 2026. A retraining pipeline is useful when it is wired into a disciplined change control process, not around one.

What is the relationship between the AI PMS plan and the PSUR under Article 86? The PMS plan specifies what will be monitored, how, on what cadence. The PSUR is the periodic report that summarises what the monitoring actually found and what was done about it. For Class IIa, IIb, and III AI devices, the PSUR has to reflect the drift and performance data from the period, the main findings of PMCF, the benefit-risk conclusions, and any corrective or preventive actions taken. It is not a template; it is a report on the real system.

Is a PMCF plan needed even if the model is locked? Yes. The locked-versus-adaptive distinction shapes the change control side of the story, not the clinical follow-up obligation. Annex XIV Part B requires PMCF regardless of whether the model ever changes, because clinical performance and safety in the intended use population has to be confirmed in real-world use, not only at the point of certification.

How does the EU AI Act affect AI PMS under MDR? The AI Act layers additional obligations on AI systems in safety-critical use, including post-market monitoring of AI system performance. Where the AI system is also a medical device, the practical expectation in 2026 is that the AI Act obligations are integrated into the MDR PMS system rather than run as a parallel process. The operational mechanics of that integration are still being clarified between the Commission, the Medical Device Coordination Group, Notified Bodies, and AI Act governance bodies. The safe move is to build the PMS system so it would satisfy an auditor looking at either Regulation.

What Is Post-Market Surveillance (PMS) Under MDR?, the pillar post on PMS that covers the baseline Article 83 to 86 framework for every device class.
MDR Articles 83 to 86. The PMS Framework Explained, the article-by-article walkthrough of the core PMS obligations.
PMCF Under MDR. A Guide for Startups, the clinical arm of the PMS system, including the PMCF plan structure.
AI Medical Devices Under MDR: The Regulatory Landscape in 2026, the pillar post for the AI MedTech category, framing the full regulatory picture.
Machine Learning Medical Devices Under MDR, the companion post on ML development discipline under MDR.
Locked vs. Adaptive AI Algorithms Under MDR, the change control context that shapes retraining triggers in the PMS plan.
AI Bias, Fairness, and Representativeness in Medical Devices, why subgroup performance monitoring is a PMS priority.
Clinical Evaluation for AI/ML Medical Devices, the pre-market evidence base that the PMCF plan has to keep current.
Model Validation and Verification for AI Medical Devices, the reference-set discipline that underpins drift detection.
AI/ML Medical Device Compliance Checklist 2027, the consolidated checklist for AI MedTech founders.
The Subtract to Ship Framework for MDR Compliance, the methodology behind every post in this blog, applied here to AI PMS.

Sources

Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices. Article 83 (post-market surveillance system), Article 84 (post-market surveillance plan), Article 85 (PMS Report for Class I), Article 86 (PSUR for Class IIa, IIb, III), Annex III (technical documentation on post-market surveillance), Annex XIV Part B (post-market clinical follow-up). Official Journal L 117, 5.5.2017.
MDCG 2025-10. Guidance on post-market surveillance of medical devices and in vitro diagnostic medical devices. Medical Device Coordination Group, December 2025.
MDCG 2019-11 Rev.1. Guidance on Qualification and Classification of Software in Regulation (EU) 2017/745. MDR and Regulation (EU) 2017/746. IVDR, October 2019, Revision 1 June 2025.
EN ISO 14971:2019 + A11:2021. Medical devices. Application of risk management to medical devices.

This post is part of the AI, Machine Learning and Algorithmic Devices category in the Subtract to Ship: MDR blog. Authored by Felix Lenhard and Tibor Zechmeister. Post-market surveillance for AI devices is where the gap between what teams build pre-market and what actually happens in the field becomes visible, and it is where the hardest engineering-meets-regulation conversations usually land. If the shape of the right PMS and PMCF plan for your specific AI device is not obvious after reading this post, that is expected. The plan is bespoke work, and it is exactly the kind of decision where a sparring partner who has walked other AI MedTech founders through the same build earns their keep.