PMCF for software and AI/ML devices still has to satisfy MDR Annex XIV Part B and feed the Article 61(11) clinical evaluation update cycle, but the operational content looks nothing like a hardware PMCF plan. Instead of registries, post-market studies, and user surveys as the primary instruments, software PMCF is built around continuous performance monitoring from telemetry, per-version benchmarks against a reference dataset, drift detection for AI/ML models, and structured outcome tracking that ties real-world clinical use back to the claims in the clinical evaluation report. The data pipeline has to be designed before CE marking, because retrofitting a PMCF telemetry layer onto a software architecture that was not built for it is expensive and often impossible within GDPR constraints.

By Tibor Zechmeister and Felix Lenhard. Last updated 10 April 2026.


TL;DR

  • PMCF for software and AI/ML devices is mandatory under Annex XIV Part B of Regulation (EU) 2017/745. The Annex does not carve out software; "it is software" is not a non-applicability justification.
  • The primary PMCF data source for connected software is telemetry from the running product, not post-market studies or user surveys. The PMCF plan names the signals, the cadence, and the thresholds that turn data into clinical conclusions.
  • Per-version performance benchmarks against a fixed reference dataset catch regressions that aggregate metrics hide. A drop in v2.3 is invisible if the dashboard only shows totals across all versions.
  • AI/ML devices add a failure mode PMCF must explicitly monitor: silent drift when the input distribution shifts even though the model has not changed. Drift detection is a PMCF activity, not a separate engineering task.
  • User outcome tracking — the clinical decisions and outcomes that follow the device's output — is the piece most often missing. Telemetry shows what the software did; outcome tracking shows whether it helped.
  • Every PMCF finding has to flow into the EN 62304:2006+A1:2015 software maintenance process and the EN ISO 14971:2019+A11:2021 risk file, or the Article 61(11) loop has broken.

Why SaMD and AI/ML PMCF is different

The pillar post on post-market clinical follow-up under MDR walks through PMCF as a concept and the Annex XIV Part B architecture. The how-to post on writing a PMCF plan under Annex XIV Part B covers the required contents of the plan document itself. This post is the software-specific walkthrough. The reason it needs its own post is that the default PMCF playbook — user surveys, registries, post-market studies, literature surveillance — maps badly onto software products. Not because the Annex objectives change for software. They do not. Because the methods that satisfy those objectives for software are mostly not on the default menu.

A Class IIa hardware device with a clinical claim can often satisfy PMCF with a structured user survey, literature surveillance, and similar-device monitoring. A Class IIa software device that tries the same combination will miss its own primary failure modes. Software fails when a regression slips into a release, when a third-party dependency changes behaviour, when the deployment environment shifts, when the input distribution drifts, or when users adapt their workflow around a weakness without complaining. None of those are visible in a quarterly survey or a literature search. They are visible in the product's own telemetry, if the PMCF plan is designed to look at it.

The companion post on post-market surveillance for SaMD covers the broader PMS framework for software under Articles 83 to 86. This post narrows in on the clinical arm of that framework — the Annex XIV Part B obligations — and the specific methods that satisfy them for software and AI/ML products.

Telemetry-based monitoring as a PMCF method

Annex XIV Part B lists the general PMCF methods the plan can use: gathering clinical experience, user feedback, screening of scientific literature, and screening of other sources of clinical data. For a connected software device, the running product is itself a very rich source of clinical data. Every prediction it makes, every recommendation it emits, every decision-support output it generates is, potentially, a PMCF data point.

The PMCF plan has to translate that potential into a specification. For each telemetry-based PMCF signal, the plan names what is collected, how it is collected, how it is linked to the clinical claims in the CER, which Annex XIV Part B objective it addresses, the analysis cadence, and the threshold at which a finding triggers a PMCF action — a CER update, a risk file update, a labelling change, or a corrective action. "We collect usage data" is a placeholder. "For every diagnostic recommendation, the product logs the model version, input validity flags, output class, confidence score, and downstream clinician action, aggregated weekly, reviewed monthly, with a pre-specified threshold for deviation from the CER-stated performance envelope" is a PMCF specification.

The proportionality test applies. Collect enough to satisfy the Annex XIV Part B objectives for this specific device, not more. GDPR and data-minimisation constraints are not optional and the plan has to respect them. For on-premises deployments where telemetry cannot be sent to the manufacturer, the PMCF plan needs an alternative route — structured periodic exports, customer-site summary reports, contractual access — or a documented reason why another method fills the gap. No telemetry is not a reason to drop the Annex XIV Part B objective; it is a reason to design around it.

For the related PMS telemetry discussion, see PMS for software as a medical device.

Performance benchmarks against a reference dataset

A software PMCF plan worth the name includes a fixed reference dataset — a locked, versioned set of inputs with known expected outputs — against which every deployed version of the software is benchmarked before and during its time in the field. The reference dataset is the ground truth against which "the device still performs as the CER says it does" can be tested, repeatably and comparably across versions.

The reference dataset is not the training set. For AI/ML devices, mixing the two is a category error that invalidates the measurement. It is a curated evaluation set, ideally including representative clinical cases, edge cases, and cases that probe the boundaries of the intended purpose. It is locked so that performance across releases is comparable. It is versioned so that when the clinical evaluation is updated, the reference dataset can evolve alongside it in a documented way.

The PMCF benchmark process runs on a defined cadence — at a minimum at every release, usually also on an independent periodic cycle — and produces metrics that can be compared directly against the performance envelope stated in the CER. Sensitivity, specificity, positive and negative predictive values, calibration, error-rate distributions by subgroup where relevant: the specific metrics depend on the device, but the principle is the same. If the CER claims the product achieves a specific level of performance on a defined population, the PMCF plan has to demonstrate that the deployed product continues to achieve that level on data representative of that population.

When a benchmark deviates from the envelope, the PMCF evaluation report records the deviation, the analysis feeds into the risk file under EN ISO 14971:2019+A11:2021, and the finding either triggers a CER update, a maintenance action under EN 62304:2006+A1:2015, or a documented reasoned decision that no action is needed. The benchmark is the rigorous half of software PMCF — the half that generates comparable, defensible, quotable numbers the notified body can read.

Drift detection integration for AI/ML devices

AI/ML devices add a failure mode PMCF must explicitly monitor. Even a locked model — one whose weights are not being updated in the field — can degrade if the input distribution it sees shifts away from the distribution it was trained and validated on. Changes in clinical practice, changes in upstream imaging equipment, changes in patient demographics at a new customer site, seasonal variation: any of these can produce silent drift where the model is technically unchanged but its real-world performance has moved. For AI/ML devices inside the MDR, this is a PMCF concern that cannot be delegated to the engineering team alone.

The PMCF plan for an AI/ML device has to name the drift detection method explicitly. Input-side drift detection monitors the distribution of inputs — feature statistics, covariate distributions, data quality flags — against the reference distribution from the training and validation work. Output-side drift detection monitors the distribution of outputs — predicted class frequencies, confidence score distributions, disagreement rates with an independent reference where one exists — against the CER-stated baseline. Ground-truth-informed drift detection, where feasible, compares model outputs against eventual clinical outcomes and measures whether the observed predictive performance still matches the claim.

Each of those methods produces a signal. The PMCF plan defines the cadence on which each signal is reviewed, the threshold that classifies a signal as a finding, and the action that follows a finding. A drift finding is a clinical signal in the Annex XIV Part B sense because it affects whether the safety and performance claims in the CER still hold. Routing it only into an engineering backlog and not into the PMCF evaluation report breaks the Article 61(11) loop. Routing it into both is the correct answer.

For AI/ML devices where the model is updated in the field — adaptive or continuously learning systems — the PMCF requirements intensify further, and the MDCG 2019-11 Rev.1 guidance on qualification and classification of software in MDR (Revision 1, June 2025) is part of the reading. The current regulatory posture in Europe strongly favours locked models with defined change-control pathways; adaptive learning without predefined change-control is not a path most notified bodies will accept without specific justification.

For the pillar companion on PMS for AI devices, see post-market surveillance for AI devices.

User outcome tracking

Telemetry shows what the software did. Benchmarks show whether it did it correctly against a reference. Neither of those, on its own, shows whether the software helped the patient — which is the question the clinical evaluation ultimately answers, and the question PMCF ultimately has to keep current. Outcome tracking is the third leg.

Outcome tracking for software PMCF means following what happened clinically after the software produced its output. For a decision-support tool, did the clinician agree with the recommendation, and what happened next? For a diagnostic aid, was the suspected finding confirmed or excluded by the confirmatory workup, and what did that mean for the patient? For a monitoring product, did the flagged events correspond to real clinical deteriorations, and were the unflagged periods actually uneventful? The exact form of outcome tracking depends on the device and the clinical workflow, but the principle is that the PMCF plan reaches beyond the software boundary and captures what the software's output contributed to the downstream clinical decision.

This is the element most often missing from software PMCF plans. Teams build elegant telemetry pipelines, run benchmarks, monitor drift, and stop at the software boundary because collecting outcome data is harder and touches the customer's workflow. The harder-to-collect data is exactly the data Annex XIV Part B is asking about. Confirming safety and performance throughout the expected lifetime, identifying previously unknown side-effects, detecting emerging risks on the basis of factual evidence, ensuring the continued acceptability of the benefit-risk ratio, and identifying systematic misuse — several of those objectives can only be answered with outcome data, not with telemetry alone.

Pragmatic routes to outcome tracking include structured feedback captured in the product itself at the point of decision, periodic chart-review studies at selected customer sites under appropriate agreements, participation in registries where the device's outputs can be linked to clinical outcomes, and contractual arrangements with customers that permit anonymised outcome reporting. None of these is free. All of them are lighter than a full post-market clinical study, and in many cases a combination is what the PMCF plan justifies as proportionate.

What to record in the PMCF evaluation report

The PMCF evaluation report for a software or AI/ML device closes the loop by consolidating telemetry findings, benchmark results, drift signals, and outcome data into a single document that feeds the next CER update under Article 61(11). The structure mirrors the generic PMCF evaluation report described in the pillar post, but the content is specific to software.

The report restates the PMCF plan objectives for the period. It summarises the deployed-version landscape — which versions were in the field, in what proportions, at which customer sites — because every finding is version-specific. It presents the telemetry data by signal, with any pre-specified thresholds that were crossed. It presents the benchmark results per version against the reference dataset and compares them against the CER performance envelope. For AI/ML devices, it presents the drift detection results with the methods, metrics, and findings. It presents the outcome data collected during the period, with the method by which it was collected and the limitations of the data. It appraises all of the above against the pre-specified criteria from the plan. It analyses the combined findings against the Annex XIV Part B objectives. It states the conclusions and the specific actions that follow — CER updates, risk file updates, EN 62304:2006+A1:2015 change requests, labelling changes, or a documented reasoned decision that no updates are needed.

MDCG 2025-10 (December 2025) is the operational reference for how PMCF sits inside the broader PMS system, and the software-specific PMCF report should read as a coherent clinical document that a notified body auditor can follow end-to-end without needing to ask where a particular finding came from.

Common mistakes

Six patterns recur in software and AI/ML PMCF plans reviewed across notified body audits.

Treating PMCF as inapplicable because "it is software." Annex XIV Part B does not exempt software, and claiming non-applicability without a written justification that addresses the Annex objectives is a documented non-conformity.

Relying on user complaints as the primary clinical data source. Software fails silently. A complaint inbox is not a PMCF method for most connected software products.

No reference dataset and no per-version benchmarks. The plan reports aggregate metrics that make regressions invisible and cannot demonstrate that the CER performance envelope still holds.

For AI/ML devices, no drift detection in the PMCF plan. Engineering runs drift monitoring quietly in the background; regulatory never sees it; the PMCF evaluation report does not mention it; the notified body auditor asks where drift is handled and the answer is "somewhere in engineering."

No outcome tracking at all. The plan stops at the software boundary, the Annex XIV Part B objectives related to benefit-risk and emerging risks cannot be answered, and the PMCF evaluation report becomes a telemetry summary rather than a clinical document.

Designing the PMCF data pipeline after CE marking. Retrofitting telemetry, benchmarks, and outcome collection onto an architecture that was not built for them is expensive and, under GDPR, sometimes not possible. The pipeline design belongs in the PMCF plan written before CE marking.

The Subtract to Ship angle — lean software PMCF that closes the loop

The Subtract to Ship framework for MDR applied to software PMCF produces a short rule. For each activity in the plan, trace it to a specific Annex XIV Part B objective and to a specific CER claim or risk file entry. If the trace does not exist, cut the activity. For each Annex XIV Part B objective and for each major clinical claim, name the lightest method that can genuinely address it — telemetry signal, benchmark metric, drift indicator, outcome track. If no method exists, add one.

A lean PMCF programme for a Class IIa AI/ML diagnostic aid typically includes a defined telemetry signal set that covers input validity, output distribution, and per-version error rates with explicit thresholds and a monthly review cadence; a locked reference dataset with per-release benchmarks against the CER performance envelope; input-side and output-side drift detection with quarterly review and finding-triggered analysis; a lightweight outcome-tracking route through structured in-product feedback or periodic chart review at one or two partner sites; integration with the complaint intake so user-reported signals and telemetry-detected signals are correlated; structured literature surveillance on the clinical condition, the technology, and similar devices; monitoring of clinical experience with equivalent or similar devices; a defined cadence for producing the PMCF evaluation report and feeding it into the CER update cycle under Article 61(11); and a deterministic path from every PMCF finding into the EN 62304:2006+A1:2015 software maintenance process and the EN ISO 14971:2019+A11:2021 risk file.

That programme traces to Annex XIV Part B line by line. It is runnable by a small team. It survives notified body surveillance because every element has an audit trail and produces comparable numbers release over release. It is not a token programme. It is the minimum real programme for a software product that takes Annex XIV Part B seriously. Minimum real beats elaborate fictional every time.

Reality Check — where do you stand?

  1. Does your PMCF plan name telemetry signals with collection method, analysis cadence, and pre-specified thresholds, or does it describe "usage data" without specifics?
  2. Do you have a locked reference dataset that every release is benchmarked against before it ships, with results compared explicitly to the performance envelope stated in the CER?
  3. For each PMCF telemetry signal and benchmark metric, can you name the specific Annex XIV Part B objective it addresses?
  4. If you are an AI/ML device, does your PMCF plan explicitly name the drift detection methods — input-side, output-side, and outcome-informed — and the thresholds that classify a signal as a finding?
  5. Is drift detection documented as a PMCF activity inside the PMCF plan, or does it live only in an engineering backlog that never reaches the PMCF evaluation report?
  6. Do you have any form of user outcome tracking in the plan, or does your PMCF stop at the software boundary?
  7. Is every PMCF finding routed into the EN 62304:2006+A1:2015 software maintenance process and the EN ISO 14971:2019+A11:2021 risk file on a documented path?
  8. Was the PMCF data pipeline designed before CE marking, or is it a retrofit?
  9. When the PMCF evaluation report is produced, does it reflect real per-version telemetry and benchmark data, or is it a copy of the previous period?
  10. Have you read MDCG 2025-10 (December 2025) and MDCG 2019-11 Rev.1 (Revision 1, June 2025) end-to-end for the regulatory context?

Frequently Asked Questions

Is PMCF required for software medical devices?

Yes. Annex XIV Part B of Regulation (EU) 2017/745 does not exempt software. PMCF is mandatory for software and AI/ML medical devices unless the manufacturer documents a specific written justification that addresses the Annex XIV Part B objectives one by one and survives notified body review. "It is software" is not a justification.

Can telemetry replace a post-market clinical study for software PMCF?

For many software devices at Class IIa and some at Class IIb, a well-designed telemetry, benchmark, and outcome-tracking programme can satisfy Annex XIV Part B without a dedicated post-market clinical study. The test is whether the chosen methods genuinely address every Annex XIV Part B objective for the device. For novel, high-risk, or AI/ML devices with open clinical questions, a structured post-market study may still be necessary alongside the continuous monitoring layer.

Is drift detection a regulatory requirement for AI/ML devices under MDR?

MDR does not use the word "drift," but Annex XIV Part B requires detecting emerging risks on the basis of factual evidence and confirming safety and performance throughout the expected lifetime. For an AI/ML device, silent drift is a direct threat to both objectives. A PMCF plan that does not address drift for an AI/ML device is a PMCF plan that does not meet Annex XIV Part B for that device. The mechanism is a practical consequence of the Regulation, not a separate rule.

How is software PMCF different from software PMS?

PMS is the broader post-market surveillance system required by Articles 83 to 86. It covers complaints, trend reporting, cybersecurity events, vigilance, and everything else that flows from real-world use. PMCF is the clinical arm of that system, required by Annex XIV Part B, focused on keeping the clinical evaluation current. For software, the two overlap significantly because telemetry often feeds both, but the outputs are different documents with different purposes: the PMS Report or PSUR on one side, the PMCF evaluation report and CER update on the other.

Can a PMCF plan use telemetry from a cloud-hosted product only?

Where telemetry can be collected lawfully under GDPR and the device's data-protection commitments, it is often the strongest PMCF data source for cloud-hosted software. For on-premises deployments, the plan needs an alternative — structured exports, customer-site reports, contractual data-sharing — or a documented reason why another method fills the same Annex XIV Part B objectives. A telemetry-only plan that ignores on-premises deployments of the same product is incomplete.

Does the PMCF reference dataset have to be updated?

The reference dataset is locked for benchmarking comparability across versions, but it can and should be extended in a documented way as the clinical evaluation evolves. New versions of the reference dataset are tracked with version identifiers so that historical benchmark results remain interpretable. Updating the reference dataset is a controlled change tied to the CER update cycle, not an ad-hoc engineering decision.

How often should the software PMCF evaluation report be produced?

The cadence is defined in the PMCF plan and has to be consistent with the clinical evaluation update cycle under Article 61(11). For Class III and implantable devices, the clinical evaluation has to be updated at least annually with PMCF data, which in practice anchors the report at an annual cadence for those classes. For Class IIa and IIb software, the cadence is justified in the plan — annual is common, more frequent cadences are appropriate for products with high release velocity or active drift risk.

Sources

  1. Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices — Article 61(11) (PMCF update of clinical evaluation), Articles 83 to 86 (post-market surveillance system, plan, PMS Report, and PSUR), and Annex XIV Part B (post-market clinical follow-up). Official Journal L 117, 5.5.2017.
  2. MDCG 2025-10 — Guidance on post-market surveillance of medical devices and in vitro diagnostic medical devices. Medical Device Coordination Group, December 2025.
  3. MDCG 2019-11 Rev.1 — Guidance on Qualification and Classification of Software in Regulation (EU) 2017/745 — MDR and Regulation (EU) 2017/746 — IVDR. First publication October 2019; Revision 1, June 2025.
  4. EN 62304:2006+A1:2015 — Medical device software — Software life-cycle processes (IEC 62304:2006 + IEC 62304:2006/A1:2015).
  5. EN ISO 14971:2019+A11:2021 — Medical devices — Application of risk management to medical devices.

This post is part of the Post-Market Surveillance & Vigilance series in the Subtract to Ship: MDR blog. Authored by Felix Lenhard and Tibor Zechmeister. Software PMCF is where the gap between a plan that looks compliant on paper and a plan that actually keeps the clinical evaluation current is widest, and it is where designing the data pipeline before CE marking pays for itself many times over during the first surveillance audit.