---
title: Clinical Evaluation of Software as a Medical Device: The Unique Challenges
description: Clinical evaluation for SaMD has the same MDR Article 61 backbone but unique data sources and validation methods. Here is what to know.
authors: Tibor Zechmeister, Felix Lenhard
category: Clinical Evaluation & Investigations
primary_keyword: clinical evaluation SaMD
canonical_url: https://zechmeister-solutions.com/en/blog/clinical-evaluation-samd
source: zechmeister-solutions.com
license: All rights reserved. Content may be cited with attribution and a link to the canonical URL.
---

# Clinical Evaluation of Software as a Medical Device: The Unique Challenges

*By Tibor Zechmeister (EU MDR Expert, Notified Body Lead Auditor) and Felix Lenhard.*

> **Clinical evaluation of SaMD follows the same MDR Article 61 and Annex XIV Part A and B backbone as any other medical device: a planned, ongoing process to generate, collect, analyse, and assess clinical data verifying safety, performance, and clinical benefit for the intended purpose. What changes for software is what counts as clinical data and how it is produced. Literature on the clinical task, retrospective performance studies on curated datasets, prospective validation cohorts on independent sites, and a PMCF plan that tracks real-world software performance replace the patient-on-the-table data sources of a physical device. The framework is identical. The evidence sources are different.**

**By Tibor Zechmeister and Felix Lenhard. Last updated 10 April 2026.**

---

## TL;DR

- MDR Article 61 and Annex XIV Part A apply to SaMD with no carve-out. There is no software-specific reduced-evidence pathway.
- The clinical data for SaMD usually comes from scientific literature on the clinical task, retrospective studies on curated datasets, and prospective validation cohorts on independent sites — not from traditional interventional trials.
- Clinical performance under MDR is not the same thing as technical performance. An accuracy number on a test set is one input to the clinical evaluation, not the answer to it.
- Equivalence under MDCG 2020-5 is technically available but rarely usable for SaMD, because two pieces of software with the same intended purpose can behave very differently on the same inputs.
- PMCF for software under Annex XIV Part B is where the software-specific discipline lives: real-world performance tracking, usage telemetry, and monitoring for changes in the input environment that degrade clinical performance.
- MDCG 2019-11 Rev.1 (June 2025) is the definitive software guidance and it is the document every SaMD clinical evaluation has to be consistent with.

---

## Why SaMD clinical evaluation is different in practice

The MDR framework for clinical evaluation was written with physical devices in mind as the baseline. The concepts generalise cleanly to software — Article 61 never says "physical device," and Annex XIV Part A never excludes software — but the concrete evidence sources look different. A clinical evaluation for an orthopaedic implant can draw on decades of published outcome data, registry information, and clinical investigations where surgeons implant the device and measure outcomes over years. A clinical evaluation for a SaMD triage tool cannot use any of that directly. The published literature talks about the clinical task, not the specific software. There are rarely registries. And the "clinical investigation" often looks more like a performance study on curated data than a traditional prospective trial.

This is where SaMD founders get tripped up. They read MDR Article 61, they read Annex XIV Part A, and they assume the framework does not fit. Then they read MDCG 2019-11 Rev.1 and realise it does fit — they just have to translate the Annex XIV data sources into the forms that actually exist for software. Literature on the clinical problem. Retrospective performance on archived cases. Prospective validation on independent cohorts. PMCF that tracks performance after deployment. Every one of those maps onto a clause in Annex XIV Part A or B. The translation is the work.

For the underlying clinical evaluation framework, our post on [what clinical evaluation under MDR is](/blog/what-is-clinical-evaluation-under-mdr) is the starting point. For SaMD qualification and classification, see our post on [what software as a medical device is under MDR](/blog/what-is-software-as-medical-device-samd-mdr). This post assumes both and focuses on the software-specific overlay.

## The data sources that actually work for software

Annex XIV Part A sets out the clinical evaluation process and the data sources that feed it. For SaMD, the three classical sources translate like this.

**Scientific literature.** For software, the literature usually falls into two buckets. The first bucket is literature on the clinical task itself — the diagnosis, the prognosis, the monitoring question — which establishes what the clinical benefit of correct classification looks like, how current practice performs, and what the consequences of error are. This bucket is almost always available and is essential context for the CER. The second bucket is literature on the specific software or on closely related software, which is rarer for startup SaMD and often non-existent for novel products. A well-built SaMD CER uses the first bucket heavily to frame the clinical problem, and then uses manufacturer-generated performance data to demonstrate what the specific software does on that problem.

**Equivalence.** Under MDR Article 61(4) and Annex XIV Part A(3), with MDCG 2020-5 (April 2020) as the authoritative guidance, a manufacturer can base clinical evaluation on data from an equivalent device if the two devices are equivalent in technical, biological, and clinical characteristics. For SaMD the biological characteristics dimension often collapses — there is no tissue contact — which superficially makes equivalence sound easier. In practice it is harder. Two pieces of software with the same intended purpose, trained on different data, with different architectures and preprocessing pipelines, produce genuinely different outputs on the same inputs. The technical characteristics dimension is usually different enough to block a defensible equivalence claim. Equivalence can still have a supporting role, but it is rarely the primary pathway for SaMD clinical evidence.

**Manufacturer's own clinical performance data.** This is the source that carries most SaMD clinical evaluations in practice. It takes two forms: retrospective performance studies on curated, annotated datasets, and prospective validation studies on independent cohorts. These are not "clinical investigations" in the traditional interventional sense — nobody is implanting anything — but they are structured clinical data generation activities that produce the evidence Annex XIV Part A asks for, and they need the same discipline around protocols, pre-specified analysis plans, and documentation.

## Performance versus clinical performance — the distinction that fails CERs

The single hardest conceptual move for SaMD founders is separating technical performance from clinical performance. The MDR uses "clinical performance" as a defined term, and it is not the same as the accuracy number on your test set.

Technical performance is what the software does on the data you feed it. Sensitivity, specificity, AUC, calibration, error rates under specific conditions. These are model-evaluation metrics, and they are necessary inputs to the clinical evaluation. They are not the clinical evaluation.

Clinical performance, in the MDR sense, is the ability of the device to achieve its intended purpose as claimed by the manufacturer — the clinically meaningful outcome for the intended use population when the software is used in the intended workflow by the intended user. The CER has to connect the technical metrics to that clinical reality. An AUC of 0.94 on a test set does not, on its own, tell you what happens to patients when the software is used. Who uses it? On which patients? In what workflow? What decision does it support? What does the user do when the output is wrong? What is the clinical benefit compared to current practice?

Notified Bodies reading SaMD CERs are increasingly explicit about this distinction. A CER that stops at technical performance metrics and does not build the bridge to clinical performance for the intended use population is incomplete, and that incompleteness is one of the most common major findings on SaMD submissions. The connection is built with data, not assertions — which is why the retrospective and prospective study design choices in the next sections matter so much.

## Retrospective performance on curated datasets

Retrospective performance studies are the workhorse of SaMD clinical evidence for most startups. The structure is straightforward: assemble a dataset of historical cases with known ground truth, run the locked software version on the dataset, compare the software output to the ground truth, and report the performance metrics and their confidence intervals against a pre-specified analysis plan.

The discipline needed to make a retrospective study credible is substantial. The dataset has to be isolated from any data used in development of the software — no overlap with training, tuning, or model selection — and the isolation has to be documented to a level that a Notified Body auditor can verify. The ground truth has to be defensible: histopathology where applicable, expert consensus with a documented adjudication process, long-term outcome follow-up where appropriate for the clinical task. The analysis plan has to be locked before the software is run on the data, and the software version has to be locked before the analysis begins. A retrospective study that was adjusted after looking at the results is not evidence.

For SaMD that processes image, signal, or structured clinical data, a well-designed retrospective study on a curated, independent dataset can carry most of the manufacturer-generated clinical evidence for the initial CE marking, particularly for Class IIa and some IIb devices. For higher-risk devices and novel clinical claims, the retrospective study is usually supplemented by prospective validation.

## Prospective validation cohorts

Prospective validation on independent sites is the strongest form of SaMD clinical evidence, because it tests the software on data the software has genuinely never seen, produced under the conditions of real clinical use. Prospective validation studies are more expensive and slower than retrospective studies, but they close the gap between performance on a curated dataset and performance in the clinical environment the device is actually intended for.

The design choices matter. The sites have to be independent of the development pipeline — not the same hospitals whose data trained the software, not the same scanners or devices whose output was used to build the preprocessing. The patient population has to reflect the intended use population, including the subgroups where the clinical task is hardest. The protocol has to follow the structure expected of any clinical investigation where applicable — documented inclusion and exclusion criteria, a pre-specified statistical analysis plan, a locked software version, ethical approval where required by national law. EN ISO 14155:2020+A11:2024 provides the discipline framework where the prospective study meets the threshold of a clinical investigation under MDR Articles 62 to 82.

For many SaMD clinical evaluations, the prospective validation cohort is what carries the clinical performance demonstration across the line. Literature frames the problem. Retrospective data shows the software works on curated cases. Prospective validation shows it still works on data produced in the field under the conditions the intended purpose describes.

## PMCF for software — the lifecycle commitment

Annex XIV Part B governs post-market clinical follow-up, and PMCF for software is where the lifecycle discipline lives. Software is not a static device. The clinical environment in which it operates changes. The upstream data changes — new imaging hardware, new patient mix, new referral patterns. The software itself may be updated. The PMCF plan has to be built around those realities.

A credible SaMD PMCF plan has several elements the Notified Body will look for. Ongoing literature surveillance on the clinical task and on the technology. A mechanism for collecting real-world performance data where it is feasible — this can be periodic re-evaluation against held-out reference cases, outcome tracking where the clinical workflow allows it, structured feedback from users, or telemetry-based monitoring of the distribution of inputs and outputs. A plan for what happens when performance changes are detected, including the threshold at which action is triggered and the response pathway. And a periodic update cadence for the CER itself that reflects the risk class and the novelty of the device — at least annually for Class III and implantable, less frequent but never "never" for lower classes.

PMCF for software is not an afterthought. It is engineered into the product from the start. Telemetry hooks, reference data pipelines, threshold logic — these are features, not documents. A PMCF plan that depends entirely on manual complaint review is inadequate for any non-trivial SaMD.

## Common mistakes SaMD founders make

**Treating technical performance as clinical performance.** An AUC number is not a clinical evaluation. The CER has to connect the metric to the clinical benefit in the intended use population. This is the single most common major finding on SaMD submissions.

**Test set contamination.** A test set that leaked into training, tuning, or model selection is not an independent test set, and the performance numbers derived from it will not withstand Notified Body scrutiny. This is unfixable after the fact — the only solution is to isolate the test set from day one.

**Claiming equivalence where it does not hold.** Two SaMD products with the same intended purpose are almost never equivalent in the strict sense MDCG 2020-5 requires. Founders try equivalence because it sounds cheaper; Notified Bodies reject the claims because the technical characteristics dimension does not hold.

**Writing the CER from the data the team happens to have.** The clinical evaluation plan has to be written first. The plan specifies the intended purpose, the clinical performance claims, the evidence sources, and the appraisal criteria. The data collection then serves the plan. Reverse-engineering a CER from whatever data exists is how submissions fail.

**Under-specifying the intended purpose.** A vague intended purpose forces the CER to cover every possible use. A tight, honest intended purpose — one clinical task, one user group, one care setting — is dramatically cheaper to evidence. Broad claims in marketing that do not match the CER intended purpose are also a recurring finding.

**Ignoring PMCF until after CE marking.** The PMCF plan is part of the pre-market technical documentation. It has to be drafted and resourced before the CE mark, and the telemetry and monitoring infrastructure has to be built into the product from the start.

## The Subtract to Ship angle

The [Subtract to Ship framework](/blog/subtract-to-ship-framework-mdr) applied to SaMD clinical evaluation runs the Evidence Pass in the specific order that makes sense for software.

Start with the intended purpose and write it tight. Every word you add to the intended purpose is evidence you will eventually have to produce. Identify the specific general safety and performance requirements that need clinical evidence for this device — not all of them do. For each required piece of evidence, evaluate the sources in order: literature on the clinical task first, equivalence if it holds (for SaMD it usually does not), retrospective performance on curated independent data next, prospective validation last. Lock the software version before the evidence generation begins. Build the PMCF infrastructure into the product, not just the paperwork. And do not duplicate model evaluation and clinical evaluation — one pre-specified analysis plan, one locked test set, one report that serves both the technical file and the CER.

This is not about doing less clinical evaluation than the MDR requires. It is about not defaulting to the most expensive pathway before the cheaper ones have been properly used, and about not generating two parallel bodies of evidence when one disciplined body serves both purposes. For the underlying Evidence Pass logic see post 65 and post 111.

## Reality Check — Where do you stand?

1. Is your intended purpose written in one precise sentence covering the clinical task, the user, the care setting, and the patient population?
2. Have you written a clinical evaluation plan before starting to assemble the CER, with pre-specified appraisal and analysis criteria?
3. Is your test set isolated from training, tuning, and model selection, with documented provenance a Notified Body auditor could verify?
4. Does your clinical evidence connect technical performance metrics to clinical performance for the intended use population, or does it stop at the AUC?
5. If you are claiming equivalence under MDCG 2020-5, can you defend the technical, biological, and clinical characteristics dimensions with sufficient access to the other device's data?
6. Does your retrospective performance study have a pre-specified analysis plan and a locked software version, signed off before the data is looked at?
7. Is there a prospective validation cohort on independent sites, or a credible plan to collect one through PMCF, for any clinical claim the retrospective data cannot close?
8. Is your PMCF plan built into the product as engineering features, or is it a paragraph in a document that nobody has resourced?
9. Is your SaMD clinical evaluation consistent with the qualification and classification in MDCG 2019-11 Rev.1, and with the software lifecycle obligations under EN 62304:2006+A1:2015?

## Frequently Asked Questions

**Does MDR require a separate clinical evaluation pathway for SaMD?**
No. MDR Article 61 and Annex XIV Part A and B apply to SaMD with the same structure and obligations as for any other medical device. MDCG 2019-11 Rev.1 confirms that software falls under the standard clinical evaluation framework. What differs is the concrete form of the clinical data — literature, retrospective studies, prospective validation cohorts, PMCF telemetry — not the framework itself.

**Is a retrospective performance study on a curated dataset enough clinical evidence for SaMD CE marking?**
For many Class IIa SaMD products and some Class IIb, a well-designed retrospective study on a genuinely independent, curated dataset with defensible ground truth and a pre-specified analysis plan can carry most of the manufacturer-generated clinical evidence, supplemented by literature on the clinical task. For higher-risk devices, novel claims, or cases where the retrospective data does not reflect the intended use environment, prospective validation on independent sites is usually needed in addition.

**Can I claim equivalence to another SaMD product under MDCG 2020-5?**
In principle yes. In practice rarely. The technical characteristics dimension — architecture, training data, preprocessing, output behaviour — is almost always different enough between two pieces of software that a defensible equivalence claim is difficult to construct, and Notified Bodies are sceptical of SaMD equivalence claims. Most SaMD clinical evaluations rely on manufacturer-generated performance data rather than equivalence.

**What is clinical performance for SaMD under MDR?**
Clinical performance is the ability of the software to achieve its intended purpose as claimed by the manufacturer, measured by the clinically meaningful outcome for the intended use population in the intended workflow. Technical metrics such as AUC or sensitivity are inputs to clinical performance, not substitutes for it. The CER has to build the bridge from the technical metrics to the clinical benefit for the patient.

**What does PMCF look like for SaMD?**
A PMCF plan for SaMD usually includes ongoing literature surveillance, periodic re-evaluation against held-out reference data, telemetry-based monitoring of input and output distributions, structured user feedback, and a defined response pathway when performance changes are detected. Passive complaint handling alone is not adequate PMCF for any non-trivial SaMD. The PMCF infrastructure is built into the product as engineering features, not added as a document after CE marking.

## Related reading

- [MDR Classification Rule 11 for Software](/blog/mdr-classification-rule-11-software) — the classification rule that frames most SaMD clinical evaluations.
- [What Is Clinical Evaluation Under MDR?](/blog/what-is-clinical-evaluation-under-mdr) — the Cat 3 pillar post on the Article 61 and Annex XIV process.
- [Sufficient Clinical Evidence Under MDR](/blog/sufficient-clinical-evidence-mdr) — how to decide when your clinical evidence is actually sufficient for conformity assessment.
- [Clinical Evaluation Plan (CEP) Under MDR](/blog/clinical-evaluation-plan-cep-mdr) — the plan document that every SaMD clinical evaluation needs before the CER is drafted.
- [Clinical Evaluation Report (CER) Structure Under MDR](/blog/clinical-evaluation-report-structure-mdr) — how the CER is structured and what each section has to contain.
- [Equivalence Under MDCG 2020-5](/blog/equivalence-mdr-clinical-evaluation) — the detailed treatment of equivalence and why SaMD rarely qualifies.
- [Post-Market Clinical Follow-Up (PMCF) Under MDR](/blog/pmcf-post-market-clinical-follow-up-mdr) — the PMCF framework the SaMD overlay sits on top of.
- [Clinical Investigations for Software Devices](/blog/clinical-investigations-software-devices) — when a prospective study crosses the threshold into a full clinical investigation under Articles 62 to 82.
- [What Is Software as a Medical Device (SaMD)?](/blog/what-is-software-as-medical-device-samd-mdr) — the SaMD category pillar on qualification and classification.
- [MDCG 2019-11 Rev.1 — What the Software Guidance Actually Says](/blog/mdcg-2019-11-software-guidance) — the full reading of the definitive software guidance document.
- [Clinical Evaluation of AI/ML Medical Devices](/blog/clinical-evaluation-ai-ml-medical-devices) — the AI-specific overlay for products where machine learning is the core technology.
- [The Subtract to Ship Framework for MDR Compliance](/blog/subtract-to-ship-framework-mdr) — the methodology pillar, including the Evidence Pass referenced in this post.

## Sources

1. Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices — Article 61 (clinical evaluation), Annex XIV Part A (clinical evaluation), Annex XIV Part B (post-market clinical follow-up). Official Journal L 117, 5.5.2017.
2. MDCG 2019-11 — Guidance on Qualification and Classification of Software in Regulation (EU) 2017/745 — MDR and Regulation (EU) 2017/746 — IVDR. First published October 2019; Revision 1, June 2025.
3. MDCG 2020-5 — Clinical Evaluation — Equivalence: A guide for manufacturers and notified bodies, April 2020.
4. EN 62304:2006+A1:2015 — Medical device software — Software life-cycle processes (IEC 62304:2006 + IEC 62304:2006/A1:2015).

---

*This post is part of the Clinical Evaluation and Clinical Investigations category in the Subtract to Ship: MDR blog. Authored by Felix Lenhard and Tibor Zechmeister. Clinical evaluation for SaMD is where the Article 61 framework meets the realities of software evidence generation — literature on the clinical task, retrospective studies on curated data, prospective validation cohorts, and PMCF built into the product. The framework is identical to any other device. The translation is the work.*

---

*This post is part of the [Clinical Evaluation & Investigations](https://zechmeister-solutions.com/en/blog/category/clinical-evaluation) cluster in the [Subtract to Ship: MDR Blog](https://zechmeister-solutions.com/en/blog). For EU MDR certification consulting, see [zechmeister-solutions.com](https://zechmeister-solutions.com).*