Performance Validation for AI Medical Devices: Metrics and Methods

Quick Summary

Performance validation for an AI medical device is the evidence that your algorithm actually delivers the clinical benefit your intended purpose claims. Under MDR, that means pre-specified metrics, a clean training/validation/test split with no leakage, subgroup analysis across clinically relevant populations, and a study design tied to Annex.

Performance validation for an AI medical device is the evidence that your algorithm actually delivers the clinical benefit your intended purpose claims. Under MDR, that means pre-specified metrics, a clean training/validation/test split with no leakage, subgroup analysis across clinically relevant populations, and a study design tied to Annex I §1 benefit-risk and Article 61 clinical evidence obligations.

By Tibor Zechmeister and Felix Lenhard.

TL;DR

MDR does not list "AUC" or "sensitivity" anywhere, but Annex I §1 requires devices to achieve their intended performance and Annex I §17.1 requires software to be validated for accuracy, reliability, and repeatability.
Your validation plan must pre-specify metrics, acceptance thresholds, and the clinical rationale for each before you look at the test set.
Training, validation, and test sets must be disjoint at the patient level, not the image level. Leakage is the single most common reason startup AI submissions fail review.
Sensitivity and specificity alone rarely tell the full story. For most AI devices you need AUC, calibration, subgroup performance, and failure mode analysis.
An independent, locked test set tested once is worth more than ten iterated results on a validation set.
Performance validation is evidence under Article 61 only when the test population matches your intended purpose population.

Why this matters

A founder walked into Tibor's office last year with a dermatology classifier and a deck full of 94% accuracy claims. The model had been trained on 12,000 images, validated on 3,000, and "tested" on another 3,000. Beautiful numbers. The problem: the split was random at the image level, and many patients had contributed multiple lesion photos. The same patient appeared in training and test. Once Tibor forced a patient-level split, test accuracy dropped to 71%. Three of the subgroups, namely darker skin types, lesions on acral sites, and images from one of the two contributing clinics, performed worse than a coin flip.

That startup had not done anything unusual. They had done what most teams do when nobody tells them the rules. And the rules, under MDR, are not optional. Annex I General Safety and Performance Requirements demand that the device "achieve the performance intended by their manufacturer" (Annex I §1) and that software be "developed and manufactured in accordance with the state of the art taking into account the principles of. verification and validation" (Annex I §17.1). A Notified Body reviewer who cannot reconstruct your validation from first principles will not sign off. And they will ask hard questions.

This post is the playbook Tibor wishes every AI MedTech startup had on day one.

What MDR actually says

MDR does not give you a metric. It gives you obligations. Four articles and annex sections matter most.

Annex I §1. General Safety and Performance Requirements: "Devices shall achieve the performance intended by their manufacturer and shall be designed and manufactured in such a way that, during normal conditions of use, they are suitable for their intended purpose." Your validation has to prove you achieved the performance you claimed. If you claim sensitivity above 90%, you need to prove sensitivity above 90% on a population representative of your intended use.

Annex I §17.1. Electronic programmable systems: "Devices that incorporate electronic programmable systems. shall be designed to ensure repeatability, reliability and performance in line with their intended use. In the event of a single fault condition, appropriate means shall be adopted to eliminate or reduce as far as possible consequent risks or impairment of performance." For an AI system, "single fault condition" reasoning includes adversarial input, out-of-distribution data, and sensor failure.

Article 61. Clinical evaluation: "Confirmation of conformity with relevant general safety and performance requirements. under the normal conditions of the intended use of the device, and the evaluation of the undesirable side-effects and of the acceptability of the benefit-risk-ratio. shall be based on clinical data providing sufficient clinical evidence." For AI devices, analytical and clinical performance both feed the clinical evaluation.

Annex XIV Part A. The clinical evaluation plan must specify performance characteristics, methods to evaluate them, and the clinical relevance of each. This is where your metric choices, thresholds, and study design live.

Risk management under EN ISO 14971:2019+A11:2021 ties the loop: any performance shortfall becomes a hazardous situation, and your benefit-risk conclusion must hold with the actual measured performance, not an aspirational one.

The state of the art for software validation is captured in EN 62304:2006+A1:2015, which MDR Annex I §17.2 effectively requires. EN 62304 does not tell you what metrics to compute, but it tells you that verification activities must trace to requirements and that validation must confirm the software meets its intended use.

Risk management under EN ISO 14971:2019+A11:2021 ties the loop: any performance shortfall becomes a hazardous situation, and your benefit-risk conclusion must hold with the actual measured performance, not an aspirational one.

A worked example

A Class IIa AI device for detecting diabetic retinopathy on fundus images. The intended purpose reads: "Assist ophthalmologists and trained screeners in identifying referable diabetic retinopathy (moderate non-proliferative or worse) in adults with diagnosed Type 1 or Type 2 diabetes."

That single sentence drives everything that follows.

Metrics, pre-specified: - Primary: sensitivity for referable DR at a fixed operating point, with acceptance criterion ≥ 90%, 95% confidence interval lower bound ≥ 85%. - Co-primary: specificity at the same operating point, ≥ 80%, CI lower bound ≥ 75%. - Secondary: AUC across the full ROC curve. - Secondary: calibration. Brier score and reliability diagram, because the device outputs a probability and clinicians will interpret it. - Secondary: subgroup sensitivity and specificity for age bands, sex, ethnicity, camera make, and diabetes type. - Failure analysis: rate of "ungradable" outputs and false negatives on proliferative DR (safety-critical misses).

Dataset split: - Training: 40,000 images from three clinics, patient IDs A. - Validation (hyperparameter tuning): 5,000 images from the same three clinics, patient IDs B, no overlap with A. - Test: 4,000 images from two completely different clinics, patient IDs C, locked before any hyperparameter decisions.

Patient-level split, clinic-level separation for the test set, the test set is opened once. If the team opens it twice, the second result is no longer a valid test. It is another validation round, and you need a new held-out set.

Study design choices: - Reference standard: adjudicated ground truth from three retina specialists with a fourth as tiebreaker. - Blinding: adjudicators do not see model outputs. - Sample size justification: powered for the primary sensitivity endpoint with a two-sided 95% CI half-width of 3%. - Pre-registration of the protocol internally, dated, signed, stored in the technical documentation.

What the team found when they ran it: - Sensitivity 92.1%, CI [89.4, 94.3], passes. - Specificity 82.4%, CI [80.1, 84.5], passes. - AUC 0.958. - Calibration: the model was overconfident above 0.8 probability. They recalibrated using isotonic regression on the validation set and re-tested on the held-out set (a pre-specified secondary analysis, documented). - Subgroup: sensitivity dropped to 84% in patients over 75. Below the acceptance threshold. They added a contraindication to the intended purpose for that subgroup and documented the limitation in the IFU, or they would have had to collect more training data and re-validate.

That last point is the honest version of AI validation. You will find subgroups where the device underperforms. The question is not whether. It is what you do about it. The MDR answer is: constrain the intended purpose, or improve the device and re-test.

In Practice

The state of the art for software validation is captured in EN 62304:2006+A1:2015 , which MDR Annex I §17.2 effectively requires.

The Lean Path Forward

Most AI teams over-engineer the experiment tracking and under-engineer the regulatory bookkeeping, flip it.

Step 1. Write the intended purpose first. Before a single line of training code runs. The intended purpose determines the reference population, which determines what a representative test set looks like. Article 2(12) defines intended purpose; Annex I §1 ties it to performance, see our intended purpose drives regulatory decisions post for the framework.

Step 2. Pre-specify your validation plan. Before touching data, metrics, thresholds, subgroups, failure modes, sample size, lock the document, date it, sign it, this is Annex XIV Part A territory and your auditor will ask for it.

Step 3. Split the data properly. Patient-level, ideally also site-level for the test set. No peeking. Store the split as a file, hashed, versioned, committed to your document control system. Leakage is the cause of more failed AI audits than any other single issue.

Step 4. Use validation data for tuning, use test data once. If you use your test set for model selection, it becomes a validation set and you need a new test set. A locked envelope is the mental model: open once, compute metrics, write them down, close.

Step 5. Report subgroups honestly. If your model underperforms in subgroups, that is not a failure of the device, it is a finding of the study, document it, either constrain the intended purpose or improve the model, do not average the numbers to hide the gap.

Step 6. Tie everything to risk management. Every identified failure mode. Ungradable inputs, adversarial perturbation, out-of-distribution drift, subgroup underperformance. Becomes an entry in your EN ISO 14971 risk file with a corresponding control.

Step 7. Plan for post-market. The validation you do before CE is not the end, Article 83-86 and Annex III require a post-market surveillance plan, for AI, that means drift monitoring, real-world performance tracking, and a path to re-validation if performance degrades, see post-market surveillance for AI devices for the full lifecycle view.

Pick the few that matter clinically and defend them rigorously.

The subtract is the number of metrics you report, pick the few that matter clinically and defend them rigorously, a report with six well-chosen metrics on a clean test set beats a report with thirty metrics on a leaky one.

What This Means in Practice

Ask yourself the following and answer honestly:

Was your validation plan written and locked before you looked at the test set?
Can you prove your training, validation, and test splits are disjoint at the patient level? Is there a script that reproduces the split from a seed?
Have you analyzed performance across every clinically relevant subgroup, not just the overall population?
Does your test set come from sites or scanners or populations that were NOT represented in training?
Have you computed calibration, not just discrimination?
Are your acceptance thresholds justified by clinical relevance, or did you pick them after seeing the results?
Does every identified failure mode have a corresponding risk control in your EN ISO 14971 file?
If a Notified Body reviewer asked you to rerun the test set calculation right now, could you do it in under an hour?

If you cannot answer all eight with a clean yes, your validation evidence is not yet MDR-ready.

Frequently Asked Questions

Is AUC sufficient as a primary endpoint for an AI medical device?

Rarely. AUC measures discrimination across all thresholds, but clinicians use the device at one operating point. Report AUC as a secondary metric and use sensitivity and specificity at the chosen operating point as primaries.

How large does my test set need to be?

Large enough to give a confidence interval narrow enough to support your acceptance decision. Sample size calculations should be pre-specified and driven by the primary endpoint, the expected proportion, and the margin you need to defend.

Do I need prospective clinical data, or is retrospective enough?

It depends on class and claim. Retrospective data can be sufficient for analytical performance validation, but clinical performance and benefit-risk often require prospective evidence. EN ISO 14155:2020+A11:2024 is the GCP standard. For AI specifics, see our post on clinical evaluation for AI/ML devices.

What counts as "data leakage"?

Any situation where information from the test set influenced training. Examples: same patient in training and test, normalizing using statistics computed on the full dataset, tuning hyperparameters on the test set, using image-level split when multiple images come from the same study.

Can we update the model after CE marking?

Only within the bounds of your change control framework. Under MDR this is assessed via the "significant change" framework in Article 120 and MDCG 2020-3. There is no EU equivalent to the FDA Predetermined Change Control Plan yet, see our PCCP for AI medical devices post.

What if our subgroup performance is genuinely worse in a protected demographic?

You have three options: improve training data and re-validate; constrain the intended purpose to exclude the subgroup; or document the limitation explicitly in labeling and IFU. Hiding the finding is not an option. It is fraud.

Clinical evaluation for AI/ML medical devices, how validation feeds into the clinical evaluation report.
Training data requirements for AI medical devices, dataset curation and provenance under MDR.
Data quality and bias in AI medical devices, the bias and representativeness dimension of validation.
Locked vs adaptive AI algorithms under MDR, why locked models are the default starting point.
Post-market surveillance for AI devices, what happens to validation evidence after CE.

Sources

Regulation (EU) 2017/745 on medical devices, consolidated text. Annex I §1, §17.1; Article 61; Annex XIV Part A.
EN ISO 14971:2019+A11:2021. Application of risk management to medical devices.
EN 62304:2006+A1:2015. Medical device software. Software lifecycle processes.
EN ISO 14155:2020+A11:2024. Clinical investigation of medical devices for human subjects. Good clinical practice.

The Bigger Picture