Training Data Requirements for AI Medical Devices Under MDR

Q: How do I prevent test set contamination?

Define the test set before model development starts. Lock it. Enforce access control so that the people doing model development cannot see it. Do not use it for hyperparameter tuning or model selection. Split at the patient level where patients contribute multiple records. Look at the test set exactly once, at the end, to produce the clinical evaluation numbers.

Quick Summary

Training data for AI medical devices must be representative, documented, and controlled. Here is what MDR and MDCG guidance expect, and what auditors check.

Training data for an AI medical device is part of the device under MDR. Annex I Section 17 requires software to be developed in accordance with the state of the art, which for AI means documented, representative, controlled, and version-managed datasets, MDCG 2019-11 Rev.1 (June 2025) confirms AI software sits inside the same qualification and classification regime as other software, and EN ISO 14971:2019+A11:2021 pushes training data decisions into the risk management file as a hazard source in their own right. The AI Act layer (Regulation (EU) 2024/1689) adds horizontal obligations on training data quality on top of MDR. The auditable question is always the same: can you show, with evidence, that the data behind the model is appropriate for the intended use population?

By Tibor Zechmeister and Felix Lenhard. Last updated 10 April 2026.

TL;DR

Training data is a regulated component of an AI medical device under MDR, not a private engineering matter. Annex I Section 17 and the risk management obligations of EN ISO 14971:2019+A11:2021 both reach into it.
Representativeness is the central criterion. The dataset has to reflect the intended use population across the subgroups that matter clinically. Age, sex, disease severity, scanner or device type, care setting, and any other axis where performance can plausibly vary.
Documentation is the audit surface. Dataset provenance, inclusion and exclusion criteria, labelling protocol, inter-rater agreement, versioning, and access control all have to be captured at a level a Notified Body auditor can verify.
Bias testing is not optional. Subgroup performance has to be measured, documented, and connected to the risk management file. A single overall accuracy number is not sufficient evidence.
Splits and holdouts have to be defined and locked before evaluation. Test set contamination is the single most common Notified Body finding on AI submissions.
The technical file has to contain a data governance dossier that is coherent with the clinical evaluation and the risk management file. Three documents telling three different stories is a fail.

Why training data matters legally under MDR

Founders sometimes treat training data as an internal engineering choice, something the data scientists handle, separate from the regulatory file, under MDR that separation does not hold. The training data is a direct input to the device's behaviour in the field, and every provision that governs device safety and performance reaches into it.

Annex I of MDR sets out the general safety and performance requirements, Section 17 deals specifically with electronic programmable systems and software, it requires that software be developed and manufactured in accordance with the state of the art, taking into account the principles of development life cycle, risk management, and verification and validation. (Regulation (EU) 2017/745, Annex I, Section 17.) For an AI medical device, "state of the art" in 2026 is not a vague term. It includes the data governance practices that competent Notified Bodies and the broader MedTech AI community have converged on: documented provenance, representativeness analysis, bias testing, version control, and test set isolation. A dataset built without those practices does not meet the state of the art for AI software development, and the Section 17 obligation is not satisfied.

MDCG 2019-11 Rev.1 (June 2025) confirms that AI and ML software fall under the same qualification and classification regime as any other medical device software. There is no separate "AI track." Everything that would be expected of a classical SaMD development process is expected of an AI development process, and the AI-specific elements sit on top.

EN ISO 14971:2019+A11:2021 is the harmonised standard for risk management. For AI medical devices, the training data is itself a hazard source. A non-representative dataset is a hazard. A label quality problem is a hazard. A distribution that drifts between training and deployment is a hazard. Each one has to be identified in the hazard analysis, its risk has to be estimated and evaluated, and controls have to be put in place and verified. Training data decisions are risk management decisions, and they belong in the risk management file.

The AI Act (Regulation (EU) 2024/1689) adds horizontal obligations on training data quality for high-risk AI systems as a general matter. We reference it here only in general terms. The operational interface between the AI Act and MDR conformity assessment is still being clarified. The practical consequence for founders in 2026 is that a training data dossier built to satisfy MDR and EN ISO 14971 expectations will cover most of what the AI Act also asks for, and building the two compliance paths as one integrated file is cheaper than building them as two separate ones.

Representativeness. The central criterion

Representativeness is the single most important property of a training dataset for a medical device. A model can be trained with perfect technical discipline (clean code, good hyperparameters, solid architecture) and still produce a dangerous device, if the data it learned from does not reflect the patients it will be used on.

It is prohibitively expensive if it has to be retrofitted after a Notified Body finding.

The representativeness question starts from the intended use population. The intended purpose statement of the device defines who the device is for: the clinical task, the user group, the care setting, the patient population. Every axis on which patients can vary clinically within that population is an axis on which the training data has to be examined. Age distribution, sex, ethnicity and skin type where relevant, disease severity, comorbidities, imaging hardware or input device characteristics, clinical workflow, geographic region. Each of these is a potential subgroup where performance can diverge from the overall number.

The analysis is not "does the dataset look diverse." It is "does the dataset reflect the intended use population on each of the clinically relevant axes, and where it does not, have we either narrowed the intended use population to match the data, or collected more data to match the intended use, or documented the limitation and accepted the residual risk under the EN ISO 14971 process."

This is the discipline that separates an AI development team ready for CE marking from one that is not. It is not expensive once it is built into the project plan. It is prohibitively expensive if it has to be retrofitted after a Notified Body finding.

Documentation expectations. What has to be written down

The technical file does not ask for your source code and your Jupyter notebooks. It asks for a data governance dossier that an auditor can read and verify. The elements that have to be documented include, at minimum:

Dataset identity and version. Each dataset used in training, validation, and testing is named, versioned, and fixed in time. "Version 1.3 of the development set, locked 2026-02-14". Not "the training data."
Provenance. Where each record came from, under what legal basis, with what consent where applicable, from which clinical site, on which hardware. Data obtained from public sources still has provenance. The source, the licence, the collection conditions, and any known limitations.
Inclusion and exclusion criteria. The rules that determined which records are in the dataset and which were excluded, written down before the dataset was assembled. Exclusions driven by data quality are fine. Exclusions driven by "these cases were hard" are not.
Labelling protocol. Who labelled the data, with what instructions, with what reference standard, with what inter-rater agreement where multiple labellers were involved. For medical AI, the ground truth is often itself a judgment call, and the process that produced it is part of the evidence.
Preprocessing pipeline. The transformations applied to raw data before it reached the model. Normalisation, cropping, filtering, augmentation. Every transformation is part of the device and has to be documented and controlled as part of the configuration.
Access control. Who could see which data, when, and why. Test set isolation is enforced by access control, and the access control has to be documented.
Change history. Datasets evolve during development. Every change has to be logged with a reason, and the dataset version used for the final clinical evaluation has to be the one that shipped with the device.

A Notified Body auditor reviewing an AI submission will ask for these documents by name. Teams that already have them hand them over. Teams that do not spend weeks assembling them retrospectively, and the retrospective assembly usually reveals gaps that force rework of the clinical evaluation itself.

Bias testing. Measuring what representativeness bought you

Representativeness is an input. Bias testing is the verification that representativeness has achieved what it was supposed to achieve. The two are not the same thing.

Bias testing measures model performance on the subgroups identified in the representativeness analysis and compares them against the overall performance and against each other. A model with 91% overall sensitivity that holds at 89% to 93% across subgroups is a different device from a model with 91% overall sensitivity that drops to 68% in one subgroup. The second device is not a 91% device. It is a device with a known safety gap in a defined population, and the clinical evaluation and the risk management file both have to address that gap.

The bias testing protocol has to be pre-specified, which subgroups are tested, what metrics are measured, what thresholds trigger action. All of these have to be defined before the numbers are looked at. Reading the subgroup results and then deciding which subgroups to report on is not bias testing. It is selective disclosure, and a competent auditor will recognise it.

Key Takeaway

The bias testing protocol has to be pre-specified, which subgroups are tested, what metrics are measured, what thresholds trigger action.

Where a subgroup shows degraded performance, the team has three options under EN ISO 14971:2019+A11:2021. Reduce the risk by further development, more data, better labelling, model changes, restrict the intended use to exclude the subgroup where performance is not adequate. Accept the residual risk with clear documentation and labelling, where the benefit-risk balance supports it. Each option is legitimate. Pretending the gap does not exist is not.

Data governance. The operational layer

Data governance is the set of processes that keep the dataset discipline intact over time. It covers who can add or remove data, how changes are reviewed, how versions are tracked, how the dataset interacts with the QMS, and how dataset decisions flow into the technical file.

For a small startup, this does not have to be an elaborate system. It has to be a real one. A shared spreadsheet with dataset versions and access logs, reviewed in a recurring meeting, beats a fancy tool that nobody updates. The test is whether the governance can answer three questions on demand: what data was used to train the current shipped model, who had access to the test set during development, and what changed between the previous dataset version and the current one.

Data governance intersects with EN ISO 13485 QMS obligations. The dataset management procedure is a QMS procedure. The roles and responsibilities for dataset decisions belong in the QMS role definitions. The records of dataset changes are QMS records. Treating data governance as an engineering concern outside the QMS creates a seam that auditors will find.

Splits and holdouts. The mechanics that decide credibility

Training, validation, and test splits are an engineering detail with regulatory consequences. The split is what determines whether the performance numbers in the clinical evaluation are credible evidence or an artefact of data leakage.

The principles are unambiguous in 2026. The test set is defined and isolated before model development starts. It is not used for training. It is not used for hyperparameter tuning. It is not used for model selection. It is looked at exactly once, at the end, to produce the performance numbers that go into the clinical evaluation. Teams that "take a peek" at the test set during development and then use it for final evaluation are not reporting test set performance. They are reporting contaminated validation performance.

Where the same patient contributes multiple records, splits have to be at the patient level, not the record level. A test set that contains different images from patients who also appear in the training set is not an independent test set for medical imaging purposes. The same patient's data in both sides of the split creates an information leak that inflates the numbers.

For external validation. A second test set from a site, hardware, or population that was not part of the development pipeline at all. The isolation has to be even stricter. The external validation set is the closest pre-market proxy to real-world deployment, and its credibility depends on it being genuinely external.

What to record in the technical file

MDR Annex II sets out the structure of the technical documentation. For an AI medical device, the training data content has a natural home in several sections of that structure: the device description (what the device does and how the AI component fits), the design and manufacturing information (the development process including data management), the risk management file (hazards related to training data and the controls applied), and the clinical evaluation (performance evidence on the test set). The data governance dossier is not a single stand-alone document. It is a coherent thread that runs through several sections of the technical file and has to tell a single consistent story.

The minimum content that should be findable in the technical file:

The intended use population, defined precisely.
The dataset inventory, with versions and provenance.
The representativeness analysis against the intended use population.
The labelling protocol and inter-rater agreement where relevant.
The preprocessing pipeline, documented and version-controlled.
The split strategy, with the isolation rules.
The bias testing protocol and results, pre-specified.
The performance results on the independent test set.
The link between dataset decisions and the EN ISO 14971 risk management file.
The change control procedure that governs future dataset updates.

An auditor opening the technical file should be able to follow this thread end to end without asking a single clarifying question about the data.

Common mistakes

Treating training data as engineering internals, outside the regulatory file. The dataset is part of the device. It belongs in the technical file and the risk management file.
Writing a representativeness paragraph instead of doing a representativeness analysis. "The dataset is diverse and balanced" is a claim. The evidence is a breakdown by subgroup against the intended use population.
Measuring bias only after the model is frozen, instead of pre-specifying the subgroups and thresholds before the numbers are seen.
Splitting at the record level for patient data, allowing the same patient into both training and test sets.
Letting the test set be used for hyperparameter tuning or model selection, then calling the final evaluation a "test set result."
Using a dataset version for clinical evaluation that is not the version shipped with the device. The numbers have to come from the version on which the device will actually run.
Writing a clinical evaluation and a risk management file that disagree about which subgroups were tested and what the performance was. Three documents, one story.

Strategic Approach

The Subtract to Ship framework applied to training data does not mean less data or less documentation. It means removing the work that does not trace to a specific MDR, MDCG, or harmonised standard obligation, and investing the saved effort where it is genuinely load-bearing.

Narrow the intended use population first. Every subgroup you exclude from the intended use is a subgroup you do not have to collect matched data for. Scoping the intended purpose precisely is the single biggest subtraction available in data governance, and it is cheaper to do before the first dataset is assembled than after.

Freeze the test set at the start of the project. Test set isolation is free when it is built in from day one and impossibly expensive to retrofit.

Collapse duplication. The same data governance dossier has to serve MDR, the risk management file, the clinical evaluation, and (in general terms) the AI Act, maintaining one integrated dossier beats maintaining four overlapping ones, subtract the duplication.

Keep the governance lean. A small team does not need a formal data governance board. It needs a named owner, a recurring review cadence, a version log, and an access control rule that is actually enforced. The test is whether the three questions in the data governance section can be answered on demand, not whether the governance process has an impressive diagram.

MDR Annex I Section 17 requires software to be developed in accordance with the state of the art, and for AI medical devices the state of the art includes dataset governance, representativeness analysis, and version control.

Key Takeaways

Is your intended use population defined precisely enough to drive a representativeness analysis, or is it vague?
Can you produce a dataset inventory right now, with version numbers and provenance for each dataset used in training, validation, and testing?
Is the test set isolated and version-locked, with access control that you could evidence to an auditor?
Have you pre-specified the subgroups for bias testing before any evaluation was run?
Does your risk management file identify training data as a hazard source, with controls for non-representativeness, label quality, and distribution shift?
Does the clinical evaluation report the performance numbers from the same dataset version that ships with the device?
Is your data governance integrated with your QMS, or does it live in an engineering wiki that the QMS does not reference?
If an auditor asked "show me the labelling protocol and the inter-rater agreement for the ground truth in your test set," could you answer within an hour?

Frequently Asked Questions

Is training data considered part of the medical device under MDR?

Yes, in effect. MDR Annex I Section 17 requires software to be developed in accordance with the state of the art, and for AI medical devices the state of the art includes dataset governance, representativeness analysis, and version control. The training data determines device behaviour, so it is governed by the same safety and performance requirements as the rest of the device, and it belongs in the technical file and the risk management file.

What does "representativeness" mean in practice for an AI medical device dataset?

It means the dataset reflects the intended use population across the clinically relevant axes. Age, sex, disease severity, imaging hardware, care setting, and whatever else can plausibly affect performance in the field. The analysis compares the dataset distribution against the intended use population distribution and documents the result. A single overall diversity claim is not representativeness.

Is bias testing required for AI medical devices under MDR?

Bias testing is not named as "bias testing" in the MDR text, but the obligation is implied by Annex I Section 17 (software state of the art), by the risk management requirements of EN ISO 14971:2019+A11:2021, and by the clinical evaluation obligations for performance in the intended use population. A competent Notified Body will expect pre-specified subgroup analysis as a normal part of the evidence package for an AI medical device in 2026.

How do I prevent test set contamination?

Define the test set before model development starts, lock it, enforce access control so that the people doing model development cannot see it, do not use it for hyperparameter tuning or model selection, split at the patient level where patients contribute multiple records, look at the test set exactly once, at the end, to produce the clinical evaluation numbers.

Does the AI Act add separate training data requirements on top of MDR?

Yes, in general terms. Regulation (EU) 2024/1689 includes horizontal obligations on training data quality for high-risk AI systems, and medical devices covered by MDR are in the high-risk category. The detailed operational interface between the AI Act obligations and MDR conformity assessment is still being clarified in 2026. A data governance dossier built to satisfy MDR and EN ISO 14971 expectations will cover most of the AI Act expectations as well, and building one integrated file is cheaper than building two.

Where in the technical file does the training data documentation live?

It threads through several sections of the Annex II structure: device description, design and manufacturing information, risk management file, and clinical evaluation. It is not a single standalone document. The key is that the thread is coherent. The dataset versions, subgroup analysis, and performance numbers referenced in each section agree with the other sections.

AI Medical Devices Under MDR: The Regulatory Environment, the pillar post that frames the full AI MedTech regulatory picture this post sits inside.
Machine Learning Medical Devices Under MDR, the companion post on ML development discipline under MDR.
Locked Versus Adaptive AI Algorithms Under MDR, why locked models are the default pathway and how change control interacts with dataset updates.
Classification of AI and ML Software Under Rule 11, the classification walk-through that determines how much dataset evidence a given device needs.
Risk Management for AI Medical Devices Under EN ISO 14971, how training data hazards are captured in the risk management file.
Clinical Evaluation of AI/ML Medical Devices, the clinical evidence process that the bias testing and test set results feed into.
AI/ML Model Validation and Verification Under MDR, the verification and validation activities that surround the dataset work.
Post-Market Surveillance for AI Medical Devices, how drift detection after deployment closes the loop on pre-market dataset decisions.
The Subtract to Ship Framework for MDR Compliance, the methodology behind the subtraction moves described in this post.

Sources

Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices. Annex I (general safety and performance requirements, in particular Section 17 on electronic programmable systems and software), Annex II (technical documentation). Official Journal L 117, 5.5.2017.
MDCG 2019-11 Rev.1. Guidance on Qualification and Classification of Software in Regulation (EU) 2017/745. MDR and Regulation (EU) 2017/746. IVDR, October 2019, Revision 1 June 2025.
EN ISO 14971:2019 + A11:2021. Medical devices. Application of risk management to medical devices.
Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Referenced in this post in general terms only, for the horizontal training data quality obligations that layer on top of MDR for high-risk AI systems. Founders should consult the official text on EUR-Lex.

This post is part of the AI, Machine Learning and Algorithmic Devices category in the Subtract to Ship: MDR blog, authored by Felix Lenhard and Tibor Zechmeister, training data is the part of an AI medical device project where engineering discipline and regulatory discipline meet most directly, and where retrofitting is most expensive. Building the dataset governance into the project from day one is the cheapest move available. And when the specific situation of your device exceeds what a general post can cover, that is exactly the territory where a sparring partner who has walked other AI MedTech teams through the same dossier earns their keep.

The Bigger Picture