---
title: Data Quality and Bias in AI Medical Devices: Regulatory Expectations Under MDR
description: Data quality and bias are core regulatory concerns for AI medical devices. Here is what MDR auditors expect for dataset characterisation and bias testing.
authors: Tibor Zechmeister, Felix Lenhard
category: AI, ML & Algorithmic Devices
primary_keyword: data quality bias AI medical devices MDR
canonical_url: https://zechmeister-solutions.com/en/blog/data-quality-bias-ai-medical-devices
source: zechmeister-solutions.com
license: All rights reserved. Content may be cited with attribution and a link to the canonical URL.
---

# Data Quality and Bias in AI Medical Devices: Regulatory Expectations Under MDR

*By Tibor Zechmeister (EU MDR Expert, Notified Body Lead Auditor) and Felix Lenhard.*

> **Data quality and bias are not soft engineering concerns for an AI medical device under MDR — they are regulatory obligations. Annex I Section 17 requires software to be developed in accordance with the state of the art, and for AI in 2026 the state of the art includes documented dataset characterisation, pre-specified subgroup performance reporting, and bias hazards captured in the EN ISO 14971:2019+A11:2021 risk management file. MDCG 2019-11 Rev.1 (June 2025) confirms AI software sits inside the same qualification and classification regime as other medical device software, with no separate leniency. A Notified Body auditor reviewing an AI submission will expect to see, on demand, how the dataset was characterised, which subgroups were tested, how performance varied across them, and how each bias-related hazard is controlled in the risk file.**

**By Tibor Zechmeister and Felix Lenhard. Last updated 10 April 2026.**

---

## TL;DR

- Data quality and bias are regulatory obligations under MDR Annex I Section 17, not engineering preferences. A dataset that cannot be characterised cannot meet the state-of-the-art requirement for AI software.
- Dataset characterisation is the written description of what is in the data — source, distribution, labelling protocol, exclusions, known gaps — at a level an auditor can verify without opening a notebook.
- Bias in medical AI has several distinct sources: sampling bias, label bias, measurement bias, historical bias, and deployment bias. Each one is a separate hazard category under EN ISO 14971:2019+A11:2021.
- Subgroup performance reporting is the operational form of bias testing. A single overall accuracy number is not sufficient evidence for a CE submission in 2026.
- The risk management file has to name bias hazards explicitly, with controls, verification evidence, and residual risk acceptance decisions — not buried in a generic "data quality" line item.
- MDCG 2019-11 Rev.1 places AI software under the same classification and qualification regime as other MDSW. There is no lighter path because the device is AI-based.

---

## Why data quality and bias are a regulatory issue

Founders building AI medical devices sometimes frame data quality as a research question and bias as an ethics question. Under MDR both are compliance questions with a direct audit consequence. The legal hook is Annex I of the Regulation, which sets out the general safety and performance requirements. Section 17 deals with electronic programmable systems and software and requires that such systems be developed and manufactured in accordance with the state of the art, taking into account the principles of development life cycle, risk management, and verification and validation. (Regulation (EU) 2017/745, Annex I, Section 17.)

For an AI medical device, the state of the art includes a set of dataset practices that the Notified Body community and the broader MedTech AI field have converged on. Written dataset characterisation. Pre-specified subgroup analysis. Label quality evidence. Measurement and sampling bias controls. Deployment monitoring plans that tie back to pre-market findings. A team that cannot show these practices has not developed the software in accordance with the state of the art, and Section 17 is not satisfied. The obligation does not depend on any company choosing to care about bias — it follows from the Regulation.

MDCG 2019-11 Rev.1 (June 2025) is the guidance that confirms this. It places AI and machine learning software inside the same qualification and classification framework as any other medical device software. There is no separate AI track with relaxed data expectations. Everything a classical SaMD development process has to evidence, an AI development process has to evidence, plus the AI-specific data governance elements on top.

EN ISO 14971:2019+A11:2021 is where the operational teeth are. The standard requires hazards to be identified, risks estimated and evaluated, controls implemented, and residual risks accepted through a documented process. For an AI device, non-representative data, mislabelled data, biased measurement, and distribution shift are each hazards in their own right. They belong in the hazard analysis by name, and the controls have to be verified before the device ships.

## Dataset characterisation — the audit surface

Dataset characterisation is the written description of the data behind the model, at a level of detail an auditor can use to form an independent judgment. It is not an ethics statement and it is not a marketing paragraph. It is a structured artefact.

The elements that a competent Notified Body will expect to see, in the technical file or a dataset dossier referenced from it:

- **Source.** Where the data came from. Clinical sites, public datasets, synthetic generation, purchased datasets. Each source named, with the legal basis, consent status where relevant, and the collection conditions.
- **Size.** Number of records, patients, and cases, broken out per source and per split. Aggregate "total number of images" is not enough if patients contribute multiple images.
- **Demographic distribution.** Age, sex, and where relevant ethnicity and skin type, broken out as a table. Not a sentence claiming diversity — a table that an auditor can read.
- **Clinical distribution.** Disease severity, comorbidities, disease prevalence, imaging hardware or input device type, care setting, geographic origin.
- **Inclusion and exclusion criteria.** The rules that defined which records entered the dataset and which were removed. Written before the dataset was assembled, not after the fact.
- **Labelling protocol.** Who labelled the data, with what instructions, against what reference standard, and with what inter-rater agreement where multiple labellers were used.
- **Known gaps.** The subgroups and conditions the dataset does not cover well, documented explicitly. A gap that is named and addressed is a different regulatory object from a gap that is hidden.
- **Versioning.** Dataset version identifiers, lock dates, and the mapping of dataset version to model version to clinical evaluation report.

Teams that have these elements written down hand them to the auditor on request. Teams that do not spend the weeks before the audit retrofitting them, and the retrofit usually exposes gaps that push the clinical evaluation back into rework.

## The categories of bias that matter for medical AI

Bias is an overloaded word. For regulatory purposes it helps to name the distinct categories, because each one has a different control strategy and a different home in the risk file.

- **Sampling bias.** The dataset over- or under-represents parts of the intended use population. A dermatology model trained mostly on light skin tones will under-perform on darker skin. A cardiology model trained mostly in one country's hospitals will under-perform on different care pathways. The control is representativeness analysis and, where gaps are found, either narrower intended use or more data.
- **Label bias.** The ground truth itself is systematically wrong for a subgroup. If the reference standard was produced by clinicians whose own accuracy varies across subgroups, the model inherits that variation. The control is labelling protocol design, reference standard selection, and inter-rater agreement measurement stratified by subgroup.
- **Measurement bias.** The raw data differs systematically across subgroups for reasons unrelated to the clinical condition. Different imaging scanners, different calibration, different acquisition protocols. The control is documentation of the acquisition conditions and, where needed, harmonisation or stratified evaluation.
- **Historical bias.** The dataset accurately reflects a past clinical reality that included disparities, and the model learns to reproduce them. The control is deliberate dataset construction and a clinical evaluation that tests the model against the intended deployment population, not only against the historical dataset.
- **Deployment bias.** The device is used outside the population or workflow it was validated for. This is a post-market problem more than a pre-market one, but it has to be anticipated in the intended use statement and the instructions for use, and monitored under the PMS plan.

Each of these categories belongs in the hazard analysis. Lumping them together as "data quality risk" and giving them one risk control makes the file thin in exactly the place auditors press hardest on AI submissions.

## Subgroup performance reporting — what has to be measured and shown

Subgroup performance reporting is the operational form of bias testing. The principle is straightforward: performance is reported not just overall but on every subgroup where clinically meaningful variation is possible, and the subgroup analysis is pre-specified before the numbers are looked at.

What pre-specification means in practice:

- The subgroups to be tested are listed in a protocol document before evaluation begins.
- The metrics to be reported on each subgroup are defined in advance — sensitivity, specificity, AUROC, calibration, or whatever the clinical task demands.
- The thresholds that would trigger action — further development, intended use restriction, or residual risk acceptance — are defined in advance.
- The evaluation is run once against the locked test set, the numbers are recorded, and the decisions follow the pre-specified rules.

Reading the subgroup results first and then deciding which subgroups to report on is not bias testing. It is selective disclosure, and a competent auditor will recognise it. The defence against that perception is pre-specification, documented, dated, and produced before the evaluation was run.

Where a subgroup shows degraded performance, EN ISO 14971:2019+A11:2021 gives the team three legitimate options. Reduce the risk through further development — more data, better labelling, model changes. Restrict the intended use to exclude the population where performance is not adequate. Or accept the residual risk with clear documentation and explicit labelling in the instructions for use, where the benefit-risk balance supports it. Each option is defensible. Pretending the gap does not exist is not.

## Documentation in the risk management file

The risk management file is where data quality and bias stop being an engineering concern and become a regulatory artefact. Under EN ISO 14971:2019+A11:2021 the hazard analysis has to identify sources of harm, and for an AI medical device the dataset is a source of harm like any other.

The minimum content that should be traceable in the risk file:

- **Hazard identification.** Each bias category named explicitly — sampling, label, measurement, historical, deployment — with the specific device-level harm each one could produce. Not a generic "data quality" entry.
- **Risk estimation.** For each identified hazard, an estimation of severity and probability of occurrence of harm, with the reasoning written down.
- **Risk control measures.** The concrete controls implemented against each hazard. Representativeness analysis, labelling protocol, stratified evaluation, intended use restriction, IFU warnings, PMS monitoring.
- **Verification of controls.** Evidence that the controls actually work. A representativeness analysis that was run. A subgroup evaluation that produced numbers. A PMS plan that names the drift metrics.
- **Residual risk evaluation.** What remains after the controls, and whether the residual risk is acceptable under the benefit-risk framework of the standard.
- **Traceability to the clinical evaluation.** The subgroup performance numbers in the risk file agree with the subgroup performance numbers in the clinical evaluation report. Three documents, one story.

The coherence test matters as much as the content. A Notified Body auditor reading the technical file will check whether the dataset dossier, the clinical evaluation, and the risk management file tell a single consistent story about which subgroups were tested, what was found, and what the controls and residual risks are. Any discrepancy is a finding.

## Common mistakes

- Writing a "data quality" paragraph in the technical file instead of producing a dataset characterisation artefact with tables and versions.
- Reporting a single overall accuracy number as the headline performance result, with no subgroup breakdown.
- Choosing subgroups to report on after seeing the evaluation results, rather than pre-specifying them.
- Treating bias as a single risk file line item instead of decomposing it into the distinct hazard categories.
- Letting the clinical evaluation report and the risk management file disagree on which subgroups were evaluated.
- Using demographic diversity claims as a substitute for quantitative representativeness analysis against the intended use population.
- Acknowledging a performance gap in a subgroup without either restricting the intended use, collecting more data, or documenting the residual risk acceptance — leaving the gap unresolved in the file.
- Assuming that because the AI Act adds horizontal obligations, MDR obligations on data quality are covered automatically. They are not. The MDR obligations are independent and predate the AI Act.

## The Subtract to Ship angle

The [Subtract to Ship framework](/blog/subtract-to-ship-framework-mdr) applied to data quality and bias does not mean less bias testing. It means removing the work that does not trace to a specific MDR, MDCG, or EN ISO 14971 obligation and investing the saved effort where it is genuinely load-bearing.

The biggest subtraction is upstream, in the intended use statement. Every subgroup excluded from the intended use is a subgroup that does not require matched data or matched subgroup evaluation. Narrowing the intended use population to what the device is actually validated for is a legitimate regulatory move, and it is cheaper to make before the dataset is assembled than after.

The second subtraction is duplication. The dataset characterisation artefact, the subgroup performance tables, and the bias hazard entries in the risk file all draw on the same underlying analysis. One integrated source, referenced from all three locations, beats three overlapping versions that drift out of agreement. Maintaining one is cheaper and safer than maintaining three.

The third subtraction is process theatre. A small team does not need a formal bias review board. It needs a pre-specified subgroup protocol, a locked test set, a recurring review of the results against the protocol, and a risk file that is actually referenced when decisions are made. Lean is not less rigorous. Lean is the discipline of keeping only the parts that carry weight.

## Reality Check — Where do you stand?

1. Do you have a written dataset characterisation document with tables for demographic and clinical distribution, or only a paragraph claiming diversity?
2. Have you named the bias categories — sampling, label, measurement, historical, deployment — in your risk management file, or are they lumped together?
3. Was your subgroup performance protocol pre-specified in writing before you looked at any evaluation numbers?
4. If an auditor asked "which subgroups did you test and why those specific ones," could you answer from the intended use statement?
5. For each subgroup where performance diverges from the overall number, is the divergence addressed in the risk file with a control, an intended use restriction, or a documented residual risk acceptance?
6. Do the subgroup numbers in the clinical evaluation report and the risk management file match exactly?
7. Does your PMS plan name the drift and deployment bias metrics it will monitor after release, and tie them back to the pre-market subgroup findings?

## Frequently Asked Questions

**Where in MDR does the obligation for data quality and bias testing come from?**
Annex I Section 17 requires software, including AI software, to be developed in accordance with the state of the art, taking into account development life cycle, risk management, and verification and validation. For AI in 2026, the state of the art includes written dataset characterisation, pre-specified subgroup performance reporting, and bias hazards captured under EN ISO 14971:2019+A11:2021. MDCG 2019-11 Rev.1 (June 2025) confirms AI software sits inside the same regulatory regime as any other medical device software.

**Is a single overall accuracy number enough for a Notified Body submission?**
No. A competent Notified Body reviewing an AI medical device submission in 2026 will expect subgroup performance numbers on the clinically relevant axes of the intended use population. A single aggregate number obscures exactly the failure modes that bias testing is meant to detect, and it will not satisfy the state-of-the-art expectation under Annex I Section 17.

**What does "pre-specified" subgroup analysis mean in practice?**
It means the subgroups, metrics, and action thresholds are written down in a protocol document before any evaluation is run against the locked test set. The point is to prevent selective reporting — choosing the subgroups that look good only after seeing the results. A pre-specification document is dated, stored, and produced on request during audit.

**How does the risk management file handle bias?**
EN ISO 14971:2019+A11:2021 treats bias sources as hazards. Each category — sampling, label, measurement, historical, deployment — is named in the hazard analysis, with its own risk estimation, controls, verification evidence, and residual risk decision. Lumping bias into one generic line item is one of the most common weaknesses auditors find in AI risk files.

**If the clinical evaluation and the risk file report different subgroup results, what happens?**
It is a finding. Three documents, one story, is the operating principle. The dataset dossier, the clinical evaluation report, and the risk management file all have to agree on which subgroups were tested, what was found, and how the gaps were handled. Any discrepancy invites the auditor to look for more.

**Does the AI Act replace these MDR obligations?**
No. Regulation (EU) 2024/1689 adds horizontal training data quality obligations for high-risk AI systems on top of MDR. It does not replace Annex I Section 17 or the risk management obligations under EN ISO 14971:2019+A11:2021. A data quality and bias dossier built to MDR expectations will cover most of what the AI Act also asks for, and building the two paths as one integrated file is cheaper than building them as two.

## Related reading

- [AI Medical Devices Under MDR: The Regulatory Landscape](/blog/ai-medical-devices-mdr-regulatory-landscape) — the pillar post framing the full AI MedTech regulatory picture.
- [Machine Learning Medical Devices Under MDR](/blog/machine-learning-medical-devices-mdr) — the companion post on ML development discipline.
- [Classification of AI and ML Software Under Rule 11](/blog/classification-ai-ml-software-rule-11) — how classification determines the depth of data quality evidence required.
- [Training Data Requirements for AI Medical Devices Under MDR](/blog/training-data-requirements-ai-medical-devices) — the companion post on dataset governance and representativeness.
- [Clinical Evaluation of AI/ML Medical Devices](/blog/clinical-evaluation-ai-ml-medical-devices) — how subgroup performance feeds into the clinical evidence.
- [Performance Validation for AI Medical Devices](/blog/performance-validation-ai-medical-devices) — the verification and validation activities that surround the bias testing.
- [Post-Market Surveillance for AI Medical Devices](/blog/post-market-surveillance-ai-devices) — how drift detection after deployment closes the loop on pre-market bias findings.
- [AI/ML Change Management and Retraining Assessment](/blog/ai-ml-change-management-retraining-assessment) — how dataset and model updates interact with bias controls.
- [The Subtract to Ship Framework for MDR Compliance](/blog/subtract-to-ship-framework-mdr) — the methodology behind the subtraction moves in this post.

## Sources

1. Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices — Annex I (general safety and performance requirements, in particular Section 17 on electronic programmable systems and software). Official Journal L 117, 5.5.2017.
2. MDCG 2019-11 Rev.1 — Guidance on Qualification and Classification of Software in Regulation (EU) 2017/745 — MDR and Regulation (EU) 2017/746 — IVDR, October 2019, Revision 1 June 2025.
3. EN ISO 14971:2019 + A11:2021 — Medical devices — Application of risk management to medical devices.

---

*This post is part of the AI, Machine Learning and Algorithmic Devices category in the Subtract to Ship: MDR blog. Authored by Felix Lenhard and Tibor Zechmeister. Data quality and bias are the part of an AI medical device file where regulatory expectations have moved the fastest in the last two years, and where a thin file is most quickly recognised by a competent auditor. Building the dataset characterisation, the subgroup protocol, and the bias hazard entries into the project from day one is the cheapest move available — and when the specific situation of your device exceeds what a general post can cover, that is exactly the territory where a sparring partner who has walked other AI MedTech teams through the same dossier earns their keep.*

---

*This post is part of the [AI, ML & Algorithmic Devices](https://zechmeister-solutions.com/en/blog/category/ai-ml-devices) cluster in the [Subtract to Ship: MDR Blog](https://zechmeister-solutions.com/en/blog). For EU MDR certification consulting, see [zechmeister-solutions.com](https://zechmeister-solutions.com).*
