Case Study: AI Diagnostics Startup That Failed Its Notified Body Audit

Quick Summary

This is a composite teaching case. It does not describe any specific company. It combines failure patterns observed across multiple real AI diagnostics audits into one instructive scenario. The details are deliberately generic. The failure modes are not.

This is a composite teaching case. It does not describe any specific company. It combines failure patterns observed across multiple real AI diagnostics audits into one instructive scenario. The details are deliberately generic. The failure modes are not.

By Tibor Zechmeister and Felix Lenhard.

TL;DR

This is a composite scenario built from recurring failure patterns in AI diagnostics audits. No real company is described.
The most damaging failure is almost always intended purpose drift: what the company says the device does changes between pitch deck, website, IFU and clinical evaluation.
Under MDR Annex VIII Rule 11 most diagnostic AI software lands in class IIa or higher, which means notified body involvement is mandatory and scrutiny is high.
Clinical evidence for AI devices must follow MDR Article 61 and Annex XIV. "We trained on 10,000 images" is not clinical evidence.
Model documentation gaps. Training data provenance, validation methodology, generalisability evidence. Are increasingly the source of major non-conformities.

Why this matters

Founders building AI diagnostic tools often come from ML research backgrounds. They understand model architectures, training regimes and validation metrics better than most regulatory professionals. And yet they fail notified body audits in disproportionate numbers, for reasons that have nothing to do with the quality of their models.

This composite case study walks through the most common failure modes observed across AI diagnostics audits. Again: this is not a real company. It is a deliberate composite. If any element sounds familiar, it is because these patterns repeat.

The composite scenario

The fictional company is a six-person startup with a deep learning model that analyses medical images and produces a probability score that a clinician uses to support a diagnostic decision. The founders have two ML PhDs, a clinical advisor, and a recently hired part-time regulatory contractor. They have raised a seed round, signed a pilot with a hospital network, and scheduled their notified body stage 2 audit nine months after starting the QMS build.

The audit fails with major non-conformities across four categories. Every category is fixable. None of them was caused by bad engineering. All of them could have been prevented by regulatory structure applied earlier.

What MDR actually says

Annex VIII Rule 11 places software intended to provide information used to take decisions with diagnosis or therapeutic purposes in class IIa, rising to class IIb or III depending on the severity of the potential harm. Most diagnostic AI software used by a clinician to inform a decision is at least class IIa. Class IIa requires notified body involvement under the applicable conformity assessment procedures in Annexes IX, X or XI.

Article 2(12) defines intended purpose as the use for which a device is intended according to the data supplied by the manufacturer on the label, in the instructions for use or in promotional or sales materials or statements and as specified by the manufacturer in the clinical evaluation, read that again, promotional material counts, sales materials count, every public claim is part of your intended purpose.

Article 61 and Annex XIV Part A set out the requirements for clinical evaluation. Clinical evidence must be sufficient to demonstrate conformity with the relevant GSPRs under normal conditions of use and to evaluate the benefit-risk determination. For AI devices this specifically means demonstrating that the model performs as claimed in the intended use population, not just in the training set.

EN 62304:2006+A1:2015 defines the software lifecycle requirements, combined with EN ISO 14971:2019+A11:2021 for risk management, the expectation is traceability from risk, to requirement, to design, to verification, to validation, to post-market feedback. For AI devices this trace must also cover training data selection, validation protocol, and model change control.

The four failure modes

Failure 1: Intended purpose drift

The website says the device "detects disease X with 95% accuracy." The pitch deck says it "diagnoses disease X." The IFU says it "provides a probability score as decision support." The clinical evaluation plan says it "assists clinicians in the assessment of patients with suspected disease X."

Four different statements, four different regulatory meanings, under Article 2(12), all of them count as intended purpose because they are all supplied by the manufacturer. The auditor opens the website during the audit, reads the claim, and writes a major non-conformity: the device as promoted is not the device as evaluated.

The fix: one single sentence of intended purpose, approved under document control, reproduced verbatim on every customer-facing surface. The website, the pitch deck, the IFU, the CEP, the CER, the DoC and the label all use the same words. Any deviation triggers document change control.

Key Takeaway

The fix: one single sentence of intended purpose, approved under document control, reproduced verbatim on every customer-facing surface.

Failure 2: Clinical evidence is model validation, not clinical evaluation

The technical file contains a rigorous ML validation report, sensitivity, specificity, AUC on a held-out test set of 2,000 images, the founders believe this is their clinical evidence. It is not.

Clinical evidence under Article 61 and Annex XIV must evaluate the device under its intended conditions of use with the intended user population. A retrospective performance study on curated images does not demonstrate that a clinician using the device in a real workflow makes better, safer or faster decisions. It does not cover the human-AI interaction. It does not cover generalisability across sites, scanners, patient demographics or disease prevalence.

The auditor is not rejecting the ML work. The auditor is pointing out that the ML work is a verification activity, not a clinical evaluation. The clinical evaluation is missing the human-in-the-loop performance, the external validation, the state-of-the-art comparison, and the benefit-risk analysis in the target use context.

The fix: separate model verification from clinical evaluation, plan the clinical evaluation under MDR Article 61 and Annex XIV from day one, use the ML validation as one input among several, not the whole answer, build external validation on data not used in training, plan a prospective clinical investigation if the evidence is otherwise insufficient, per MDR Article 61(4).

Failure 3: Post-market surveillance plan has no AI-specific content

The PMS plan is a generic template. It mentions complaint handling, trend reporting and PSUR preparation. What it does not mention is model drift, data distribution shift, generalisability monitoring across deployment sites, the process for detecting silent performance degradation, or the change control pathway when the model needs to be retrained.

Under MDR Article 83 the PMS system must be proportionate to the risk class and appropriate for the type of device. For AI devices, "appropriate" means AI-specific. The auditor issues a non-conformity against the PMS plan: it is not proportionate to the device type.

The fix: build an AI-specific PMS plan, monitor input data distributions at each deployment site, monitor performance metrics continuously where possible, define triggers for investigation and retraining, define the significant change assessment process for model updates, because a model update is very likely a significant change requiring notified body review, not a routine software patch.

Failure 4: Model documentation gaps

The technical file does not document where the training data came from. It does not document the inclusion and exclusion criteria for training cases. It does not document the ethical approvals under which the data was obtained. It does not document the labelling process, inter-rater reliability, or the quality control applied to ground truth. It does not document the validation protocol in a way that a third party could reproduce.

The auditor issues non-conformities on data governance, traceability under EN 62304, and risk management under EN ISO 14971 (because the risk analysis cannot address hazards related to training data bias if the training data is not documented).

The fix: treat the dataset as a regulated artefact from day one. Data provenance, labelling protocol, inter-rater reliability, demographic characterisation, inclusion criteria. All under document control, retain the raw data where legally permitted, document everything such that a different team could repeat the validation.

The Subtract to Ship playbook

Step 1: Write the intended purpose first, once, and defend it. Before website, before pitch deck, before clinical evaluation plan. One sentence. Under document control, reproduced verbatim everywhere.

Step 2: Classify honestly under Rule 11. Most diagnostic AI is at least class IIa. Some is higher. Trying to argue it down to class I will not survive audit and will cost you trust.

Step 3: Plan the clinical evaluation in parallel with model development, not after. Article 61 and Annex XIV define the framework. The CEP is a day-one deliverable, not a pre-audit scramble.

Step 4: Separate verification from clinical evaluation. ML validation belongs in verification. Clinical evaluation is a different document with different scope and different evidence requirements.

Step 5: Document the dataset as a regulated artefact. Provenance, consent, labelling, inclusion criteria, quality control, demographic characterisation. All under document control from the first training run.

Step 6: Build an AI-specific PMS plan. Drift detection, distribution shift, generalisability monitoring, retraining triggers, significant change assessment, not a generic template.

Step 7: Rehearse the audit. An internal mock audit with the same scope as the notified body audit, run by someone experienced, catches the issues in this case study before they become major non-conformities.

Step 8: Assume the auditor will open your website, because they will, and the claims there are part of your intended purpose.

Self-Assessment

Is your intended purpose a single sentence, under document control, reproduced verbatim across website, pitch deck, IFU, CEP, CER, DoC and label?
Is your clinical evaluation plan written against MDR Article 61 and Annex XIV, separate from your ML validation?
Does your clinical evidence include external validation on data not used in training?
Does your PMS plan include AI-specific monitoring. Drift, distribution shift, generalisability, retraining triggers?
Is your training dataset documented as a regulated artefact, with provenance, consent, labelling protocol and demographic characterisation?
Do you have a significant change assessment process for model updates?
Have you run a mock audit with the same scope as your notified body audit?
If the auditor opens your website during the audit, is every claim consistent with your technical documentation?

Frequently Asked Questions

Is this a real company?

No. This is a deliberate composite built from recurring failure patterns observed across multiple audits. No identifying details from any real company are included.

Is a 95% accuracy claim on my website legally binding?

Under Article 2(12), yes. Promotional and sales materials are part of your intended purpose. If the claim is inconsistent with your clinical evidence, it is a regulatory issue.

Can my ML validation report serve as my clinical evaluation?

No. Model validation is a verification activity. Clinical evaluation under Article 61 and Annex XIV covers the device in use, with the intended user, in the intended conditions. They are different documents with different scope.

What is the most common major non-conformity in AI diagnostics audits?

Intended purpose inconsistency and insufficient clinical evidence are the two most common. Dataset documentation gaps are rising rapidly as a third.

Do model updates require a new conformity assessment?

Often yes. A model retraining that changes performance characteristics is likely a significant change requiring notified body review. Assume yes until you have a documented assessment that concludes no.

How do I avoid the failures in this case study?

Start the regulatory work when you start the ML work, not when you schedule the audit. The separation of concerns between verification and clinical evaluation, and the discipline of one-sentence intended purpose, prevents most of the failures described here.

MDR classification Rule 11 for software, how to classify your AI device correctly.
Clinical evaluation for AI and ML medical devices, the clinical evaluation framework for AI.
Post-market surveillance for AI devices, AI-specific PMS plan content.
Prepare for your first notified body audit, audit preparation fundamentals.
Ten most common MDR non-conformities in startup audits, the wider pattern across all startup audits.

Sources

Regulation (EU) 2017/745 on medical devices, consolidated text. Article 2(12), Article 61, Article 83, Annex VIII Rule 11, Annex XIV Part A.
MDCG 2019-11 Rev.1. Guidance on qualification and classification of software under MDR and IVDR.
EN 62304:2006+A1:2015. Medical device software. Software lifecycle processes.
EN ISO 14971:2019+A11:2021. Medical devices. Application of risk management.