A generative AI tool becomes a medical device under MDR the moment its intended purpose falls within Article 2(1): diagnosis, prevention, monitoring, prediction, prognosis, treatment or alleviation of disease. Once that line is crossed, Annex VIII Rule 11 typically pushes it to Class IIa or higher, and the non-deterministic output of an LLM has to be handled as a foreseeable safety hazard under Annex I.

By Tibor Zechmeister and Felix Lenhard.

TL;DR

  • A large language model is not a medical device by default; what it is used for, as declared by the manufacturer, decides qualification under MDR Article 2(1).
  • If the intended purpose includes diagnosis, treatment decisions, triage, or monitoring of individual patients, Rule 11 of Annex VIII usually classifies the software as Class IIa, IIb, or III.
  • Hallucination and non-determinism are not bugs to be patched later — under EN ISO 14971 they are foreseeable hazards that must be identified, evaluated, controlled, and disclosed.
  • Generic disclaimers like "for informational purposes only" do not remove a device from MDR scope if the promotional material or UI implies a medical use.
  • Evidence for LLM-based devices is harder than for classical SaMD: you need clinical evaluation against the actual model behaviour, not against a training dataset.
  • MDCG 2019-11 Rev.1 (June 2025) is the primary interpretive text for software qualification and Rule 11 application.

Why this matters

A three-person team in Vienna builds a chatbot on top of a frontier LLM. They market it to "help patients understand their lab results." They think they are outside MDR scope because the model is third-party and they "don't give medical advice." Six months later their first hospital pilot asks for the CE certificate, and they discover they have been placing an unregistered Class IIa medical device on the market since day one.

This is the most common trap we see in generative AI MedTech right now. The founders know their LLM can hallucinate. They know they built a wrapper, not a model. What they miss is that MDR does not care who trained the weights. It cares about intended purpose, and intended purpose is whatever the manufacturer says it is — on the label, in the IFU, in promotional material, on the landing page, in the pitch deck, and in the clinical evaluation. Article 2(12) makes this explicit.

If you are building on LLMs for a healthcare use case, the regulatory question is not "is my model safe enough." It is "have I declared an intended purpose that makes this a medical device, and if so, am I ready to prove safety and performance under MDR?"

What MDR actually says

Article 2(1) — definition of a medical device. A device is qualified as a medical device when it is intended by the manufacturer to be used for one or more of the specific medical purposes listed: diagnosis, prevention, monitoring, prediction, prognosis, treatment or alleviation of disease, among others. Software is explicitly in scope.

Article 2(12) — intended purpose. "Intended purpose means the use for which a device is intended according to the data supplied by the manufacturer on the label, in the instructions for use or in promotional or sales materials or statements and as specified by the manufacturer in the clinical evaluation."

Read that sentence twice. It is the single most important sentence in MDR for a generative AI startup. Your landing page is regulatory evidence. Your sales deck is regulatory evidence. A demo video on LinkedIn is regulatory evidence. If any of them describe a medical purpose, you cannot later claim in your technical file that the purpose is something else.

Annex VIII Rule 11 — software classification. Software intended to provide information which is used to take decisions with diagnosis or therapeutic purposes is classified as Class IIa, except where such decisions have an impact that may cause death or irreversible deterioration (Class III) or serious deterioration of a person's state of health or surgical intervention (Class IIb). Software intended to monitor physiological processes is Class IIa, or Class IIb if monitoring vital physiological parameters where variations could result in immediate danger. All other software is Class I.

For any generative AI tool that produces text a clinician or patient might act on, Rule 11 is rarely going to leave you in Class I.

Annex I GSPR 1 and §17. The device shall achieve the performance intended by the manufacturer and be designed and manufactured in such a way that, during normal conditions of use, it is suitable for its intended purpose. For software, §17 requires the lifecycle to account for risk management, verification, and validation — which for an LLM means the foreseeable ways it can go wrong must be in scope of your risk file.

MDCG 2019-11 Rev.1 (June 2025) is the reference for qualifying and classifying software under Rule 11. It does not contain a dedicated section named "generative AI," but its qualification logic — decision-making intent, medical purpose, output used in care — applies directly.

A worked example

A startup wants to build an LLM-based summarisation tool for radiology reports. Two versions of the same product:

Version A — Documentation assistant. Intended purpose: "Assists radiologists by drafting structured narrative text from radiologist-authored findings for inclusion in the final report. The radiologist is responsible for reviewing and approving all output before it is saved." Marketing never claims the tool reads images, identifies pathology, or suggests diagnoses. The UI forces the radiologist to edit or approve every sentence.

This is probably not a medical device. The intended purpose is workflow and clerical support, not diagnosis. No medical decision flows from the software's output to the patient without human authorship. It may still sit under the EU AI Act as a high-risk system if used in healthcare, but MDR qualification under Article 2(1) is unlikely. Note: "probably not" still needs a documented qualification assessment under MDCG 2019-11 Rev.1, signed off, and kept on file.

Version B — Impression generator. Intended purpose: "Generates the 'Impression' section of radiology reports by summarising imaging findings and clinical context, to support radiologist interpretation." Marketing says it "helps radiologists reach diagnostic conclusions faster."

This is a medical device. The "Impression" is the diagnostic conclusion of a radiology report. Software generating that section is providing information used to take decisions with diagnostic purpose. Rule 11 classifies it at least Class IIa, and depending on body region and the severity of conditions involved, IIb is realistic. Notified body review is required. Clinical evaluation must demonstrate the tool's output is accurate and safe for the claimed purpose across the intended patient population — not against a benchmark dataset, but against the real-world clinical use.

Same underlying LLM. Same wrapper. Two completely different regulatory paths, driven entirely by intended purpose.

The Subtract to Ship playbook

Step 1 — Write the intended purpose first, not last. Before you write a single line of code, write the sentence that will appear in your technical file under "intended purpose." Then go to your landing page draft, your sales deck, and your demo script, and make sure none of them contradict it. If marketing wants to say something stronger, your choices are: change marketing, or accept the regulatory path that matches the stronger claim.

Step 2 — Do the qualification assessment in writing. Use MDCG 2019-11 Rev.1 as the framework. Document: is it software; is it a medical device under Article 2(1); if yes, what class under Rule 11. Sign it, date it, keep it. An auditor will ask for this. A VC doing diligence will ask for this.

Step 3 — Treat hallucination as a hazard in your risk file. Under EN ISO 14971, a hazard is a potential source of harm. Non-deterministic, plausible-sounding but incorrect output is a foreseeable source of harm in any clinical use of an LLM. Your risk file must identify it, estimate the risk, and specify controls: human-in-the-loop review, confidence signalling, output constraints, refusal behaviours, logging, and post-market monitoring of failure modes.

Step 4 — Decide on the architecture before you lock in the foundation model. A third-party foundation model is SOUP under EN 62304. You need a plan for how you evaluate and document it, how you monitor version changes from the provider, and what happens if the provider updates the model under you. If your provider will not give you version stability guarantees, that is a design input, not a procurement detail.

Step 5 — Plan clinical evaluation against the deployed system. You cannot rely on benchmark scores from the foundation model's training paper. Your clinical evaluation must address your device — the wrapper, the prompts, the guardrails, the UI — used by the intended users on the intended population for the intended purpose. For a decision-support LLM, expect your notified body to want prospective clinical data, not a literature review of GPT benchmarks.

Step 6 — Subtract the claims, not the evidence. The cheapest regulatory path is always the one where you honestly claim less. If you do not need the tool to make diagnostic suggestions, don't let it. Constrain the output. Force human authorship. Narrow the population. Every claim you subtract saves months of clinical evidence work downstream. This is not hiding — it is honest scope management, and it is the core of the Subtract to Ship approach.

Step 7 — Run the EU AI Act layer in parallel. Even if your tool is outside MDR scope, a healthcare-adjacent LLM application is likely a high-risk AI system under the EU AI Act. If it is a medical device, both regimes apply. Build one documentation system that covers both, rather than two parallel bureaucracies.

Reality Check

  1. Can you write your device's intended purpose in one sentence, and does every piece of marketing material you have ever published match that sentence?
  2. Have you run a written qualification assessment under MDCG 2019-11 Rev.1, signed and dated?
  3. If your tool's output is wrong in a plausible, confident-sounding way, what is the worst realistic patient harm, and where is that analysis in your risk file?
  4. Who authored the final clinical decision in your workflow — the clinician or your software? Can you prove it from the audit log?
  5. Do you have a documented plan for what happens when the foundation model provider ships a new version next Tuesday?
  6. Is your clinical evaluation strategy scoped to your deployed system, or are you planning to lean on benchmark papers?
  7. Have you mapped which of your claims you would be willing to drop to move one class down under Rule 11?
  8. If a notified body asked tomorrow for your intended purpose, qualification decision, and risk file on hallucination, could you produce all three in under an hour?

Frequently Asked Questions

Is a general-purpose chatbot a medical device if patients use it for health questions? Not by default. The manufacturer's intended purpose is what counts. A general assistant with no medical intended purpose is not a medical device even if users ask it health questions. But the moment the manufacturer markets it for a medical purpose, Article 2(1) applies.

Can we put a disclaimer saying "not a medical device" and stay out of scope? Only if the disclaimer matches reality. Disclaimers that contradict the actual intended purpose, UI, or marketing do not protect you. MDR looks at the whole picture under Article 2(12), not at a footer line.

Is using a third-party LLM enough to make us not the manufacturer? No. You are the manufacturer of the finished device you place on the market. The foundation model is a component — SOUP under EN 62304 — and your technical file must address it.

How does the EU AI Act interact with MDR for generative AI? For a medical device that is also a high-risk AI system, both apply. MDR conformity assessment covers the medical device requirements; the AI Act adds obligations on data governance, transparency, human oversight, and post-market monitoring. They are complementary, not alternative.

What clinical evidence does a notified body expect for an LLM-based Class IIa device? Evidence that the specific deployed system achieves its intended purpose safely in the intended population. Expect questions about dataset representativeness, failure mode characterisation, and prospective clinical performance data. Retrospective benchmarks alone are rarely sufficient.

Can we update prompts after CE marking without a new conformity assessment? It depends on whether the change is significant under MDR Article 120 and MDCG guidance on software changes. A prompt change that alters intended purpose, performance, or safety is significant. Maintain a change control procedure that evaluates every prompt change against these criteria.

Sources

  1. Regulation (EU) 2017/745 on medical devices, consolidated text. Article 2(1), Article 2(12), Annex VIII Rule 11, Annex I GSPR 1 and §17.
  2. MDCG 2019-11 Rev.1 (October 2019, Rev.1 June 2025) — Guidance on qualification and classification of software in Regulation (EU) 2017/745.
  3. EN 62304:2006+A1:2015 — Medical device software — Software life cycle processes.
  4. EN ISO 14971:2019+A11:2021 — Application of risk management to medical devices.
  5. EN 62366-1:2015+A1:2020 — Application of usability engineering to medical devices.