Natural Language Processing in Medical Devices Under MDR

Quick Summary

When does NLP become a medical device under MDR? Clinical scribes, decision support, diagnostic NLP, and Rule 11 classification explained.

NLP software is a medical device under MDR when its intended purpose involves diagnosis, prediction, treatment, or clinical decisions for individual patients. A pure transcription scribe is usually outside scope. An NLP tool that extracts clinical codes, summarises for decisions, or suggests diagnoses is almost always inside Rule 11.

By Tibor Zechmeister and Felix Lenhard.

TL;DR

MDR Article 2(1) qualifies software as a device based on medical intended purpose. Language processing is not exempt.
Article 2(12) anchors intended purpose to labelling, IFU, promotional materials, and the clinical evaluation. Marketing copy counts.
MDCG 2019-11 Rev.1 confirms that software acting on data for an individual patient's benefit is a device; literal speech-to-text is not, interpretation and suggestion are.
Annex VIII Rule 11 places NLP-driven decision support at Class IIa minimum, Class IIb/III for higher-risk decisions.
Training data provenance, domain adaptation, and hallucination are regulatory problems, not just engineering problems, documented under EN 62304 and EN ISO 14971.
The safest non-device NLP product is one that never interprets, only faithfully records, and never targets clinical decisions.

Why this matters

Every month a new clinical-NLP startup ships a demo: paste a consultation note, get a tidy summary with ICD-10 codes and a suggested differential. The founders insist it is "just a scribe" or "just a documentation assistant." Then the first hospital procurement office asks for the CE certificate and the CFO asks why it is not in the technical file.

Language models have collapsed the perceived distance between "typing assistant" and "diagnostic tool." A single prompt can turn a scribe into a decision-support system. The MDR does not care whether you used a transformer, a rule-based parser, or a regex. It cares what you claim the output is for.

This post separates the parts of an NLP product that sit outside MDR, the parts that sit inside, and the honest playbook for founders who want to ship without a nasty surprise at the notified body.

What MDR actually says

Article 2(1) of Regulation (EU) 2017/745 covers software intended for diagnosis, prevention, monitoring, prediction, prognosis, treatment, or alleviation of disease. NLP software that extracts a clinical finding, suggests a code used in billing or triage, summarises a note into a recommendation, or answers a clinical question about a specific patient is software intended for one of these purposes. Whether the manufacturer calls it "assistant," "copilot," or "scribe" does not change the qualification.

NLP that extracts a clinical condition from free text and forwards it into a decision pipeline lands in Class IIa at minimum.

Article 2(12) fixes intended purpose to "the use for which a device is intended according to the data supplied by the manufacturer on the label, in the instructions for use or in promotional or sales materials or statements and as specified by the manufacturer in the clinical evaluation." For an NLP startup, this includes the landing page headline, the product demo video, the onboarding email, and what the CEO says on a podcast.

MDCG 2019-11 Rev.1 (June 2025 revision) draws a workable line. Software that only archives, stores, communicates, or performs simple searches is not a medical device. Software that creates or modifies medical information to support clinical decisions, or to provide information that influences them, is. Pure speech-to-text with no interpretation sits on the non-device side. Pure summarisation that preserves every clinical claim verbatim can also sit there, if the manufacturer genuinely claims nothing more. Anything that classifies, ranks, codes, suggests, prioritises, or extracts entities for clinical use crosses the line.

Annex VIII Rule 11 then classifies the device. Software intended to provide information used to take decisions with diagnostic or therapeutic purposes is Class IIa, rising to IIb or III depending on consequences. NLP that extracts a clinical condition from free text and forwards it into a decision pipeline lands in Class IIa at minimum. NLP that directly suggests a diagnosis to a clinician is Class IIa and often higher, depending on the clinical setting and the severity of the decisions it supports.

A worked example

A startup builds three products on top of the same large language model backbone. Same infrastructure, same prompts library, different claims.

Product 1. "Consultation Scribe." Intended purpose: "records the audio of a clinical consultation and produces a verbatim written transcript. Does not interpret, summarise, classify, diagnose, or recommend." The UI shows the transcript and only the transcript. Marketing is strict: "a faithful transcript, nothing more." This is not a medical device under Article 2(1), because there is no medical purpose beyond faithful recording. It is the same regulatory footing as a dictaphone.

Product 2. "Structured Note Generator." Intended purpose: "from a clinical consultation, generate a structured note with chief complaint, history, examination, assessment, and plan sections, plus suggested ICD-10 codes and differential considerations." This is a medical device. It creates new medical information, it classifies, it suggests codes used downstream, and it influences clinical decisions. Under Rule 11 it is Class IIa. If the suggested differentials could lead to treatment decisions that cause serious harm when wrong, a notified body will look hard at Class IIb.

Key Takeaway

NLP products attract feature creep faster than any other category.

Product 3. "Inbox Triage." Intended purpose: "reads patient messages in a clinician inbox and ranks urgency, highlighting messages that may indicate acute deterioration." This is software providing information used for therapeutic decisions, because the decisions relate to potential acute deterioration, Class IIb is realistic, and Class III is possible depending on the patient population. The founder's instinct will be to call it "workflow tooling." The claim "highlights messages that may indicate acute deterioration" is a diagnostic claim.

Three products, one model, three regulatory realities. The math is identical. The claims are not.

In Practice

A startup builds three products on top of the same large language model backbone.

The Lean Path Forward

NLP products attract feature creep faster than any other category. Teams add summarisation, then coding, then recommendations, then Q&A, then agentic actions. Each addition changes the intended purpose and may change the class. Subtract to Ship applies hard.

Step 1: Define the narrowest useful intended purpose. Write one sentence. "This product produces a verbatim transcript of a clinical consultation." Or: "This product extracts ICD-10 codes from free-text clinical notes to support billing workflows." Narrow is safer. Narrow is cheaper. Narrow is shippable.

Step 2: Separate the medical and non-medical products architecturally. If you want both a scribe (non-device) and a structured note generator (device), they cannot share a UI that blurs which output came from which. A notified body will look at the user experience, not the source code repository layout. MDCG 2019-11 Rev.1 asks what the user sees and does.

Step 3: Take hallucination seriously as a risk control. EN ISO 14971:2019+A11:2021 requires a risk management file that identifies hazards and control measures. For NLP, hallucination is a named hazard. Evidence-grounded generation, citation of source text, confidence indicators, and refusal behaviours are risk controls and must be documented as such. "We tested it and it seemed fine" is not a risk file.

Step 4: Document training data and domain adaptation. Under Annex II technical documentation and EN 62304 lifecycle requirements, the provenance of training data, the domain of the data (which specialties, which languages, which EHR systems), and the mismatch between training and deployment contexts must be documented. An NLP model trained on US discharge summaries and deployed on German primary-care notes is a domain shift with clinical consequences. Write it down.

Step 5: Lock the prompt stack and call it a software component. Under EN 62304, prompts, system messages, retrieval configurations, and post-processing rules are part of the software system. They are versioned, change-controlled, and tested. Treating prompts as "just config" invites non-conformities the first time an auditor walks through your change log.

Step 6: Do not let the sales team outrun the intended purpose. Under Article 2(12), one slide that says "our AI diagnoses conditions from notes" can reclassify the whole product. Train the sales team on what they can and cannot claim. Keep a log of external material for the technical file.

Step 7: Build a usability file under EN 62366-1. NLP interfaces are dense with automation bias risk. Users trust well-written text more than they should. EN 62366-1:2015+A1:2020 use-related risk analysis is the discipline that catches this, and Annex I §5 and §22 of the MDR require it for devices with a user interface.

Reality Check

Can you state your NLP product's intended purpose in one sentence, without ambiguity about whether it interprets or merely records?
Does your UI show anything beyond what the intended purpose allows. Any ranking, suggestion, classification, or code?
Have you walked Annex VIII Rule 11 line by line with your intended purpose and written down the class with reasoning?
Is your training data provenance documented, including domain, language, specialty, and known gaps?
Does your risk management file name hallucination as a hazard and list specific risk controls with effectiveness evidence?
Are prompts, system messages, and retrieval rules under configuration management as software components per EN 62304?
Have you run use-related risk analysis under EN 62366-1 for automation bias and over-trust in fluent output?
If a market surveillance inspector read your website today, would it agree with your intended purpose statement?

Frequently Asked Questions

Is a pure speech-to-text clinical scribe a medical device?

Usually not, if it only produces a verbatim transcript and the manufacturer claims nothing more. The moment it summarises, classifies, codes, or suggests, it likely is.

Usually not, if it only produces a verbatim transcript and the manufacturer claims nothing more.

Does using an LLM automatically make my product a medical device?

No. Technology is not the trigger. Intended purpose is. An LLM-powered product with no medical claim is not a device. An LLM-powered product with a medical claim is.

Can I ship an NLP product in Europe without a notified body?

Only if it is Class I or not a medical device at all. Under Rule 11, most useful clinical NLP lands in Class IIa or higher, which requires a notified body.

How do I document the "intelligence" of a language model?

As a software item under EN 62304, with documented architecture, interfaces, requirements, and verification evidence. Training data, prompts, and retrieval components are all in scope. The fact that the model weights are opaque does not exempt the system from lifecycle documentation.

What about hallucinations? Can I even CE-mark something that can hallucinate?

Yes, in principle, if the residual risk is acceptable per EN ISO 14971 and benefit-risk per MDR Annex I supports it. This usually requires strong grounding, clear user communication of uncertainty, limited intended purpose, and clinical evaluation evidence.

Does multilingual deployment change my regulatory file?

Yes. Each language and clinical domain is a validation scope. Deploying into a new language or specialty is a change that may require re-validation and, for significant changes, a change notification to your notified body.

MDR Classification Rule 11 for Software, the full text of Rule 11 and classification worked examples
MDCG 2019-11 Software Guidance, qualification and classification guidance for software
Intended Purpose Drives Regulatory Decisions, why this one statement matters more than anything else
Clinical Decision Support Under MDR, where CDS and NLP overlap
Training Data Requirements for AI Medical Devices, documenting data provenance and fit-for-purpose

Sources

Regulation (EU) 2017/745 on medical devices, consolidated text. Article 2(1), Article 2(12), Annex VIII Rule 11.
MDCG 2019-11 Rev.1 (June 2025). Guidance on Qualification and Classification of Software in Regulation (EU) 2017/745.
EN 62304:2006+A1:2015. Medical device software lifecycle processes.
EN ISO 14971:2019+A11:2021. Application of risk management to medical devices.
EN 62366-1:2015+A1:2020. Application of usability engineering to medical devices.

The Bigger Picture

Generative AI in MedTech: LLMs Under MDR