Usability Engineering for AI-Powered Medical Devices

Quick Summary

How to handle explainability, uncertainty, and automation bias in the usability file for AI-powered medical devices under MDR.

For AI-powered medical devices, the usability file must address three hazards the notified body will ask about: how the user understands what the algorithm is doing, how the user interprets uncertainty, and how the UI prevents automation bias. These are use errors in the EN 62366-1:2015+A1:2020 sense, and they must be traced through the risk file and summative validation like any other use error.

By Tibor Zechmeister and Felix Lenhard.

TL;DR

AI-powered medical devices fall under EN 62366-1:2015+A1:2020 exactly like any other device, with no AI-specific carve-out from the usability engineering process.
Three AI-specific hazardous use scenarios show up in nearly every notified body review: misinterpretation of the algorithm's output, misreading of uncertainty or confidence information, and automation bias that causes the user to stop checking the algorithm.
MDR Annex VIII Rule 11 drives most AI diagnostic and monitoring software to Class IIa or higher, which raises the depth of usability evidence the notified body expects.
The UI must communicate what the algorithm can and cannot do, when its output is uncertain, and when the user must override it. Silence on any of these three points is a usability finding.
MDCG 2019-11 Rev.1 governs software qualification and classification. It does not replace the usability engineering obligations of EN 62366-1; both apply together.

Why AI breaks assumptions in the usability file

A traditional medical device has deterministic behaviour. Press the button, the device does the same thing every time. The usability hazard is that the user pressed the wrong button. The risk control is clearer labelling, a confirmation step, or a physical guard. The evidence is formative and summative testing of the labelling and the confirmation flow.

An AI-powered device behaves differently. The same input image produces the same output most of the time, but the output comes with an implicit or explicit uncertainty. The user's task is not just to press the right button; the user's task is to interpret a probability, a classification, or a recommendation that the algorithm generated from training data the user has never seen. The usability hazard is that the user misreads the output, over-trusts it, under-trusts it, or does not understand what the algorithm was asked to decide in the first place.

This is why Tibor treats AI usability as a distinct cluster of hazardous use scenarios inside EN 62366-1. The standard itself does not name AI. It does not need to. Clause 5.5 on hazardous use scenarios is broad enough to cover anything the user might do that leads to harm, and the three AI hazards above fit cleanly inside that clause.

What MDR actually says

MDR does not have an AI chapter. AI-powered medical devices are regulated through the existing MDR framework. Classification follows MDR Annex VIII, and for most AI diagnostic and monitoring software, Rule 11 applies. Rule 11 states that software intended to provide information which is used to take decisions with diagnosis or therapeutic purposes is Class IIa, unless such decisions have an impact that may cause death or an irreversible deterioration of a person's state of health (Class III) or a serious deterioration of a person's state of health or a surgical intervention (Class IIb). Software intended to monitor physiological processes is Class IIa, unless it is intended for monitoring vital physiological parameters where the nature of variations could result in immediate danger, in which case it is Class IIb. All other software is Class I.

MDCG 2019-11 Rev.1 gives the authoritative interpretation of Rule 11 and includes worked examples. It does not replace the usability engineering obligations of EN 62366-1 and does not discuss usability in depth. Usability sits under MDR Annex I §5 and §22 and under the software-specific requirements in Annex I §17 on electronic programmable systems and software.

Annex I §14.2 is worth rereading in the AI context: devices shall be designed and manufactured to remove or reduce as far as possible the risk of use error due to the ergonomic features of the device and the environment in which the device is intended to be used. For AI, the "ergonomic features" include the way the algorithm's output is presented, and the "environment" includes the time pressure and cognitive load the user operates under.

EN 62366-1 clause 5.1 requires a use specification. For an AI device, the use specification must answer a question the notified body will ask: what is the user's actual task in the loop with the algorithm? Is the user confirming the algorithm's output, overriding it when it is wrong, acting on it without review, or using it as a second opinion alongside their own judgment? Different loops produce different hazardous use scenarios and different UI obligations.

Key Takeaway

The use specification describes the GP as time-pressured, trained in general dermatology but not specialised, seeing a lesion photograph on a tablet during a consultation.

A worked example

Consider a Class IIa AI-powered dermatology triage app. The algorithm takes a photograph of a skin lesion and outputs a classification into "benign," "suspicious," or "refer urgently," with a confidence score. The intended user is a general practitioner in a primary care setting. The use specification describes the GP as time-pressured, trained in general dermatology but not specialised, seeing a lesion photograph on a tablet during a consultation.

The team identifies three AI-specific hazardous use scenarios, first, the GP misreads the confidence score, a "72 percent confident benign" is shown as a green badge, and the GP treats it as "the AI says benign, I agree, move on," without noticing that 28 percent of the time the classifier is wrong, second, the GP develops automation bias over weeks of use, the first hundred predictions were correct, so the GP stops examining suspicious lesions personally when the AI returns green. Third, the GP does not understand what the algorithm was trained on. The training data was mostly lighter skin tones, and the confidence scores are less reliable on darker skin. The GP has no way to know this from the UI.

Each of these is a hazardous use scenario under EN 62366-1 and a contributing cause to potential patient harm under EN ISO 14971:2019+A11:2021. The risk controls the team adopts include: replacing the green badge with a numeric confidence plus a written caveat, adding an explicit "algorithm does not replace your clinical judgment" screen that the GP must acknowledge on the first use of each session, showing a representative image from the training distribution alongside the patient's image so the GP can judge whether the case is in or out of distribution, and adding a forcing function that requires the GP to document their own assessment before the AI output is revealed on high-risk lesion categories.

That last one is a strong example of Subtract to Ship thinking. Instead of adding more features, the team subtracted the AI's ability to anchor the GP's judgment by delaying the output. It is also testable in summative: can users under realistic time pressure still arrive at their own assessment before the AI anchors them? If yes, the risk control works.

In Practice

EN 62366-1 clause 5.1 requires a use specification.

The Efficiency Lens

Felix has watched AI startups burn months on explainability features that do not actually help the user. The Subtract to Ship instinct is to ask what the user needs to decide safely and cut everything else. The playbook has five moves.

Move one: define the human-in-the-loop role precisely. Is the user a decision maker, a supervisor, a reviewer, or a rubber stamp? The use specification must answer this clearly because every UI decision flows from it. Notified bodies will ask.

Move two: decide what the UI must communicate about the algorithm. At a minimum, most AI devices need to communicate what the algorithm does, what data it was trained on at a summary level, how confident it is on this specific input, and what the user's override options are. If any of those four is missing from the UI, the notified body will flag it and Tibor would flag it in an audit.

Move three: design against automation bias, not just for it. Automation bias is the well-documented tendency of users to over-trust automated outputs, especially under time pressure. The UI must include forcing functions that keep the user engaged. Examples: requiring the user to enter their own impression before the algorithm output is displayed on high-stakes cases, flagging cases where the input is out of the training distribution, and showing disagreement signals when the algorithm output conflicts with other available evidence.

Move four: communicate uncertainty in a form the user can actually use. A raw probability ("0.73") is usually not enough. Most clinical users find it easier to interpret a three-tier scale ("high confidence," "moderate confidence," "low confidence, manual review recommended") than a number. The scale must be calibrated: the summative evaluation must show that users who see "high confidence" actually treat it as high confidence and users who see "low confidence" actually escalate.

Move five: validate all of this with real users under realistic conditions. A summative evaluation where participants sit in a quiet room with no time pressure is not valid for a device used under clinical time pressure. Tibor has seen teams pass summative in a lab and fail in post-market surveillance because the lab conditions did not reflect the real environment. For AI devices, time pressure, cognitive load, and repetition over weeks are part of the use environment and must be considered.

Bottom Line

Does the use specification name the human-in-the-loop role precisely (decision maker, supervisor, reviewer, rubber stamp)?
Does the UI communicate what the algorithm does, what it was trained on, how confident it is on this specific input, and how the user can override it?
Have you identified hazardous use scenarios specifically for misinterpretation of output, misreading of uncertainty, and automation bias, and do they trace to risk controls in the risk file?
Does the UI include at least one forcing function that keeps the user engaged on high-stakes cases, rather than assuming the user will stay vigilant indefinitely?
Is uncertainty communicated in a form users actually act on correctly, validated in summative, not assumed from a designer's intuition?
Does your summative evaluation reproduce the real cognitive load and time pressure of the intended use environment, or does it run in artificial lab conditions?
Do your post-market surveillance plans include monitoring for automation bias and distributional drift, given that both will get worse over time and not better?

Frequently Asked Questions

Is there an MDR article specifically for AI?

No. MDR does not have an AI chapter. AI-powered medical devices are regulated through the existing MDR framework. The EU AI Act is a separate piece of legislation with its own obligations, and high-risk medical AI will be subject to both MDR and the AI Act simultaneously. This post covers MDR usability obligations; AI Act usability obligations are related but distinct.

Does EN 62366-1 apply to AI devices the same way as to other devices?

Yes. EN 62366-1:2015+A1:2020 applies without modification. The standard does not name AI but is broad enough to cover any device where user interaction creates risk. The specific hazardous use scenarios for AI look different, but the process is the same: use specification, hazardous use scenario identification, risk controls, formative evaluation, summative validation.

What does "explainability" mean in usability terms?

For regulatory usability purposes, explainability is not about showing saliency maps or SHAP values. It is about whether the user understands what the algorithm is telling them, well enough to act on it safely. A saliency map that the user cannot interpret is worse than no explanation, because it creates false confidence. The test is empirical: does the summative evaluation show that users act correctly on the information provided?

How do I handle uncertainty in the UI?

Show it, calibrate it, and validate that users act on it correctly. Most clinical users find categorical scales easier to act on than raw probabilities. Whatever form you choose, the summative evaluation must show that users in high-uncertainty cases actually escalate or seek a second opinion, not that they ignore the uncertainty and proceed.

Is automation bias a usability concern under MDR?

Yes. Automation bias is a well-documented human factor that causes users to over-trust automated outputs, especially over time and under time pressure. Under EN 62366-1, it is a hazardous use scenario that must be identified, mitigated, and validated. Ignoring it in the usability file is a finding.

Does a continuously learning AI need extra usability evidence?

Continuously learning AI has specific clinical evaluation obligations that are more demanding than locked algorithms. On the usability side, the core question is whether the user's mental model of the device remains accurate as the algorithm changes. If the algorithm is updated and now performs differently in edge cases, the user's interpretation of the output may no longer be correct. Post-market usability surveillance has to be part of the plan.

MDR classification Rule 11 for software covers the classification rule most AI medical software falls under.
Clinical evaluation for AI and ML with continuous learning handles the clinical evidence side of AI devices.
MDCG 2019-11 software guidance is the authoritative interpretation of Rule 11 for software, including AI.
Risk management and usability engineering link shows how hazardous use scenarios flow into the risk file.
Home Use Medical Devices: Extra Usability Requirements Under MDR

Sources

Regulation (EU) 2017/745 on medical devices, consolidated text. Annex I §5, §14.2, §17, §22, Annex VIII Rule 11.
EN 62366-1:2015+A1:2020, Medical devices, Part 1: Application of usability engineering to medical devices.
EN ISO 14971:2019+A11:2021, Medical devices, Application of risk management to medical devices.
EN 62304:2006+A1:2015, Medical device software, Software life cycle processes.
MDCG 2019-11 Rev.1 (June 2025), Guidance on Qualification and Classification of Software in Regulation (EU) 2017/745, MDR and Regulation (EU) 2017/746, IVDR.

The Bigger Picture

User-Centered Design for Medical Devices Under MDR