Using AI for PMS automation means letting a qualified tool take on the structured, high-volume layer of the work MDR Articles 83 and 84 already require — complaint intake triage, trend detection against a locked baseline, literature surveillance, and first-draft narratives — while a qualified human at the manufacturer keeps the adjudication, the signal determination, and the vigilance reporting decisions under MDCG 2023-3 Rev.2. Done correctly, the tool absorbs the hours that eat a small team alive and returns those hours to risk-file integration, PMCF analysis, and the parts of PMS where judgement actually matters. Done incorrectly, it creates a ranked queue that nobody reads past the top, a signal detection dashboard that never triggers, and an audit trail that cannot answer a Notified Body's first question. This post walks through the PMS time problem, what AI automates well, what cannot be automated, the eleventh-result complacency risk, the human-in-the-loop discipline, the audit trail expectations Notified Bodies bring in 2026, the mistakes small teams keep making, and how the whole thing fits inside the Subtract to Ship framework.

By Tibor Zechmeister and Felix Lenhard. Last updated 10 April 2026.


TL;DR

  • MDR Articles 83 and 84 require a proactive PMS system and a PMS plan proportionate to the risk class. Annex III sets out what the plan must address. Nothing in the Regulation prohibits AI-assisted automation of the work; everything in the Regulation requires the manufacturer to remain responsible for the outputs.
  • AI automates the structured layer well: complaint triage, duplicate detection, clustering, component tagging, trend detection against a locked baseline, literature surveillance, and first-draft narratives.
  • AI cannot make the signal determination, cannot decide trend reporting under MDR Article 88, and cannot make vigilance reporting decisions. Those sit with a qualified human under MDCG 2023-3 Rev.2 (January 2025).
  • The eleventh-result pattern is the failure mode to fight: after the tool gets ten in a row right, the reviewer stops reviewing, and the one that matters slips through.
  • The audit trail a Notified Body expects in 2026 is concrete — tool version, inputs, draft outputs, human edits, approver identity, and QMS qualification of the tool under EN ISO 13485:2016+A11:2021.

The PMS time problem

Classical PMS in a small MedTech team fit on a shelf. A handful of complaints a month, a quarterly literature scan, a service log the regulatory lead read end to end on a Friday afternoon. That model held together while the product was small and the field was quiet.

It does not survive a product at real scale. A diagnostic that is actually in clinical use produces complaint intake through a customer portal, a field service system, distributor forwards, clinician emails, and support tickets that sometimes contain a safety-relevant sentence buried three paragraphs into a request about something else. MDCG 2025-10 (December 2025) expects the PMS system to proactively draw from all of this and more — similar devices, public literature, trend data, PSURs and FSCAs on comparable products. The sources grew. The team did not.

The Flinn customer case Tibor has described in other posts is the canonical version of this. Two qualified people whose job became Excel copy-paste. Hundreds of reports a week, each one categorised by hand, logged, and moved on from. Both of them quit. The work was necessary, the volume was real, and the mismatch between the qualification required to do the work and the mechanics of actually doing it had become unbearable.

Using AI for PMS automation is the response to that arithmetic. The obligations under Articles 83 and 84 did not change. The volume of data those obligations apply to did.

For the foundational framework, see what is post-market surveillance under MDR and MDR Articles 83 to 86 explained. For the adjacent signal-detection deep dive, see AI for post-market surveillance: complaint analysis and signal detection.

What AI automates well

The band where AI earns its place in PMS is narrow and real. It maps to the parts of Annex III that are structured enough to be handled by pattern recognition and stable enough that a validated tool can process them on a defined cadence.

Complaint intake triage. Draft categorisation of each incoming item — serious incident candidate, non-serious incident, malfunction, user error, no device relationship, off-topic — is a structured classification task current tools handle reliably. In the Flinn customer case, pre-categorisation of the safety database cut manual time by more than 80% against the previous Excel workflow. That number is not universal; it is a real measurement from one specific workflow under specific conditions.

Duplicate detection. The same event arriving through two channels, a clinician who files twice after a follow-up, a distributor forwarding a complaint already logged. A similarity check over text and metadata removes duplicates before they inflate apparent rates and waste reviewer hours.

Clustering and component tagging. Grouping complaints that describe the same failure mode, and tagging each one against the specific component, subsystem, or clinical context it touches. This is the layer that turns unstructured text into a dataset the reviewer can query, filter, and count. Trend detection depends on it.

Trend detection against a locked baseline. For each event category, the PMS plan defines a baseline rate from an agreed reference period. An AI-assisted pipeline computes the current rate on a defined cadence and compares it to the baseline. A pre-defined threshold triggers formal review. MDR Article 88 governs trend reporting of statistically significant increases in the frequency or severity of incidents that are not themselves serious incidents — the trend detection layer feeds that decision. The threshold is set in the plan before the signal happens, not invented after.

Literature surveillance. Running Annex III-required literature monitoring against a defined query set, on a defined cadence, with the results ranked for relevance. The reviewer reads the top of the list; the tool saves the hours that went into reading the bottom of the list.

First-draft narratives. A short structured draft of what happened, to whom, with which device, with what outcome. The reviewer edits the draft against the raw input. The draft is never the final record.

Felix's summary from the interviews applies here cleanly. AI maintains documentation, flags discrepancies, runs questionnaires, increases speed, maintains quality, reduces costs — and in PMS the flagging-discrepancies part is where the value concentrates. The tool reads faster than a human and raises its hand when something looks off.

What cannot be automated

The things AI cannot do in PMS, and must not be allowed to do, sit in a different band entirely.

Signal determination. Deciding whether a cluster, a rate change, or a narrative pattern constitutes a real safety signal is a judgement call with patient-safety consequences. A tool surfaces the material. A qualified human decides whether the material means something.

Trend reporting under Article 88. The decision to report a statistically significant increase in frequency or severity is a manufacturer decision, made by a qualified person against the methods and protocols specified in the PMS plan under Annex III. The tool can compute the rate change. It cannot decide what the rate change means for the benefit-risk determination.

Vigilance reporting decisions. Whether an event meets the serious-incident definition, whether an FSCA is warranted, how and when to report to the competent authority — all of this lives with a qualified human at the manufacturer, against the terms and criteria of MDCG 2023-3 Rev.2 (January 2025).

Risk file and clinical evaluation updates. A signal that lands in the PMS system has to close the loop into the risk management file and the clinical evaluation. That is interpretive work. A tool drafts; a human concludes.

The rule Felix keeps coming back to is simple. AI flags issues, the qualified human adjudicates. Never the other way around. An expert who rubber-stamps AI output is not adjudicating; they are laundering a machine decision through a human signature, and a Notified Body reading the audit trail will see it.

The eleventh-result complacency risk

Here is the failure mode Tibor watches for in every AI-assisted PMS workflow, including at Flinn customers.

The tool gets the first ten results right. By the tenth, the reviewer has stopped really reviewing. They are clicking approve because the previous nine were fine and the queue is long. The eleventh result is wrong — a serious incident candidate mis-tagged as a non-serious malfunction, a safety-relevant literature hit marked off-topic, a cluster that should have triggered trend review dismissed as noise — and nobody catches it. The review stopped being a review several results earlier.

This is automation complacency, and it is well-documented in human-automation research across domains. Aviation trains against it. Clinical AI studies it. Regulatory operations, at the moment, mostly do not talk about it. In sales the cost of the eleventh wrong result is a bad email. In PMS it can be a missed safety signal that affects patients. The calibration of how much attention the reviewer brings to the tool's output has to account for this explicitly, in the SOP, not in people's heads.

The countermeasures are the same ones that work in every other human-automation system. They are not novel. They are the basic mechanics of keeping a human-in-the-loop system actually in the loop.

Human-in-the-loop discipline

Named adjudicator per decision. Every signal that enters formal review has a named human responsible for the conclusion. Every trend reporting decision under Article 88 and every vigilance reporting decision is made by a qualified person, with their name attached.

Full review of the top of the queue from raw input. The reviewer reads the items the tool ranked most likely safety-relevant from the raw source, not from the tool's summary. A draft narrative is a starting point; the record is built against the underlying data.

Mandatory spot-check rate on the bottom of the queue. The failure mode to fight is not that the top is wrong — it is that the bottom is wrong and nobody reads the bottom. A fixed percentage of items the tool ranked low-relevance are re-reviewed from scratch by a human on a defined cadence. This is the single most important control in the whole pipeline.

Override logging. Every time the reviewer disagrees with the tool's categorisation, cluster assignment, or ranking, the disagreement is logged with a reason. The log is reviewed periodically for drift in either direction — the tool getting worse, or the reviewer getting lazier.

Rotation. The same reviewer does not supervise the same AI-assisted workflow for months on end. Fresh eyes re-establish critical distance. In a three-person team this is logistically hard. It is also where complacency hits fastest.

Pre-defined escalation triggers. Anything that matches a serious-incident pattern goes to a second reviewer regardless of what the tool said about it.

Audit trail expectations

Under MDR Article 10 the manufacturer is responsible for the PMS system and everything it produces. Under EN ISO 13485:2016+A11:2021 the QMS has to describe how any software used in the process is qualified and controlled. For an AI-assisted PMS pipeline in 2026, the audit trail a Notified Body expects is concrete.

  • Tool identity and version. Which tool produced the triage, clustering, ranking, and trend computations; which version was in effect on the date of the records. A version change is itself a change control event in the QMS.
  • Inputs used. Which fields from which source systems went into the tool, and which did not. An auditor can ask. The answer has to be in the record.
  • Draft outputs preserved. The tool's raw categorisation, cluster assignments, draft narratives, and rate computations — stored as-is, not overwritten by the reviewer's edits.
  • Human review evidence. Who reviewed, on what date, what they changed, what they kept, why. The override log is part of this.
  • Approval and signature. The named human whose signature closes the record, with role and date.
  • Qualification of the tool in the QMS. A record of the tool's intended use, validation evidence, scope, limits, and the controls that apply when it runs. MDCG 2025-10 (December 2025) expects PMS processes to be described in real operational terms, not abstracted.
  • Fall-back plan. A documented workflow for operating the PMS system if the tool is unavailable. A process that cannot run without the tool is a process with a single point of failure.

A record that shows only the approved final entry, with no evidence of the triage process underneath it, looks fine on first reading and will not survive a serious Notified Body review. The auditor will ask how the item got to the top of the queue and what happened to the items that did not. The answer has to exist.

Common mistakes

  • Treating the ranked queue as a sorted answer. The queue is an ordering, not a conclusion. Reviewing only the top and never sampling the bottom turns the tool into a filter that hides the signals that matter most.
  • No locked baseline. Without a reference period and agreed baseline rates per category, trend detection has nothing to compare against. "We noticed an increase" is not an auditable statement under Article 88.
  • Threshold-free dashboards. Monitoring everything and triggering nothing. A dashboard with no pre-defined escalation threshold is a decoration.
  • Skipping QMS qualification of the tool. Using a tool that is not described anywhere in the QMS creates an invisible dependency. Describe it, control it, keep the description current.
  • Copying the draft narrative into the record without checking it against the raw complaint. The draft is a starting point. A record that is the unedited draft is a record the reviewer did not make.
  • Merging triage and adjudication in a single screen. If the reviewer approves the tool's category on the same screen it is shown on, the approval step will drift toward rubber-stamping the fastest. Separate the surfaces.
  • No fall-back plan. If the tool disappears tomorrow and the PMS process stops running, the tool has become the process. That is the wrong dependency.

The Subtract to Ship angle

From a Subtract to Ship perspective, using AI for PMS automation fits the framework cleanly when it subtracts hours from the structured triage and surveillance layer that Articles 83, 84, and 88 already require. The obligations do not move. The time spent on the structured layer does. A team that clears triage faster and reinvests the recovered hours in risk-file integration, trend adjudication, and PMCF analysis is subtracting waste and keeping compliance. A team that uses the tool to skip the human review step is cutting compliance, and the framework does not allow it.

The test stays the same as everywhere else in the framework. Every activity in the PMS plan has to trace to a specific MDR article, annex, or harmonised standard. AI changes how fast an activity runs. It does not change whether the activity is required.

The lean AI-assisted PMS pipeline looks like this. One ingestion pathway per source, normalised into a common schema. One AI-assisted triage step that categorises, deduplicates, clusters, tags, and ranks. One human review step with a named adjudicator, a mandatory spot-check rate on the bottom of the queue, and override logging. One trend detection layer with locked baselines and pre-defined thresholds under Article 88. One literature surveillance layer on a defined cadence with a ranked, human-reviewed output. One escalation pathway into vigilance assessment under MDCG 2023-3 Rev.2. One feedback loop into the risk file and the clinical evaluation. One PMS Report or PSUR that reflects the real data from the period. Nothing in that list is optional. Nothing beyond it earns its place.

For the post-market clinical follow-up side of the same pipeline, see PMCF under MDR — a guide for startups. For the AI devices themselves — where PMS has a drift-oriented character that changes the math — see post-market surveillance for AI devices.

Reality Check — Where do you stand?

  1. For your product's actual complaint volume, can a human-only workflow read every incoming item end to end this month? If not, what is the current coping strategy?
  2. If you use an AI tool in triage or trend detection, is it described in your QMS with an intended use, a validation record, a scope, and a named owner?
  3. Do you have locked baseline rates per event category, and pre-defined thresholds that trigger formal review under Article 88?
  4. Is there a mandatory spot-check rate on the bottom of the ranked queue, and can you prove from the records it is actually being done?
  5. For every vigilance reporting decision in the last quarter, is there a named human adjudicator with a documented rationale independent of the tool's draft?
  6. If a Notified Body auditor asked how a specific complaint ended up where it did in your process — and what happened to the fifty complaints around it — could you answer from the records without improvising?
  7. If the AI tool disappeared tomorrow, how long before your PMS process fell behind, and is there a documented fall-back workflow?
  8. When was the last time you reviewed the override log for drift, in either direction?

Frequently Asked Questions

Does the MDR allow using AI for PMS automation? Yes, within limits. MDR Articles 83 and 84 require a PMS system and plan proportionate to the risk class and appropriate for the device, and Annex III specifies what the plan has to address. Nothing in those provisions prohibits AI-assisted automation. EN ISO 13485:2016+A11:2021 requires the QMS to describe how any software used in the process is qualified and controlled, and MDCG 2025-10 (December 2025) expects PMS processes to be described in real operational terms. If the tool is qualified, the human-in-the-loop discipline is intact, and the audit trail exists, AI-assisted PMS is a legitimate productivity layer.

Can an AI tool make the trend reporting decision under Article 88? No. An AI tool can compute rate changes against a locked baseline and surface the cases where the pre-defined threshold has been crossed. The decision that the change constitutes a statistically significant increase in frequency or severity requiring trend reporting is a manufacturer decision made by a qualified person against the methods and protocols specified in the PMS plan under Annex III.

What is the single most important control in an AI-assisted PMS workflow? A mandatory spot-check rate on the bottom of the ranked queue, combined with override logging. The failure mode to fight is not that the top of the queue is wrong. It is that the bottom of the queue is wrong and nobody reads the bottom. Without the spot-check, the tool silently becomes a filter that hides the signals that matter most.

How do I avoid over-fitting trend detection thresholds to noise? Set thresholds from a defined baseline period where the data is well-understood, document the statistical reasoning in the PMS plan under Annex III, and re-evaluate on a fixed cadence as more data accumulates. Pre-defined thresholds prevent both over-reaction to noise and under-reaction to real signals, and the reasoning is written down so a Notified Body reviewer can follow it.

Does the Notified Body need to know that I use AI in my PMS process? Transparency is the right posture. Notified Bodies in 2026 are increasingly familiar with AI-assisted workflows in QMS processes and will look for the tool's qualification in the QMS, the validation evidence, the human review controls, and the audit trail. Describing the tool's role honestly is stronger than hiding it.

What happens if the tool is wrong and a signal is missed? The responsibility is the manufacturer's regardless of the tool. The PMS system must be designed so that tool errors do not translate directly into missed signals — the spot-check rate, the override logging, the rotation, and the independent human adjudication exist precisely for this. If a signal is missed, the root-cause analysis covers the process controls that should have caught it, and the process is corrected accordingly.

Sources

  1. Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices — Article 83 (post-market surveillance system), Article 84 (post-market surveillance plan), Article 88 (trend reporting), Annex III (technical documentation on post-market surveillance). Official Journal L 117, 5.5.2017, consolidated text.
  2. MDCG 2025-10 — Guidance on post-market surveillance of medical devices and in vitro diagnostic medical devices. Medical Device Coordination Group, December 2025.
  3. MDCG 2023-3 Rev.2 — Questions and Answers on vigilance terms and concepts as outlined in Regulation (EU) 2017/745 and Regulation (EU) 2017/746. Medical Device Coordination Group, first publication February 2023, Revision 2 January 2025.
  4. EN ISO 13485:2016 + A11:2021 — Medical devices — Quality management systems — Requirements for regulatory purposes.

This post is part of the Post-Market Surveillance & Vigilance series in the Subtract to Ship: MDR blog. Authored by Felix Lenhard and Tibor Zechmeister. Using AI for PMS automation is a productivity layer on top of the Article 83, 84, and 88 obligations, not a way around them — the qualified human at the manufacturer still owns the determinations, and the tool exists to make that human's attention land on the items where the determination actually matters.