The 20-step cliff

On April 2, the FDA issued its first warning letter citing the inappropriate use of AI in pharmaceutical manufacturing.

The firm had used AI agents to generate drug specifications and master production records, and nobody on the quality side had checked them.

The agentic workflow ran 20 steps, and the audit is what caught it first.

How long are the agentic AI workflows being pitched at your plant?

👋🏻 I'm Leonardo Ubbiali. This week we're looking at the cliff that every agentic AI workflow hits at step 20, and what to ask the next vendor who pitches you a long-chain agent.

Picture an agent handling a deviation investigation in a pharma plant.

It pulls the batch record, cross-references three historical deviations, drafts the root-cause narrative, routes it for review, and updates the quality system.

Then it does the same for the next deviation. By the end of the shift it has run 30 steps.

20 steps in, the record on screen no longer matches the record in your historian.

Nobody knows where it went wrong.

That is what Microsoft Research published on April 17.

Frontier models including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT-5.4 corrupt an average of 25% of document content by the end of a 20-step workflow.

Across all 19 models tested, the average is 50%.

Python was the only professional domain where most models cleared a 98 percent reliability bar, while every other domain failed.

There is one more finding in the paper.

The strongest models postpone the problem.

Roughly 80% of the total degradation comes from sparse single-round drops of 10 to 30 points, and frontier models push those drops to later rounds.

The chain looks clean for the early interactions but fails in the last stages.

I have not seen a single agentic AI demo this year that runs past step 10. The reliability problem starts past step 15.

Kruger and Nissha Metallizing Solutions are the two named pilots for the Accenture, Avanade, and Microsoft agentic factory unveiled at Hannover Messe in April.

The pilots run on a deliberate constraint most vendors do not advertise.

They are downtime and diagnostics products.

The agent spots the bearing degrading, drafts the ticket, orders the spare part, and the chain finishes before any drift can compound.

Eric Ashby, COO at Kruger, projects a 10 to 15 percent reduction in mean time to repair.

Multimillion-dollar savings across their sites, all inside a three-step loop.

The trouble starts when the same architecture gets pushed onto a 20-step batch record workflow it was never designed for.

That is exactly what the FDA cited Purolea Cosmetics Lab for on April 2.

The firm used AI agents to write drug specifications, procedures, and master production records.

The Quality Unit approved the outputs without checking them against 21 CFR 211.22(c), and the warning letter has a dedicated section titled "Inappropriate Use of Artificial Intelligence in Pharmaceutical Manufacturing."

First of its kind from the FDA.

When asked why no process validation had been done, the firm said its AI agent never told them validation was required.

The plants getting this right design the chain around the model's limits. That is the whole job.

Anyone signing a 20-step batch record contract this quarter should expect a warning letter.

Five things you can do this quarter

The problem: Your plant is being pitched a multi-step agentic AI workflow and you need to know whether the chain is too long to be reliable.

What you need: The exact workflow steps the vendor is proposing, what data and records each step touches, what gets reviewed by a human and when, and which records are subject to audit.

The Prompt (copy this):

❝

I'm a [YOUR ROLE] at a [FACILITY TYPE] manufacturing plant. A vendor is proposing an agentic AI workflow with [NUMBER] steps that handles [DESCRIBE PROCESS]. Records touched include [LIST]. Human review happens at [DESCRIBE WHEN]. The records subject to audit are [LIST].

Tell me: At what step count does this workflow become unreliable per the Microsoft Research DELEGATE-52 benchmark? Where should I require human checkpoints to catch silent corruption before it compounds? What audit trail and validator requirements should I put in the contract before signing? What is the strongest case for breaking this into shorter chains?

What you'll see:

A reliability assessment of the proposed workflow length, a checkpoint design for catching silent corruption, a contract requirements list for audit trails and validators, and a recommendation for restructuring into shorter chains.

TRY IT

PHILIPPE LABAN ON X

Microsoft Research: LLMs Corrupt Your Documents When You Delegate

The Table 2 section on which professional domains fall into catastrophic corruption is the clearest argument for putting deterministic validators around every agent output. The Python exception explains why batch records and certificates of analysis are exposed in a way Python scripts are not.

READ IT

Time to value: 20 minutes

Kruger runs the agentic factory at three steps and projects multimillion-dollar savings. The firm the FDA cited ran the same architecture across 20 plus steps.

Where is the agentic AI workflow at your plant on that chart?

Hit reply. I read every email.
Leo

The 20-step cliff

How long are the agentic AI workflows being pitched at your plant?

Five things you can do this quarter

The Prompt (copy this):

What you'll see:

Microsoft Research: LLMs Corrupt Your Documents When You Delegate

Keep Reading

Subscribe to Aurora News

Newsroom