Key insight

Human-in-the-loop is a control only when the human can understand the action and can realistically refuse it. Opaque requests, high volume, an approve-by-default flow, or a powerless “no” turn oversight into theatre — false assurance that launders the agent's choice as human judgement.

Why approval steps decay into rubber stamps

“A human approves it” is the most common answer to “how do you keep the agent safe?” — and the most common place where the answer is hollow. Human oversight is a genuine control only under conditions that are easy to omit. The reviewer must understand what they are approving, which fails when the request is a terse summary or an opaque payload. They must have the attention to evaluate it, which fails when they're approving hundreds of items an hour. They must be able to refuse, which fails when the interface defaults to approve, makes declining slow, or routes a refusal nowhere. Remove any of these and the human becomes a formality.

This matters because regulators and frameworks increasingly require effective human oversight, not just a checkbox. The EU AI Act, Article 14 obliges providers of high-risk systems to enable oversight by people who can understand the system's output and can decide not to use it or to override it. The NIST AI RMF likewise frames human oversight as something that must be meaningful and informed. A rubber stamp satisfies neither — and it is arguably worse than honest automation, because it manufactures false assurance and pins accountability on a person who was never equipped to exercise it.

The Rubber-Stamp Approval anti-pattern

Anti-pattern

Rubber-Stamp Approval

Definition. A human-in-the-loop approval is treated as a control while the conditions that make oversight real — comprehension, attention, and a genuine ability to refuse — are absent, so the human approves without meaningfully evaluating the action.

Symptoms. Approval requests that don't show what will actually happen or why; reviewers approving high volumes reflexively; an interface defaulting to approve or making refusal harder than acceptance; declines that don't change the outcome; approval rates at or near 100% with no measurement.

Why it is hazardous. The agent's decision proceeds with the appearance of human judgement but none of its substance, producing false assurance, undermining compliance with effective-oversight requirements, and placing blame on a person who could not realistically have said no.

Related controls. Present the full action and rationale in understandable terms; reserve approval for high-stakes actions so attention is preserved; make refusal as easy and effective as approval; and measure approval rates to detect rubber-stamping.

A hypothetical approval that wasn't

The following illustrates a plausible failure mode. No specific incident is implied.

An operations agent proposes infrastructure changes, and a human must approve each one. The approval screen shows a one-line summary: “Apply configuration update to cluster prod-3.” The reviewer sees dozens of these an hour, they are nearly always routine, and the green “Approve” button is the obvious path forward; declining means opening a ticket and explaining why.

One day the agent, misreading its task, proposes a change that opens a network rule far wider than intended. The summary looks like every other update. The reviewer, conditioned by a hundred safe approvals, clicks approve. The change ships. In the post-incident review, the “human approval” control is revealed as a reflex: the reviewer never saw the actual diff, had no time to evaluate it, and faced friction for saying no. Had the screen shown the full configuration change and its security impact, flagged this one as high-risk, and made “reject” a first-class action, the human could have caught it — which is the entire point of the step.

Four layers that compose into a defence

  1. Show the whole action, in understandable terms.

    The approval surface presents exactly what will happen — the concrete change, the affected systems, the data involved — and why the agent proposes it, in language the reviewer can evaluate. An approval the human cannot understand is not an approval.

  2. Reserve approval for actions that warrant it.

    Gate by stakes, not uniformly. Routine, low-risk actions proceed automatically; consequential ones require review. Asking a human to approve everything guarantees they evaluate nothing, so spend their attention where the cost of error is real.

  3. Make refusal as easy and effective as approval.

    “Reject” is a first-class action with no more friction than “approve,” the interface does not default to approve, and a refusal genuinely stops or changes the outcome. Oversight without a workable veto is not oversight.

  4. Measure the approvals.

    Track approval rates, time-to-decision, and override frequency. A 100% approval rate or sub-second decisions on complex actions are signals that the control has decayed into a rubber stamp. Measurement is how you catch the decay before an incident does — the unsupervised perimeter lesson applied to your own reviewers.

A control no one can fail is not a control.

If the reviewer cannot realistically reject an action — because they can't see it, can't keep up, or can't say no without friction — then the approval adds no safety, only the appearance of it. Design the step so “no” is a real, reachable outcome.

A practical checklist

Test your own codebase in ten minutes

The fastest way to find out whether this anti-pattern is present in your own system is to ask an AI coding assistant to look for it. Run the prompt below in a fresh chat session, on its own — and judge the system by what the code actually does, not by what its documentation claims.

Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.

You are looking for one specific failure mode: a human-in-the-loop
approval step is treated as a control, but the approval request does
not show the reviewer what will actually happen and why, OR every
action requires approval (exhausting attention), OR the flow defaults
to approve / makes refusal harder than acceptance, OR a refusal does
not actually stop the action. In short, approval the human cannot
meaningfully exercise.

If there is no human approval step, say "not applicable".

Respond with exactly these four sections:
1. VERDICT: one of [present / not present / unclear]
2. EVIDENCE: file path + line numbers + a one-line quote per claim
3. WHY IT MATTERS: two sentences, plain English
4. FIX: a concrete change, with a short before/after code snippet
   if applicable. If "unclear", list the one piece of context you
   need to decide.

Insist on the four-part answer: a verdict with a file path, a line number, and a one-line quote is something you can act on; a verdict on its own is just an opinion. If the result is present, the FIX section is your starting point — surface the full action, gate by stakes, and make refusal real. Re-run the same prompt after the change to confirm the verdict flips to not present.

Conclusion

Human oversight is one of the strongest controls available for an agent — and one of the easiest to fake. The difference is whether the human can see the action, has the attention to weigh it, and can actually refuse. Build the approval step for genuine judgement: full context, stakes-based gating, a frictionless “no,” and measurement to catch decay. Anything less is not a control; it is a signature on a decision someone else already made.

References & further reading

  1. EU AI Act, Article 14 — human oversight of high-risk AI systems.
  2. NIST AI Risk Management Framework — meaningful, informed human oversight.
  3. ReAct Discipline for Secure AI Agents — structuring agent reasoning so a human can review it.