Hallucinated Authority: Acting on Confident Fabrication

Key insight

A model optimises for plausible text, not true text, and it never flags its own uncertainty. If output flows into an action without a grounding or verification step, a confident fabrication is indistinguishable from a fact — and the system will act on both alike.

Why fluency reads as authority

A language model generates the most probable continuation of its input. When the training data supports a true answer, that answer is often the most probable one; when it does not, the model still produces a fluent, confident response — it simply has no mechanism to say “I don't know.” The output that is fabricated looks identical to the output that is grounded. Humans read confidence as competence, and software reads a well-formed string as a valid one, so both tend to act on the model's word without asking what that word rests on.

This is OWASP LLM09: Misinformation and its companion failure, overreliance. It becomes an engineering hazard — rather than merely an inaccurate chatbot — the moment output is consumed by an action: a tool call keyed on a fabricated identifier, an approval that quotes a policy that doesn't exist, code that imports a hallucinated package an attacker has helpfully registered (slopsquatting). The model's mistake is now the system's action.

The Hallucinated Authority anti-pattern

Anti-pattern

Hallucinated Authority

Definition. Model output is consumed by an action, approval, or downstream decision on the strength of its fluency and apparent confidence, with no step that grounds the claim in a verifiable source or validates referenced facts against a system of record.

Symptoms. Tool calls keyed on identifiers the model produced without lookup; citations, policies, or figures quoted but never resolved; generated code whose dependencies are not checked against a registry; answers presented as fact with no provenance; no confidence threshold or human review on consequential paths.

Why it is hazardous. A fabrication is indistinguishable from a fact in the output, so the system executes both with equal confidence — producing wrong actions, false attestations, or attacker-supplied dependencies that the model invented and someone pre-registered.

Related controls. Ground answers in retrieved, citable sources and verify the citations resolve; validate identifiers and references against systems of record before acting; gate consequential actions behind verification and human review; surface provenance and uncertainty to the user.

A hypothetical failure

The following illustrates a plausible failure mode. No specific incident is implied.

A coding agent helps developers add features. Asked to integrate a date-handling utility, it confidently writes an import for a package name that sounds exactly right but has never existed. The model invented a plausible name. An attacker, anticipating exactly this behaviour, has already registered that name on the public registry with a malicious payload. The developer, trusting the fluent suggestion, installs it — and ships the attacker's code.

The same shape appears without an attacker. A support agent tells a customer, in a polished paragraph, that they're entitled to a refund under “policy section 7.4” — a section that does not exist. A workflow that auto-approves refunds quoting a policy clause issues the refund. Nobody lied; the model simply generated a plausible clause number, and the system treated plausibility as authority.

Four layers that compose into a defence

Ground claims in citable sources.
For anything factual, have the model answer from retrieved documents and cite them, rather than from parametric memory. Then verify the citation resolves to a real source that actually supports the claim — a citation the model invents is itself a hallucination.
Validate references against systems of record.
Before an action uses an identifier, a policy number, a price, or a package name the model produced, look it up in the authoritative system — the database, the registry, the policy store. If it does not resolve, the action does not proceed. The model proposes; the system of record disposes.
Raise the bar with consequence.
Scale verification to stakes. A low-risk summary can tolerate uncertainty; an action that moves money, changes access, or makes an attestation requires grounded evidence and, above a threshold, human confirmation. Match the proof you demand to the cost of being wrong.
Surface provenance and uncertainty.
Show users where an answer came from and where it didn't. Distinguish “grounded in document X” from “generated without a source,” and let the model decline rather than fabricate. Visible provenance turns blind reliance into informed judgement.

Confidence is a writing style, not evidence.

The model is equally fluent when it is right and when it is making things up. The only way to tell the difference is to check the claim against something real — so build that check into the path, not into the reader's hope.

A practical checklist

Factual answers are grounded in retrieved sources, and citations are verified to resolve.
Identifiers, policy references, and prices are validated against systems of record before any action uses them.
Generated code's dependencies are checked against a registry; unknown packages are flagged, not installed.
Consequential actions require grounded evidence; above a risk threshold they require human confirmation.
The model can decline or express uncertainty instead of fabricating an answer.
Provenance is surfaced to the user: grounded answers are distinguished from ungrounded ones.
Auto-approval paths cannot be satisfied by a quoted reference that was never validated.
There is monitoring for actions taken on references that later fail to resolve.

Test your own codebase in ten minutes

The fastest way to find out whether this anti-pattern is present in your own system is to ask an AI coding assistant to look for it. Run the prompt below in a fresh chat session, on its own — and judge the system by what the code actually does, not by what its documentation claims.

Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.

You are looking for one specific failure mode: model output flows into
an action, approval, or downstream decision based on its apparent
confidence, with no step that grounds the claim in a verifiable source
or validates referenced facts (identifiers, policy numbers, prices,
package names) against a system of record before acting.

If model output never drives an action or decision, say
"not applicable".

Respond with exactly these four sections:
1. VERDICT: one of [present / not present / unclear]
2. EVIDENCE: file path + line numbers + a one-line quote per claim
3. WHY IT MATTERS: two sentences, plain English
4. FIX: a concrete change, with a short before/after code snippet
   if applicable. If "unclear", list the one piece of context you
   need to decide.

Insist on the four-part answer: a verdict with a file path, a line number, and a one-line quote is something you can act on; a verdict on its own is just an opinion. If the result is present, the FIX section is your starting point — add a grounding or validation step before the action. Re-run the same prompt after the change to confirm the verdict flips to not present.

Conclusion

The model's gift is fluency, and fluency is exactly what disguises its errors. Wherever output turns into action, insert the step the model cannot perform for itself: check the claim against something real. Ground it, validate it, gate it by consequence, and show your work. The goal is not a model that never errs — it is a system that does not act on an error just because it was phrased with confidence.

References & further reading

OWASP Top 10 for LLM Applications — LLM09: Misinformation and overreliance.
NIST AI Risk Management Framework — validity, reliability, and human oversight.
MITRE ATLAS — adversarial techniques including dependency and supply-chain abuse.