Securing AI Agents: Why the Tool Allow-List Is Your Real Security Boundary

Key insight

The core security boundary for AI agents is not identity or credentials — it is the set of tools the agent is allowed to invoke. Get that boundary right and most other controls compose well around it. Get it wrong and no amount of credential scoping will reliably contain the consequences.

Why hallucinations are not the dominant risk

Public discussion of AI safety still concentrates on the language model itself: how often it produces convincing-but-wrong text, how to evaluate factuality, how to constrain output. These are important problems, and a great deal of useful work continues on them.

In production agentic systems, however, the dominant operational risk is no longer what the model says. It is what the agent does. The systems being deployed today — code assistants that commit changes, copilots that update tickets, agents that deploy infrastructure or rotate credentials — all share a common property: the language model is wrapped in a runtime that can call external functions on the operator's behalf. The blast radius of a single misjudged call is now bounded by exactly two things: which functions the agent was allowed to attempt, and which permissions the credential happened to carry at that moment.

Modern AI security guidance reflects this shift. The OWASP Top 10 for LLM Applications identifies "Excessive Agency" (LLM06) as a distinct top-tier risk — defined as the LLM-based system being granted permissions or tool access beyond what is needed. The NIST AI Risk Management Framework emphasises minimising exposure to actions whose effects cannot be reversed. The Microsoft Zero Trust guidance recommends per-tool authorisation rather than broad delegation. The direction across these sources is consistent: scope the agent's capabilities, not just its identity.

How AI agents actually operate

Before discussing where things go wrong, it is worth establishing an accurate picture of the moving parts. Most agentic systems in production today share the same five elements, regardless of framework.

Figure 1. Five elements of an agentic system. The control point — what the agent is capable of attempting — lives at the tool client, where the catalogue is composed.

The protocol details differ across frameworks. OpenAI's Function Calling uses one set of schemas; Anthropic's Model Context Protocol standardises another; agent gateways offer their own. The mechanics are similar:

A tool server publishes a list of named functions and their parameter schemas.
The agent's client reads that list and presents some subset of it to the language model.
The model decides which (if any) to call based on the operator's intent and the available context.
The client invokes the tool on the server, returns the result to the model, and the loop continues.

Step 2 is the architecturally significant one. Tool servers expose many verbs — read operations that are usually safe and write operations that are not. The decision of which to present to the model is the security boundary at the heart of this article.

The core risk: overexposed tool surfaces

The risk is straightforward to state. When an agent's tool catalogue is wider than its operator's intent, the agent can attempt actions the operator never wanted. Whether those attempts succeed depends on the credential — which in many real configurations is itself broader than required.

Several conditions, none of them malicious, conspire to produce overexposure: vendor quick-starts default to permissive catalogues; listing needed verbs requires reading documentation per server, which wildcards skip; teams fear future capability drift; multi-server fan-out compounds the labour; and teams assume the credential is a sufficient back-stop. Each reason is sensible in isolation. Taken together they produce a configuration whose safety depends on credentials being narrowly scoped at all times across every connected service — a property no organisation maintains perfectly forever.

The Wildcard Tool Exposure anti-pattern

Anti-pattern

Wildcard Tool Exposure

Definition. When an AI agent is configured with access to "all tools" or an undifferentiated broad tool set, it unintentionally expands its attack surface to the union of every verb published by every connected tool server — including verbs the operator never intended to delegate.

Symptoms. A configuration containing literal wildcards ("tools": ["*"]); an agent prompt that does not enumerate its available verbs; a tool catalogue that grows automatically when the upstream server publishes new functions; absence of a per-call approval gate for write operations.

Why it is hazardous. The agent's capability is no longer set at design time. It is the cross-product of the language model's interpretation of operator intent and whatever the upstream tool server happens to publish on any given day.

Related controls. Explicit per-server allow-lists; separation of read-only and write-capable profiles; per-call human approval for write verbs; least-privilege credentials provisioned to match the catalogue.

Naming the pattern is not stylistic. It makes the conversation tractable inside the organisation. "We have a Wildcard Tool Exposure on the GitHub integration" is a sentence an architecture review can act on. "The agent feels over-permissive" is not.

A hypothetical failure scenario

The following illustrates a plausible failure mode under Wildcard Tool Exposure. It is constructed from elements common to several reported industry patterns; no specific incident is implied.

An operator opens an agentic assistant connected to a source-control platform and a cloud platform. The configuration is a copy of the vendor's quick-start: every available verb advertised, no per-call approval gates. The credentials provisioned to the session are the operator's own — broad enough to be useful, broad enough to be hazardous if misdirected.

The operator asks the assistant to summarise the state of a third-party repository. Inside one of the files in that repository, a previous contributor — possibly malicious, possibly merely careless — left a comment whose phrasing is indistinguishable, to a language model, from an operator instruction. The model reads the file, treats the comment as part of the conversation context, and decides the next tool call accordingly.

Figure 2. Hypothetical five-step sequence under Wildcard Tool Exposure. (1) The agent reads an untrusted source as part of a routine task. (2) The source returns content containing an embedded instruction. (3) The model interprets the instruction as operator intent and proposes a destructive tool call. (4) The tool client forwards the call to the downstream system. (5) The downstream system executes and returns success. The two amber banners identify where a control, if present, would have stopped the chain — neither was.

The instructive observation is not that a malicious instruction reached the model — for any agent that reads externally-authored content, that exposure is essentially structural. The instructive observation is that four independent controls could each have stopped the chain. The lesson is that defence in depth at the agent layer means the same thing it has always meant elsewhere: independent controls, so that the failure of one does not constitute the failure of the system.

Defence in depth at the agent layer means the same thing it has always meant elsewhere. Independent controls. The failure of one does not constitute the failure of the system.

The principle: tool allow-listing

The architectural principle is straightforward: enumerate the verbs each agent is allowed to invoke, per tool server, and exclude everything else by default. The enumeration is the contract; the wildcard is its abdication.

An intuitive counter-argument is worth addressing: narrow the credentials, and the catalogue does not matter. If the credential cannot rotate secrets, the call would have been refused regardless. This argument is partially correct. Where it applies cleanly, it is a real defence. It has four limitations that make it incomplete as a sole control:

Most agents talk to more than one credential domain. Each domain has its own credential model; truly read-only across all of them, all the time, requires per-platform engineering rarely applied uniformly.
Read-only blocks writes but not reads. Cross-source exfiltration, reconnaissance, misleading output, and telemetry persistence all work with read access.
Configurations evolve. The day a credential is temporarily widened, every destructive verb in the wildcard re-enters scope.
Tool servers add verbs across versions. A new write verb published next month becomes available silently under a wildcard.

The right framing is "and," not "or."

Narrow credentials and curated tool catalogues are complementary controls. Implementing both is straightforward; relying on either alone is fragile.

Three design controls that compose

Control	What it prevents	What it does not prevent
Curated tool catalogue per server, per agent	The agent attempting verbs its operators never intended	Misuse of permitted verbs — addressed by the next control
Per-call approval gate for any write operation	Silent destructive changes; operator-out-of-loop incidents	Operator fatigue if writes are routine — addressed by disciplined defaults
Narrow, short-lived credentials scoped to specific resources	Damage when the first two controls are bypassed	Disclosure-only attacks where read access is sufficient

From principle to configuration

A typical permissive configuration:

{ "tools": ["*"] }

becomes, for a read-only default profile:

{
  "tools": [
    "search_code",
    "get_file_contents",
    "list_pull_requests",
    "get_pull_request",
    "list_issues",
    "get_repository"
  ]
}

Discover what each agent actually needs.
Most agents need fewer than ten verbs — far fewer than a wildcard exposes.
Take the union per tool server.
Across agents sharing a server, take the union of needed verbs. Document each entry.
Separate write verbs into an opt-in profile.
Activation explicit, time-bounded, noisy. Default off.
Mirror the catalogue in the agent's prompt.
"Operations not in this list are unavailable; if a task requires one, report it as a recommendation."
Guard the configuration in continuous integration.
A small build-time check that fails on any wildcard. A two-line check costs nothing.
Provision the credential to match.
Catalogue is the structural rule; credential is the back-stop. Both narrow.

# Fail the build if any "tools" array equals ["*"]
jq '.. | select(type=="array") | select(. == ["*"])' config.json \
  | grep -q . && { echo "wildcard catalogue detected" >&2; exit 1; }
echo "OK: catalogue is explicit"

A practical checklist

Every connected tool server has an explicit, enumerated catalogue. No wildcards.
The default catalogue excludes every write verb.
A separate write profile exists, with each write verb individually justified and time-bounded activation.
Each agent's system prompt names the verbs available to it and forbids attempts on others.
A continuous-integration check fails the build if a wildcard reappears.
Each tool server has a named owner who reviews its catalogue at least quarterly.
Credentials are scoped to the minimum required and minted just in time.
Every write call requires a per-call human approval; approvals are logged.
An audit of every tool call exists, is rotated, and is forwarded to a central observability stack.
The inventory of connected tool servers is maintained as a supplier inventory.

Test your own agent in ten minutes

The fastest way to find out whether this anti-pattern is present in your own system is to ask an AI coding assistant to look for it. Run the prompt below in a fresh chat session, on its own — and judge the system by what the code actually does, not by what its documentation claims.

Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.

You are looking for one specific failure mode: the AI agent in this
codebase is granted tool access that is wildcard or near-wildcard
— for example, an unrestricted shell, an HTTP client with no
domain allow-list, file system access with no path restriction, or
an MCP / plugin client that loads every tool advertised by the
server.

Tell me whether this codebase exhibits that pattern.

Respond with exactly these four sections:
1. VERDICT: one of [present / not present / unclear]
2. EVIDENCE: file path + line numbers + a one-line quote per claim
3. WHY IT MATTERS: two sentences, plain English
4. FIX: a concrete change, with a short before/after code snippet
   if applicable. If "unclear", list the one piece of context you
   need to decide.

Insist on the four-part answer: a verdict with a file path, a line number, and a one-line quote is something you can act on; a verdict on its own is just an opinion. If the result is present, the FIX section is your starting point — replace the wildcard capability with a named allow-list of the specific verbs the agent actually needs. Re-run the same prompt after the change to confirm the verdict flips to not present.

Conclusion: shift the boundary

The categories of incident that organisations should prepare for in AI deployments are operational. They look like conventional cloud incidents — an over-permissioned role, an unscoped credential, a misconfigured webhook — with one new wrinkle: the actor inside the trust boundary is a model whose decisions are probabilistic and whose context can be influenced by anything it is asked to read.

The controls that contain this risk are not exotic. Allow-lists, approval gates, audit, least privilege, short-lived credentials — none require new science. They require the recognition that the agent's tool catalogue has joined the list of artefacts that decide what production systems will do.

If there is a single takeaway: the tool allow-list is now part of your security architecture. Write it down. Review it. Treat the next proposal of a wildcard the way an experienced reviewer would treat the next proposal of a credential with unrestricted write access. The two requests are, in functional terms, the same request.

References & further reading

OWASP Top 10 for LLM Applications — particularly LLM06: Excessive Agency.
NIST AI Risk Management Framework — actions whose effects cannot be reversed.
Microsoft Zero Trust guidance — per-tool authorisation.
Anthropic Model Context Protocol — the standard for connecting language models to external tools.
OpenAI Function Calling guide — the parallel mechanism in OpenAI's platform.
NIST SP 800-207 — Zero Trust Architecture — the trust-zone model adapted to agentic contexts.