Key insight

An agent is a data connector, and a connector with broad read access and broad outbound reach will eventually move data across a line it shouldn't. Put the decision in the middle: classify and tag data, enforce egress policy at the tool boundary, pin processing to permitted regions, and log every crossing.

Why agents move data across boundaries

The thing that makes an agent useful — it reads from your systems, reasons over what it finds, and acts through external tools — is exactly what makes it a powerful data-movement engine. Each tool call is a potential egress: the agent can place data it read from a regulated store into a prompt sent to a model hosted elsewhere, forward it to a third-party API, or write it to a destination in another region or another tenant. None of these are exotic; they are the agent's ordinary repertoire. The boundaries that data is supposed to respect — jurisdiction, contract, purpose, tenant — are invisible to the model unless something enforces them.

These boundaries are legal obligations, not preferences. The GDPR restricts transfers of personal data outside permitted regions and binds processing to the purpose for which data was collected, with sub-processor and residency obligations attached. The EU AI Act adds data-governance duties for AI systems. When an agent crosses one of these lines, the consequence is a regulatory finding, a breached contract, or a broken data-processing agreement — harms that don't show up as an exception in your logs.

The Boundaryless Data Egress anti-pattern

Anti-pattern

Boundaryless Data Egress

Definition. An agent has broad read access and broad outbound reach with no control that decides, per piece of data, whether it may cross a jurisdictional, contractual, purpose, or tenant boundary — so data leaves wherever the model's chosen tool call sends it.

Symptoms. Regulated data placed into prompts sent to models in unknown or non-permitted regions; one tenant's data reachable by tools scoped to another; data collected for one purpose used for another; no classification or residency tagging on data the agent handles; no egress policy at the tool boundary; egress unlogged.

Why it is hazardous. The agent composes its own actions, so a boundary crossing happens with no human present to catch it, and the result is a regulatory violation, a contract breach, or a residency failure that surfaces as a finding rather than an error.

Related controls. Data classification with residency, purpose, and tenant tags; egress policy enforced at the tool boundary; processing and sub-processors pinned to permitted regions; minimisation before data leaves; and logging of every egress for audit.

A hypothetical crossing

The following illustrates a plausible failure mode. No specific incident is implied.

A customer-service agent serves users in a region with strict data-residency rules: personal data must be processed within that region. The agent is wired to a general-purpose model endpoint and a set of enrichment tools, chosen for capability and price, some of which run in other regions. When the agent handles a request, it places the customer's personal details into a prompt and calls whichever tool seems most helpful — including one hosted abroad.

Nothing in the path asked whether this data was allowed to leave the region. The agent did its job; the data crossed a border it was legally required to stay within. The breach is invisible operationally — no error, no failed request — and surfaces only later, in an audit or a complaint, as a residency violation. Had the data been classified and residency-tagged, and had the tool boundary enforced an egress policy that blocked out-of-region processing of that tag, the agent would have been confined to in-region tools and the crossing would never have happened.

Four layers that compose into a defence

  1. Classify and tag the data.

    Know what the agent handles. Classify data by sensitivity and attach the metadata that boundaries depend on — residency region, permitted purpose, owning tenant. You cannot enforce a boundary on data you have not labelled; classification is the input every other control reads.

  2. Enforce egress policy at the tool boundary.

    Put the decision where data actually leaves: the tool layer. Before any outbound call, a policy checks the data's tags against the destination — its region, its tenant scope, the permitted purpose — and blocks calls that would cross a forbidden line. The model proposes the action; the boundary decides if it's allowed.

  3. Pin processing to permitted regions.

    Constrain the model endpoints, tools, and sub-processors the agent may use to those in permitted regions for the data at hand, and confirm your providers' residency and sub-processor commitments. Residency is only real if the entire processing chain honours it.

  4. Minimise and log every egress.

    Send the least data the task requires — redact or tokenise what the destination doesn't need — and record every egress: what data, what classification, to which destination and region. Logging makes crossings auditable, and minimisation shrinks the harm of any that slip through. This is the least-privilege instinct applied to data flow.

Every tool call is a potential export.

Putting data into a prompt sent elsewhere is a data transfer, even though it doesn't feel like one. Treat the model endpoint and every external tool as an egress point that an explicit policy must clear.

A practical checklist

Test your own codebase in ten minutes

The fastest way to find out whether this anti-pattern is present in your own system is to ask an AI coding assistant to look for it. Run the prompt below in a fresh chat session, on its own — and judge the system by what the code actually does, not by what its documentation claims.

Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.

You are looking for one specific failure mode: the agent has broad
read access and broad outbound reach (model endpoints, external tools,
write destinations) with no control that decides, per piece of data,
whether it may cross a jurisdictional, contractual, purpose, or tenant
boundary — no data classification/residency tagging, no egress
policy at the tool boundary, processing not pinned to permitted
regions, and egress unlogged.

If the agent never handles regulated/multi-tenant data or has no
external egress, say "not applicable".

Respond with exactly these four sections:
1. VERDICT: one of [present / not present / unclear]
2. EVIDENCE: file path + line numbers + a one-line quote per claim
3. WHY IT MATTERS: two sentences, plain English
4. FIX: a concrete change, with a short before/after code snippet
   if applicable. If "unclear", list the one piece of context you
   need to decide.

Insist on the four-part answer: a verdict with a file path, a line number, and a one-line quote is something you can act on; a verdict on its own is just an opinion. If the result is present, the FIX section is your starting point — classify data and enforce egress policy at the tool boundary. Re-run the same prompt after the change to confirm the verdict flips to not present.

Conclusion

Agents are connectors, and a connector that can read broadly and reach outward freely will, given enough tasks, move data across a line the law or a contract drew. The boundaries are real obligations; make them real controls. Classify the data, enforce egress at the tool boundary, pin the processing chain to permitted regions, and log every crossing. Then the agent's reach becomes a capability you govern rather than a liability you discover in an audit.

References & further reading

  1. GDPR — international transfers, purpose limitation, and processor obligations.
  2. EU AI Act — data-governance duties for AI systems.
  3. NIST AI Risk Management Framework — data-governance and privacy controls.
  4. Standing Credential — least privilege, applied here to data flow.