How to use
Each prompt below opens with a search instruction so the assistant scans the whole repository rather than a list you have to compile by hand — see Tip 1 for why this beats pinning files on anything but a toy project. Run prompts one at a time, in a fresh chat session per prompt — mixing them in one long thread is the single most common reason the assistant’s answers degrade. Capture each verdict; Tip 4 walks through the fix-list format.
The common response shape
Every prompt below ends with the same instruction block. It puts the evidence first and the judgement second, so the verdict and the severity follow mechanically from what was found — not from the model’s memory. Reading the response is much faster when it is always laid out the same way:
First, show your work in this order:
A. SEARCH — list the exact search terms and globs you used and the
files you opened. Ignore generated, vendored, and dependency folders.
B. EVIDENCE — enumerate every match as file path + line number +
a one-line quote. The verdict below must follow from this list.
Then respond with exactly these sections:
1. VERDICT: one of [present / not present / unclear / not applicable]
present = at least one cited artefact matches the failure mode
not present = the surface was searched and the safe pattern (or its
absence) is shown
not applicable = the precondition does not exist (name it)
unclear = the surface exists but the evidence is insufficient
(name the one fact you are missing)
2. SEVERITY: one of [Critical / High / Medium / Low / N/A], computed as
Likelihood x Impact using the rubric in "How verdicts and severity
are decided" below. State the Impact baseline, the Likelihood level,
and the matrix cell you landed on. Use N/A for not present / not
applicable.
3. WHY IT MATTERS: two sentences, plain English.
4. FIX: a concrete change, with a short before/after code snippet
if applicable. If "unclear", list the one piece of context you
need to decide.
5. AUDIT BLOCK: a fenced json object with this exact schema, so two
runs on the same commit are diff-able:
{"id":"<check-slug>","verdict":"...","impact":"...",
"likelihood":"...","severity":"...",
"evidence":[{"file":"...","line":0,"quote":"..."}],
"search_terms":["..."]}
Each prompt below assumes you append that block. To keep the page readable it is shown once, then referenced as [response shape].
How verdicts and severity are decided
The value of this pack for an audit is that the verdict and the severity are not the model’s opinion — they are derived, by a fixed rule, from evidence the model has to cite. Two reviewers (or two runs) looking at the same code and the same evidence should reach the same verdict and the same severity. This section is that rule, so anyone can re-derive a result without trusting the model.
How the verdict is decided
The verdict is gated on cited evidence: a claim with no file:line quote behind it is not a finding.
| Verdict | Decision rule | Audit trail it leaves |
|---|---|---|
| present | At least one artefact in the code or config matches the failure-mode definition. | The cited file path + line + one-line quote. |
| not present | The relevant surface was searched and the safe pattern (or a clear absence) is shown. | The search terms used — so the absence is verifiable, not assumed. |
| not applicable | The precondition for the failure mode does not exist in this codebase. | The named missing precondition (e.g. “no CI files”, “no end-user auth”). |
| unclear | The surface exists but the evidence is insufficient to decide. | The single named fact a reviewer needs to resolve it. |
How severity is calculated
Severity is Likelihood × Impact. Neither axis is a free-form judgement:
- Impact is a fixed baseline per anti-pattern — not a per-run guess. It reflects the consequence the anti-pattern’s framework mapping (OWASP LLM Top 10, OWASP Agentic, NIST, MITRE) attaches to it, recorded once in the catalogue’s evidence crosswalk. The baselines are below.
- Likelihood is scored per run from observable signals, on a fixed three-level scale.
Impact baselines. High = can lead to code execution, credential or data compromise, or supply-chain control. Medium = data exposure, or a governance / integrity gap. (Low severity is not a baseline; it only emerges from the matrix when Likelihood is Low.)
| Impact baseline | Anti-patterns |
|---|---|
| High | Wildcard tool exposure; unauthenticated tool channel; conflated context; comment-to-commit promotion; live-fetch dependency; standing credential; mutable reference trust; unsupervised perimeter; shared identity runtime; poisoned knowledge base; persistent memory poisoning; unsanitised output sink; implicit agent-to-agent trust; system prompt as secret store; boundaryless data egress |
| Medium | Phishable flow; plaintext journal; documented defence that doesn’t exist; internal-to-product gap; porous vector store; hallucinated authority; unbounded consumption; unattributable action; rubber-stamp approval; ungated model swap |
Likelihood (scored per run).
| Likelihood | When |
|---|---|
| High | An unsafe path is reachable by untrusted input (internet-facing, or third-party / user-supplied content) and no mitigating control is present. |
| Medium | Exploitation needs an authenticated or internal actor, or a partial / incomplete control exists. |
| Low | Exploitation needs privileged local access or several unlikely preconditions, or a strong compensating control is already in place. |
The matrix. Severity is the cell where the two axes meet:
| Likelihood ↓ \ Impact → | Low | Medium | High |
|---|---|---|---|
| High | Medium | High | Critical |
| Medium | Low | Medium | High |
| Low | Low | Low | Medium |
Why a run is auditable
Three properties let a reviewer who was not in the room reconstruct any result:
- Evidence-gated verdicts. Every “present” carries a
file:linequote a reviewer can open; every “not present” carries the search terms, so absence can be re-checked rather than taken on faith. - Rule-derived severity. Severity is a lookup — a fixed Impact baseline crossed with an observable Likelihood — so the answer to “why is this High?” is the two inputs and the matrix cell, never a vibe.
- A diff-able audit block + run record. Each finding ends with the JSON audit block; each run is stamped with the model, version, temperature, date, and commit SHA (Tip 4). Same inputs should produce the same blocks; any drift is visible by diffing them.
The honest limit — and how to pin a run down
AI assistant output is probabilistic: byte-identical text is not guaranteed across runs, even at temperature zero, because hosted models change under you and tool-call order varies. So “consistent” here has a precise meaning — same model build + same commit + same prompt → the same rule-derived verdict and severity, with any difference surfaced by diffing the audit blocks rather than hidden in prose. To get there:
- Pin the model name and version, set temperature to its lowest setting, and run one prompt per fresh session (see Tip 1) — mixing checks in one thread is the main reason a combined run and an individual run disagree.
- Record the run (model, version, temperature, date, commit SHA) next to each finding — Tip 4 has the format.
- For the mechanically-detectable checks, back the AI pass with a deterministic static check, so those verdicts are 100% reproducible:
| Check | Deterministic rule you can also run |
|---|---|
| Live-fetch dependency; mutable reference trust | grep / semgrep for :latest, @main, unpinned actions/*@v*, and curl | sh; a CI lint that requires SHA-pinned actions. |
| Wildcard tool exposure | semgrep for shell=True, unrestricted exec, and HTTP clients with no host allow-list. |
| Standing credential; system prompt as secret store | a secret scanner (gitleaks / trufflehog) over source and prompt templates. |
| Plaintext journal | a lint that flags logging of raw prompt / tool-argument variables with no redaction wrapper. |
The AI pass owns the judgement-heavy checks; the static rules own the mechanical ones. Where both cover a check, the static rule is the tie-breaker of record — it cannot hallucinate.
The twenty-five prompts
Reminder: AI assistant output is probabilistic. Verify every claim against your source before acting on it. False positives and false negatives are expected.
Prompt 1 — Is the tool catalogue a wildcard?
Looks for the agent being granted an open-ended set of capabilities (shell, arbitrary HTTP, broad file access) when a named verb list would do.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the AI agent in this
codebase is granted tool access that is wildcard or near-wildcard
— for example, an unrestricted shell, an HTTP client with no
domain allow-list, file system access with no path restriction, or
an MCP / plugin client that loads every tool advertised by the
server.
Tell me whether this codebase exhibits that pattern.
[response shape]
Prompt 2 — Are remote tools authenticated and trusted?
Catches connectors / plugins / MCP servers loaded over plain HTTP, without auth, or from sources that are not on a controlled list.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the AI agent connects
to remote tool servers (HTTP endpoints, MCP servers, plugins,
function-call backends) without authentication, without TLS, or from
URLs that are user-controlled / configuration-controlled with no
allow-list of trusted hosts.
[response shape]
Prompt 3 — Is untrusted content mixed with instructions?
The structural prompt-injection problem: content fetched from the web, a database, or a tool result is concatenated into the same context as the system prompt with no separator and no sanitisation.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: untrusted text
(web page contents, tool results, user-supplied documents, RAG
retrievals) is placed into the model’s context window without
clear separation from trusted instructions, without sanitisation,
and without any downstream check that prevents the model from
following directives embedded in that text.
[response shape]
Prompt 4 — Can a comment trigger a write action?
Relevant if your tool integrates with code review, issue trackers, or chat. Looks for the pattern where any commenter implicitly becomes a committer because the agent acts on their words.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: a comment, issue,
chat message, or other text written by anyone with read access can
cause the agent to perform a write action (commit code, merge a
pull request, deploy, modify a ticket, send a message) without an
additional authorisation step that checks the actor’s
permission to perform that specific write.
If the codebase does not integrate with any such surface, say
"not applicable" and stop.
[response shape]
Prompt 5 — Are dependencies pinned or fetched at runtime?
Floating tags (@latest, :main), curl | sh, runtime pip install / npm install — anything that resolves an external artefact at run-time instead of at build-time.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the agent (or its
install / startup script) fetches code or container images using
floating references — for example ":latest", "@main",
unpinned package installs at runtime, or "curl ... | sh"
patterns. A pinned version, a lock file, or a content hash counts
as not-present.
[response shape]
Prompt 6 — Does the agent hold long-lived broad credentials?
Looks for static API keys, long-lived OAuth tokens, or service-account secrets baked into the agent’s environment instead of minted just-in-time from a workload identity.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the agent process
holds a credential that is long-lived (no automatic rotation), broad
in scope, and continuously available in memory or on disk —
for example a static API key, a personal access token, or a
service-account secret read from environment variables at startup
and never refreshed.
A short-lived token minted just-in-time per call from a workload
identity counts as not-present.
[response shape]
Prompt 7 — Is the login flow appropriate for the device?
The specific case of OAuth device-code flow running on a device that has a perfectly good browser — turning a fallback flow into the attacker’s preferred channel.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the agent uses an
authentication flow that is meaningfully more phishable than the
device can support — most commonly OAuth device-code flow
running on a machine that has a browser available, where the
authorisation-code-with-PKCE flow would have worked.
If the codebase does not perform end-user authentication at all,
say "not applicable".
[response shape]
Prompt 8 — Does telemetry leak sensitive content?
Logs / traces that record prompts, tool arguments, retrieved documents, or model outputs verbatim — turning observability into a pre-staged data leak.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the agent writes
telemetry (logs, traces, metrics, spans) that contains raw model
prompts, raw tool arguments, retrieved document contents, or raw
model output, without redaction of personal data, credentials, or
business-sensitive content, and without an access boundary that
matches the sensitivity of the original data.
[response shape]
Prompt 9 — Are CI / pipeline references pinned by hash?
CI actions / workflows / orchestration steps referenced by tag (@v4) instead of by commit hash. The reference can be repointed after you reviewed it.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: any CI workflow,
pipeline definition, build script, or orchestration manifest in
this codebase references third-party actions, images, or modules
by a mutable name (a tag, a branch, "latest") instead of by a
content-addressable hash (commit SHA, image digest).
If the codebase has no CI / pipeline files, say "not applicable".
[response shape]
Prompt 10 — Are all shipped agents governed, or only the headline ones?
Catches the gap where one curated agent has policies, evals, and rate limits, but the same product ships ten other invocable agents or sub-agents with none of that.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the codebase
defines or registers multiple agents, sub-agents, or tool-bearing
flows, but the security and governance controls (rate limits, tool
allow-lists, evaluation hooks, logging, content filters) are
applied to only a subset of them.
Enumerate every distinct agent or sub-agent you can find across the
repository and state for each whether the controls applied to
the main one also apply to it.
[response shape]
Prompt 12 — Do the docs match the code?
README / docs claim a control (“all tool calls are sandboxed”, “prompts are redacted before logging”) that the code does not actually implement.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Read every source file
the failure mode could touch, and also read README.md and any docs/
file that mentions security, sandboxing, isolation, logging,
redaction, authentication, or permissions. List the search terms you
used so I can confirm nothing was missed.
You are looking for one specific failure mode: a security-relevant
claim in the documentation is not backed by an implementation in
the source.
For each claim, quote the doc sentence verbatim, then either point
to the file/lines that implement it or state "no implementation
found".
[response shape]
Prompt 13 — Is this tool ready to be shipped to others?
A tool built for internal use takes on new obligations the moment it is shipped — multi-tenant isolation, abuse handling, supportable error messages, a security contact. Catches the gap before launch day.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Read the source plus
README.md, LICENSE, SECURITY.md, and any docs/ file. List the search
terms you used so I can confirm nothing was missed.
Assume this codebase is about to be published for use by people
other than its authors. Identify obligations that apply to shipped
software but not to internal tools, and state whether each is met:
- multi-tenant isolation (no cross-tenant data leakage)
- abuse-handling path (rate limits, reporting channel)
- user-facing error messages (no internal stack traces / secrets)
- a security contact and disclosure policy
- a clear statement of what data the tool sends where
[response shape]
Candidate checks (14–25). The twelve prompts below target the candidate anti-patterns under review — gaps mapped against the OWASP LLM Top 10, the OWASP Agentic taxonomy, NIST AI RMF, MITRE ATLAS, and the compliance regimes. Run them the same way: one at a time, fresh chat per prompt, append the [response shape].
Prompt 14 — Is the RAG / training corpus trusted blindly?
Documents the agent retrieves or is fine-tuned on are an injection surface; a poisoned corpus steers behaviour long after ingestion with no runtime prompt to inspect.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: documents ingested
into a retrieval-augmented-generation index or a fine-tuning /
training corpus are treated as trusted — there is no provenance
check, no integrity validation, and no sanitisation of content that
could carry embedded instructions, before that content can influence
the model’s behaviour.
If the codebase has no RAG corpus or training data, say
"not applicable".
[response shape]
Prompt 15 — Is cross-session memory written without validation?
An agent that persists “learnings” across sessions can be taught a malicious instruction once and replay it indefinitely. Memory is state, and unvalidated state is an attack surface.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the agent persists
memory, notes, or “learnings” across sessions, and content can be
written into that store from untrusted input (user messages, tool
results, retrieved documents) without validation, attribution, or a
scope boundary — so a malicious instruction written once is
replayed in later sessions.
If the agent keeps no cross-session memory, say "not applicable".
[response shape]
Prompt 16 — Does retrieval enforce per-tenant isolation?
A shared vector index that ignores per-user access control leaks one tenant’s documents into another’s retrieval results; embeddings can also be inverted back toward source text.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: a vector store or
retrieval index is queried without enforcing the caller’s access
rights — there is no per-tenant or per-user filter on the
similarity search, so one user’s query can return another’s
documents, and the embeddings are not protected against inversion.
If the codebase has no vector store or retrieval index, say
"not applicable".
[response shape]
Prompt 17 — Is model output trusted as code, SQL, or markup?
Passing raw model output into a shell, a query, or a browser DOM turns the model into an injection vector for the classic web vulnerabilities. Output is untrusted input to the next system.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: raw model output is
passed into a downstream interpreter or sink — a shell command,
a SQL query, an HTML / DOM context, an eval, a file path, or a
generated API call — without encoding, parameterisation, or
validation appropriate to that sink, so the model can emit an
injection payload.
[response shape]
Prompt 18 — Are confident agent assertions acted on unverified?
Acting on a confident but fabricated answer — a non-existent API, a wrong policy, a made-up citation — because the interface presents it with no uncertainty signal or verification step.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the system takes a
consequential action, or presents a fact to a user as authoritative,
based purely on the model’s assertion — with no verification
against a source of truth, no uncertainty signal, and no
ground-truth check — so a confident hallucination (a
non-existent API, a wrong policy, a fabricated citation) is acted on.
[response shape]
Prompt 19 — Do sub-agents trust each other without checks?
In a multi-agent system, one agent accepting another’s output or tool calls without authentication or scope checks lets a single compromised agent pivot through the whole orchestration.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: in a multi-agent or
sub-agent system, one agent accepts another agent’s output,
instructions, or tool calls as trusted — without authenticating
the source, validating the message, or constraining it to that
agent’s permitted scope — so a single compromised agent can
pivot through the orchestration.
If the codebase has only a single agent, say "not applicable".
[response shape]
Prompt 20 — Are secrets or policy hidden in the prompt?
Embedding API keys, connection strings, or access rules in the system prompt assumes the prompt is confidential. It is extractable, so anything in it is effectively published.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the system prompt,
developer prompt, or any instruction text sent to the model contains
secrets (API keys, connection strings, passwords, tokens) or
security-relevant access rules that are relied on for enforcement
— treating the prompt as confidential when it is extractable.
[response shape]
Prompt 21 — Is there a token, cost, or loop ceiling?
An agent with no token budget, recursion limit, or request cap can be driven into runaway loops or adversarial cost amplification — a denial-of-service that bills you instead of crashing.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the agent has no
ceiling on consumption — no maximum token budget per request,
no recursion / tool-call-loop limit, no per-user rate or spend cap
— so it can be driven into a runaway loop or adversarial cost
amplification (denial of wallet).
[response shape]
Prompt 22 — Can every action be traced to who and why?
When an agent acts under a shared identity with no per-decision log, you cannot answer “who did this and why” after an incident — failing the non-repudiation controls auditors expect.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: actions the agent
takes (tool calls, writes, approvals, external requests) are not
recorded in a tamper-resistant audit trail that links each action to
the initiating user / session, the agent decision, and a timestamp
— so after an incident you cannot answer “who did this, and
why”.
[response shape]
Prompt 23 — Does the human-approval gate cause fatigue?
A human-approval gate that fires on every routine action trains the reviewer to click “approve” without reading. The control exists on paper but provides no real oversight.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: a human-in-the-loop
approval gate fires on every action regardless of risk, presents too
little context to make a real decision, or has no consequence for
bulk-approving — so the reviewer is trained to rubber-stamp and
the control provides no real oversight.
If there is no human-approval gate at all, say "not applicable".
[response shape]
Prompt 24 — Can the model or prompt change with no eval gate?
Switching the underlying model or editing the system prompt without a regression-eval gate ships silent behaviour and safety changes straight to production, outside change management.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the model version,
system prompt, or generation parameters can be changed and reach
production without passing an automated evaluation gate — no
representative regression suite, no comparison of new versus previous
behaviour, and no gradual rollout or quick revert.
If the system has no configurable model / prompt, say
"not applicable".
[response shape]
Prompt 25 — Is data sent across boundaries without control?
Routing user data to a model or third party in another jurisdiction or tenant, with no classification, residency rule, or egress allow-list, breaches privacy and sovereignty obligations.
Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Identify every location
the failure mode below could occur, read those files in full before
you judge, and list the search terms you used so I can confirm nothing
was missed.
You are looking for one specific failure mode: the agent has broad
read access and broad outbound reach (model endpoints, external
tools, write destinations) with no control that decides, per piece
of data, whether it may cross a jurisdictional, contractual,
purpose, or tenant boundary — no data classification / residency
tagging, no egress policy at the tool boundary, and unlogged egress.
If the agent never handles regulated / multi-tenant data or has no
external egress, say "not applicable".
[response shape]
How long this takes
End-to-end, expect 30–45 minutes for a small tool, longer for a larger one — most of which is reading the answers, not running the prompts. Several of the twenty-five will return “not applicable” for any given tool; that is fine and is the point of having a checklist.
Failure modes & triage
| Symptom | Likely cause | Fix |
|---|---|---|
| Every prompt returns “not present” | Scope is wrong (assistant is reading docs, not code) or the response shape is being ignored. | Re-do Tip 1. Confirm with the sanity-check prompt at the bottom of Tip 1. |
| Verdict is “unclear” on most prompts | Files listed are not the ones that contain the relevant logic. | Ask the assistant which file would contain <the topic>, then re-run with that file added. |
| Findings are confidently wrong | Hallucination; the assistant did not actually read the file. | Switch from chat to inline-edit / file-attach mode that forces a read. Always check that quoted lines actually appear in the file. |
| Same prompt gives different answers on re-run | Model non-determinism; longer answers are more variable. | Lower temperature if your assistant exposes it; otherwise run a prompt three times and take the intersection of findings. |
Run all twenty-five as one prompt
The twenty-five prompts above are best run one at a time in a fresh chat per prompt — that keeps each answer sharp. But if you want a single paste that walks the assistant through every check in one pass (useful for a quick first sweep, or for assistants that hold context well), the combined prompt below chains all twenty-five, expands the [response shape] once at the top, and finishes by writing a professional Markdown and HTML report — with scalable SVG diagrams (no overlapping or overflowing text, a click-to-expand full-page view, and subtle motion where it aids understanding), accessibility-compliant and Fluent 2 themed — into a Self-Testing-Report/ folder at the repository root. Paste it whole.
Treat the combined run as fast triage, not the audit of record. Spreading one paste across twenty-five checks gives each check less of the model’s attention, so it is more prone to missed evidence and merged answers than the individual prompts — it will not match a per-prompt run check-for-check, and it is not meant to. Use it to surface candidates quickly; before you rely on any present verdict or its severity, re-run that single check on its own in a fresh session (Tip 1) and keep the individual run, with its audit block and run record, as the result of record.
The combined sweep — twenty-five checks, one paste
Runs every failure-mode check in sequence in a single session, asks for a severity-ordered summary, then generates a Microsoft-standard Markdown and HTML report. The report uses scalable SVG diagrams with no text/diagram overlap and no overflow, a click-to-open full-page view for each diagram, gentle animation where it helps (honouring reduced-motion), light/dark Fluent 2 theming, and WCAG 2.1 AA accessibility. Slower and more variable than running prompts individually, but zero copy-paste juggling and you finish with a shareable artefact.
You are performing a security self-assessment of this entire
codebase. Work through all TWENTY-FIVE checks below in order, in this
one session, and do not stop until every check has an answer.
For each check: search the whole repository to find where it applies
— do not wait for me to list files. Ignore generated, vendored,
and dependency folders (build output, node_modules, vendor). Read the
relevant files in full before you judge, and list the search terms you
used so I can confirm nothing was missed.
For EVERY check, first list the search terms and globs you used and
the files you opened (ignore generated, vendored, and dependency
folders), then enumerate the matching file path + line + one-line
quote. The verdict must follow from that list. Then respond with
exactly these sections, numbered to match the check:
1. VERDICT: one of [present / not present / unclear / not applicable]
2. SEVERITY: one of [Critical / High / Medium / Low / N/A], computed as
Likelihood x Impact. Impact is the inherent blast radius of the
failure mode (High = code execution, credential or data compromise,
or supply-chain control; Medium = data exposure, governance or
integrity gaps; Low = limited or local effect). Likelihood is High
when an unsafe path is reachable by untrusted input with no
mitigation, Medium when it needs an authenticated or internal actor
or a partial control exists, Low when it needs privileged local
access or a strong control is already present. Severity is the
matrix cell: High+High=Critical; one High and one Medium=High;
Medium+Medium or one High and one Low=Medium; anything else=Low.
Use N/A for not present / not applicable.
3. WHY IT MATTERS: two sentences, plain English
4. FIX: a concrete change, with a short before/after code snippet
if applicable. If "unclear", list the one piece of context you
need to decide.
5. AUDIT BLOCK: a fenced json object {"id","verdict","impact",
"likelihood","severity","evidence":[{"file","line","quote"}],
"search_terms"} so re-runs on the same commit are diff-able.
After CHECK 25, end with a one-paragraph SUMMARY listing every check
whose verdict is "present", ordered by severity (most serious first),
then a RUN RECORD line: model name and version, temperature, the date,
and the repository commit SHA you assessed.
CHECK 1 — Wildcard tool exposure: the AI agent is granted tool
access that is wildcard or near-wildcard — for example, an
unrestricted shell, an HTTP client with no domain allow-list, file
system access with no path restriction, or an MCP / plugin client
that loads every tool advertised by the server.
CHECK 2 — Unauthenticated tool channel: the agent connects to
remote tool servers (HTTP endpoints, MCP servers, plugins,
function-call backends) without authentication, without TLS, or from
URLs that are user-controlled / configuration-controlled with no
allow-list of trusted hosts.
CHECK 3 — Conflated context (prompt injection): untrusted text
(web page contents, tool results, user-supplied documents, RAG
retrievals) is placed into the model’s context window without
clear separation from trusted instructions, without sanitisation, and
without any downstream check that prevents the model from following
directives embedded in that text.
CHECK 4 — Comment-to-commit promotion: a comment, issue, chat
message, or other text written by anyone with read access can cause
the agent to perform a write action (commit code, merge a pull
request, deploy, modify a ticket, send a message) without an
additional authorisation step that checks the actor’s permission
to perform that specific write. If the codebase does not integrate
with any such surface, answer "not applicable".
CHECK 5 — Live-fetch dependency: the agent (or its install /
startup script) fetches code or container images using floating
references — for example ":latest", "@main", unpinned package
installs at runtime, or "curl ... | sh" patterns. A pinned version, a
lock file, or a content hash counts as not-present.
CHECK 6 — Standing credential: the agent process holds a
credential that is long-lived (no automatic rotation), broad in scope,
and continuously available in memory or on disk — for example a
static API key, a personal access token, or a service-account secret
read from environment variables at startup and never refreshed. A
short-lived token minted just-in-time per call from a workload
identity counts as not-present.
CHECK 7 — Phishable flow: the agent uses an authentication flow
that is meaningfully more phishable than the device can support —
most commonly OAuth device-code flow running on a machine that has a
browser available, where authorisation-code-with-PKCE would have
worked. If the codebase does not perform end-user authentication at
all, answer "not applicable".
CHECK 8 — Plaintext journal: the agent writes telemetry (logs,
traces, metrics, spans) that contains raw model prompts, raw tool
arguments, retrieved document contents, or raw model output, without
redaction of personal data, credentials, or business-sensitive
content, and without an access boundary that matches the sensitivity
of the original data.
CHECK 9 — Mutable reference trust: any CI workflow, pipeline
definition, build script, or orchestration manifest references
third-party actions, images, or modules by a mutable name (a tag, a
branch, "latest") instead of by a content-addressable hash (commit
SHA, image digest). If the codebase has no CI / pipeline files, answer
"not applicable".
CHECK 10 — Unsupervised perimeter: the codebase defines or
registers multiple agents, sub-agents, or tool-bearing flows, but the
security and governance controls (rate limits, tool allow-lists,
evaluation hooks, logging, content filters) are applied to only a
subset of them. Enumerate every distinct agent or sub-agent you can
find and state for each whether the controls applied to the main one
also apply to it.
CHECK 11 — Shared identity runtime: the agent executes in the
same security context as the operator who launched it (same OS user,
same cloud identity, same file system permissions, same process),
rather than in a sandbox / separate identity with a narrower set of
permissions.
CHECK 12 — Documented defence that doesn’t exist: a
security-relevant claim in the documentation is not backed by an
implementation in the source. Also read README.md and any docs/ file
that mentions security, sandboxing, isolation, logging, redaction,
authentication, or permissions. For each claim, quote the doc sentence
verbatim, then either point to the file/lines that implement it or
state "no implementation found".
CHECK 13 — Internal-to-product gap: assume this codebase is about
to be published for use by people other than its authors. Read the
source plus README.md, LICENSE, SECURITY.md, and any docs/ file.
Identify obligations that apply to shipped software but not to internal
tools, and state whether each is met:
- multi-tenant isolation (no cross-tenant data leakage)
- abuse-handling path (rate limits, reporting channel)
- user-facing error messages (no internal stack traces / secrets)
- a security contact and disclosure policy
- a clear statement of what data the tool sends where
CHECK 14 — Poisoned knowledge base: documents ingested into a
retrieval-augmented-generation index or a fine-tuning / training
corpus are treated as trusted — no provenance check, no integrity
validation, and no sanitisation of content that could carry embedded
instructions, before that content can influence the model. If there is
no RAG corpus or training data, answer "not applicable".
CHECK 15 — Persistent memory poisoning: the agent persists
memory or “learnings” across sessions, and content can be written
into that store from untrusted input (user messages, tool results,
retrieved documents) without validation, attribution, or a scope
boundary — so a malicious instruction written once replays in
later sessions. If the agent keeps no cross-session memory, answer
"not applicable".
CHECK 16 — Porous vector store: a vector store or retrieval index
is queried without enforcing the caller’s access rights — no
per-tenant or per-user filter on the similarity search, so one
user’s query can return another’s documents, and the embeddings are
not protected against inversion. If there is no vector store or
retrieval index, answer "not applicable".
CHECK 17 — Unsanitised output sink: raw model output is passed
into a downstream interpreter or sink — a shell command, a SQL
query, an HTML / DOM context, an eval, a file path, or a generated API
call — without encoding, parameterisation, or validation
appropriate to that sink, so the model can emit an injection payload.
CHECK 18 — Hallucinated authority: the system takes a
consequential action, or presents a fact to a user as authoritative,
based purely on the model’s assertion — no verification against
a source of truth, no uncertainty signal, no ground-truth check —
so a confident hallucination (a non-existent API, a wrong policy, a
fabricated citation) is acted on.
CHECK 19 — Implicit agent-to-agent trust: in a multi-agent or
sub-agent system, one agent accepts another’s output, instructions,
or tool calls as trusted — without authenticating the source,
validating the message, or constraining it to that agent’s permitted
scope — so a single compromised agent can pivot through the
orchestration. If there is only a single agent, answer
"not applicable".
CHECK 20 — System prompt as secret store: the system prompt,
developer prompt, or any instruction text sent to the model contains
secrets (API keys, connection strings, passwords, tokens) or
security-relevant access rules relied on for enforcement —
treating the prompt as confidential when it is extractable.
CHECK 21 — Unbounded consumption: the agent has no ceiling on
consumption — no maximum token budget per request, no recursion /
tool-call-loop limit, no per-user rate or spend cap — so it can
be driven into a runaway loop or adversarial cost amplification
(denial of wallet).
CHECK 22 — Unattributable action: actions the agent takes (tool
calls, writes, approvals, external requests) are not recorded in a
tamper-resistant audit trail that links each action to the initiating
user / session, the agent decision, and a timestamp — so after an
incident you cannot answer “who did this, and why”.
CHECK 23 — Rubber-stamp approval: a human-in-the-loop approval
gate fires on every action regardless of risk, presents too little
context to make a real decision, or has no consequence for
bulk-approving — so the reviewer is trained to rubber-stamp and
the control provides no real oversight. If there is no human-approval
gate at all, answer "not applicable".
CHECK 24 — Ungated model swap: the model version, system prompt,
or generation parameters can be changed and reach production without
passing an automated evaluation gate — no representative
regression suite, no comparison of new versus previous behaviour, and
no gradual rollout or quick revert. If the system has no configurable
model / prompt, answer "not applicable".
CHECK 25 — Boundaryless data egress: the agent has broad read
access and broad outbound reach (model endpoints, external tools,
write destinations) with no control that decides, per piece of data,
whether it may cross a jurisdictional, contractual, purpose, or tenant
boundary — no data classification / residency tagging, no egress
policy at the tool boundary, and unlogged egress. If the agent never
handles regulated / multi-tenant data or has no external egress,
answer "not applicable".
=== DELIVERABLE: WRITE TWO REPORT FILES ===
After you have completed all twenty-five checks, create a folder named
"Self-Testing-Report" at the repository root and write BOTH of these
files into it (create the folder if it does not exist). Append the
run's generation timestamp to each filename in the form
YYYY-MM-DD-HHmmss (24-hour local time) so every run is preserved and a
new report never overwrites an earlier one:
1. Self-Testing-Report/security-self-assessment-YYYY-MM-DD-HHmmss.md
2. Self-Testing-Report/security-self-assessment-YYYY-MM-DD-HHmmss.html
Use the SAME timestamp for both files in a single run, and use that
same timestamp as the generation date in the report metadata (section A).
Both files must contain the SAME findings and meet Microsoft writing
and documentation standards: clear heading hierarchy, plain concise
language, sentence-case headings, active voice, no marketing tone, and
every acronym expanded on first use. Include a generation date and a
one-line note that the report is AI-generated and must be verified.
Report structure (both files, in this order):
A. Title + metadata (repository name, date, assistant/model used,
scope, and the disclaimer that findings are probabilistic and
must be human-verified).
B. Executive summary: one short paragraph on what this assessment
is and how to read it, then total checks, counts by verdict
(present / not present / unclear / not applicable), and the top
risks in plain English for a non-technical reader.
C. Severity-ordered findings table with columns: # | Check |
Verdict | Severity (Critical/High/Medium/Low/Info) | One-line
summary. Order most-serious first.
D. Detailed findings: one section per check (all twenty-five), in the
same order, each using this exact consistent layout so the
report reads cleanly and never feels cluttered. Use short
paragraphs and sub-headings, not dense walls of text:
- Heading: the check number and name.
- What this check is: 1-2 plain-English sentences defining the
failure mode, written so a non-specialist understands it.
Expand any acronym on first use.
- Why we check it: 1-2 sentences on what risk it maps to (cite
the relevant framework where natural, e.g. OWASP LLM Top 10,
NIST AI RMF, MITRE ATLAS) and why it matters for an AI agent.
- What goes wrong if it is not fixed: 1-3 sentences describing
the concrete failure / attack and its real-world impact
(data loss, privilege escalation, supply-chain compromise,
etc.) so the reader understands the stakes.
- Verdict: one of [present / not present / unclear /
not applicable].
- Evidence: file path + line numbers + a one-line quote per
claim (or "no occurrences found" with the terms you searched).
- Why this verdict, here: 1-2 sentences tying the evidence in
THIS codebase to the verdict.
- Recommended fix and why it helps: a concrete change with a
short before/after snippet where applicable, plus one
sentence on what the fix prevents. For "not present", state
briefly what good looks like so the reader can keep it that
way; for "unclear", list the one piece of context needed.
Keep each section self-contained and scannable: lead with the
plain-English explanation, then the evidence, then the fix.
Do not repeat the full definitions in other sections.
Diagrams (required):
- Render every diagram as crisp, scalable SVG. In the .md use
```mermaid fenced blocks (Mermaid emits SVG); in the .html embed
the Mermaid runtime via a pinned CDN script tag and render the
same definitions to inline SVG. Where a diagram is highly bespoke,
you may hand-author inline SVG instead of Mermaid, but it must
follow the same layout and accessibility rules below.
- For EVERY finding whose verdict is "present", include a Mermaid
sequenceDiagram that shows how the gap is exploited: the actor /
untrusted input, the agent, the tool or credential or channel,
and the resulting impact. Give each its own heading.
- Include one overall data-flow / trust-boundary diagram (Mermaid
flowchart) showing untrusted inputs, the agent, its tools, its
credentials, and where the trust boundaries sit.
- STRONGLY prefer Mermaid over hand-authored SVG: Mermaid sizes
each node box to its text automatically, which avoids clipping.
Only hand-author SVG if Mermaid genuinely cannot express the
diagram, and if you do, you MUST wrap every label in a
<foreignObject> with real HTML text so it wraps and the box
grows to fit — never paint text into a fixed-width <rect>.
- Labels must be COMPLETE, never truncated. Do not cut a label to
fit a box and do not end one mid-word (no "comment tex", no
"remote M", no "single OS id"). If a label is long, rephrase it
to be genuinely shorter, or wrap it onto two lines with a Mermaid
line break (<br>), but always show the full meaning. After
rendering, re-read every node and edge label and confirm none is
clipped or running past its box; fix any that are before saving.
- Layout quality is mandatory: NO text may overlap another node,
edge, or label, and NO text or shape may overflow the diagram's
bounds or be clipped. Give nodes generous padding, keep adequate
nodeSpacing and rankSpacing between nodes and edges, and let the
SVG size to its content (keep its viewBox; do not impose a fixed
pixel width/height that squashes it) so nothing is cut off at any
zoom level or screen width.
- Provide a full-page / expand option for visibility, and make it
actually fill the screen. Each diagram has a "Full screen" button
that opens a full-viewport overlay (a native <dialog> or a
fixed-position lightbox). Inside the overlay the diagram must
SCALE UP to fill the available space — it must not sit tiny
in the middle of a large empty area. Achieve this by: keeping the
SVG's viewBox, removing any fixed width/height attributes on the
SVG (or setting width/height to 100%), adding
preserveAspectRatio="xMidYMid meet", and styling the SVG with
width:100%; height:100%; max-width:96vw; max-height:90vh inside a
flex container that centres it (display:flex; align-items:center;
justify-content:center). The result should be a large, crisp,
centred diagram that uses most of the screen. Include a clearly
labelled, keyboard-operable Close control (Esc must also close
it), dim the page behind, return focus to the trigger on close,
and trap focus while open (role, accessible name, screen-reader
usable).
- Use subtle, purposeful animation where it aids comprehension (for
example, an animated pulse or moving dot along the exploit path in
a "present" finding's diagram to trace the flow of untrusted data,
or a gentle highlight of the trust boundary that is crossed). Keep
animations short, looping calmly, and never essential to meaning.
You MUST honour prefers-reduced-motion: reduce by disabling or
freezing all motion for users who request it.
- Accessibility for diagrams is mandatory and must meet WCAG 2.1
AA: give every Mermaid diagram a title via the "accTitle" and a
longer text alternative via "accDescr" (for hand-authored SVG use
<title> and <desc> plus role="img" and an aria-label), AND
immediately follow each diagram with a short plain-text summary
paragraph that fully conveys the same information to anyone who
cannot see the diagram. Never rely on colour alone to convey
meaning (pair colour with a label, icon, or text). Ensure all
text/background pairs meet a contrast ratio of at least 4.5:1.
HTML report requirements (the timestamped .html report only):
- A single self-contained file (inline CSS and the one Mermaid
script tag); it must open correctly from the local filesystem.
- Use the Fluent 2 design language: font stack
"Segoe UI Variable, Segoe UI, system-ui, sans-serif", Fluent 2
spacing and rounded corners, and Fluent 2 colour tokens.
- Support BOTH a light and a dark theme. Honour the OS setting via
prefers-color-scheme AND provide a visible, keyboard-operable
theme-toggle button. Configure Mermaid's theme to follow the
active mode so diagrams are legible in both light and dark.
- Initialise Mermaid for legibility and no clipping: set
startOnLoad true, theme to match the mode, securityLevel 'loose'
so HTML labels and line breaks render, and a flowchart config of
htmlLabels:true, useMaxWidth:true, nodeSpacing:60, rankSpacing:70
(and equivalent spacing for sequence diagrams). After Mermaid
renders, each diagram's inline SVG should display at a readable
size on the page (not shrunk to a thumbnail) and expand cleanly
in the full-screen overlay described above.
- Accessibility (WCAG 2.1 AA): semantic landmarks (header, main,
nav, footer), a "skip to main content" link, a single h1 then a
correct heading order, a visible :focus indicator, descriptive
link text, table headers with scope attributes, and a lang
attribute on the html element. All interactive controls must be
reachable and operable by keyboard.
- Make the severity table sortable-by-reading (already ordered) and
use both a colour AND a text label for each severity badge.
After writing both files, print the two file paths and a one-line
confirmation of how many findings were marked "present".
Next tip
The prompt pack is structured and predictable, which is a strength and a weakness. Tip 3 → is the unstructured counterpart: a single longer prompt that role-plays an attacker against your tool and tells you what they would try first.