Tip 3 — One red-team prompt that tries to break your agent

Key insight

Structured checklists find structured failures. Adversarial role-play finds the failures the checklist did not anticipate — the chain of two small mistakes that becomes one large one, the assumption the code makes that an attacker can flip. Both passes matter; neither replaces the other.

The prompt

Paste this into a fresh chat with the whole-repo search set per Tip 1.

Reminder: AI assistant output is probabilistic. Verify every claim against your source before acting on it. False positives and false negatives are expected.

Search the whole repository to find where this applies — do not
wait for me to list files. Ignore generated, vendored, and dependency
folders (build output, node_modules, vendor). Read every file that
forms an input surface or handles untrusted data in full, and list
the search terms you used so I can confirm nothing was missed.

Take the position of an adversary who wants this AI tool / agent to
do something its authors did not intend — leak data, perform
an unauthorised action, execute attacker-supplied code, exhaust
resources, or impersonate a user.

You have read access to the source. You can send any input the
tool accepts (prompts, tool arguments, files, URLs, configuration).
You cannot modify the source.

Produce, in this exact format:

ATTACK PLAYBOOK
For each attack you can construct (aim for at least five), give:
  - GOAL: what you want the tool to do
  - ENTRY POINT: which input surface you use
  - STEPS: the concrete sequence (specific values, not hand-waving)
  - WHY IT WORKS: which line or design choice makes it possible
  - EFFORT: low / medium / high
  - BLAST RADIUS: contained / lateral movement / full compromise

DEFENCES THAT WOULD STOP EACH ATTACK
For each attack above, the smallest change to the source that
would defeat it. Include a before/after snippet.

WHAT YOU COULD NOT FIND A WAY TO ATTACK
List the parts of the tool that resisted your attempts, and say
briefly why — so the authors know what to keep.

Constraints:
- No generic advice. Every claim must reference a specific file
  and line number in the source.
- If you do not have enough context to construct any attack, say
  so and list the single piece of information you need.
- Do not invent code that is not in the listed files.

Reading the answer

You will get one of three useful outcomes, and one less-useful one:

Useful — specific attacks with line refs. Take each one, verify the lines really say what the assistant claims (always — do not skip this), and add the verified ones to your fix list (Tip 4).
Useful — attacks that turn out to be wrong. Often the assistant has missed a check that is there. Note these too — they are evidence that the check is hard to find, which is itself worth fixing (move the check closer to the entry point, or add a comment that the assistant’s next iteration will catch).
Useful — the “could not find a way to attack” section. This is where the assistant tells you which design decisions are quietly doing real work. Keep them.
Less useful — generic platitudes. “An attacker could try injection” with no specifics. Re-run with the search narrowed to the input-handling code and an explicit reminder that every claim needs a line number.

When to use this prompt vs. the prompt pack

Situation	Reach for
First-ever self-review of a new tool	Tip 2 (prompt pack) first, this prompt second.
Familiar tool, before a release	This prompt first — you already know the structured answers.
Refactor of a security-critical component	Both, on just the changed files.
Adding a new tool / capability to the agent	This prompt, scoped to the new file plus the dispatcher / catalogue file.

Failure modes & triage

Symptom	Likely cause	Fix
Assistant refuses to role-play an attacker	Safety policy is reading “adversary” as malicious intent.	Reframe: “Act as a security reviewer producing a defensive playbook for the authors of this tool.” Same content, different framing.
Attacks are imaginative but unrelated to your code	Assistant defaulted to general LLM-security tropes instead of reading the source.	Add: “Quote the exact line of code your attack depends on. If you cannot, do not include the attack.”
All attacks have “low effort”	Calibration drift; the assistant is overstating ease.	Add: “Effort is ‘low’ only if the attack works on first try without any reconnaissance. Most real attacks are medium or high.”

Next tip

You now have a structured pass and an adversarial pass. Both produce findings. Tip 4 → is the tiny convention for turning those findings into changes that actually ship.

← Back to the index