Prompt Injection: Threat Model & Defenses

What prompt injection really is

Prompt injection is an attempt to influence an application’s behavior by manipulating the text and context it feeds to the model. Attackers aim to override instructions, extract sensitive context, or coerce unsafe tool actions.

A simple threat model

Target	How it’s attacked	Impact
Instruction hierarchy	Role confusion, “ignore previous instructions”, multi-turn manipulation	Policy bypass, unsafe content, incorrect actions
Context boundary (RAG)	Prompting model to reveal retrieved context or fetch restricted docs	Data leakage, IP exposure
Tool boundary (agents)	Coercing tools to execute unsafe actions or expand scope	Unauthorized actions, workflow compromise

Layered defenses that scale

Avoid a single brittle control. Use layers that reinforce each other:

Policy gates: enforce non-negotiable boundaries before responses or tool calls.
Context hygiene: sanitize retrieved context and restrict what can be exposed.
Tool allowlists: constrain which tools can run, with scoped parameters.
Action verification: confirm “high impact” actions with explicit checks.
Adversarial regression tests: re-run injection tests as prompts change.

What to measure

Security teams need signals. Measure injection attempts by category, boundary touched (prompt/context/tool), outcomes (blocked/allowed), and drift over time.