What prompt injection really is
Prompt injection is an attempt to influence an application’s behavior by manipulating the text and context it feeds to the model. Attackers aim to override instructions, extract sensitive context, or coerce unsafe tool actions.
A simple threat model
| Target | How it’s attacked | Impact |
|---|---|---|
| Instruction hierarchy | Role confusion, “ignore previous instructions”, multi-turn manipulation | Policy bypass, unsafe content, incorrect actions |
| Context boundary (RAG) | Prompting model to reveal retrieved context or fetch restricted docs | Data leakage, IP exposure |
| Tool boundary (agents) | Coercing tools to execute unsafe actions or expand scope | Unauthorized actions, workflow compromise |
Layered defenses that scale
Avoid a single brittle control. Use layers that reinforce each other:
- Policy gates: enforce non-negotiable boundaries before responses or tool calls.
- Context hygiene: sanitize retrieved context and restrict what can be exposed.
- Tool allowlists: constrain which tools can run, with scoped parameters.
- Action verification: confirm “high impact” actions with explicit checks.
- Adversarial regression tests: re-run injection tests as prompts change.
What to measure
Security teams need signals. Measure injection attempts by category, boundary touched (prompt/context/tool), outcomes (blocked/allowed), and drift over time.