Define what “good” looks like
Before you alert, define baselines: typical tool usage patterns, normal retrieval behavior, and expected response structures. When prompts or models change, re-baseline intentionally.
Signals that map to security outcomes
| Signal type | Examples | Why it matters |
|---|---|---|
| Injection attempts | Role override patterns, multi-turn manipulation | Indicates boundary pressure; may require guardrail updates |
| Tool anomalies | Unexpected tool selection, parameter spikes | Can indicate misuse, coercion, or compromised workflow |
| Leakage pressure | Requests for hidden context or restricted data | Highlights weak context boundaries or disclosure policy gaps |
| Drift | Shift in refusal rates, changes in behavior after updates | Can reintroduce risk or reduce control effectiveness |
Make investigations easier
Store enough context for security operations: boundary touched (prompt/context/tool), the policy outcome, and a timeline of the interaction. Avoid storing unnecessary sensitive data.