Model Monitoring: Signals That Matter

Monitoring should help security teams take action. This blueprint focuses on signals that map to AI boundary failures: injection attempts, unsafe tool use, leakage risk, and drift.

For educational purposes only; not legal advice.

Define what “good” looks like

Before you alert, define baselines: typical tool usage patterns, normal retrieval behavior, and expected response structures. When prompts or models change, re-baseline intentionally.

Signals that map to security outcomes

Signal type	Examples	Why it matters
Injection attempts	Role override patterns, multi-turn manipulation	Indicates boundary pressure; may require guardrail updates
Tool anomalies	Unexpected tool selection, parameter spikes	Can indicate misuse, coercion, or compromised workflow
Leakage pressure	Requests for hidden context or restricted data	Highlights weak context boundaries or disclosure policy gaps
Drift	Shift in refusal rates, changes in behavior after updates	Can reintroduce risk or reduce control effectiveness

Make investigations easier

Store enough context for security operations: boundary touched (prompt/context/tool), the policy outcome, and a timeline of the interaction. Avoid storing unnecessary sensitive data.

Model Monitoring: Signals That Matter

Define what “good” looks like

Signals that map to security outcomes

Make investigations easier

Deploy monitoring with guardrails