RAG threat surface
RAG introduces new failure modes: retrieving sensitive documents, mixing trust levels, exposing proprietary context, or allowing untrusted sources to steer the model.
Four controls that matter
- Provenance & trust tiers: classify sources and restrict mixing across trust levels.
- Retrieval constraints: enforce query filters, document allowlists, and scope boundaries.
- Context sanitization: strip executable instructions, sensitive data, and unsafe segments.
- Response policy: restrict what the model can disclose and how it cites context.
Poisoning and instruction steering
A common pitfall is allowing retrieved text to behave like “instructions.” Treat retrieved content as untrusted input and apply sanitization before it reaches the model. This reduces instruction injection via documents.
Monitoring and evidence
Track retrieval outcomes by source tier, query patterns, and leakage risk. Record when controls block or redact content, and maintain an evidence trail for governance reviews.