System Design: Defense-in-Depth for LLM Safety
A practical, layered safety architecture for GenAI systems: where to place guardrails, how to keep latency low, and how to handle edge cases.
Table of Contents
Problem Statement
LLM applications face prompt injection, policy violations, PII leakage, and unsafe outputs. A single safety layer is insufficient: attacks bypass heuristics, and models can hallucinate unsafe content even when inputs are clean.
The goal is to build a defense-in-depth system that:
- Catches obvious attacks fast
- Escalates only suspicious traffic to expensive checks
- Provides auditability and incident response
- Fails closed on safety errors
High-Level Architecture
Client
-> Input Validation (schema, size, encoding)
-> Tier 1 Heuristics (rules, regex, blocklists)
-> Tier 2 Fast ML (toxicity, injection classifier)
-> Tier 3 Deep ML (LLM judge / ensemble)
-> Model Inference
-> Output Filtering (PII redaction, policy checks)
-> Response
Observability + Audit Logs across all tiers
Key design principle: cheap checks first, expensive checks last.
Components
1) Input Validation
- Reject malformed requests early
- Enforce length limits, encoding, required fields
- Prevent resource exhaustion
2) Tier 1 Heuristics
- Regex patterns for known jailbreaks
- Blocklists for unsafe terms
- Bloom filter for known bad prompts
Latency: sub-millisecond
3) Tier 2 Fast ML
- Small classifiers for toxicity, injection
- Run on GPU when available, CPU otherwise
- Sample only when needed (e.g., 30% of clean traffic + all suspicious)
Latency: 5-20ms
4) Tier 3 Deep ML
- Expensive LLM-based analysis
- Use only on edge cases
- Strict budget per request
Latency: 50-100ms (rarely triggered)
5) Output Filtering
- Detect PII, unsafe output, policy violations
- Redact or block
- Record audit trail
Key Data Flows
Normal (safe) flow
- Input validation passes
- Heuristics pass
- Skip ML tiers
- Output filter passes
- Response returned
Suspicious flow
- Heuristics trigger
- Fast ML invoked
- If borderline, deep ML invoked
- Output filter applied
- Response modified or blocked
Failure Handling
- Fail closed on safety service errors
- Provide clear rejection reasons to callers
- Log with request IDs for investigation
- Escalate to human review when confidence is low
Observability
Track:
- Safety verdict distribution
- False positive rate
- Latency per tier (p50/p95/p99)
- Trigger rates for each tier
- Block reasons by category
Tradeoffs
- Latency vs coverage: more layers increase safety but add delay
- Cost vs accuracy: deep ML is expensive, limit its usage
- User friction: block vs fix/guide responses
Summary
Defense-in-depth provides resilience to evolving attack patterns. The system must be measurable, budget-aware, and designed to fail safe.
If you’re building GenAI systems, treat safety as architecture, not a bolt-on feature.