System Design: Defense-in-Depth for LLM Safety

Problem Statement

LLM applications face prompt injection, policy violations, PII leakage, and unsafe outputs. A single safety layer is insufficient: attacks bypass heuristics, and models can hallucinate unsafe content even when inputs are clean.

The goal is to build a defense-in-depth system that:

Catches obvious attacks fast
Escalates only suspicious traffic to expensive checks
Provides auditability and incident response
Fails closed on safety errors

High-Level Architecture

Client
  -> Input Validation (schema, size, encoding)
  -> Tier 1 Heuristics (rules, regex, blocklists)
  -> Tier 2 Fast ML (toxicity, injection classifier)
  -> Tier 3 Deep ML (LLM judge / ensemble)
  -> Model Inference
  -> Output Filtering (PII redaction, policy checks)
  -> Response

Observability + Audit Logs across all tiers

Key design principle: cheap checks first, expensive checks last.

Components

1) Input Validation

Reject malformed requests early
Enforce length limits, encoding, required fields
Prevent resource exhaustion

2) Tier 1 Heuristics

Regex patterns for known jailbreaks
Blocklists for unsafe terms
Bloom filter for known bad prompts

Latency: sub-millisecond

3) Tier 2 Fast ML

Small classifiers for toxicity, injection
Run on GPU when available, CPU otherwise
Sample only when needed (e.g., 30% of clean traffic + all suspicious)

Latency: 5-20ms

4) Tier 3 Deep ML

Expensive LLM-based analysis
Use only on edge cases
Strict budget per request

Latency: 50-100ms (rarely triggered)

5) Output Filtering

Detect PII, unsafe output, policy violations
Redact or block
Record audit trail

Key Data Flows

Normal (safe) flow

Input validation passes
Heuristics pass
Skip ML tiers
Output filter passes
Response returned

Suspicious flow

Heuristics trigger
Fast ML invoked
If borderline, deep ML invoked
Output filter applied
Response modified or blocked

Failure Handling

Fail closed on safety service errors
Provide clear rejection reasons to callers
Log with request IDs for investigation
Escalate to human review when confidence is low

Observability

Track:

Safety verdict distribution
False positive rate
Latency per tier (p50/p95/p99)
Trigger rates for each tier
Block reasons by category

Tradeoffs

Latency vs coverage: more layers increase safety but add delay
Cost vs accuracy: deep ML is expensive, limit its usage
User friction: block vs fix/guide responses

Summary

Defense-in-depth provides resilience to evolving attack patterns. The system must be measurable, budget-aware, and designed to fail safe.

If you’re building GenAI systems, treat safety as architecture, not a bolt-on feature.

Table of Contents