| 3 min read

System Design: Defense-in-Depth for LLM Safety

A practical, layered safety architecture for GenAI systems: where to place guardrails, how to keep latency low, and how to handle edge cases.

#System Design #AI Safety #LLM #Architecture

Problem Statement

LLM applications face prompt injection, policy violations, PII leakage, and unsafe outputs. A single safety layer is insufficient: attacks bypass heuristics, and models can hallucinate unsafe content even when inputs are clean.

The goal is to build a defense-in-depth system that:

  • Catches obvious attacks fast
  • Escalates only suspicious traffic to expensive checks
  • Provides auditability and incident response
  • Fails closed on safety errors

High-Level Architecture

Client
  -> Input Validation (schema, size, encoding)
  -> Tier 1 Heuristics (rules, regex, blocklists)
  -> Tier 2 Fast ML (toxicity, injection classifier)
  -> Tier 3 Deep ML (LLM judge / ensemble)
  -> Model Inference
  -> Output Filtering (PII redaction, policy checks)
  -> Response

Observability + Audit Logs across all tiers

Key design principle: cheap checks first, expensive checks last.


Components

1) Input Validation

  • Reject malformed requests early
  • Enforce length limits, encoding, required fields
  • Prevent resource exhaustion

2) Tier 1 Heuristics

  • Regex patterns for known jailbreaks
  • Blocklists for unsafe terms
  • Bloom filter for known bad prompts

Latency: sub-millisecond

3) Tier 2 Fast ML

  • Small classifiers for toxicity, injection
  • Run on GPU when available, CPU otherwise
  • Sample only when needed (e.g., 30% of clean traffic + all suspicious)

Latency: 5-20ms

4) Tier 3 Deep ML

  • Expensive LLM-based analysis
  • Use only on edge cases
  • Strict budget per request

Latency: 50-100ms (rarely triggered)

5) Output Filtering

  • Detect PII, unsafe output, policy violations
  • Redact or block
  • Record audit trail

Key Data Flows

Normal (safe) flow

  • Input validation passes
  • Heuristics pass
  • Skip ML tiers
  • Output filter passes
  • Response returned

Suspicious flow

  • Heuristics trigger
  • Fast ML invoked
  • If borderline, deep ML invoked
  • Output filter applied
  • Response modified or blocked

Failure Handling

  • Fail closed on safety service errors
  • Provide clear rejection reasons to callers
  • Log with request IDs for investigation
  • Escalate to human review when confidence is low

Observability

Track:

  • Safety verdict distribution
  • False positive rate
  • Latency per tier (p50/p95/p99)
  • Trigger rates for each tier
  • Block reasons by category

Tradeoffs

  • Latency vs coverage: more layers increase safety but add delay
  • Cost vs accuracy: deep ML is expensive, limit its usage
  • User friction: block vs fix/guide responses

Summary

Defense-in-depth provides resilience to evolving attack patterns. The system must be measurable, budget-aware, and designed to fail safe.

If you’re building GenAI systems, treat safety as architecture, not a bolt-on feature.