Skip to main content

Guardrails

Guardrails are rules applied to either the user's input, the model's output, or both (io) to enforce content policies, protect privacy, and ensure your AI applications behave safely and appropriately.


Execution Flow

Guardrails run in a defined order around the model call:

  1. Input Phase: input and io guardrails inspect user messages before they reach the model.
  2. Model Invocation: The AI model processes the (filtered) request.
  3. Output Phase: output and io guardrails inspect the model's response before it reaches the user.

Guardrail Classes

Guardrails belong to one of two classes:

  • Check Guardrails — inspect content and, if triggered, can block, soft_block, or warn.
  • Redact Guardrails — modify content by masking or removing sensitive information. They always redact and do not require a behavior.

Configuration Fields

FieldDescription
nameA unique, descriptive identifier for the guardrail
typeThe type of check to perform (see types below)
whereWhere to apply the guardrail: input, output, or io (both)
behaviorAction to take if triggered: block, soft_block, or warn. Not required for Redact types.
response_messageOptional message returned to the user when the guardrail triggers
parametersA dictionary of settings specific to the guardrail type

Guardrail Types at a Glance

TypeClassDescriptionparameters example
starts_withCheckText starts with specific stringvalues: ["Hello", "Hi"]
ends_withCheckText ends with specific stringvalues: ["?", "!"]
containsCheckText contains specific substringvalues: ["forbidden_word"]
regexCheckText matches a regular expressionvalues: ['\b[A-Z]{2,}\b']
presidio_analyzerCheckDetects PII using Microsoft Presidiolanguage: en, entities: ["EMAIL_ADDRESS"]
presidio_anonymizerRedactMasks PII using Microsoft Presidiolanguage: en, entities: ["EMAIL_ADDRESS"]
judgeCheckSemantic evaluation via an LLM-as-a-Judgeprompt_ref: "toxicity_check.md", model_id: "gpt-4o-mini"

Traditional Guardrails

Fast, rule-based filtering using pattern matching. These should be your first line of defence due to their low latency.

Starts With

guardrails:
- name: greeting_check
type: starts_with
where: input
behavior: warn
parameters:
values: ["Hello", "Hi", "Good morning"]

Ends With

guardrails:
- name: question_check
type: ends_with
where: input
behavior: soft_block
parameters:
values: ["?", "??", "???"]

Contains

guardrails:
- name: profanity_filter
type: contains
where: input
behavior: block
parameters:
values: ["inappropriate", "offensive", "spam"]
response_message: "Content blocked due to inappropriate language"

Regex

guardrails:
- name: email_detector
type: regex
where: input
behavior: block
parameters:
values: ['\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b']
response_message: "Email addresses are not allowed"

Presidio Guardrails

PII detection and anonymization powered by Microsoft Presidio.

Presidio Analyzer (Check)

Detects PII entities and blocks or warns depending on behavior.

guardrails:
- name: pii_detector
type: presidio_analyzer
where: input
behavior: block
parameters:
language: en
entities: ["EMAIL_ADDRESS", "PHONE_NUMBER", "CREDIT_CARD"]
response_message: "Personal information detected and blocked"

Presidio Anonymizer (Redact)

Masks detected PII with placeholders (e.g., replaces an email with <EMAIL_ADDRESS>). No behavior needed — it always redacts.

guardrails:
- name: pii_anonymizer
type: presidio_anonymizer
where: io
parameters:
language: en
entities: ["EMAIL_ADDRESS", "PHONE_NUMBER", "IBAN_CODE", "IT_IDENTITY_CARD"]

LLM Judge Guardrails

Semantic evaluation using a language model as a "judge". The JudgeEngine passes the user input or model output through a prompt template, invokes the configured LLM, and parses the result into a structured verdict (is_triggered, optional reasoning).

How It Works

  1. The input or output text is extracted and normalized.
  2. The text is inserted into a prompt template (prompt_ref).
  3. The LLM defined in model_id is invoked via LangChain's init_chat_model().
  4. The response is parsed into a JudgeResult object.
  5. The is_triggered flag determines whether the guardrail fires.
  6. If the primary model fails, an optional fallback_model_id is tried automatically.

Engine Components

ComponentDescription
JudgePromptManagerLoads built-in and user-defined prompt templates
PromptTemplateDefines structured LLM prompts with formatting and output schema
PydanticOutputParserParses model output into a JudgeResult object
Model CacheOptimizes repeated model usage by caching initialized instances
Fallback LogicRetries with a fallback model if the primary fails

Built-in Prompt Templates

PromptPurpose
toxicity_check.mdDetects offensive, abusive, or harmful content
business_context_check.mdValidates if the request aligns with your business domain
prompt_injection_check.mdIdentifies prompt injection or jailbreak attempts

Custom Prompt Templates

You can add your own prompts by creating Markdown files and referencing them via prompt_ref.

1. Create your prompt file (e.g., custom_ethical_check.md):

You are a compliance officer ensuring that all AI responses adhere to ethical standards.
Evaluate the following user input and decide if it violates company ethical policies.

Return JSON:
{
"is_triggered": boolean,
"reasoning": string | null
}

2. Set the environment variable and mount the directory into the container:

  • Environment variable: JUDGE_PROMPTS_DIR=/radicalbit_ai_gateway/radicalbit_ai_gateway/guardrails/judges/custom-prompts
  • Mount: your local custom-prompts folder → /radicalbit_ai_gateway/radicalbit_ai_gateway/guardrails/judges/custom-prompts (read-only)

Custom prompts are checked before built-in defaults. Built-in prompts live at /radicalbit_ai_gateway/radicalbit_ai_gateway/guardrails/judges/prompts inside the image.

Configuration Examples

Toxicity detection with fallback model:

guardrails:
- name: toxicity_judge
type: judge
where: input
behavior: block
response_message: "🚨 Toxic content detected and blocked"
parameters:
prompt_ref: "toxicity_check.md"
model_id: "gpt-4o-mini"
temperature: 0.0
max_tokens: 100
fallback_model_id: "gpt-3.5-turbo"

Custom ethical policy check:

guardrails:
- name: ethical_guardrail
type: judge
where: io
behavior: block
parameters:
prompt_ref: "custom_ethical_check.md"
model_id: "gpt-4o-mini"
temperature: 0.3
max_tokens: 150

Guardrail Behaviors

BehaviorActionDescription
blockFully reject the request
soft_block⚠️Reject but return a user-friendly message in the response content
warn🟡Log a warning but allow the request to continue

Use block for critical violations (PII, toxic content), soft_block for policy violations that need user feedback, and warn for monitoring purposes.


Stack guardrails from fastest to slowest for optimal performance:

  1. Traditional Filters — fast rule-based screening
  2. Presidio Analysis — PII detection and masking
  3. LLM Judge — deep semantic safety validation

Layered configuration example:

guardrails:
- name: basic_filter
type: contains
where: input
behavior: warn
parameters:
values: ["spam", "scam"]

- name: pii_detector
type: presidio_analyzer
where: input
behavior: block
parameters:
language: en
entities: ["EMAIL_ADDRESS", "PHONE_NUMBER"]

- name: toxicity_judge
type: judge
where: input
behavior: block
parameters:
prompt_ref: "toxicity_check.md"
model_id: "gpt-4o-mini"

Monitoring

Track guardrail activity through the following metric:

MetricDescription
gateway_guardrails_triggered_totalNumber of times each guardrail was triggered

See the Monitoring guide for details on how to access and visualize these metrics.


Next Steps