Advanced Configuration

info

This page is under development.

In this page we are going to explain how to configure the Gateway routes in all its component and features.

Routes
Models
Guardrails
Fallback
Caching
Rate Limiting
Token Limiting
Routing

The gateway's entire behavior is controlled by a single YAML configuration file. This file defines:

Reusable model definitions at top-level (chat_models, embedding_models)
Routes that reference models by model ID
Optional features applied per-route (guardrails, fallback, caching, limits, etc.)

Routes

The top-level key is routes. Each key under routes defines a separate API endpoint with its own independent configuration (e.g. customer-service, business-development, finance).

With the new structure:

chat_models inside a route is a list of model IDs (strings), not full model objects.
embedding_models inside a route (optional) is a list of embedding model IDs (strings).
Actual model configurations live at the top level under chat_models and embedding_models.

Example of config.yaml:

chat_models:
  - model_id: gpt-4o-mini
    model: openai/gpt-4o-mini
    credentials:
      api_key: !secret OPENAI_API_KEY
    params:
      temperature: 0.7
      max_tokens: 300
    # Use either `prompt` OR `prompt_ref` (mutually exclusive)
    prompt_ref: "helpful_assistant.md"
    role: system

  - model_id: claude-3-sonnet
    model: anthropic/claude-3-5-sonnet-latest
    credentials:
      api_key: !secret ANTHROPIC_API_KEY
    params:
      temperature: 0.7
      max_tokens: 300
    prompt: "You are a financial advisor."
    role: system

embedding_models:
  - model_id: text-embedding-3-small
    model: openai/text-embedding-3-small
    credentials:
      api_key: !secret OPENAI_API_KEY

  - model_id: text-embedding-ada-002
    model: openai/text-embedding-ada-002
    credentials:
      api_key: !secret OPENAI_API_KEY

routes:
  customer-service:
    chat_models:
      - gpt-4o-mini

  business-development:
    chat_models:
      - gpt-4o-mini

  finance:
    chat_models:
      - claude-3-sonnet

  search-and-analytics:
    chat_models:
      - gpt-4o-mini
    embedding_models:
      - text-embedding-3-small
      - text-embedding-ada-002

Models

The Radicalbit AI Gateway supports all OpenAI-compatible models. This means that the gateway integrates with the vast majority of models on the market, both proprietary and open-source.

The Gateway supports the following types of models:

Chat Completion
Embeddings

With the new configuration layout, models are defined once at top-level and then referenced by ID inside routes.

Chat Completion

OpenAI
OpenAI-like
Gemini

chat_models:
  - model_id: openai-4o
    model: openai/gpt-4o
    credentials:
      api_key: !secret OPENAI_API_KEY
    params:
      temperature: 1
      max_tokens: 200
    # Use either `prompt` OR `prompt_ref` (mutually exclusive)
    prompt_ref: "helpful_assistant.md"
    role: system

routes:
  your-route:
    chat_models:
      - openai-4o

model_id: Unique identifier for the model (Required)
model: Model identifier in format provider/model_name (e.g., openai/gpt-4o) (Required)
credentials: API credentials for accessing the model
params: Model parameters (temperature, max_tokens, etc.)
retry_attempts: Number of retry attempts (default: 3)
prompt: Optional inline system/developer prompt (mutually exclusive with prompt_ref)
prompt_ref: Optional reference to a Markdown file containing the prompt
role: Role used when injecting prompt/prompt_ref (allowed: system or developer when prompt is set)
input_cost_per_million_tokens: Cost per million input tokens
output_cost_per_million_tokens: Cost per million output tokens

chat_models:
  - model_id: llama3
    model: openai/llama3.2:3b
    credentials:
      base_url: "http://host.docker.internal:11434/v1"
    params:
      temperature: 0.7
      top_p: 0.9
    prompt_ref: "assistant.md"
    role: system

routes:
  your-route:
    chat_models:
      - llama3

chat_models:
  - model_id: gemini-pro
    model: google-genai/gemini-2.5-flash
    credentials:
      api_key: !secret GOOGLE_API_KEY
    params:
      temperature: 0.7
      max_output_tokens: 1024
    prompt: "You are a helpful assistant powered by Google Gemini."
    role: system

routes:
  your-route:
    chat_models:
      - gemini-pro

File-based prompts (`prompt_ref`)

When using prompt_ref, the referenced Markdown file must be available inside the gateway container. Mount a host folder containing prompt files and configure the directories via environment variable:

PROMPTS_DIR: directory for chat model prompts (prompt_ref)

Example (docker compose snippet):

environment:
  PROMPTS_DIR: "/radicalbit_ai_gateway/radicalbit_ai_gateway/prompts"
volumes:
  - ${PROMPTS_HOST_DIR:-./prompts}:/radicalbit_ai_gateway/radicalbit_ai_gateway/prompts:ro

Embeddings

OpenAI
OpenAI-like
Gemini

embedding_models:
  - model_id: emb_model_for_caching
    model: openai/text-embedding-3-small
    credentials:
      api_key: !secret OPENAI_API_KEY

routes:
  your-route:
    embedding_models:
      - emb_model_for_caching

model_id: Unique identifier for the model (Required)
model: Model identifier in format provider/model_name (Required)
credentials: API credentials for accessing the model

embedding_models:
  - model_id: emb_model_for_caching
    model: openai/text-embedding-3-small
    credentials:
      base_url: "http://host.docker.internal:11434/v1"

routes:
  your-route:
    embedding_models:
      - emb_model_for_caching

embedding_models:
  - model_id: gemini-embedding
    model: google-genai/models/gemini-embedding-001
    credentials:
      api_key: !secret GOOGLE_API_KEY
    params:
      task_type: RETRIEVAL_QUERY  # Optional: RETRIEVAL_DOCUMENT, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING

routes:
  your-route:
    embedding_models:
      - gemini-embedding

Guardrails

Text Control

PII detection and masking

LLM-as-a-Judge

Fallback

Defines a chain of backup models to use if the primary model fails (e.g., due to an API error or downtime). The gateway will automatically try the fallbacks in the order they are listed.

target: The model_id of the primary model.
fallbacks: A list of model_ids to try in sequence if the target fails.
type (optional): Use embedding for embedding fallbacks.

Example (chat fallback):

routes:
  route-name:
    chat_models:
      - openai-4o
      - llama3.2
      - qwen
    fallback:
      - target: openai-4o
        fallbacks:
          - llama3.2
          - qwen

If a request is routed to openai-4o and it fails, the gateway will retry the same request with llama3.2. If llama3.2 also fails, it will try qwen.

Example (embedding fallback):

routes:
  route-name:
    embedding_models:
      - text-embedding-3-small
      - text-embedding-ada-002
    fallback:
      - target: text-embedding-3-small
        fallbacks:
          - text-embedding-ada-002
        type: embedding

Caching

Exact Cache

Exact caching serves identical requests from memory instead of calling the LLM again.

type: exact
ttl: Time-to-live in seconds

At the top level of the config.yaml, a cache object must be defined if any route has caching enabled.

Example:

routes:
  route-name:
    chat_models:
      - openai-4o
    caching:
      type: exact
      ttl: 300

cache:
  redis_host: "valkey"
  redis_port: 6379

Semantic Cache

Semantic cache retrieves responses based on similarity. The route must declare:

at least one chat_model
one embedding_model used to generate embeddings for cache lookup/storage

For each new request, the embedding model is invoked, and a similarity score is computed against stored vectors. If a cached entry exceeds similarity_threshold, the cached response is returned.

routes:
  your-route:
    chat_models:
      - openai-4o
    embedding_models:
      - text-embedding-3-small
    caching:
      type: semantic
      ttl: 60
      embedding_model_id: text-embedding-3-small
      similarity_threshold: 0.80
      distance_metric: cosine
      dim: 1536

cache:
  redis_host: "valkey"
  redis_port: 6379

ttl: The time-to-live (in seconds) for a cached entry.
type: The caching strategy (semantic enables vector-based caching).
embedding_model_id: The embedding model ID used to generate/compare embeddings.
similarity_threshold: Minimum similarity score to accept a cached match.
distance_metric: Similarity metric (cosine, euclidean, dot).
dim: Dimensionality of produced embeddings (must match the model output).

Rate Limiting

Controls the number of requests allowed over a time window for a given route.

algorithm: The limiting algorithm. Currently, only fixed_window is supported.
window_size: The duration of the time window (e.g., 1 minute, 120 seconds).
max_requests: The maximum number of requests allowed in that window.

Example:

routes:
  route-name:
    chat_models:
      - openai-4o
    rate_limiting:
      algorithm: fixed_window
      window_size: 1 minute
      max_requests: 20

Token Limiting

Controls the cumulative number of tokens processed for a route's inputs and outputs over a time window. This is excellent for managing costs.

It has two sections: input and output.

algorithm: The limiting algorithm (e.g., fixed_window).
window_size: The duration of the time window.
max_token: The total number of tokens that can be processed in that window.

Example:

routes:
  route-name:
    chat_models:
      - openai-4o
    token_limiting:
      input:
        window_size: 10 seconds
        max_token: 1000
      output:
        window_size: 10 minutes
        max_token: 500

Routing

Intelligent routing allows the gateway to dynamically select which model handles a request based on configurable rules. Routing configs are defined at top-level under routing and referenced by name from routes.

name: Unique identifier for the routing config.
type (optional): The routing strategy. Defaults to deterministic.
default_model_id: Model ID to use when no rule condition matches.
rule: The rule type to apply. One of: keyword, token_length, time, budget.
output_mapping: List of entries mapping conditions to model IDs.

Example (keyword routing):

routing:
  - name: keyword-routing
    type: deterministic
    default_model_id: gpt-4o-mini
    rule: keyword
    output_mapping:
      - model_id: gpt-4o
        conditions:
          - "urgent"
          - "complex"
      - model_id: gpt-4o-mini
        conditions:
          - "simple"

routes:
  customer-service:
    chat_models:
      - gpt-4o
      - gpt-4o-mini
    routing: keyword-routing

If a user message contains "urgent" or "complex", the request is routed to gpt-4o. If it contains "simple", it goes to gpt-4o-mini. Otherwise, the default_model_id (gpt-4o-mini) is used.

Example (budget routing):

routing:
  - name: budget-routing
    type: deterministic
    default_model_id: gpt-4o
    rule: budget
    output_mapping:
      - model_id: gpt-4o-mini
        conditions:
          threshold: 0.8

routes:
  production:
    chat_models:
      - gpt-4o
      - gpt-4o-mini
    routing: budget-routing
    budget_limiting:
      input:
        algorithm: fixed_window
        window_size: 1 hour
        max_budget: 50.0
      output:
        algorithm: fixed_window
        window_size: 1 hour
        max_budget: 100.0

The threshold is evaluated against the combined input + output budget. In this example, max_budget = $50 (input) + $100 (output) = $150 total. When more than 80% of that combined budget ($120+) has been consumed, requests are automatically routed to the cheaper gpt-4o-mini model.

For full details on all rule types (keyword, token length, time, budget), see the Intelligent Routing page.

Complete Configuration Example

This example showcases multiple routes and features using the new top-level model definitions:

guardrails:
  - name: presidio_anonymizer
    type: presidio_anonymizer
    description: Anonymize IBAN and emails codes
    where: io
    parameters:
      language: it
      entities:
        - EMAIL_ADDRESS
        - IBAN_CODE

  - name: presidio_analyzer
    type: presidio_analyzer
    description: Block italian Identity card
    where: input
    behavior: block
    parameters:
      language: it
      entities:
        - IT_IDENTITY_CARD

chat_models:
  - model_id: qwen
    model: openai/qwen2.5:3b
    credentials:
      base_url: "http://host.docker.internal:11434/v1"
    params:
      temperature: 0.7
      top_p: 0.9
    # Use either `prompt` OR `prompt_ref` (mutually exclusive)
    prompt_ref: "customer_service.md"
    role: system

  - model_id: llama3.2
    model: openai/llama3.2
    credentials:
      base_url: "http://host.docker.internal:11434/v1"
    prompt: "You are a helpful assistant and you are nice to the customer that you are facing. Do not take initiatives"
    role: system
    params:
      temperature: 0.7
      top_p: 0.9

embedding_models:
  - model_id: text-embedding-3-small
    model: openai/text-embedding-3-small
    credentials:
      api_key: "your-api-key"

  - model_id: text-embedding-ada-002
    model: openai/text-embedding-ada-002
    credentials:
      api_key: "your-api-key"

routing:
  - name: keyword-routing
    type: deterministic
    default_model_id: qwen
    rule: keyword
    output_mapping:
      - model_id: llama3.2
        conditions:
          - "finance"
          - "budget"
      - model_id: qwen
        conditions:
          - "support"
          - "help"

routes:
  customer-service:
    chat_models:
      - qwen
      - llama3.2
    embedding_models:
      - text-embedding-3-small
      - text-embedding-ada-002
    guardrails:
      - presidio_analyzer
      - presidio_anonymizer
    fallback:
      - target: qwen
        fallbacks:
          - llama3.2
      - target: text-embedding-3-small
        fallbacks:
          - text-embedding-ada-002
        type: embedding
    rate_limiting:
      algorithm: fixed_window
      window_size: 30 seconds
      max_requests: 20

  business-development:
    chat_models:
      - qwen
      - llama3.2
    routing: keyword-routing
    rate_limiting:
      algorithm: fixed_window
      window_size: 20 seconds
      max_requests: 2
    token_limiting:
      input:
        window_size: 10 seconds
        max_token: 5

  finance:
    chat_models:
      - llama3.2
    caching:
      type: exact
      ttl: 300

cache:
  redis_host: "valkey"
  redis_port: 6379

Routes​

Models​

Chat Completion​

File-based prompts (prompt_ref)​

Embeddings​

Guardrails​

Text Control​

PII detection and masking​

LLM-as-a-Judge​

Fallback​

Caching​

Exact Cache​

Semantic Cache​

Rate Limiting​

Token Limiting​

Routing​

Complete Configuration Example​