Advanced Configuration
This page is under development.
In this page we are going to explain how to configure the Gateway routes in all its component and features.
The gateway's entire behavior is controlled by a single YAML configuration file. This file defines:
- Reusable model definitions at top-level (
chat_models,embedding_models) - Routes that reference models by model ID
- Optional features applied per-route (guardrails, fallback, caching, limits, etc.)
Routes
The top-level key is routes. Each key under routes defines a separate API endpoint with its own independent configuration (e.g. customer-service, business-development, finance).
With the new structure:
chat_modelsinside a route is a list of model IDs (strings), not full model objects.embedding_modelsinside a route (optional) is a list of embedding model IDs (strings).- Actual model configurations live at the top level under
chat_modelsandembedding_models.
Example of config.yaml:
chat_models:
- model_id: gpt-4o-mini
model: openai/gpt-4o-mini
credentials:
api_key: !secret OPENAI_API_KEY
params:
temperature: 0.7
max_tokens: 300
# Use either `prompt` OR `prompt_ref` (mutually exclusive)
prompt_ref: "helpful_assistant.md"
role: system
- model_id: claude-3-sonnet
model: anthropic/claude-3-5-sonnet-latest
credentials:
api_key: !secret ANTHROPIC_API_KEY
params:
temperature: 0.7
max_tokens: 300
prompt: "You are a financial advisor."
role: system
embedding_models:
- model_id: text-embedding-3-small
model: openai/text-embedding-3-small
credentials:
api_key: !secret OPENAI_API_KEY
- model_id: text-embedding-ada-002
model: openai/text-embedding-ada-002
credentials:
api_key: !secret OPENAI_API_KEY
routes:
customer-service:
chat_models:
- gpt-4o-mini
business-development:
chat_models:
- gpt-4o-mini
finance:
chat_models:
- claude-3-sonnet
search-and-analytics:
chat_models:
- gpt-4o-mini
embedding_models:
- text-embedding-3-small
- text-embedding-ada-002
Models
The Radicalbit AI Gateway supports all OpenAI-compatible models. This means that the gateway integrates with the vast majority of models on the market, both proprietary and open-source.
The Gateway supports the following types of models:
- Chat Completion
- Embeddings
With the new configuration layout, models are defined once at top-level and then referenced by ID inside routes.
Chat Completion
- OpenAI
- OpenAI-like
- Gemini
chat_models:
- model_id: openai-4o
model: openai/gpt-4o
credentials:
api_key: !secret OPENAI_API_KEY
params:
temperature: 1
max_tokens: 200
# Use either `prompt` OR `prompt_ref` (mutually exclusive)
prompt_ref: "helpful_assistant.md"
role: system
routes:
your-route:
chat_models:
- openai-4o
model_id: Unique identifier for the model (Required)model: Model identifier in formatprovider/model_name(e.g.,openai/gpt-4o) (Required)credentials: API credentials for accessing the modelparams: Model parameters (temperature, max_tokens, etc.)retry_attempts: Number of retry attempts (default: 3)prompt: Optional inline system/developer prompt (mutually exclusive withprompt_ref)prompt_ref: Optional reference to a Markdown file containing the promptrole: Role used when injectingprompt/prompt_ref(allowed: system or developer when prompt is set)input_cost_per_million_tokens: Cost per million input tokensoutput_cost_per_million_tokens: Cost per million output tokens
chat_models:
- model_id: llama3
model: openai/llama3.2:3b
credentials:
base_url: "http://host.docker.internal:11434/v1"
params:
temperature: 0.7
top_p: 0.9
prompt_ref: "assistant.md"
role: system
routes:
your-route:
chat_models:
- llama3
chat_models:
- model_id: gemini-pro
model: google-genai/gemini-2.5-flash
credentials:
api_key: !secret GOOGLE_API_KEY
params:
temperature: 0.7
max_output_tokens: 1024
prompt: "You are a helpful assistant powered by Google Gemini."
role: system
routes:
your-route:
chat_models:
- gemini-pro
File-based prompts (prompt_ref)
When using prompt_ref, the referenced Markdown file must be available inside the gateway container.
Mount a host folder containing prompt files and configure the directories via environment variable:
PROMPTS_DIR: directory for chat model prompts (prompt_ref)
Example (docker compose snippet):
environment:
PROMPTS_DIR: "/radicalbit_ai_gateway/radicalbit_ai_gateway/prompts"
volumes:
- ${PROMPTS_HOST_DIR:-./prompts}:/radicalbit_ai_gateway/radicalbit_ai_gateway/prompts:ro
Embeddings
- OpenAI
- OpenAI-like
- Gemini
embedding_models:
- model_id: emb_model_for_caching
model: openai/text-embedding-3-small
credentials:
api_key: !secret OPENAI_API_KEY
routes:
your-route:
embedding_models:
- emb_model_for_caching
model_id: Unique identifier for the model (Required)model: Model identifier in formatprovider/model_name(Required)credentials: API credentials for accessing the model
embedding_models:
- model_id: emb_model_for_caching
model: openai/text-embedding-3-small
credentials:
base_url: "http://host.docker.internal:11434/v1"
routes:
your-route:
embedding_models:
- emb_model_for_caching
embedding_models:
- model_id: gemini-embedding
model: google-genai/models/gemini-embedding-001
credentials:
api_key: !secret GOOGLE_API_KEY
params:
task_type: RETRIEVAL_QUERY # Optional: RETRIEVAL_DOCUMENT, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING
routes:
your-route:
embedding_models:
- gemini-embedding
Guardrails
Text Control
PII detection and masking
LLM-as-a-Judge
Fallback
Defines a chain of backup models to use if the primary model fails (e.g., due to an API error or downtime). The gateway will automatically try the fallbacks in the order they are listed.
target: Themodel_idof the primary model.fallbacks: A list ofmodel_ids to try in sequence if thetargetfails.type(optional): Useembeddingfor embedding fallbacks.
Example (chat fallback):
routes:
route-name:
chat_models:
- openai-4o
- llama3.2
- qwen
fallback:
- target: openai-4o
fallbacks:
- llama3.2
- qwen
If a request is routed to openai-4o and it fails, the gateway will retry the same request with llama3.2. If llama3.2 also fails, it will try qwen.
Example (embedding fallback):
routes:
route-name:
embedding_models:
- text-embedding-3-small
- text-embedding-ada-002
fallback:
- target: text-embedding-3-small
fallbacks:
- text-embedding-ada-002
type: embedding
Caching
Exact Cache
Exact caching serves identical requests from memory instead of calling the LLM again.
type:exactttl: Time-to-live in seconds
At the top level of the config.yaml, a cache object must be defined if any route has caching enabled.
Example:
routes:
route-name:
chat_models:
- openai-4o
caching:
type: exact
ttl: 300
cache:
redis_host: "valkey"
redis_port: 6379
Semantic Cache
Semantic cache retrieves responses based on similarity. The route must declare:
- at least one
chat_model - one
embedding_modelused to generate embeddings for cache lookup/storage
For each new request, the embedding model is invoked, and a similarity score is computed against stored vectors. If a cached entry exceeds similarity_threshold, the cached response is returned.
routes:
your-route:
chat_models:
- openai-4o
embedding_models:
- text-embedding-3-small
caching:
type: semantic
ttl: 60
embedding_model_id: text-embedding-3-small
similarity_threshold: 0.80
distance_metric: cosine
dim: 1536
cache:
redis_host: "valkey"
redis_port: 6379
ttl: The time-to-live (in seconds) for a cached entry.type: The caching strategy (semanticenables vector-based caching).embedding_model_id: The embedding model ID used to generate/compare embeddings.similarity_threshold: Minimum similarity score to accept a cached match.distance_metric: Similarity metric (cosine,euclidean,dot).dim: Dimensionality of produced embeddings (must match the model output).
Rate Limiting
Controls the number of requests allowed over a time window for a given route.
algorithm: The limiting algorithm. Currently, onlyfixed_windowis supported.window_size: The duration of the time window (e.g.,1 minute,120 seconds).max_requests: The maximum number of requests allowed in that window.
Example:
routes:
route-name:
chat_models:
- openai-4o
rate_limiting:
algorithm: fixed_window
window_size: 1 minute
max_requests: 20
Token Limiting
Controls the cumulative number of tokens processed for a route's inputs and outputs over a time window. This is excellent for managing costs.
It has two sections: input and output.
algorithm: The limiting algorithm (e.g.,fixed_window).window_size: The duration of the time window.max_token: The total number of tokens that can be processed in that window.
Example:
routes:
route-name:
chat_models:
- openai-4o
token_limiting:
input:
window_size: 10 seconds
max_token: 1000
output:
window_size: 10 minutes
max_token: 500
Routing
Intelligent routing allows the gateway to dynamically select which model handles a request based on configurable rules. Routing configs are defined at top-level under routing and referenced by name from routes.
name: Unique identifier for the routing config.type(optional): The routing strategy. Defaults todeterministic.default_model_id: Model ID to use when no rule condition matches.rule: The rule type to apply. One of:keyword,token_length,time,budget.output_mapping: List of entries mapping conditions to model IDs.
Example (keyword routing):
routing:
- name: keyword-routing
type: deterministic
default_model_id: gpt-4o-mini
rule: keyword
output_mapping:
- model_id: gpt-4o
conditions:
- "urgent"
- "complex"
- model_id: gpt-4o-mini
conditions:
- "simple"
routes:
customer-service:
chat_models:
- gpt-4o
- gpt-4o-mini
routing: keyword-routing
If a user message contains "urgent" or "complex", the request is routed to gpt-4o. If it contains "simple", it goes to gpt-4o-mini. Otherwise, the default_model_id (gpt-4o-mini) is used.
Example (budget routing):
routing:
- name: budget-routing
type: deterministic
default_model_id: gpt-4o
rule: budget
output_mapping:
- model_id: gpt-4o-mini
conditions:
threshold: 0.8
routes:
production:
chat_models:
- gpt-4o
- gpt-4o-mini
routing: budget-routing
budget_limiting:
input:
algorithm: fixed_window
window_size: 1 hour
max_budget: 50.0
output:
algorithm: fixed_window
window_size: 1 hour
max_budget: 100.0
The threshold is evaluated against the combined input + output budget. In this example, max_budget = $50 (input) + $100 (output) = $150 total. When more than 80% of that combined budget ($120+) has been consumed, requests are automatically routed to the cheaper gpt-4o-mini model.
For full details on all rule types (keyword, token length, time, budget), see the Intelligent Routing page.
Complete Configuration Example
This example showcases multiple routes and features using the new top-level model definitions:
guardrails:
- name: presidio_anonymizer
type: presidio_anonymizer
description: Anonymize IBAN and emails codes
where: io
parameters:
language: it
entities:
- EMAIL_ADDRESS
- IBAN_CODE
- name: presidio_analyzer
type: presidio_analyzer
description: Block italian Identity card
where: input
behavior: block
parameters:
language: it
entities:
- IT_IDENTITY_CARD
chat_models:
- model_id: qwen
model: openai/qwen2.5:3b
credentials:
base_url: "http://host.docker.internal:11434/v1"
params:
temperature: 0.7
top_p: 0.9
# Use either `prompt` OR `prompt_ref` (mutually exclusive)
prompt_ref: "customer_service.md"
role: system
- model_id: llama3.2
model: openai/llama3.2
credentials:
base_url: "http://host.docker.internal:11434/v1"
prompt: "You are a helpful assistant and you are nice to the customer that you are facing. Do not take initiatives"
role: system
params:
temperature: 0.7
top_p: 0.9
embedding_models:
- model_id: text-embedding-3-small
model: openai/text-embedding-3-small
credentials:
api_key: "your-api-key"
- model_id: text-embedding-ada-002
model: openai/text-embedding-ada-002
credentials:
api_key: "your-api-key"
routing:
- name: keyword-routing
type: deterministic
default_model_id: qwen
rule: keyword
output_mapping:
- model_id: llama3.2
conditions:
- "finance"
- "budget"
- model_id: qwen
conditions:
- "support"
- "help"
routes:
customer-service:
chat_models:
- qwen
- llama3.2
embedding_models:
- text-embedding-3-small
- text-embedding-ada-002
guardrails:
- presidio_analyzer
- presidio_anonymizer
fallback:
- target: qwen
fallbacks:
- llama3.2
- target: text-embedding-3-small
fallbacks:
- text-embedding-ada-002
type: embedding
rate_limiting:
algorithm: fixed_window
window_size: 30 seconds
max_requests: 20
business-development:
chat_models:
- qwen
- llama3.2
routing: keyword-routing
rate_limiting:
algorithm: fixed_window
window_size: 20 seconds
max_requests: 2
token_limiting:
input:
window_size: 10 seconds
max_token: 5
finance:
chat_models:
- llama3.2
caching:
type: exact
ttl: 300
cache:
redis_host: "valkey"
redis_port: 6379