Intelligent Routing
This page covers intelligent routing configuration for the Radicalbit AI Gateway, enabling dynamic model selection based on configurable rules.
Overview
Intelligent routing in the Radicalbit AI Gateway allows you to automatically select which model handles a request based on rule-based logic. Instead of always routing to a fixed model, routing evaluates incoming requests against configurable rules — such as keywords in the user message, token count (per-message or full conversation), time of day, or budget consumption — and directs each request to the most appropriate model.
There are three routing categories:
| Category | type value | Decision basis | Added latency |
|---|---|---|---|
| Deterministic | deterministic | Rule evaluated locally | Negligible |
| Text Classification | text_classification | External ML classifier over HTTP | HTTP call latency |
| Semantic | semantic | Embedding similarity against example utterances | Embedding call latency (startup + per query) |
Deterministic Routing
The deterministic strategy uses rule-based logic to select a model for each request. Each routing config defines a single rule type and a list of output_mapping entries that map conditions to specific models. When a request arrives, the rule is evaluated against the mapping entries, and the first match determines the model. If no condition matches, the default_model_id is used.
Configuration Structure
Routing is defined as a top-level key in the gateway configuration:
routing:
- name: my-routing-rule
type: deterministic
default_model_id: gpt-4o-mini
rule: keyword
output_mapping:
- model_id: gpt-4o
conditions:
- "urgent"
- "complex"
A route references a routing config by its name:
routes:
customer-service:
chat_models:
- gpt-4o
- gpt-4o-mini
routing: my-routing-rule
Routing Config Fields
Required Fields
name: Unique identifier for the routing config (used by routes to reference it)default_model_id: Model ID to use when no rule condition matchesrule: The rule type to apply. One of:keyword,token_length,context_length,time,budgetoutput_mapping: List of entries mapping conditions to model IDs
Optional Fields
type: The routing strategy. Defaults todeterministic
Output Mapping Fields
Each entry in output_mapping defines a model and the conditions under which it is selected:
model_id: The model ID to select if the conditions match (must reference a top-levelchat_modelsentry)conditions: Rule-specific conditions (format varies by rule type — see sections below)
Keyword Rule
The keyword rule matches keywords against user messages and selects a model based on the first match.
How it works: The gateway extracts the last user message (lowercased), then checks each output_mapping entry in order. If any keyword from an entry's conditions is found as a substring in that message, that entry's model_id is selected. First match wins.
Conditions type: list[str] — a list of keyword strings
chat_models:
- model_id: gpt-4o
model: openai/gpt-4o
credentials:
api_key: !secret OPENAI_API_KEY
- model_id: gpt-4o-mini
model: openai/gpt-4o-mini
credentials:
api_key: !secret OPENAI_API_KEY
routing:
- name: keyword-routing
type: deterministic
default_model_id: gpt-4o-mini
rule: keyword
output_mapping:
- model_id: gpt-4o
conditions:
- "urgent"
- "complex"
- model_id: gpt-4o-mini
conditions:
- "simple"
routes:
customer-service:
chat_models:
- gpt-4o
- gpt-4o-mini
routing: keyword-routing
Keyword matching is case-insensitive and uses substring matching. A keyword "urgent" will match messages containing "urgent", "URGENT", or "urgently".
Behavior:
- Only the last user message is evaluated — keywords in earlier messages do not affect routing
- Entries are evaluated in order — first keyword match across all entries wins
- If no keyword matches any entry,
default_model_idis used - Supports both plain string and multipart content message formats
Token Length Rule
The token length rule routes requests based on the number of tokens in the last user message, allowing you to send longer or more complex prompts to more capable models.
How it works: The gateway extracts the last user message and counts its tokens using the default model's tokenizer. Each output_mapping entry specifies a condition — gte (greater than or equal), lte (less than or equal), or between (inclusive range) — and the first matching entry determines the model.
Conditions type: TokenLengthConditions — an object with exactly one of the following fields:
| Field | Type | Description |
|---|---|---|
gte | int | Matches when token count ≥ the value |
lte | int | Matches when token count ≤ the value |
between | [int, int] | Matches when token count is within the inclusive range [min, max] |
Each entry must set exactly one condition (gte, lte, or between). Setting none or more than one will cause a validation error. For between, the first value must be ≤ the second value.
Example: Tiered routing with all condition types
chat_models:
- model_id: gpt-4o
model: openai/gpt-4o
credentials:
api_key: !secret OPENAI_API_KEY
- model_id: gpt-4o-mini
model: openai/gpt-4o-mini
credentials:
api_key: !secret OPENAI_API_KEY
- model_id: gpt-4.1
model: openai/gpt-4.1
credentials:
api_key: !secret OPENAI_API_KEY
routing:
- name: token-length-routing
type: deterministic
default_model_id: gpt-4o
rule: token_length
output_mapping:
- model_id: gpt-4o-mini
conditions:
lte: 999 # Short messages → lightweight model
- model_id: gpt-4.1
conditions:
between: [1000, 4999] # Medium messages → mid-tier model
- model_id: gpt-4o
conditions:
gte: 5000 # Long messages → most capable model
routes:
production:
chat_models:
- gpt-4o
- gpt-4o-mini
- gpt-4.1
routing: token-length-routing
Behavior:
- Only the last user message is evaluated — earlier messages and system messages do not affect the token count
- Each entry is checked against the token count using its condition type (
gte,lte, orbetween) - For the example above: a message with 500 tokens routes to
gpt-4o-mini, 2500 tokens routes togpt-4.1, and 6000 tokens routes togpt-4o - If no condition matches,
default_model_idis used
Validation rules
The gateway validates your configuration at startup and rejects invalid setups:
- Each entry must have exactly one of
gte,lte, orbetweenset betweenranges must havebetween[0] ≤ between[1]betweenranges must not overlap with otherbetweenranges or withgte/lteconditions- Multiple
gteor multiplelteentries are allowed (the router sorts them deterministically)
If you need to route based on the total token count of the entire conversation (including system and assistant messages), use the context_length rule instead.
Context Length Rule
The context length rule routes requests based on the total token count of the entire conversation — including system messages, assistant messages, and all user messages. This is useful when you want routing decisions to reflect the full context window usage, not just the latest message.
How it works: The gateway concatenates the content of all messages in the conversation and counts the total tokens using the default model's tokenizer. Each output_mapping entry specifies a condition — gte, lte, or between — and the first matching entry determines the model. The conditions work identically to the token length rule.
Conditions type: TokenLengthConditions — an object with exactly one of gte, lte, or between (same as token_length)
chat_models:
- model_id: deepseek-chat
model: deepseek/deepseek-chat
credentials:
api_key: !secret DEEPSEEK_API_KEY
- model_id: gpt-4o
model: openai/gpt-4o
credentials:
api_key: !secret OPENAI_API_KEY
- model_id: claude-long-context
model: anthropic/claude-3-5-sonnet-latest
credentials:
api_key: !secret ANTHROPIC_API_KEY
routing:
- name: context-length-routing
type: deterministic
default_model_id: deepseek-chat
rule: context_length
output_mapping:
- model_id: gpt-4o
conditions:
between: [2000, 7999] # Medium conversations → standard model
- model_id: claude-long-context
conditions:
gte: 8000 # Long conversations → large context model
routes:
production:
chat_models:
- deepseek-chat
- gpt-4o
- claude-long-context
routing: context-length-routing
Behavior:
- All messages are included in the token count — system messages, assistant messages, and all user messages (not just the last one)
- Each entry is checked against the total token count using its condition type (
gte,lte, orbetween) - For the example above: a conversation with 10,000 total tokens routes to
claude-long-context, one with 3,000 tokens routes togpt-4o, and one with 500 tokens uses the defaultdeepseek-chat - If no condition matches,
default_model_idis used - The same validation rules apply as for token length (exactly one condition per entry, no overlapping between ranges)
Use context_length when conversations grow over time and you want to automatically escalate to models with larger context windows. Use token_length when you want routing based solely on the complexity of the current user message.
Time Rule
The time rule routes requests based on the current time using cron expressions, enabling different models for different time windows (e.g., business hours vs. off-hours).
How it works: The gateway evaluates the current UTC time against the cron expressions in each output_mapping entry. The first entry with a matching cron expression is selected.
Conditions type: list[str] — a list of cron expression strings
chat_models:
- model_id: gpt-4o
model: openai/gpt-4o
credentials:
api_key: !secret OPENAI_API_KEY
- model_id: gpt-4o-mini
model: openai/gpt-4o-mini
credentials:
api_key: !secret OPENAI_API_KEY
routing:
- name: time-routing
type: deterministic
default_model_id: gpt-4o-mini
rule: time
output_mapping:
- model_id: gpt-4o
conditions:
- "0 9-17 * * 1-5" # Business hours: Mon-Fri, 9 AM - 5 PM UTC
- model_id: gpt-4o-mini
conditions:
- "0 0-8,18-23 * * *" # Off-hours
routes:
production:
chat_models:
- gpt-4o
- gpt-4o-mini
routing: time-routing
All cron expressions are evaluated in UTC. Make sure to adjust your schedules accordingly if your users are in different time zones.
Behavior:
- Entries are evaluated in order — the first matching cron expression wins
- If no cron expression matches the current time,
default_model_idis used
Budget Rule
The budget rule routes requests based on the current budget consumption ratio, allowing you to switch to cheaper models as spending approaches the budget limit.
How it works: The gateway tracks cumulative spending (input + output token costs combined) against the single max_budget configured on the route. It calculates usage_ratio = 1 - (remaining_budget / max_budget), then sorts output_mapping entries by threshold descending and selects the first entry whose threshold is less than or equal to the usage ratio.
Conditions type: BudgetConditions — an object with a threshold field (float, 0.0 to 1.0)
The budget rule requires budget_limiting to be configured on the route. Without it, the rule will fall back to default_model_id.
chat_models:
- model_id: gpt-4o
model: openai/gpt-4o
credentials:
api_key: !secret OPENAI_API_KEY
input_cost_per_million_tokens: 5.0
output_cost_per_million_tokens: 15.0
- model_id: gpt-4o-mini
model: openai/gpt-4o-mini
credentials:
api_key: !secret OPENAI_API_KEY
input_cost_per_million_tokens: 0.15
output_cost_per_million_tokens: 0.6
routing:
- name: budget-routing
type: deterministic
default_model_id: gpt-4o
rule: budget
output_mapping:
- model_id: gpt-4o-mini
conditions:
threshold: 0.8 # When > 80% of combined (input + output) budget used, switch to cheaper model
routes:
production:
chat_models:
- gpt-4o
- gpt-4o-mini
routing: budget-routing
budget_limiting:
algorithm: fixed_window
window_size: 1 hour
max_budget: 150.0
Behavior:
- Entries are sorted by
thresholddescending (highest first) — the highest threshold that the usage ratio meets or exceeds wins - The usage ratio is computed against the single
max_budgetvalue. In the example above, athresholdof0.8triggers when $120 or more of the $150 budget has been spent (input + output costs combined) - If no threshold is met, or if no budget limiter is configured,
default_model_idis used
Text Classification Routing
Text classification routing delegates the routing decision to an external ML model (e.g., a classifier deployed via MLflow). The gateway sends the last user message to a configurable HTTP endpoint and maps the returned class label to a model.
How it works: For each incoming request, the gateway extracts the last human message and POSTs it to the configured url. The classifier responds with a predicted class. The gateway looks up the class in output_mapping.conditions and routes to the matching model. If the class is not found, or if the HTTP call fails (timeout or error response), the gateway falls back silently to default_model_id.
Configuration Structure
routing:
- name: semantic-routing
type: text_classification
url: http://text-classifier:8888 # HTTP endpoint of the classifier
timeout: 5.0 # Request timeout in seconds (default: 5.0)
default_model_id: fallback-model
output_mapping:
- model_id: model-a
conditions:
- CLASS_A
- model_id: model-b
conditions:
- CLASS_B
- CLASS_C
routes:
my-route:
chat_models:
- model-a
- model-b
- fallback-model
routing: semantic-routing
HTTP Contract
The gateway sends a POST request to the configured url using MLflow's dataframe_records format:
Request body:
{
"dataframe_records": [{ "inputs": "<last user message>" }]
}
Expected response body:
{
"predictions": [
{
"class": "CLASS_A",
"score": 0.95
}
]
}
The gateway extracts the class field from the first element of predictions and looks it up in the output_mapping conditions.
If the classifier times out, returns an HTTP error, or returns a class not present in any output_mapping entry, the gateway falls back silently to default_model_id. No error is returned to the client.
Routing Config Fields
Required Fields
name: Unique identifier for the routing configtype: Must betext_classificationurl: Full HTTP URL of the classifier endpointdefault_model_id: Model ID to use when no class matches or on erroroutput_mapping: List of entries mapping class labels to model IDs
Optional Fields
timeout: HTTP request timeout in seconds (default:5.0)
Full Example
chat_models:
- model_id: sentiment-positive-model
model: openai/gpt-4o
credentials:
api_key: !secret OPENAI_API_KEY
- model_id: sentiment-negative-model
model: openai/gpt-4o-mini
credentials:
api_key: !secret OPENAI_API_KEY
- model_id: fallback-model
model: openai/gpt-4o-mini
credentials:
api_key: !secret OPENAI_API_KEY
routing:
- name: sentiment-routing
type: text_classification
url: http://sentiment-classifier:8888
timeout: 3.0
default_model_id: fallback-model
output_mapping:
- model_id: sentiment-positive-model
conditions:
- SAT
- model_id: sentiment-negative-model
conditions:
- WRONG_ANSWER
- NEED_CLARIFICATION
routes:
feedback:
chat_models:
- sentiment-positive-model
- sentiment-negative-model
- fallback-model
routing: sentiment-routing
Text classification routing is ideal for intent detection or sentiment-based routing using a custom ML model trained on your domain data (e.g., classifying user feedback as SAT, NEED_CLARIFICATION, or WRONG_ANSWER).
Behavior:
- The gateway evaluates the last user message only
- The
conditionslist for eachoutput_mappingentry contains the class labels that should map to that model — multiple labels can share a model - Entry order does not matter for class lookup (unlike keyword/time rules)
- On any failure (timeout, HTTP error, unknown class),
default_model_idis used
Semantic Routing
Semantic routing uses embedding similarity to route requests to the most appropriate model. Instead of matching keywords or calling an external classifier, it compares the user's message against pre-computed example utterances using vector similarity.
This is ideal when you want intent-based routing without deploying a separate ML classifier — you simply provide example utterances for each model, and the gateway handles the rest.
How It Works
Semantic routing operates in two phases:
Initialization phase (runs once at startup):
- For each model in
output_mapping, the gateway takes the list of example utterances fromconditions - Embeds every utterance using the model referenced by
embedding_model_id - Computes a normalized centroid vector (average of embeddings) per model
Query phase (runs per request):
- Extracts the last human message from the conversation
- Embeds it using the same embedding model
- Computes cosine similarity between the message embedding and every stored centroid
- If the highest similarity score exceeds
similarity_threshold, routes to that model - Otherwise, routes to
default_model_id
Configuration Structure
embedding_models:
- model_id: text-embedding-3-small
model: openai/text-embedding-3-small
credentials:
api_key: !secret OPENAI_API_KEY
chat_models:
- model_id: code-model
model: openai/gpt-4o
credentials:
api_key: !secret OPENAI_API_KEY
- model_id: general-model
model: openai/gpt-4o-mini
credentials:
api_key: !secret OPENAI_API_KEY
- model_id: default-model
model: openai/gpt-4o-mini
credentials:
api_key: !secret OPENAI_API_KEY
routing:
- name: intent-routing
type: semantic
default_model_id: default-model
embedding_model_id: text-embedding-3-small
similarity_threshold: 0.35
output_mapping:
- model_id: code-model
conditions:
- "write a python function"
- "debug this code"
- "explain this algorithm"
- "refactor this class"
- model_id: general-model
conditions:
- "what is the weather"
- "tell me a joke"
- "summarize this article"
routes:
production:
chat_models:
- code-model
- general-model
- default-model
routing: intent-routing
Routing Config Fields
Required Fields
name: Unique identifier for the routing configtype: Must besemanticdefault_model_id: Model ID used when no centroid exceeds the similarity threshold, or if initialization failsembedding_model_id: References a top-levelembedding_modelsentry used for embedding both utterances and queriesoutput_mapping: List of entries mapping example utterances to model IDs
Optional Fields
similarity_threshold: Cosine similarity threshold (default:0.35, range:0.0–1.0). A message must exceed this score against a centroid to be routed to that model
Output Mapping Fields
Each entry in output_mapping:
model_id: The model to route to (must reference a top-levelchat_modelsentry)conditions:list[str]— example utterances that represent the kind of messages this model should handle. These are embedded at startup to form the centroid
The embedding model referenced by embedding_model_id must be defined in the top-level embedding_models section. If the embedding model is unreachable or initialization fails, the gateway logs a warning and falls back to default_model_id for all requests.
Write conditions that are representative of real user messages. More diverse examples produce a better centroid and more accurate routing. Aim for 5–10 examples per model covering the range of expected intents.
Behavior:
- Only the last human message is evaluated
- Initialization is asynchronous at startup — if it fails, the gateway falls back to
default_model_idwith a warning (no error returned to clients) - At query time, cosine similarity is computed against all centroids; the highest-scoring centroid wins if it exceeds the threshold
- Entry order in
output_mappingdoes not matter — selection is purely by similarity score - If two centroids have the same score, the first one in
output_mappingorder wins
Configuration Reference
Routing Config
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Unique name for the routing config |
type | string | Yes | Routing strategy: deterministic, text_classification, or semantic |
default_model_id | string | Yes | Fallback model ID when no rule matches or on error |
rule | string | Deterministic only | Rule type: keyword, token_length, context_length, time, or budget |
url | string | Text classification only | HTTP endpoint of the external classifier |
timeout | float | No | Classifier request timeout in seconds (default: 5.0) |
embedding_model_id | string | Semantic only | References a top-level embedding_models entry for embedding utterances and queries |
similarity_threshold | float | No | Cosine similarity threshold (default: 0.35, range: 0.0–1.0). Semantic only |
output_mapping | list | Yes | List of condition-to-model mappings |
Output Mapping Entry
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | Yes | Model ID to select when conditions match |
conditions | varies | Yes | Rule-specific conditions (see below) |
Conditions by Rule / Type
| Routing type / Rule | Conditions Type | Format |
|---|---|---|
deterministic / keyword | list[str] | List of keyword strings |
deterministic / token_length | TokenLengthConditions | Object with exactly one of: gte: int, lte: int, or between: [int, int] |
deterministic / context_length | TokenLengthConditions | Object with exactly one of: gte: int, lte: int, or between: [int, int] |
deterministic / time | list[str] | List of cron expressions |
deterministic / budget | BudgetConditions | Object with threshold: float (0.0–1.0) |
text_classification | list[str] | List of class label strings returned by the classifier |
semantic | list[str] | List of example utterances used to compute the centroid for the model |
Best Practices
Rule Selection
- Use keyword routing when request intent is clearly expressed in the message content
- Use token length routing to send complex, long prompts to more capable models based on the current message
- Use context length routing to escalate to larger context window models as conversations grow over time
- Use time routing to optimize costs during off-peak hours
- Use budget routing to gracefully degrade to cheaper models as spending increases
- Use text classification routing when intent detection requires an ML model — e.g., sentiment analysis, topic classification, or domain-specific labelling
- Use semantic routing when you want intent-based model selection without deploying an external classifier — just provide example utterances per model
Model Configuration
- Ensure all models referenced in
output_mappinganddefault_model_idare defined inchat_models - All models used in routing must also be listed in the route's
chat_models - Configure cost information on models when using budget routing
- For semantic routing, ensure the embedding model referenced by
embedding_model_idis defined in top-levelembedding_models
General Tips
- Keep
output_mappingentries ordered intentionally — entry order matters for keyword and time rules - Test routing rules in non-production environments before deploying
- Use descriptive
namevalues for routing configs (e.g.,keyword-routing,budget-aware-routing) - For semantic routing, write diverse, representative example utterances (5–10 per model) — the quality of routing depends on how well the centroids represent each intent cluster
- Set
similarity_thresholdconservatively (0.3–0.5) to start, then adjust based on how many requests fall through to the default model
Troubleshooting
Common Issues
- Model Not Found: Ensure all
model_idvalues inoutput_mappinganddefault_model_idexist in the top-levelchat_modelsand are referenced in the route'schat_modelslist - Budget Rule Not Working: Verify that
budget_limitingis configured on the route. Without it, the budget rule always falls back todefault_model_id - Time Rule Not Matching: Cron expressions are evaluated in UTC. Double-check your expressions account for the correct time zone offset
- Unexpected Model Selection: For keyword and time rules, the first match wins. Review the order of your
output_mappingentries - Text Classification Always Using Fallback: Check that the classifier is reachable from the gateway (correct
url, network connectivity). Verify the response contains apredictions[0].classfield and that its value matches a label defined inoutput_mapping.conditions. Increasetimeoutif the classifier is slow to respond - Semantic Routing Always Using Default Model: Check that the embedding model referenced by
embedding_model_idis defined in top-levelembedding_modelsand is reachable. Review gateway startup logs for initialization warnings. Verify thatsimilarity_thresholdis not set too high — try lowering it (e.g., from0.8to0.6) and inspect whether any centroid scores appear in debug logs - Semantic Routing Selecting the Wrong Model: Improve
conditionsinoutput_mapping— add more diverse example utterances that better represent the target intent. Ensure conditions across different models are sufficiently distinct (overlapping utterance themes produce overlapping centroids)
Next Steps
- Fallback - Set up automatic failover when models fail
- Budget Limiting - Configure budget limits (required for budget routing)
- Semantic Caching - Another embedding-based feature for caching similar requests
- Advanced Configuration - Enterprise configuration options