Caching
This page covers caching configuration and features in the Radicalbit AI Gateway.
Overview
Caching in the Radicalbit AI Gateway improves performance and reduces costs by storing responses for frequently repeated requests.
The gateway supports two caching strategies:
- Semantic Cache: similarity-based caching using embeddings
- Exact Cache: exact-match caching for identical requests
The gateway supports Redis (recommended for distributed deployments) and can also work without Redis configuration depending on the deployment (e.g. single-instance/in-memory setups).
Cache Backends
Redis Caching
Persistent caching using Redis:
cache:
redis_host: localhost
redis_port: 6379
In-Memory Caching
For single-instance deployments, caching may work without a Redis cache section.
If you deploy multiple gateway replicas, Redis (or a shared backend) is strongly recommended to keep cache consistent across instances.
Route-Level Caching
Caching is configured per-route via the caching section.
With the new config structure:
- Models are defined at top-level (
chat_models,embedding_models) - Routes reference models by model ID (strings)
Exact Cache (Basic)
chat_models:
- model_id: gpt-4o
model: openai/gpt-4o
routes:
production:
chat_models:
- gpt-4o
caching:
type: exact
ttl: 3600 # Cache for 1 hour
Exact Cache (Advanced)
routes:
production:
chat_models:
- gpt-4o
caching:
type: exact
ttl: 7200 # Cache for 2 hours
Semantic Cache
Semantic cache retrieves responses based on similarity instead of exact textual matching. For semantic caching you must configure:
- at least one
chat_modelin the route - at least one
embedding_modelin the route caching.type: semanticcaching.embedding_model_idpointing to one of the route embedding model IDs
Semantic Cache Example
chat_models:
- model_id: assistant
model: openai/gpt-4o
embedding_models:
- model_id: text-embedding-3-small
model: openai/text-embedding-3-small
routes:
semantic-cache-demo:
chat_models:
- assistant
embedding_models:
- text-embedding-3-small
caching:
type: semantic
ttl: 120
embedding_model_id: text-embedding-3-small
similarity_threshold: 0.85
distance_metric: cosine
dim: 1536
Semantic Cache Fields
type: must besemanticttl: time-to-live in secondsembedding_model_id: embedding model ID used to build/compare vectorssimilarity_threshold: minimum similarity score to accept a cached matchdistance_metric:cosine,euclidean, ordotdim: embedding vector dimensionality (must match the model output)
Global Cache Settings
If you use Redis, declare it at top-level:
cache:
redis_host: redis-server
redis_port: 6379
Cache Keys
Cache keys are automatically generated based on:
- Route name
- Request content (and relevant configuration) hashed
For semantic cache, the gateway also stores/queries the embedding vectors for similarity search.
Best Practices
TTL Configuration
- Short TTL (300-1800s): dynamic content
- Medium TTL (1800-7200s): semi-static content
- Long TTL (7200-86400s): static content
Memory Management
- Use appropriate TTL values
- Prefer Redis for distributed deployments
Performance Optimization
- Enable exact cache for highly repetitive requests
- Use semantic cache for “similar but not identical” questions
- Monitor cache hit rates and adjust
similarity_threshold
Monitoring
Cache Metrics
- Cache Hit Counter: number of times the gateway hits the cache
- (optional) cache miss counters / latency distribution depending on your metrics setup
Troubleshooting
Common Issues
- Cache Not Working: verify
caching.typeis set on the route and Redis is reachable (if configured) - High Memory Usage: reduce TTL and monitor cache size
- Redis Connection Issues: check Redis server status and network configuration
- Semantic Cache Always Misses: validate
embedding_model_id,dim, and tunesimilarity_threshold
Next Steps
- Fallback - Set up automatic failover
- Monitoring - Set up observability and metrics
- API Reference - Complete API documentation