Rate Limiting
This page covers rate limiting configuration and features in the Radicalbit AI Gateway.
Overview
Rate limiting in the Radicalbit AI Gateway controls the number of requests that can be made within a specific time window, helping to manage costs and prevent abuse.
With the new configuration structure:
- Models are defined at top-level (
chat_models,embedding_models) - Routes reference models by model ID (strings)
Rate Limiting Types
Request Rate Limiting
Limit the number of requests per time window:
chat_models:
- model_id: gpt-4o
model: openai/gpt-4o
routes:
production:
chat_models:
- gpt-4o
rate_limiting:
algorithm: fixed_window
window_size: 1 minute
max_requests: 100
Rate Limiting Algorithms
Fixed Window
The gateway uses the fixed window algorithm for rate limiting. This counts requests within fixed time windows:
rate_limiting:
algorithm: fixed_window
window_size: 1 minute
max_requests: 100
Note: Other algorithms (sliding_window, sliding_window_counter) may exist in the schema, but if they are not implemented by your gateway version, behavior will still be fixed-window.
Configuration Examples
Basic Rate Limiting
chat_models:
- model_id: gpt-3.5-turbo
model: openai/gpt-3.5-turbo
routes:
api:
chat_models:
- gpt-3.5-turbo
rate_limiting:
algorithm: fixed_window
window_size: 1 minute
max_requests: 60 # ~1 request per second
Combined Limiting (Rate + Token)
chat_models:
- model_id: gpt-4o
model: openai/gpt-4o
routes:
enterprise:
chat_models:
- gpt-4o
rate_limiting:
algorithm: fixed_window
window_size: 1 minute
max_requests: 1000
token_limiting:
input:
algorithm: fixed_window
window_size: 1 hour
max_token: 2000000
output:
algorithm: fixed_window
window_size: 1 hour
max_token: 2000000
Advanced Configuration
Custom Window Sizes
rate_limiting:
algorithm: fixed_window
window_size: 5 minutes
max_requests: 500
Algorithm Configuration
# Fixed window (currently implemented)
rate_limiting:
algorithm: fixed_window
window_size: 1 minute
max_requests: 100
Note: While you can specify sliding_window or sliding_window_counter in the configuration, your gateway may still only implement fixed window limiting.
Rate Limiting Behavior
When Limits Are Exceeded
- HTTP 429: Too Many Requests status code
- Retry-After: Header indicating when to retry (if enabled)
- Error Message: Descriptive error message
Response Headers (if exposed by the gateway)
HTTP/1.1 429 Too Many Requests
Retry-After: 60
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640995200
Header names/availability may vary depending on the deployment and gateway version.
Monitoring
Rate Limiting Metrics
- Requests Blocked: Number of requests blocked by rate limiting
- Rate Limit Utilization: Percentage of rate limit used
- Window Resets: Number of rate limit window resets
Best Practices
Window Size Selection
- Short Windows (1-5 minutes): burst protection
- Medium Windows (1 hour): sustained load protection
- Long Windows (24 hours): daily caps
Limit Configuration
- Start with conservative limits
- Monitor usage patterns
- Adjust based on actual usage
Error Handling
- Implement proper retry logic
- Handle 429 responses gracefully
- Provide user feedback
Troubleshooting
Common Issues
- Too Restrictive: Increase limits or window sizes
- Not Working: Verify
rate_limitingis configured on the route - Inconsistent Behavior: Check algorithm selection and deployment topology
Next Steps
- Token Limiting - Configure token-based limits
- Budget Limiting - Set up cost controls
- Monitoring - Set up observability and metrics
- API Reference - Complete API documentation