Why Rate Limiting Matters More Than You Think
The obvious reason to rate limit is to protect your API from abuse โ scrapers, DDoS attempts, badly-written client loops. But rate limiting also protects you from your own users.
A client that encounters an error will often retry. If they retry immediately and the retry fails, many libraries will retry again with exponential backoff. This is sensible behaviour. But under certain failure conditions โ a brief database hiccup, a cold-start delay โ every client hits the error simultaneously and all start their retry loops at the same time. Without rate limiting and proper Retry-After headers, this turns a 10-second blip into a 10-minute outage. This is the thundering herd problem, and rate limiting is one of the primary defences against it.
The Four Main Rate Limiting Algorithms
1. Fixed Window Counter
The simplest approach. You define a time window (e.g., 1 minute) and count requests against it. When the count hits the limit, requests are rejected until the window resets.
The problem: A burst attack at the window boundary can send 2ร your intended limit. If your limit is 100 req/min and a window resets at :00, an attacker can send 100 requests at :59 and 100 at :00 โ 200 requests in 2 seconds.
Best for: Internal services, admin APIs, or any context where the boundary attack is acceptable.
import time
import redis
def is_allowed_fixed_window(user_id: str, limit: int, window_seconds: int) -> bool:
r = redis.Redis()
key = f"rate:{user_id}:{int(time.time() // window_seconds)}"
count = r.incr(key)
if count == 1:
r.expire(key, window_seconds)
return count <= limit
2. Sliding Window Log
Instead of a fixed bucket, you keep a log of request timestamps and count how many fall within the last N seconds. This eliminates the boundary attack โ the rate is genuinely smooth. The trade-off is memory usage: storing a timestamp per request per user doesn't scale at high volume.
Best for: Public APIs with moderate traffic and strict fairness requirements.
3. Token Bucket โญ (Most Common in Production)
Imagine each user has a bucket that holds tokens. Tokens are added at a fixed rate (the "fill rate"). Each request consumes one token. If the bucket is empty, the request is rejected.
The key property: Users can burst up to the bucket capacity, then are throttled to the fill rate. Example: a bucket capacity of 20 tokens, fill rate of 5/second. A user can fire 20 requests immediately, then is limited to 5/second. A user who makes no requests for 4 seconds gets 20 tokens back.
import time
class TokenBucket:
def __init__(self, capacity: int, fill_rate: float):
self.capacity = capacity
self.fill_rate = fill_rate # tokens per second
self.tokens = capacity
self.last_refill = time.monotonic()
def consume(self, tokens: int = 1) -> bool:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.fill_rate)
self.last_refill = now
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
Best for: Public APIs, SDKs, anything where bursty-but-average-rate clients are legitimate (e.g., a mobile app that fires several requests when it resumes from background).
4. Leaky Bucket
The inverse of token bucket. Requests go into a queue (the bucket) and are processed at a fixed rate. If the queue is full, new requests overflow and are dropped. Output is perfectly smooth regardless of input burst โ but users don't "earn back" capacity during quiet periods.
Best for: Systems where smooth downstream throughput is critical โ payment processors, email senders, any service that can't handle bursts even briefly.
Calculating Your Rate Limits
The right rate limit depends on your backend capacity and your user's expected behaviour. Use this formula as a starting point:
Sustained rate limit = (backend capacity ร safety factor) รท expected concurrent users
Burst limit = sustained rate ร burst multiplier (typically 2โ5ร)
For example โ backend handles 1,000 req/sec, 0.7 safety factor, 100 concurrent users:
Sustained limit = (1,000 ร 0.7) รท 100 = 7 req/sec per user
Burst limit = 7 ร 3 = 21 requests (burst capacity)
DevOpsArsenal API Rate Limit Calculator
Plug in your backend capacity numbers and automatically compute the recommended sustained limit, burst capacity, and retry window โ no spreadsheet required.
Try Rate Limit Calculator โHTTP Headers: Tell Your Clients What's Happening
Rate limiting is only half the battle. Clients need to know what happened and when they can retry. The standard headers are:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1719532800
Retry-After: 47
Content-Type: application/json
{
"error": "rate_limit_exceeded",
"message": "You have exceeded your rate limit of 100 requests per minute.",
"retry_after_seconds": 47
}
The Retry-After header is especially important. Without it, clients have no information to guide their retry logic and will often retry immediately โ creating the thundering herd you were trying to prevent.
Exponential Backoff with Jitter on the Client Side
If you're building a client that calls a rate-limited API, implement exponential backoff with jitter. Plain exponential backoff (retry after 1s, 2s, 4s, 8sโฆ) can still cause herd behaviour if many clients fail simultaneously. Adding random jitter spreads the retries out:
import time
import random
def retry_with_backoff(fn, max_retries=5, base_delay=1.0):
for attempt in range(max_retries):
try:
return fn()
except RateLimitError:
if attempt == max_retries - 1:
raise
# Exponential backoff with full jitter
delay = random.uniform(0, base_delay * (2 ** attempt))
time.sleep(delay)
Summary Comparison
| Algorithm | Burst Handling | Memory | Best Use Case |
|---|---|---|---|
| Fixed Window | Poor (boundary attack) | Low | Internal APIs |
| Sliding Window Log | Excellent | High | Low-traffic APIs |
| Token Bucket | Excellent | Low | Public APIs (most common) |
| Leaky Bucket | None (smooths everything) | Medium | Payment / email senders |
For most public-facing APIs, token bucket is the right default. Set your burst capacity to 3โ5ร your sustained rate and always return Retry-After on a 429.
Frequently Asked Questions
429 Too Many Requests is the correct status code per RFC 6585. Some older APIs use 503 Service Unavailable with a Retry-After header, which is also acceptable but less semantically precise.redis-py, ioredis, and node-rate-limiter-flexible handle this pattern with atomic operations to avoid race conditions.Retry-After header if present. If not, start with a 1-second delay and double it on each subsequent 429. Cap the delay at a maximum (e.g., 64 seconds). Log every rate limit hit โ if you're hitting limits regularly, you need to either optimise your call patterns or upgrade your plan.Retry-After on a 429, and implement jitter on the client side. Work out your specific numbers with the DevOpsArsenal Rate Limit Calculator before you ship.