What should I do when I am being rate limited by a third-party API?

Respect the Retry-After header if present. If not, start with a 1-second delay and double it on each subsequent 429 response. Cap the delay at a maximum such as 64 seconds. Log every rate limit hit — if you are hitting limits regularly, you need to either optimise your call patterns or upgrade your plan.

API Rate Limiting: Token Bucket vs Leaky Bucket

Q: What HTTP status code should a rate-limited response use?

429 Too Many Requests is the correct status code per RFC 6585. Some older APIs use 503 Service Unavailable with a Retry-After header, which is also acceptable but less semantically precise.

Why Rate Limiting Matters More Than You Think

The obvious reason to rate limit is to protect your API from abuse — scrapers, DDoS attempts, badly-written client loops. But rate limiting also protects you from your own users.

A client that encounters an error will often retry. If they retry immediately and the retry fails, many libraries will retry again with exponential backoff. This is sensible behaviour. But under certain failure conditions — a brief database hiccup, a cold-start delay — every client hits the error simultaneously and all start their retry loops at the same time. Without rate limiting and proper Retry-After headers, this turns a 10-second blip into a 10-minute outage. This is the thundering herd problem, and rate limiting is one of the primary defences against it.

The Four Main Rate Limiting Algorithms

1. Fixed Window Counter

The simplest approach. You define a time window (e.g., 1 minute) and count requests against it. When the count hits the limit, requests are rejected until the window resets.

The problem: A burst attack at the window boundary can send 2× your intended limit. If your limit is 100 req/min and a window resets at :00, an attacker can send 100 requests at :59 and 100 at :00 — 200 requests in 2 seconds.

Best for: Internal services, admin APIs, or any context where the boundary attack is acceptable.

import time
import redis

def is_allowed_fixed_window(user_id: str, limit: int, window_seconds: int) -> bool:
    r = redis.Redis()
    key = f"rate:{user_id}:{int(time.time() // window_seconds)}"
    count = r.incr(key)
    if count == 1:
        r.expire(key, window_seconds)
    return count <= limit

2. Sliding Window Log

Instead of a fixed bucket, you keep a log of request timestamps and count how many fall within the last N seconds. This eliminates the boundary attack — the rate is genuinely smooth. The trade-off is memory usage: storing a timestamp per request per user doesn't scale at high volume.

Best for: Public APIs with moderate traffic and strict fairness requirements.

3. Token Bucket ⭐ (Most Common in Production)

Imagine each user has a bucket that holds tokens. Tokens are added at a fixed rate (the "fill rate"). Each request consumes one token. If the bucket is empty, the request is rejected.

The key property: Users can burst up to the bucket capacity, then are throttled to the fill rate. Example: a bucket capacity of 20 tokens, fill rate of 5/second. A user can fire 20 requests immediately, then is limited to 5/second. A user who makes no requests for 4 seconds gets 20 tokens back.

import time

class TokenBucket:
    def __init__(self, capacity: int, fill_rate: float):
        self.capacity = capacity
        self.fill_rate = fill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill = time.monotonic()

    def consume(self, tokens: int = 1) -> bool:
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.fill_rate)
        self.last_refill = now
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False

Best for: Public APIs, SDKs, anything where bursty-but-average-rate clients are legitimate (e.g., a mobile app that fires several requests when it resumes from background).

4. Leaky Bucket

The inverse of token bucket. Requests go into a queue (the bucket) and are processed at a fixed rate. If the queue is full, new requests overflow and are dropped. Output is perfectly smooth regardless of input burst — but users don't "earn back" capacity during quiet periods.

Best for: Systems where smooth downstream throughput is critical — payment processors, email senders, any service that can't handle bursts even briefly.

Calculating Your Rate Limits

The right rate limit depends on your backend capacity and your user's expected behaviour. Use this formula as a starting point:

Sustained rate limit = (backend capacity × safety factor) ÷ expected concurrent users
Burst limit          = sustained rate × burst multiplier (typically 2–5×)

For example — backend handles 1,000 req/sec, 0.7 safety factor, 100 concurrent users:

Sustained limit = (1,000 × 0.7) ÷ 100 = 7 req/sec per user
Burst limit     = 7 × 3 = 21 requests (burst capacity)

🚦

DevOpsArsenal API Rate Limit Calculator

Plug in your backend capacity numbers and automatically compute the recommended sustained limit, burst capacity, and retry window — no spreadsheet required.

Try Rate Limit Calculator →

HTTP Headers: Tell Your Clients What's Happening

Rate limiting is only half the battle. Clients need to know what happened and when they can retry. The standard headers are:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1719532800
Retry-After: 47
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "message": "You have exceeded your rate limit of 100 requests per minute.",
  "retry_after_seconds": 47
}

The Retry-After header is especially important. Without it, clients have no information to guide their retry logic and will often retry immediately — creating the thundering herd you were trying to prevent.

Exponential Backoff with Jitter on the Client Side

If you're building a client that calls a rate-limited API, implement exponential backoff with jitter. Plain exponential backoff (retry after 1s, 2s, 4s, 8s…) can still cause herd behaviour if many clients fail simultaneously. Adding random jitter spreads the retries out:

import time
import random

def retry_with_backoff(fn, max_retries=5, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return fn()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff with full jitter
            delay = random.uniform(0, base_delay * (2 ** attempt))
            time.sleep(delay)

Summary Comparison

Algorithm	Burst Handling	Memory	Best Use Case
Fixed Window	Poor (boundary attack)	Low	Internal APIs
Sliding Window Log	Excellent	High	Low-traffic APIs
Token Bucket	Excellent	Low	Public APIs (most common)
Leaky Bucket	None (smooths everything)	Medium	Payment / email senders

For most public-facing APIs, token bucket is the right default. Set your burst capacity to 3–5× your sustained rate and always return Retry-After on a 429.

Frequently Asked Questions

What's the difference between rate limiting and throttling? ▼

They are often used interchangeably, but technically: rate limiting rejects requests once a limit is hit (returns 429). Throttling slows requests down — it queues them and introduces delay rather than rejecting them outright. Leaky bucket is a form of throttling; token bucket can be configured either way.

What HTTP status code should a rate-limited response use? ▼

429 Too Many Requests is the correct status code per RFC 6585. Some older APIs use 503 Service Unavailable with a Retry-After header, which is also acceptable but less semantically precise.

How do I handle distributed rate limiting across multiple servers? ▼

You need a shared store — Redis is the standard choice. Each server checks and updates the rate limit counter in Redis on every request. Libraries like redis-py, ioredis, and node-rate-limiter-flexible handle this pattern with atomic operations to avoid race conditions.

What should I do when I'm being rate limited by a third-party API? ▼

Respect the Retry-After header if present. If not, start with a 1-second delay and double it on each subsequent 429. Cap the delay at a maximum (e.g., 64 seconds). Log every rate limit hit — if you're hitting limits regularly, you need to either optimise your call patterns or upgrade your plan.

For most public-facing APIs, token bucket is the right default. It's battle-tested, understood by clients, and allows the bursty behaviour that real-world applications exhibit. When in doubt, set your burst capacity to 3–5× your sustained rate, always return Retry-After on a 429, and implement jitter on the client side. Work out your specific numbers with the DevOpsArsenal Rate Limit Calculator before you ship.