TIC
The Interns Company
Intermediate

Rate Limiting

API DesignSecurityScalabilityTraffic Management

Overview

Rate limiting is a strategy to control the amount of incoming and outgoing traffic to or from a network, API, or service. It restricts the number of requests a client can make within a specific time period, protecting systems from abuse, overload, and denial-of-service attacks while ensuring fair resource distribution.

What is Rate Limiting?

Rate limiting is a technique used to control the rate at which a client can make requests or perform operations in a system. It acts as a traffic cop for your applications, ensuring that no single user, service, or API client can overwhelm your system by sending too many requests in a short period of time.

Think of rate limiting like a busy coffee shop that only allows customers to order one drink every 5 minutes. This ensures the baristas aren't overwhelmed with orders, keeps wait times reasonable for everyone, and prevents any single customer from monopolizing the service. Similarly, rate limiting helps distribute computing resources fairly among all users.

Basic Rate Limiting Concept

Why Use Rate Limiting?

Rate limiting serves several critical purposes in modern application design:

  • Prevent Resource Exhaustion: Protects your systems from being overwhelmed by too many requests
  • Defend Against Attacks: Mitigates denial-of-service (DoS) attacks by limiting request frequency
  • Control Costs: Especially important for services that charge by usage (like cloud providers or third-party APIs)
  • Maintain Service Quality: Ensures consistent performance for all users by preventing any single user from consuming all resources
  • Enforce Fair Usage: Implements usage tiers and prevents abuse of free or shared services
  • Regulate Traffic Flow: Smooths out traffic spikes and creates more predictable system behavior

Common Rate Limiting Algorithms

Several algorithms are commonly used to implement rate limiting, each with its own strengths and ideal use cases:

1. Token Bucket Algorithm

The Token Bucket algorithm is one of the most widely used rate limiting approaches. Imagine a bucket that is constantly being filled with tokens at a steady rate up to a maximum capacity. Each request consumes one token. If a token is available, the request proceeds; if not, the request is either rejected or delayed until a token becomes available.

Token Bucket Algorithm

  • Strengths: Allows for bursts of traffic (up to bucket capacity) while maintaining a long-term rate limit
  • Implementation: Two parameters - token refill rate (e.g., 10 tokens per second) and bucket capacity (e.g., 100 tokens)
  • Example: Allow 100 requests per minute, but also allow short bursts of up to 20 requests at once

2. Leaky Bucket Algorithm

The Leaky Bucket algorithm is similar to Token Bucket but processes requests at a fixed rate. Think of a bucket with a small hole in the bottom. Water (requests) drips in from the top and leaks out at a constant rate from the bottom. If the bucket overflows, new requests are rejected or delayed.

Leaky Bucket Algorithm

  • Strengths: Guarantees a constant outflow rate, smoothing out bursts
  • Implementation: Queue (bucket) size and outflow/processing rate
  • Ideal for: Traffic shaping and situations requiring steady processing rates

3. Fixed Window Counter

The Fixed Window Counter algorithm divides time into fixed windows (e.g., 1-minute intervals) and counts requests in each window. Once the counter reaches the limit, new requests are rejected until the next window begins.

Fixed Window Counter Algorithm

  • Strengths: Simple to understand and implement
  • Weakness: Can allow twice the rate at window boundaries (edge condition)
  • Implementation: Time window size and maximum request count per window

4. Sliding Window Log

The Sliding Window Log algorithm keeps track of timestamps for each request. When a new request arrives, it counts how many requests have occurred in the time window ending at the current time. If the count exceeds the limit, the request is rejected.

Sliding Window Log Algorithm

  • Strengths: Most accurate, true rolling window with precise tracking
  • Weakness: Higher memory usage as it stores each request timestamp
  • Implementation: Time window duration and maximum requests per window

5. Sliding Window Counter

The Sliding Window Counter is a hybrid approach that combines the Fixed Window Counter with a weighted calculation based on the previous window's count. It provides a more accurate rate limit without storing individual request timestamps.

Sliding Window Counter Algorithm

  • Strengths: Good accuracy with lower memory requirements than Sliding Window Log
  • Implementation: Current window counter, previous window counter, and weight calculation
  • Example: If 30% into current window, weight = 0.7*previous_count + current_count

Rate Limiting in Distributed Systems

Implementing rate limiting in distributed systems introduces additional challenges. When requests are spread across multiple servers or microservices, coordination is needed to maintain accurate global rate limits.

Distributed Rate Limiting Architecture

Common approaches for distributed rate limiting include:

  • Centralized Rate Limiter: All application servers check with a central service (like Redis) before processing requests
  • Sticky Sessions: Route the same user to the same server, enabling per-server rate limiting
  • Synchronized Counters: Periodically synchronize rate limit counters across all instances
  • Token Bucket with Shared State: Implement token bucket algorithm using distributed shared state

Rate Limiting Granularity

Rate limits can be applied at different levels of granularity, depending on your system's needs:

  • IP-based: Limit requests from each IP address (simplest but can be problematic with shared IPs)
  • User-based: Limit requests per user account (requires authentication)
  • API Key-based: Limit requests per API key (common for third-party API access)
  • Endpoint-based: Different limits for different API endpoints (based on resource intensity)
  • Service-based: Limit requests between internal microservices
  • Region/Data Center-based: Different limits based on geographic regions or data centers

Rate Limiting Granularity Levels

Client Response Strategies

When a client exceeds the rate limit, there are several ways to respond:

  • Reject with 429 Status Code: Return "Too Many Requests" HTTP status code
  • Throttling: Delay processing the request rather than rejecting it
  • Queuing: Place excess requests in a queue to be processed later
  • Degraded Service: Provide limited functionality rather than full rejection
  • Retry-After Header: Indicate when the client should try again

HTTP Rate Limit Response Example

plaintext
1HTTP/1.1 429 Too Many Requests
2Content-Type: application/json
3Retry-After: 60
4X-RateLimit-Limit: 100
5X-RateLimit-Remaining: 0
6X-RateLimit-Reset: 1620000000
7
8{
9  "error": "Rate limit exceeded",
10  "message": "You have exceeded the rate limit of 100 requests per hour",
11  "retry_after_seconds": 60
12}

Implementation Examples

Redis-based Rate Limiter

Redis is commonly used for rate limiting due to its speed, atomic operations, and built-in expiration functionality. Here's an example of implementing a simple fixed window rate limiter with Redis:

Redis Rate Limiter Implementation

plaintext
1import redis
2import time
3
4class RedisRateLimiter:
5    def __init__(self, redis_client, key_prefix, limit, window_seconds):
6        self.redis = redis_client
7        self.key_prefix = key_prefix
8        self.limit = limit
9        self.window = window_seconds
10    
11    def is_allowed(self, identifier):
12        # Create a key that combines the prefix, identifier, and current window
13        window_start = int(time.time() / self.window) * self.window
14        key = f"{self.key_prefix}:{identifier}:{window_start}"
15        
16        # Increment the counter for this window and identifier
17        current_count = self.redis.incr(key)
18        
19        # Set the key to expire at the end of the window if it's new
20        if current_count == 1:
21            self.redis.expire(key, self.window)
22        
23        # Check if the count exceeds the limit
24        return current_count <= self.limit
25
26# Usage example
27redis_client = redis.Redis(host='localhost', port=6379, db=0)
28rate_limiter = RedisRateLimiter(
29    redis_client,
30    key_prefix="rate_limit:api",
31    limit=100,  # 100 requests allowed
32    window_seconds=3600  # per hour
33)
34
35user_id = "user_123"
36if rate_limiter.is_allowed(user_id):
37    # Process the request
38    print("Request allowed")
39else:
40    # Return rate limit exceeded response
41    print("Rate limit exceeded")

Token Bucket Implementation

Here's a simple token bucket implementation in JavaScript:

Token Bucket Implementation in JavaScript

plaintext
1class TokenBucket {
2  constructor(capacity, fillRate) {
3    // Maximum tokens the bucket can hold
4    this.capacity = capacity;
5    
6    // Tokens added per second
7    this.fillRate = fillRate;
8    
9    // Current token count
10    this.tokens = capacity;
11    
12    // Last refill timestamp
13    this.lastRefill = Date.now();
14  }
15  
16  refill() {
17    const now = Date.now();
18    const elapsedTime = (now - this.lastRefill) / 1000; // in seconds
19    
20    // Calculate how many tokens to add based on elapsed time
21    const newTokens = elapsedTime * this.fillRate;
22    
23    // Update token count, but don't exceed capacity
24    this.tokens = Math.min(this.capacity, this.tokens + newTokens);
25    
26    // Update last refill time
27    this.lastRefill = now;
28  }
29  
30  tryConsume(tokensRequired = 1) {
31    // Refill tokens based on elapsed time
32    this.refill();
33    
34    // Check if enough tokens are available
35    if (this.tokens >= tokensRequired) {
36      this.tokens -= tokensRequired;
37      return true;
38    }
39    
40    return false;
41  }
42}
43
44// Usage example
45const rateLimiter = new TokenBucket(10, 2); // 10 tokens capacity, 2 tokens/second refill
46
47function handleRequest() {
48  if (rateLimiter.tryConsume()) {
49    console.log("Request processed");
50    // Process the request
51  } else {
52    console.log("Rate limit exceeded");
53    // Return 429 Too Many Requests
54  }
55}
56
57// Simulate requests
58for (let i = 0; i < 15; i++) {
59  handleRequest();
60}
61
62// After a delay, tokens refill
63setTimeout(() => {
64  console.log("After delay:");
65  for (let i = 0; i < 5; i++) {
66    handleRequest();
67  }
68}, 3000);

Handling Rate Limits in Clients

Well-behaved clients should handle rate limiting gracefully. Here are best practices for client-side rate limit handling:

  • Respect Retry-After Headers: Wait the specified time before retrying
  • Exponential Backoff: Increase wait time between retries if repeatedly rate-limited
  • Rate Awareness: Track rate limits from response headers (X-RateLimit-*)
  • Request Prioritization: When near limits, prioritize important requests
  • Pre-emptive Throttling: Stay under limits by self-imposing delays
  • Cache Responses: Reduce need for repeated requests

Client Retry Strategy with Exponential Backoff

Real-World Rate Limiting Examples

Many popular services implement rate limiting to protect their resources:

  • GitHub API: 5,000 requests per hour for authenticated users, 60 requests per hour for unauthenticated requests
  • Twitter API: 300/450/900 tweets per 3-hour window (based on account tier)
  • Google Cloud APIs: Varies by service, typically quota-based with per-minute and per-day limits
  • AWS API Gateway: Configurable, default of 10,000 requests per second
  • Stripe API: 100 read requests per second, 100 write requests per second

Best Practices for Rate Limiting

  • Documentation: Clearly document your rate limits for API users
  • Response Headers: Include rate limit information in response headers
  • Differentiated Limits: Apply different limits based on user tiers or endpoint sensitivity
  • Graceful Degradation: Consider throttling rather than hard rejection for important users
  • Monitoring: Alert on suspicious patterns that might indicate abuse or attacks
  • Testing: Include rate limit testing in your API test suite
  • Caching: Implement caching to reduce the need for repeated requests
  • Geographic Distribution: Consider region-specific limits for global services

Conclusion

Rate limiting is an essential technique in modern system design that protects your services from abuse while ensuring fair resource allocation among clients. By implementing appropriate rate limiting strategies, you can maintain system stability, control costs, and provide consistent service quality even under high load conditions.

The choice of rate limiting algorithm and implementation approach should be based on your specific requirements for accuracy, performance, and complexity. In distributed systems, additional considerations around synchronization and state sharing are critical to ensure effective rate limiting across multiple service instances.