Rate limiting is a foundational control in API security, abuse prevention, and distributed systems reliability. It determines how systems allocate finite capacity across users,
Rate limiting is a foundational control in API security, abuse prevention, and distributed systems reliability. It determines how systems allocate finite capacity across users, services, and regions. Every rate limiting algorithm encodes a different model of time, fairness, and burst tolerance. Choosing the wrong one can create boundary exploits, uneven load, or distributed consistency failures. This guide explains how the major rate limiting algorithms work, where they fail in practice, and which one is the strongest default for modern APIs.
| Algorithm | Fairness | Burst Tolerance | Memory Cost | Best For |
|---|---|---|---|---|
| Fixed Window | Low | High at window boundaries | Very Low | Simple internal limits |
| Sliding Window Log | High | Low | High | Strict fairness enforcement |
| Sliding Window Counter | Medium to High | Low to Medium | Moderate | Scalable public APIs |
| Token Bucket | Medium to High | Controlled bursts | Low | Developer-facing APIs |
| Leaky Bucket | High output smoothing | None | Low | Traffic shaping |
Key conclusions:
A useful mental model: token bucket models capacity accumulation. Sliding window models recent history. Fixed window models discrete accounting periods.
Rate limiting constrains how frequently an identity can perform an action within a defined time interval. The identity may be an IP address, API key, user ID, tenant ID, or service.
At a systems level, every rate limiter defines:
Rate limiting is not just about blocking traffic, it’s about encoding a policy for fairness, abuse resistance, and infrastructure protection.
The fixed window algorithm divides time into discrete intervals such as 60 seconds. For each identity, a counter tracks how many requests occur within the current interval. When the counter exceeds the configured limit, additional requests are rejected until the next window begins.
A limit of 100 requests per minute means:
This is computationally efficient, and it requires only a counter and a window identifier. However, it introduces boundary amplification: A client can send 100 requests at 12:00:59 and another 100 at 12:01:00, and the system technically enforces 100 per minute, but 200 requests arrive within two seconds.
Fixed window is rarely appropriate for public-facing APIs unless burst amplification is acceptable.
The sliding window algorithm evaluates requests against a rolling time interval instead of discrete blocks. There are two primary implementations: sliding window log and sliding window counter.
Sliding Window Log: The system stores a timestamp for every request. When a new request arrives, timestamps older than the window duration are removed. The remaining entries are counted. This provides strong fairness. At any moment, the system evaluates exactly the last N seconds of traffic. However, memory usage grows with request volume. During traffic spikes, log-based implementations can create memory pressure and increased CPU overhead for pruning. Failure example: a high-throughput endpoint using sliding logs can experience memory amplification during an attack spike.
Sliding Window Counter: Instead of storing every timestamp, this approach blends counters from the current and previous window based on elapsed time. It approximates sliding behavior with far lower memory cost. Accuracy is slightly reduced, but boundary amplification is dramatically minimized.
For large-scale APIs, sliding counter is often the scalable alternative to sliding log.
The token bucket algorithm models capacity as tokens accumulating over time.
Each identity has:
Requests consume tokens, so if no tokens remain, requests are rejected. If a client is idle, tokens accumulate up to the bucket’s maximum capacity, this allows controlled bursts without violating long-term average limits For most developer-facing APIs, token bucket is the strongest default, it enforces a steady average rate while allowing short bursts, which matches real-world API usage patterns.
Leaky bucket enforces a constant output rate. Incoming requests enter a queue that drains at a steady pace. If the queue is full, new requests are rejected. Unlike token bucket, leaky bucket does not accumulate burst capacity, it smooths traffic strictly.
This is useful for:
Distributed rate limiting introduces coordination complexity. If multiple instances serve traffic for the same identity, they must share state. Without coordination, each node may independently allow traffic, multiplying the effective limit.
Strong consistency ensures accurate limits but increases latency and reduces availability. Eventual consistency improves resilience but allows temporary overages. In multi-region deployments, global rate limiting across regions requires explicit tradeoffs. During a regional partition, systems may double-allow traffic if coordination fails.
Distributed rate limiting is fundamentally a consistency problem. Algorithm choice matters, but state coordination dominates complexity at scale.
In microservices architectures, rate limiting must be placed deliberately. Limits at the edge or API gateway protect external exposure and enforce per-identity fairness. Internal limits serve a different purpose: they protect downstream dependencies and reduce the risk of cascading failures.
Problems arise when retries are not coordinated with enforcement, so if a downstream service begins failing and upstream services retry aggressively, total traffic can exceed the original request volume. In fan-out systems, where one request triggers multiple internal calls, this effect multiplies quickly.
Effective rate limiting in microservices requires aligning enforcement with retry behavior and service topology. Otherwise, it can amplify instability instead of containing it.
For most public APIs, the token bucket algorithm is the strongest default. It enforces a predictable long-term rate while allowing controlled bursts that reflect real usage patterns, which makes it well suited for developer-facing platforms.
If strict fairness is the primary requirement, particularly in security-sensitive environments, a sliding window log provides the highest accuracy because it evaluates exact recent request history. However, that precision comes with higher memory and operational cost.
For large-scale distributed systems where memory usage and coordination overhead matter, a sliding window counter is often the most practical compromise. It reduces boundary burst effects while remaining efficient enough to operate at high throughput.
Fixed window algorithms are rarely appropriate for public endpoints because boundary amplification can be exploited. Leaky bucket is better suited for traffic shaping and steady output control than for user-facing API fairness.
Ultimately, the correct choice depends on whether the system prioritizes fairness, scalability, burst tolerance, or downstream protection. In modern API architectures, the token bucket is typically the right starting point.
Get the full posts by email every week.