Kwaku Boateng

Software Engineer and Writer

Ever so often, I get this question from mentees: “How can I master system design?” Knowing how to design systems at scale is a highly sought‑after skill, especially if you're preparing for interviews. Designing systems at scale is what separates seniors from juniors or mid‑level engineers.

My personal journey learning how to design applications at scale was rough and extremely fragmented. I learned a bit of everything from everywhere. I firmly believe that approach will not suffice now; a methodical, step‑by‑step approach is required to get a firm grip of system design concepts quickly. I'll lay out a structured way of mastering system design here. It might feel overwhelming at first, but once you take each concept in isolation and try to understand all its trade‑offs, it will all come together in no time.

Below, I have outlined the structure I would recommend:

  1. Watch mock interviews of senior engineers. Watching mock interviews of highly skilled engineers will give you a fair idea of the building blocks of efficient, scalable systems. In these interviews, engineers explain their thought process and the trade‑offs they're making. I personally found them quite insightful. For example, I learned how companies like X use change data capture (CDC) through a mock interview on Hello Interview. We learn by watching others do what we desire to be. You'll find some of these interviews on Hello Interview and I Got An Offer.

  2. Learn core concepts. Build a foundation in system design by understanding key principles:

    • Latency vs. throughput: Latency measures speed (how long it takes a single request to complete), while throughput measures capacity (how many requests a system can handle per unit time).
    • Fault tolerance: Know how a system continues working in spite of failures, e.g., through graceful degradation and redundancy.
    • Synchronous vs. asynchronous communication: Be clear on when to use each.
    • Writing to disk vs. writing to RAM: Understand how storage choices affect performance. For example, Kafka relies on sequential disk writes and the operating system’s page cache to achieve high throughput.
  3. The building blocks.

    • API gateways and load balancing: Learn load‑balancing algorithms like round‑robin and least connections, and the advantages of each. Study API gateways and how they handle routing and rate limiting.
    • Content delivery networks (CDNs): See how CDNs cache content close to users to reduce round‑trips.
    • Message queues: Essential for event‑driven designs; they handle asynchronous tasks.
    • Caching: Caching is critical for reducing latency.
    • Circuit breakers: Prevent cascading failures and improve fault tolerance.
    • Stateful vs. stateless design: Stateless systems do not retain session data and therefore scale more easily; they enable parallel code execution and high scalability. Explore why and when to use each approach.
  4. Databases.

    • SQL, NoSQL, Graph Databases, Key-Value Store, Vector Databases, Timeseries Databases, Wide Column Databases: Understand their trade‑offs.
    • OLTP vs. OLAP: Know the distinction between transactional (OLTP) and analytical (OLAP) workloads.
    • Full‑text search: Traditional SQL queries (e.g., LIKE) are inefficient for large‑scale text searches. Tools such as Elasticsearch or Algolia provide specialized indexing and search capabilities.
    • Indexes: Learn why and when to apply indexes.
    • Debugging slow queries: Practice analyzing and optimizing queries in production.
    • Sharding: Explore strategies for distributing data across multiple nodes.
    • Consistent hashing: An algorithm that evenly distributes data across a cluster and minimizes data movement when nodes are added or removed.
    • Replication: Understand how data is replicated for high availability.
    • ACID transactions: ACID stands for atomicity, consistency, isolation and durability. Each property ensures data integrity.
    • Partitioning: Learn how partitioning improves database performance.
    • Strong vs. eventual consistency: Know when each consistency model is appropriate.
  5. APIs.

    • Authentication and authorization: Authentication answers “Who are you?” and authorization answers “What can you do?” Explore role‑based permissions, attribute‑based control and OAuth.
    • Pagination: Understand offset‑based versus cursor‑based pagination.
  6. Big data.

    • Batching vs. streaming: Recognize when streaming is more suitable for event‑driven architectures.
  7. Learn from real‑world systems and courses. Read engineering blog posts from companies like Shopify, Netflix, Facebook and Google Drive to see how they solve challenges at scale. Courses such as Alex Xu’s ByteByteGo and the System Design School are great resources.

  8. Architectural patterns. Use patterns as blueprints:

    • Client–server architecture: Understand the basic two‑tier model and its limitations.
    • Microservices architecture: Learn how to break monolithic applications into independently deployable services.
    • Serverless architecture: Explore functions‑as‑a‑service offerings like AWS Lambda or Azure Functions.
    • Event‑driven architecture: Learn how events orchestrate workflows in systems like e‑commerce and IoT.
    • Peer‑to‑peer architecture: Study decentralized systems such as torrents or blockchains.
  9. Practice. Design systems you use every day: your blog, your social media apps, etc. Practice consistently, do mock interviews, and seek feedback.

Due to a spike in requests on our public API(application programming interface) at Regulon. I had to implement a rate limiter to improve the availability and user experience of our APIs.

There are different ways to rate-limit an API. I'll list three here and explain one.

  1. Fixed window: This is the most straightforward way. Essentially, you'll limit every user to a number of requests within a fixed time window. This way of rate-limiting cannot handle sudden bursts of traffic. If a user reaches the limit within the first second of the window, the user will have to wait for the counter to reset. An example of a fixed window is: 100 requests per minute. If a user uses up all their requests in the first few seconds, they'll have to wait for the next minute to initiate another request.
  2. Sliding window: This is similar to the fixed window, only that this time the window “slides” forward with the current time. The window constantly evaluates the quota relative to the current time. Example: If the limit is 100 requests per hour. When a request is made at 2:30 PM, the system checks requests made between 1:30 PM and 2:30 PM. At 4:00 PM, the system checks requests made between 3:00 PM and 4:00 PM.
  3. Token bucket: This is the industry standard and is used by Nginx, Envoy, and AWS API Gateway. It controls how many requests are allowed over time by giving the system a fixed bucket or token size. The goal is that every user has a bucket of tokens. Every request takes out a token from their bucket. If the bucket is empty, the request is rejected. Tokens are added at a steady rate or refill rate up to a maximum capacity of the bucket. This approach is ideal for handling short bursts in traffic with long-term smooth rate limiting. For example, for a bucket with 100 tokens at a refill rate of 10 tokens per second, a user can make 100 requests in one second and in the next second make up to 10 requests.

I chose the token bucket approach since it guaranteed our users short spikes in traffic.

Implementing the Token Bucket Algorithm:

Make sure Redis is installed. In Regulon's use case, I chose Redis. Redis offers fast reads, and there's no need to make a round trip to other services and then database to verify if a user is allowed to access the system. The rate limiter should be the first thing an authenticated user will interact with.

Now let's look at the code that handles the implementation:

Ruby:

Install Redis

gem install redis

or your Rails app:

# Gemfile
gem 'redis'

rate_limiter.rb

require 'redis'
require 'time'

class RateLimiter
  def initialize(rate_per_minute: 100, bucket_size: 100)
    @redis = Redis.new
    # this is the refill rate
    @rate_per_second = rate_per_minute.to_f / 60.0 
    @bucket_size = bucket_size
  end

  def allow_request(user_id)
   # create keys in Redis for the user
    tokens_key = "tb:#{user_id}:tokens"
    ts_key     = "tb:#{user_id}:ts"

   # read from Redis, if there are no values, initiate with the fixed values
    tokens = @redis.get(tokens_key)&.to_f || @bucket_size
    last_ts = @redis.get(ts_key)&.to_f || Time.now.to_f

   # calculate the elapsed time from the last request
    now = Time.now.to_f
    elapsed = now - last_ts

    #  Calculate tokens to refill
    # If 0.5 seconds passed and refill_rate is 10 tokens/second:
    tokens_to_add = elapsed * @rate_per_second
    tokens = [@bucket_size, tokens + tokens_to_add].min
    
   # Check if enough tokens are available
    if tokens < 1
      retry_after = (1 - tokens) / @rate_per_second
      @redis.set(tokens_key, tokens)
      @redis.set(ts_key, now)
      return [false, retry_after]
    end

    tokens -= 1

    @redis.pipelined do
      @redis.set(tokens_key, tokens)
      @redis.set(ts_key, now)
    end

    [true, nil]
  end
end

Usage:

    limiter = RateLimiter.new(rate_per_minute: 100)
    allowed, retry_after = limiter.allow_request(user_id)
    # validate for allowed
   unless allowed
      render json: {
        error: "Too Many Requests",
        retry_after: retry_after
      }, status: 429
    end

For Python:

rate_limiter.py

import time
import redis

class RateLimiter:
    def __init__(self, redis_client, rate_per_minute=100, bucket_size=100):
        self.redis = redis_client
        self.rate_per_second = rate_per_minute / 60.0
        self.bucket_size = bucket_size

    def allow_request(self, user_id):
        key_tokens = f"tb:{user_id}:tokens"
        key_timestamp = f"tb:{user_id}:ts"

        # fetch current values
        tokens = self.redis.get(key_tokens)
        last_ts = self.redis.get(key_timestamp)

        tokens = float(tokens) if tokens else self.bucket_size
        last_ts = float(last_ts) if last_ts else time.time()

        now = time.time()
        elapsed = now - last_ts

        # refill based on elapsed time
        tokens = min(self.bucket_size, tokens + elapsed * self.rate_per_second)

        if tokens < 1:
            # not enough tokens - reject
            self.redis.set(key_tokens, tokens)
            self.redis.set(key_timestamp, now)
            retry_after = (1 - tokens) / self.rate_per_second
            return False, retry_after

        # consume 1 token
        tokens -= 1

        # save state
        pipeline = self.redis.pipeline()
        pipeline.set(key_tokens, tokens)
        pipeline.set(key_timestamp, now)
        pipeline.execute()

        return True, None

Usage:

from flask import Flask, request, jsonify
import redis
from token_bucket import TokenBucket

app = Flask(__name__)
r = redis.Redis(host="localhost", port=some_port)
limiter = RateLimiter(r)

@app.route("/api/test")
def test():
    user_id = request.headers.get("X-User-ID", "anonymous")
    allowed, retry_after = limiter.allow_request(user_id)

    if not allowed:
        return jsonify({
            "error": "Too Many Requests",
            "retry_after": retry_after
        }), 429

    return jsonify({"message": "success"})

In conclusion, the Token Bucket Algorithm is an effective way to rate limit APIs. Test it out and happy coding.