๐Ÿ—๏ธ

System Design

๐Ÿ‘จโ€๐Ÿณ Chef

The art of designing systems that scale

Imagine you're the architect of a restaurant. You don't just decide where tables go, but how the kitchen flows, how many cooks you need, where you store ingredients, and what happens when 500 customers arrive instead of 50.

System design is exactly that: planning how to build software that works well today and can grow tomorrow.

Good system design isn't the most complex one, but the one that solves the current problem with room to grow.


Monolith vs Microservices

The first architectural decision you'll face.

Monolith: All in one

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           APPLICATION               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚Auth โ”‚ โ”‚Usersโ”‚ โ”‚Ordersโ”‚ โ”‚Pay  โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚         Single database             โ”‚
โ”‚              โ”Œโ”€โ”€โ”€โ”                  โ”‚
โ”‚              โ”‚ DBโ”‚                  โ”‚
โ”‚              โ””โ”€โ”€โ”€โ”˜                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Advantages:

  • Simple to develop and deploy
  • Easy to debug (everything in one place)
  • Single database = consistency
  • Ideal for small teams (<10 devs)

Disadvantages:

  • Scale all or nothing
  • One bug can bring everything down
  • Risky deployments
  • Hard to maintain as it grows

Microservices: Divide and conquer

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Auth   โ”‚  โ”‚  Users  โ”‚  โ”‚ Orders  โ”‚
โ”‚ Service โ”‚  โ”‚ Service โ”‚  โ”‚ Service โ”‚
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜
     โ”‚            โ”‚            โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”
โ”‚Auth DB  โ”‚  โ”‚Users DB โ”‚  โ”‚Orders DBโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Advantages:

  • Scale only what you need
  • Independent teams
  • One service fails, not all
  • Different technologies per service

Disadvantages:

  • High operational complexity
  • Distributed debugging is hard
  • Eventual consistency (not immediate)
  • Requires mature DevOps

When to use each

ScenarioRecommendation
Startup, MVP, < 5 devsMonolith
Proven product, > 20 devsMicroservices
Parts with very different loadsHybrid
Don't know which to chooseMonolith

Golden rule: Start with monolith. Extract microservices when the pain is real, not imagined.


The CAP Theorem

In distributed systems, you can only have 2 of 3:

        Consistency
           /\
          /  \
         /    \
        /      \
       /   ??   \
      /          \
     /____________\
Availability    Partition
                Tolerance
  • Consistency (C): Everyone sees the same data at the same time
  • Availability (A): The system always responds
  • Partition Tolerance (P): Works even with network failures

In practice

Network partitions CAN ALWAYS happen. So you really choose between:

SystemChoosesSacrificesExample
CPConsistencyAvailabilityBanks, inventory
APAvailabilityConsistencySocial networks, cache

Real example: In a bank, if there's a network failure, you prefer the ATM to say "Not available" (CP) rather than let you withdraw money you don't have (AP).


Scaling: Vertical vs Horizontal

Vertical: Bigger machine

Before:         After:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 4GB โ”‚   โ†’     โ”‚  64GB   โ”‚
โ”‚ 2CPUโ”‚         โ”‚  32CPU  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • Simple: just upgrade the server
  • Has physical limits
  • Single point of failure

Horizontal: More machines

Before:         After:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 4GB โ”‚   โ†’     โ”‚ 4GB โ”‚ โ”‚ 4GB โ”‚ โ”‚ 4GB โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜
  • Theoretically infinite
  • Requires Load Balancer
  • Your app must be stateless

Load Balancers

Distribute traffic among multiple servers.

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚  Load   โ”‚
        Users   โ†’   โ”‚Balancer โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜
                         โ”‚
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ–ผ             โ–ผ             โ–ผ
      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
      โ”‚Server 1 โ”‚   โ”‚Server 2 โ”‚   โ”‚Server 3 โ”‚
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Distribution algorithms

AlgorithmHow it worksWhen to use
Round Robin1, 2, 3, 1, 2, 3...Equal servers
Least ConnectionsTo the one with fewerLong connections
IP HashSame client โ†’ same serverSticky sessions
WeightedMore to the powerful oneDifferent servers

Scaling Databases

Replication: Read copies

     Writes
        โ”‚
        โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Primary โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚  (RW)   โ”‚              โ”‚ Replication
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ”‚
        โ”‚                   โ”‚
        โ–ผ                   โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Replica โ”‚         โ”‚ Replica โ”‚
   โ”‚  (RO)   โ”‚         โ”‚  (RO)   โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ–ฒ                   โ–ฒ
        โ”‚                   โ”‚
      Reads              Reads
  • Scales reads, not writes
  • Eventual consistency (replication lag)

Sharding: Split the data

user_id 1-1000      user_id 1001-2000    user_id 2001-3000
       โ”‚                   โ”‚                    โ”‚
       โ–ผ                   โ–ผ                    โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ Shard 1 โ”‚         โ”‚ Shard 2 โ”‚         โ”‚ Shard 3 โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • Scales both reads and writes
  • Complexity: JOINs between shards are expensive
  • Choosing a good shard key is critical

Caching: The key to performance

Cache strategies

Cache-Aside (Lazy Loading)

1. App requests data
2. Cache miss? โ†’ Read from DB โ†’ Store in cache
3. Cache hit? โ†’ Return from cache

โ”Œโ”€โ”€โ”€โ”€โ”€โ”  miss   โ”Œโ”€โ”€โ”€โ”€โ”€โ”        โ”Œโ”€โ”€โ”€โ”€โ”
โ”‚ App โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ†’ โ”‚Cacheโ”‚        โ”‚ DB โ”‚
โ”‚     โ”‚ โ†โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚     โ”‚        โ”‚    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”˜  hit    โ””โ”€โ”€โ”€โ”€โ”€โ”˜        โ””โ”€โ”€โ”€โ”€โ”˜
    โ”‚                              โ–ฒ
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         miss: read and store

Write-Through

Write to cache AND DB at the same time
- Data always consistent
- Slower writes

Write-Behind (Write-Back)

Write to cache, then async to DB
- Fast writes
- Risk of data loss if cache fails

What to cache

CandidatePriority
Data that doesn't change (config)High
Frequently read dataHigh
Expensive calculation resultsHigh
Active user dataMedium
Data that changes every secondLow

Message Queues

For asynchronous communication between services.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚Producer โ”‚ โ”€โ”€โ†’ โ”‚    Queue    โ”‚ โ”€โ”€โ†’ โ”‚  Consumer   โ”‚
โ”‚ (API)   โ”‚     โ”‚ (RabbitMQ)  โ”‚     โ”‚  (Worker)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Use cases

  • Email sending: API enqueues, worker sends
  • Image processing: Upload enqueues, worker processes
  • Notifications: Event enqueues, multiple consumers notify

Popular tools

ToolBest for
RabbitMQTraditional messaging, complex routing
Redis StreamsSimple, you already have Redis
KafkaHigh volume, event sourcing
SQSAWS native, simple

Practical case: Designing a URL Shortener

Requirements

Functional:

  • Shorten long URL โ†’ short code
  • Redirect code โ†’ original URL
  • URLs expire (optional)

Non-functional:

  • 100M new URLs/month
  • 10:1 read:write ratio
  • Latency < 100ms

Estimations

URLs/month: 100M
URLs/sec: 100M / (30 * 24 * 3600) โ‰ˆ 40 URLs/sec writes
Reads: 40 * 10 = 400 URLs/sec reads

Storage (5 years):
100M * 12 * 5 = 6B URLs
6B * 500 bytes = 3TB

Short code design

Base62: [a-zA-Z0-9] = 62 characters

7 characters = 62^7 = 3.5 trillion combinations
Enough for 100M/month for centuries

Final architecture

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
     Users     โ”€โ”€โ”€โ†’ โ”‚    LB     โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
                          โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ–ผ                       โ–ผ
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚ API 1   โ”‚             โ”‚ API 2   โ”‚
         โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜             โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜
              โ”‚                       โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚   Redis   โ”‚ (Cache hot URLs)
                    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
                          โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ Postgres  โ”‚ (Sharded by hash)
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Recommended resources


Practice

-> Architecture Workshop - Design a real system step by step