The art of designing systems that scale
Imagine you're the architect of a restaurant. You don't just decide where tables go, but how the kitchen flows, how many cooks you need, where you store ingredients, and what happens when 500 customers arrive instead of 50.
System design is exactly that: planning how to build software that works well today and can grow tomorrow.
Good system design isn't the most complex one, but the one that solves the current problem with room to grow.
Monolith vs Microservices
The first architectural decision you'll face.
Monolith: All in one
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ APPLICATION โ
โ โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ โ
โ โAuth โ โUsersโ โOrdersโ โPay โ โ
โ โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ โ
โ Single database โ
โ โโโโโ โ
โ โ DBโ โ
โ โโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Advantages:
- Simple to develop and deploy
- Easy to debug (everything in one place)
- Single database = consistency
- Ideal for small teams (<10 devs)
Disadvantages:
- Scale all or nothing
- One bug can bring everything down
- Risky deployments
- Hard to maintain as it grows
Microservices: Divide and conquer
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ Auth โ โ Users โ โ Orders โ
โ Service โ โ Service โ โ Service โ
โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโฌโโโโโ
โ โ โ
โโโโโโดโโโโโ โโโโโโดโโโโโ โโโโโโดโโโโโ
โAuth DB โ โUsers DB โ โOrders DBโ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
Advantages:
- Scale only what you need
- Independent teams
- One service fails, not all
- Different technologies per service
Disadvantages:
- High operational complexity
- Distributed debugging is hard
- Eventual consistency (not immediate)
- Requires mature DevOps
When to use each
| Scenario | Recommendation |
|---|---|
| Startup, MVP, < 5 devs | Monolith |
| Proven product, > 20 devs | Microservices |
| Parts with very different loads | Hybrid |
| Don't know which to choose | Monolith |
Golden rule: Start with monolith. Extract microservices when the pain is real, not imagined.
The CAP Theorem
In distributed systems, you can only have 2 of 3:
Consistency
/\
/ \
/ \
/ \
/ ?? \
/ \
/____________\
Availability Partition
Tolerance
- Consistency (C): Everyone sees the same data at the same time
- Availability (A): The system always responds
- Partition Tolerance (P): Works even with network failures
In practice
Network partitions CAN ALWAYS happen. So you really choose between:
| System | Chooses | Sacrifices | Example |
|---|---|---|---|
| CP | Consistency | Availability | Banks, inventory |
| AP | Availability | Consistency | Social networks, cache |
Real example: In a bank, if there's a network failure, you prefer the ATM to say "Not available" (CP) rather than let you withdraw money you don't have (AP).
Scaling: Vertical vs Horizontal
Vertical: Bigger machine
Before: After:
โโโโโโโ โโโโโโโโโโโ
โ 4GB โ โ โ 64GB โ
โ 2CPUโ โ 32CPU โ
โโโโโโโ โโโโโโโโโโโ
- Simple: just upgrade the server
- Has physical limits
- Single point of failure
Horizontal: More machines
Before: After:
โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ
โ 4GB โ โ โ 4GB โ โ 4GB โ โ 4GB โ
โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ
- Theoretically infinite
- Requires Load Balancer
- Your app must be stateless
Load Balancers
Distribute traffic among multiple servers.
โโโโโโโโโโโ
โ Load โ
Users โ โBalancer โ
โโโโโโฌโโโโโ
โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโ
โผ โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โServer 1 โ โServer 2 โ โServer 3 โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
Distribution algorithms
| Algorithm | How it works | When to use |
|---|---|---|
| Round Robin | 1, 2, 3, 1, 2, 3... | Equal servers |
| Least Connections | To the one with fewer | Long connections |
| IP Hash | Same client โ same server | Sticky sessions |
| Weighted | More to the powerful one | Different servers |
Scaling Databases
Replication: Read copies
Writes
โ
โผ
โโโโโโโโโโโ
โ Primary โโโโโโโโโโโโโโโโ
โ (RW) โ โ Replication
โโโโโโโโโโโ โ
โ โ
โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโ
โ Replica โ โ Replica โ
โ (RO) โ โ (RO) โ
โโโโโโโโโโโ โโโโโโโโโโโ
โฒ โฒ
โ โ
Reads Reads
- Scales reads, not writes
- Eventual consistency (replication lag)
Sharding: Split the data
user_id 1-1000 user_id 1001-2000 user_id 2001-3000
โ โ โ
โผ โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ Shard 1 โ โ Shard 2 โ โ Shard 3 โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
- Scales both reads and writes
- Complexity: JOINs between shards are expensive
- Choosing a good shard key is critical
Caching: The key to performance
Cache strategies
Cache-Aside (Lazy Loading)
1. App requests data
2. Cache miss? โ Read from DB โ Store in cache
3. Cache hit? โ Return from cache
โโโโโโโ miss โโโโโโโ โโโโโโ
โ App โ โโโโโโโ โCacheโ โ DB โ
โ โ โโโโโโโ โ โ โ โ
โโโโโโโ hit โโโโโโโ โโโโโโ
โ โฒ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
miss: read and store
Write-Through
Write to cache AND DB at the same time
- Data always consistent
- Slower writes
Write-Behind (Write-Back)
Write to cache, then async to DB
- Fast writes
- Risk of data loss if cache fails
What to cache
| Candidate | Priority |
|---|---|
| Data that doesn't change (config) | High |
| Frequently read data | High |
| Expensive calculation results | High |
| Active user data | Medium |
| Data that changes every second | Low |
Message Queues
For asynchronous communication between services.
โโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โProducer โ โโโ โ Queue โ โโโ โ Consumer โ
โ (API) โ โ (RabbitMQ) โ โ (Worker) โ
โโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
Use cases
- Email sending: API enqueues, worker sends
- Image processing: Upload enqueues, worker processes
- Notifications: Event enqueues, multiple consumers notify
Popular tools
| Tool | Best for |
|---|---|
| RabbitMQ | Traditional messaging, complex routing |
| Redis Streams | Simple, you already have Redis |
| Kafka | High volume, event sourcing |
| SQS | AWS native, simple |
Practical case: Designing a URL Shortener
Requirements
Functional:
- Shorten long URL โ short code
- Redirect code โ original URL
- URLs expire (optional)
Non-functional:
- 100M new URLs/month
- 10:1 read:write ratio
- Latency < 100ms
Estimations
URLs/month: 100M
URLs/sec: 100M / (30 * 24 * 3600) โ 40 URLs/sec writes
Reads: 40 * 10 = 400 URLs/sec reads
Storage (5 years):
100M * 12 * 5 = 6B URLs
6B * 500 bytes = 3TB
Short code design
Base62: [a-zA-Z0-9] = 62 characters
7 characters = 62^7 = 3.5 trillion combinations
Enough for 100M/month for centuries
Final architecture
โโโโโโโโโโโโโ
Users โโโโ โ LB โ
โโโโโโโฌโโโโโโ
โ
โโโโโโโโโโโโโดโโโโโโโโโโโโ
โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโ
โ API 1 โ โ API 2 โ
โโโโโโฌโโโโโ โโโโโโฌโโโโโ
โ โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โโโโโโโดโโโโโโ
โ Redis โ (Cache hot URLs)
โโโโโโโฌโโโโโโ
โ
โโโโโโโดโโโโโโ
โ Postgres โ (Sharded by hash)
โโโโโโโโโโโโโ
Recommended resources
- Book: "Designing Data-Intensive Applications" - Martin Kleppmann
- Book: "System Design Interview" - Alex Xu
- Web: system-design-primer
- Practice: Exercism System Design
Practice
-> Architecture Workshop - Design a real system step by step