The black box problem
Imagine driving a car without a dashboard. You dont know how fast youre going, how much fuel you have left, or if the engine is about to overheat. Thats what operating an application in production without observability is like.
Without observability, you only learn about problems when your users complain. With observability, you detect them before users notice.
The three pillars of observability
Observability is built on three complementary pillars:
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ LOGS โ METRICS โ TRACES โ
โ โ โ โ
โ What happened? โ How much? โ Where? โ
โ โ โ โ
โ Discrete โ Numbers you โ Path of a โ
โ events with โ can graph โ request across โ
โ context โ and alert on โ services โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโ
| Pillar | Question it answers | Example |
|---|---|---|
| Logs | What exactly happened? | "User X failed login due to wrong password" |
| Metrics | How is the system doing? | "95% of requests < 200ms, 2% errors" |
| Traces | Where did the request go? | "API โ Auth โ DB โ Cache โ Response (340ms)" |
Logs: The event record
Why console.log doesnt scale
// In development: works fine
console.log('User logged in:', userId);
// In production with 1000 requests/sec:
// - How do you search for a specific log?
// - How do you correlate logs from the same request?
// - How do you filter only errors?
// - How do you keep logs when the server dies?
Structured logging with Pino
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
timestamp: pino.stdTimeFunctions.isoTime,
});
// Create logger with request context
function createRequestLogger(req) {
return logger.child({
requestId: req.id,
userId: req.user?.id,
path: req.path,
method: req.method,
});
}
// Usage in endpoint
app.get('/api/orders', async (req, res) => {
const log = createRequestLogger(req);
log.info('Fetching orders');
try {
const orders = await getOrders(req.user.id);
log.info({ count: orders.length }, 'Orders fetched successfully');
res.json(orders);
} catch (error) {
log.error({ error: error.message, stack: error.stack }, 'Failed to fetch orders');
res.status(500).json({ error: 'Internal error' });
}
});
Structured JSON output:
{
"level": "info",
"time": "2026-01-15T10:30:00.000Z",
"requestId": "abc-123",
"userId": "user-456",
"path": "/api/orders",
"method": "GET",
"msg": "Orders fetched successfully",
"count": 25
}
Log levels
| Level | When to use | Example |
|---|---|---|
| error | Something failed and needs attention | DB connection lost |
| warn | Something odd but not critical | Rate limit almost reached |
| info | Important business events | User created order |
| debug | Details for troubleshooting | Query executed in 50ms |
Metrics: Numbers that matter
The 4 Golden Signals (Google SRE)
Google SRE defines 4 key signals for monitoring any service:
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ LATENCY โ TRAFFIC โ ERRORS โ SATURATION โ
โ โ โ โ โ
โ How long โ How much โ What % โ How full โ
โ does it โ demand? โ fails? โ is it? โ
โ take? โ โ โ โ
โ โ โ โ โ
โ p50, p95, โ req/sec โ % HTTP 5xx โ CPU, RAM, โ
โ p99 โ users โ % timeouts โ connections โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
Prometheus + prom-client
import promClient from 'prom-client';
// Collect default metrics (CPU, memory, etc.)
promClient.collectDefaultMetrics();
// Custom metric: request latency
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
});
// Custom metric: total requests
const httpRequestsTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
});
// Middleware to measure requests
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const labels = {
method: req.method,
route: req.route?.path || 'unknown',
status_code: res.statusCode,
};
httpRequestDuration.observe(labels, duration);
httpRequestsTotal.inc(labels);
});
next();
});
// Endpoint for Prometheus
app.get('/metrics', async (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.end(await promClient.register.metrics());
});
Metric types
| Type | Description | Example |
|---|---|---|
| Counter | Only increments | Total requests, errors |
| Gauge | Goes up and down | Active connections, temperature |
| Histogram | Distribution of values | Latency (p50, p95, p99) |
| Summary | Similar to histogram | Client-side calculated percentiles |
Traces: Following the path
The microservices problem
In a monolith, following a request is easy. In microservices:
User โ API Gateway โ Auth Service โ Order Service โ Payment Service โ DB
โ โ โ โ
Where is the bottleneck? ๐คท
OpenTelemetry: The standard
OpenTelemetry is the open source standard for instrumentation.
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
serviceName: 'order-service',
traceExporter: new OTLPTraceExporter({
url: 'http://jaeger:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
}),
],
});
sdk.start();
Anatomy of a trace
Trace ID: abc-123-xyz
โโโ Span: HTTP GET /api/order/123 (450ms)
โ โโโ Span: Auth Middleware (20ms)
โ โโโ Span: DB Query: SELECT * FROM orders (150ms)
โ โโโ Span: HTTP POST payment-service/charge (250ms)
โ โ โโโ Span: Validate card (30ms)
โ โ โโโ Span: Process payment (220ms)
โ โโโ Span: Send confirmation email (30ms)
Recommended stack 2026
Open Source (self-hosted)
# docker-compose.yml - Observability stack
services:
# Metrics
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
# Visualization
grafana:
image: grafana/grafana:latest
environment:
- GF_SECURITY_ADMIN_PASSWORD=secret
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
# Logs
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
# Traces
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "4318:4318" # OTLP HTTP
Managed Services (SaaS)
| Service | Specialty | Free Tier |
|---|---|---|
| Grafana Cloud | Metrics + Logs + Traces | 10k series, 50GB logs |
| Datadog | All-in-one observability | 14-day trial |
| New Relic | Complete APM | 100GB/month free |
| Sentry | Errors and performance | 5k errors/month |
Effective alerting
Why "alert on everything" is counterproductive
Alert: CPU > 50% โ Ignored (normal during peaks)
Alert: Memory > 60% โ Ignored (always like this)
Alert: Slow request โ Ignored (seen 100 today)
Alert: Database down โ Ignored out of habit... OOPS
Result: Alert fatigue. The team ignores all alerts.
SLOs and Error Budgets
Instead of alerting on symptoms, alert when you affect the user:
SLO (Service Level Objective):
"99.9% of requests in less than 500ms"
Error Budget:
- 30 days ร 24 hours ร 60 min = 43,200 minutes
- 0.1% error budget = 43.2 minutes of allowed downtime
Alert when:
- Consumed > 50% of budget in 1 hour
- Consumed > 80% of budget in the month
Anatomy of a good alert
# prometheus/alerts.yml
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.01
for: 5m # Only alert if it persists for 5 min
labels:
severity: critical
annotations:
summary: "Error rate > 1% for 5 minutes"
runbook: "https://wiki.your-company.com/runbooks/high-error-rate"
dashboard: "https://grafana.your-company.com/d/api-health"
Observability checklist
- Structured logs with requestId for correlation
- 4 Golden Signals metrics exposed
- Traces configured between services
- Dashboards with key metrics
- Alerts based on SLOs, not symptoms
- Runbooks documented for each alert
- Data retention defined (30 days logs, 15 days traces)
Practice
-> Set up Prometheus + Grafana -> Professional logging with Pino + Loki