๐Ÿ“Š

Observability

๐Ÿ‘จโ€๐Ÿณ Chef

The black box problem

Imagine driving a car without a dashboard. You dont know how fast youre going, how much fuel you have left, or if the engine is about to overheat. Thats what operating an application in production without observability is like.

Without observability, you only learn about problems when your users complain. With observability, you detect them before users notice.


The three pillars of observability

Observability is built on three complementary pillars:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      LOGS       โ”‚     METRICS     โ”‚     TRACES      โ”‚
โ”‚                 โ”‚                 โ”‚                 โ”‚
โ”‚  What happened? โ”‚  How much?      โ”‚  Where?         โ”‚
โ”‚                 โ”‚                 โ”‚                 โ”‚
โ”‚  Discrete       โ”‚  Numbers you    โ”‚  Path of a      โ”‚
โ”‚  events with    โ”‚  can graph      โ”‚  request across โ”‚
โ”‚  context        โ”‚  and alert on   โ”‚  services       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
PillarQuestion it answersExample
LogsWhat exactly happened?"User X failed login due to wrong password"
MetricsHow is the system doing?"95% of requests < 200ms, 2% errors"
TracesWhere did the request go?"API โ†’ Auth โ†’ DB โ†’ Cache โ†’ Response (340ms)"

Logs: The event record

Why console.log doesnt scale

// In development: works fine
console.log('User logged in:', userId);

// In production with 1000 requests/sec:
// - How do you search for a specific log?
// - How do you correlate logs from the same request?
// - How do you filter only errors?
// - How do you keep logs when the server dies?

Structured logging with Pino

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Create logger with request context
function createRequestLogger(req) {
  return logger.child({
    requestId: req.id,
    userId: req.user?.id,
    path: req.path,
    method: req.method,
  });
}

// Usage in endpoint
app.get('/api/orders', async (req, res) => {
  const log = createRequestLogger(req);

  log.info('Fetching orders');

  try {
    const orders = await getOrders(req.user.id);
    log.info({ count: orders.length }, 'Orders fetched successfully');
    res.json(orders);
  } catch (error) {
    log.error({ error: error.message, stack: error.stack }, 'Failed to fetch orders');
    res.status(500).json({ error: 'Internal error' });
  }
});

Structured JSON output:

{
  "level": "info",
  "time": "2026-01-15T10:30:00.000Z",
  "requestId": "abc-123",
  "userId": "user-456",
  "path": "/api/orders",
  "method": "GET",
  "msg": "Orders fetched successfully",
  "count": 25
}

Log levels

LevelWhen to useExample
errorSomething failed and needs attentionDB connection lost
warnSomething odd but not criticalRate limit almost reached
infoImportant business eventsUser created order
debugDetails for troubleshootingQuery executed in 50ms

Metrics: Numbers that matter

The 4 Golden Signals (Google SRE)

Google SRE defines 4 key signals for monitoring any service:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   LATENCY    โ”‚   TRAFFIC    โ”‚   ERRORS     โ”‚  SATURATION  โ”‚
โ”‚              โ”‚              โ”‚              โ”‚              โ”‚
โ”‚  How long    โ”‚  How much    โ”‚  What %      โ”‚  How full    โ”‚
โ”‚  does it     โ”‚  demand?     โ”‚  fails?      โ”‚  is it?      โ”‚
โ”‚  take?       โ”‚              โ”‚              โ”‚              โ”‚
โ”‚              โ”‚              โ”‚              โ”‚              โ”‚
โ”‚  p50, p95,   โ”‚  req/sec     โ”‚  % HTTP 5xx  โ”‚  CPU, RAM,   โ”‚
โ”‚  p99         โ”‚  users       โ”‚  % timeouts  โ”‚  connections โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Prometheus + prom-client

import promClient from 'prom-client';

// Collect default metrics (CPU, memory, etc.)
promClient.collectDefaultMetrics();

// Custom metric: request latency
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
});

// Custom metric: total requests
const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
});

// Middleware to measure requests
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const labels = {
      method: req.method,
      route: req.route?.path || 'unknown',
      status_code: res.statusCode,
    };

    httpRequestDuration.observe(labels, duration);
    httpRequestsTotal.inc(labels);
  });

  next();
});

// Endpoint for Prometheus
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

Metric types

TypeDescriptionExample
CounterOnly incrementsTotal requests, errors
GaugeGoes up and downActive connections, temperature
HistogramDistribution of valuesLatency (p50, p95, p99)
SummarySimilar to histogramClient-side calculated percentiles

Traces: Following the path

The microservices problem

In a monolith, following a request is easy. In microservices:

User โ†’ API Gateway โ†’ Auth Service โ†’ Order Service โ†’ Payment Service โ†’ DB
              โ†“               โ†“              โ†“              โ†“
           Where is the bottleneck? ๐Ÿคท

OpenTelemetry: The standard

OpenTelemetry is the open source standard for instrumentation.

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://jaeger:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
    }),
  ],
});

sdk.start();

Anatomy of a trace

Trace ID: abc-123-xyz
โ”œโ”€โ”€ Span: HTTP GET /api/order/123 (450ms)
โ”‚   โ”œโ”€โ”€ Span: Auth Middleware (20ms)
โ”‚   โ”œโ”€โ”€ Span: DB Query: SELECT * FROM orders (150ms)
โ”‚   โ”œโ”€โ”€ Span: HTTP POST payment-service/charge (250ms)
โ”‚   โ”‚   โ”œโ”€โ”€ Span: Validate card (30ms)
โ”‚   โ”‚   โ””โ”€โ”€ Span: Process payment (220ms)
โ”‚   โ””โ”€โ”€ Span: Send confirmation email (30ms)

Recommended stack 2026

Open Source (self-hosted)

# docker-compose.yml - Observability stack
services:
  # Metrics
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  # Visualization
  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=secret
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana

  # Logs
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

  # Traces
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "4318:4318"    # OTLP HTTP

Managed Services (SaaS)

ServiceSpecialtyFree Tier
Grafana CloudMetrics + Logs + Traces10k series, 50GB logs
DatadogAll-in-one observability14-day trial
New RelicComplete APM100GB/month free
SentryErrors and performance5k errors/month

Effective alerting

Why "alert on everything" is counterproductive

Alert: CPU > 50%          โ†’ Ignored (normal during peaks)
Alert: Memory > 60%       โ†’ Ignored (always like this)
Alert: Slow request       โ†’ Ignored (seen 100 today)
Alert: Database down      โ†’ Ignored out of habit... OOPS

Result: Alert fatigue. The team ignores all alerts.

SLOs and Error Budgets

Instead of alerting on symptoms, alert when you affect the user:

SLO (Service Level Objective):
"99.9% of requests in less than 500ms"

Error Budget:
- 30 days ร— 24 hours ร— 60 min = 43,200 minutes
- 0.1% error budget = 43.2 minutes of allowed downtime

Alert when:
- Consumed > 50% of budget in 1 hour
- Consumed > 80% of budget in the month

Anatomy of a good alert

# prometheus/alerts.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 5m  # Only alert if it persists for 5 min
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 1% for 5 minutes"
          runbook: "https://wiki.your-company.com/runbooks/high-error-rate"
          dashboard: "https://grafana.your-company.com/d/api-health"

Observability checklist

  • Structured logs with requestId for correlation
  • 4 Golden Signals metrics exposed
  • Traces configured between services
  • Dashboards with key metrics
  • Alerts based on SLOs, not symptoms
  • Runbooks documented for each alert
  • Data retention defined (30 days logs, 15 days traces)

Practice

-> Set up Prometheus + Grafana -> Professional logging with Pino + Loki


Resources