The black box problem

Imagine driving a car without a dashboard. You dont know how fast youre going, how much fuel you have left, or if the engine is about to overheat. Thats what operating an application in production without observability is like.

Without observability, you only learn about problems when your users complain. With observability, you detect them before users notice.

The three pillars of observability

Observability is built on three complementary pillars:

┌─────────────────┬─────────────────┬─────────────────┐
│      LOGS       │     METRICS     │     TRACES      │
│                 │                 │                 │
│  What happened? │  How much?      │  Where?         │
│                 │                 │                 │
│  Discrete       │  Numbers you    │  Path of a      │
│  events with    │  can graph      │  request across │
│  context        │  and alert on   │  services       │
└─────────────────┴─────────────────┴─────────────────┘

Pillar	Question it answers	Example
Logs	What exactly happened?	"User X failed login due to wrong password"
Metrics	How is the system doing?	"95% of requests < 200ms, 2% errors"
Traces	Where did the request go?	"API → Auth → DB → Cache → Response (340ms)"

Logs: The event record

Why console.log doesnt scale

// In development: works fine
console.log('User logged in:', userId);

// In production with 1000 requests/sec:
// - How do you search for a specific log?
// - How do you correlate logs from the same request?
// - How do you filter only errors?
// - How do you keep logs when the server dies?

Structured logging with Pino

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Create logger with request context
function createRequestLogger(req) {
  return logger.child({
    requestId: req.id,
    userId: req.user?.id,
    path: req.path,
    method: req.method,
  });
}

// Usage in endpoint
app.get('/api/orders', async (req, res) => {
  const log = createRequestLogger(req);

  log.info('Fetching orders');

  try {
    const orders = await getOrders(req.user.id);
    log.info({ count: orders.length }, 'Orders fetched successfully');
    res.json(orders);
  } catch (error) {
    log.error({ error: error.message, stack: error.stack }, 'Failed to fetch orders');
    res.status(500).json({ error: 'Internal error' });
  }
});

Structured JSON output:

{
  "level": "info",
  "time": "2026-01-15T10:30:00.000Z",
  "requestId": "abc-123",
  "userId": "user-456",
  "path": "/api/orders",
  "method": "GET",
  "msg": "Orders fetched successfully",
  "count": 25
}

Log levels

Level	When to use	Example
error	Something failed and needs attention	DB connection lost
warn	Something odd but not critical	Rate limit almost reached
info	Important business events	User created order
debug	Details for troubleshooting	Query executed in 50ms

Metrics: Numbers that matter

The 4 Golden Signals (Google SRE)

Google SRE defines 4 key signals for monitoring any service:

┌──────────────┬──────────────┬──────────────┬──────────────┐
│   LATENCY    │   TRAFFIC    │   ERRORS     │  SATURATION  │
│              │              │              │              │
│  How long    │  How much    │  What %      │  How full    │
│  does it     │  demand?     │  fails?      │  is it?      │
│  take?       │              │              │              │
│              │              │              │              │
│  p50, p95,   │  req/sec     │  % HTTP 5xx  │  CPU, RAM,   │
│  p99         │  users       │  % timeouts  │  connections │
└──────────────┴──────────────┴──────────────┴──────────────┘

Prometheus + prom-client

import promClient from 'prom-client';

// Collect default metrics (CPU, memory, etc.)
promClient.collectDefaultMetrics();

// Custom metric: request latency
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
});

// Custom metric: total requests
const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
});

// Middleware to measure requests
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const labels = {
      method: req.method,
      route: req.route?.path || 'unknown',
      status_code: res.statusCode,
    };

    httpRequestDuration.observe(labels, duration);
    httpRequestsTotal.inc(labels);
  });

  next();
});

// Endpoint for Prometheus
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

Metric types

Type	Description	Example
Counter	Only increments	Total requests, errors
Gauge	Goes up and down	Active connections, temperature
Histogram	Distribution of values	Latency (p50, p95, p99)
Summary	Similar to histogram	Client-side calculated percentiles

Traces: Following the path

The microservices problem

In a monolith, following a request is easy. In microservices:

User → API Gateway → Auth Service → Order Service → Payment Service → DB
              ↓               ↓              ↓              ↓
           Where is the bottleneck? 🤷

OpenTelemetry: The standard

OpenTelemetry is the open source standard for instrumentation.

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://jaeger:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
    }),
  ],
});

sdk.start();

Anatomy of a trace

Trace ID: abc-123-xyz
├── Span: HTTP GET /api/order/123 (450ms)
│   ├── Span: Auth Middleware (20ms)
│   ├── Span: DB Query: SELECT * FROM orders (150ms)
│   ├── Span: HTTP POST payment-service/charge (250ms)
│   │   ├── Span: Validate card (30ms)
│   │   └── Span: Process payment (220ms)
│   └── Span: Send confirmation email (30ms)

Recommended stack 2026

Open Source (self-hosted)

# docker-compose.yml - Observability stack
services:
  # Metrics
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  # Visualization
  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=secret
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana

  # Logs
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

  # Traces
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "4318:4318"    # OTLP HTTP

Managed Services (SaaS)

Service	Specialty	Free Tier
Grafana Cloud	Metrics + Logs + Traces	10k series, 50GB logs
Datadog	All-in-one observability	14-day trial
New Relic	Complete APM	100GB/month free
Sentry	Errors and performance	5k errors/month

Effective alerting

Why "alert on everything" is counterproductive

Alert: CPU > 50%          → Ignored (normal during peaks)
Alert: Memory > 60%       → Ignored (always like this)
Alert: Slow request       → Ignored (seen 100 today)
Alert: Database down      → Ignored out of habit... OOPS

Result: Alert fatigue. The team ignores all alerts.

SLOs and Error Budgets

Instead of alerting on symptoms, alert when you affect the user:

SLO (Service Level Objective):
"99.9% of requests in less than 500ms"

Error Budget:
- 30 days × 24 hours × 60 min = 43,200 minutes
- 0.1% error budget = 43.2 minutes of allowed downtime

Alert when:
- Consumed > 50% of budget in 1 hour
- Consumed > 80% of budget in the month

Anatomy of a good alert

# prometheus/alerts.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 5m  # Only alert if it persists for 5 min
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 1% for 5 minutes"
          runbook: "https://wiki.your-company.com/runbooks/high-error-rate"
          dashboard: "https://grafana.your-company.com/d/api-health"

Observability checklist

Structured logs with requestId for correlation
4 Golden Signals metrics exposed
Traces configured between services
Dashboards with key metrics
Alerts based on SLOs, not symptoms
Runbooks documented for each alert
Data retention defined (30 days logs, 15 days traces)

Practice

-> Set up Prometheus + Grafana -> Professional logging with Pino + Loki

Observability