📡 Observability✍️ Khoa📅 19/04/2026☕ 2 phút đọc

👀 Observability — Production-Grade Deep Dive

Mục lục

  • Metrics, Logs và Traces — Deep Dive

    • Storage architecture & trade-offs (Prometheus/Thanos/VictoriaMetrics)
    • Cardinality explosion & mitigation
    • Log cost optimization (sampling, tiering, schema-on-read)
    • Tail-based sampling implementation
    • Context propagation patterns
    • Correlation & exemplars
  • OpenTelemetry — Production Architecture

    • Collector deployment patterns (agent vs gateway)
    • Pipeline configuration (receivers → processors → exporters)
    • W3C Trace Context deep dive
    • Context propagation in async flows (Kafka, background jobs)
    • Semantic conventions & custom instrumentation
    • Performance overhead analysis & benchmarks
    • Collector scaling & HA
  • SLI, SLO, SLA & Error Budget — SRE Framework

  • Production Debugging Methodology

    • Structured incident investigation workflow
    • Distributed trace reading, timeline reconstruction
    • Common cross-service failure patterns (retry storms, connection pool exhaustion)
  • Incident Management & Postmortem Culture

    • Severity classification, incident response roles
    • Blameless postmortem framework
    • Building runbook culture
    • SLI selection framework (availability, latency, durability)
    • Error budget calculation & burn rate
    • Multi-window, multi-burn-rate alerting (Google SRE)
    • Error budget policy enforcement & auto-freeze
    • Recording rules & dashboard implementation
    • Composite SLOs & user journey tracking
    • Rolling vs calendar windows

Target Audience

Level: Senior+ engineers, SRE, platform teams
Prerequisite: Experience với monitoring tools, production debugging
Scope: Architecture decisions, cost optimization, implementation patterns


Coverage Map

Foundation ────────────────► Advanced
│                              │
├─ Metrics/Logs/Traces         ├─ Cardinality control
├─ Prometheus basics           ├─ Tail sampling strategies
├─ Structured logging          ├─ Context propagation
│                              ├─ Error budget automation
│                              └─ Cost optimization ($1K→$200/mo)

Real-world Scenarios Covered

✅ Reducing observability cost by 80% (sampling, tiering, pruning)
✅ Implementing tail-based sampling at 10K QPS
✅ Debugging distributed traces with missing spans
✅ Setting up multi-window burn-rate alerts
✅ Enforcing error budget policy via CI/CD
✅ Scaling OTel Collector for high availability
✅ Correlating metrics → logs → traces in production incidents


Maturity Progression

Level 1: Basic Monitoring (0-3 months)

  • Prometheus/Grafana setup
  • Golden signals dashboards
  • Simple threshold alerts
  • Unstructured logs

Level 2: Structured Observability (3-6 months)

  • Structured logging + indexing
  • Distributed tracing (head sampling)
  • SLO dashboards
  • On-call runbooks

Level 3: Advanced SRE (6-12 months)

  • Tail-based sampling
  • Multi-window burn-rate alerts
  • Error budget enforcement
  • Auto-correlation (trace_id in logs)
  • Cost optimization

This section targets Level 2→3 progression.