👀 Observability — Production-Grade Deep Dive
Mục lục
Metrics, Logs và Traces — Deep Dive
- Storage architecture & trade-offs (Prometheus/Thanos/VictoriaMetrics)
- Cardinality explosion & mitigation
- Log cost optimization (sampling, tiering, schema-on-read)
- Tail-based sampling implementation
- Context propagation patterns
- Correlation & exemplars
OpenTelemetry — Production Architecture
- Collector deployment patterns (agent vs gateway)
- Pipeline configuration (receivers → processors → exporters)
- W3C Trace Context deep dive
- Context propagation in async flows (Kafka, background jobs)
- Semantic conventions & custom instrumentation
- Performance overhead analysis & benchmarks
- Collector scaling & HA
Production Debugging Methodology
- Structured incident investigation workflow
- Distributed trace reading, timeline reconstruction
- Common cross-service failure patterns (retry storms, connection pool exhaustion)
Incident Management & Postmortem Culture
- Severity classification, incident response roles
- Blameless postmortem framework
- Building runbook culture
- SLI selection framework (availability, latency, durability)
- Error budget calculation & burn rate
- Multi-window, multi-burn-rate alerting (Google SRE)
- Error budget policy enforcement & auto-freeze
- Recording rules & dashboard implementation
- Composite SLOs & user journey tracking
- Rolling vs calendar windows
Target Audience
Level: Senior+ engineers, SRE, platform teams
Prerequisite: Experience với monitoring tools, production debugging
Scope: Architecture decisions, cost optimization, implementation patterns
Coverage Map
Foundation ────────────────► Advanced
│ │
├─ Metrics/Logs/Traces ├─ Cardinality control
├─ Prometheus basics ├─ Tail sampling strategies
├─ Structured logging ├─ Context propagation
│ ├─ Error budget automation
│ └─ Cost optimization ($1K→$200/mo)
Real-world Scenarios Covered
✅ Reducing observability cost by 80% (sampling, tiering, pruning)
✅ Implementing tail-based sampling at 10K QPS
✅ Debugging distributed traces with missing spans
✅ Setting up multi-window burn-rate alerts
✅ Enforcing error budget policy via CI/CD
✅ Scaling OTel Collector for high availability
✅ Correlating metrics → logs → traces in production incidents
Maturity Progression
Level 1: Basic Monitoring (0-3 months)
- Prometheus/Grafana setup
- Golden signals dashboards
- Simple threshold alerts
- Unstructured logs
Level 2: Structured Observability (3-6 months)
- Structured logging + indexing
- Distributed tracing (head sampling)
- SLO dashboards
- On-call runbooks
Level 3: Advanced SRE (6-12 months)
- Tail-based sampling
- Multi-window burn-rate alerts
- Error budget enforcement
- Auto-correlation (trace_id in logs)
- Cost optimization
This section targets Level 2→3 progression.