📡 Observability✍️ Khoa📅 19/04/2026☕ 10 phút đọc

Observability: Metrics, Logs và Traces — Deep Dive

"Observability is not about collecting data. It's about asking questions you didn't know you needed to ask." — Charity Majors

Mục tiêu không phải là "có dashboard đẹp" hay "log đầy đủ", mà là debug unknown-unknowns trong production khi system ở scale.

1. Ba Pillar — Architecture & Trade-offs

1.1 Metrics: Time-Series Data at Scale

Core concept: Metrics là aggregated numerical data theo time windows.

http_requests_total{method="POST", route="/orders", status="200"} = 1523 @ t1
http_requests_total{method="POST", route="/orders", status="200"} = 1847 @ t2

Storage Backend Trade-offs

Backend	Storage Model	Query Perf	Cardinality Limit	Cost
Prometheus	Local TSDB	Fast (in-memory blocks)	~10M series	Low (self-hosted)
Thanos/Cortex	Object storage (S3)	Medium (remote read)	Unlimited	Medium
VictoriaMetrics	Optimized TSDB	Very fast	~100M series	Low-Medium
Datadog/New Relic	Managed SaaS	Fast	High	High ($$)

Architecture pattern — Federation:

┌──────────────────────────────────────────────────────┐
│  Application Pods (1000s)                             │
│  ├─ Expose /metrics endpoint                          │
└────────────────┬─────────────────────────────────────┘
                 │ scrape every 15s
                 ▼
┌──────────────────────────────────────────────────────┐
│  Prometheus (per-cluster)                             │
│  ├─ Local storage: 15 days retention                  │
│  ├─ Aggregation: recording rules                      │
└────────────────┬─────────────────────────────────────┘
                 │ remote_write
                 ▼
┌──────────────────────────────────────────────────────┐
│  Long-term Storage (Thanos/Cortex)                    │
│  ├─ Downsampling: 5m → 1h resolution after 7 days    │
│  ├─ Retention: 13 months                              │
└──────────────────────────────────────────────────────┘

Cardinality Explosion — The Silent Killer

Problem: Labels tạo unique time-series. High cardinality → memory exhausted.

// 🔥 DANGEROUS: Unbounded cardinality
http_requests_total{user_id="12345", ip="1.2.3.4", session_id="abc..."}
// If 1M users × 1M IPs × 1M sessions = 10^18 series → OOM kill

// ✅ SAFE: Bounded cardinality
http_requests_total{method="POST", route="/orders", status="200"}
// 10 methods × 50 routes × 10 status codes = 5000 series

Rule: Label values phải thuộc finite set. Không dùng user_id, request_id, email.

Solution for high-cardinality data: Đẩy vào logs hoặc exemplars (Prometheus 2.26+).

Recording Rules — Pre-aggregation at Scale

Khi có 1000 pods, query rate(http_requests_total[5m]) phải aggregate qua 1000 series → chậm.

Recording rule: Pre-compute aggregation mỗi 15s.

# prometheus.yml
groups:
  - name: api_metrics
    interval: 15s
    rules:
      - record: api:http_requests:rate5m
        expr: |
          sum(rate(http_requests_total[5m])) by (route, status)

Bây giờ dashboard query api:http_requests:rate5m → instant, không cần aggregate runtime.

Trade-off: Tốn storage (thêm pre-aggregated series), nhưng query nhanh gấp 100x.

1.2 Logs: Structured Events at Petabyte Scale

Storage Architecture

Application
  ↓ stdout/stderr
Log Shipper (Fluent Bit, Vector)
  ↓ buffer + transform
Message Queue (Kafka — optional for high volume)
  ↓
Log Processor (Logstash, Vector aggregator)
  ↓ parse + enrich
Storage
  ├─ Hot tier (Elasticsearch): 7 days, full-text search
  ├─ Warm tier (S3 + Athena): 30 days, SQL queries
  └─ Cold tier (Glacier): 1 year, compliance

Cost Optimization — The 80/20 Rule

Reality: 80% logs không bao giờ được đọc. 20% logs critical cho debugging.

Strategy:

Sampling at source (application-level):

// Sample verbose logs
if rand.Float64() < 0.01 {  // 1% sample rate
    log.Debug("processing item", "item_id", id)
}

// Always log errors
log.Error("payment failed", "error", err)

Dynamic log levels:

// Default: INFO
// On incident: Flip to DEBUG for specific service via config reload
logger := log.WithLevel(dynamicLevel())

Schema on read (vs schema on write):

Schema on write: Parse logs khi ingest (Logstash filters) → tốn CPU lúc write
Schema on read: Lưu raw, parse khi query (Athena, BigQuery) → tốn CPU lúc read

Trade-off: Schema on read = cheap ingest, expensive query. Good cho logs ít query.

Structured Logging — JSON or Not?

// Option 1: JSON (Elasticsearch-friendly)
log.Info("order created",
    "order_id", order.ID,
    "user_id", user.ID,
    "total", order.Total,
)
// {"level":"info","order_id":"ord_123","user_id":456,"total":99.99}

// Option 2: Logfmt (human-readable, Loki-friendly)
// level=info order_id=ord_123 user_id=456 total=99.99

// Option 3: Hybrid (JSON for indexing, text for context)
// {"level":"info","order_id":"ord_123"} msg="Created order for premium user"

Production pattern: JSON + selective field extraction.

# Vector config: Extract only high-cardinality fields for indexing
[transforms.parse_logs]
  type = "remap"
  source = '''
    parsed = parse_json!(.message)
    .order_id = parsed.order_id      # Index this
    .level = parsed.level            # Index this
    # Don't index full message → save storage
  '''

Query Performance — Inverted Index Limits

Elasticsearch bottleneck: Full-text search trên 10TB logs = slow.

Solution: Pre-filter với time + indexed fields:

// ❌ SLOW: Full-text search
GET /logs/_search
{
  "query": {
    "match": { "message": "payment timeout" }
  }
}

// ✅ FAST: Time range + exact match
GET /logs/_search
{
  "query": {
    "bool": {
      "filter": [
        { "range": { "@timestamp": { "gte": "now-1h" } } },
        { "term": { "service": "payment-api" } },
        { "term": { "level": "error" } }
      ],
      "must": [
        { "match": { "message": "timeout" } }
      ]
    }
  }
}

Rule: Always có time range. Không có = query toàn bộ index = OOM.

1.3 Traces: Distributed Debugging at 10K QPS

Trace = Directed Acyclic Graph (DAG)

TraceID: abc123
  ├─ Span: API Gateway [0-250ms]
  │   ├─ Span: Auth Service [10-50ms]
  │   └─ Span: Order Service [60-240ms]
  │       ├─ Span: Inventory Check [70-100ms]
  │       ├─ Span: Payment Charge [110-200ms]
  │       │   └─ Span: Stripe API [120-190ms]
  │       └─ Span: DB Insert [210-230ms]

Mỗi span có:

span_id, parent_span_id, trace_id
start_time, duration
attributes (key-value metadata)
events (logs trong span)
links (span này liên kết span khác)

Sampling Strategies — The Critical Decision

Problem: Không thể lưu 100% traces khi có 100K QPS → 100K traces/sec × 10KB/trace = 1GB/sec.

Strategy matrix:

Strategy	How	Pros	Cons
Head-based	Quyết định lúc root span	Simple, cheap	Miss rare errors
Tail-based	Quyết định sau khi trace complete	Keep all errors	Complex, stateful
Adaptive	Tăng rate khi error spike	Balance cost/coverage	Requires tuning

Tail-based sampling architecture:

Application → OTLP Collector (buffering)
                ↓
           [Wait for trace complete — 10s timeout]
                ↓
        Decision: Keep or Drop?
          - Keep if: has error span
          - Keep if: latency > p99
          - Keep if: rare endpoint (< 1% traffic)
          - Drop: happy path, common endpoint
                ↓
        Storage (Tempo, Jaeger)

Implementation (OpenTelemetry Collector):

# otel-collector.yaml
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 5000}
      - name: sample_happy_path
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

Trade-off: Tail-based cần buffer traces (10s × 10K QPS = 100K traces in memory) → tốn RAM.

Context Propagation — The Make-or-Break

W3C Trace Context (standard header):

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ││ └─────────── trace-id ──────────┘ └── span-id ──┘ │
             ││                                                   │
             │└─ version                                          └─ flags (sampled)
             └─ version

Propagation qua nhiều protocols:

// HTTP
req.Header.Set("traceparent", traceparent)

// gRPC (metadata)
md := metadata.Pairs("traceparent", traceparent)
ctx = metadata.NewOutgoingContext(ctx, md)

// Kafka (message header)
msg.Headers = append(msg.Headers, kafka.Header{
    Key:   "traceparent",
    Value: []byte(traceparent),
})

// Database (SQL comment injection)
query := fmt.Sprintf("/* traceparent=%s */ SELECT ...", traceparent)

Pitfall: Nếu một service không propagate → trace bị đứt.

✓ Service A → Service B → Service C  (full trace)
✗ Service A → Service B (missing propagation) → Service C (new trace)

Solution: Auto-instrumentation libraries handle propagation. Nhưng custom code (background jobs, async tasks) phải manual.

2. Three Pillars in Production — Real-world Integration

Correlation: The Holy Trinity

Goal: Từ metric spike → drill vào log → drill vào trace.

Implementation:

1. Alert fires: error_rate > 5%
   ↓
2. Dashboard shows spike at 10:23 AM
   ↓
3. Query logs: service=payment-api level=error @timestamp:[10:23 TO 10:24]
   ↓
4. Log entry: {"trace_id": "abc123", "error": "stripe timeout"}
   ↓
5. Open trace: trace_id=abc123
   ↓
6. See span: Stripe API took 30s (timeout=10s)

Technical glue:

// Inject trace_id vào log
ctx := context.WithValue(ctx, "trace_id", span.SpanContext().TraceID())
log := logger.With("trace_id", ctx.Value("trace_id"))
log.Error("payment failed", "error", err)

// Grafana: Link from metric to logs
{
  "targets": [{
    "expr": "rate(http_errors[5m])",
    "legendFormat": "{{service}}"
  }],
  "links": [{
    "title": "View Logs",
    "url": "/explore?queries=[{\"expr\":\"{service=\\\"${__field.labels.service}\\\"}\"}]"
  }]
}

Exemplars — Bridging Metrics & Traces

Exemplar = Trace ID attached to metric sample.

http_request_duration_seconds_bucket{le="0.5"} 145 # {trace_id="xyz789"} 0.234

Prometheus UI: Click metric point → jump to trace.

Configuration (Prometheus 2.26+):

# prometheus.yml
global:
  scrape_interval: 15s
  exemplar_storage:
    max_exemplars: 100000

scrape_configs:
  - job_name: 'api'
    scrape_interval: 15s
    metrics_path: /metrics

Application (OpenTelemetry Go):

// Record metric with exemplar
attrs := metric.WithAttributes(
    attribute.String("method", "POST"),
    attribute.String("route", "/orders"),
)
histogram.Record(ctx, duration, attrs)
// Span context from ctx automatically becomes exemplar

3. Cost Optimization — Senior-level Reality Check

Cost breakdown at 10K QPS service

Metrics:  $500/month
  - 50K series × $0.01/series = $500
  
Logs:     $5000/month
  - 10K QPS × 2KB/log × 86400s × 30 days = 50TB/month
  - 50TB × $0.10/GB = $5000
  
Traces:   $2000/month
  - 10K QPS × 1% sample rate = 100 traces/sec
  - 100 × 86400 × 30 = 260M traces/month
  - 260M × 10KB × $0.00075/GB = $2000

Total:    $7500/month

Optimization levers:

Metrics: Recording rules (reduce query cost), prune unused series
Logs: Sampling (10x reduction), cold tier (5x cheaper)
Traces: Adaptive sampling (5x reduction), tail-based (10x smarter)

After optimization: $1500/month (80% reduction).

4. Observability-Driven Development (ODD)

Philosophy: Instrumentation không phải "thêm sau", mà là first-class citizen.

Design pattern: Observable by Default

// Service template with observability built-in
type ObservableService struct {
    svc       CoreService
    meter     metric.Meter
    tracer    trace.Tracer
    logger    *slog.Logger
}

func (s *ObservableService) ProcessOrder(ctx context.Context, order Order) error {
    // Start span
    ctx, span := s.tracer.Start(ctx, "ProcessOrder")
    defer span.End()
    
    // Record metric
    start := time.Now()
    defer func() {
        duration := time.Since(start)
        s.meter.RecordDuration(ctx, duration, 
            attribute.String("order_type", order.Type))
    }()
    
    // Log with trace context
    log := s.logger.With("trace_id", span.SpanContext().TraceID())
    log.Info("processing order", "order_id", order.ID)
    
    // Actual business logic
    err := s.svc.ProcessOrder(ctx, order)
    
    // Record error
    if err != nil {
        span.SetStatus(codes.Error, err.Error())
        span.RecordError(err)
        log.Error("order failed", "error", err)
        s.meter.RecordError(ctx, attribute.String("error_type", errorType(err)))
    }
    
    return err
}

Anti-pattern Checklist

❌ Log mọi thứ → log volume explosion
❌ High-cardinality labels → Prometheus OOM
❌ Không sample traces → storage cost explodes
❌ Không propagate context → traces bị đứt
❌ Log sensitive data (passwords, tokens)
❌ Alert trên absolute threshold (cpu > 80%) → false positive khi scale
❌ Không có runbook → alert bắn nhưng không biết làm gì

✅ Log theo level, sample verbose logs
✅ Bounded cardinality, dùng exemplars cho high-cardinality
✅ Tail-based sampling, keep errors
✅ Auto-propagate qua middleware
✅ Structured logging, redact PII
✅ Alert trên rate of change (cpu increase 50% in 5min)
✅ Mỗi alert có runbook link

5. Maturity Model — From Basic to Advanced

Level	Metrics	Logs	Traces	Correlation
L1: Ad-hoc	CPU/memory only	Unstructured, grep	None	Manual
L2: Reactive	Golden signals	Structured JSON	Head sampling (1%)	None
L3: Proactive	SLO-based	Indexed fields	Tail sampling	Trace ID in logs
L4: Predictive	Anomaly detection	Adaptive sampling	Exemplars	Auto-correlation

Current goal: Reach L3 trong 3-6 tháng.

6. Tooling Landscape — Senior Perspective

Open-source stack (cost-effective)

Metrics:  Prometheus + Thanos (long-term)
Logs:     Loki (no indexing) or OpenSearch
Traces:   Tempo (object storage-based)
Frontend: Grafana (unified dashboards)

Pros: Cheap, flexible, self-hosted
Cons: Operational overhead, need expertise

Commercial stack (ease-of-use)

All-in-one: Datadog, New Relic, Dynatrace
Specialized: Honeycomb (events), Lightstep (traces)

Pros: Turnkey, great UX, support
Cons: Expensive ($10K+/month at scale)

Hybrid approach (pragmatic)

Metrics:  Prometheus (cheap, mature)
Logs:     Commercial (Datadog) — log search UX matters
Traces:   Tempo (cheap storage) + Grafana

Rationale: Logs query daily, metrics query hourly, traces query weekly → optimize spend accordingly.

Tóm tắt: Production Checklist

Metrics có recording rules cho expensive queries
Cardinality < 1M series per Prometheus instance
Logs có sampling strategy (< 50TB/month ingestion)
Traces có tail-based sampling (keep 100% errors, 1% happy path)
Trace context propagate qua tất cả services
Logs có trace_id field để correlation
Alerts có runbook links + không spam (< 5 false positive/week)
Cost monitoring (set budget alerts)
Retention policy documented (hot/warm/cold tiers)
Team trained (không chỉ SRE, devs cũng phải biết query)