Observability: Metrics, Logs và Traces — Deep Dive
"Observability is not about collecting data. It's about asking questions you didn't know you needed to ask." — Charity Majors
Mục tiêu không phải là "có dashboard đẹp" hay "log đầy đủ", mà là debug unknown-unknowns trong production khi system ở scale.
1. Ba Pillar — Architecture & Trade-offs
1.1 Metrics: Time-Series Data at Scale
Core concept: Metrics là aggregated numerical data theo time windows.
http_requests_total{method="POST", route="/orders", status="200"} = 1523 @ t1
http_requests_total{method="POST", route="/orders", status="200"} = 1847 @ t2
Storage Backend Trade-offs
| Backend | Storage Model | Query Perf | Cardinality Limit | Cost |
|---|---|---|---|---|
| Prometheus | Local TSDB | Fast (in-memory blocks) | ~10M series | Low (self-hosted) |
| Thanos/Cortex | Object storage (S3) | Medium (remote read) | Unlimited | Medium |
| VictoriaMetrics | Optimized TSDB | Very fast | ~100M series | Low-Medium |
| Datadog/New Relic | Managed SaaS | Fast | High | High ($$) |
Architecture pattern — Federation:
┌──────────────────────────────────────────────────────┐
│ Application Pods (1000s) │
│ ├─ Expose /metrics endpoint │
└────────────────┬─────────────────────────────────────┘
│ scrape every 15s
▼
┌──────────────────────────────────────────────────────┐
│ Prometheus (per-cluster) │
│ ├─ Local storage: 15 days retention │
│ ├─ Aggregation: recording rules │
└────────────────┬─────────────────────────────────────┘
│ remote_write
▼
┌──────────────────────────────────────────────────────┐
│ Long-term Storage (Thanos/Cortex) │
│ ├─ Downsampling: 5m → 1h resolution after 7 days │
│ ├─ Retention: 13 months │
└──────────────────────────────────────────────────────┘
Cardinality Explosion — The Silent Killer
Problem: Labels tạo unique time-series. High cardinality → memory exhausted.
// 🔥 DANGEROUS: Unbounded cardinality
http_requests_total{user_id="12345", ip="1.2.3.4", session_id="abc..."}
// If 1M users × 1M IPs × 1M sessions = 10^18 series → OOM kill
// ✅ SAFE: Bounded cardinality
http_requests_total{method="POST", route="/orders", status="200"}
// 10 methods × 50 routes × 10 status codes = 5000 series
Rule: Label values phải thuộc finite set. Không dùng user_id, request_id, email.
Solution for high-cardinality data: Đẩy vào logs hoặc exemplars (Prometheus 2.26+).
Recording Rules — Pre-aggregation at Scale
Khi có 1000 pods, query rate(http_requests_total[5m]) phải aggregate qua 1000 series → chậm.
Recording rule: Pre-compute aggregation mỗi 15s.
# prometheus.yml
groups:
- name: api_metrics
interval: 15s
rules:
- record: api:http_requests:rate5m
expr: |
sum(rate(http_requests_total[5m])) by (route, status)
Bây giờ dashboard query api:http_requests:rate5m → instant, không cần aggregate runtime.
Trade-off: Tốn storage (thêm pre-aggregated series), nhưng query nhanh gấp 100x.
1.2 Logs: Structured Events at Petabyte Scale
Storage Architecture
Application
↓ stdout/stderr
Log Shipper (Fluent Bit, Vector)
↓ buffer + transform
Message Queue (Kafka — optional for high volume)
↓
Log Processor (Logstash, Vector aggregator)
↓ parse + enrich
Storage
├─ Hot tier (Elasticsearch): 7 days, full-text search
├─ Warm tier (S3 + Athena): 30 days, SQL queries
└─ Cold tier (Glacier): 1 year, compliance
Cost Optimization — The 80/20 Rule
Reality: 80% logs không bao giờ được đọc. 20% logs critical cho debugging.
Strategy:
- Sampling at source (application-level):
// Sample verbose logs
if rand.Float64() < 0.01 { // 1% sample rate
log.Debug("processing item", "item_id", id)
}
// Always log errors
log.Error("payment failed", "error", err)
- Dynamic log levels:
// Default: INFO
// On incident: Flip to DEBUG for specific service via config reload
logger := log.WithLevel(dynamicLevel())
- Schema on read (vs schema on write):
- Schema on write: Parse logs khi ingest (Logstash filters) → tốn CPU lúc write
- Schema on read: Lưu raw, parse khi query (Athena, BigQuery) → tốn CPU lúc read
Trade-off: Schema on read = cheap ingest, expensive query. Good cho logs ít query.
Structured Logging — JSON or Not?
// Option 1: JSON (Elasticsearch-friendly)
log.Info("order created",
"order_id", order.ID,
"user_id", user.ID,
"total", order.Total,
)
// {"level":"info","order_id":"ord_123","user_id":456,"total":99.99}
// Option 2: Logfmt (human-readable, Loki-friendly)
// level=info order_id=ord_123 user_id=456 total=99.99
// Option 3: Hybrid (JSON for indexing, text for context)
// {"level":"info","order_id":"ord_123"} msg="Created order for premium user"
Production pattern: JSON + selective field extraction.
# Vector config: Extract only high-cardinality fields for indexing
[transforms.parse_logs]
type = "remap"
source = '''
parsed = parse_json!(.message)
.order_id = parsed.order_id # Index this
.level = parsed.level # Index this
# Don't index full message → save storage
'''
Query Performance — Inverted Index Limits
Elasticsearch bottleneck: Full-text search trên 10TB logs = slow.
Solution: Pre-filter với time + indexed fields:
// ❌ SLOW: Full-text search
GET /logs/_search
{
"query": {
"match": { "message": "payment timeout" }
}
}
// ✅ FAST: Time range + exact match
GET /logs/_search
{
"query": {
"bool": {
"filter": [
{ "range": { "@timestamp": { "gte": "now-1h" } } },
{ "term": { "service": "payment-api" } },
{ "term": { "level": "error" } }
],
"must": [
{ "match": { "message": "timeout" } }
]
}
}
}
Rule: Always có time range. Không có = query toàn bộ index = OOM.
1.3 Traces: Distributed Debugging at 10K QPS
Trace = Directed Acyclic Graph (DAG)
TraceID: abc123
├─ Span: API Gateway [0-250ms]
│ ├─ Span: Auth Service [10-50ms]
│ └─ Span: Order Service [60-240ms]
│ ├─ Span: Inventory Check [70-100ms]
│ ├─ Span: Payment Charge [110-200ms]
│ │ └─ Span: Stripe API [120-190ms]
│ └─ Span: DB Insert [210-230ms]
Mỗi span có:
span_id,parent_span_id,trace_idstart_time,durationattributes(key-value metadata)events(logs trong span)links(span này liên kết span khác)
Sampling Strategies — The Critical Decision
Problem: Không thể lưu 100% traces khi có 100K QPS → 100K traces/sec × 10KB/trace = 1GB/sec.
Strategy matrix:
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Head-based | Quyết định lúc root span | Simple, cheap | Miss rare errors |
| Tail-based | Quyết định sau khi trace complete | Keep all errors | Complex, stateful |
| Adaptive | Tăng rate khi error spike | Balance cost/coverage | Requires tuning |
Tail-based sampling architecture:
Application → OTLP Collector (buffering)
↓
[Wait for trace complete — 10s timeout]
↓
Decision: Keep or Drop?
- Keep if: has error span
- Keep if: latency > p99
- Keep if: rare endpoint (< 1% traffic)
- Drop: happy path, common endpoint
↓
Storage (Tempo, Jaeger)
Implementation (OpenTelemetry Collector):
# otel-collector.yaml
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow
type: latency
latency: {threshold_ms: 5000}
- name: sample_happy_path
type: probabilistic
probabilistic: {sampling_percentage: 1}
Trade-off: Tail-based cần buffer traces (10s × 10K QPS = 100K traces in memory) → tốn RAM.
Context Propagation — The Make-or-Break
W3C Trace Context (standard header):
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
││ └─────────── trace-id ──────────┘ └── span-id ──┘ │
││ │
│└─ version └─ flags (sampled)
└─ version
Propagation qua nhiều protocols:
// HTTP
req.Header.Set("traceparent", traceparent)
// gRPC (metadata)
md := metadata.Pairs("traceparent", traceparent)
ctx = metadata.NewOutgoingContext(ctx, md)
// Kafka (message header)
msg.Headers = append(msg.Headers, kafka.Header{
Key: "traceparent",
Value: []byte(traceparent),
})
// Database (SQL comment injection)
query := fmt.Sprintf("/* traceparent=%s */ SELECT ...", traceparent)
Pitfall: Nếu một service không propagate → trace bị đứt.
✓ Service A → Service B → Service C (full trace)
✗ Service A → Service B (missing propagation) → Service C (new trace)
Solution: Auto-instrumentation libraries handle propagation. Nhưng custom code (background jobs, async tasks) phải manual.
2. Three Pillars in Production — Real-world Integration
Correlation: The Holy Trinity
Goal: Từ metric spike → drill vào log → drill vào trace.
Implementation:
1. Alert fires: error_rate > 5%
↓
2. Dashboard shows spike at 10:23 AM
↓
3. Query logs: service=payment-api level=error @timestamp:[10:23 TO 10:24]
↓
4. Log entry: {"trace_id": "abc123", "error": "stripe timeout"}
↓
5. Open trace: trace_id=abc123
↓
6. See span: Stripe API took 30s (timeout=10s)
Technical glue:
// Inject trace_id vào log
ctx := context.WithValue(ctx, "trace_id", span.SpanContext().TraceID())
log := logger.With("trace_id", ctx.Value("trace_id"))
log.Error("payment failed", "error", err)
// Grafana: Link from metric to logs
{
"targets": [{
"expr": "rate(http_errors[5m])",
"legendFormat": "{{service}}"
}],
"links": [{
"title": "View Logs",
"url": "/explore?queries=[{\"expr\":\"{service=\\\"${__field.labels.service}\\\"}\"}]"
}]
}
Exemplars — Bridging Metrics & Traces
Exemplar = Trace ID attached to metric sample.
http_request_duration_seconds_bucket{le="0.5"} 145 # {trace_id="xyz789"} 0.234
Prometheus UI: Click metric point → jump to trace.
Configuration (Prometheus 2.26+):
# prometheus.yml
global:
scrape_interval: 15s
exemplar_storage:
max_exemplars: 100000
scrape_configs:
- job_name: 'api'
scrape_interval: 15s
metrics_path: /metrics
Application (OpenTelemetry Go):
// Record metric with exemplar
attrs := metric.WithAttributes(
attribute.String("method", "POST"),
attribute.String("route", "/orders"),
)
histogram.Record(ctx, duration, attrs)
// Span context from ctx automatically becomes exemplar
3. Cost Optimization — Senior-level Reality Check
Cost breakdown at 10K QPS service
Metrics: $500/month
- 50K series × $0.01/series = $500
Logs: $5000/month
- 10K QPS × 2KB/log × 86400s × 30 days = 50TB/month
- 50TB × $0.10/GB = $5000
Traces: $2000/month
- 10K QPS × 1% sample rate = 100 traces/sec
- 100 × 86400 × 30 = 260M traces/month
- 260M × 10KB × $0.00075/GB = $2000
Total: $7500/month
Optimization levers:
- Metrics: Recording rules (reduce query cost), prune unused series
- Logs: Sampling (10x reduction), cold tier (5x cheaper)
- Traces: Adaptive sampling (5x reduction), tail-based (10x smarter)
After optimization: $1500/month (80% reduction).
4. Observability-Driven Development (ODD)
Philosophy: Instrumentation không phải "thêm sau", mà là first-class citizen.
Design pattern: Observable by Default
// Service template with observability built-in
type ObservableService struct {
svc CoreService
meter metric.Meter
tracer trace.Tracer
logger *slog.Logger
}
func (s *ObservableService) ProcessOrder(ctx context.Context, order Order) error {
// Start span
ctx, span := s.tracer.Start(ctx, "ProcessOrder")
defer span.End()
// Record metric
start := time.Now()
defer func() {
duration := time.Since(start)
s.meter.RecordDuration(ctx, duration,
attribute.String("order_type", order.Type))
}()
// Log with trace context
log := s.logger.With("trace_id", span.SpanContext().TraceID())
log.Info("processing order", "order_id", order.ID)
// Actual business logic
err := s.svc.ProcessOrder(ctx, order)
// Record error
if err != nil {
span.SetStatus(codes.Error, err.Error())
span.RecordError(err)
log.Error("order failed", "error", err)
s.meter.RecordError(ctx, attribute.String("error_type", errorType(err)))
}
return err
}
Anti-pattern Checklist
❌ Log mọi thứ → log volume explosion
❌ High-cardinality labels → Prometheus OOM
❌ Không sample traces → storage cost explodes
❌ Không propagate context → traces bị đứt
❌ Log sensitive data (passwords, tokens)
❌ Alert trên absolute threshold (cpu > 80%) → false positive khi scale
❌ Không có runbook → alert bắn nhưng không biết làm gì
✅ Log theo level, sample verbose logs
✅ Bounded cardinality, dùng exemplars cho high-cardinality
✅ Tail-based sampling, keep errors
✅ Auto-propagate qua middleware
✅ Structured logging, redact PII
✅ Alert trên rate of change (cpu increase 50% in 5min)
✅ Mỗi alert có runbook link
5. Maturity Model — From Basic to Advanced
| Level | Metrics | Logs | Traces | Correlation |
|---|---|---|---|---|
| L1: Ad-hoc | CPU/memory only | Unstructured, grep | None | Manual |
| L2: Reactive | Golden signals | Structured JSON | Head sampling (1%) | None |
| L3: Proactive | SLO-based | Indexed fields | Tail sampling | Trace ID in logs |
| L4: Predictive | Anomaly detection | Adaptive sampling | Exemplars | Auto-correlation |
Current goal: Reach L3 trong 3-6 tháng.
6. Tooling Landscape — Senior Perspective
Open-source stack (cost-effective)
Metrics: Prometheus + Thanos (long-term)
Logs: Loki (no indexing) or OpenSearch
Traces: Tempo (object storage-based)
Frontend: Grafana (unified dashboards)
Pros: Cheap, flexible, self-hosted
Cons: Operational overhead, need expertise
Commercial stack (ease-of-use)
All-in-one: Datadog, New Relic, Dynatrace
Specialized: Honeycomb (events), Lightstep (traces)
Pros: Turnkey, great UX, support
Cons: Expensive ($10K+/month at scale)
Hybrid approach (pragmatic)
Metrics: Prometheus (cheap, mature)
Logs: Commercial (Datadog) — log search UX matters
Traces: Tempo (cheap storage) + Grafana
Rationale: Logs query daily, metrics query hourly, traces query weekly → optimize spend accordingly.
Tóm tắt: Production Checklist
- Metrics có recording rules cho expensive queries
- Cardinality < 1M series per Prometheus instance
- Logs có sampling strategy (< 50TB/month ingestion)
- Traces có tail-based sampling (keep 100% errors, 1% happy path)
- Trace context propagate qua tất cả services
- Logs có trace_id field để correlation
- Alerts có runbook links + không spam (< 5 false positive/week)
- Cost monitoring (set budget alerts)
- Retention policy documented (hot/warm/cold tiers)
- Team trained (không chỉ SRE, devs cũng phải biết query)