📡 Observability✍️ Khoa📅 19/04/2026☕ 12 phút đọc

Reliability Engineering — Khi Production Sập Lúc 3 Giờ Sáng

"Everything fails, all the time." — Werner Vogels, CTO Amazon. Và nếu bạn chưa chuẩn bị cho failure, bạn sẽ fail ở cái cách bạn handle failure.

Observability giúp bạn nhìn thấy vấn đề. Reliability Engineering giúp bạn chuẩn bị cho vấn đề — trước, trong, và sau khi nó xảy ra. Đây là mảnh ghép biến một Senior Engineer thành người mà cả team tin tưởng lúc 3 giờ sáng.

1. Chaos Engineering — Phá hệ thống để nó mạnh hơn

1.1 Philosophy

Câu hỏi sai:  "Hệ thống có ổn không?"
Câu hỏi đúng: "Hệ thống sẽ sập kiểu gì?"

Chaos Engineering = Chủ động inject failure trong môi trường
có kiểm soát để phát hiện weak points TRƯỚC khi production 
tự phát hiện cho bạn (thường vào lúc Black Friday).

Giống như tiêm vaccine: inject virus yếu → hệ miễn dịch 
học cách chống → khi virus thật đến, đã sẵn sàng.

1.2 Principles of Chaos Engineering (Netflix)

1. Start with a Hypothesis
   "Nếu Redis primary down, application sẽ failover sang
   replica trong < 5 giây và user không thấy error"

2. Vary Real-world Events
   → Kill container/pod
   → Inject network latency (100ms → 5000ms)
   → Fill disk
   → CPU stress
   → DNS failure
   → Clock skew

3. Run in Production (nếu đủ confident)
   → Staging không thể replicate production behavior
   → Start nhỏ: 1% traffic, 1 availability zone

4. Minimize Blast Radius
   → Có kill switch (dừng experiment ngay khi cần)
   → Chạy trong business hours (có người monitor)
   → Rollback plan rõ ràng

1.3 Tools

Chaos Monkey (Netflix):
  → Randomly kill instances trong production
  → "Nếu instance chết mà service vẫn OK → good"
  → Philosophy: build resilient services by default

Litmus Chaos (CNCF):
  → Kubernetes-native chaos engineering
  → ChaosExperiment CRDs
  → Pod delete, network chaos, disk fill, node drain
  → Có Chaos Hub với pre-built experiments

Chaos Mesh (CNCF):
  → Cũng K8s-native
  → Time chaos (thay đổi system clock)
  → IO chaos (inject I/O errors)

Gremlin (SaaS):
  → Enterprise chaos platform
  → Có gameday management
  → Scenarios library

1.4 Chạy Chaos Experiment đầu tiên

# Litmus ChaosEngine example: Kill pod
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: order-service-chaos
spec:
  appinfo:
    appns: production
    applabel: "app=order-service"
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"        # Kill trong 30 giây
            - name: CHAOS_INTERVAL
              value: "10"        # Mỗi 10 giây kill 1 lần
            - name: FORCE
              value: "false"     # Graceful delete

Quy trình chạy:
  1. Define steady state: "p99 < 200ms, error rate < 0.1%"
  2. Hypothesize: "Pod restart không ảnh hưởng steady state"
  3. Inject chaos: Delete pod
  4. Observe: Metrics có vượt threshold?
  5. Learn: Nếu vượt → fix weakness → re-run

2. Incident Management — Khi sh*t hits the fan

2.1 Severity Levels

SEV1 (Critical) 🔴
  → Revenue impact hoặc data loss
  → Ví dụ: Checkout down, database corruption
  → Response: War room ngay lập tức, all-hands-on-deck
  → Comms: Update stakeholders mỗi 15-30 phút
  → Target MTTR: < 1 giờ

SEV2 (Major) 🟠
  → Feature chính bị ảnh hưởng, có workaround
  → Ví dụ: Search chậm 10x, push notification delay
  → Response: On-call + escalation nếu cần
  → Comms: Update mỗi 1 giờ
  → Target MTTR: < 4 giờ

SEV3 (Minor) 🟡
  → Feature phụ bị ảnh hưởng, ít user notice
  → Ví dụ: Avatar upload fail, report generation chậm
  → Response: On-call xử lý trong business hours
  → Target MTTR: < 24 giờ

SEV4 (Low) 🟢
  → Cosmetic issue, no business impact
  → Track trong backlog, fix khi có bandwidth

2.2 Incident Response Process

Phase 1: DETECT (phút 0-5)
  → Alert fire (PagerDuty, Opsgenie)
  → On-call acknowledge
  → Quick triage: SEV mấy?

Phase 2: RESPOND (phút 5-15)
  → Mở incident channel (Slack #inc-YYYY-MM-DD-title)
  → Assign roles:
    • Incident Commander (IC): Điều phối, KHÔNG debug
    • Technical Lead: Debug và fix
    • Communications Lead: Update stakeholders
  → Start incident timeline doc

Phase 3: MITIGATE (phút 15-60)
  → Goal: Khôi phục service, KHÔNG phải root cause fix
  → Options:
    • Rollback deployment
    • Failover sang replica
    • Feature flag tắt
    • Scale up / rate limit
    • Restart service
  → Mọi action ghi vào timeline

Phase 4: RESOLVE
  → Service restored, metrics trở lại normal
  → Declare incident resolved
  → Schedule post-mortem (trong 24-48 giờ)

Phase 5: LEARN (post-mortem)
  → Xem Section 3 bên dưới

2.3 War Room Dynamics

Đúng:
  ✅ IC dẫn dắt, hỏi "status update?" mỗi 5-10 phút
  ✅ Một người debug, còn lại hỗ trợ (không 5 người SSH cùng server)
  ✅ Communicate clearly: "Tôi đang check DB connections"
  ✅ Escalate sớm: "Tôi cần help từ team Database"
  ✅ Ghi lại mọi action vào timeline

Sai:
  ❌ Blame: "Ai deploy cái này?" (KHÔNG hỏi lúc incident)
  ❌ Chaos: Ai cũng chạy command lung tung
  ❌ Silent debugging: Debug 20 phút không update ai
  ❌ Hero culture: 1 người cố fix hết, không share context

3. Blameless Post-mortems — Học từ failure

3.1 Tại sao Blameless?

Blame culture:
  → "Anh A deploy code bug, gây incident"
  → Lần sau anh A sợ deploy → deploy ít hơn → batch lớn hơn
  → Batch lớn → risk cao hơn → incident lớn hơn 🔄

Blameless culture:
  → "Hệ thống cho phép deploy code chưa đủ test lên production"
  → Fix: thêm integration test gate trong CI/CD
  → Systemic improvement → ít incident hơn cho TOÀN team

3.2 Post-mortem Template

# Post-mortem: Order Service Outage — 2024-03-15

## Summary
Order service không thể xử lý đơn hàng trong 47 phút
(14:23 - 15:10 UTC). ~2,300 orders bị ảnh hưởng.
Revenue impact ước tính: $45,000.

## Timeline (UTC)
| Time  | Event |
|-------|-------|
| 14:15 | Deploy v2.4.1 lên production |
| 14:23 | Error rate tăng từ 0.1% → 35% |
| 14:25 | PagerDuty alert fire |
| 14:27 | On-call acknowledge, mở #inc-2024-03-15 |
| 14:30 | IC assigned, bắt đầu triage |
| 14:35 | Identify: connection pool exhausted (500/500) |
| 14:40 | Quyết định rollback v2.4.0 |
| 14:50 | Rollback complete, connections recovering |
| 15:10 | Error rate về 0.1%, incident resolved |

## Root Cause
v2.4.1 thêm feature gọi external API đồng bộ trong
request path. API này có p99 = 2s, gây connection pool
exhaustion dưới load.

## Contributing Factors
- Không có load test cho feature mới
- Connection pool size hardcoded (500), không có monitoring
- External API timeout quá cao (30s default)

## What Went Well
- Alert fire trong 2 phút
- Rollback quyết định nhanh (10 phút)
- Communication rõ ràng trong incident channel

## What Went Wrong
- Không có integration test với external API
- Monitoring cho connection pool chưa có
- Rollback mất 10 phút (nên < 5 phút)

## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Thêm connection pool metrics + alert | @tuan | 2024-03-22 |
| External API call → async (queue) | @khoa | 2024-03-29 |
| Load test gate trong CI cho critical paths | @hoa | 2024-04-05 |
| Reduce rollback time (blue/green deploy) | @devops | 2024-04-12 |

## Lessons Learned
Synchronous external API call trong request path =
ticking time bomb. Bất kỳ dependency nào chậm hơn
expected → cascading failure.

3.3 Post-mortem Facilitation

Facilitator checklist:
  □ Schedule trong 24-48 giờ (còn nhớ details)
  □ Timeline đã được điền trước meeting
  □ Mời đúng người (involved + interested)
  □ Set tone đầu meeting: "Blameless — focus vào system"
  □ Dùng "we" thay vì "you/he/she"
  □ Ask "what" và "how", KHÔNG "who" và "why" (accusatory)
  □ End với concrete action items + owners + deadlines
  □ Publish post-mortem cho cả org đọc

4. Error Budgets — "Tiêu" reliability có kế hoạch

4.1 Concept

SLO: 99.9% availability = 43.8 phút downtime/tháng

Error Budget = 100% - SLO = 0.1% = 43.8 phút

Nghĩa là: Team ĐƯỢC PHÉP có 43.8 phút downtime/tháng
mà vẫn "đạt SLO".

Error budget > 0: Push features, deploy thường xuyên
Error budget ≈ 0: Freeze deploys, focus reliability
Error budget < 0: ALL STOP, fix reliability trước

4.2 Burn Rate

Burn rate = tốc độ tiêu error budget

Burn rate = 1:  Dùng hết budget đúng 30 ngày → bình thường
Burn rate = 2:  Dùng hết budget trong 15 ngày → cảnh báo
Burn rate = 10: Dùng hết budget trong 3 ngày → nguy hiểm
Burn rate = 60: Dùng hết budget trong 12 giờ → SEV1

Multi-window alerting (Google SRE recommend):
  Fast burn: burn_rate > 14.4 trong 1h window → Page ngay
  Slow burn: burn_rate > 6 trong 6h window → Page
  Chronic:   burn_rate > 1 trong 3 ngày → Ticket

4.3 Error Budget Policy

Khi error budget cạn:
  1. Feature freeze: Chỉ ship reliability improvements
  2. Mandatory post-mortem cho mỗi incident
  3. Tăng test coverage cho critical paths
  4. On-call review: runbooks đã đủ?
  5. Chaos testing: tìm thêm weak points

Khi error budget dồi dào:
  1. Ship features nhanh hơn
  2. Thử experimental deployments
  3. Chấp nhận controlled risk
  4. Reduce over-engineering cho reliability

→ Error budget tạo ra ngôn ngữ chung giữa Product và
  Engineering: "Tuần này budget còn 80%, ship thôi" vs
  "Budget còn 5%, freeze deploy, fix stability trước"

5. Failure Mode Analysis (FMEA)

5.1 Phân tích pre-mortem

Post-mortem: Học sau khi sập
Pre-mortem (FMEA): Tưởng tượng sập TRƯỚC khi xảy ra

Cho mỗi component:
  1. Failure mode: Nó có thể fail kiểu gì?
  2. Effect: Khi fail, ảnh hưởng gì?
  3. Severity: Nghiêm trọng cỡ nào? (1-10)
  4. Probability: Khả năng xảy ra? (1-10)
  5. Detection: Có detect được không? (1-10, 10 = khó detect)
  6. RPN = S × P × D (Risk Priority Number)

FMEA cho Order Service:

| Failure Mode | Effect | S | P | D | RPN | Mitigation |
|---|---|---|---|---|---|---|
| DB primary down | Không write được | 9 | 3 | 2 | 54 | Auto-failover |
| Redis down | Tăng DB load 10x | 7 | 4 | 3 | 84 | Cache-aside fallback |
| Payment API timeout | Checkout fail | 9 | 5 | 2 | 90 | Async + retry queue |
| Disk full | Service crash | 8 | 3 | 4 | 96 | Alert at 80% + log rotation |
| Memory leak | OOM kill | 8 | 4 | 5 | 160 | Memory limit + profiling |

Top RPN → Fix trước.

6. Toil Reduction

6.1 Toil là gì?

Toil (Google SRE definition):
  → Manual work (có thể automate)
  → Repetitive (lặp đi lặp lại)
  → Automatable (script/tool có thể làm)
  → Tactical (reactive, không strategic)
  → No enduring value (không improve system)
  → Scales linearly (service grow → toil grow)

Ví dụ toil:
  → Manually restart crashed pods
  → Copy-paste config giữa environments  
  → Manually scale up trước peak hours
  → SSH vào server để check logs
  → Manually approve routine deploys

6.2 Toil Budget

Google SRE target: < 50% time on toil

Cách đo:
  1. Track on-call tasks 2 tuần
  2. Classify: toil vs engineering
  3. Calculate: toil_hours / total_hours
  4. Nếu > 50% → automate top toil items

Automation priority:
  High frequency + Low complexity = Automate NGAY
  High frequency + High complexity = Invest time, high ROI
  Low frequency + Low complexity = Maybe later
  Low frequency + High complexity = Don't automate

7. On-Call Best Practices

7.1 On-Call Design

Rotation:
  → Minimum 2 people per rotation (primary + secondary)
  → 1-week rotation (longer = burnout)
  → Follow-the-sun cho global teams
  → Handoff meeting: "Tuần qua có gì ongoing?"

Compensation:
  → Trả tiền on-call (hoặc comp time off)
  → Nếu bị page ngoài giờ → ít nhất nửa ngày comp
  → On-call không nên là "free labor"

Alert quality:
  → Mỗi alert PHẢI actionable
  → Nếu alert fire mà action = "dismiss" → delete alert
  → Target: < 2 pages per on-call shift
  → Review alert noise monthly

7.2 Runbook Template

# Runbook: High Error Rate — Order Service

## Alert
`OrderServiceErrorRate > 5% for 5 minutes`

## Quick Check (< 2 phút)
1. Check Grafana dashboard: [link]
2. Error logs: `kubectl logs -l app=order-service --tail=100`
3. Recent deployments: `kubectl rollout history deployment/order-service`

## Common Causes & Fixes

### Cause 1: Bad deployment
**Symptoms**: Error spike ngay sau deploy
**Fix**: `kubectl rollout undo deployment/order-service`

### Cause 2: Database connection exhaustion
**Symptoms**: "too many connections" trong logs
**Fix**: 
  1. Check connection count: [Grafana link]
  2. If > 80% pool: restart pods (rolling)
  3. If recurring: increase pool size in config

### Cause 3: Downstream service failure
**Symptoms**: Timeout errors to payment/inventory service
**Fix**: Check downstream service status
  - If downstream down: Circuit breaker should activate
  - If CB not activating: Manual CB override: [link]

## Escalation
- After 15 min no progress → Page secondary
- After 30 min → Page team lead
- If data loss suspected → Page engineering manager

8. Tóm tắt

Reliability Engineering Checklist:

  □ Chaos Engineering: Thử phá trước khi production tự phá
  □ Incident Management: Roles, process, communication
  □ Post-mortems: Blameless, có action items, publish rộng
  □ Error Budgets: Ngôn ngữ chung Engineering-Product
  □ FMEA: Pre-mortem analysis cho critical systems  
  □ Toil Reduction: < 50% time on manual work
  □ On-Call: Actionable alerts, runbooks, fair rotation

Tài liệu tham khảo

Google SRE Book (free online)
Google SRE Workbook
Sách: Release It! — Michael Nygard
Sách: Accelerate — Forsgren, Humble, Kim
Principles of Chaos Engineering
PagerDuty Incident Response Guide

💡 Remember: "Hope is not a strategy." Hy vọng hệ thống không sập = chiến lược tệ nhất. Chuẩn bị cho failure = chiến lược duy nhất. 🛡️