📡 Observability✍️ Khoa📅 19/04/2026☕ 9 phút đọc

SLI, SLO, SLA và Error Budget — Site Reliability Engineering

"SLOs are the interface between SRE and product. They define what 'good enough' means." — Google SRE Book

Không aim 100% uptime. Aim right level of reliability that balances user happiness với development velocity.

1. Definitions — The SLx Hierarchy

SLA (Service Level Agreement)
  ↓ business contract, legal consequences
SLO (Service Level Objective)
  ↓ internal target, drives engineering decisions
SLI (Service Level Indicator)
  ↓ measurement, actual user experience

1.1 SLI — What to Measure

Bad SLI: Server uptime (server có thể up nhưng user không truy cập được)

Good SLI: Request success rate from user perspective

SLI Categories (Google SRE):

Type	Definition	Example
Availability	Fraction of successful requests	`good_requests / total_requests`
Latency	Fraction of requests faster than threshold	`requests < 300ms / total_requests`
Durability	Fraction of data not lost	`retained_records / written_records`
Correctness	Fraction of correct responses	`valid_outputs / total_outputs`
Freshness	How up-to-date data is	`records < 5min old / total_records`

1.2 SLO — Picking the Right Target

Framework: User Journey Mapping

User journey: Checkout flow
  1. Add to cart     → 99.9% availability, p99 < 500ms
  2. View cart       → 99.5% availability, p99 < 1s
  3. Enter payment   → 99.95% availability, p99 < 300ms  ← Critical
  4. Confirm order   → 99.99% availability, p99 < 200ms  ← Most critical

Rationale: Confirm order failures = revenue loss. View cart failures = minor UX issue.

Anti-pattern: Same SLO cho tất cả endpoints.

1.3 SLA — The Contract

Typical SLA structure:

Service: Payment API
SLA: 99.9% monthly uptime
Measurement: Successful 2xx responses / total requests
Exclusions: Planned maintenance (max 4 hours/month), user errors (4xx)
Consequences:
  - 99.9% - 99.0%: 10% credit
  - 99.0% - 95.0%: 25% credit
  - < 95.0%:       100% credit

Rule: SLA < SLO. If SLO = 99.9%, SLA = 99.5% (buffer for internal incidents).

2. Error Budget — The Engineering Currency

2.1 Calculating Error Budget

Given: SLO = 99.9% availability over 30 days

Total time = 30 days × 24 hours × 60 min = 43,200 minutes
Allowed downtime = 43,200 × (1 - 0.999) = 43.2 minutes

Request-based (for high-traffic APIs):

Total requests = 100M/month
Allowed failures = 100M × (1 - 0.999) = 100,000 requests

Current status (real-time tracking):

Time elapsed: 10 days (33.3% of month)
Budget used: 15 minutes (34.7% of budget)
Remaining: 28.2 minutes

Burn rate: 34.7% budget / 33.3% time = 1.04× normal
Status: ⚠️ Slightly elevated, monitor closely

2.2 Error Budget Policy — The Rule Book

Policy document (example):

# Error Budget Policy — Payment API

## Measurement Window
- SLO: 99.9% over rolling 30 days
- Evaluated: Daily at 00:00 UTC

## Budget Status Tiers

### Tier 1: Healthy (> 50% budget remaining)
- ✅ All development activities allowed
- ✅ Deploy on-demand (no freeze)
- ✅ Experiment with new features

### Tier 2: Warning (20% - 50% budget remaining)
- ⚠️ Increase monitoring frequency
- ⚠️ Freeze risky changes (major refactors)
- ⚠️ Post-mortems mandatory for incidents

### Tier 3: Critical (< 20% budget remaining)
- 🔴 Feature freeze — only reliability fixes
- 🔴 On-call team escalates to senior SRE
- 🔴 Daily review meetings until recovery
- 🔴 No deployments without SRE approval

### Tier 4: Exhausted (0% budget remaining)
- 🚨 Complete freeze — no changes except emergency
- 🚨 CTO-level review required
- 🚨 Root cause analysis + blameless post-mortem
- 🚨 Recovery plan due within 24 hours

## Exceptions
- Security patches: Always allowed (but count toward budget)
- Legal compliance: Always allowed
- Data loss prevention: Always allowed

## Budget Reset
- Automatic: Monthly on 1st day 00:00 UTC
- Manual: CTO approval only (extreme circumstances)

Key insight: Error budget = negotiation tool giữa product (thêm feature) và SRE (ổn định).

3. Multi-Window, Multi-Burn-Rate Alerting

3.1 Problem với Single-Threshold Alerts

# Naive alert
alert: HighErrorRate
expr: rate(http_errors[5m]) > 0.01

Issues:

❌ False positive: Spike 1 phút (user retry) → alert
❌ Slow detection: Gradual degradation qua 2 giờ → no alert

3.2 Burn Rate — The Better Metric

Burn rate = Tỷ lệ tiêu hao budget so với normal.

Normal burn rate = 1.0
  (ví dụ: 0.1% errors → burn 0.1% budget/hour)

Burn rate = actual_error_rate / (1 - SLO)

Example:
  SLO = 99.9% → budget = 0.1%/hour normal
  Actual error rate = 1% → burn rate = 1% / 0.1% = 10×

Interpretation: Burn rate 10× = exhaust budget trong 3 ngày thay vì 30 ngày.

3.3 Multi-Window Alert (Google SRE Workbook)

Strategy: Combine short window (detect fast) + long window (reduce noise).

# Prometheus alert rules
groups:
  - name: slo_alerts
    interval: 30s
    rules:
      # Page-worthy: Burn entire budget in 2 hours
      - alert: ErrorBudgetBurnCritical
        expr: |
          (
            sum(rate(http_requests{status=~"5.."}[1h]))
            / sum(rate(http_requests[1h]))
          ) > (14.4 * 0.001)   # 14.4× burn rate for 99.9% SLO
          and
          (
            sum(rate(http_requests{status=~"5.."}[5m]))
            / sum(rate(http_requests[5m]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Critical: Burn entire error budget in 2 hours"
          runbook: "https://runbook.example.com/error-budget-burn"
      
      # Ticket-worthy: Burn entire budget in 6 hours
      - alert: ErrorBudgetBurnHigh
        expr: |
          (
            sum(rate(http_requests{status=~"5.."}[6h]))
            / sum(rate(http_requests[6h]))
          ) > (6 * 0.001)      # 6× burn rate
          and
          (
            sum(rate(http_requests{status=~"5.."}[30m]))
            / sum(rate(http_requests[30m]))
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: "High: Burn entire error budget in 6 hours"

Parameters (Google recommendation):

Alert	Burn Rate	Short Window	Long Window	Severity
Critical	14.4×	1h	5m	Page
High	6×	6h	30m	Ticket
Medium	3×	1d	2h	Ticket
Low	1×	3d	6h	No alert

Why these numbers?

SLO = 99.9% → budget = 0.1% per 30 days = 43.2 min/month

Critical (14.4×):
  43.2 min / 14.4 = 3 minutes to exhaustion if sustained
  Alert after 2 min → 1 min to respond

High (6×):
  43.2 min / 6 = 7.2 minutes to exhaustion
  Alert after 15 min → still catchable

4. Implementation Patterns

4.1 SLI Recording Rules — Pre-compute SLO

# Prometheus recording rules
groups:
  - name: sli_rules
    interval: 15s
    rules:
      # Availability SLI
      - record: sli:http_requests:availability:1m
        expr: |
          sum(rate(http_requests{status!~"5.."}[1m]))
          / sum(rate(http_requests[1m]))
      
      # Latency SLI (fraction < 300ms)
      - record: sli:http_requests:latency:1m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[1m]))
          / sum(rate(http_request_duration_seconds_count[1m]))
      
      # Error budget remaining (rolling 30d)
      - record: slo:error_budget_remaining
        expr: |
          1 - (
            (1 - sli:http_requests:availability:30d)
            / (1 - 0.999)
          )

4.2 Dashboard — Error Budget Burn Down

┌────────────────────────────────────────────────────┐
│ Error Budget Status                                 │
│                                                     │
│ SLO: 99.9% │ Budget: 43.2 min │ Remaining: 28 min │
│                                                     │
│ ████████████████████████████░░░░░░░░ 65%          │
│                                                     │
│ Burn rate: 1.2× (slightly elevated)                │
│ Time to exhaustion: 23 days (if sustained)         │
│                                                     │
│ [Graph: Budget over time]                          │
│    ▲                                                │
│ 100│╲                                               │
│    │ ╲___                                           │
│  50│     ╲___                                       │
│    │         ╲___                                   │
│   0└─────────────────────────────► Time            │
│     0d        15d        30d                        │
└────────────────────────────────────────────────────┘

PromQL (for dashboard):

# Budget remaining (0-1)
1 - (
  (1 - sli:http_requests:availability:30d{service="payment-api"})
  / (1 - 0.999)
)

# Burn rate (current vs normal)
(
  (1 - sli:http_requests:availability:1h)
  / (1 - 0.999)
) / (1/720)  # 720 hours in 30 days

# Time to exhaustion (days)
(
  slo:error_budget_remaining
  / (1 - sli:http_requests:availability:1h)
) * (1/24)

4.3 Automated Freeze — GitOps Integration

# ArgoCD ApplicationSet — auto-disable sync khi budget low
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: payment-api
spec:
  generators:
    - list:
        elements:
          - env: production
            syncPolicy: |
              {{- if gt .Values.errorBudgetRemaining 0.2 }}
              automated:
                prune: true
                selfHeal: true
              {{- else }}
              # Freeze when budget < 20%
              manual: {}
              {{- end }}

Integration with CI/CD:

# Pre-deployment check
curl -s "http://prometheus/api/v1/query?query=slo:error_budget_remaining" \
  | jq -r '.data.result[0].value[1]' \
  | awk '{if ($1 < 0.2) exit 1}'  # Exit 1 if < 20%

if [ $? -eq 1 ]; then
  echo "❌ Deployment blocked: Error budget < 20%"
  exit 1
fi

5. Advanced: Composite SLOs

5.1 Multi-Component SLO

Scenario: E-commerce checkout flow = 3 services.

SLO_checkout = SLO_cart × SLO_payment × SLO_inventory
             = 0.999 × 0.999 × 0.999
             = 0.997

→ Target individual SLOs = 99.9% to achieve 99.7% end-to-end

Implication: Stricter SLOs cho sub-components để achieve aggregate goal.

5.2 User Journey SLO

# Define user journey as single SLI
sli:checkout_journey:success:1m =
  (
    requests with:
      - view_product: 200 OK
      - add_to_cart: 200 OK
      - checkout: 200 OK
      - payment: 200 OK
  ) / total_checkout_attempts

Measurement: End-to-end synthetic monitoring (e.g., Playwright script).

6. Rolling Window vs Calendar Window

Approach	Definition	Pros	Cons
Rolling Window	Last 30 days	Smooth, fair	Complex to compute
Calendar Window	This month (1st-30th)	Simple	Resets on 1st (gameable)

Production choice: Rolling window with calendar reporting.

Measure: Rolling 30-day window (for accuracy)
Report: Monthly (for business alignment)

7. SRE Maturity Model

Level	SLO Adoption	Error Budget	Alerting	Process
L0	No SLOs	N/A	Threshold-based	Ad-hoc
L1	1-2 SLOs defined	Calculated manually	Threshold + SLO	Quarterly review
L2	SLOs per service	Automated dashboard	Multi-window burn rate	Monthly review
L3	SLOs per endpoint	Auto-freeze on exhaustion	Adaptive sampling	Weekly review
L4	SLOs drive roadmap	Error budget = sprint points	ML-based anomaly	Daily review

Target: Reach L2 trong 6 tháng, L3 trong 1 năm.

8. Common Pitfalls — War Stories

Pitfall 1: SLO quá aggressive

Story: Team set 99.99% uptime (52 min/year downtime). Mọi deploy đều scared. Velocity drop 50%.

Fix: Downgrade to 99.9% (43 min/month). Result: 10× more budget, velocity recovered.

Pitfall 2: Measuring wrong thing

Story: SLO = server uptime. Server up nhưng database down → users cannot access. SLO still 100%.

Fix: Measure user-facing requests, not server health.

Pitfall 3: Không enforce error budget policy

Story: Budget exhausted, team vẫn deploy feature. Incident xảy ra, breach SLA.

Fix: Automated deployment block khi budget < 20%.

Pitfall 4: Alert spam

Story: 20 alerts/day, team ignore → miss real incident.

Fix: Multi-window burn rate → 2 alerts/week, all actionable.

Tóm tắt: Implementation Checklist

Foundation (Month 1-2):

Define 1-2 SLIs per service (availability + latency)
Set SLO targets based on current performance + 10% buffer
Implement recording rules for SLIs
Build error budget dashboard

Intermediate (Month 3-4):

Multi-window, multi-burn-rate alerts configured
Error budget policy documented and communicated
Integrate error budget check into CI/CD pipeline
Monthly SLO review meetings established

Advanced (Month 5-6):

Auto-freeze deployments khi budget < threshold
Composite SLOs for critical user journeys
Adaptive sampling based on error budget status
Error budget = engineering planning input (sprint points)

Ongoing:

Quarterly SLO adjustment (tighten as reliability improves)
Blameless post-mortems for budget exhaustion
Share SLO reports with stakeholders (product, exec)