📡 Observability✍️ Khoa📅 19/04/2026☕ 9 phút đọc

SLI, SLO, SLA và Error Budget — Site Reliability Engineering

"SLOs are the interface between SRE and product. They define what 'good enough' means." — Google SRE Book

Không aim 100% uptime. Aim right level of reliability that balances user happiness với development velocity.


1. Definitions — The SLx Hierarchy

SLA (Service Level Agreement)
  ↓ business contract, legal consequences
SLO (Service Level Objective)
  ↓ internal target, drives engineering decisions
SLI (Service Level Indicator)
  ↓ measurement, actual user experience

1.1 SLI — What to Measure

Bad SLI: Server uptime (server có thể up nhưng user không truy cập được)

Good SLI: Request success rate from user perspective

SLI Categories (Google SRE):

Type Definition Example
Availability Fraction of successful requests good_requests / total_requests
Latency Fraction of requests faster than threshold requests < 300ms / total_requests
Durability Fraction of data not lost retained_records / written_records
Correctness Fraction of correct responses valid_outputs / total_outputs
Freshness How up-to-date data is records < 5min old / total_records

1.2 SLO — Picking the Right Target

Framework: User Journey Mapping

User journey: Checkout flow
  1. Add to cart     → 99.9% availability, p99 < 500ms
  2. View cart       → 99.5% availability, p99 < 1s
  3. Enter payment   → 99.95% availability, p99 < 300ms  ← Critical
  4. Confirm order   → 99.99% availability, p99 < 200ms  ← Most critical

Rationale: Confirm order failures = revenue loss. View cart failures = minor UX issue.

Anti-pattern: Same SLO cho tất cả endpoints.

1.3 SLA — The Contract

Typical SLA structure:

Service: Payment API
SLA: 99.9% monthly uptime
Measurement: Successful 2xx responses / total requests
Exclusions: Planned maintenance (max 4 hours/month), user errors (4xx)
Consequences:
  - 99.9% - 99.0%: 10% credit
  - 99.0% - 95.0%: 25% credit
  - < 95.0%:       100% credit

Rule: SLA < SLO. If SLO = 99.9%, SLA = 99.5% (buffer for internal incidents).


2. Error Budget — The Engineering Currency

2.1 Calculating Error Budget

Given: SLO = 99.9% availability over 30 days

Total time = 30 days × 24 hours × 60 min = 43,200 minutes
Allowed downtime = 43,200 × (1 - 0.999) = 43.2 minutes

Request-based (for high-traffic APIs):

Total requests = 100M/month
Allowed failures = 100M × (1 - 0.999) = 100,000 requests

Current status (real-time tracking):

Time elapsed: 10 days (33.3% of month)
Budget used: 15 minutes (34.7% of budget)
Remaining: 28.2 minutes

Burn rate: 34.7% budget / 33.3% time = 1.04× normal
Status: ⚠️ Slightly elevated, monitor closely

2.2 Error Budget Policy — The Rule Book

Policy document (example):

# Error Budget Policy — Payment API

## Measurement Window
- SLO: 99.9% over rolling 30 days
- Evaluated: Daily at 00:00 UTC

## Budget Status Tiers

### Tier 1: Healthy (> 50% budget remaining)
- ✅ All development activities allowed
- ✅ Deploy on-demand (no freeze)
- ✅ Experiment with new features

### Tier 2: Warning (20% - 50% budget remaining)
- ⚠️ Increase monitoring frequency
- ⚠️ Freeze risky changes (major refactors)
- ⚠️ Post-mortems mandatory for incidents

### Tier 3: Critical (< 20% budget remaining)
- 🔴 Feature freeze — only reliability fixes
- 🔴 On-call team escalates to senior SRE
- 🔴 Daily review meetings until recovery
- 🔴 No deployments without SRE approval

### Tier 4: Exhausted (0% budget remaining)
- 🚨 Complete freeze — no changes except emergency
- 🚨 CTO-level review required
- 🚨 Root cause analysis + blameless post-mortem
- 🚨 Recovery plan due within 24 hours

## Exceptions
- Security patches: Always allowed (but count toward budget)
- Legal compliance: Always allowed
- Data loss prevention: Always allowed

## Budget Reset
- Automatic: Monthly on 1st day 00:00 UTC
- Manual: CTO approval only (extreme circumstances)

Key insight: Error budget = negotiation tool giữa product (thêm feature) và SRE (ổn định).


3. Multi-Window, Multi-Burn-Rate Alerting

3.1 Problem với Single-Threshold Alerts

# Naive alert
alert: HighErrorRate
expr: rate(http_errors[5m]) > 0.01

Issues:

  • ❌ False positive: Spike 1 phút (user retry) → alert
  • ❌ Slow detection: Gradual degradation qua 2 giờ → no alert

3.2 Burn Rate — The Better Metric

Burn rate = Tỷ lệ tiêu hao budget so với normal.

Normal burn rate = 1.0
  (ví dụ: 0.1% errors → burn 0.1% budget/hour)

Burn rate = actual_error_rate / (1 - SLO)

Example:
  SLO = 99.9% → budget = 0.1%/hour normal
  Actual error rate = 1% → burn rate = 1% / 0.1% = 10×

Interpretation: Burn rate 10× = exhaust budget trong 3 ngày thay vì 30 ngày.

3.3 Multi-Window Alert (Google SRE Workbook)

Strategy: Combine short window (detect fast) + long window (reduce noise).

# Prometheus alert rules
groups:
  - name: slo_alerts
    interval: 30s
    rules:
      # Page-worthy: Burn entire budget in 2 hours
      - alert: ErrorBudgetBurnCritical
        expr: |
          (
            sum(rate(http_requests{status=~"5.."}[1h]))
            / sum(rate(http_requests[1h]))
          ) > (14.4 * 0.001)   # 14.4× burn rate for 99.9% SLO
          and
          (
            sum(rate(http_requests{status=~"5.."}[5m]))
            / sum(rate(http_requests[5m]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Critical: Burn entire error budget in 2 hours"
          runbook: "https://runbook.example.com/error-budget-burn"
      
      # Ticket-worthy: Burn entire budget in 6 hours
      - alert: ErrorBudgetBurnHigh
        expr: |
          (
            sum(rate(http_requests{status=~"5.."}[6h]))
            / sum(rate(http_requests[6h]))
          ) > (6 * 0.001)      # 6× burn rate
          and
          (
            sum(rate(http_requests{status=~"5.."}[30m]))
            / sum(rate(http_requests[30m]))
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: "High: Burn entire error budget in 6 hours"

Parameters (Google recommendation):

Alert Burn Rate Short Window Long Window Severity
Critical 14.4× 1h 5m Page
High 6h 30m Ticket
Medium 1d 2h Ticket
Low 3d 6h No alert

Why these numbers?

SLO = 99.9% → budget = 0.1% per 30 days = 43.2 min/month

Critical (14.4×):
  43.2 min / 14.4 = 3 minutes to exhaustion if sustained
  Alert after 2 min → 1 min to respond

High (6×):
  43.2 min / 6 = 7.2 minutes to exhaustion
  Alert after 15 min → still catchable

4. Implementation Patterns

4.1 SLI Recording Rules — Pre-compute SLO

# Prometheus recording rules
groups:
  - name: sli_rules
    interval: 15s
    rules:
      # Availability SLI
      - record: sli:http_requests:availability:1m
        expr: |
          sum(rate(http_requests{status!~"5.."}[1m]))
          / sum(rate(http_requests[1m]))
      
      # Latency SLI (fraction < 300ms)
      - record: sli:http_requests:latency:1m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[1m]))
          / sum(rate(http_request_duration_seconds_count[1m]))
      
      # Error budget remaining (rolling 30d)
      - record: slo:error_budget_remaining
        expr: |
          1 - (
            (1 - sli:http_requests:availability:30d)
            / (1 - 0.999)
          )

4.2 Dashboard — Error Budget Burn Down

┌────────────────────────────────────────────────────┐
│ Error Budget Status                                 │
│                                                     │
│ SLO: 99.9% │ Budget: 43.2 min │ Remaining: 28 min │
│                                                     │
│ ████████████████████████████░░░░░░░░ 65%          │
│                                                     │
│ Burn rate: 1.2× (slightly elevated)                │
│ Time to exhaustion: 23 days (if sustained)         │
│                                                     │
│ [Graph: Budget over time]                          │
│    ▲                                                │
│ 100│╲                                               │
│    │ ╲___                                           │
│  50│     ╲___                                       │
│    │         ╲___                                   │
│   0└─────────────────────────────► Time            │
│     0d        15d        30d                        │
└────────────────────────────────────────────────────┘

PromQL (for dashboard):

# Budget remaining (0-1)
1 - (
  (1 - sli:http_requests:availability:30d{service="payment-api"})
  / (1 - 0.999)
)

# Burn rate (current vs normal)
(
  (1 - sli:http_requests:availability:1h)
  / (1 - 0.999)
) / (1/720)  # 720 hours in 30 days

# Time to exhaustion (days)
(
  slo:error_budget_remaining
  / (1 - sli:http_requests:availability:1h)
) * (1/24)

4.3 Automated Freeze — GitOps Integration

# ArgoCD ApplicationSet — auto-disable sync khi budget low
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: payment-api
spec:
  generators:
    - list:
        elements:
          - env: production
            syncPolicy: |
              {{- if gt .Values.errorBudgetRemaining 0.2 }}
              automated:
                prune: true
                selfHeal: true
              {{- else }}
              # Freeze when budget < 20%
              manual: {}
              {{- end }}

Integration with CI/CD:

# Pre-deployment check
curl -s "http://prometheus/api/v1/query?query=slo:error_budget_remaining" \
  | jq -r '.data.result[0].value[1]' \
  | awk '{if ($1 < 0.2) exit 1}'  # Exit 1 if < 20%

if [ $? -eq 1 ]; then
  echo "❌ Deployment blocked: Error budget < 20%"
  exit 1
fi

5. Advanced: Composite SLOs

5.1 Multi-Component SLO

Scenario: E-commerce checkout flow = 3 services.

SLO_checkout = SLO_cart × SLO_payment × SLO_inventory
             = 0.999 × 0.999 × 0.999
             = 0.997

→ Target individual SLOs = 99.9% to achieve 99.7% end-to-end

Implication: Stricter SLOs cho sub-components để achieve aggregate goal.

5.2 User Journey SLO

# Define user journey as single SLI
sli:checkout_journey:success:1m =
  (
    requests with:
      - view_product: 200 OK
      - add_to_cart: 200 OK
      - checkout: 200 OK
      - payment: 200 OK
  ) / total_checkout_attempts

Measurement: End-to-end synthetic monitoring (e.g., Playwright script).


6. Rolling Window vs Calendar Window

Approach Definition Pros Cons
Rolling Window Last 30 days Smooth, fair Complex to compute
Calendar Window This month (1st-30th) Simple Resets on 1st (gameable)

Production choice: Rolling window with calendar reporting.

Measure: Rolling 30-day window (for accuracy)
Report: Monthly (for business alignment)

7. SRE Maturity Model

Level SLO Adoption Error Budget Alerting Process
L0 No SLOs N/A Threshold-based Ad-hoc
L1 1-2 SLOs defined Calculated manually Threshold + SLO Quarterly review
L2 SLOs per service Automated dashboard Multi-window burn rate Monthly review
L3 SLOs per endpoint Auto-freeze on exhaustion Adaptive sampling Weekly review
L4 SLOs drive roadmap Error budget = sprint points ML-based anomaly Daily review

Target: Reach L2 trong 6 tháng, L3 trong 1 năm.


8. Common Pitfalls — War Stories

Pitfall 1: SLO quá aggressive

Story: Team set 99.99% uptime (52 min/year downtime). Mọi deploy đều scared. Velocity drop 50%.

Fix: Downgrade to 99.9% (43 min/month). Result: 10× more budget, velocity recovered.

Pitfall 2: Measuring wrong thing

Story: SLO = server uptime. Server up nhưng database down → users cannot access. SLO still 100%.

Fix: Measure user-facing requests, not server health.

Pitfall 3: Không enforce error budget policy

Story: Budget exhausted, team vẫn deploy feature. Incident xảy ra, breach SLA.

Fix: Automated deployment block khi budget < 20%.

Pitfall 4: Alert spam

Story: 20 alerts/day, team ignore → miss real incident.

Fix: Multi-window burn rate → 2 alerts/week, all actionable.


Tóm tắt: Implementation Checklist

Foundation (Month 1-2):

  • Define 1-2 SLIs per service (availability + latency)
  • Set SLO targets based on current performance + 10% buffer
  • Implement recording rules for SLIs
  • Build error budget dashboard

Intermediate (Month 3-4):

  • Multi-window, multi-burn-rate alerts configured
  • Error budget policy documented and communicated
  • Integrate error budget check into CI/CD pipeline
  • Monthly SLO review meetings established

Advanced (Month 5-6):

  • Auto-freeze deployments khi budget < threshold
  • Composite SLOs for critical user journeys
  • Adaptive sampling based on error budget status
  • Error budget = engineering planning input (sprint points)

Ongoing:

  • Quarterly SLO adjustment (tighten as reliability improves)
  • Blameless post-mortems for budget exhaustion
  • Share SLO reports with stakeholders (product, exec)