SLI, SLO, SLA và Error Budget — Site Reliability Engineering
"SLOs are the interface between SRE and product. They define what 'good enough' means." — Google SRE Book
Không aim 100% uptime. Aim right level of reliability that balances user happiness với development velocity.
1. Definitions — The SLx Hierarchy
SLA (Service Level Agreement)
↓ business contract, legal consequences
SLO (Service Level Objective)
↓ internal target, drives engineering decisions
SLI (Service Level Indicator)
↓ measurement, actual user experience
1.1 SLI — What to Measure
Bad SLI: Server uptime (server có thể up nhưng user không truy cập được)
Good SLI: Request success rate from user perspective
SLI Categories (Google SRE):
| Type | Definition | Example |
|---|---|---|
| Availability | Fraction of successful requests | good_requests / total_requests |
| Latency | Fraction of requests faster than threshold | requests < 300ms / total_requests |
| Durability | Fraction of data not lost | retained_records / written_records |
| Correctness | Fraction of correct responses | valid_outputs / total_outputs |
| Freshness | How up-to-date data is | records < 5min old / total_records |
1.2 SLO — Picking the Right Target
Framework: User Journey Mapping
User journey: Checkout flow
1. Add to cart → 99.9% availability, p99 < 500ms
2. View cart → 99.5% availability, p99 < 1s
3. Enter payment → 99.95% availability, p99 < 300ms ← Critical
4. Confirm order → 99.99% availability, p99 < 200ms ← Most critical
Rationale: Confirm order failures = revenue loss. View cart failures = minor UX issue.
Anti-pattern: Same SLO cho tất cả endpoints.
1.3 SLA — The Contract
Typical SLA structure:
Service: Payment API
SLA: 99.9% monthly uptime
Measurement: Successful 2xx responses / total requests
Exclusions: Planned maintenance (max 4 hours/month), user errors (4xx)
Consequences:
- 99.9% - 99.0%: 10% credit
- 99.0% - 95.0%: 25% credit
- < 95.0%: 100% credit
Rule: SLA < SLO. If SLO = 99.9%, SLA = 99.5% (buffer for internal incidents).
2. Error Budget — The Engineering Currency
2.1 Calculating Error Budget
Given: SLO = 99.9% availability over 30 days
Total time = 30 days × 24 hours × 60 min = 43,200 minutes
Allowed downtime = 43,200 × (1 - 0.999) = 43.2 minutes
Request-based (for high-traffic APIs):
Total requests = 100M/month
Allowed failures = 100M × (1 - 0.999) = 100,000 requests
Current status (real-time tracking):
Time elapsed: 10 days (33.3% of month)
Budget used: 15 minutes (34.7% of budget)
Remaining: 28.2 minutes
Burn rate: 34.7% budget / 33.3% time = 1.04× normal
Status: ⚠️ Slightly elevated, monitor closely
2.2 Error Budget Policy — The Rule Book
Policy document (example):
# Error Budget Policy — Payment API
## Measurement Window
- SLO: 99.9% over rolling 30 days
- Evaluated: Daily at 00:00 UTC
## Budget Status Tiers
### Tier 1: Healthy (> 50% budget remaining)
- ✅ All development activities allowed
- ✅ Deploy on-demand (no freeze)
- ✅ Experiment with new features
### Tier 2: Warning (20% - 50% budget remaining)
- ⚠️ Increase monitoring frequency
- ⚠️ Freeze risky changes (major refactors)
- ⚠️ Post-mortems mandatory for incidents
### Tier 3: Critical (< 20% budget remaining)
- 🔴 Feature freeze — only reliability fixes
- 🔴 On-call team escalates to senior SRE
- 🔴 Daily review meetings until recovery
- 🔴 No deployments without SRE approval
### Tier 4: Exhausted (0% budget remaining)
- 🚨 Complete freeze — no changes except emergency
- 🚨 CTO-level review required
- 🚨 Root cause analysis + blameless post-mortem
- 🚨 Recovery plan due within 24 hours
## Exceptions
- Security patches: Always allowed (but count toward budget)
- Legal compliance: Always allowed
- Data loss prevention: Always allowed
## Budget Reset
- Automatic: Monthly on 1st day 00:00 UTC
- Manual: CTO approval only (extreme circumstances)
Key insight: Error budget = negotiation tool giữa product (thêm feature) và SRE (ổn định).
3. Multi-Window, Multi-Burn-Rate Alerting
3.1 Problem với Single-Threshold Alerts
# Naive alert
alert: HighErrorRate
expr: rate(http_errors[5m]) > 0.01
Issues:
- ❌ False positive: Spike 1 phút (user retry) → alert
- ❌ Slow detection: Gradual degradation qua 2 giờ → no alert
3.2 Burn Rate — The Better Metric
Burn rate = Tỷ lệ tiêu hao budget so với normal.
Normal burn rate = 1.0
(ví dụ: 0.1% errors → burn 0.1% budget/hour)
Burn rate = actual_error_rate / (1 - SLO)
Example:
SLO = 99.9% → budget = 0.1%/hour normal
Actual error rate = 1% → burn rate = 1% / 0.1% = 10×
Interpretation: Burn rate 10× = exhaust budget trong 3 ngày thay vì 30 ngày.
3.3 Multi-Window Alert (Google SRE Workbook)
Strategy: Combine short window (detect fast) + long window (reduce noise).
# Prometheus alert rules
groups:
- name: slo_alerts
interval: 30s
rules:
# Page-worthy: Burn entire budget in 2 hours
- alert: ErrorBudgetBurnCritical
expr: |
(
sum(rate(http_requests{status=~"5.."}[1h]))
/ sum(rate(http_requests[1h]))
) > (14.4 * 0.001) # 14.4× burn rate for 99.9% SLO
and
(
sum(rate(http_requests{status=~"5.."}[5m]))
/ sum(rate(http_requests[5m]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: page
annotations:
summary: "Critical: Burn entire error budget in 2 hours"
runbook: "https://runbook.example.com/error-budget-burn"
# Ticket-worthy: Burn entire budget in 6 hours
- alert: ErrorBudgetBurnHigh
expr: |
(
sum(rate(http_requests{status=~"5.."}[6h]))
/ sum(rate(http_requests[6h]))
) > (6 * 0.001) # 6× burn rate
and
(
sum(rate(http_requests{status=~"5.."}[30m]))
/ sum(rate(http_requests[30m]))
) > (6 * 0.001)
for: 15m
labels:
severity: ticket
annotations:
summary: "High: Burn entire error budget in 6 hours"
Parameters (Google recommendation):
| Alert | Burn Rate | Short Window | Long Window | Severity |
|---|---|---|---|---|
| Critical | 14.4× | 1h | 5m | Page |
| High | 6× | 6h | 30m | Ticket |
| Medium | 3× | 1d | 2h | Ticket |
| Low | 1× | 3d | 6h | No alert |
Why these numbers?
SLO = 99.9% → budget = 0.1% per 30 days = 43.2 min/month
Critical (14.4×):
43.2 min / 14.4 = 3 minutes to exhaustion if sustained
Alert after 2 min → 1 min to respond
High (6×):
43.2 min / 6 = 7.2 minutes to exhaustion
Alert after 15 min → still catchable
4. Implementation Patterns
4.1 SLI Recording Rules — Pre-compute SLO
# Prometheus recording rules
groups:
- name: sli_rules
interval: 15s
rules:
# Availability SLI
- record: sli:http_requests:availability:1m
expr: |
sum(rate(http_requests{status!~"5.."}[1m]))
/ sum(rate(http_requests[1m]))
# Latency SLI (fraction < 300ms)
- record: sli:http_requests:latency:1m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[1m]))
/ sum(rate(http_request_duration_seconds_count[1m]))
# Error budget remaining (rolling 30d)
- record: slo:error_budget_remaining
expr: |
1 - (
(1 - sli:http_requests:availability:30d)
/ (1 - 0.999)
)
4.2 Dashboard — Error Budget Burn Down
┌────────────────────────────────────────────────────┐
│ Error Budget Status │
│ │
│ SLO: 99.9% │ Budget: 43.2 min │ Remaining: 28 min │
│ │
│ ████████████████████████████░░░░░░░░ 65% │
│ │
│ Burn rate: 1.2× (slightly elevated) │
│ Time to exhaustion: 23 days (if sustained) │
│ │
│ [Graph: Budget over time] │
│ ▲ │
│ 100│╲ │
│ │ ╲___ │
│ 50│ ╲___ │
│ │ ╲___ │
│ 0└─────────────────────────────► Time │
│ 0d 15d 30d │
└────────────────────────────────────────────────────┘
PromQL (for dashboard):
# Budget remaining (0-1)
1 - (
(1 - sli:http_requests:availability:30d{service="payment-api"})
/ (1 - 0.999)
)
# Burn rate (current vs normal)
(
(1 - sli:http_requests:availability:1h)
/ (1 - 0.999)
) / (1/720) # 720 hours in 30 days
# Time to exhaustion (days)
(
slo:error_budget_remaining
/ (1 - sli:http_requests:availability:1h)
) * (1/24)
4.3 Automated Freeze — GitOps Integration
# ArgoCD ApplicationSet — auto-disable sync khi budget low
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: payment-api
spec:
generators:
- list:
elements:
- env: production
syncPolicy: |
{{- if gt .Values.errorBudgetRemaining 0.2 }}
automated:
prune: true
selfHeal: true
{{- else }}
# Freeze when budget < 20%
manual: {}
{{- end }}
Integration with CI/CD:
# Pre-deployment check
curl -s "http://prometheus/api/v1/query?query=slo:error_budget_remaining" \
| jq -r '.data.result[0].value[1]' \
| awk '{if ($1 < 0.2) exit 1}' # Exit 1 if < 20%
if [ $? -eq 1 ]; then
echo "❌ Deployment blocked: Error budget < 20%"
exit 1
fi
5. Advanced: Composite SLOs
5.1 Multi-Component SLO
Scenario: E-commerce checkout flow = 3 services.
SLO_checkout = SLO_cart × SLO_payment × SLO_inventory
= 0.999 × 0.999 × 0.999
= 0.997
→ Target individual SLOs = 99.9% to achieve 99.7% end-to-end
Implication: Stricter SLOs cho sub-components để achieve aggregate goal.
5.2 User Journey SLO
# Define user journey as single SLI
sli:checkout_journey:success:1m =
(
requests with:
- view_product: 200 OK
- add_to_cart: 200 OK
- checkout: 200 OK
- payment: 200 OK
) / total_checkout_attempts
Measurement: End-to-end synthetic monitoring (e.g., Playwright script).
6. Rolling Window vs Calendar Window
| Approach | Definition | Pros | Cons |
|---|---|---|---|
| Rolling Window | Last 30 days | Smooth, fair | Complex to compute |
| Calendar Window | This month (1st-30th) | Simple | Resets on 1st (gameable) |
Production choice: Rolling window with calendar reporting.
Measure: Rolling 30-day window (for accuracy)
Report: Monthly (for business alignment)
7. SRE Maturity Model
| Level | SLO Adoption | Error Budget | Alerting | Process |
|---|---|---|---|---|
| L0 | No SLOs | N/A | Threshold-based | Ad-hoc |
| L1 | 1-2 SLOs defined | Calculated manually | Threshold + SLO | Quarterly review |
| L2 | SLOs per service | Automated dashboard | Multi-window burn rate | Monthly review |
| L3 | SLOs per endpoint | Auto-freeze on exhaustion | Adaptive sampling | Weekly review |
| L4 | SLOs drive roadmap | Error budget = sprint points | ML-based anomaly | Daily review |
Target: Reach L2 trong 6 tháng, L3 trong 1 năm.
8. Common Pitfalls — War Stories
Pitfall 1: SLO quá aggressive
Story: Team set 99.99% uptime (52 min/year downtime). Mọi deploy đều scared. Velocity drop 50%.
Fix: Downgrade to 99.9% (43 min/month). Result: 10× more budget, velocity recovered.
Pitfall 2: Measuring wrong thing
Story: SLO = server uptime. Server up nhưng database down → users cannot access. SLO still 100%.
Fix: Measure user-facing requests, not server health.
Pitfall 3: Không enforce error budget policy
Story: Budget exhausted, team vẫn deploy feature. Incident xảy ra, breach SLA.
Fix: Automated deployment block khi budget < 20%.
Pitfall 4: Alert spam
Story: 20 alerts/day, team ignore → miss real incident.
Fix: Multi-window burn rate → 2 alerts/week, all actionable.
Tóm tắt: Implementation Checklist
Foundation (Month 1-2):
- Define 1-2 SLIs per service (availability + latency)
- Set SLO targets based on current performance + 10% buffer
- Implement recording rules for SLIs
- Build error budget dashboard
Intermediate (Month 3-4):
- Multi-window, multi-burn-rate alerts configured
- Error budget policy documented and communicated
- Integrate error budget check into CI/CD pipeline
- Monthly SLO review meetings established
Advanced (Month 5-6):
- Auto-freeze deployments khi budget < threshold
- Composite SLOs for critical user journeys
- Adaptive sampling based on error budget status
- Error budget = engineering planning input (sprint points)
Ongoing:
- Quarterly SLO adjustment (tighten as reliability improves)
- Blameless post-mortems for budget exhaustion
- Share SLO reports with stakeholders (product, exec)