Cost Engineering & FinOps — Khi CTO Hỏi "Tại Sao Cloud Bill Tăng 3x?"
Giỏi code mà không biết system cost bao nhiêu = architect mù một mắt. Staff Engineer phải trả lời được: "Feature này cost bao nhiêu per user per month?"
Thời đại "unlimited cloud budget" đã qua. Sau waves of layoffs 2023-2024, mọi công ty đều hỏi: "Infrastructure cost có justify được value không?" Và câu hỏi đó sẽ đến tay bạn.
1. Cloud Cost Anatomy — Tiền chảy đi đâu?
1.1 Cost Breakdown Typical
Typical cloud bill breakdown:
Compute (EC2/GKE/ECS): 40-50% ← Biggest chunk
Database (RDS/CloudSQL): 20-30% ← Thường over-provisioned
Storage (S3/GCS): 5-10%
Network (Data Transfer): 5-15% ← Hidden cost killer!
Other (Lambda, SQS, etc): 5-10%
Surprise costs:
→ Cross-AZ data transfer: $0.01/GB (seems small, adds up FAST)
→ NAT Gateway: $0.045/GB processed (why is this so expensive?!)
→ CloudWatch Logs: $0.50/GB ingested
→ EBS snapshots: accumulate over time
→ Idle load balancers: $18/month EACH even with 0 traffic
1.2 Unit Economics — Cost Per X
Thay vì nhìn total bill, nhìn cost per business unit:
Cost per request: $0.0001/request
Cost per user/month: $0.15/user
Cost per order: $0.02/order
Cost per GB stored: $0.023/month
Tại sao quan trọng:
→ Total bill $50K/month → "Đắt hay rẻ?"
→ $0.15/user, 300K users, $25 revenue/user → Margin OK ✅
→ $2/user, 300K users, $25 revenue/user → 8% infra cost ⚠️
Track unit cost over time:
→ Nếu cost/user tăng mà revenue/user không đổi = bad
→ Nếu cost/user giảm = engineering efficiency improving
2. Cost Optimization Strategies
2.1 Right-sizing — Đừng dùng Ferrari để đi chợ
Vấn đề #1: Over-provisioned instances
Thực tế phổ biến:
→ Team provision m5.xlarge (4 vCPU, 16GB) vì "safe"
→ Average CPU usage: 15%
→ Average memory usage: 30%
→ Đang trả tiền cho 70% unused resources
Fix:
→ Monitor actual usage (CPU, memory) 2 tuần
→ Right-size: m5.xlarge → m5.large (giảm 50% cost)
→ Hoặc dùng auto-scaling (scale down off-peak)
Tools:
→ AWS Compute Optimizer
→ GCP Recommender
→ Kubecost (Kubernetes)
2.2 Reserved Instances & Savings Plans
On-demand: Trả full price, linh hoạt
Reserved: Commit 1-3 năm, giảm 30-60%
Spot: Dùng spare capacity, giảm 60-90%, có thể bị terminate
Strategy:
┌──────────────────────────────────────────┐
│ Workload Mix │
│ │
│ Base load (stable): Reserved/Savings Plan│
│ ████████████████████████████ │
│ │
│ Variable load: On-demand + Auto-scaling │
│ ████████████████████ │
│ ████████ │
│ │
│ Fault-tolerant jobs: Spot instances │
│ ████ (batch processing, CI/CD workers) │
└──────────────────────────────────────────┘
Savings plan coverage:
→ 60-70% base load = Savings Plans (1-year, no upfront)
→ 20-30% variable = On-demand
→ CI/CD workers = Spot instances (save 70%+)
2.3 Database Cost Optimization
Common wastes:
→ RDS Multi-AZ cho development environments
→ Over-provisioned IOPS (provisioned IOPS $$)
→ Keeping old snapshots forever
→ Read replicas nobody uses
Fixes:
→ Dev/staging: Single-AZ, smaller instance, stop off-hours
→ Production: Right-size, use gp3 instead of io1 (cheaper IOPS)
→ Snapshot lifecycle policy (delete after 30 days)
→ Audit read replicas quarterly
→ Consider Aurora Serverless v2 cho variable workloads
Storage tiering:
→ Hot data (< 3 months): SSD, primary DB
→ Warm data (3-12 months): Cheaper storage, read replicas
→ Cold data (> 12 months): S3/GCS, Glacier
→ Archive (> 3 years): Glacier Deep Archive ($1/TB/month!)
2.4 Network Cost — Hidden Killer
AWS data transfer pricing:
→ Same AZ: FREE
→ Cross-AZ: $0.01/GB (both directions)
→ Cross-region: $0.02/GB
→ Internet out: $0.09/GB (first 10TB)
Scenario: Service A ↔ Service B, 1TB/day cross-AZ
Cost: 1000GB × $0.01 × 2 directions × 30 days = $600/month
Chỉ cho data transfer giữa 2 services! 😱
Fixes:
→ Co-locate services cùng AZ khi có thể
→ Compress data between services (gzip, protobuf)
→ Cache responses (reduce redundant transfers)
→ Use VPC endpoints cho AWS services (avoid NAT Gateway)
→ CDN cho static content (edge caching)
3. FinOps Framework
3.1 FinOps Lifecycle
┌─────────────┐
│ INFORM │ ← Visibility: ai dùng gì, cost bao nhiêu?
└──────┬──────┘
│
┌──────▼──────┐
│ OPTIMIZE │ ← Right-size, reserved, cleanup waste
└──────┬──────┘
│
┌──────▼──────┐
│ OPERATE │ ← Budgets, alerts, accountability
└─────────────┘
3.2 Cost Allocation & Tagging
Tagging strategy (CRITICAL — thiếu tags = blind):
Required tags cho MỌI resource:
team: "order-team"
environment: "production" | "staging" | "development"
service: "order-service"
cost-center: "engineering"
→ Enforce qua CI/CD: reject deploy nếu thiếu tags
→ AWS: Tag Policies, SCP (deny untagged resources)
→ K8s: Labels + Kubecost
Showback vs Chargeback:
Showback: "Team A dùng $5K/tháng" (inform only)
Chargeback: "Team A bị trừ $5K từ budget" (accountability)
Start with showback → mature to chargeback
3.3 Budget & Alerts
Cloud budget alerts:
→ 50% budget consumed: Email notification
→ 80% budget consumed: Slack alert
→ 100% budget consumed: Page team lead
→ 120% budget: Auto-alert engineering manager
Anomaly detection:
→ AWS Cost Anomaly Detection: ML-based
→ Custom: Alert khi daily cost > 2x average
→ "Tại sao hôm qua cost tăng 300%?"
Usually: someone forgot to stop a test cluster 😅
4. Kubernetes Cost Optimization
K8s specific optimizations:
1. Resource Requests & Limits:
❌ Không set requests → scheduler không optimize
❌ Requests = Limits → no bursting, waste resources
✅ Requests = typical usage, Limits = 2x requests
2. Cluster Autoscaler:
→ Scale nodes down khi pods don't need them
→ Configure scale-down-delay (avoid thrashing)
→ Use multiple node pools (different instance types)
3. Pod Disruption Budgets + Spot Nodes:
→ Non-critical workloads → spot node pool (70% cheaper)
→ PDB ensures enough pods survive spot interruption
4. Namespace quotas:
→ Prevent any team from consuming all resources
→ ResourceQuota per namespace
5. Kubecost / OpenCost:
→ Per-pod, per-namespace, per-team cost breakdown
→ Efficiency score: actual usage / requested resources
→ Recommend right-sizing
5. Cost-Aware Architecture Decisions
Mỗi architecture decision = cost decision:
Microservices vs Monolith:
→ 50 microservices × load balancer ($18/mo) = $900/mo
chỉ cho load balancers!
→ Monolith: $18/mo
→ Microservices cost more. Chọn khi benefit > cost.
Sync vs Async:
→ Sync: hold connection open → more instances needed
→ Async (queue): process when ready → fewer instances
→ Queue cost (SQS: $0.40/million messages) thường < compute savings
Cache vs No Cache:
→ Redis (r6g.large): ~$200/month
→ Savings: reduce DB reads 80% → smaller DB instance
→ ROI thường positive nếu DB cost > $500/month
Serverless vs Container:
→ Lambda: $0 khi idle, $0.0000166667/GB-sec khi chạy
→ Container: $X/month 24/7, kể cả idle
→ Low traffic: Lambda wins
→ High traffic (>1M req/day): Containers cheaper
6. Tóm tắt
FinOps Checklist cho Staff Engineers:
□ Know your bill: Cost breakdown by service, team, environment
□ Unit economics: Cost per user, per request, per order
□ Tagging: Enforce tags cho mọi cloud resources
□ Right-sizing: Review monthly, auto-recommendations
□ Reserved/Savings: 60-70% base load coverage
□ Network: Monitor cross-AZ/region transfer
□ Storage: Lifecycle policies, tiering
□ K8s: Resource requests, autoscaler, Kubecost
□ Alerts: Budget alerts + anomaly detection
□ Architecture: Consider cost in every design decision
Tài liệu tham khảo
- FinOps Foundation
- Cloud FinOps — J.R. Storment & Mike Fuller (O'Reilly)
- AWS Well-Architected: Cost Optimization
- Kubecost
- OpenCost
💡 Remember: "The cloud is someone else's computer — and they send you the bill." Optimize không phải để tiết kiệm pennies, mà để đảm bảo growth sustainable. 💰