✍️ Khoa📅 19/04/2026☕ 14 phút đọc

Cloud Cost Optimization & FinOps — Đừng Để Cloud Bill Giết Chết Startup Của Bạn

"The cloud is not cheaper than on-premise. The cloud is more agile — and agility has a price. Your job is to make sure you're paying for agility, not paying for waste."

Hầu hết teams đều phát hiện ra bài toán cost quá muộn — sau khi nhận được invoice $80k cho một tháng mà estimate ban đầu là $20k. Bài này giúp bạn nghĩ về cost từ khi thiết kế architecture, không phải sau khi bị cháy túi.

1. FinOps Mindset — Engineer Cần Biết Gì?

1.1 FinOps Không Phải Là Bộ Phận Finance

FinOps (Financial Operations) là practice để tối ưu cloud spending, và nó đòi hỏi collaboration từ Engineering, Finance, và Product.

Truyền thống:
  Finance: "Tại sao bill tháng này tăng 40%?"
  Engineering: "Tôi không biết, hỏi DevOps đi"
  DevOps: "Tôi chỉ setup infra, không theo dõi cost"

FinOps:
  Engineer hiểu cost implication của quyết định thiết kế
  → Chọn instance type với awareness về giá
  → Biết query này tốn bao nhiêu BigQuery slot
  → Biết streaming 1TB/ngày qua Kinesis có giá bao nhiêu

1.2 Unit Economics — Tư duy đúng về cost

Đừng hỏi "Cloud bill tháng này là bao nhiêu?" — hỏi:

Cost per request: Mỗi API call tốn bao nhiêu tiền?
Cost per user: Phục vụ một user active tốn bao nhiêu/tháng?
Cost per transaction: Mỗi payment process tốn bao nhiêu?

# Đơn giản hóa: track unit cost
monthly_cloud_bill = 50_000   # USD
monthly_api_requests = 500_000_000  # 500M requests

cost_per_request = monthly_cloud_bill / monthly_api_requests
# = $0.0001/request = 0.1 cent/request

# Nếu margin cần > 40% và bạn charge $0.001/request
# → Cost chiếm 10% revenue → healthy
# → Nhưng nếu cost tăng 3x lên $0.003/request → toàn bộ margin bị ăn hết

1.3 Ba Đòn Bẩy Chính

Cloud Cost = Resources × Time × Unit Price

Giảm Resources:  Right-sizing, auto-scaling, eliminate waste
Giảm Time:       Scale to zero, Spot instances, scheduled scaling
Giảm Unit Price: Reserved Instances, Committed Use, Savings Plans

2. Compute Cost — Spot vs Reserved vs On-Demand

2.1 Decision Framework

Bạn cần loại compute nào?

START HERE
    │
    ▼
[Workload có thể bị interrupt không?]
    │
    ├─ YES → Spot/Preemptible (tiết kiệm 60-90%)
    │         Dùng cho: Batch jobs, ML training, data processing
    │         NOT dùng cho: Production API, database, stateful apps
    │
    └─ NO → [Workload có predictable usage không?]
                │
                ├─ YES, chạy 24/7 → Reserved Instances / Savings Plans
                │                    Commit 1-3 năm, tiết kiệm 30-72%
                │                    Dùng cho: Production databases, core services
                │
                ├─ Mostly ON, đôi khi spike → Savings Plans (flexible)
                │                              Commit theo $/hour, không theo instance type
                │
                └─ Unpredictable / Dev environment → On-demand
                                                     Hoặc Savings Plans ở mức baseline

2.2 Spot Instances Strategy

Spot bị reclaim trong 2 phút khi AWS cần lại capacity. Đây là cách dùng đúng:

# EKS: Mixed instance group với Spot
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  template:
    spec:
      # Dùng Node Affinity để chỉ schedule lên Spot nodes
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: eks.amazonaws.com/capacityType
                operator: In
                values: ["SPOT"]
      # Tolerate Spot interruption
      tolerations:
      - key: "spot"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      # Graceful shutdown: 2 phút warning từ AWS
      terminationGracePeriodSeconds: 90

# Terraform: ASG với mixed instance policy
resource "aws_autoscaling_group" "workers" {
  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 2     # Minimum 2 on-demand
      on_demand_percentage_above_base_capacity = 20    # 20% on-demand, 80% spot
      spot_allocation_strategy                 = "capacity-optimized"
    }

    launch_template {
      override {
        instance_type = "m5.xlarge"
      }
      override {
        instance_type = "m5a.xlarge"   # Đa dạng instance type → ít bị reclaim hơn
      }
      override {
        instance_type = "m4.xlarge"
      }
    }
  }
}

2.3 Savings Plans vs Reserved Instances

	Reserved Instances	Savings Plans
Commitment	Specific instance type/region	$/hour spending
Flexibility	Thấp (Convertible RI linh hoạt hơn)	Cao (apply cho bất kỳ compute)
Discount	Up to 72%	Up to 66%
Best for	Stable, predictable workload	Microservices, multi-service
Rủi ro	Lock vào instance family	Ít rủi ro hơn

Practical advice: Mua Compute Savings Plans ở mức 70-80% của baseline spending. Phần còn lại để On-demand handle spikes. Đừng over-commit.

3. Storage Cost — Cái Bẫy Lãng Phí Thầm Lặng

3.1 S3/GCS Lifecycle Policies — Tiền Nằm Trong Config

# Terraform: S3 Lifecycle thực tế
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "archive-old-data"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"    # 30-45% rẻ hơn Standard, min 30 ngày
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"     # 68% rẻ hơn Standard, retrieval < 1 phút
    }

    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"   # 95% rẻ hơn Standard, retrieval 12-48h
    }

    expiration {
      days = 2555  # 7 năm, sau đó xóa (compliance requirement)
    }
  }

  # Xóa incomplete multipart uploads — thường bị quên
  rule {
    id     = "cleanup-multipart"
    status = "Enabled"

    abort_incomplete_multipart_upload {
      days_after_initiation = 7
    }
  }
}

S3 Storage Classes so sánh nhanh:

Class	$/GB/month	Retrieval time	Min duration	Use case
Standard	$0.023	Immediate	None	Active data
Standard-IA	$0.0125	Immediate	30 days	Monthly access
Glacier Instant	$0.004	< 1 min	90 days	Quarterly access
Glacier Flexible	$0.0036	3-5 hours	90 days	Annual access
Glacier Deep Archive	$0.00099	12-48 hours	180 days	Compliance backup

3.2 EBS Volume Audit — Mỏ Vàng Waste

# Tìm EBS volumes không attached đến EC2 nào (đang trả tiền cho gì?)
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].[VolumeId,Size,CreateTime,Tags]' \
  --output table

# Tìm EBS snapshots cũ > 90 ngày
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<`2024-01-01`].[SnapshotId,VolumeSize,StartTime]' \
  --output table

Thực tế: Sau 1 năm, thường có 20-30% EBS volumes là "zombie" — không ai dùng, không ai dám xóa vì không biết của service nào. Tagging từ đầu giải quyết vấn đề này.

4. Database Cost — Những Quyết Định Đắt Giá

4.1 RDS Sizing Pitfalls

Pitfall 1: Over-provisioned từ đầu

Câu chuyện phổ biến:
  - Estimate: "App sẽ có 100k users → cần db.r5.4xlarge (16 vCPU, 128GB RAM)"
  - Reality sau 6 tháng: 20k users, CPU average 8%, RAM average 15%
  - Waste: Đang trả $2,400/tháng cho cái cần $600/tháng

Pitfall 2: Read Replicas không cần thiết

Read replica cần thiết khi:

Read load cao (> 60% queries là SELECT)
Muốn offload analytics/reporting queries
Multi-AZ replica cho disaster recovery

Không cần khi:

App chủ yếu là writes
Traffic còn thấp (< 100 req/s)
Chỉ thêm vì "best practice" không có data

Pitfall 3: Aurora Serverless v2 cho OLTP latency-sensitive

Aurora Serverless v2 scale theo ACU (Aurora Capacity Units). Scale-up nhanh (< 1s), nhưng có latency spike nhỏ. Với payment systems hay order processing, prefer provisioned Aurora.

4.2 RDS vs Aurora vs DynamoDB — Cost Decision

Monthly cost estimate (moderate traffic, ap-southeast-1):

RDS MySQL db.t3.medium (2 vCPU, 4GB):
  Instance: ~$50/month
  Storage: 100GB = $11.5/month
  Backup: 7 ngày = ~$8/month
  Total: ~$70/month

Aurora MySQL Serverless v2 (min 0.5 ACU, max 64 ACU):
  Compute: $0.12/ACU-hour × average 2 ACU × 720h = ~$173/month
  Storage: $0.10/GB × 100GB = $10/month
  I/O: Depends on traffic
  Total: $183-400/month (tùy traffic)

DynamoDB (On-demand mode, 1M reads + 100k writes/day):
  Reads: 30M × $0.00000025 = $7.5/month
  Writes: 3M × $0.00000125 = $3.75/month
  Storage: 10GB × $0.25 = $2.5/month
  Total: ~$14/month
  → Scale tiếp không tăng kiến trúc, chỉ tăng số tiền linearly

5. Egress Cost — Cái Bẫy Ẩn Nhất

5.1 Anatomy của Egress Charges

Egress = Data ra khỏi AWS network

FREE:
  ├── Inbound từ Internet → AWS (inbound luôn miễn phí)
  ├── Same AZ, same service → Free (EC2 → EC2 same AZ)
  └── AWS → CloudFront

CHARGED:
  ├── EC2/RDS → Internet: $0.09/GB (first 10TB)
  ├── Cross-AZ: $0.01/GB mỗi chiều (thường bị bỏ qua)
  ├── Cross-Region: $0.02/GB
  └── AWS → S3 Transfer Acceleration: $0.04/GB thêm

5.2 Cross-AZ Cost — Bẫy Architecture Vô Hình

Microservice architecture:
  Service A (AZ-a) → Service B (AZ-b) → Database (AZ-a)

Mỗi request tốn:
  A→B: $0.01/GB
  B→DB: $0.01/GB
  DB response→B: $0.01/GB  
  B response→A: $0.01/GB
  = $0.04/GB mỗi request

Với 1TB/ngày traffic nội bộ:
  = $40/ngày = $1,200/tháng chỉ cho cross-AZ

Fix: AZ-aware routing trong service mesh / load balancer

# Kubernetes: Topology-aware routing (prefer same zone)
apiVersion: v1
kind: Service
metadata:
  name: backend-service
  annotations:
    service.kubernetes.io/topology-mode: "Auto"

5.3 Egress Optimization Strategies

CloudFront trước mọi thứ: CDN cache giảm egress từ origin. Với media/static assets, có thể giảm 80% egress.
S3 Transfer Acceleration vs Direct: Chỉ dùng Transfer Acceleration khi user ở xa region. Cost cao hơn nhưng speed better.
VPC Endpoints cho S3: Traffic EC2→S3 qua VPC Endpoint = không tốn egress charge. Với workload xử lý nhiều S3, tiết kiệm được nhiều.

# Terraform: VPC Endpoint cho S3 (gateway type = free)
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.ap-southeast-1.s3"
  
  route_table_ids = [aws_route_table.private.id]
}

Data locality: Đặt compute gần data. EC2 và RDS cùng region, cùng AZ khi có thể.

6. Kubernetes Cost Optimization

6.1 The Problem: K8s Clusters Thường Waste 40-60%

Lý do:
  - Resources requests quá cao so với actual usage
  - Node không được pack tốt (nhiều node nhỏ, empty space)
  - DaemonSets chiếm resource trên mỗi node
  - Dev/staging environment chạy 24/7
  - Không có scale-to-zero

Ví dụ thực tế:
  10 nodes × m5.2xlarge (8 CPU, 32GB RAM)
  = 80 CPU, 320GB RAM total
  
  Actual pod usage: 25 CPU, 80GB RAM
  = 31% CPU utilization, 25% RAM utilization
  = Đang waste 69% compute cost

6.2 Right-sizing với VPA (Vertical Pod Autoscaler)

# VPA recommendation mode — chỉ gợi ý, không tự apply
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: backend-api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: backend-api
  updatePolicy:
    updateMode: "Off"   # "Off" = chỉ recommend, "Auto" = tự apply

# Xem recommendation
kubectl describe vpa backend-api-vpa

# Output:
# Recommendation:
#   Container Recommendations:
#     Container Name: backend
#       Lower Bound:   CPU: 50m, Memory: 128Mi
#       Target:        CPU: 250m, Memory: 512Mi   ← Dùng cái này
#       Upper Bound:   CPU: 1000m, Memory: 2Gi

6.3 Cluster Autoscaler vs Karpenter

	Cluster Autoscaler	Karpenter
Scale speed	1-2 phút	10-60 giây
Node selection	Từ node group đã define	Dynamic, just-in-time
Spot support	Cần multiple node groups	Built-in, đa dạng instance
Cost savings	Good	Better (bin packing tốt hơn)
Complexity	Đơn giản	Phức tạp hơn

Karpenter NodePool ví dụ (AWS):

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: cost-optimized
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]   # Graviton (ARM) 20% rẻ hơn
      nodeClassRef:
        name: default
  disruption:
    consolidationPolicy: WhenUnderutilized   # Xóa node rỗng tự động
    consolidateAfter: 30s

6.4 Namespace Resource Quotas — Prevent Runaway Cost

apiVersion: v1
kind: ResourceQuota
metadata:
  name: staging-quota
  namespace: staging
spec:
  hard:
    requests.cpu: "20"        # Tổng cộng staging không dùng quá 20 CPU
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    count/pods: "50"

7. Cost Allocation — Ai Tiêu Gì?

7.1 Tagging Strategy — Làm Đúng Từ Đầu

# Terraform: Mandatory tags cho mọi resource
locals {
  mandatory_tags = {
    Environment = var.environment    # prod, staging, dev
    Team        = var.team           # backend, frontend, data
    Service     = var.service_name   # user-service, payment-api
    CostCenter  = var.cost_center    # 1001 (Engineering), 2001 (Data)
    ManagedBy   = "terraform"
  }
}

resource "aws_instance" "app" {
  # ...
  tags = merge(local.mandatory_tags, {
    Name = "${var.service_name}-${var.environment}"
  })
}

# SCP: Bắt buộc tag khi tạo resource (prevent untagged resources)
{
  "Effect": "Deny",
  "Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
  "Resource": "*",
  "Condition": {
    "Null": {
      "aws:RequestTag/Team": "true"
    }
  }
}

7.2 Chargeback vs Showback

Model	Mô tả	Phù hợp
Showback	Report cost by team, nhưng tất cả trả từ một bucket	Team nhỏ, early stage
Chargeback	Từng team thực sự trả cho cloud cost của họ	Enterprise, nhiều P&L riêng
Forecasting	Dự báo cost để plan budget	Mọi stage

8. Tools — Visibility Trước, Optimization Sau

8.1 AWS Native Tools

# AWS Cost Explorer: query cost by service, tag, account
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity MONTHLY \
  --metrics "BlendedCost" \
  --group-by Type=TAG,Key=Team

# Tìm resource đang "idle" (CPU < 5% trong 14 ngày)
aws ce get-rightsizing-recommendation \
  --service "AmazonEC2" \
  --configuration RecommendationTarget=SAME_INSTANCE_FAMILY

8.2 Infracost — Cost trong CI Pipeline

# .github/workflows/infracost.yml
name: Infracost

on: [pull_request]

jobs:
  infracost:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}
      
      - name: Generate Infracost diff
        run: |
          infracost diff --path=. \
            --format=json \
            --out-file=/tmp/infracost.json
      
      - name: Post Infracost comment
        uses: infracost/actions/comment@v3
        with:
          path: /tmp/infracost.json
          # PR comment: "This change will increase costs by $X/month"
          behavior: update

Mỗi PR Terraform thay đổi sẽ tự động comment: "PR này sẽ tăng $X/tháng" hoặc "PR này tiết kiệm $X/tháng" — ngăn surprise bills.

8.3 Kubecost — K8s Cost Visibility

Kubecost allocate K8s cost theo namespace, deployment, label. Free tier đủ dùng cho hầu hết teams.

# Install Kubecost
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost kubecost/cost-analyzer \
  --namespace kubecost \
  --create-namespace \
  --set kubecostToken="<token>"

# Sau đó access dashboard: kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090

9. Quyết Định Architecture Ảnh Hưởng Lớn Nhất Đến Cost

9.1 Những Quyết Định Thường Không Được Nghĩ Đến

1. Sync vs Async Processing

Sync: User request → Lambda → xử lý 30s → response
  = Lambda chạy 30s × mỗi request

Async: User request → SQS → Lambda (batch 100 items) → process
  = Lambda invocation chia cho 100 items
  = Cost giảm 100x cho processing

2. Monolith vs Microservices

Microservices có hidden cost:

Mỗi service cần riêng: ALB listener ($16/month), ECS task, logging
Cross-service network calls → egress
Nhiều Lambda functions → more cold starts overhead

Với team nhỏ (< 20 engineers), monolith thường rẻ hơn và nhanh hơn.

3. SQL vs NoSQL — Cost Implication

Scenario	SQL (RDS)	NoSQL (DynamoDB)
1k users, simple CRUD	$70/month	$3/month
1M users, uniform access	$400/month	$50/month
10M users, hot/cold data	$2k/month	$500/month với tiering
Complex analytics queries	$500/month (read replica)	BAD CHOICE

4. Data Transfer Architecture

Đắt:
  EC2 → Internet → S3 → EC2 khác
  (trả egress 2 lần)

Rẻ:
  EC2 → S3 (qua VPC Endpoint, free)
  EC2 → SQS → Lambda → S3 (internal, rất ít egress)

10. Case Study — Giảm $50k/tháng Xuống $20k/tháng

Bối cảnh

Startup B2B SaaS, 200k users, backend trên AWS. Bill $52k/tháng, runway chỉ còn 8 tháng.

Audit 2 tuần

Top spending breakdown:
  EC2/EKS nodes:     $28k (54%)  ← Lớn nhất
  RDS:               $12k (23%)
  Data Transfer:      $6k (12%)
  S3/CloudFront:      $3k (6%)
  Other:              $3k (6%)

Interventions (thứ tự ROI)

Week 1: Quick wins

1. Tắt dev/staging ngoài giờ làm việc
   Cron: 7am ON, 8pm OFF (Mon-Fri)
   Tiết kiệm: -$3,200/month

2. Delete zombie resources:
   - 47 EBS volumes unattached: -$940/month
   - 12 Elastic IPs không dùng: -$120/month
   - 3 Load Balancers idle: -$540/month
   
3. Enable S3 Intelligent-Tiering cho data lake:
   Tiết kiệm: -$800/month

Week 2-4: Right-sizing

4. RDS right-sizing:
   db.r5.4xlarge (16 vCPU) → db.r5.xlarge (4 vCPU)
   CPU average: 12% → 45% (acceptable)
   Tiết kiệm: -$4,800/month

5. EC2 → 60% Spot + 40% On-demand:
   EKS worker nodes được Spot-ify với Karpenter
   Tiết kiệm: -$9,000/month

6. Purchase 1-year Compute Savings Plans (baseline):
   Tiết kiệm: -$3,600/month (24% off baseline)

Month 2-3: Architecture changes

7. CDN cho static assets (S3 + CloudFront):
   Egress từ S3 giảm 70%
   Tiết kiệm: -$2,100/month

8. SQS batching cho async jobs:
   Lambda invocations giảm 80%
   Tiết kiệm: -$1,200/month

9. RDS read replica → ElastiCache cho hot data:
   Xóa được 2 read replicas
   Tiết kiệm: -$2,400/month

Kết quả

Before: $52,000/month
After:  $19,700/month (viết tròn $20k)
Saved:  $32,300/month (~62% reduction)
Time:   3 months

ROI:
  3 tháng effort = $96,900 tiết kiệm
  Runway extended thêm 4 tháng

11. Mental Model — Cost Optimization Pyramid

                    ┌──────────────────┐
                    │  Architectural   │  Highest impact,
                    │  Decisions       │  hardest to change
                    └────────┬─────────┘
                   ┌─────────┴──────────┐
                   │  Purchasing Model  │  RIs, Savings Plans
                   │  (Commit Discounts)│  Moderate effort
                   └─────────┬──────────┘
                  ┌──────────┴───────────┐
                  │   Right-sizing &     │  Easy wins,
                  │   Waste Elimination  │  do first
                  └──────────────────────┘

Rule of thumb:
  Trước tiên: Kill waste (delete unused, right-size)
  Sau đó: Commit discount (RI/Savings Plans)
  Cuối cùng: Architectural optimization (async, caching, CDN)
  
  Đừng mua Reserved Instances cho overprovisioned infrastructure —
  bạn đang commit 1 năm cho sự lãng phí.

Cost review nên là monthly ritual của team Engineering, không phải hàng quý của Finance.