🚀 DevOps✍️ Khoa📅 19/04/2026☕ 12 phút đọc

DevOps: CI/CD Pipeline Production-Grade cho Docker + K8s (Intermediate++)

CI/CD tốt không phải là "tự động deploy". Đó là pipeline mà bạn tin tưởng đủ để deploy lúc 11 giờ đêm thứ Sáu mà không lo. Bài này đi sâu vào cách build pipeline đó: từ tối ưu Docker build, quét bảo mật, quản lý secret trong CI, đến chiến lược deploy không downtime.


1. Docker build trong CI — nhanh, an toàn, reproducible

1.1 BuildKit và caching strategy

BuildKit là backend build mặc định từ Docker 23+, mang lại:

  • Parallel builds: các stage độc lập chạy song song.
  • Cache mounts: cache npm/go/pip giữa các build mà không nhúng vào image.
  • Secret mounts: truyền secret vào build mà không để lại trong layer.
# syntax=docker/dockerfile:1.6
FROM golang:1.22-alpine AS builder

WORKDIR /app

# Cache go modules riêng — chỉ re-download khi go.mod/go.sum thay đổi
COPY go.mod go.sum ./
RUN --mount=type=cache,target=/root/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    go mod download

# Copy code và build
COPY . .
RUN --mount=type=cache,target=/root/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    CGO_ENABLED=0 GOOS=linux \
    go build -ldflags="-w -s -X main.version=${VERSION}" \
    -o /payment-api ./cmd/server

# ---
FROM gcr.io/distroless/static-debian12:nonroot AS runtime

COPY --from=builder /payment-api /payment-api

# Distroless: không có shell, không có package manager
# → attack surface tối thiểu
EXPOSE 8080
ENTRYPOINT ["/payment-api"]

GitHub Actions với layer cache:

- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3

- name: Build and push image
  uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: |
      ghcr.io/myorg/payment-api:${{ github.sha }}
      ghcr.io/myorg/payment-api:latest
    cache-from: type=gha             # GitHub Actions cache
    cache-to: type=gha,mode=max
    build-args: |
      VERSION=${{ github.sha }}

GitLab CI với registry cache:

build:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  variables:
    DOCKER_BUILDKIT: "1"
    IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
  script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
    - docker buildx create --use
    - docker buildx build
        --cache-from type=registry,ref=$CI_REGISTRY_IMAGE:buildcache
        --cache-to type=registry,ref=$CI_REGISTRY_IMAGE:buildcache,mode=max
        --tag $IMAGE
        --push .

1.2 Image tagging strategy

Strategy Ví dụ Ưu điểm Nhược điểm
Git SHA abc1234f Immutable, traceable Không readable
Semantic version 1.23.0 Readable, familiar Cần process bump version
Git SHA + timestamp 1.23.0-abc1234 Cả hai ưu điểm Tag dài
latest latest Tiện để dev ❌ Không dùng production

Khuyến nghị production:

# Primary tag: git SHA (immutable, full traceability)
IMAGE_TAG="${GITHUB_SHA:0:8}"  # 8 ký tự đầu đủ unique

# Additional tags nếu là release
if [[ "$GITHUB_REF" =~ ^refs/tags/v ]]; then
  SEMVER="${GITHUB_REF#refs/tags/v}"  # v1.23.0 → 1.23.0
fi

1.3 Build secrets — đừng để lộ trong image

# Dùng SSH để clone private repo trong build
RUN --mount=type=ssh \
    git clone git@github.com:myorg/private-lib.git /tmp/private-lib

# Dùng secret file (không xuất hiện trong layer nào)
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
    npm ci
# GitHub Actions: truyền secret vào build
- name: Build with secrets
  uses: docker/build-push-action@v5
  with:
    secrets: |
      npmrc=${{ secrets.NPMRC_FILE }}
    ssh: |
      default=${{ env.SSH_AUTH_SOCK }}

2. Image scanning — bảo mật trước khi deploy

2.1 Trivy — scanner phổ biến nhất

# GitHub Actions: scan và fail nếu có critical CVE
security-scan:
  runs-on: ubuntu-latest
  needs: build
  steps:
    - name: Run Trivy vulnerability scanner
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: ghcr.io/myorg/payment-api:${{ github.sha }}
        format: sarif                    # GitHub Security tab
        output: trivy-results.sarif
        severity: CRITICAL,HIGH
        exit-code: '1'                   # Fail pipeline nếu tìm thấy
        ignore-unfixed: true             # Bỏ qua CVE chưa có fix

    - name: Upload Trivy scan results
      uses: github/codeql-action/upload-sarif@v2
      if: always()                       # Upload kể cả khi scan fail
      with:
        sarif_file: trivy-results.sarif
# Local scan
trivy image ghcr.io/myorg/payment-api:abc1234

# Scan filesystem (trong CI, trước khi build image)
trivy fs --security-checks vuln,secret,config .

# Scan IaC (Kubernetes YAML)
trivy config ./k8s/

# Scan với ignore file (.trivyignore)
# CVE-2023-XXXXX   # Accepted risk — mitigated by WAF

2.2 SBOM — Software Bill of Materials

SBOM là danh sách đầy đủ các components trong image. Quan trọng cho compliance (SSDF, Executive Order 14028).

- name: Generate SBOM
  uses: anchore/sbom-action@v0
  with:
    image: ghcr.io/myorg/payment-api:${{ github.sha }}
    format: spdx-json                # hoặc cyclonedx-json
    output-file: sbom.spdx.json
    artifact-name: sbom-payment-api

- name: Attest SBOM to image
  uses: actions/attest-sbom@v1
  with:
    subject-name: ghcr.io/myorg/payment-api
    subject-digest: ${{ steps.build.outputs.digest }}
    sbom-path: sbom.spdx.json
    push-to-registry: true

2.3 Container signing với Cosign (Sigstore)

- name: Sign container image
  env:
    COSIGN_EXPERIMENTAL: "1"        # Keyless signing qua OIDC
  run: |
    cosign sign --yes \
      ghcr.io/myorg/payment-api@${{ steps.build.outputs.digest }}

# Trong K8s: dùng Policy Controller để verify signature trước khi admit pod

3. Testing trong CI — đủ tin cậy để deploy

3.1 Test strategy per layer

# .github/workflows/ci.yml
name: CI

on: [push, pull_request]

jobs:
  # Layer 1: Unit tests (nhanh, chạy đầu tiên)
  unit-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with:
          go-version: '1.22'
          cache: true
      - name: Run unit tests
        run: go test -race -short ./...
      - name: Upload coverage
        uses: codecov/codecov-action@v3

  # Layer 2: Integration tests (cần Docker, chậm hơn)
  integration-test:
    runs-on: ubuntu-latest
    needs: unit-test
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_PASSWORD: testpass
          POSTGRES_DB: testdb
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      redis:
        image: redis:7-alpine
        ports:
          - 6379:6379
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with:
          go-version: '1.22'
          cache: true
      - name: Run integration tests
        env:
          DATABASE_URL: postgres://postgres:testpass@localhost:5432/testdb
          REDIS_URL: redis://localhost:6379
        run: go test -tags=integration ./...

  # Layer 3: Build image (sau khi tests pass)
  build:
    runs-on: ubuntu-latest
    needs: [unit-test, integration-test]
    permissions:
      contents: read
      packages: write
      id-token: write              # Cho keyless signing
    steps:
      - uses: actions/checkout@v4
      # ... build steps

3.2 Test contracts giữa services

# Consumer-driven contract testing với Pact
contract-test:
  runs-on: ubuntu-latest
  steps:
    - name: Run Pact consumer tests
      run: go test ./... -run TestPact -v
      env:
        PACT_BROKER_URL: https://pact.myorg.internal
        PACT_BROKER_TOKEN: ${{ secrets.PACT_TOKEN }}

    - name: Publish pacts
      run: |
        pact-broker publish ./pacts \
          --consumer-app-version ${{ github.sha }} \
          --broker-base-url https://pact.myorg.internal \
          --broker-token ${{ secrets.PACT_TOKEN }}

    - name: Can I deploy?
      run: |
        pact-broker can-i-deploy \
          --pacticipant payment-api \
          --version ${{ github.sha }} \
          --to-environment production \
          --broker-base-url https://pact.myorg.internal

4. Secrets trong CI — không bao giờ hardcode

4.1 GitHub Actions OIDC — không cần long-lived credentials

Thay vì lưu AWS Access Key/Secret Key trong GitHub Secrets, dùng OIDC để GitHub Actions assume IAM role trực tiếp:

permissions:
  id-token: write    # Cần để OIDC hoạt động
  contents: read

steps:
  - name: Configure AWS credentials via OIDC
    uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: arn:aws:iam::123456789:role/github-actions-payment-api
      aws-region: us-east-1
      # Không cần AWS_ACCESS_KEY_ID hay AWS_SECRET_ACCESS_KEY!

  - name: Push image to ECR
    run: |
      aws ecr get-login-password | docker login --username AWS \
        --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
      docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/payment-api:${{ github.sha }}

IAM Trust Policy cho GitHub Actions:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
      },
      "StringLike": {
        "token.actions.githubusercontent.com:sub": "repo:myorg/payment-api:*"
      }
    }
  }]
}

4.2 Workload Identity cho GKE

Thay vì Service Account key JSON trong K8s Secret, dùng Workload Identity:

# Service Account K8s được bind với GCP Service Account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: payment-api
  namespace: payments
  annotations:
    iam.gke.io/gcp-service-account: payment-api@myproject.iam.gserviceaccount.com
# Bind IAM policy
gcloud iam service-accounts add-iam-policy-binding \
  payment-api@myproject.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:myproject.svc.id.goog[payments/payment-api]"

Pod khi chạy tự động nhận GCP credentials thông qua metadata server — không cần secret nào.

4.3 External Secrets Operator — sync từ Vault/AWS SM

# SecretStore: kết nối đến AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets-manager
  namespace: payments
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: payment-api    # Dùng Workload Identity

---
# ExternalSecret: kéo secret về và tạo K8s Secret
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: payment-api-db-creds
  namespace: payments
spec:
  refreshInterval: 1h              # Sync định kỳ, tự rotate
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: payment-api-secret       # Tên K8s Secret được tạo
    creationPolicy: Owner
  data:
    - secretKey: DATABASE_URL      # Key trong K8s Secret
      remoteRef:
        key: prod/payment-api/db   # Key trong AWS SM
        property: connection_string

5. Deploy strategy — không downtime, rollback nhanh

5.1 Rolling Update với zero-downtime

Để đảm bảo không có request nào bị drop trong quá trình deploy, cần cấu hình đồng bộ:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0    # Không giảm capacity
      maxSurge: 25%        # Tạo thêm tối đa 25% pod mới
  
  template:
    spec:
      # 1. preStop hook: đợi load balancer drain connection trước khi process nhận SIGTERM
      containers:
        - name: payment-api
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]
      
      # 2. terminationGracePeriodSeconds: thời gian process có để finish in-flight requests
      terminationGracePeriodSeconds: 60
      
      # 3. readinessProbe đủ chặt để chỉ route traffic khi pod thực sự sẵn sàng
      containers:
        - readinessProbe:
            httpGet:
              path: /healthz/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            successThreshold: 2    # Cần 2 lần pass liên tiếp
            failureThreshold: 3

Tại sao cần preStop sleep? Khi pod nhận signal terminate, kube-proxy cần vài giây để remove pod khỏi iptables rules. Nếu process thoát ngay, vẫn có requests đến pod đó trong khoảng thời gian này → connection refused. preStop: sleep 5 cho kube-proxy thời gian cập nhật.

5.2 Blue/Green với Argo Rollouts

Argo Rollouts (thêm vào cluster) cung cấp progressive delivery nâng cao:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-api
spec:
  replicas: 5
  strategy:
    blueGreen:
      activeService: payment-api-active     # Service đang nhận production traffic
      previewService: payment-api-preview   # Service đang nhận preview traffic
      autoPromotionEnabled: false           # Cần approve thủ công
      scaleDownDelaySeconds: 30             # Đợi 30s sau promote rồi scale down blue

  selector:
    matchLabels:
      app: payment-api
  template:
    # ... pod spec
# Sau khi deploy, xem trạng thái
kubectl argo rollouts get rollout payment-api --watch

# Promote green lên production (sau khi QA ok)
kubectl argo rollouts promote payment-api

# Abort nếu có vấn đề (tự rollback về blue)
kubectl argo rollouts abort payment-api

5.3 Canary với phân tích tự động

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-api
spec:
  strategy:
    canary:
      canaryService: payment-api-canary
      stableService: payment-api-stable
      trafficRouting:
        nginx:
          stableIngress: payment-api-ingress
      steps:
        - setWeight: 5            # 5% traffic vào canary
        - pause: {duration: 10m}  # Đợi 10 phút
        - analysis:               # Phân tích metrics
            templates:
              - templateName: success-rate
        - setWeight: 20
        - pause: {duration: 10m}
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100          # Promote hoàn toàn

      analysis:
        successfulRunHistoryLimit: 3
        unsuccessfulRunHistoryLimit: 3

---
# AnalysisTemplate: định nghĩa tiêu chí pass/fail
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 5m
      count: 3
      successCondition: result[0] >= 0.99   # Ít nhất 99% request thành công
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{job="payment-api",status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total{job="payment-api"}[5m]))

6. Full pipeline example — GitHub Actions

# .github/workflows/deploy.yml
name: Build, Scan, Test, Deploy

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # ─── Tests ────────────────────────────────────────────────────
  unit-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with: { go-version: '1.22', cache: true }
      - run: go test -race -count=1 -short ./...

  integration-test:
    runs-on: ubuntu-latest
    needs: unit-test
    services:
      postgres:
        image: postgres:16-alpine
        env: { POSTGRES_PASSWORD: test, POSTGRES_DB: testdb }
        options: --health-cmd pg_isready --health-interval 5s --health-retries 5
        ports: ["5432:5432"]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with: { go-version: '1.22', cache: true }
      - run: go test -tags=integration ./...
        env:
          DATABASE_URL: postgres://postgres:test@localhost:5432/testdb

  # ─── Build & Security ─────────────────────────────────────────
  build:
    runs-on: ubuntu-latest
    needs: [unit-test, integration-test]
    permissions:
      contents: read
      packages: write
      id-token: write
      security-events: write
    outputs:
      image-digest: ${{ steps.build.outputs.digest }}
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=,suffix=,format=short
            type=semver,pattern={{version}}

      - uses: docker/setup-buildx-action@v3

      - name: Login to registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
          build-args: VERSION=${{ github.sha }}

      - name: Scan image
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          format: sarif
          output: trivy.sarif
          severity: CRITICAL,HIGH
          exit-code: '1'
          ignore-unfixed: true

      - name: Upload scan results
        uses: github/codeql-action/upload-sarif@v2
        if: always()
        with:
          sarif_file: trivy.sarif

      - name: Generate and attest SBOM
        if: github.event_name != 'pull_request'
        uses: actions/attest-sbom@v1
        with:
          subject-name: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          subject-digest: ${{ steps.build.outputs.digest }}
          sbom-path: sbom.spdx.json
          push-to-registry: true

      - name: Sign image
        if: github.event_name != 'pull_request'
        env:
          COSIGN_EXPERIMENTAL: "1"
        run: |
          cosign sign --yes \
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build.outputs.digest }}

  # ─── Deploy to Staging ────────────────────────────────────────
  deploy-staging:
    runs-on: ubuntu-latest
    needs: build
    if: github.event_name != 'pull_request'
    environment:
      name: staging
      url: https://staging.payment.example.com
    steps:
      - name: Update staging image tag in GitOps repo
        uses: peter-evans/repository-dispatch@v2
        with:
          token: ${{ secrets.GITOPS_TOKEN }}
          repository: myorg/k8s-config
          event-type: update-image
          client-payload: |
            {
              "service": "payment-api",
              "environment": "staging",
              "image_tag": "${{ github.sha }}",
              "digest": "${{ needs.build.outputs.image-digest }}"
            }

  # ─── Deploy to Production (manual approval) ───────────────────
  deploy-production:
    runs-on: ubuntu-latest
    needs: deploy-staging
    environment:
      name: production          # GitHub Environments: require manual review
      url: https://api.payment.example.com
    steps:
      - name: Update production image tag
        uses: peter-evans/repository-dispatch@v2
        with:
          token: ${{ secrets.GITOPS_TOKEN }}
          repository: myorg/k8s-config
          event-type: update-image
          client-payload: |
            {
              "service": "payment-api",
              "environment": "production",
              "image_tag": "${{ github.sha }}"
            }

7. Observability trong pipeline

7.1 Deployment tracking

# Sau khi deploy xong, notify monitoring system
- name: Create Datadog deployment event
  run: |
    curl -X POST https://api.datadoghq.com/api/v1/events \
      -H "Content-Type: application/json" \
      -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \
      -d '{
        "title": "payment-api deployed",
        "text": "Version ${{ github.sha }} deployed to production",
        "tags": ["service:payment-api", "env:production"],
        "alert_type": "info"
      }'

7.2 DORA metrics — đo lường hiệu quả pipeline

Bốn chỉ số quan trọng (Google DORA Research):

Metric Mô tả Elite performer
Deployment Frequency Bao nhiêu lần deploy/ngày Multiple per day
Lead Time for Changes Từ commit đến production < 1 giờ
Change Failure Rate % deploy gây incident < 5%
Time to Restore Thời gian recover khi incident < 1 giờ
# Track trong CI — ghi vào monitoring dashboard
- name: Track DORA metrics
  run: |
    LEAD_TIME=$(($(date +%s) - $(git log -1 --format=%ct)))
    echo "Lead time: ${LEAD_TIME}s"
    
    # Push đến Prometheus Pushgateway
    echo "cicd_lead_time_seconds{service=\"payment-api\",env=\"production\"} ${LEAD_TIME}" \
      | curl --data-binary @- http://pushgateway:9091/metrics/job/cicd

Tóm tắt: Checklist CI/CD Production-Grade

Build:

  • Multi-stage Dockerfile, layer cache tối ưu
  • BuildKit với --mount=type=cache cho dependencies
  • Distroless hoặc minimal base image
  • Image tag = git SHA (không dùng latest)

Security:

  • Trivy scan — fail nếu có CRITICAL/HIGH CVE
  • SBOM được generate và attach vào image
  • Image signing với Cosign
  • Secret qua OIDC/Workload Identity, không phải long-lived credentials
  • .trivyignore có review định kỳ

Testing:

  • Unit tests với race detector
  • Integration tests với real dependencies (services docker trong CI)
  • Contract tests nếu có nhiều service
  • Test reports và coverage tracking

Deploy:

  • preStop hook + terminationGracePeriodSeconds đủ dài
  • readinessProbe fail fast, livenessProbe rộng hơn
  • maxUnavailable: 0 cho zero-downtime
  • PDB prevent mass eviction
  • Manual approval gate trước production
  • Automatic rollback nếu error rate tăng
  • Deployment event gửi đến monitoring system