DevOps: CI/CD Pipeline Production-Grade cho Docker + K8s (Intermediate++)
CI/CD tốt không phải là "tự động deploy". Đó là pipeline mà bạn tin tưởng đủ để deploy lúc 11 giờ đêm thứ Sáu mà không lo. Bài này đi sâu vào cách build pipeline đó: từ tối ưu Docker build, quét bảo mật, quản lý secret trong CI, đến chiến lược deploy không downtime.
1. Docker build trong CI — nhanh, an toàn, reproducible
1.1 BuildKit và caching strategy
BuildKit là backend build mặc định từ Docker 23+, mang lại:
- Parallel builds: các stage độc lập chạy song song.
- Cache mounts: cache npm/go/pip giữa các build mà không nhúng vào image.
- Secret mounts: truyền secret vào build mà không để lại trong layer.
# syntax=docker/dockerfile:1.6
FROM golang:1.22-alpine AS builder
WORKDIR /app
# Cache go modules riêng — chỉ re-download khi go.mod/go.sum thay đổi
COPY go.mod go.sum ./
RUN \
go mod download
# Copy code và build
COPY . .
RUN \
CGO_ENABLED=0 GOOS=linux \
go build -ldflags="-w -s -X main.version=${VERSION}" \
-o /payment-api ./cmd/server
# ---
FROM gcr.io/distroless/static-debian12:nonroot AS runtime
COPY /payment-api /payment-api
# Distroless: không có shell, không có package manager
# → attack surface tối thiểu
EXPOSE 8080
ENTRYPOINT ["/payment-api"]
GitHub Actions với layer cache:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build and push image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
ghcr.io/myorg/payment-api:${{ github.sha }}
ghcr.io/myorg/payment-api:latest
cache-from: type=gha # GitHub Actions cache
cache-to: type=gha,mode=max
build-args: |
VERSION=${{ github.sha }}
GitLab CI với registry cache:
build:
stage: build
image: docker:24
services:
- docker:24-dind
variables:
DOCKER_BUILDKIT: "1"
IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
- docker buildx create --use
- docker buildx build
--cache-from type=registry,ref=$CI_REGISTRY_IMAGE:buildcache
--cache-to type=registry,ref=$CI_REGISTRY_IMAGE:buildcache,mode=max
--tag $IMAGE
--push .
1.2 Image tagging strategy
| Strategy | Ví dụ | Ưu điểm | Nhược điểm |
|---|---|---|---|
| Git SHA | abc1234f |
Immutable, traceable | Không readable |
| Semantic version | 1.23.0 |
Readable, familiar | Cần process bump version |
| Git SHA + timestamp | 1.23.0-abc1234 |
Cả hai ưu điểm | Tag dài |
latest |
latest |
Tiện để dev | ❌ Không dùng production |
Khuyến nghị production:
# Primary tag: git SHA (immutable, full traceability)
IMAGE_TAG="${GITHUB_SHA:0:8}" # 8 ký tự đầu đủ unique
# Additional tags nếu là release
if [[ "$GITHUB_REF" =~ ^refs/tags/v ]]; then
SEMVER="${GITHUB_REF#refs/tags/v}" # v1.23.0 → 1.23.0
fi
1.3 Build secrets — đừng để lộ trong image
# Dùng SSH để clone private repo trong build
RUN \
git clone git@github.com:myorg/private-lib.git /tmp/private-lib
# Dùng secret file (không xuất hiện trong layer nào)
RUN \
npm ci
# GitHub Actions: truyền secret vào build
- name: Build with secrets
uses: docker/build-push-action@v5
with:
secrets: |
npmrc=${{ secrets.NPMRC_FILE }}
ssh: |
default=${{ env.SSH_AUTH_SOCK }}
2. Image scanning — bảo mật trước khi deploy
2.1 Trivy — scanner phổ biến nhất
# GitHub Actions: scan và fail nếu có critical CVE
security-scan:
runs-on: ubuntu-latest
needs: build
steps:
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: ghcr.io/myorg/payment-api:${{ github.sha }}
format: sarif # GitHub Security tab
output: trivy-results.sarif
severity: CRITICAL,HIGH
exit-code: '1' # Fail pipeline nếu tìm thấy
ignore-unfixed: true # Bỏ qua CVE chưa có fix
- name: Upload Trivy scan results
uses: github/codeql-action/upload-sarif@v2
if: always() # Upload kể cả khi scan fail
with:
sarif_file: trivy-results.sarif
# Local scan
trivy image ghcr.io/myorg/payment-api:abc1234
# Scan filesystem (trong CI, trước khi build image)
trivy fs --security-checks vuln,secret,config .
# Scan IaC (Kubernetes YAML)
trivy config ./k8s/
# Scan với ignore file (.trivyignore)
# CVE-2023-XXXXX # Accepted risk — mitigated by WAF
2.2 SBOM — Software Bill of Materials
SBOM là danh sách đầy đủ các components trong image. Quan trọng cho compliance (SSDF, Executive Order 14028).
- name: Generate SBOM
uses: anchore/sbom-action@v0
with:
image: ghcr.io/myorg/payment-api:${{ github.sha }}
format: spdx-json # hoặc cyclonedx-json
output-file: sbom.spdx.json
artifact-name: sbom-payment-api
- name: Attest SBOM to image
uses: actions/attest-sbom@v1
with:
subject-name: ghcr.io/myorg/payment-api
subject-digest: ${{ steps.build.outputs.digest }}
sbom-path: sbom.spdx.json
push-to-registry: true
2.3 Container signing với Cosign (Sigstore)
- name: Sign container image
env:
COSIGN_EXPERIMENTAL: "1" # Keyless signing qua OIDC
run: |
cosign sign --yes \
ghcr.io/myorg/payment-api@${{ steps.build.outputs.digest }}
# Trong K8s: dùng Policy Controller để verify signature trước khi admit pod
3. Testing trong CI — đủ tin cậy để deploy
3.1 Test strategy per layer
# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
# Layer 1: Unit tests (nhanh, chạy đầu tiên)
unit-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.22'
cache: true
- name: Run unit tests
run: go test -race -short ./...
- name: Upload coverage
uses: codecov/codecov-action@v3
# Layer 2: Integration tests (cần Docker, chậm hơn)
integration-test:
runs-on: ubuntu-latest
needs: unit-test
services:
postgres:
image: postgres:16
env:
POSTGRES_PASSWORD: testpass
POSTGRES_DB: testdb
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7-alpine
ports:
- 6379:6379
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.22'
cache: true
- name: Run integration tests
env:
DATABASE_URL: postgres://postgres:testpass@localhost:5432/testdb
REDIS_URL: redis://localhost:6379
run: go test -tags=integration ./...
# Layer 3: Build image (sau khi tests pass)
build:
runs-on: ubuntu-latest
needs: [unit-test, integration-test]
permissions:
contents: read
packages: write
id-token: write # Cho keyless signing
steps:
- uses: actions/checkout@v4
# ... build steps
3.2 Test contracts giữa services
# Consumer-driven contract testing với Pact
contract-test:
runs-on: ubuntu-latest
steps:
- name: Run Pact consumer tests
run: go test ./... -run TestPact -v
env:
PACT_BROKER_URL: https://pact.myorg.internal
PACT_BROKER_TOKEN: ${{ secrets.PACT_TOKEN }}
- name: Publish pacts
run: |
pact-broker publish ./pacts \
--consumer-app-version ${{ github.sha }} \
--broker-base-url https://pact.myorg.internal \
--broker-token ${{ secrets.PACT_TOKEN }}
- name: Can I deploy?
run: |
pact-broker can-i-deploy \
--pacticipant payment-api \
--version ${{ github.sha }} \
--to-environment production \
--broker-base-url https://pact.myorg.internal
4. Secrets trong CI — không bao giờ hardcode
4.1 GitHub Actions OIDC — không cần long-lived credentials
Thay vì lưu AWS Access Key/Secret Key trong GitHub Secrets, dùng OIDC để GitHub Actions assume IAM role trực tiếp:
permissions:
id-token: write # Cần để OIDC hoạt động
contents: read
steps:
- name: Configure AWS credentials via OIDC
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions-payment-api
aws-region: us-east-1
# Không cần AWS_ACCESS_KEY_ID hay AWS_SECRET_ACCESS_KEY!
- name: Push image to ECR
run: |
aws ecr get-login-password | docker login --username AWS \
--password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/payment-api:${{ github.sha }}
IAM Trust Policy cho GitHub Actions:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
},
"StringLike": {
"token.actions.githubusercontent.com:sub": "repo:myorg/payment-api:*"
}
}
}]
}
4.2 Workload Identity cho GKE
Thay vì Service Account key JSON trong K8s Secret, dùng Workload Identity:
# Service Account K8s được bind với GCP Service Account
apiVersion: v1
kind: ServiceAccount
metadata:
name: payment-api
namespace: payments
annotations:
iam.gke.io/gcp-service-account: payment-api@myproject.iam.gserviceaccount.com
# Bind IAM policy
gcloud iam service-accounts add-iam-policy-binding \
payment-api@myproject.iam.gserviceaccount.com \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:myproject.svc.id.goog[payments/payment-api]"
Pod khi chạy tự động nhận GCP credentials thông qua metadata server — không cần secret nào.
4.3 External Secrets Operator — sync từ Vault/AWS SM
# SecretStore: kết nối đến AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secrets-manager
namespace: payments
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
jwt:
serviceAccountRef:
name: payment-api # Dùng Workload Identity
---
# ExternalSecret: kéo secret về và tạo K8s Secret
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: payment-api-db-creds
namespace: payments
spec:
refreshInterval: 1h # Sync định kỳ, tự rotate
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: payment-api-secret # Tên K8s Secret được tạo
creationPolicy: Owner
data:
- secretKey: DATABASE_URL # Key trong K8s Secret
remoteRef:
key: prod/payment-api/db # Key trong AWS SM
property: connection_string
5. Deploy strategy — không downtime, rollback nhanh
5.1 Rolling Update với zero-downtime
Để đảm bảo không có request nào bị drop trong quá trình deploy, cần cấu hình đồng bộ:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # Không giảm capacity
maxSurge: 25% # Tạo thêm tối đa 25% pod mới
template:
spec:
# 1. preStop hook: đợi load balancer drain connection trước khi process nhận SIGTERM
containers:
- name: payment-api
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
# 2. terminationGracePeriodSeconds: thời gian process có để finish in-flight requests
terminationGracePeriodSeconds: 60
# 3. readinessProbe đủ chặt để chỉ route traffic khi pod thực sự sẵn sàng
containers:
- readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 2 # Cần 2 lần pass liên tiếp
failureThreshold: 3
Tại sao cần preStop sleep? Khi pod nhận signal terminate, kube-proxy cần vài giây để remove pod khỏi iptables rules. Nếu process thoát ngay, vẫn có requests đến pod đó trong khoảng thời gian này → connection refused. preStop: sleep 5 cho kube-proxy thời gian cập nhật.
5.2 Blue/Green với Argo Rollouts
Argo Rollouts (thêm vào cluster) cung cấp progressive delivery nâng cao:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-api
spec:
replicas: 5
strategy:
blueGreen:
activeService: payment-api-active # Service đang nhận production traffic
previewService: payment-api-preview # Service đang nhận preview traffic
autoPromotionEnabled: false # Cần approve thủ công
scaleDownDelaySeconds: 30 # Đợi 30s sau promote rồi scale down blue
selector:
matchLabels:
app: payment-api
template:
# ... pod spec
# Sau khi deploy, xem trạng thái
kubectl argo rollouts get rollout payment-api --watch
# Promote green lên production (sau khi QA ok)
kubectl argo rollouts promote payment-api
# Abort nếu có vấn đề (tự rollback về blue)
kubectl argo rollouts abort payment-api
5.3 Canary với phân tích tự động
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-api
spec:
strategy:
canary:
canaryService: payment-api-canary
stableService: payment-api-stable
trafficRouting:
nginx:
stableIngress: payment-api-ingress
steps:
- setWeight: 5 # 5% traffic vào canary
- pause: {duration: 10m} # Đợi 10 phút
- analysis: # Phân tích metrics
templates:
- templateName: success-rate
- setWeight: 20
- pause: {duration: 10m}
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100 # Promote hoàn toàn
analysis:
successfulRunHistoryLimit: 3
unsuccessfulRunHistoryLimit: 3
---
# AnalysisTemplate: định nghĩa tiêu chí pass/fail
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 5m
count: 3
successCondition: result[0] >= 0.99 # Ít nhất 99% request thành công
failureLimit: 1
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{job="payment-api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{job="payment-api"}[5m]))
6. Full pipeline example — GitHub Actions
# .github/workflows/deploy.yml
name: Build, Scan, Test, Deploy
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# ─── Tests ────────────────────────────────────────────────────
unit-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with: { go-version: '1.22', cache: true }
- run: go test -race -count=1 -short ./...
integration-test:
runs-on: ubuntu-latest
needs: unit-test
services:
postgres:
image: postgres:16-alpine
env: { POSTGRES_PASSWORD: test, POSTGRES_DB: testdb }
options: --health-cmd pg_isready --health-interval 5s --health-retries 5
ports: ["5432:5432"]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with: { go-version: '1.22', cache: true }
- run: go test -tags=integration ./...
env:
DATABASE_URL: postgres://postgres:test@localhost:5432/testdb
# ─── Build & Security ─────────────────────────────────────────
build:
runs-on: ubuntu-latest
needs: [unit-test, integration-test]
permissions:
contents: read
packages: write
id-token: write
security-events: write
outputs:
image-digest: ${{ steps.build.outputs.digest }}
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=,suffix=,format=short
type=semver,pattern={{version}}
- uses: docker/setup-buildx-action@v3
- name: Login to registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
id: build
uses: docker/build-push-action@v5
with:
context: .
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
build-args: VERSION=${{ github.sha }}
- name: Scan image
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
format: sarif
output: trivy.sarif
severity: CRITICAL,HIGH
exit-code: '1'
ignore-unfixed: true
- name: Upload scan results
uses: github/codeql-action/upload-sarif@v2
if: always()
with:
sarif_file: trivy.sarif
- name: Generate and attest SBOM
if: github.event_name != 'pull_request'
uses: actions/attest-sbom@v1
with:
subject-name: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
subject-digest: ${{ steps.build.outputs.digest }}
sbom-path: sbom.spdx.json
push-to-registry: true
- name: Sign image
if: github.event_name != 'pull_request'
env:
COSIGN_EXPERIMENTAL: "1"
run: |
cosign sign --yes \
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build.outputs.digest }}
# ─── Deploy to Staging ────────────────────────────────────────
deploy-staging:
runs-on: ubuntu-latest
needs: build
if: github.event_name != 'pull_request'
environment:
name: staging
url: https://staging.payment.example.com
steps:
- name: Update staging image tag in GitOps repo
uses: peter-evans/repository-dispatch@v2
with:
token: ${{ secrets.GITOPS_TOKEN }}
repository: myorg/k8s-config
event-type: update-image
client-payload: |
{
"service": "payment-api",
"environment": "staging",
"image_tag": "${{ github.sha }}",
"digest": "${{ needs.build.outputs.image-digest }}"
}
# ─── Deploy to Production (manual approval) ───────────────────
deploy-production:
runs-on: ubuntu-latest
needs: deploy-staging
environment:
name: production # GitHub Environments: require manual review
url: https://api.payment.example.com
steps:
- name: Update production image tag
uses: peter-evans/repository-dispatch@v2
with:
token: ${{ secrets.GITOPS_TOKEN }}
repository: myorg/k8s-config
event-type: update-image
client-payload: |
{
"service": "payment-api",
"environment": "production",
"image_tag": "${{ github.sha }}"
}
7. Observability trong pipeline
7.1 Deployment tracking
# Sau khi deploy xong, notify monitoring system
- name: Create Datadog deployment event
run: |
curl -X POST https://api.datadoghq.com/api/v1/events \
-H "Content-Type: application/json" \
-H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \
-d '{
"title": "payment-api deployed",
"text": "Version ${{ github.sha }} deployed to production",
"tags": ["service:payment-api", "env:production"],
"alert_type": "info"
}'
7.2 DORA metrics — đo lường hiệu quả pipeline
Bốn chỉ số quan trọng (Google DORA Research):
| Metric | Mô tả | Elite performer |
|---|---|---|
| Deployment Frequency | Bao nhiêu lần deploy/ngày | Multiple per day |
| Lead Time for Changes | Từ commit đến production | < 1 giờ |
| Change Failure Rate | % deploy gây incident | < 5% |
| Time to Restore | Thời gian recover khi incident | < 1 giờ |
# Track trong CI — ghi vào monitoring dashboard
- name: Track DORA metrics
run: |
LEAD_TIME=$(($(date +%s) - $(git log -1 --format=%ct)))
echo "Lead time: ${LEAD_TIME}s"
# Push đến Prometheus Pushgateway
echo "cicd_lead_time_seconds{service=\"payment-api\",env=\"production\"} ${LEAD_TIME}" \
| curl --data-binary @- http://pushgateway:9091/metrics/job/cicd
Tóm tắt: Checklist CI/CD Production-Grade
Build:
- Multi-stage Dockerfile, layer cache tối ưu
- BuildKit với
--mount=type=cachecho dependencies - Distroless hoặc minimal base image
- Image tag = git SHA (không dùng
latest)
Security:
- Trivy scan — fail nếu có CRITICAL/HIGH CVE
- SBOM được generate và attach vào image
- Image signing với Cosign
- Secret qua OIDC/Workload Identity, không phải long-lived credentials
-
.trivyignorecó review định kỳ
Testing:
- Unit tests với race detector
- Integration tests với real dependencies (services docker trong CI)
- Contract tests nếu có nhiều service
- Test reports và coverage tracking
Deploy:
-
preStophook +terminationGracePeriodSecondsđủ dài -
readinessProbefail fast,livenessProberộng hơn -
maxUnavailable: 0cho zero-downtime - PDB prevent mass eviction
- Manual approval gate trước production
- Automatic rollback nếu error rate tăng
- Deployment event gửi đến monitoring system