🚀 DevOps✍️ Khoa📅 19/04/2026☕ 15 phút đọc

DevOps: K8s Troubleshooting Thực Chiến (Case-based)

Troubleshooting Kubernetes hiệu quả là kỹ năng phân biệt người vận hành bình thường với người bình tĩnh debug lúc 2 giờ sáng. Bài này tổ chức theo case thực tế, mỗi case có flow chẩn đoán rõ ràng thay vì liệt kê lệnh lung tung.

Nguyên tắc debug: Không đoán mò. Quan sát → Đặt giả thuyết → Kiểm chứng → Xử lý.

Công cụ cần cài

# kubectl plugins (cài qua krew)
kubectl krew install ctx ns stern tail resource-capacity

# k9s — TUI cho Kubernetes (rất recommend)
brew install k9s

# kubectx + kubens — chuyển context/namespace nhanh
brew install kubectx

# stern — log nhiều pod cùng lúc
brew install stern

Framework chẩn đoán chung

Trước khi đi vào từng case, đây là flow chung cho mọi sự cố:

1. Scope: Vấn đề xảy ra ở đâu? (Pod? Node? Namespace? Toàn cluster?)
   kubectl get pods -n <namespace>
   kubectl get nodes

2. Events: K8s ghi lại gì?
   kubectl describe pod <pod> -n <namespace>   ← Events section ở cuối
   kubectl get events -n <namespace> --sort-by='.lastTimestamp'

3. Logs: App nói gì?
   kubectl logs <pod> -n <namespace>
   kubectl logs <pod> -n <namespace> --previous    ← Log của lần crash trước

4. State: Resource trông như thế nào?
   kubectl get pod <pod> -o yaml
   kubectl get deployment <name> -o yaml

5. Network: Traffic có đến đúng chỗ không?
   kubectl get endpoints <service>
   kubectl exec -it <debug-pod> -- curl http://<service>:<port>/healthz

Case 1: Pod bị CrashLoopBackOff

Triệu chứng

kubectl get pods -n payments
# NAME                          READY   STATUS             RESTARTS   AGE
# payment-api-7d4f8b9c4-xk2m9   0/1     CrashLoopBackOff   8          15m

Flow chẩn đoán

Bước 1: Xem log của container vừa crash

# Log hiện tại (container đang restart)
kubectl logs payment-api-7d4f8b9c4-xk2m9 -n payments

# Log của lần crash TRƯỚC (thường có thông tin hơn)
kubectl logs payment-api-7d4f8b9c4-xk2m9 -n payments --previous

Bước 2: Xem events của pod

kubectl describe pod payment-api-7d4f8b9c4-xk2m9 -n payments
# Chú ý phần "Events:" ở cuối output
# Thường thấy: Back-off restarting failed container

Bước 3: Xem exit code

kubectl get pod payment-api-7d4f8b9c4-xk2m9 -n payments -o json \
  | jq '.status.containerStatuses[0].lastState.terminated'

# exitCode: 1  → App exit vì lỗi (check log)
# exitCode: 137 → OOMKilled (137 = 128 + 9, signal SIGKILL từ OOM)
# exitCode: 143 → SIGTERM không handle kịp (terminationGracePeriodSeconds quá ngắn)

Nguyên nhân phổ biến và cách xử lý

Nguyên nhân 1: App crash ngay khi start — config/secret sai

# Kiểm tra env vars và secret mounts
kubectl exec payment-api-7d4f8b9c4-xk2m9 -n payments -- env | grep DATABASE

# Nếu pod không start được (crash quá nhanh), dùng ephemeral container
kubectl debug -it payment-api-7d4f8b9c4-xk2m9 \
  --image=busybox --target=payment-api -- sh
# Trong shell: kiểm tra /etc/config, env vars, file mounts

# Hoặc kiểm tra secret có đúng key không
kubectl get secret payment-api-secret -n payments -o jsonpath='{.data}' | jq

Nguyên nhân 2: OOMKilled (exitCode 137)

# Xác nhận OOM
kubectl describe pod <pod> | grep -A5 "OOMKilled"
# Reason: OOMKilled

# Xem memory usage trước khi crash (nếu có metrics server)
kubectl top pod <pod> -n payments

# Xem memory limit hiện tại
kubectl get pod <pod> -o jsonpath='{.spec.containers[0].resources.limits.memory}'

# Fix: tăng limit hoặc tìm leak
# Nếu Java/JVM: thêm -XX:MaxRAMPercentage=75.0 để JVM không dùng hết limit

Nguyên nhân 3: Liveness probe fail liên tục

# Xem probe config
kubectl get pod <pod> -o yaml | grep -A20 livenessProbe

# Thử endpoint bằng tay
kubectl port-forward pod/<pod> 8080:8080 -n payments
curl localhost:8080/healthz/live
# Nếu liveness check DB nhưng DB chậm → probe timeout → restart loop

Nguyên nhân 4: Image pull error

kubectl describe pod <pod> | grep -i "failed\|error\|pull"
# ErrImagePull: image không tồn tại hoặc sai tag
# ImagePullBackOff: credentials sai hoặc registry không accessible

# Kiểm tra imagePullSecrets
kubectl get pod <pod> -o jsonpath='{.spec.imagePullSecrets}'
kubectl get secret <pull-secret> -n payments -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

Case 2: Pod bị Pending — không được schedule

Triệu chứng

kubectl get pods -n payments
# NAME                          READY   STATUS    RESTARTS   AGE
# payment-api-7d4f8b9c4-xk2m9   0/1     Pending   0          5m

Flow chẩn đoán

# Bước quan trọng nhất: xem Events của pod
kubectl describe pod payment-api-7d4f8b9c4-xk2m9 -n payments

# Output sẽ có một trong các message sau:
# 0/3 nodes are available: 3 Insufficient cpu.
# 0/3 nodes are available: 3 Insufficient memory.
# 0/3 nodes are available: 3 node(s) had untolerated taint {key: value: effect}.
# 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector.
# 0/3 nodes are available: 3 node(s) didn't have enough free PVC.

Nguyên nhân phổ biến

Nguyên nhân 1: Không đủ resource trên mọi node

# Xem resource available trên mọi node
kubectl describe nodes | grep -A5 "Allocated resources"

# Hoặc dùng plugin resource-capacity
kubectl resource-capacity --pods --sort cpu.request

# Xem tổng resource requests của namespace
kubectl get pods -n payments -o json \
  | jq '[.items[].spec.containers[].resources.requests.cpu] | add'

# Fix ngắn hạn: xem pod nào đang dùng nhiều nhất
kubectl top pods -n payments --sort-by=cpu

Nguyên nhân 2: Taint trên node không có toleration

# Xem tất cả taints trên các node
kubectl get nodes -o custom-columns=\
'NAME:.metadata.name,TAINTS:.spec.taints'

# Ví dụ: node có taint "dedicated=gpu:NoSchedule"
# Pod cần toleration:
# tolerations:
#   - key: "dedicated"
#     operator: "Equal"
#     value: "gpu"
#     effect: "NoSchedule"

Nguyên nhân 3: PVC không thể bind

# Kiểm tra PVC
kubectl get pvc -n payments
# STATUS: Pending → PV không match (storageClass, accessMode, capacity)

kubectl describe pvc <pvc-name> -n payments
# Events: no persistent volumes available for this claim and no storage class is set
# hoặc: volume node affinity conflict (EBS chỉ available ở 1 AZ, node ở AZ khác)

Nguyên nhân 4: Node affinity quá chặt

# Kiểm tra affinity trong pod spec
kubectl get pod <pod> -o jsonpath='{.spec.affinity}' | jq

# Xem labels của các node
kubectl get nodes --show-labels

# Nếu node không có label mà affinity yêu cầu → Pending mãi mãi
kubectl label node <node-name> disk-type=ssd

Case 3: Service không route traffic đến pod

Triệu chứng

App chạy bình thường, pod Running và Ready, nhưng request qua Service trả về connection refused hoặc 502.

Flow chẩn đoán

# Bước 1: Kiểm tra Service có chọn đúng pod không
kubectl get endpoints payment-api -n payments
# NAME          ENDPOINTS                     AGE
# payment-api   10.244.1.5:8080,10.244.2.8:8080   5d
# Nếu ENDPOINTS là <none> → label selector sai!

# Bước 2: So sánh selector của Service với labels của Pod
kubectl get service payment-api -n payments -o jsonpath='{.spec.selector}'
# {"app":"payment-api","version":"v2"}

kubectl get pods -n payments --show-labels
# NAME                          LABELS
# payment-api-xxx               app=payment-api,version=v1   ← version khác!

# Bước 3: Test connectivity trực tiếp đến Pod (bypass Service)
kubectl exec -it debug-pod -n payments -- \
  curl http://10.244.1.5:8080/healthz
# Nếu OK → vấn đề ở Service/kube-proxy
# Nếu fail → vấn đề ở app hoặc NetworkPolicy

# Bước 4: Test qua Service DNS
kubectl exec -it debug-pod -n payments -- \
  curl http://payment-api.payments.svc.cluster.local:8080/healthz

# Bước 5: Kiểm tra NetworkPolicy có block không
kubectl get networkpolicy -n payments
kubectl describe networkpolicy <name> -n payments

Nguyên nhân phổ biến

Label selector không match:

# Service selector: app=payment-api
# Pod labels: app=payment-api-v2   ← KHÔNG MATCH

# Fix: sửa service selector hoặc pod labels
kubectl patch service payment-api -n payments \
  -p '{"spec":{"selector":{"app":"payment-api-v2"}}}'

Pod chưa Ready (readinessProbe fail):

# Endpoint chỉ include pod đang Ready
kubectl get pod <pod> -n payments -o jsonpath='{.status.conditions}'
# Nếu Ready=False → pod bị exclude khỏi endpoints

# Check why readiness fail
kubectl describe pod <pod> | grep -A10 "Readiness"

Sai targetPort:

kubectl get service payment-api -n payments -o yaml | grep -A5 ports
# port: 80
# targetPort: 8080   ← phải match containerPort trong pod spec

Case 4: Node áp lực — Pod bị Evict

Triệu chứng

kubectl get pods -n payments
# NAME                          STATUS    REASON
# payment-api-xxx               Evicted   The node was low on resource: memory.

Flow chẩn đoán

# Bước 1: Xem node conditions
kubectl describe node <node-name>
# Conditions:
#   MemoryPressure: True    ← node hết memory
#   DiskPressure: True      ← node hết disk
#   PIDPressure: True       ← quá nhiều process

# Bước 2: Xem pod nào đang dùng nhiều resource nhất
kubectl top pods -n payments --sort-by=memory

# Bước 3: Xem node resource usage
kubectl top nodes

# Bước 4: Kiểm tra eviction threshold
kubectl describe node <node-name> | grep -A10 "Eviction"
# Eviction Hard:
#   memory.available: 100Mi    ← evict khi còn < 100Mi
#   nodefs.available: 10%

Xử lý và phòng ngừa

# Dọn dẹp ngay khi emergency
# Xoá completed/failed pod (tốn resource cho etcd nhưng không tốn RAM)
kubectl delete pod --field-selector=status.phase==Succeeded -n payments
kubectl delete pod --field-selector=status.phase==Failed -n payments

# Xoá images cũ trên node (SSH vào node)
crictl rmi --prune

# Xoá evicted pods (để cluster sạch)
kubectl get pods -A | grep Evicted | awk '{print $2 " -n " $1}' \
  | xargs -L1 kubectl delete pod

Phòng ngừa lâu dài:

# 1. Set resource requests cho tất cả pod
# 2. Dùng LimitRange để enforce trong namespace
apiVersion: v1
kind: LimitRange
metadata:
  name: payments-limit-range
  namespace: payments
spec:
  limits:
    - type: Container
      default:             # Áp dụng khi container không set limits
        cpu: 500m
        memory: 512Mi
      defaultRequest:      # Áp dụng khi container không set requests
        cpu: 100m
        memory: 128Mi
      max:                 # Không cho phép vượt quá
        cpu: 4000m
        memory: 4Gi

# 3. ResourceQuota cho namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: payments-quota
  namespace: payments
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    count/pods: "50"

Case 5: DNS resolution fail

Triệu chứng

kubectl exec -it payment-api-xxx -n payments -- \
  curl http://user-service.users.svc.cluster.local
# curl: (6) Could not resolve host: user-service.users.svc.cluster.local

Flow chẩn đoán

# Bước 1: Kiểm tra CoreDNS đang chạy
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Nếu pod CoreDNS crash → toàn cluster DNS bị ảnh hưởng

# Bước 2: Test DNS resolution từ pod debug
kubectl exec -it payment-api-xxx -n payments -- nslookup kubernetes
# Server: 10.96.0.10   ← CoreDNS ClusterIP
# Address: 10.96.0.10#53
#
# Name: kubernetes.default.svc.cluster.local
# Address: 10.96.0.1

# Nếu không resolve được kubernetes (service cơ bản) → CoreDNS bị lỗi

# Bước 3: Test service cụ thể
kubectl exec -it payment-api-xxx -n payments -- nslookup user-service.users
# Nếu fail nhưng kubernetes OK → service "user-service" trong namespace "users" không tồn tại

# Bước 4: Kiểm tra service thực sự tồn tại
kubectl get service user-service -n users
# Error: services "user-service" not found   ← Tên sai hoặc namespace sai

CoreDNS bị lỗi — kiểm tra chi tiết:

# Xem logs CoreDNS
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# CoreDNS ConfigMap (custom DNS rules)
kubectl get configmap coredns -n kube-system -o yaml

# Test trực tiếp CoreDNS
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- \
  nslookup payment-api.payments.svc.cluster.local 10.96.0.10

Vấn đề phổ biến với ndots:

# Pod có /etc/resolv.conf:
# nameserver 10.96.0.10
# search payments.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5

# "ndots:5" nghĩa là: nếu hostname có < 5 dấu chấm,
# thử thêm search domain trước khi resolve tuyệt đối
# → gọi "redis" từ pod thực ra là gọi "redis.payments.svc.cluster.local"
# → gọi "redis.payments" là "redis.payments.svc.cluster.local" trước

Case 6: CPU throttling không rõ nguyên nhân

Triệu chứng

App có latency tăng đột biến theo chu kỳ. Không có error, không OOMKill, CPU usage trông bình thường trên dashboard.

Flow chẩn đoán

# Bước 1: Kiểm tra CPU throttling trong Prometheus
# Metric: container_cpu_cfs_throttled_seconds_total
# Nếu rate tăng → CPU đang bị throttle

# Bước 2: Xem limits hiện tại
kubectl get pod <pod> -o jsonpath='{.spec.containers[0].resources}'

# Bước 3: Xem actual CPU usage vs limits
kubectl top pod <pod> --containers

# CPU throttling xảy ra ngay cả khi average usage thấp!
# Ví dụ: limit = 500m, app có burst lên 800m trong 1ms → throttle
# → average trông như 200m nhưng thực ra bị cắt

Giải thích cơ chế

Linux CFS (Completely Fair Scheduler) chia time thành các period (mặc định 100ms). Mỗi container được dùng quota CPU trong mỗi period:

Limit = 500m → quota = 50ms/100ms period

Nếu container muốn dùng 80ms → 30ms bị throttle → toàn bộ process bị pause
Mặc dù average usage chỉ là ~500m, latency spike vẫn xảy ra

# Kiểm tra throttling trực tiếp từ cgroup
cat /sys/fs/cgroup/cpu/kubepods/pod<pod-uid>/cpu.stat
# nr_throttled 1234          ← số lần bị throttle
# throttled_time 98765432    ← tổng nanoseconds bị throttle

Giải pháp

# Option 1: Tăng limits.cpu
# Option 2: Không set limits.cpu (chấp nhận noisy neighbor risk)
# Option 3: Điều chỉnh CFS quota period (cluster-level, ít dùng)

# Cho JVM workloads — GC pause thường gây throttle burst
# Thêm JVM flags:
env:
  - name: JAVA_OPTS
    value: >-
      -XX:+UseG1GC
      -XX:MaxGCPauseMillis=200
      -XX:+ParallelRefProcEnabled
      -XX:MaxRAMPercentage=75

Case 7: Deploy bị stuck — rollout không progress

Triệu chứng

kubectl rollout status deployment/payment-api -n payments
# Waiting for deployment "payment-api" rollout to finish: 1 out of 3 new replicas have been updated...
# (stuck không tiến thêm được)

Flow chẩn đoán

# Bước 1: Xem trạng thái deployment
kubectl describe deployment payment-api -n payments
# Conditions:
#   Available: True
#   Progressing: False   ← ProgressDeadlineExceeded nếu quá thời gian

# Bước 2: Xem ReplicaSet
kubectl get replicaset -n payments
# NAME                          DESIRED   CURRENT   READY   AGE
# payment-api-new-xxx           1         1         0       5m   ← pod mới không Ready
# payment-api-old-xxx           2         2         2       2d

# Bước 3: Xem pod mới đang làm gì
kubectl get pods -n payments
# payment-api-new-xxx   0/1   CrashLoopBackOff   5   5m
# → Pod mới crash → readiness fail → rollout stuck

# Bước 4: Rollback
kubectl rollout undo deployment/payment-api -n payments

# Xem rollout history
kubectl rollout history deployment/payment-api -n payments --revision=2

Các lý do rollout stuck:

Pod mới crashloop → readiness probe không bao giờ pass.
maxUnavailable: 0 và maxSurge: 0 (cấu hình sai).
PDB block pod cũ bị xoá (không đủ minAvailable).
progressDeadlineSeconds quá ngắn (mặc định 600s).

Case 8: Secret/ConfigMap thay đổi nhưng pod không cập nhật

Triệu chứng

Bạn đã kubectl apply secret mới, nhưng app trong pod vẫn đọc giá trị cũ.

Giải thích

Kubernetes không tự động restart pod khi ConfigMap/Secret thay đổi (trừ khi dùng Reloader hoặc cấu hình đặc biệt).

# Environment variable: chỉ update khi pod restart
# Volume mount: được update tự động trong ~1 phút (tùy kubelet sync period)
# Nhưng app phải tự watch file change nếu muốn load lại config

Giải pháp

# Option 1: Restart deployment thủ công
kubectl rollout restart deployment/payment-api -n payments

# Option 2: Dùng Reloader (operator tự động restart khi secret/configmap đổi)
# https://github.com/stakater/Reloader
# Thêm annotation vào deployment:
# annotations:
#   reloader.stakater.com/auto: "true"

# Option 3: Đặt checksum của secret vào pod annotation (trigger restart tự động)
# Trong Helm template:
# annotations:
#   checksum/secret: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}

Công cụ debug nâng cao

Ephemeral container — debug pod đang chạy

# Thêm container debug vào pod đang chạy (K8s 1.23+)
kubectl debug -it payment-api-xxx \
  --image=nicolaka/netshoot \    # Image có đầy đủ network tools
  --target=payment-api \         # Share process namespace với container chính
  -n payments

# Trong netshoot container:
ss -tlnp                          # Xem port đang listen
tcpdump -i eth0 port 8080         # Capture traffic
curl http://localhost:8080/debug/pprof/  # Nếu app expose pprof
dig user-service.users.svc.cluster.local # Test DNS

Node debugging

# Debug từ node (cần SSH hoặc node shell access)
kubectl debug node/worker-node-1 -it --image=ubuntu

# Trong debug pod, xem container trên node
crictl ps
crictl logs <container-id>
crictl inspect <container-id> | jq .info.pid

# Xem resource usage ở mức kernel
cat /proc/<pid>/status | grep -i vm  # Memory stats
ls /proc/<pid>/fd | wc -l             # Số file descriptors

stern — log nhiều pod cùng lúc

# Xem log tất cả pod có label app=payment-api
stern -l app=payment-api -n payments

# Filter bằng regex
stern payment-api -n payments --include="ERROR|FATAL|panic"

# Xem log từ nhiều namespace
stern -l app=payment-api --all-namespaces

# Thêm timestamp
stern payment-api --timestamps -n payments

k9s — TUI cho cluster

k9s                          # Mở TUI
# Phím tắt:
# :pod → xem pods
# :svc → xem services
# :dp → xem deployments
# l → xem logs của pod được chọn
# d → describe pod
# s → shell vào container
# ctrl+k → xoá resource
# / → filter
# ? → help

Checklist debug nhanh (in ra mang theo)

Pod PENDING:
  □ kubectl describe pod → Events: Insufficient resource? Taint? Affinity?
  □ kubectl get nodes → node có Ready không?
  □ kubectl top nodes → resource còn bao nhiêu?

Pod CRASHLOOP:
  □ kubectl logs --previous → error message là gì?
  □ exit code? 137=OOM, 1=app error, 143=SIGTERM
  □ kubectl describe pod → probe failure? image pull error?

Service KHÔNG ROUTE:
  □ kubectl get endpoints → có IP không?
  □ Labels của pod có match selector của service không?
  □ kubectl exec → curl thẳng đến pod IP được không?
  □ NetworkPolicy có block không?

LATENCY CAO:
  □ CPU throttling? (Prometheus: container_cpu_cfs_throttled)
  □ OOM pressure? (kubectl top nodes, node conditions)
  □ DNS slow? (test nslookup latency)
  □ Network policy mới thêm?
  □ DB connection pool exhausted?

ROLLOUT STUCK:
  □ Pod mới có crash không? (kubectl get pods)
  □ PDB block không? (kubectl get pdb)
  □ kubectl rollout undo nếu cần rollback ngay