🧠 Programming✍️ Khoa📅 19/04/2026☕ 15 phút đọc

Go Performance & Profiling

Performance optimization không phải đoán mò — phải dựa trên data. File này hướng dẫn cách profile, benchmark, và tối ưu Go code một cách có hệ thống.

💡 "Premature optimization is the root of all evil." — Donald Knuth
"Measure first, optimize later." — Go wisdom


Performance Mindset

Quy trình tối ưu chuẩn

1. Measure (Profile)
   ↓
2. Identify bottleneck
   ↓
3. Optimize
   ↓
4. Measure again (Verify improvement)
   ↓
5. Repeat until meet target

Không bao giờ optimize khi chưa profile.


pprof: CPU Profiling

Enable pprof

import (
    _ "net/http/pprof"
    "net/http"
)

func main() {
    // Start debug server
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()
    
    // Your app code
}

Collect CPU profile

# Collect 30s CPU profile
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof

# Analyze
go tool pprof cpu.prof

pprof commands

(pprof) top             # Top functions by CPU time
(pprof) top -cum        # Top by cumulative time
(pprof) list <func>     # Line-by-line breakdown
(pprof) web             # Visualize call graph (requires Graphviz)
(pprof) peek <func>     # Callers and callees
(pprof) traces          # Sample traces

Example output:

(pprof) top
Showing nodes accounting for 2.50s, 83.33% of 3.00s total
      flat  flat%   sum%        cum   cum%
     1.20s 40.00% 40.00%      1.50s 50.00%  main.expensiveFunc
     0.80s 26.67% 66.67%      0.80s 26.67%  runtime.memmove
     0.50s 16.67% 83.33%      0.50s 16.67%  crypto/sha256.block

Giải thích:

  • flat: CPU time trong function (not including calls)
  • cum (cumulative): CPU time bao gồm cả calls

Flame graph

# Generate flame graph
go tool pprof -http=:8080 cpu.prof

Browser mở, click "Flame Graph" → visual representation of call stacks.


pprof: Heap Profiling

Collect heap profile

curl http://localhost:6060/debug/pprof/heap > heap.prof
go tool pprof heap.prof

pprof heap commands

(pprof) top -cum           # Top allocators
(pprof) list <func>        # Line-by-line allocations
(pprof) web                # Visualize

Các loại heap profile:

# In-use objects (current memory usage)
curl http://localhost:6060/debug/pprof/heap > heap.prof

# Allocated objects (total allocations)
curl http://localhost:6060/debug/pprof/allocs > allocs.prof

Memory leak detection

# Baseline
curl http://localhost:6060/debug/pprof/heap > heap1.prof

# Wait...

# After some time
curl http://localhost:6060/debug/pprof/heap > heap2.prof

# Diff
go tool pprof -base heap1.prof heap2.prof

Nếu có functions tăng allocation liên tục → investigate leak.


Execution Tracer

Trace shows timeline của program execution — goroutine scheduling, GC, syscalls.

Collect trace

curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out

# View
go tool trace trace.out

Trace UI

Browser mở, có các views:

  1. View trace: Timeline của events
  2. Goroutine analysis: Goroutine lifecycle, blocking
  3. Network/Syscall blocking: I/O latency
  4. Scheduler latency: GC pauses, scheduling delays

Use cases:

  • Debug latency spikes (xem GC pauses)
  • Detect goroutine contention
  • Identify blocking I/O

Benchmarking

Basic benchmark

func BenchmarkSum(b *testing.B) {
    data := []int{1, 2, 3, 4, 5}
    
    b.ResetTimer()  // Reset timer sau setup
    
    for i := 0; i < b.N; i++ {
        _ = sum(data)
    }
}

Run:

go test -bench=. -benchmem

Output:

BenchmarkSum-8    10000000    112 ns/op    0 B/op    0 allocs/op

Giải thích:

  • 10000000: số iterations (b.N)
  • 112 ns/op: thời gian trung bình per operation
  • 0 B/op: bytes allocated per operation
  • 0 allocs/op: số allocations per operation

Benchmark với setup/teardown

func BenchmarkDB(b *testing.B) {
    // Setup (không tính vào benchmark time)
    db := setupTestDB()
    defer db.Close()
    
    b.ResetTimer()
    
    for i := 0; i < b.N; i++ {
        _ = db.Query("SELECT * FROM users")
    }
}

Sub-benchmarks

func BenchmarkEncode(b *testing.B) {
    data := []byte("hello world")
    
    b.Run("json", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            json.Marshal(data)
        }
    })
    
    b.Run("msgpack", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            msgpack.Marshal(data)
        }
    })
}

Output:

BenchmarkEncode/json-8       1000000    1052 ns/op
BenchmarkEncode/msgpack-8    2000000     652 ns/op

Benchmark comparison (benchstat)

# Baseline
go test -bench=. -count=10 > old.txt

# After optimization
go test -bench=. -count=10 > new.txt

# Compare
go install golang.org/x/perf/cmd/benchstat@latest
benchstat old.txt new.txt

Output:

name     old time/op  new time/op  delta
Sum-8      112ns ± 2%    87ns ± 1%  -22.32%

Memory Profiling Techniques

1. Check allocations

func BenchmarkFunc(b *testing.B) {
    b.ReportAllocs()  // Report allocations
    
    for i := 0; i < b.N; i++ {
        _ = expensiveFunc()
    }
}

2. Profile allocations with pprof

go test -bench=BenchmarkFunc -memprofile=mem.prof
go tool pprof mem.prof

3. Escape analysis

go build -gcflags='-m -m' main.go 2>&1 | grep "escapes to heap"

CPU Profiling Techniques

1. Find hot functions

go test -bench=. -cpuprofile=cpu.prof
go tool pprof cpu.prof
(pprof) top

2. Line-by-line profile

(pprof) list <function_name>

Example:

ROUTINE ======================== main.process
     1.20s      1.50s (flat, cum) 50.00% of Total
         .          .     10:func process(data []byte) {
         .          .     11:    var result []byte
     0.30s      0.30s     12:    for _, b := range data {
     0.90s      0.90s     13:        result = append(result, transform(b))
         .      0.30s     14:    }
         .          .     15:    return result
         .          .     16:}

Line 13 tốn 0.90s → optimize đây.


Optimization Strategies

1. Reduce allocations

Before:

func concat(strs []string) string {
    result := ""
    for _, s := range strs {
        result += s  // Allocates new string mỗi lần
    }
    return result
}

After:

func concat(strs []string) string {
    var b strings.Builder
    for _, s := range strs {
        b.WriteString(s)
    }
    return b.String()
}

Benchmark:

BenchmarkConcat/before-8    10000    105234 ns/op    503800 B/op    100 allocs/op
BenchmarkConcat/after-8    100000     14523 ns/op      1024 B/op      1 allocs/op

2. Pre-allocate slices

Before:

var result []int
for i := 0; i < 1000; i++ {
    result = append(result, i)  // Multiple reallocations
}

After:

result := make([]int, 0, 1000)
for i := 0; i < 1000; i++ {
    result = append(result, i)  // No reallocation
}

3. Use sync.Pool for reusable objects

Before:

func process() {
    buf := make([]byte, 4096)  // Allocate mỗi lần
    // Use buf
}

After:

var bufPool = sync.Pool{
    New: func() interface{} {
        return make([]byte, 4096)
    },
}

func process() {
    buf := bufPool.Get().([]byte)
    defer bufPool.Put(buf)
    // Use buf
}

4. Avoid unnecessary conversions

Before:

func hash(s string) uint32 {
    return crc32.ChecksumIEEE([]byte(s))  // Allocates
}

After:

import "unsafe"

func hash(s string) uint32 {
    return crc32.ChecksumIEEE(*(*[]byte)(unsafe.Pointer(&s)))
}

Warning: Unsafe approach — chỉ dùng nếu thực sự cần performance.

5. Use fast paths for common cases

Before:

func parseInt(s string) (int, error) {
    return strconv.Atoi(s)
}

After:

func parseInt(s string) (int, error) {
    // Fast path for single digit
    if len(s) == 1 && s[0] >= '0' && s[0] <= '9' {
        return int(s[0] - '0'), nil
    }
    return strconv.Atoi(s)
}

GC Tuning

Monitor GC impact

GODEBUG=gctrace=1 ./myapp

Output:

gc 1 @0.001s 0%: 0.018+0.23+0.003 ms clock, 4->4->0 MB, 5 MB goal

Nếu GC frequent:

  • Tăng GOGC để giảm GC frequency
  • Giảm allocations trong hot path
GOGC=200 ./myapp  # GC khi heap double (thay vì 100%)

Set memory limit (Go 1.19+)

GOMEMLIMIT=2GiB ./myapp

Runtime Metrics

Expose metrics

import "runtime"

func recordMetrics() {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    
    // Expose to Prometheus, Datadog, etc.
    gauge.Set("heap_alloc_bytes", float64(m.Alloc))
    gauge.Set("num_gc", float64(m.NumGC))
    gauge.Set("goroutines", float64(runtime.NumGoroutine()))
}

Key metrics

  • runtime.NumGoroutine(): Số goroutines
  • runtime.NumCPU(): Số CPU cores
  • runtime.MemStats.Alloc: Heap allocated
  • runtime.MemStats.NumGC: Số GC cycles
  • runtime.MemStats.PauseTotalNs: Total GC pause time

Common Performance Pitfalls

❌ 1. String concatenation in loop

// BAD: O(n²) due to allocations
s := ""
for i := 0; i < 10000; i++ {
    s += "x"
}

Fix: Dùng strings.Builder.

❌ 2. Defer in tight loop

// BAD: Defer overhead mỗi iteration
for i := 0; i < 1000000; i++ {
    mu.Lock()
    defer mu.Unlock()  // Defer có cost
    // ...
}

Fix: Extract loop body thành function, hoặc unlock manually.

❌ 3. Range copy large struct

type Large struct {
    data [1024]byte
}

// BAD: Copy struct mỗi iteration
for _, item := range items {  // items []Large
    process(item)  // Copy 1024 bytes
}

Fix: Range over pointers hoặc indices.

for i := range items {
    process(&items[i])
}

❌ 4. Inefficient map usage

// BAD: Check existence, then access (2 lookups)
if _, ok := m[key]; ok {
    val := m[key]
    process(val)
}

Fix: Single lookup.

if val, ok := m[key]; ok {
    process(val)
}

Performance Testing in CI

Benchmark regression detection

# .github/workflows/benchmark.yml
name: Benchmark
on: [pull_request]

jobs:
  bench:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Benchmark baseline (main)
        run: |
          git fetch origin main
          git checkout main
          go test -bench=. -count=5 | tee old.txt
      
      - name: Benchmark PR
        run: |
          git checkout ${{ github.head_ref }}
          go test -bench=. -count=5 | tee new.txt
      
      - name: Compare
        run: benchstat old.txt new.txt

Tóm tắt Best Practices

✅ Profile trước khi optimize
✅ Benchmark để verify improvement
✅ Reduce allocations (pre-allocate, reuse, pool)
✅ Use pprof + trace để identify bottlenecks
✅ Monitor GC impact (gctrace, metrics)
✅ Test performance regressions trong CI
✅ Optimize hot paths, không waste time trên cold paths


Tài liệu tham khảo

Go Performance & Profiling

Performance optimization trong Go không phải về "tricks" — mà về measure, analyze, optimize based on data. File này cover tools và techniques để profile và tối ưu Go applications.

💡 "Premature optimization is the root of all evil." — Donald Knuth

Quy trình đúng: Profile → Identify bottleneck → Optimize → Measure again.


Profiling Tools

1. pprof — CPU & Memory Profiling

Enable pprof server:

import _ "net/http/pprof"
import "net/http"

func main() {
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()
    
    // Your application code
}

Profile types:

Profile URL Purpose
CPU /debug/pprof/profile?seconds=30 CPU usage
Heap /debug/pprof/heap Memory allocation
Goroutine /debug/pprof/goroutine Goroutine stack traces
Block /debug/pprof/block Blocking operations
Mutex /debug/pprof/mutex Lock contention

Capture profile:

# CPU profile (30 seconds)
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof

# Heap profile
curl http://localhost:6060/debug/pprof/heap > heap.prof

# Goroutine profile
curl http://localhost:6060/debug/pprof/goroutine > goroutine.prof

Analyze với pprof:

go tool pprof cpu.prof

Interactive commands:

(pprof) top         # Top functions by time/memory
(pprof) top -cum    # Top functions by cumulative time
(pprof) list <func> # Line-by-line breakdown
(pprof) web         # Visualize call graph (requires graphviz)
(pprof) peek <func> # Callers and callees

Example output:

(pprof) top
Showing nodes accounting for 2.50s, 83.33% of 3.00s total
      flat  flat%   sum%        cum   cum%
     1.50s 50.00% 50.00%      2.00s 66.67%  main.processData
     0.60s 20.00% 70.00%      0.60s 20.00%  runtime.mallocgc
     0.40s 13.33% 83.33%      0.40s 13.33%  runtime.scanobject

Giải thích:

  • flat: Time spent in function itself
  • cum: Cumulative time (including callees)
  • main.processData tốn 50% CPU time

2. trace — Execution Tracer

Capture trace:

curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out

View trace:

go tool trace trace.out

What you see:

  • Goroutine execution timeline
  • GC events
  • System calls
  • Network I/O
  • Lock contention

Use cases:

  • Tìm goroutine blocking
  • Xem GC impact trên latency
  • Phát hiện scheduling issues

Example findings:

  • "Goroutine X blocked 90% of time on channel receive"
  • "GC pause causing p99 latency spike"
  • "Too many goroutines competing for same lock"

3. Benchmarking

Write benchmarks:

func BenchmarkProcessData(b *testing.B) {
    data := generateTestData()
    
    b.ResetTimer()  // Don't count setup time
    b.ReportAllocs()  // Report allocations
    
    for i := 0; i < b.N; i++ {
        processData(data)
    }
}

Run benchmarks:

go test -bench=. -benchmem ./...

Output:

BenchmarkProcessData-8    500000    3521 ns/op    2048 B/op    5 allocs/op

Giải thích:

  • -8: GOMAXPROCS=8
  • 500000: Số iterations
  • 3521 ns/op: Avg time per operation
  • 2048 B/op: Bytes allocated per operation
  • 5 allocs/op: Number of allocations per operation

Compare benchmarks:

# Before optimization
go test -bench=. -benchmem > old.txt

# After optimization
go test -bench=. -benchmem > new.txt

# Compare
go install golang.org/x/perf/cmd/benchstat@latest
benchstat old.txt new.txt

Output:

name              old time/op    new time/op    delta
ProcessData-8     3.52µs ± 2%    1.98µs ± 1%   -43.75%  (p=0.000 n=10+10)

name              old alloc/op   new alloc/op   delta
ProcessData-8     2.05kB ± 0%    0.51kB ± 0%   -75.12%  (p=0.000 n=10+10)

name              old allocs/op  new allocs/op  delta
ProcessData-8      5.00 ± 0%      1.00 ± 0%   -80.00%  (p=0.000 n=10+10)

→ 43% faster, 75% less memory, 80% fewer allocations!


4. perf (Linux) / Instruments (macOS)

perf (hệ thống Linux):

# Record
perf record -g ./myapp

# View
perf report

Instruments (macOS):

# Open Instruments
instruments -t "Time Profiler" ./myapp

Common Performance Issues

Issue 1: Excessive allocations

Symptom: High GC time, p99 latency spikes.

Detection:

go tool pprof http://localhost:6060/debug/pprof/allocs
(pprof) top
      flat  flat%   sum%        cum   cum%
    500MB 50.00% 50.00%     500MB 50.00%  main.processRequest
    200MB 20.00% 70.00%     200MB 20.00%  encoding/json.Marshal

Solutions:

1. Reuse với sync.Pool:

var bufferPool = sync.Pool{
    New: func() interface{} {
        return new(bytes.Buffer)
    },
}

func handler(w http.ResponseWriter, r *http.Request) {
    buf := bufferPool.Get().(*bytes.Buffer)
    buf.Reset()
    defer bufferPool.Put(buf)
    
    // Use buf
    json.NewEncoder(buf).Encode(data)
    w.Write(buf.Bytes())
}

2. Pre-allocate slices:

// Before
result := []int{}
for i := 0; i < 1000; i++ {
    result = append(result, i)  // Reallocates
}

// After
result := make([]int, 0, 1000)  // Pre-allocate capacity
for i := 0; i < 1000; i++ {
    result = append(result, i)  // No reallocation
}

3. Avoid string concatenation:

// Before
s := ""
for _, str := range strings {
    s += str  // New string each time
}

// After
var b strings.Builder
b.Grow(estimatedSize)
for _, str := range strings {
    b.WriteString(str)
}
s := b.String()

Issue 2: Lock contention

Symptom: Low CPU usage, high wait time.

Detection:

go test -bench=. -mutexprofile=mutex.prof
go tool pprof mutex.prof
(pprof) top
      flat  flat%   sum%        cum   cum%
      10s 50.00% 50.00%       10s 50.00%  sync.(*Mutex).Lock
       5s 25.00% 75.00%        5s 25.00%  main.updateCounter

Solutions:

1. Shard locks:

// Before: Single lock
type Counter struct {
    mu    sync.Mutex
    count int64
}

func (c *Counter) Inc() {
    c.mu.Lock()
    c.count++
    c.mu.Unlock()
}

// After: Sharded locks
type ShardedCounter struct {
    shards [64]struct {
        mu    sync.Mutex
        count int64
        _     [56]byte  // Padding to avoid false sharing
    }
}

func (c *ShardedCounter) Inc() {
    shard := &c.shards[fastrand()%64]
    shard.mu.Lock()
    shard.count++
    shard.mu.Unlock()
}

func (c *ShardedCounter) Count() int64 {
    var total int64
    for i := range c.shards {
        c.shards[i].mu.Lock()
        total += c.shards[i].count
        c.shards[i].mu.Unlock()
    }
    return total
}

2. Dùng atomic:

// Before
var counter int64
var mu sync.Mutex

func inc() {
    mu.Lock()
    counter++
    mu.Unlock()
}

// After
var counter int64

func inc() {
    atomic.AddInt64(&counter, 1)
}

3. RWMutex cho read-heavy:

// Before: Mutex blocks readers
type Cache struct {
    mu   sync.Mutex
    data map[string]string
}

// After: RWMutex allows concurrent reads
type Cache struct {
    mu   sync.RWMutex
    data map[string]string
}

func (c *Cache) Get(key string) (string, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()
    
    val, ok := c.data[key]
    return val, ok
}

Issue 3: Goroutine leak

Symptom: NumGoroutine() tăng liên tục, memory tăng.

Detection:

curl http://localhost:6060/debug/pprof/goroutine > goroutine.prof
go tool pprof goroutine.prof
(pprof) top
      flat  flat%   sum%        cum   cum%
      1000 100.00% 100.00%     1000 100.00%  main.worker
(pprof) list worker
      1000     .      select {
      .        .      case <-ch:  // Never receives
      .        .          process()
      .        .      }

Solutions:

1. Thêm cancel context:

// Before
func worker(ch <-chan Job) {
    for job := range ch {
        process(job)
    }
}

// After
func worker(ctx context.Context, ch <-chan Job) {
    for {
        select {
        case <-ctx.Done():
            return  // Exit on cancel
        case job := <-ch:
            process(job)
        }
    }
}

2. Thêm timeout:

func worker(ch <-chan Job) {
    for {
        select {
        case job := <-ch:
            process(job)
        case <-time.After(10 * time.Second):
            return  // Exit if no job in 10s
        }
    }
}

Issue 4: Slow I/O

Symptom: High wait time trong CPU profile.

Detection: Profile shows time spent in syscalls.

Solutions:

1. Batching:

// Before: Write mỗi record
for _, record := range records {
    file.Write(record)  // Syscall mỗi lần
}

// After: Buffer writes
buf := bufio.NewWriter(file)
for _, record := range records {
    buf.Write(record)
}
buf.Flush()  // Single syscall

2. Concurrency:

// Before: Sequential I/O
for _, url := range urls {
    fetch(url)
}

// After: Concurrent I/O
var wg sync.WaitGroup
for _, url := range urls {
    wg.Add(1)
    go func(u string) {
        defer wg.Done()
        fetch(u)
    }(url)
}
wg.Wait()

Optimization Techniques

1. Avoid interface{} boxing

// Slow: interface{} boxes value
func sumInterface(nums []interface{}) int {
    sum := 0
    for _, n := range nums {
        sum += n.(int)  // Type assertion + unboxing
    }
    return sum
}

// Fast: Concrete type
func sumInt(nums []int) int {
    sum := 0
    for _, n := range nums {
        sum += n
    }
    return sum
}

Benchmark:

BenchmarkSumInterface-8    10000000    150 ns/op
BenchmarkSumInt-8          50000000     30 ns/op

→ 5x faster!

2. Reduce pointer chasing

// Slow: Pointer indirection
type Node struct {
    Value *int
    Next  *Node
}

// Fast: Value types khi có thể
type Node struct {
    Value int
    Next  *Node
}

3. Inline small functions

Compiler tự động inline small functions, nhưng có thể force:

//go:inline
func add(a, b int) int {
    return a + b
}

Note: Thường không cần — compiler đã rất thông minh.

4. Use buffered channels

// Slow: Unbuffered (blocks sender/receiver)
ch := make(chan int)

// Fast: Buffered (reduces blocking)
ch := make(chan int, 100)

5. Escape analysis awareness

// Allocates on heap
func newSlice() []int {
    s := make([]int, 100)
    return s  // Escapes
}

// May allocate on stack
func processSlice() {
    s := make([]int, 100)
    // Use s locally, doesn't escape
}

Production Monitoring

Metrics to track

1. QPS (Queries Per Second):

var requests = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
    },
    []string{"method", "endpoint"},
)

2. Latency (Histogram):

var latency = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Buckets: []float64{.001, .005, .01, .05, .1, .5, 1, 5},
    },
    []string{"method", "endpoint"},
)

3. Error rate:

var errors = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_request_errors_total",
    },
    []string{"method", "endpoint", "code"},
)

4. Goroutines:

prometheus.NewGaugeFunc(
    prometheus.GaugeOpts{
        Name: "go_goroutines",
    },
    func() float64 {
        return float64(runtime.NumGoroutine())
    },
)

5. Memory:

prometheus.NewGaugeFunc(
    prometheus.GaugeOpts{
        Name: "go_memstats_alloc_bytes",
    },
    func() float64 {
        var m runtime.MemStats
        runtime.ReadMemStats(&m)
        return float64(m.Alloc)
    },
)

Continuous Profiling

Tools:

Benefits:

  • Always-on profiling in production
  • Compare profiles over time
  • Correlate perf changes with deploys

Tóm tắt

Tool Use case
pprof CPU, memory profiling
trace Execution timeline, GC events
benchstat Compare benchmark results
GODEBUG Runtime debugging (gctrace, schedtrace)
Prometheus Production metrics
Continuous profiling Always-on production profiling

Workflow:

  1. Benchmark baseline
  2. Profile to find bottleneck
  3. Optimize
  4. Benchmark again
  5. Deploy with monitoring
  6. Repeat

Tài liệu tham khảo