Go Performance & Profiling
Performance optimization không phải đoán mò — phải dựa trên data. File này hướng dẫn cách profile, benchmark, và tối ưu Go code một cách có hệ thống.
💡 "Premature optimization is the root of all evil." — Donald Knuth
"Measure first, optimize later." — Go wisdom
Performance Mindset
Quy trình tối ưu chuẩn
1. Measure (Profile)
↓
2. Identify bottleneck
↓
3. Optimize
↓
4. Measure again (Verify improvement)
↓
5. Repeat until meet target
Không bao giờ optimize khi chưa profile.
pprof: CPU Profiling
Enable pprof
import (
_ "net/http/pprof"
"net/http"
)
func main() {
// Start debug server
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
// Your app code
}
Collect CPU profile
# Collect 30s CPU profile
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
# Analyze
go tool pprof cpu.prof
pprof commands
(pprof) top # Top functions by CPU time
(pprof) top -cum # Top by cumulative time
(pprof) list <func> # Line-by-line breakdown
(pprof) web # Visualize call graph (requires Graphviz)
(pprof) peek <func> # Callers and callees
(pprof) traces # Sample traces
Example output:
(pprof) top
Showing nodes accounting for 2.50s, 83.33% of 3.00s total
flat flat% sum% cum cum%
1.20s 40.00% 40.00% 1.50s 50.00% main.expensiveFunc
0.80s 26.67% 66.67% 0.80s 26.67% runtime.memmove
0.50s 16.67% 83.33% 0.50s 16.67% crypto/sha256.block
Giải thích:
flat: CPU time trong function (not including calls)cum(cumulative): CPU time bao gồm cả calls
Flame graph
# Generate flame graph
go tool pprof -http=:8080 cpu.prof
Browser mở, click "Flame Graph" → visual representation of call stacks.
pprof: Heap Profiling
Collect heap profile
curl http://localhost:6060/debug/pprof/heap > heap.prof
go tool pprof heap.prof
pprof heap commands
(pprof) top -cum # Top allocators
(pprof) list <func> # Line-by-line allocations
(pprof) web # Visualize
Các loại heap profile:
# In-use objects (current memory usage)
curl http://localhost:6060/debug/pprof/heap > heap.prof
# Allocated objects (total allocations)
curl http://localhost:6060/debug/pprof/allocs > allocs.prof
Memory leak detection
# Baseline
curl http://localhost:6060/debug/pprof/heap > heap1.prof
# Wait...
# After some time
curl http://localhost:6060/debug/pprof/heap > heap2.prof
# Diff
go tool pprof -base heap1.prof heap2.prof
Nếu có functions tăng allocation liên tục → investigate leak.
Execution Tracer
Trace shows timeline của program execution — goroutine scheduling, GC, syscalls.
Collect trace
curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out
# View
go tool trace trace.out
Trace UI
Browser mở, có các views:
- View trace: Timeline của events
- Goroutine analysis: Goroutine lifecycle, blocking
- Network/Syscall blocking: I/O latency
- Scheduler latency: GC pauses, scheduling delays
Use cases:
- Debug latency spikes (xem GC pauses)
- Detect goroutine contention
- Identify blocking I/O
Benchmarking
Basic benchmark
func BenchmarkSum(b *testing.B) {
data := []int{1, 2, 3, 4, 5}
b.ResetTimer() // Reset timer sau setup
for i := 0; i < b.N; i++ {
_ = sum(data)
}
}
Run:
go test -bench=. -benchmem
Output:
BenchmarkSum-8 10000000 112 ns/op 0 B/op 0 allocs/op
Giải thích:
10000000: số iterations (b.N)112 ns/op: thời gian trung bình per operation0 B/op: bytes allocated per operation0 allocs/op: số allocations per operation
Benchmark với setup/teardown
func BenchmarkDB(b *testing.B) {
// Setup (không tính vào benchmark time)
db := setupTestDB()
defer db.Close()
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = db.Query("SELECT * FROM users")
}
}
Sub-benchmarks
func BenchmarkEncode(b *testing.B) {
data := []byte("hello world")
b.Run("json", func(b *testing.B) {
for i := 0; i < b.N; i++ {
json.Marshal(data)
}
})
b.Run("msgpack", func(b *testing.B) {
for i := 0; i < b.N; i++ {
msgpack.Marshal(data)
}
})
}
Output:
BenchmarkEncode/json-8 1000000 1052 ns/op
BenchmarkEncode/msgpack-8 2000000 652 ns/op
Benchmark comparison (benchstat)
# Baseline
go test -bench=. -count=10 > old.txt
# After optimization
go test -bench=. -count=10 > new.txt
# Compare
go install golang.org/x/perf/cmd/benchstat@latest
benchstat old.txt new.txt
Output:
name old time/op new time/op delta
Sum-8 112ns ± 2% 87ns ± 1% -22.32%
Memory Profiling Techniques
1. Check allocations
func BenchmarkFunc(b *testing.B) {
b.ReportAllocs() // Report allocations
for i := 0; i < b.N; i++ {
_ = expensiveFunc()
}
}
2. Profile allocations with pprof
go test -bench=BenchmarkFunc -memprofile=mem.prof
go tool pprof mem.prof
3. Escape analysis
go build -gcflags='-m -m' main.go 2>&1 | grep "escapes to heap"
CPU Profiling Techniques
1. Find hot functions
go test -bench=. -cpuprofile=cpu.prof
go tool pprof cpu.prof
(pprof) top
2. Line-by-line profile
(pprof) list <function_name>
Example:
ROUTINE ======================== main.process
1.20s 1.50s (flat, cum) 50.00% of Total
. . 10:func process(data []byte) {
. . 11: var result []byte
0.30s 0.30s 12: for _, b := range data {
0.90s 0.90s 13: result = append(result, transform(b))
. 0.30s 14: }
. . 15: return result
. . 16:}
Line 13 tốn 0.90s → optimize đây.
Optimization Strategies
1. Reduce allocations
Before:
func concat(strs []string) string {
result := ""
for _, s := range strs {
result += s // Allocates new string mỗi lần
}
return result
}
After:
func concat(strs []string) string {
var b strings.Builder
for _, s := range strs {
b.WriteString(s)
}
return b.String()
}
Benchmark:
BenchmarkConcat/before-8 10000 105234 ns/op 503800 B/op 100 allocs/op
BenchmarkConcat/after-8 100000 14523 ns/op 1024 B/op 1 allocs/op
2. Pre-allocate slices
Before:
var result []int
for i := 0; i < 1000; i++ {
result = append(result, i) // Multiple reallocations
}
After:
result := make([]int, 0, 1000)
for i := 0; i < 1000; i++ {
result = append(result, i) // No reallocation
}
3. Use sync.Pool for reusable objects
Before:
func process() {
buf := make([]byte, 4096) // Allocate mỗi lần
// Use buf
}
After:
var bufPool = sync.Pool{
New: func() interface{} {
return make([]byte, 4096)
},
}
func process() {
buf := bufPool.Get().([]byte)
defer bufPool.Put(buf)
// Use buf
}
4. Avoid unnecessary conversions
Before:
func hash(s string) uint32 {
return crc32.ChecksumIEEE([]byte(s)) // Allocates
}
After:
import "unsafe"
func hash(s string) uint32 {
return crc32.ChecksumIEEE(*(*[]byte)(unsafe.Pointer(&s)))
}
Warning: Unsafe approach — chỉ dùng nếu thực sự cần performance.
5. Use fast paths for common cases
Before:
func parseInt(s string) (int, error) {
return strconv.Atoi(s)
}
After:
func parseInt(s string) (int, error) {
// Fast path for single digit
if len(s) == 1 && s[0] >= '0' && s[0] <= '9' {
return int(s[0] - '0'), nil
}
return strconv.Atoi(s)
}
GC Tuning
Monitor GC impact
GODEBUG=gctrace=1 ./myapp
Output:
gc 1 @0.001s 0%: 0.018+0.23+0.003 ms clock, 4->4->0 MB, 5 MB goal
Nếu GC frequent:
- Tăng GOGC để giảm GC frequency
- Giảm allocations trong hot path
GOGC=200 ./myapp # GC khi heap double (thay vì 100%)
Set memory limit (Go 1.19+)
GOMEMLIMIT=2GiB ./myapp
Runtime Metrics
Expose metrics
import "runtime"
func recordMetrics() {
var m runtime.MemStats
runtime.ReadMemStats(&m)
// Expose to Prometheus, Datadog, etc.
gauge.Set("heap_alloc_bytes", float64(m.Alloc))
gauge.Set("num_gc", float64(m.NumGC))
gauge.Set("goroutines", float64(runtime.NumGoroutine()))
}
Key metrics
runtime.NumGoroutine(): Số goroutinesruntime.NumCPU(): Số CPU coresruntime.MemStats.Alloc: Heap allocatedruntime.MemStats.NumGC: Số GC cyclesruntime.MemStats.PauseTotalNs: Total GC pause time
Common Performance Pitfalls
❌ 1. String concatenation in loop
// BAD: O(n²) due to allocations
s := ""
for i := 0; i < 10000; i++ {
s += "x"
}
Fix: Dùng strings.Builder.
❌ 2. Defer in tight loop
// BAD: Defer overhead mỗi iteration
for i := 0; i < 1000000; i++ {
mu.Lock()
defer mu.Unlock() // Defer có cost
// ...
}
Fix: Extract loop body thành function, hoặc unlock manually.
❌ 3. Range copy large struct
type Large struct {
data [1024]byte
}
// BAD: Copy struct mỗi iteration
for _, item := range items { // items []Large
process(item) // Copy 1024 bytes
}
Fix: Range over pointers hoặc indices.
for i := range items {
process(&items[i])
}
❌ 4. Inefficient map usage
// BAD: Check existence, then access (2 lookups)
if _, ok := m[key]; ok {
val := m[key]
process(val)
}
Fix: Single lookup.
if val, ok := m[key]; ok {
process(val)
}
Performance Testing in CI
Benchmark regression detection
# .github/workflows/benchmark.yml
name: Benchmark
on: [pull_request]
jobs:
bench:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Benchmark baseline (main)
run: |
git fetch origin main
git checkout main
go test -bench=. -count=5 | tee old.txt
- name: Benchmark PR
run: |
git checkout ${{ github.head_ref }}
go test -bench=. -count=5 | tee new.txt
- name: Compare
run: benchstat old.txt new.txt
Tóm tắt Best Practices
✅ Profile trước khi optimize
✅ Benchmark để verify improvement
✅ Reduce allocations (pre-allocate, reuse, pool)
✅ Use pprof + trace để identify bottlenecks
✅ Monitor GC impact (gctrace, metrics)
✅ Test performance regressions trong CI
✅ Optimize hot paths, không waste time trên cold paths
Tài liệu tham khảo
- pprof documentation: https://go.dev/blog/pprof
- Profiling Go Programs: https://go.dev/blog/profiling-go-programs
- Execution Tracer: https://go.dev/blog/execution-tracer
- High Performance Go: https://github.com/dgryski/go-perfbook
- Dave Cheney - High Performance Go Workshop: https://dave.cheney.net/high-performance-go-workshop/gophercon-2019.html
Go Performance & Profiling
Performance optimization trong Go không phải về "tricks" — mà về measure, analyze, optimize based on data. File này cover tools và techniques để profile và tối ưu Go applications.
💡 "Premature optimization is the root of all evil." — Donald Knuth
Quy trình đúng: Profile → Identify bottleneck → Optimize → Measure again.
Profiling Tools
1. pprof — CPU & Memory Profiling
Enable pprof server:
import _ "net/http/pprof"
import "net/http"
func main() {
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
// Your application code
}
Profile types:
| Profile | URL | Purpose |
|---|---|---|
| CPU | /debug/pprof/profile?seconds=30 |
CPU usage |
| Heap | /debug/pprof/heap |
Memory allocation |
| Goroutine | /debug/pprof/goroutine |
Goroutine stack traces |
| Block | /debug/pprof/block |
Blocking operations |
| Mutex | /debug/pprof/mutex |
Lock contention |
Capture profile:
# CPU profile (30 seconds)
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
# Heap profile
curl http://localhost:6060/debug/pprof/heap > heap.prof
# Goroutine profile
curl http://localhost:6060/debug/pprof/goroutine > goroutine.prof
Analyze với pprof:
go tool pprof cpu.prof
Interactive commands:
(pprof) top # Top functions by time/memory
(pprof) top -cum # Top functions by cumulative time
(pprof) list <func> # Line-by-line breakdown
(pprof) web # Visualize call graph (requires graphviz)
(pprof) peek <func> # Callers and callees
Example output:
(pprof) top
Showing nodes accounting for 2.50s, 83.33% of 3.00s total
flat flat% sum% cum cum%
1.50s 50.00% 50.00% 2.00s 66.67% main.processData
0.60s 20.00% 70.00% 0.60s 20.00% runtime.mallocgc
0.40s 13.33% 83.33% 0.40s 13.33% runtime.scanobject
Giải thích:
flat: Time spent in function itselfcum: Cumulative time (including callees)main.processDatatốn 50% CPU time
2. trace — Execution Tracer
Capture trace:
curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out
View trace:
go tool trace trace.out
What you see:
- Goroutine execution timeline
- GC events
- System calls
- Network I/O
- Lock contention
Use cases:
- Tìm goroutine blocking
- Xem GC impact trên latency
- Phát hiện scheduling issues
Example findings:
- "Goroutine X blocked 90% of time on channel receive"
- "GC pause causing p99 latency spike"
- "Too many goroutines competing for same lock"
3. Benchmarking
Write benchmarks:
func BenchmarkProcessData(b *testing.B) {
data := generateTestData()
b.ResetTimer() // Don't count setup time
b.ReportAllocs() // Report allocations
for i := 0; i < b.N; i++ {
processData(data)
}
}
Run benchmarks:
go test -bench=. -benchmem ./...
Output:
BenchmarkProcessData-8 500000 3521 ns/op 2048 B/op 5 allocs/op
Giải thích:
-8: GOMAXPROCS=8500000: Số iterations3521 ns/op: Avg time per operation2048 B/op: Bytes allocated per operation5 allocs/op: Number of allocations per operation
Compare benchmarks:
# Before optimization
go test -bench=. -benchmem > old.txt
# After optimization
go test -bench=. -benchmem > new.txt
# Compare
go install golang.org/x/perf/cmd/benchstat@latest
benchstat old.txt new.txt
Output:
name old time/op new time/op delta
ProcessData-8 3.52µs ± 2% 1.98µs ± 1% -43.75% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
ProcessData-8 2.05kB ± 0% 0.51kB ± 0% -75.12% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
ProcessData-8 5.00 ± 0% 1.00 ± 0% -80.00% (p=0.000 n=10+10)
→ 43% faster, 75% less memory, 80% fewer allocations!
4. perf (Linux) / Instruments (macOS)
perf (hệ thống Linux):
# Record
perf record -g ./myapp
# View
perf report
Instruments (macOS):
# Open Instruments
instruments -t "Time Profiler" ./myapp
Common Performance Issues
Issue 1: Excessive allocations
Symptom: High GC time, p99 latency spikes.
Detection:
go tool pprof http://localhost:6060/debug/pprof/allocs
(pprof) top
flat flat% sum% cum cum%
500MB 50.00% 50.00% 500MB 50.00% main.processRequest
200MB 20.00% 70.00% 200MB 20.00% encoding/json.Marshal
Solutions:
1. Reuse với sync.Pool:
var bufferPool = sync.Pool{
New: func() interface{} {
return new(bytes.Buffer)
},
}
func handler(w http.ResponseWriter, r *http.Request) {
buf := bufferPool.Get().(*bytes.Buffer)
buf.Reset()
defer bufferPool.Put(buf)
// Use buf
json.NewEncoder(buf).Encode(data)
w.Write(buf.Bytes())
}
2. Pre-allocate slices:
// Before
result := []int{}
for i := 0; i < 1000; i++ {
result = append(result, i) // Reallocates
}
// After
result := make([]int, 0, 1000) // Pre-allocate capacity
for i := 0; i < 1000; i++ {
result = append(result, i) // No reallocation
}
3. Avoid string concatenation:
// Before
s := ""
for _, str := range strings {
s += str // New string each time
}
// After
var b strings.Builder
b.Grow(estimatedSize)
for _, str := range strings {
b.WriteString(str)
}
s := b.String()
Issue 2: Lock contention
Symptom: Low CPU usage, high wait time.
Detection:
go test -bench=. -mutexprofile=mutex.prof
go tool pprof mutex.prof
(pprof) top
flat flat% sum% cum cum%
10s 50.00% 50.00% 10s 50.00% sync.(*Mutex).Lock
5s 25.00% 75.00% 5s 25.00% main.updateCounter
Solutions:
1. Shard locks:
// Before: Single lock
type Counter struct {
mu sync.Mutex
count int64
}
func (c *Counter) Inc() {
c.mu.Lock()
c.count++
c.mu.Unlock()
}
// After: Sharded locks
type ShardedCounter struct {
shards [64]struct {
mu sync.Mutex
count int64
_ [56]byte // Padding to avoid false sharing
}
}
func (c *ShardedCounter) Inc() {
shard := &c.shards[fastrand()%64]
shard.mu.Lock()
shard.count++
shard.mu.Unlock()
}
func (c *ShardedCounter) Count() int64 {
var total int64
for i := range c.shards {
c.shards[i].mu.Lock()
total += c.shards[i].count
c.shards[i].mu.Unlock()
}
return total
}
2. Dùng atomic:
// Before
var counter int64
var mu sync.Mutex
func inc() {
mu.Lock()
counter++
mu.Unlock()
}
// After
var counter int64
func inc() {
atomic.AddInt64(&counter, 1)
}
3. RWMutex cho read-heavy:
// Before: Mutex blocks readers
type Cache struct {
mu sync.Mutex
data map[string]string
}
// After: RWMutex allows concurrent reads
type Cache struct {
mu sync.RWMutex
data map[string]string
}
func (c *Cache) Get(key string) (string, bool) {
c.mu.RLock()
defer c.mu.RUnlock()
val, ok := c.data[key]
return val, ok
}
Issue 3: Goroutine leak
Symptom: NumGoroutine() tăng liên tục, memory tăng.
Detection:
curl http://localhost:6060/debug/pprof/goroutine > goroutine.prof
go tool pprof goroutine.prof
(pprof) top
flat flat% sum% cum cum%
1000 100.00% 100.00% 1000 100.00% main.worker
(pprof) list worker
1000 . select {
. . case <-ch: // Never receives
. . process()
. . }
Solutions:
1. Thêm cancel context:
// Before
func worker(ch <-chan Job) {
for job := range ch {
process(job)
}
}
// After
func worker(ctx context.Context, ch <-chan Job) {
for {
select {
case <-ctx.Done():
return // Exit on cancel
case job := <-ch:
process(job)
}
}
}
2. Thêm timeout:
func worker(ch <-chan Job) {
for {
select {
case job := <-ch:
process(job)
case <-time.After(10 * time.Second):
return // Exit if no job in 10s
}
}
}
Issue 4: Slow I/O
Symptom: High wait time trong CPU profile.
Detection: Profile shows time spent in syscalls.
Solutions:
1. Batching:
// Before: Write mỗi record
for _, record := range records {
file.Write(record) // Syscall mỗi lần
}
// After: Buffer writes
buf := bufio.NewWriter(file)
for _, record := range records {
buf.Write(record)
}
buf.Flush() // Single syscall
2. Concurrency:
// Before: Sequential I/O
for _, url := range urls {
fetch(url)
}
// After: Concurrent I/O
var wg sync.WaitGroup
for _, url := range urls {
wg.Add(1)
go func(u string) {
defer wg.Done()
fetch(u)
}(url)
}
wg.Wait()
Optimization Techniques
1. Avoid interface{} boxing
// Slow: interface{} boxes value
func sumInterface(nums []interface{}) int {
sum := 0
for _, n := range nums {
sum += n.(int) // Type assertion + unboxing
}
return sum
}
// Fast: Concrete type
func sumInt(nums []int) int {
sum := 0
for _, n := range nums {
sum += n
}
return sum
}
Benchmark:
BenchmarkSumInterface-8 10000000 150 ns/op
BenchmarkSumInt-8 50000000 30 ns/op
→ 5x faster!
2. Reduce pointer chasing
// Slow: Pointer indirection
type Node struct {
Value *int
Next *Node
}
// Fast: Value types khi có thể
type Node struct {
Value int
Next *Node
}
3. Inline small functions
Compiler tự động inline small functions, nhưng có thể force:
//go:inline
func add(a, b int) int {
return a + b
}
Note: Thường không cần — compiler đã rất thông minh.
4. Use buffered channels
// Slow: Unbuffered (blocks sender/receiver)
ch := make(chan int)
// Fast: Buffered (reduces blocking)
ch := make(chan int, 100)
5. Escape analysis awareness
// Allocates on heap
func newSlice() []int {
s := make([]int, 100)
return s // Escapes
}
// May allocate on stack
func processSlice() {
s := make([]int, 100)
// Use s locally, doesn't escape
}
Production Monitoring
Metrics to track
1. QPS (Queries Per Second):
var requests = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
},
[]string{"method", "endpoint"},
)
2. Latency (Histogram):
var latency = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: []float64{.001, .005, .01, .05, .1, .5, 1, 5},
},
[]string{"method", "endpoint"},
)
3. Error rate:
var errors = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_request_errors_total",
},
[]string{"method", "endpoint", "code"},
)
4. Goroutines:
prometheus.NewGaugeFunc(
prometheus.GaugeOpts{
Name: "go_goroutines",
},
func() float64 {
return float64(runtime.NumGoroutine())
},
)
5. Memory:
prometheus.NewGaugeFunc(
prometheus.GaugeOpts{
Name: "go_memstats_alloc_bytes",
},
func() float64 {
var m runtime.MemStats
runtime.ReadMemStats(&m)
return float64(m.Alloc)
},
)
Continuous Profiling
Tools:
- Pyroscope: https://pyroscope.io/
- Google Cloud Profiler: https://cloud.google.com/profiler
- Datadog Continuous Profiler: https://www.datadoghq.com/product/code-profiling/
Benefits:
- Always-on profiling in production
- Compare profiles over time
- Correlate perf changes with deploys
Tóm tắt
| Tool | Use case |
|---|---|
| pprof | CPU, memory profiling |
| trace | Execution timeline, GC events |
| benchstat | Compare benchmark results |
| GODEBUG | Runtime debugging (gctrace, schedtrace) |
| Prometheus | Production metrics |
| Continuous profiling | Always-on production profiling |
Workflow:
- Benchmark baseline
- Profile to find bottleneck
- Optimize
- Benchmark again
- Deploy with monitoring
- Repeat
Tài liệu tham khảo
- Profiling Go Programs: https://go.dev/blog/pprof
- Diagnostics: https://go.dev/doc/diagnostics
- Dave Cheney - High Performance Go: https://dave.cheney.net/high-performance-go-workshop/dotgo-paris.html
- Go Performance Tips: https://github.com/dgryski/go-perfbook