🧠 Programming✍️ Khoa📅 19/04/2026☕ 10 phút đọc

Go Scheduler Internals: G-M-P Deep Dive

Scheduler của Go là một trong những phần tinh vi nhất của runtime. Hiểu scheduler giúp bạn debug performance issues, tối ưu concurrent code, và trả lời câu hỏi interview một cách tự tin.

💡 Scheduler của Go là cooperative preemptive — goroutines cooperate (yield at function calls), nhưng runtime có thể preempt nếu cần.

Tại sao cần Scheduler?

Vấn đề: Tạo hàng triệu goroutines nhưng chỉ có N CPU cores.

1,000,000 goroutines
        ↓
      ???
        ↓
    8 CPU cores

Giải pháp: Scheduler ánh xạ M goroutines lên N OS threads một cách hiệu quả.

So sánh với OS threads:

	OS Thread	Goroutine
Stack size	1-2 MB (fixed)	2 KB (grow dynamically)
Creation cost	~1-2 µs	~200 ns
Context switch	~1-2 µs	~20 ns
Scheduling	Kernel-level (expensive)	User-level (cheap)
Max practical	~10,000	Millions

→ Goroutines rẻ hơn 10-100x so với OS threads.

G-M-P Model

Ba thành phần chính

┌─────────────────────────────────────────────────────────┐
│                      Go Runtime                         │
│                                                         │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐  │
│  │    G    │  │    G    │  │    G    │  │    G    │  │  G = Goroutine
│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘  │  (task to execute)
│       │            │            │            │        │
│  ┌────▼────────────▼────────────▼────────────▼─────┐ │
│  │                     P                            │ │  P = Processor
│  │  (Local run queue + execution context)          │ │  (token to run)
│  └────┬─────────────────────────────────────────────┘ │
│       │                                                │
│  ┌────▼────┐                                          │
│  │    M    │                                          │  M = Machine
│  └────┬────┘                                          │  (OS thread)
│       │                                                │
└───────┼────────────────────────────────────────────────┘
        │
   ┌────▼────┐
   │   CPU   │
   └─────────┘

G (Goroutine)

Định nghĩa: Đại diện cho một goroutine đang chạy hoặc chờ.

Cấu trúc (simplified):

type g struct {
    stack       stack       // Stack memory
    stackguard0 uintptr     // Stack overflow detection
    m           *m          // Current M running this G
    sched       gobuf       // Saved registers (PC, SP)
    atomicstatus uint32     // State: runnable, running, waiting, dead
    goid        int64       // Goroutine ID
    waitsince   int64       // Time spent waiting
    lockedm     *m          // Locked to specific M?
}

Các trạng thái:

_Gidle: Mới tạo, chưa init
_Grunnable: Sẵn sàng chạy, đang trong run queue
_Grunning: Đang chạy trên M
_Gsyscall: Đang gọi syscall (blocking)
_Gwaiting: Bị block (channel, sleep, IO)
_Gdead: Goroutine kết thúc

M (Machine)

Định nghĩa: OS thread thật sự thực thi code.

Cấu trúc:

type m struct {
    g0      *g          // Goroutine for scheduling (not user code)
    curg    *g          // Current user goroutine
    p       *p          // Current P (can be nil)
    nextp   *p          // Next P to run after syscall
    id      int64
    spinning bool       // Looking for work?
    park    note        // Sleep/wake mechanism
    alllink *m          // Link in allm list
}

M không có số lượng fixed:

Ban đầu: GOMAXPROCS M được tạo
Nếu M block (syscall), runtime tạo M mới để tận dụng P
Maximum: 10,000 M (limit cứng)

P (Processor)

Định nghĩa: Execution context — token để M được chạy Go code.

Cấu trúc:

type p struct {
    id          int32
    status      uint32      // _Pidle, _Prunning, _Psyscall
    link        *p
    m           *m          // Current M owning this P
    runqhead    uint32      // Local run queue head
    runqtail    uint32      // Local run queue tail
    runq        [256]*g     // Local run queue (circular buffer)
    runnext     *g          // Next G to run (priority)
    
    // For GC
    mcache      *mcache
    
    // Stats
    schedtick   uint32      // Number of schedules
}

Số lượng P: GOMAXPROCS (default = số CPU cores)

# Check GOMAXPROCS
go env GOMAXPROCS

# Set in code
runtime.GOMAXPROCS(8)

# Set via env
GOMAXPROCS=8 go run main.go

Scheduling Flow

1. Goroutine được tạo

go func() {
    fmt.Println("Hello")
}()

Runtime thực hiện:

Allocate g struct
Setup stack (2 KB ban đầu)
Đưa vào local run queue của P hiện tại
Nếu local queue đầy → đưa vào global run queue

2. Scheduler chọn G tiếp theo

Quy trình tìm work:

M với P đang idle, tìm G để chạy:

1. Check P.runnext (priority slot)
   ↓ None
2. Check local run queue (P.runq)
   ↓ Empty
3. Check global run queue
   ↓ Empty
4. Steal from other P (work stealing)
   ↓ All empty
5. Check network poller (netpoll)
   ↓ None
6. M goes to sleep

Work stealing:

P idle steal từ P khác (lấy 1/2 queue)
Tránh load imbalance giữa các P

P1: [G1, G2, G3, G4, G5, G6]
P2: []

→ Work stealing

P1: [G1, G2, G3]
P2: [G4, G5, G6]

3. G chạy trên M

M + P + G
  ↓
Execute G's code
  ↓
One of:
  - G finishes → pick next G
  - G blocks (channel, syscall) → park G, pick next G
  - G yields (runtime.Gosched) → put back to queue, pick next G
  - Preemption signal → save state, pick next G

4. G bị block

Scenario 1: Blocking syscall (read, write, ...)

M1 + P1 + G1 (calling syscall)
  ↓
G1 enters syscall
  ↓
P1 detaches from M1
  ↓
P1 finds M2 (or creates new M)
  ↓
M2 + P1 continues scheduling other Gs
  ↓
M1 waits for syscall to finish
  ↓
Syscall done → M1 tries to reacquire P
  - Success: M1 + P1 + G1 continues
  - Fail: G1 put in global queue, M1 sleeps

Scenario 2: Non-blocking (channel, select)

G1 waiting on channel
  ↓
G1 state = _Gwaiting
  ↓
G1 put in channel's wait queue
  ↓
M picks next G from run queue
  ↓
... later ...
  ↓
Another G sends to channel
  ↓
G1 woken up → state = _Grunnable
  ↓
G1 back to run queue

Preemption

Cooperative Preemption (trước Go 1.14)

Cách hoạt động: Compiler inject preemption check tại function calls.

func foo() {
    // Compiler injects: check preemption signal
    bar()
}

Vấn đề: Tight loop không có function call → không bao giờ preempt.

// Goroutine này monopolize CPU
for {
    i++
}

Async Preemption (từ Go 1.14)

Cách hoạt động: Runtime gửi signal (SIGURG trên Unix) để interrupt goroutine.

Flow:

Sysmon thread (background daemon) phát hiện G chạy > 10ms
Gửi preemption signal đến M đang chạy G
Signal handler save state, switch to scheduler
Scheduler pick G khác

Benefit: Tight loops không còn starve goroutines khác.

// Go 1.14+: vẫn có thể preempt
for {
    i++
}

Sysmon Thread

Sysmon là một M đặc biệt không cần P, chạy background tasks.

Nhiệm vụ:

Preempt goroutines chạy quá lâu (> 10ms)
Force GC nếu lâu không chạy (2 phút)
Retake Ps bị stuck trong syscall
Network polling nếu không có dedicated poller

Sysmon loop (chạy mỗi 20µs → 10ms adaptive):
  ↓
1. Check retake Ps in syscall > 10ms
2. Check preempt Gs running > 10ms
3. Check force GC if idle > 2min
4. Poll network if no active M
  ↓
Sleep, then repeat

Network Poller

Vấn đề: I/O operations (network, file) block OS thread.

Giải pháp: Integrated network poller (epoll/kqueue/IOCP).

Cách hoạt động:

G1: conn.Read()
  ↓
Syscall would block
  ↓
Runtime puts G1 in netpoll wait list
  ↓
M picks next G (G2)
  ↓
... G2 runs ...
  ↓
Netpoller detects conn is ready
  ↓
G1 moved to run queue
  ↓
M eventually picks G1 again
  ↓
G1 resumes, Read() returns

Benefit: Hàng nghìn concurrent connections mà không cần hàng nghìn OS threads.

GOMAXPROCS: Chọn giá trị nào?

Rule of thumb

CPU-bound:

GOMAXPROCS = số cores
Tăng thêm không giúp gì (context switch overhead)

I/O-bound:

GOMAXPROCS = số cores vẫn OK
Goroutines block không waste P

Mixed workload:

Default (số cores) thường tốt
Nếu có CPU-intensive + I/O-intensive riêng biệt, có thể tăng lên 1.5-2x cores

Khi nào cần tune?

Dấu hiệu cần giảm GOMAXPROCS:

High CPU context switch rate
P99 latency cao do contention

Dấu hiệu cần tăng:

CPU utilization thấp dù có nhiều goroutines sẵn sàng
Rare, thường không cần tăng

Ví dụ thực tế

func cpuBound() {
    for i := 0; i < 1e9; i++ {
        _ = i * i
    }
}

func main() {
    runtime.GOMAXPROCS(runtime.NumCPU())
    
    for i := 0; i < 100; i++ {
        go cpuBound()
    }
    
    time.Sleep(10 * time.Second)
}

Với GOMAXPROCS=1: 100 goroutines chạy tuần tự trên 1 core
Với GOMAXPROCS=8: 100 goroutines chạy song song trên 8 cores (nhanh hơn 8x)

Debugging Scheduler

1. GODEBUG=schedtrace

GODEBUG=schedtrace=1000 ./myapp

Output:

SCHED 0ms: gomaxprocs=8 idleprocs=0 threads=10 spinningthreads=0 idlethreads=4 runqueue=0 [0 0 0 0 0 0 0 0]

Giải thích:

gomaxprocs=8: 8 P
idleprocs=0: 0 P đang idle (tất cả đang bận)
threads=10: 10 M tồn tại
spinningthreads=0: 0 M đang spin tìm work
idlethreads=4: 4 M đang sleep
runqueue=0: 0 G trong global run queue
[0 0 0 0 0 0 0 0]: Local run queue của mỗi P (tất cả đều trống)

2. runtime.NumGoroutine()

fmt.Println("Goroutines:", runtime.NumGoroutine())

Nếu tăng liên tục → goroutine leak.

3. pprof goroutine profile

go tool pprof http://localhost:6060/debug/pprof/goroutine

Trong pprof:

(pprof) top
(pprof) list <function>
(pprof) traces

Xem goroutines đang đợi ở đâu.

4. Execution tracer

curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out
go tool trace trace.out

Trong trace UI:

View goroutine timeline
Xem scheduling events
Phát hiện blocking, contention

Advanced: LockOSThread

Use case: Cần bind goroutine vào specific OS thread (OpenGL, thread-local storage).

func init() {
    runtime.LockOSThread()
}

func main() {
    // This goroutine always runs on same OS thread
    // Useful for C libraries requiring thread-local state
}

Note: Tránh dùng nếu không thực sự cần (giảm scheduling flexibility).

Performance Tips

1. Tránh tạo quá nhiều goroutines cùng lúc

// ❌ BAD: Tạo 1 triệu goroutines cùng lúc
for i := 0; i < 1_000_000; i++ {
    go process(i)
}

// ✅ GOOD: Worker pool với giới hạn
const workers = 100
sem := make(chan struct{}, workers)

for i := 0; i < 1_000_000; i++ {
    sem <- struct{}{}
    go func(id int) {
        defer func() { <-sem }()
        process(id)
    }(i)
}

2. Goroutines nên yield khi chờ

// ❌ BAD: Busy wait
for !ready {
    // Monopolizes CPU
}

// ✅ GOOD: Yield hoặc dùng channel
for !ready {
    runtime.Gosched()  // Yield to other goroutines
}

// ✅ BETTER: Dùng channel
<-readyChan

3. Batch work để giảm scheduling overhead

// ❌ BAD: Mỗi item một goroutine
for _, item := range items {
    go process(item)  // Scheduling overhead
}

// ✅ GOOD: Batch items
chunkSize := 100
for i := 0; i < len(items); i += chunkSize {
    end := i + chunkSize
    if end > len(items) {
        end = len(items)
    }
    
    chunk := items[i:end]
    go func(batch []Item) {
        for _, item := range batch {
            process(item)
        }
    }(chunk)
}

Tóm tắt

Concept	Ý nghĩa
G	Goroutine (task) — nhẹ, tạo/destroy nhanh
M	OS thread — expensive, số lượng dynamic
P	Processor (token) — số lượng = GOMAXPROCS
Work stealing	P idle steal từ P khác để cân bằng load
Preemption	Async signal-based (từ Go 1.14)
Sysmon	Background thread cho maintenance tasks
Netpoller	Async I/O không block OS threads
GOMAXPROCS	Default = cores, ít khi cần tune

Tài liệu tham khảo

Go scheduler design doc: https://github.com/golang/go/blob/master/src/runtime/HACKING.md
Kavya Joshi - The Scheduler Saga: https://www.youtube.com/watch?v=YHRO5WQGh0k
Dmitry Vyukov - Go Scheduler: https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw
Go Blog - Scheduling: https://go.dev/blog/