Go 效能除錯完全指南

Go 效能工具總覽

工具	用途	啟用方式
race detector	偵測併發讀寫衝突	`go run -race` / `go test -race`
pprof CPU	找出哪些函式吃 CPU	benchmark 或 net/http/pprof
pprof Memory	找記憶體洩漏 / 過量 allocation	同上
pprof Block	找 goroutine block 在 channel/mutex	同上
pprof Mutex	找 mutex 競爭熱點	同上
runtime/trace	微秒級事件、scheduler 細節	`runtime/trace.Start`
GODEBUG	內建 runtime 偵錯資訊	`GODEBUG=gctrace=1`

Race Detector 競爭偵測

-race flag 是 Go 最重要的併發除錯工具。所有有 goroutine 的程式必須跑 race 測試。

啟用

go run -race main.go
go test -race ./...
go build -race -o myapp # production 不要這樣編（慢 2-20 倍 + 多 5-10x 記憶體）

範例：偵測到 race

package main

import (
    "fmt"
    "sync"
)

func main() {
    var counter int
    var wg sync.WaitGroup

    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            counter++ // ⚠️ data race
        }()
    }
    wg.Wait()
    fmt.Println(counter)
}

$ go run -race main.go
==================
WARNING: DATA RACE
Read at 0x00c00001a0a8 by goroutine 8:
  main.main.func1()
      /tmp/race.go:14 +0x39
Previous write at 0x00c00001a0a8 by goroutine 7:
  main.main.func1()
      /tmp/race.go:14 +0x4d
==================

Race detector 會印出具體哪兩個 goroutine 在哪兩行 race，超精準。

修正

// 方法 1：sync.Mutex
var mu sync.Mutex
mu.Lock()
counter++
mu.Unlock()

// 方法 2：atomic
import "sync/atomic"
var counter atomic.Int64
counter.Add(1)

// 方法 3：channel
counts := make(chan int, 10)
// ... worker 送、main 收加總

詳見 Go sync 同步原語

限制

只偵測實際發生的 race（如果某 race 路徑沒被執行就抓不到）
不是 0 false negative — 只是「目前 run 沒偵測到 race」
開發 / CI 都要跑

CI 慣例

# 在 CI 一律加 -race
- run: go test -race -cover ./...

pprof 介紹

pprof 是 Go 內建的 profiling 工具，能產生 CPU / 記憶體 / block / mutex profile。

兩種啟用模式

模式 1：HTTP endpoint（線上 server）

import (
    "net/http"
    _ "net/http/pprof" // 註冊 /debug/pprof/* handler
)

func main() {
    go func() {
        // 只給內部 / debug 用，不要對外
        http.ListenAndServe("localhost:6060", nil)
    }()

    // ... 正常程式邏輯
}

訪問：

http://localhost:6060/debug/pprof/ — index
http://localhost:6060/debug/pprof/profile?seconds=30 — CPU 30 秒
http://localhost:6060/debug/pprof/heap — 當下 heap 快照
http://localhost:6060/debug/pprof/goroutine — goroutine stack dump

模式 2：runtime/pprof（CLI 工具 / benchmark）

import (
    "os"
    "runtime/pprof"
)

func main() {
    f, _ := os.Create("cpu.prof")
    pprof.StartCPUProfile(f)
    defer pprof.StopCPUProfile()

    // 跑要 profile 的程式碼
}

用 `go tool pprof` 分析

# 從 HTTP endpoint 抓 30 秒 CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# 從檔案
go tool pprof cpu.prof

# 進互動模式
(pprof) top                # 印出最耗 CPU 的 10 個函式
(pprof) top 20             # 印 20 個
(pprof) list FuncName      # 顯示該 function 程式碼層級熱點
(pprof) web                # 開啟 SVG 函式呼叫圖（需 graphviz）
(pprof) png > out.png      # 輸出 PNG

Web UI（最直觀）

go tool pprof -http=:8080 cpu.prof

開啟瀏覽器：

Top：函式排序
Graph：呼叫圖
Flame Graph：火焰圖（看時間分佈最快）
Source：程式碼層級熱點

CPU Profile

取得 profile

# 從 HTTP 抓 30 秒
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# 從 benchmark
go test -bench=. -cpuprofile=cpu.prof ./pkg
go tool pprof cpu.prof

解讀

(pprof) top10
Showing nodes accounting for 5.20s, 86.67% of 6.00s total
      flat  flat%   sum%        cum   cum%
     2.50s 41.67% 41.67%      2.80s 46.67%  json.parseString
     1.20s 20.00% 61.67%      1.50s 25.00%  encoding/json.indirect
     0.80s 13.33% 75.00%      0.80s 13.33%  runtime.mallocgc

欄位	意義
flat	該函式本身花的時間
flat%	flat / total
sum%	累加百分比
cum	該函式 + 其呼叫的子函式總時間
cum%	cum / total

flat 高 = 本身很重；cum 高但 flat 低 = 被呼叫的子函式重。

火焰圖（Flame Graph）

最直觀的 CPU profile 視覺化：

go tool pprof -http=:8080 cpu.prof
# 在瀏覽器選「Flame Graph」

寬度 = 時間佔比
由下往上 = 呼叫鏈
一眼看出寬而高的山頭就是熱點

注意：CPU profile 只計 on-CPU 時間

CPU profile 不包括：

I/O 等待
channel block
mutex 等待
sleep

這些要用 block profile / mutex profile / trace 看。

Memory Profile

兩種 heap profile

# 當下 heap 快照
go tool pprof http://localhost:6060/debug/pprof/heap

# 累計 allocation（包含已 GC 的）
go tool pprof http://localhost:6060/debug/pprof/allocs

解讀

(pprof) top
Showing nodes accounting for 512MB, 80% of 640MB total
      flat  flat%   sum%        cum   cum%
     250MB 39.06%       250MB 39.06%  bytes.Buffer.grow
     180MB 28.12%       180MB 28.12%  strings.Builder.Grow

四種 sample type

go tool pprof -alloc_space  heap   # 累計分配空間
go tool pprof -alloc_objects heap  # 累計分配物件數
go tool pprof -inuse_space  heap   # 當下佔用空間
go tool pprof -inuse_objects heap  # 當下佔用物件數

用途	Sample
找記憶體洩漏	`-inuse_space`（誰持續佔著沒釋放）
找 GC 壓力	`-alloc_objects`（誰一直 alloc，即使馬上釋放）

Benchmark 配合 mem profile

go test -bench=. -benchmem -memprofile=mem.prof ./pkg
go tool pprof mem.prof

# -benchmem 在輸出加上 B/op + allocs/op

BenchmarkParseJSON-8    100000   12000 ns/op    1024 B/op    8 allocs/op

優化目標：減少 allocs/op 比減 B/op 更重要（每次 allocation 都有 GC 壓力）。

Block 與 Mutex Profile

Block profile：找 goroutine block 在 channel / select

import "runtime"

func init() {
    runtime.SetBlockProfileRate(1) // 1 = 全採樣（測試）；production 用 100000+
}

go tool pprof http://localhost:6060/debug/pprof/block

顯示哪些 goroutine 卡在 channel 操作、<-ctx.Done()、sync.WaitGroup.Wait 等。

Mutex profile：找 lock 競爭熱點

runtime.SetMutexProfileFraction(1) // 1 = 全採樣（測試）

go tool pprof http://localhost:6060/debug/pprof/mutex

顯示哪些 mutex 競爭最激烈。

何時用

CPU 不滿但 latency 高 → Block / Mutex profile
goroutine 數量爆 → goroutine profile
純運算慢 → CPU profile

Benchmark 整合 pprof

最常見的 profiling 流程：寫 benchmark 找熱點。

完整流程

// pkg/parser/parser_test.go
func BenchmarkParse(b *testing.B) {
    input := loadTestData()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        Parse(input)
    }
}

# 1. 跑 benchmark + 輸出 profile
go test -bench=BenchmarkParse -cpuprofile=cpu.prof -memprofile=mem.prof ./pkg/parser

# 2. 看 CPU 熱點
go tool pprof -http=:8080 cpu.prof

# 3. 看記憶體 alloc
go tool pprof -http=:8081 mem.prof

比較優化前後

# 改 code 前
go test -bench=. -count=5 -benchmem > old.txt

# 改 code 後
go test -bench=. -count=5 -benchmem > new.txt

# 統計顯著比較
go install golang.org/x/perf/cmd/benchstat@latest
benchstat old.txt new.txt

輸出：

            old time/op    new time/op    delta
Parse-8     12.0µs ± 1%    8.0µs ± 1%    -33.3%
            old allocs/op  new allocs/op  delta
Parse-8     8.0 ± 0%       3.0 ± 0%      -62.5%

`-benchmem` 必加

go test -bench=. -benchmem ./...

不加只看時間，加了能看 B/op 跟 allocs/op，幾乎所有效能問題都跟 allocation 有關。

runtime/trace 微秒級事件追蹤

pprof 是「採樣」型，trace 是「事件」型。能看到：

每個 goroutine 何時 start / block / 切換
GC 何時跑、跑多久
Network / Syscall block

啟用

import (
    "os"
    "runtime/trace"
)

func main() {
    f, _ := os.Create("trace.out")
    trace.Start(f)
    defer trace.Stop()

    // 程式邏輯
}

從 HTTP

import _ "net/http/pprof"
// /debug/pprof/trace?seconds=5 取 5 秒 trace

看 trace

go tool trace trace.out

開啟瀏覽器，能看到：

Goroutine analysis：每個 goroutine 的生命週期
Network blocking profile
Synchronization blocking profile
Syscall blocking profile
Scheduler latency

何時用 trace

pprof 看不出問題的 latency 問題
想知道 goroutine 切換頻率
GC 影響分析
Lock 等待分佈

trace 資料量大，只在開發 / 短時間 production 採樣。

實戰範例

範例 1：找出 Web service 的 CPU 熱點

package main

import (
    "log"
    "net/http"
    _ "net/http/pprof"
)

func main() {
    // 業務 server
    go func() {
        http.HandleFunc("/api/process", processHandler)
        log.Fatal(http.ListenAndServe(":8080", nil))
    }()

    // pprof server（內部 only）
    log.Fatal(http.ListenAndServe("localhost:6060", nil))
}

# 1. 壓測（產生負載）
hey -z 30s -c 10 http://localhost:8080/api/process

# 2. 同時抓 30 秒 CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# 3. 看 top + flame graph
(pprof) top
(pprof) web

範例 2：找記憶體洩漏

症狀：service 跑久了記憶體越來越高，懷疑有洩漏。

# Step 1：跑一段時間後取第一次 heap snapshot
curl http://localhost:6060/debug/pprof/heap > heap1.prof

# Step 2：再跑一段時間，取第二次
curl http://localhost:6060/debug/pprof/heap > heap2.prof

# Step 3：比較增加量
go tool pprof -base heap1.prof heap2.prof
(pprof) top  # 找出哪些函式持續增加 inuse_space

範例 3：優化 hot path

情境：JSON parsing 在 pprof 顯示佔 40% CPU。

// Before
func ParseRequests(data []byte) ([]Request, error) {
    var reqs []Request
    err := json.Unmarshal(data, &reqs)
    return reqs, err
}

func BenchmarkParseRequests(b *testing.B) {
    data := generateTestData(1000)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        ParseRequests(data)
    }
}

$ go test -bench=. -benchmem -cpuprofile=cpu.prof
BenchmarkParseRequests-8    100   12000000 ns/op   2048000 B/op   3000 allocs/op

看 cpu.prof：

(pprof) top
     5.0s  41.67%  json.parseString
     2.0s  16.67%  reflect.Value.Set

改進：用 jsoniter / 自寫 parser / decoder reuse buffer。

// After: use jsoniter
import jsoniter "github.com/json-iterator/go"

var jsonp = jsoniter.ConfigCompatibleWithStandardLibrary

func ParseRequests(data []byte) ([]Request, error) {
    var reqs []Request
    err := jsonp.Unmarshal(data, &reqs)
    return reqs, err
}

$ benchstat old.txt new.txt
            old time/op    new time/op    delta
ParseRequests-8  12.0ms ± 1%    7.5ms ± 1%    -37.5%

範例 4：goroutine leak 偵測

# 取 goroutine snapshot
curl http://localhost:6060/debug/pprof/goroutine?debug=1

看到一堆 goroutine 卡在同一個 stack trace → 通常是 channel 沒人讀 / context 沒 cancel。

最佳實踐

1. 所有併發 code 跑 race test

# CI 必加
go test -race ./...

2. 線上服務一律掛 pprof endpoint（內部 only）

go func() {
    http.ListenAndServe("localhost:6060", nil) // ⭐ localhost
}()

安全：絕對不要對外開放 pprof（會洩漏記憶體位址、stack 資訊）。

3. Benchmark 用 `-benchmem`，永遠

go test -bench=. -benchmem ./...

看 allocations 比看時間更能挖出效能問題。

4. 用 benchstat 而非肉眼比

benchstat old.txt new.txt
# 給出統計顯著性，避免被雜訊誤導

5. Profile 在類 production 環境

dev 機跟 production 環境（CPU、記憶體、Linux kernel）差很多。重要服務在 staging 也做 profiling。

6. 優化前先找瓶頸

測量 → 找瓶頸 → 優化 → 重新測量

不要憑直覺優化，profile 告訴你哪裡慢。

7. 找 allocation 來源

import "runtime"

func init() {
    runtime.MemProfileRate = 1 // 每次 allocation 都記
}

production 不要這設定（開銷大），benchmark 用沒問題。

8. `GOMAXPROCS` 對應實際 CPU 限制

容器化環境：

# 在容器內 GOMAXPROCS 預設讀 host CPU 數
# 但 cgroup CPU limit 是 2 個 → 跑 N 個 P 反而切換成本高
go env GOMAXPROCS

用 automaxprocs（Uber）：

import _ "go.uber.org/automaxprocs"

自動依 cgroup 限制設 GOMAXPROCS。

常見問題

問題 1：race detector 沒抓到 race

可能原因：

race 路徑沒被執行（測試覆蓋率不夠）
race 條件很罕見（多跑幾次或加壓力）

解決：

增加 stress test
go test -race -count=100 跑多次
確保關鍵 code path 有測試

問題 2：pprof 看不出問題

可能原因：

採樣時間太短（30 秒以上比較準）
負載不夠（用 hey、vegeta 壓測）
CPU 不是瓶頸（試試 block / mutex profile）

問題 3：production CPU 升高但 pprof 沒看出來

可能原因：GC 壓力（allocation 太多）

解決：

# 開 GC trace
GODEBUG=gctrace=1 ./myapp
# 印出每次 GC 的詳情

如果 GC 跑很頻繁，挖記憶體 profile 找 allocation 熱點。

問題 4：goroutine 數量爆炸

症狀：runtime.NumGoroutine() 持續上升

檢查：

curl http://localhost:6060/debug/pprof/goroutine?debug=2

看 goroutine stack，常見：

channel 沒人讀 → 寫方 block
context 沒 cancel
forever loop 沒退出條件

問題 5：pprof flame graph 看不到自己的函式

可能原因：

inlined（編譯器內聯了，看不到單獨的 frame）
PGO（profile-guided optimization）影響

解決：

# Disable inlining 看清楚
go test -gcflags="-l" -bench=. -cpuprofile=cpu.prof

但 production 不該這樣編。

問題 6：runtime/trace 檔案太大

解決：

只追蹤關鍵 path（用 trace.WithRegion / trace.Log）
減少採樣時間
用 sample 模式

總結

核心要點

工具地圖：
- race detector → 併發正確性
- pprof CPU → 函式時間
- pprof Memory → allocation / 洩漏
- pprof Block / Mutex → channel / lock 競爭
- runtime/trace → 微秒級事件

效能除錯流程：

測量：benchmark 或 production pprof
定位：top / flame graph
理解：list source / trace
改進：減 allocation、用對資料結構
驗證：benchstat 比較

速查表

# Race detector
go test -race ./...
go run -race main.go

# Benchmark
go test -bench=. -benchmem ./pkg
go test -bench=. -count=5 -benchmem > out.txt
benchstat old.txt new.txt

# CPU profile
go test -bench=. -cpuprofile=cpu.prof
go tool pprof -http=:8080 cpu.prof

# Memory profile
go test -bench=. -memprofile=mem.prof
go tool pprof -http=:8080 mem.prof

# Live server pprof
import _ "net/http/pprof"
http.ListenAndServe("localhost:6060", nil)

# 從 HTTP 取
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
go tool pprof http://localhost:6060/debug/pprof/heap
go tool pprof http://localhost:6060/debug/pprof/goroutine

# Trace
go test -bench=. -trace=trace.out
go tool trace trace.out

# GC trace
GODEBUG=gctrace=1 ./myapp

何時用哪個工具

症狀	工具
併發資料錯亂	`-race`
CPU 100% 但慢	CPU profile
記憶體一直漲	Memory profile (`-inuse_space`)
GC 太頻繁	Memory profile (`-alloc_objects`) + `GODEBUG=gctrace=1`
latency 高但 CPU 低	Block profile + Trace
Lock 競爭	Mutex profile
Goroutine 爆	Goroutine profile
Scheduler 行為	runtime/trace

參考資源

官方 pprof 教學：https://go.dev/blog/pprof
profiling Go programs：https://go.dev/doc/diagnostics
runtime/trace：https://pkg.go.dev/runtime/trace
Brendan Gregg - Flame Graphs：https://www.brendangregg.com/flamegraphs.html
benchstat：https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
automaxprocs：https://github.com/uber-go/automaxprocs

建立日期：2026-05-16 最後更新：2026-05-16

🔗相關文章