Debugging performance issues in Go programs
Let's assume you have a Go program and want to improve its performance. There are several tools available that can help with this task. These tools can help you to identify varIoUs types of hotspots (cpu,IO,memory),hotspots are the places that you need to concentrate on in order to significantly improve performance. However,another outcome is possible -- the tools can help you identify obvIoUs performance defects in the program. For example,you prepare an sql statement before each query while you could prepare it once at program startup. Another example is if an O(N^2) algorithm somehow slipped into where an obvIoUs O(N) exists and is expected. In order to identify such cases you need to sanity check what you see in profiles. For example for the first case significant time spent in sql statement preparation would be the red flag.
It's also important to understand varIoUs bounding factors for performance. For example,if the program communicates via 100 Mbps network link and it is already utilizes >90Mbps,there is not much you can do with the program to improve its performance. There are similar bounding factors for disk IO,memory consumption and computational tasks.
With that in mind we can look at the available tools.
Note: The tools can interfere with each other. For example,precise memory profiling skews cpu profiles,goroutine blocking profiling affects scheduler trace,etc. Use tools in isolation to get more precise info.
Note: the formats described here are based on Go1.3 release.
cpu Profiler
Go runtime contains built-in cpu profiler,which shows what functions consume what percent of cpu time. There are 3 ways you can get access to it:
1. The simplest one is-cpuprofileflag of 'go test' command. For example,the following command:
$ go test -run=none -bench=ClientServerParallel4 -cpuprofile=cprof net/http
will profile the given benchmark and write cpu profile into 'cprof' file.
Then:
$ go tool pprof --text http.test cprof
will print a list of the hottest functions.
There are several output types available,the most useful ones are: --text,--web and --list. Run 'go tool pprof' to get the complete list.
The obvIoUs drawback of this option is that it works only for tests.
2.net/http/pprofpackage. This is the ideal solution for network servers. You merely need to import net/http/pprof,and collect profiles with:
$ go tool pprof --text mybinhttp://myserver:6060:/debug/pprof/profile
3. Manual profile collection. You need to importruntime/pprofand add the following code to main function:
1 |
if *flagcpuprofile != "" { |
2 |
f,err := os.Create(*flagcpuprofile) |
4
log .Fatal(err) |
6
pprof.StartcpuProfile(f) |
7 |
defer pprof.StopcpuProfile() |
8
} |
The profile will be written to the specified file,visualize it the same way as in the first option.
Here is an example of a profile visualized with --web option:
You can investigate a single function with --list=funcname option. For example,the following profile shows that the time was spent in the append function:
01
. . 93: func (bp *buffer) WriteRune(r rune) error { |
02
. . 94: if r < utf8.RuneSelf { |
03 |
5 5 95: *bp = append(*bp,byte(r)) |
04
. . 96: return nil |
06
. . 98: |
08
. . 100: n := len(b) |
09 |
. . 101: for n+utf8.UTFMax > cap(b) { |
10
. . 102: b = append(b,0) |
12
. . 104: w := utf8.EncodeRune(b[n:n+utf8.UTFMax],r) |
13 |
. . 105: *bp = b[:n+w] |
14
. . 106: return nil |
You can find detailed info on pprof tool and description of the numbers in the graphhere.
There are also 3 special entries that the profiler uses when it can't unwind stack: GC,System and ExternalCode. GC means time spent during garbage collection,see Memory Profiler and Garbage Collector Trace sections below for optimization suggestions. System means time spent in goroutine scheduler,stack management code and other auxiliary runtime code. ExternalCode means time spent in native dynamic libraries.
Here are some hints with respect to how to interpret what you see in the profile.
If you see lots of time spent in runtime.mallocgc function,the program potentially makes excessive amount of small memory allocations. The profile will tell you where the allocations are coming from. See the memory profiler section for suggestions on how to optimize this case.
If lots of time is spent in channel operations,sync.Mutex code and other synchronization primitives or System component,the program probably suffers from contention. Consider to restructure program to eliminate frequently accessed shared resources. Common techniques for this include sharding/partitioning,local buffering/batching and copy-on-write technique.
If lots of time is spent in syscall.Read/Write,the program potentially makes excessive amount of small reads and writes. Bufio wrappers around os.File or net.Conn can help in this case.
If lots of time is spent in GC component,the program either allocates too many transient objects or heap size is very small so garbage collections happen too frequently. See Garbage Collector Tracer and Memory Profiler sections for optimization suggestions.
Note: cpu profiler currentlydoes not work on darwin.
Note: On windows you need to install Cygwin,Perl and Graphviz to generate svg/web profiles.
Note: On linux you can also tryperf system profiler. It can't unwind Go stacks,but it can profile and unwind cgo/SWIG code and kernel. So it can be useful to get insights into native/kernel performance bottlenecks.
Memory Profiler
Memory profiler shows what functions allocate heap memory. You can collect it in similar ways as cpu profile: with'go test --memprofile',withnet/http/pprofviahttp://myserver:6060:/debug/pprof/heap or by callingruntime/pprof.WriteHeapProfile.
You can visualize only allocations live at the time of profile collection (--inuse_space flag to pprof,default),or all allocations happened since program start (--alloc_space flag to pprof). The former is useful for profiles collected with net/http/pprof on live applications,the latter is useful for profiles collected at program end (otherwise you will see almost empty profile).
Note: the memory profiler is sampling,that is,it collects information only about some subset of memory allocations. Probability of sampling an object is proportional to its size. You can change the sampling rate with go test --memprofilerate flag,or by setting runtime.MemProfileRate variable at program startup. The rate of 1 will lead to collection of information about all allocations,but it can slow down execution. The default sampling rate is 1 sample per 512KB of allocated memory.
You can also visualize number of bytes allocated or number of objects allocated (--inuse/alloc_space and --inuse/alloc_objects flags,respectively). The profiler tends to sample larger objects during profiling more. But it's important to understand that large objects affect memory consumption and GC time,while large number of tiny allocations affects execution speed (and GC time to some degree as well). So it may be useful to look at both.
Objects can be persistent or transient. If you have several large persistent objects allocated at program start,they will be most likely sampled by the profiler (because they are large). Such objects do affect memory consumption and GC time,but they do not affect normal execution speed (no memory management operations happen on them). On the other hand if you have large number of objects with very short life durations,they can be barely represented in the profile (if you use the default --inuse_space mode). But they do significantly affect execution speed,because they are constantly allocated and freed. So,once again,it may be useful to look at both types of objects.
So,generally,if you want to reduce memory consumption,you need to look at --inuse_space profile collected during normal program operation. If you want to improve execution speed,look at --alloc_objects profile collected after significant running time or at program end.
There are several flags that control reporting granularity. --functions makes pprof report on function level (default). --lines makes pprof report on source line level,which is useful if hot functions allocate on different lines. And there are also --addresses and --files for exact instruction address and file level,respectively.
There is a useful option for the memory profile -- you can look at it right in the browser (provided that you imported net/http/pprof). If you open http://myserver:6060/debug/pprof/heap?debug=1 you must see the heap profile along the lines of:
heap profile: 4: 266528 [123: 11284472] @ heap/1048576
1: 262144 [4: 376832] @ 0x28d9f 0x2a201 0x2a28a 0x2624d 0x26188 0x94ca3 0x94a0b 0x17add6 0x17ae9f 0x1069d3 0xfe911 0xf0a3e 0xf0d22 0x21a70
# 0x2a201 cnew+0xc1 runtime/malloc.goc:718 |
# 0x2a28a runtime.cnewarray+0x3a runtime/malloc.goc:731
# 0x2624d makeslice1+0x4d runtime/slice.c:57 |
# 0x26188 runtime.makeslice+0x98 runtime/slice.c:38
# 0x94ca3 bytes.makeSlice+0x63 bytes/buffer.go:191 |
# 0x94a0b bytes.(*Buffer).ReadFrom+0xcb bytes/buffer.go:163
# 0x17add6 io/IoUtil.readAll+0x156 io/IoUtil/IoUtil.go:32 |
# 0x17ae9f io/
IoUtil.ReadAll+0x3f io/
IoUtil/
IoUtil.go:41
# 0x1069d3 godoc/vfs.ReadFile+0x133 godoc/vfs/vfs.go:44 |
# 0xfe911 godoc.func·023+0x471 godoc/
Meta.go:80
# 0xf0a3e godoc.(*Corpus).updateMetadata+0x9e godoc/Meta.go:101 |
# 0xf0d22 godoc.(*Corpus).refresh
MetadataLoop+0x42 godoc/
Meta.go:141
16
2: 4096 [2: 4096] @ 0x28d9f 0x29059 0x1d252 0x1d450 0x106993 0xf1225 0xe1489 0xfbcad 0x21a70 |
17 |
# 0x1d252 newdefer+0x112 runtime/panic.c:49 |
18
# 0x1d450 runtime.deferproc+0x10 runtime/panic.c:132 |
19 |
# 0x106993 godoc/vfs.ReadFile+0xf3 godoc/vfs/vfs.go:43 |
20
# 0xf1225 godoc.(*Corpus).parseFile+0x75 godoc/parser.go:20 |
21 |
# 0xe1489 godoc.(*treeBuilder).newDirTree+0x8e9 godoc/dirtrees.go:108 |
22
# 0xfbcad godoc.func·002+0x15d godoc/dirtrees.go:100 |