Managing Goroutines
It’s surprisingly easy tostartgoroutines. Unfortunately,it isn’t quite as easy to orchestrate their cleanup. Avoiding deadlocks is also challenging. Most often this boils down to an ordering problem,where a goroutine receiving on a go-chan exitsbeforethe upstream goroutines sending on it.
Why care at all though? It’s simple,an orphaned goroutine is amemory leak. Memory leaks in long running daemons are bad,especially when the expectation is that your process will be stable when all else fails.
To further complicate things,a typicalnsqdprocess hasmanygoroutines involved in message delivery. Internally,message “ownership” changes often. To be able to shutdown cleanly,it’s incredibly important to account for allintraprocessmessages.
Although there aren’t any magic bullets,the following techniques make it a little easier to manage…
WaitGroups
Thesync
package providessync.WaitGroup
,which can be used to perform accounting of how many goroutines are live (and provide a means to wait on their exit).
To reduce the typical boilerplate,nsqduses this wrapper:
type WaitGroupWrapper struct { sync.WaitGroup } func (w *WaitGroupWrapper) Wrap(cb func()) { w.Add(1) go func() { cb() w.Done() }() } // can be used as follows: wg := WaitGroupWrapper{} wg.Wrap(func() { n.idPump() }) ... wg.Wait()
Exit Signaling
The easiest way to trigger an event in multiple child goroutines is to provide a single go-chan that you close when ready. All pending receives on that go-chan will activate,rather than having to send a separate signal to each goroutine.
func work() { exitChan := make(chan int) go task1(exitChan) go task2(exitChan) time.Sleep(5 * time.Second) close(exitChan) } func task1(exitChan chan int) { <-exitChan log.Printf("task1 exiting") } func task2(exitChan chan int) { <-exitChan log.Printf("task2 exiting") }
Synchronizing Exit
It was quite difficult to implement a reliable,deadlock free,exit path that accounted for all in-flight messages. A few tips:
-
Ideally the goroutine responsible for sending on a go-chan should also be responsible for closing it.
-
If messages cannot be lost,ensure that pertinent go-chans are emptied (especially unbuffered ones!) to guarantee senders can make progress.
-
Alternatively,if a message is no longer relevant,sends on a single go-chan should be converted to a
select
with the addition of an exit signal (as discussed above) to guarantee progress. -
The general order should be:
- Stop accepting new connections (close listeners)
- Signal exit to child goroutines (see above)
- Wait on
WaitGroup
for goroutine exit (see above) - Recover buffered data
- Flush anything left to disk
Logging
Finally,the most important tool at your disposal is tolog the entrance and exit of your goroutines!. It makes itinfinitelyeasier to identify the culprit in the case of deadlocks or leaks.
nsqdlog lines include information to correlate goroutines with their siblings (and parent),such as the client’s remote address or the topic/channel name.
The logs are verbose,but not verbose to the point where the log is overwhelming. There’s a fine line,butnsqdleans towards the side of havingmoreinformation in the logs when a fault occurs rather than trying to reduce chattiness at the expense of usefulness.