Proposal: Low-cost defers through inline code, and extra funcdata to manage the panic case

Author(s): Dan Scales, Keith Randall, and Austin Clements (with input from many others, including Russ Cox and Cherry Zhang)

Last updated: 2019-09-23

Discussion at https://golang.org/issue/34481

General defer performance discussion at https://golang.org/issue/14939.

Abstract

As of Go 1.13, most defer operations take about 35ns (reduced from about 50ns in Go 1.12). In contrast, a direct call takes about 6ns. This gap incentivizes engineers to eliminate defer operations from hot code paths, which takes away time from more productive tasks, leads to less maintainable code (e.g., if a panic is later introduced, the "optimization" is no longer correct), and discourages people from using a language feature when it would otherwise be an appropriate solution to a problem.

We propose a way to make most defer calls no more expensive than open-coding the call, hence eliminating the incentives to shy away from using this language feature to its fullest extent.

Background

Go 1.13 implements the defer statement by calling into the runtime to push a "defer object" onto the defer chain. Then, in any function that contains defer statements, the compiler inserts a call to runtime.deferreturn at every function exit point to unwind that function's defers. Both of these cause overhead: the defer object must be populated with function call information (function pointer, arguments, closure information, etc.) when it is added to the chain, and deferreturn must find the right defers to unwind, copy out the call information, and invoke the deferred calls. Furthermore, this inhibits compiler optimizations like inlining, since the defer functions are called from the runtime using the defer information.

When a function panics, the runtime runs the deferred calls on the defer chain until one of these calls recover or it exhausts the chain, resulting in a fatal panic. The stack itself is not unwound unless a deferred call recovers. This has the important property that examining the stack from a deferred call run during a panic will include the panicking frame, even if the defer was pushed by an ancestor of the panicking frame.

In general, this defer chain is necessary since a function can defer an unbounded or dynamic number of calls that must all run when it returns. For example, a defer statement can appear in a loop or an if block. This also means that, in general, the defer objects must be heap-allocated, though the runtime uses an allocation pool to reduce the cost of the allocation.

This is notably different from exception handling in C++ or Java, where the applicable set of except or finally blocks can be determined statically at every program counter in a function. In these languages, the non-exception case is typically inlined and exception handling is driven by a side table giving the locations of the except and finally blocks that apply to each PC.

However, while Go's defer mechanism permits unbounded calls, the vast majority of functions that use defer invoke each defer statement at most once, and do not invoke defer in a loop. Go 1.13 adds an optimization to stack-allocate defer objects in this case, but they must still be pushed and popped from the defer chain. This applies to 363 out of the 370 static defer sites in the cmd/go binary and speeds up this case by 30% relative to heap-allocated defer objects.

This proposal combines this insight with the insights used in C++ and Java to make the non-panic case of most defer operations no more expensive than the manually open-coded case, while retaining correct panic handling.

Requirements

While the common case of defer handling is simple enough, it can interact in often non-obvious ways with things like recursive panics, recover, and stack traces. Here we attempt to enumerate the requirements that any new defer implementation should likely satisfy, in addition to those in the language specification for Defer statements and Handling panics.

Executing a defer statement logically pushes a deferred function call onto a per-goroutine stack. Deferred calls are always executed starting from the top of this stack (hence in the reverse order of the execution of defer statements). Furthermore, each execution of a defer statement corresponds to exactly one deferred call (except in the case of program termination, where a deferred function may not be called at all).
Defer calls are executed in one of two ways. Whenever a function call returns normally, the runtime starts popping and executing all existing defer calls for that stack frame only (in reverse order of original execution). Separately, whenever a panic (or a call to Goexit) occurs, the runtime starts popping and executing all existing defer calls for the entire defer stack. The execution of any defer call may be interrupted by a panic within the execution of the defer call.
A program may have multiple outstanding panics, since a recursive (second) panic may occur during any of the defer calls being executed during the processing of the first panic. A previous panic is “aborted” if the processing of defers by the new panic reaches the frame where the previous panic was processing defers when the new panic happened. When a defer call returns that did a successful recover that applies to a panic, the stack is immediately unwound to the frame which contains the defer that did the recover call, and any remaining defers in that frame are executed. Normal execution continues in the preceding frame (but note that normal execution may actually be continuing a defer call for an outer panic). Any panic that has not been recovered or aborted must still appear on the caller stack. Note that the first panic may never continue its defer processing, if the second panic actually successfully runs all defer calls, but the original panic must appear on the stack during all the processing by the second panic.
When a defer call is executed because a function is returning normally (whether there are any outstanding panics or not), the call site of a deferred call must appear to be the function that invoked defer to push that function on the defer stack, at the line where that function is returning. A consequence of this is that, if the runtime is executing deferred calls in panic mode and a deferred call recovers, it must unwind the stack immediately after that deferred call returns and before executing another deferred call.
When a defer call is executed because of an explicit panic, the call stack of a deferred function must include runtime.gopanic and the frame that panicked (and its callers) immediately below the deferred function call. As mentioned, the call stack must also include any outstanding previous panics. If a defer call is executed because of a run-time panic, the same condition applies, except that runtime.gopanic does not necessarily need to be on the stack. (In the current gc-go implementation, runtime.gopanic does appear on the stack even for run-time panics.)

Proposal

We propose optimizing deferred calls in functions where every defer is executed at most once (specifically, a defer may be on a conditional path, but is never in a loop in the control-flow graph). In this optimization, the compiler assigns a bit for every defer site to indicate whether that defer had been reached or not. The defer statement itself simply sets the corresponding bit and stores all necessary arguments in specific stack slots. Then, at every exit point of the function, the compiler open-codes each deferred call, protected by (and clearing) each corresponding bit.

For example, the following:

defer f1(a)
if cond {
 defer f2(b)
}
body...

would compile to

deferBits |= 1<<0
tmpF1 = f1
tmpA = a
if cond {
 deferBits |= 1<<1
 tmpF2 = f2
 tmpB = b
}
body...
exit:
if deferBits & 1<<1 != 0 {
 deferBits &^= 1<<1
 tmpF2(tmpB)
}
if deferBits & 1<<0 != 0 {
 deferBits &^= 1<<0
 tmpF1(tmpA)
}

In order to ensure that the value of deferBits and all the tmp variables are available in case of a panic, these variables must be allocated explicit stack slots and the stores to deferBits and the tmp variables (tmpF1, tmpA, etc.) must write the values into these stack slots. In addition, the updates to deferBits in the defer exit code must explicitly store the deferBits value to the corresponding stack slot. This will ensure that panic processing can determine exactly which defers have been executed so far.

However, the defer exit code can still be optimized significantly in many cases. We can refer directly to the deferBits and tmpA ‘values’ (in the SSA sense), and these accesses can therefore be optimized in terms of using existing values in registers, propagating constants, etc. Also, if the defers were called unconditionally, then constant propagation may in some cases to eliminate the checks on deferBits (because the value of deferBits is known statically at the exit point).

If there are multiple exits (returns) from the function, we can either duplicate the defer exit code at each exit, or we can have one copy of the defer exit code that is shared among all (or most) of the exits. Note that any sharing of defer-exit code code may lead to less specific line numbers (which don’t indicate the exact exit location) if the user happens to look at the call stack while in a call made by the defer exit code.

For any function that contains a defer which could be executed more than once (e.g. occurs in a loop), we will fall back to the current way of handling defers. That is, we will create a defer object at each defer statement and push it on to the defer chain. At function exit, we will call deferreturn to execute an active defer objects for that function. We may similarly revert to the defer chain implementation if there are too many defers or too many function exits. Our goal is to optimize the common cases where current defers overheads show up, which is typically in small functions with only a small number of defers statements.

Panic processing

Because no actual defer records have been created, panic processing is quite different and somewhat more complex in the open-coded approach. When generating the code for a function, the compiler also emits an extra set of FUNCDATA information that records information about each of the open-coded defers. For each open-coded defer, the compiler emits FUNCDATA that specifies the exact stack locations that store the function pointer and each of the arguments. It also emits the location of the stack slot containing deferBits. Since stack frames can get arbitrarily large, the compiler uses a varint encoding for the stack slot offsets.

In addition, for all functions with open-coded defers, the compiler adds a small segment of code that does a call to runtime.deferreturn and then returns. This code segment is not reachable by the main code of the function, but is used to unwind the stack properly when a panic is successfully recovered.

To handle a panic, the runtime conceptually walks the defer chain in parallel with the stack in order to interleave execution of pushed defers with defers in open-coded frames. When the runtime encounters an open-coded frame F executing function f, it executes the following steps.

The runtime reads the funcdata for function f that contains the open-defer information.
Using the information about the location in frame F of the stack slot for deferBits, the runtime loads the current value of deferBits for this frame. The runtime processes each of the active defers, as specified by the value of deferBits, in reverse order.
For each active defer, the runtime loads the function pointer for the defer call from the appropriate stack slot. It also builds up an argument frame by copying each of the defer arguments from its specified stack slot to the appropriate location in the argument frame. It then updates deferBits in its stack slot after masking off the bit for the current defer. Then it uses the function pointer and argument frame to call the deferred function.
If the defer call returns normally without doing a recovery, then the runtime continues executing active defer calls for frame F until all active defer calls have finished.
If any defer call returns normally but has done a successful recover, then the runtime stops processing defers in the current frame. There may or may not be any remaining defers to process. The runtime then arranges to jump to the deferreturn code segment and unwind the stack to frame F, by simultaneously setting the PC to the address of the deferreturn segment and setting the SP to the appropriate value for frame F. The deferreturn code segment then calls back into the runtime. The runtime can now process any remaining active defers from frame F. But for these defers, the stack has been appropriately unwound and the defer appears to be called directly from function f. When all defers for the frame have finished, the deferreturn finishes and the code segment returns from frame F to continue execution.

If a deferred call in step 3 itself panics, the runtime starts its normal panic processing again. For any frame with open-coded defers that has already run some defers, the deferBits value at the specified stack slot will always accurately reflect the remaining defers that need to be run.

The processing for runtime.Goexit is similar. The main difference is that there is no panic happening, so there is no need to check for or do special handling for recovery in runtime.Goexit. A panic could happen while running defer functions for runtime.Goexit, but that will be handled in runtime.gopanic.

Rationale

One other approach that we extensively considered (and prototyped) also has inlined defer code for the normal case, but actual executes the defer exit code directly even in the panic case. Executing the defer exit code in the panic case requires duplication of stack frame F and some complex runtime code to start execution of the defer exit code using this new duplicated frame and to regain control when the defer exit code complete. The required runtime code for this approach is much more architecture-dependent and seems to be much more complex (and possibly fragile).

Compatibility

This proposal does not change any user-facing APIs, and hence satisfies the compatibility guidelines.

Implementation

An implementation has been mostly done. The change is here. Comments on the design or implementation are very welcome.

Some miscellaneous implementation details:

We need to restrict the number of defers in a function to the size of the deferBits bitmask. To minimize code size, we currently make deferBits to be 8 bits, and don’t do open-coded defers if there are more than 8 defers in a function. If there are more than 8 defers in a function, we revert to the standard defer chain implementation.
The deferBits variable and defer arguments variables (such as tmpA) must be declared (via OpVarDef) in the entry block, since the unconditional defer exit code at the bottom of the function will access them, so these variables are live throughout the entire function. (And, of course, they can be accessed by panic processing at any point within the function that might cause a panic.) For any defer argument stack slots that are pointers (or contain pointers), we must initialize those stack slots to zero in the entry block. The initialization is required for garbage collection, which doesn’t know which of these defer arguments are active (i.e. which of the defer sites have been reached, but the corresponding defer call has not yet happened)
Because the deferreturn code segment is disconnected from the rest of the function, it would not normally indicate that any stack slots are live. However, we want the liveness information at the deferreturn call to indicate that all of the stack slots associated with defers (which may include pointers to variables accessed by closures) and all of the return values are live. We must explicitly set the liveness for the deferreturn call to be the same as the liveness at the first defer call on the defer exit path.