Directly freeing user memory to reduce GC work

PJ Malloy (@thepudds)
January 2025

Discussion at https://go.dev/issue/74299.
CL stack at https://go.dev/cl/673695.

Introduction

This covers a design and initial implementation for a mechanism to free user memory eagerly to the Go runtime. The freed memory can be reused immediately in subsequent allocations without needing to wait for the GC cycle to progress. The mechanism is implemented in a new runtime function, runtime.freegc, which is inserted by the compiler in cases where it can prove the memory is no longer used.

A call to runtime.freegc records memory as dead and tracks it for later reuse. The allocator is updated to be able to safely reuse these heap objects, and care is taken so that runtime.freegc is able to coexist and cooperate with the concurrently running garbage collector.

Updates to the compiler include automatically inserting calls to runtime.freegc such as when the compiler proves heap allocated memory does not live past a function return and therefore may be safely freed on function exit.

The compiler is also updated to do a form of alias analysis so that even escaping heap memory can be immediately freed once it is dead in some cases, such as the intermediate memory for appends of a non-aliased slice in a loop.

When enabled, runtime.freegc allows faster reuse of user memory, reducing the allocation rate from the GC's perspective, and thus reducing the frequency of GC cycles, total GC CPU usage, and how often write barriers are enabled. It can also produce a more cache-friendly allocation pattern for user code.

This design composes well with the compiler's historic ability to do stack allocation of statically-known sizes by filling in gaps that stack allocation cannot currently handle. It also composes well with possible future improvements like:

memory regions, which can dynamically cast a wider net via a new user-facing API.
stack allocations of dynamically-sized small slices, which currently targets slices <= 32 bytes.
escape analysis improvements for interface arguments, which for example in w.Write(b) would allow b to be stack allocated, but which would be limited to only helping statically-sized slices unless there is an ability to hook into something like runtime.freegc. Other improvements also become more broadly applicable and hence more worthwhile if runtime.freegc exists.

There is also a reasonably rich group of direct follow-on targets for runtime.freegc that likely become available, and it becomes another tool in the toolbox for the compiler and runtime.

Hand-written free lists

Today, users can implement their own free lists and achieve some of the memory reuse benefits described here. However, runtime.freegc is implemented at a low level of the runtime and has various advantages like being able to reuse memory across types more safely and easily, more directly hook into what the GC is doing, more efficiently track reusable objects, and even change the GC shape of reusable objects, which is something even unsafe user code cannot do.

Updating the compiler translates to the reuse being both safer and more automated than hand written code using a custom free list.

And of course, the goal is to speed up safe, idiomatic Go code without requiring a variety of ad-hoc user written free lists.

High-level approach

In an attempt to help triangulate on what is possible for the runtime and compiler to do together to reduce how much work the GC must do, this document proposes updates to:

The runtime: implement runtime.freegc and associated runtime entry-points. Freed objects are recorded in one of two types of new free lists, and are then immediately reusable for the next allocation in the same span class via updates to the allocator. Challenges here include:
- Safety, including in the face of the concurrent GC, conservative scanning, and time delays between an object being observed by the GC to when it is processed.
- Performance, specifically without materially slowing down the normal allocation path when there is no reuse, making it fast on the reuse path, and being careful to not increase memory usage too much due to tracking additional information.
The compiler: automatically insert runtime calls to track and free slices allocated via make when the compiler can prove it is safe to do so. Also implemented is automatically inserting calls to free the intermediate memory from repeated calls to append in a loop (without freeing the final slice) when it can prove the memory is unaliased. This is done regardless of whether the final slice escapes (for example, even if the final slice is returned from the function that built the slice using append in a loop).

A sample benchmark included below shows strings.Builder operating roughly 2x faster.

An experimental modification to reflect to use runtime.freegc and then using that reflect with json/v2 gave reported memory allocation reductions of -43.7%, -32.9%, -21.9%, -22.0%, -1.0% for the 5 official real-world unmarshalling benchmarks from go-json-experiment/jsonbench by the authors of json/v2, covering the CanadaGeometry through TwitterStatus datasets.

A separate CL updates the builtin maps to free backing map storage during growth and split to free logically dead table memory that is not held by map iteration.

Note: there is no current intent to modify the standard library to have explicit calls to runtime.freegc, and of course such an ability would never be exposed to end-user code. The reflect and other standard library modifications were experiments that were helpful while designing and implementing the runtime side prior to some of the compiler work.

Additional steps that are natural follow-ons (but not yet attempted) include:

Recognizing append in loops that start with a slice of unknown provenance, such as a function parameter, but still be able to automatically free intermediate memory starting with the second slice growth if the slice does not otherwise become aliased.
Recognizing append in an iterator-based loop to be able to prove it is safe to free unaliased intermediate memory, such as in the patterns shown in slices.Collect. (Update: this is now implemented but not yet in review.)
Possibly extend to freeing more than just slices. Slices are a natural first step, but the large majority of the work implemented is independent of whether the heap object represents a slice. For example, on function exit, we likely will be able to free the table memory for non-escaping maps.

Runtime API overview

There are currently four primary runtime entry-points with working implementations, plus some additional variations on these that are not listed here. There are additional details with more draft API doc towards the end of this document.

freegc is the primary entry-point for freeing user memory. It is emitted directly by the compiler and also underlies the implementation of the other free* runtime entry-points.

func freegc(ptr unsafe.Pointer, size uintptr, noscan bool)

growsliceNoAlias is used during an append. It is like the existing growslice, but only for the case where we know there are no aliases for the backing memory of the slice, and hence we can free the old backing memory as the slice is updated with new, larger backing memory. Importantly, this works even if the slice escapes, such as if the resulting slice is returned from a function. growsliceNoAlias does both the grow and the free.

func growsliceNoAlias(oldPtr *any, newLen, oldCap, num int, et *byte) (ary []any)

makeslicetracked64 (and variations) allocates a heap object and records or "tracks" the allocated object's pointer. makeslicetracked64 is like the existing makeslice64, but fills in a trackedObjs for later use by freegcTracked. The compiler is responsible for reserving stack space for a properly sized backing array for trackedObj based on how many objects will be tracked, which gives the runtime the space it needs to write down its tracking information. (An alternative not yet attempted might be to use the standard makeslice64 and have compiler-generated code update the tracking information.)

func makeslicetracked64(et *_type, len64 int64, cap64 int64, trackedObjs *[]trackedObj) unsafe.Pointer

freegcTracked is automatically inserted by the compiler in cases where we know the lifetime of a heap-allocated object is tied to a certain scope (usually a function), which allows for tracked heap objects to be automatically freed as a scope is exited. freegcTracked must be paired with functions like makeslicetracked64.

func freegcTracked(trackedObjs *[]trackedObj)

An additional runtime API was temporarily used by some CLs as a stand-in for makeslicetracked64 for convenience, but there was no plan to use it beyond helping to bootstrap some of the work:

// TEMPORARY API.
func trackSlice(sl unsafe.Pointer, trackedObjs *[]trackedObj)

Pointer liveness and `freegc` vs. `freegcTracked`

This section summarizes some differences between freegc and freegcTracked and attempts to explain why freegcTracked exists at all.

First, note that freegc accepts an unsafe.Pointer and hence keeps the pointer alive. It therefore could be a pessimization in some cases (such as a long-lived function or function that straddles GC transitions) if the caller does not call freegc before or roughly when the liveness analysis of the compiler would otherwise have determined the heap object is reclaimable by the GC.

For certain allocations or in certain loops, we directly insert freegc via the compiler almost exactly at the point where the memory goes dead (or similarly insert a layer on top of freegc like growsliceNoAlias). freegc says that the memory is immediately available for reuse, which is especially valuable in a loop or when reusing memory is otherwise useful without waiting to exit a function.

However, we would be giving up on many valuable targets if we did not also take advantage of escape analysis results (based on its current implementation and based on some escape analysis modifications as part of this work).

Escape analysis today provides primarily function-scoped lifetime results. If we merely added freegc to function exit points to free objects with function-scoped lifetimes, that would likely be a pessimization if the object otherwise would have been dead before the function exit.

This is part of the reason freegcTracked exists and takes a different approach than freegc. Our tracking of an allocation does not keep the tracked memory alive, therefore avoiding the liveness pessimization described above when used at scope exit points. Another benefit of freegcTracked is that it frees multiple objects at scope exit, which means fewer calls into the runtime and less emitted code (including compared to if we were to instead have one freegc per object at scope exit).

In addition, a given pointer today can go dead at multiple places in a function, which would also add complexity if we were to directly use freegc automatically wherever the pointer goes dead, in contrast to freegcTracked effectively being a single clean up operation as the function exits.

Another difference is freegcTracked is essentially a two-step process -- first there must be a tracked allocation, and later there is a freegcTracked call that operates on those tracked allocation(s). In contrast, freegc can operate on any heap allocation as long as we know it is safe to do so.

Summarizing freegc vs. freegcTracked:

  freegc                                    freegcTracked
  -------------------------------------     ---------------------------
  frees single heap object                  can free multiple heap objects

  lower-level primitive                     implemented on top of freegc

  operates on any heap allocation           paired with tracked allocation function

  call keeps pointer alive                  tracking does not keep pointer alive

  primarily intended for use the moment     primarily intended for use at
  a pointer is otherwise dead               scope exit

The initial implementation of tracking avoided keeping the pointer alive via an approach of storing a uintptr on the stack along with a sequence number (which is incremented as the GC cycle progresses) to then later know whether it is still valid to use the stored uintptr to free its heap object as the scope is exited.

While reviewing this work, Keith Randall recently suggested an improvement to the implementation of freegcTracked where the runtime zeros out the tracked pointers on the stack as the GC cycle progresses, which is somewhat similar in effect to a weak pointer. The GC-related portions of the runtime would take a more active role now in the tracking, rather than the freegcTracked code inferring what has happened based on the sequence number. This new approach might touch more pieces of the code base, but it is better overall, and the intent is to now implement this improvement. The general characteristics of freegcTracked described above still apply (except now there is no sequence number), and most of the associated implementation is unaffected.

There are more details on these functions at the bottom of this document, including draft API doc.

Compiler changes overview

Analyzing aliases

The changes described in this subsection do not rely on runtime.freegcTracked, but do rely on runtime.freegc and runtime.growsliceNoAlias.

As part of this work, the compiler has been updated to prove whether certain variables are aliased or not, and to also examine append calls to see if it can prove whether the slice is not aliased.

If so, the compiler automatically inserts a call to the new runtime function runtime.growsliceNoAlias (briefly described above), which frees the old backing memory after slice growth is complete and the old storage is logically dead.

This works with multiple append calls for the same slice, including inside loops, and the final slice memory can be escaping, such as returned from the function. (In that case, the final slice value is of course not freed.)

An example target is:

    func f1(input []int) []int {
        var s []int
        for _, x := range input {
            s = append(s, g(x))
        }
        return s
    }

After proving that the slice s cannot be aliased in that loop, the compiler converts that roughly to:

    func f1(input []int) []int {
        var s []int
        for _, x := range input {
			// s cannot be aliased here.
			// The old backing array for s is freed inside growsliceNoAlias after growth.
			// The final slice is not freed. It is returned to the caller of f1.
			// Note: this is simplified from the actual signature for growsliceNoAlias.
            s = growsliceNoAlias(s, g(x))
        }
        return s
    }

Note that proving a slice is unaliased must be careful in the presence of loops (or gotos that form unstructured loops). In this next example, we do not treat s as entirely unaliased because it is aliased on the second pass through the loop:

var alias []int

func f2() {
        var s []int
        for i := range 10 {
                s = append(s, i)
                alias = s
        }
}

For some additional details on the overall approach, see the new src/cmd/compile/internal/escape/alias.go file in CL 710015 that is part of this work (especially the (*aliasAnalysis).analyze method).

By tracking aliasing and renamings across function inlinings (as well as some support for recognizing whether a goto is an unstructured loop), the compiler is also able to automatically free intermediate memory from appends used with iterators, such as in:

    s := slices.Collect(seq)

    var s2 []int
    s2 = slices.AppendSeq(s2, seqA)
    s2 = slices.AppendSeq(s2, seqB)
	s2 = slices.AppendSeq(s2, seqC)

(An earlier version of this work was only able to handle these iterator cases via manual edits to the slices package; that is no longer needed).

Analyzing `make`

The changes described in this subsection do rely on runtime.freegcTracked.

Additional CLs update the compiler to automatically insert free calls for slices allocated via make when the compiler can prove it is safe to do so.

Consider for example the following code:

func f1(size int) {
	s := make([]int64, size) // heap allocated if > 32 bytes
	_ = s
}

In today's compiler, the backing array for s is not marked as escaping, but will be heap allocated if larger than 32 bytes.

As part of this work, the compiler has been updated so that the backing array for s is recognized as non-escaping but unknown size and it is valid to free it when f1 is done.

The compiler automatically transforms f1 to roughly:

func f1(size int) {
	var freeablesArr [1]struct { // compiler counts how many are needed (one, in this case)
		ptr      uintptr
		// Possibly 1-2 additional fields in future.
	}
	freeables := freeablesArr[:0]
	defer freegcTracked(&freeables)

	s := make([]int64, size)
	trackSlice(unsafe.Pointer(&s), &freeables) // NOTE: only temporarily two runtime calls

	_ = s
}

TODO: remove the trackSlice from this example, which was only a temporary API used to help bootstrap this work.

TODO: mention make-like uses of append, such as append([]byte(nil), make([]byte, n)...).

The freeablesArr is stored on the stack and automatically sized by the compiler based on the number of target sites. The defer is currently inserted in a way so that it is an open-coded defer and hence has low performance impact, though the plan is to switch to inserting the calls without relying on defer.

For more details on freegcTracked, see the high-level description above or the API doc below.

The transformed code just above shows what is currently implemented, but two items seen there that might change as the implementation is polished:

freegcTracked currently takes a slice for convenience, but we could make that tighter.
the current intent is to use makeslicetracked64 (for example, in f1 above), but the compiler initially inserted calls to a new runtime.trackSlice instead, which initially seemed slightly easier to implement on the compiler side. The plan is to have the compiler use makeslicetracked64 instead.

If there are multiple target sites within a function, such as in f2 here:

func f2(size1 int, size2 int, x bool) {
	s1 := make([]int64, size1) // heap allocated if > 32 bytes
	if x {
		s2 := make([]int64, size2) // heap allocated if > 32 bytes
		_ = s2
	}
	_ = s1
}

The compiler counts the max number of sites (not sensitive to conditionals), and sizes freeablesArr appropriately (two element array in this case), with a single call to freegcTracked per function:

func f2(size1 int, size2 int, x bool) {
	var freeablesArr [2]struct { // compiler counts how many are needed (two, in this case)
		ptr      uintptr
		// Possibly 1-2 additional fields in future.
	}
	freeables := freeablesArr[:0]
	defer freegcTracked(&freeables) // only a single freegcTracked call, regardless of how many tracked sites

	s1 := make([]int64, size1)
	trackSlice(unsafe.Pointer(&s1), &freeables)
	if x {
		s2 := make([]int64, size2)
		trackSlice(unsafe.Pointer(&s2), &freeables) // NOTE: only temporarily two runtime calls

		_ = s2
	}
	_ = s1
}

TODO: remove the trackSlice from this example, which was only a temporary API used to help bootstrap this work.

The max number of automatically inserted call sites is bounded based on the count of make calls in a given function.

Inside for loops or when a goto looks like it is being used in an unstructured loop, the compiler currently skips automatically inserting these calls.

However, a modest extension (and the current plan) is to have the compiler instead use a different runtime entry point for freeing in loops. The idea is to allow heap-allocated slice memory to be reclaimed on each loop iteration when it returns to the same allocation site, and when it's safe to do so. Because it's the same allocation site, this doesn't require unbounded or dynamic space in the trackedObj slice. (The compiler can already prove it is safe and the runtime entry point is prototyped.)

`strings.Builder`

As an early experiment, we updated strings.Builder to have manual calls to runtime.freegc, though there is no intent to actually manually modify strings.Builder.

strings.Builder is a somewhat low-level part of the stdlib that already helps encapsulate some performance-related trickery, including using unsafe and internal/bytealg.MakeNoZero. The strings.Builder already owns its in-progress buffer that is being built up, until it returns a usually final string. The existing internal comment says:

    // External users should never get direct access to this buffer, since
    // the slice at some point will be converted to a string using unsafe,
    // also data between len(buf) and cap(buf) might be uninitialized.
    buf []byte

If we update strings.Builder to explicitly call runtime.freegc on that buf after a resize operation (but without freeing the usually final incarnation of buf that will be returned to the user as a string via Builder.String), we can see some nice benchmark results on the existing strings benchmarks that call strings.(*Builder).Write N times and then call String. Here, the (uncommon) case of a single Write is not helped (given it never resizes after first alloc if there is only one Write), but the impact grows such that it is up to ~2x faster as there are more resize operations due to more Write calls:

                                               │     disabled.out │      new-free-20.txt                │
                                               │      sec/op      │   sec/op     vs base                │
BuildString_Builder/1Write_36Bytes_NoGrow-4           55.82n ± 2%   55.86n ± 2%        ~ (p=0.794 n=20)
BuildString_Builder/2Write_36Bytes_NoGrow-4           125.2n ± 2%   115.4n ± 1%   -7.86% (p=0.000 n=20)
BuildString_Builder/3Write_36Bytes_NoGrow-4           224.0n ± 1%   188.2n ± 2%  -16.00% (p=0.000 n=20)
BuildString_Builder/5Write_36Bytes_NoGrow-4           239.1n ± 9%   205.1n ± 1%  -14.20% (p=0.000 n=20)
BuildString_Builder/8Write_36Bytes_NoGrow-4           422.8n ± 3%   325.4n ± 1%  -23.04% (p=0.000 n=20)
BuildString_Builder/10Write_36Bytes_NoGrow-4          436.9n ± 2%   342.3n ± 1%  -21.64% (p=0.000 n=20)
BuildString_Builder/100Write_36Bytes_NoGrow-4         4.403µ ± 1%   2.381µ ± 2%  -45.91% (p=0.000 n=20)
BuildString_Builder/1000Write_36Bytes_NoGrow-4        48.28µ ± 2%   21.38µ ± 2%  -55.71% (p=0.000 n=20)

There is no plan to hand insert calls to runtime.freegc within strings.Builder (including there are alternatives that could be used within strings.Builder, such as a segmented buffer and/or sync.Pool such as used within cmd/compile). We are mainly using it here as a small-ish concrete example.

In addition, after doing the hand edited strings.Builder prototype, the compiler work has since progressed further, and I have some hope for the compiler to be able to automatically free intermediate memory for many common uses of strings.Builder via growsliceNoAlias (with roughly 2 of 3 steps towards that goal currently implemented; strings.Builder is more complex than an ordinary use of append for a few reasons).

There are other potential optimizations for strings.Builder, like allocating an initial buf (e.g., 64 bytes) on the first Write, faster growth than current 1.25x growth for larger slices, and "right sizing" the final string; an earlier cut did implement those optimizations.

Normal allocation performance

For the cases where a normal allocation is happening without any reuse, some initial micro-benchmarks suggest the impact of these changes could be small to negligible (at least with GOAMD64=v3):

goos: linux
goarch: amd64
pkg: runtime
cpu: AMD EPYC 7B13
                        │ base-512M-v3.bench │   ps16-512M-goamd64-v3.bench     │
                        │        sec/op      │ sec/op     vs base               │
Malloc8-16                      11.01n ± 1%   10.94n ± 1%  -0.68% (p=0.038 n=20)
Malloc16-16                     17.15n ± 1%   17.05n ± 0%  -0.55% (p=0.007 n=20)
Malloc32-16                     18.65n ± 1%   18.42n ± 0%  -1.26% (p=0.000 n=20)
MallocTypeInfo8-16              18.63n ± 0%   18.36n ± 0%  -1.45% (p=0.000 n=20)
MallocTypeInfo16-16             22.32n ± 0%   22.65n ± 0%  +1.50% (p=0.000 n=20)
MallocTypeInfo32-16             23.37n ± 0%   23.89n ± 0%  +2.23% (p=0.000 n=20)
geomean                         18.02n        18.01n       -0.05%

runtime API

This section lists the runtime APIs that are automatically inserted by the compiler.

Note: the term "logically dead" here is distinct from the compiler's liveness analysis, and the runtime free and re-use implementations must be (and attempt to be) robust in sequences such as a first alloc of a given pointer, GC observes and queues the pointer, runtime.free, GC dequeues pointer for scanning, runtime re-uses the same pointer, all within a single GC phase. (TODO: more precise term than "logically dead"?)

// freegc records that a heap object is reusable and available for
// immediate reuse in a subsequent mallocgc allocation, without
// needing to wait for the GC cycle to progress.
//
// The information is recorded in a free list stored in the
// current P's mcache. The caller must pass in the user size
// and whether the object has pointers, which allows a faster free
// operation.
//
// freegc must be called by the effective owner of ptr who knows
// the pointer is logically dead, with no possible aliases that might
// be used past that moment. In other words, ptr must be the
// last and only pointer to its referent.
//
// The intended callers are the runtime and compiler-generated calls.
//
// ptr must point to a heap object or into the current g's stack,
// in which case freegc is a no-op. In particular, ptr must not point
// to memory in the data or bss sections, which is partially enforced.
// For objects with a malloc header, ptr should point mallocHeaderSize bytes
// past the base; otherwise, ptr should point to the base of the heap object.
// In other words, ptr should be the same pointer that was returned by mallocgc.
//
// In addition, the caller must know that ptr's object has no specials, such
// as might have been created by a call to SetFinalizer or AddCleanup.
// (Internally, the runtime deals appropriately with internally-created
// specials, such as specials for memory profiling).
//
// If the size of ptr's object is less than 16 bytes or greater than
// 32KiB - gc.MallocHeaderSize bytes, freegc is currently a no-op. It must only
// be called in alloc-safe places.
//
// Note that freegc accepts an unsafe.Pointer and hence keeps the pointer
// alive. It therefore could be a pessimization in some cases (such
// as a long-lived function) if the caller does not call freegc before
// or roughly when the liveness analysis of the compiler
// would otherwise have determined ptr's object is reclaimable by the GC.
func freegc(ptr unsafe.Pointer, size uintptr, noscan bool) bool

// growsliceNoAlias is like growslice but only for the case where
// we know that oldPtr is not aliased.
//
// In other words, the caller must know that there are no other references
// to the backing memory of the slice being grown aside from the slice header
// that will be updated with new backing memory when growsliceNoAlias
// returns, and therefore oldPtr must be the only pointer to its referent
// aside from the slice header updated by the returned slice.
func growsliceNoAlias(oldPtr unsafe.Pointer, newLen, oldCap, num int, et *_type) slice

// makeslicetracked64 is like the existing makeslice64, but the caller
// also passes in a pointer to a trackedObj slice that allows the runtime
// to track the heap-allocated backing array and possibly later
// free the tracked objects via freegcTracked. The contents of
// the trackedObj slice are opaque to the caller, but the caller
// provides the space for the trackedObj slice.
//
// Currently, the only intended caller is code inserted by
// the compiler, which only does so when escape analysis can
// prove the associated pointer has a known scope -- currently,
// when it can prove the pointer does not outlive the current
// function. This effectively creates a scope for the
// associated heap pointers. The compiler also arranges for
// freegcTracked to be called on function exit, so that
// the heap objects are automatically freed when leaving the
// scope. The compiler also arranges for the trackedObj slice's
// backing array to be placed on the user's normal goroutine
// stack to provide storage for the tracking information,
// with that storage matching the lifetime of the tracked
// heap objects (that is, both match the lifetime of
// the function call).
func makeslicetracked64(et *_type, len64 int64, cap64 int64, trackedObjs *[]trackedObj) unsafe.Pointer

// freegcTracked marks as reusable the objects allocated by
// calls to makeslicetracked64 and tracked in the trackedObjs
// slice, unless the GC has already possibly reclaimed
// those objects.
//
// The trackedObj slice does not keep the associated pointers
// alive past the start of a mark cycle, and
// freegcTracked becomes a no-op if the GC cycle
// has progressed such that the GC might have already reclaimed
// the tracked pointers before freegcTracked was called.
// This might happen for example if a function call lasts
// multiple GC cycles.
//
// The compiler statically counts the maximum number of
// possible makeslicetracked64 calls in a function
// in order to statically know how much user
// stack space is needed for the trackedObj slice.
func freegcTracked(trackedObjs *[]trackedObj)

// trackedObj is used to track a heap-allocated object that
// has a proven scope and can potentially be freed by freegcTracked.
//
// trackedObj does not keep the tracked object alive.
//
// trackedObj is opaque outside the allocation and free code.
//
// TODO: update description to describe runtime zeroing.
//
// TODO: there is a plan to allow a trackedObj to have specials,
// including user-created via SetFinalizer or AddCleanup, but
// not yet implemented. (In short -- update runtime.addspecial
// to update a bitmap of objIndexes that have specials so
// that the fast path for allocation & free does not need to
// walk an mspan's specials linked list, and then a trackedObj with
// specials will not be reused. An alternative might be to
// sweep the span and process the specials after reaching
// the end of an mspan in the mcache if there are reusable
// objects with specials, but that might be harder or perhaps
// infeasible. More generally, sweeping sooner than the
// normal GC but not in the allocation/free fast path (perhaps
// in a goroutine managed by the runtime), might be worth
// exploring for other reasons than just specials).
type trackedObj struct {
	ptr      uintptr   // tracked object
	// Future: possibly 1-2 additional metadata fields to speed up free.
}

Other possible runtime APIs

runtime.free without size, noscan

An earlier version of this work also implemented an API that just took an unsafe.Pointer, without any size or noscan parameters, like:

func free(ptr unsafe.Pointer)

However, this was more expensive, primarily due to needing to call spanOf, findObject or similar. Early measurements suggested that passing the size and noscan parameters like in freegc allowed eliminating roughly 60% of the cost of a free operation.

(A small historical note: the more explicit API taking size and noscan parameters was originally named freeSized but was later renamed to freegc; the older freeSized name might still be visible in some earlier prototype CLs.)

runtime.freegcSlice

Separately, we will likely also use a runtime.freegcSlice. It is a small function over runtime.freegc that accepts a slice. (Initially it was just a test function, but ended up being more convenient in some cases, including in some of the early prototype compiler work.)