Coroutine HALO (Clang 20+)#

Heap Allocation eLision Optimization (HALO) is an optimization technique applied by the compiler that allows child task allocations to be combined into the parent’s allocation, eliminating per-task heap allocations for improved performance.

In practice, HALO is often not applied automatically by compilers. However, the Clang compiler, starting with Clang 20, offers some additional attributes [[clang::coro_await_elidable]] and [[clang::coro_await_elidable_argument]] which can be used as a hint to to the compiler that it should apply this optimization. TMC provides these attributes for several types and functions. On non-Clang compilers or Clang versions prior to 20, these functions are safe to use, but provide no additional optimization.

TMC has defined these attributes for the following types and functions:

However, for Clang to apply HALO via these attributes, the usage must be “a direct call expression that returns a prvalue of a type that is the immediate right-hand side operand to a co_await expression”. In other words, HALO can work if you await one of these expressions directly, but if you assign the expression to a variable and then await that variable, HALO cannot be applied.

// HALO: This allocation can be elided as it is directly awaited
auto result = co_await task_int(1);

// Non-HALO: This allocation cannot be elided as it is stored in a
// variable before being awaited
auto t = task_int(2);
auto result = co_await std::move(t);

// Non-HALO: This allocation cannot be elided because the .resume_on()
// member function call breaks Clang's requirement that it be
// an "immediate right-hand side operand to a co_await expression".
auto result = co_await task_int(3).resume_on(some_executor);

The *_clang() functions exist to satisfy this “immediately awaited” requirement for use cases where it otherwise would not work. They all return a dummy awaitable which must be awaited immediately. This dummy awaitable doesn’t do anything other than trick the compiler. In the case of the fork_clang() functions, this dummy awaitable returns the real awaitable, which can be saved and awaited later to join the forked task.

// HALO: fork_clang() is directly awaited
auto forked = co_await tmc::fork_clang(some_task());
do_some_work();
co_await std::move(forked);

// Non-HALO: fork_clang() stored in variable
auto dummy = tmc::fork_clang(some_task());
auto forked = co_await std::move(dummy);
do_some_work();
co_await std::move(forked);

See test_halo.cpp for a complete listing of every scenario in which HALO does, and does not, work.

HALO is only for coroutines#

Awaitable types, that expose await_ready() / await_suspend() / await_resume() or operator co_await(), but are not themselves coroutines, cannot be HALO’d. In fact, the concept of HALO does not apply to these types at all, and they may not cause any allocations in the first place.

HALO is only for coroutines.

How can you tell if HALO is working?#

HALO is an optimization, so it may not be applied at -O0 optimization level.

Since the compiler doesn’t give any direct feedback about HALO, TMC offers an optional counter that you can use to profile the number of tmc::task allocations in your own code. When this is enabled, TMC will track all calls to tmc::task::operator new(). When HALO is applied, operator new() is not called, and this counter is not incremented. The usage of this feature is demonstrated in test_halo.cpp.

Usage Steps:

  1. Define TMC_DEBUG_TASK_ALLOC_COUNT in your build script.

  2. Call tmc::debug::get_task_alloc_count() to read the number of tmc::task allocations that have occurred.

  3. (optional) Call tmc::debug::set_task_alloc_count() to reset the value of the counter.

This functionality is implemented as an atomic variable increment, so it has a small but non-zero overhead that is present only when TMC_DEBUG_TASK_ALLOC_COUNT is defined.

API Reference#

inline size_t tmc::debug::get_task_alloc_count()#

Returns the current value of the tmc::task allocation counter. This is useful to determine if HALO is working; tasks that have their allocations folded into the parent allocation by HALO do not increase this counter.

inline void tmc::debug::set_task_alloc_count(size_t Value)#

Allows you to reset the tmc::task allocation counter, in order to count the number of allocations in a specific program section.