Futexes

From Linux Checkpoint / Restart Wiki
Jump to: navigation, search

Contents

Futex checkpoint/restart

Futexes use atomic operations on userspace memory (the "futex word") to implement the uncontended path for locks and other synchronization primitives. The contended cases are resolved by the kernel. Futexes have received many enhancements over the years to provide robustness in the face of exiting tasks and priority inheritance for example.

Plain Futexes (explicitly* tested)

Plain futexes use a simple 32-bit futex word and the WAIT and WAKE operations. In the "uncontended" userspace-only paths the checkpoint of memory contents and the fact that the tasks using the memory are frozen will ensure a consistent checkpoint. When waiters and/or wakers are executing syscalls restarting the syscalls suffices. For restart it suffices to restart the futex syscalls which the contending waiters are already making.

Plain futexes may or may not use the tid in the futex word. It depends on the code that uses these futexes whether or not a pid namespace is needed to make operation after restart perfectly reliable. A conservative restart would always use a new pid namespace.

Robust Futexes (explicitly* tested)

set_robust_list() and get_robust_list() register a per-task list of futexes the task has acquired. When the task dies the kernel walks these lists carefully and wakes a waiter for the held futexes.

Checkpoint/restart of these lists involves saving and restoring the list head pointer. This is sufficient because the list head pointer is a pointer to userspace memory, which is already being checkpointed. Since the tasks using the memory are frozen we can be sure we won't race with robust list operations. The freezer also prevents any new concurrent set_robust_list() syscalls. If the task was in the set_robust_list() syscall prior to being frozen then we know those changes are already reflected inside the kernel by the time we reach checkpoint because that requires a FROZEN -> CHECKPOINTING transition in the cgroup freezer.

Robust futexes store tids in the lower 29 bits of the futex word. Thus, to maintain robustness and for reliable restart, any program using robust futexes must be restarted in a new pid namespace so that its pid is guaranteed to be available. Fortunately, this can be detected by looking at the robust list field for the task in the checkpoint image.

Priority Inheritance (PI) Futexes (explicitly* tested)

PI futexes have much more kernel-internal state. They utilize rt mutexes to implement priority inheritance. The naieve approach of checkpointing these mutexes and restarting them is actually much more work than necessary thanks to the semantics of the futex syscall. (TODO: verify that these syscalls are restarted after the rt priorities have been restored -- otherwise priority inheritance may not cause the correct waiter to be woken next).

PI futexes store tids in the lower 29 bits of the futex word. Thus, for reliable restart, any program using PI futexes must be restarted in a new pid namespace so that its pid is guaranteed to be available. Unfortunately, there is no practical way to detect that PI futexes are being used unless we are checkpointing tasks in the contended path. However, depending on how the tid is cached this may not be a big problem once all the pi/robust futexes are released -- the new tid may be picked up from gettid().

Futex wake ops, bitsets...

These were evaluated for checkpoint/restart and it appeared that the existing strategy -- restarting syscalls (esp. of the waiters) -- will suffice. Unfortunately, this has not been explicitly* tested.

Combinations

PI and robust futexes can be combined. Also expected to work yet not explicitly* tested.


Footnotes

  • explicit testing -- Explicit testing means we have a test case that is guaranteed test checkpointing in states which also involve kernel state. See also implicit testing.
  • implicit testing -- We've checkpointed and restarted things which use these futexes but we haven't made an effort to guarantee that these futexes were "being operated on" during the checkpoint. This is a non-deterministic method of testing since it relies purely on timing.
Personal tools