Notes from Checkpoint/Restart BoF at Linux Plumbers Conference, Sep 24, 2009.
- Oren Laadan, Joeseph Ruscio (Librato)
- Pavel Emelyanov (OpenVZ)
- Ying Han, Salman Qazi (Google)
- Dan Smith, Matt Helsley, Sukadev Bhattiprolu, Dave Hansen (IBM)
The topic discussed can be roughly categorized into:
Current c/r code (version ckpt-v18) already supports uts-us, ipc-ns, user-ns, and pid-ns (not yet nested). Work in underway to add mount-ns, network-ns, and integrate with devpts-ns. There is an interest in "dev-ns" and "time-ns" as well.
Mount namespaces: Make a distinction between external and internal mount points. Anything that is shared with the container's parent is considered external. The rest appears only in namespaces the belong to the container/subtree, and are internal.
- External filesystems/mount points are expected to already exist for restart to succeed. Userspace is responsible for setting up the proper filesystem view for the container/subtree before the restart.
- Internal filesystems/mount points will be restored during the restart. This include bind-mounts, loop-mounts, as well as mount a device partition or a remote file system.
- Mount points shared with a mount-ns outside the container/subtree will be considered external. Disabling shared mount propagation is one way to ensure that no such sharing occurs.
- Internal remote fileststems that require network connection (e.g. NFS, FUSE) introduce a chicken-and-egg problem during live network: the network is shutdown for the duration of the migration, but the migration cannot complete before the filesystem is mounted, which requires network traffic...
Related, are unlinked files, unlinked directories and lazy-unmounted mount points. For unlinked files and directories, if the underlying filesystem supports a "re-link" operation that re-attaches a name to an existing (but otherwise unreachable) inode, then c/r can leverage this functionality. Otherwise, we are probably bound to storing the entire contents of those unlinked files as part of the checkpoint image, or somewhere else.
Network namespaces: Everyone agrees that network namespaces - specifically the network configuration and setup - is better handled in userspace. The actual state of network endpoints (ie. sockets) is restored in the kernel.
Time namespace: Such a feature does is not even available for containers, but it is definitely needed to be able to control how restarted perceive the time-warp that they experience. A few issues were pointed out:
- Use absolute time or relative time ?
- Do new children inherit the policy ?
- Do we gradually adjust from relative to absolute time ?
Detailed discussion of this topic was deferred.
Device namespace: This, too, is not even available for containers, although the need it clear. It was mentioned through the example of /dev/rtc, for instance if one would like to migrate a user-session that contains audio or video playback.
Task creation: The extended clone (clone2/clone3/clone_ext/clond_with_pids) was briefly discussed. All parties agreed with the current approach of restoring the process-tree (aka task creation) in userspace.
Network-ns: see above.
Mount-ns: see above.
Image format: The format of the checkpoint image was discussed briefly. This format is not written in stone and may change in the future. Furthermore, backward compatibility is to be implemented in userspace. Newer kernel may not be able to interpret images produced by older kernels. Instead, userspace will convert the image format to be suited for the target kernel.
That said, Pavel requested that changes and extension to the data structure be kept to minimum, e.g. by only adding new data fields to the end of a data structure.
Inspection: If you wish to inspect the contents of a given checkpoint image, the ckptinfo tool is your friend. At the moment it provides very basic information, but can be easily extended to provide more detailed view of the contents of an image.
Interfaces to userspace
Checkpoint-able or not ? Users may want to tell whether a container/subtree is checkpoint-able or not. Earlier, a suggestion to to track this per process/container was dismissed. Instead, we suggested to allow a dry-run (without dumping real data) to test whether it succeeds. The application must remain frozen if the user is to depend on the result of a dry-run. This feature is not yet implemented.
Related to this, Pavel asked about leak-detection when dealing with full containers. The leak detection in place is based on Alexey's work, and adds detection of "reverse-leaks".
Restart-able or not ? Users may also want to know whether a checkpoint image is restart-able or not without having it execute and fail. Moreover, a restart may appear to succeed, only for one of the restart processes to fail soon after because, for example, it expects to use a feature not available on current hardware or kernel.
On way to address this is to encapsulate within the checkpoint image a representation of the "capabilities" (broadly-defined) of the environment where the application was checkpointed. This include hardware capabilities (e.g. CPU and FPU specific when matters), and kernel capabilities (e.g. whether FUTEXes are supported). Note that it may be that the application doesn't use such a resource, but plans to use it because it already tested for its availability.
Because there could be too many such capabilities that are hard to enumerate and encapsulate, or even track at runtime, it was decided for now not to add such complicated logic to the kernel code. Instead, userspace tools could obtain and place such information in some metadata that will surely accompany checkpoint images as they migrate between hosts.
Error reporting: Error reporting is important to inform the user why a checkpoint or a restart operation failed. Currently, we provide this information during checkpoint by appending a special error-record to the output file in case checkpoints fails. For restart, such information can only be obtained from the kernel logs (dmesg) if debug-code is enabled.
It was suggested instead to extend the API of both sys_checkpoint and sys_restart by adding a file-descriptor into which the kernel will log information about the progress of the checkpoint or restart. The caller will control the verbosity level (e.g. status, log, debug) using the existing @flags argument. (This is preferred over passing a buffer for the kernel to fill).
I was also requested that at least some part of the error reporting will have a standard (unified) format that is suitable for automatic (userspace) tools to parse and interpret. However, the format of the log information is not an ABI, and may change over time. The current code already provides some form of unified format, and may be improved.
Controlling checkpoint/restart: It is desirable to allow userspace manage how certain resources are saved and restored. For example, to optimize for speed and space, an application may tell the kernel that some memory is "scratch" memory and need not be saved. Applications may want to substitute an alternative resource for an existing one during restart, e.g. replace /dev/tty1 with /dev/tty2 (and at checkpoint, advise the kernel to not fail when reaching /dev/tty1 although ttys are not supported).
One way to communicate such wills is to introduce a new system call, e.g. cradvise, whose API is yet to be defined. However, it was argued against the introduction of this syscall, mainly to avoid another ioctl-like complex interface and concerns about unfavorable response to it by the community. An alternative would be to reuse existing "hinting" syscalls, such as memadvise, shmctl, fcntl to accomplish the same purpose, and add new syscalls as needed.
Assorted other points
- Ying Han asked if there is a performance difference between the original instance of an application and the restarted instance ? (Eg: on NUMA if application was on one node at checkpoint and after restart, ended up on another node). We do not have performance evaluation that can answer the question. However, we do not expect any performance degradation beyond transient effects soon after restart, such increase page-fault rate due to cold cache state, and re-shuffling memory across NUMA nodes.
- The VDSO page is a major headache for two reasons: it provides direct access to kernel variables (e.g. current time) bypassing syscalls, and its binary code may change between difference kernel compiles. In OpenVZ this feature is simply disabled. Addressing the issue remains a challenge.
- An application may have pending asynchronous I/O at the time of the checkpoint. A simple solution, taken by OpenVZ, is to flush all pending I/O requests prior to the operation (similar to msync of shared-mapped files). It remains to explore how alternatives to reduce application downtime during a checkpoint.
- It would be useful to merge the test suites developed by OpenVZ and the evolving test-suite for the current c/r work. OpenVZ's suite make various applications enter specific states and wait for a checkpoint. After that and after restart they check that nothing has changed unexpectedly.