Problem: Tasks being Checkpointed may have open devices
Tasks may have device nodes open during checkpoint. Depending on the device this may be impossible to checkpoint. Currently most open device nodes cause the kernel to refuse to checkpoint and return an error from sys_checkpoint(). One example to keep in mind and which has inspired this RFC is the X DRM device.
Solution Idea: Hot-unplug the devices during Restart
There are other ways to solve this problem. One is to create virtual devices which, when open, can still be checkpointed. For example, X can be run over VNC rather than directly on hardware. However this limits the use of checkpoint/restart to slow VNC graphics. Similarly, other devices may not have suitable virtual devices.
This solution does not require virtual devices and instead allows tasks to have device nodes of hot-pluggable devices open.
NOTE: Both of the solutions mentioned above preserve the reliability of checkpoint -- if an open device node cannot be checkpointed then the kernel refuses to checkpoint and returns an error.
Use the cold/hot plug device support to emulate addition/removal of the devices rather than try to virtualize or reconnect them all. Devices that support hot unplug would just need a single line of code much like generic filesystems use generic_file_checkpoint:
.checkpoint = generic_unplug_device_after_restart,
This function would generate the hot-unplug uevents without actually sending them out. The events would be stored in the checkpoint image. During restart we would replace the "open device nodes" with special files that simulate unplugged devices. After restart has connected the new tasks with all sockets capable of listening for uevents we first "cold plug" the existing devices so the restarted programs recognize new alternatives. This would be similar to what already happens on boot. Next we emit the hot-unplug uevents so the tasks disconnect from the non-existent devices. Any attempt to access file-descriptors of those devices would return EBADF.
Subproblems and their potential solutions:
- Memory-mapped devices: difficult or impossible. (Would putting the task to sleep until after the hot-unplug uevent has been sent work as a fault handler? Or are hot-unpluggable devices generally not mmap-able?)
- uevents are broadcat on netlink sockets. We want only the restarted programs to respond to these uevents but we still want the restarted programs to receive normal uevents too. Could solve with:
- A netlink proxy (new concept). A netlink proxy would receive normal uevents and forward them to another, private, broadcast socket that replaces the checkpointed netlink broadcast socket during restart. The proxy would start by sending these cold-plug and hot-unplug uevents.
- Sequence numbers (easy, but may not work). Move the next uevent sequence number so its greater than the highest checkpointed sequence number + the number of cold-plug events. Then only the restarted tasks will pay attention to the cold-plug and hot-unplug uevents sent during restart.
- Hot-unplugging devices during restart requires modifications to the uevent code to:
- . stimulate generation of these uevents during checkpoint. Tricky because we may have to stimulate bus-level uevents too (when the last device goes away).
- . redirect the stimulated uevents from the netlink socket to a list of uevent buffers. Looks like there's just one function to hook and that will do the trick.
- . may require special handling for sequence numbers
- Need to get reactions from the "device model" community on whether this is too ugly.
This may work for The X Window System instead of always using VNC. It may even be useful for devices that aren't normally hot-plug capable if we have a sysfs attribute for each (kind of) device which enables/advertises "virtual hotplug support". Setting those files to "1" ("0" by default) would indicate that userspace is prepared for those (kinds of) devices to be virtually hot-(un)plugged.