- Can this checkpoint/restart implementation be used with Liblxc ?
- Yes. The instructions are at LXC-USERCR
- What is the big deal - there are several implementations of C/R (see www.checkpointing.org). How is this different ?
Several implementations listed on the checkpointing.org are tailored for specific applications. Some require running applications with special LD_PRELOAD values. Some do not restore all resources (sockets, files, pipes, signals, process-hierarchy etc). This implementation aims to be general purpose and does try to restore all the resources listed and more. See also http://lists.linux-foundation.org/pipermail/clusters_sig/attachments/20051025/88ed8b30/Comparison-CR-0001.pdf. More detailed comparison with some of the C/R implementations is given below.
- Why not checkpoint/restart from user space ?<p>C/R from user-space would require a lot of kernel state to exported to user-space via ioctls or /proc or new system calls.Such interfaces would become permanent and complicate maintenance.
- How does this implementation differ from BLCR (http://ftg.lbl.gov/checkpoint)
- How does this implementation differ from Cryopid (http://cryopid.berlios.de/)
- How does this implementation differ from Zap
- How does this implementation differ from OpenVZ
- Can non-root users checkpoint/restart an application ?
For now, only users with CAP_SYSADMIN privileges can C/R an application. This is to ensure that the checkpoint image has not been tampered with and will be treated like a loadable kernel-module
- Freezer: Why not use SIGSTOP to stop the application rather than freezer ?<p>SIGCONT can be blocked/handled by the application and using SIGSTOP/SIGCONT would not be transparent to the application.
- Pros/Cons of recreating process-tree in-kernel (as in OpenVZ)
- Single task calls sys_restart(), all other processes created in kernel
- Kernel has complete control of all processes until return from sys_restart().
- Can only restart entire container. To restart subset of processes that were checkpointed, the checkpoint image must be chopped from user-space
- Pros/Cons of recreating processes in user-space (as in Zap)
- Leverage existing calls like fork(), clone()
- Allow partial-container checkpoint (i.e subtree checkpoint)
- Needs a kernel-glue/synchronization for all processes that call sys_restart()
- Needs a system call like clone_with_pids() to create processes with specific pid.
- Should we allow/deny subtree (partial-container) checkpoint ?
- Useful for development
- Useful for some workloads as long as user understands limitations, leaks
- Useful for gathering more information during core dumps -- perhaps enough to restart the application from the core dump, attach with a debugger, and then observe/fix the problem.