Google Summer of Code 2016/lxc migration
During summer 2016, I worked in part of Google Summer of Code on adding support for migration of linux containers in libvirt. While there is still a lot of ongoing work to make live migration fully functional, there is some basic functionality that can be tested and this I will describe above.
The underlying migration technology used is CRIU, which requires some kernel flags to be enabled. Instructions on how to configure your kernel for CRIU can be found here. At the moment writing this post, I have used criu-dev branch which is on version 2.4.
Checkpoint/Restore (C/R) of linux containers
The first thing I implemented for this GSoC project is saving a linux container's state into files and restoring the container from these files later. For this purpose I implemented some new functions inside lxc_driver in order to support domainSave and domainRestore general functions. At this point there are two extra commands that can be invoked within virsh, the virsh save and virsh restore commands.
virsh save [domain-name] [domain-id or domain-uuid] [directory name]
virsh restore [directory name]
The virsh save command stops the guest you specify and saves the data into some files inside the provided directory, which may take some time given the amount of memory in use by your guest.
The virsh restore command restarts the saved guest, which may take some time as well. The guest's name and UUID are preserved but are allocated for a new id.
Currently, libvirt informs the users regarding the success or failure of the operation but for more information one should look inside the checkpoint directory for the .log file produced by CRIU. The above operations can be invoked either on the same host or in different hosts. In order for the latter to work one should follow two extra steps.
- rsync the root filesystem with destination host
- rsync the checkpoint directory with destination host
Here is a DEMO showing the functionality for the above commands. In this short terminal recording we show the C/R operations for a shell container that is running a simple script.
Here is a link for the patches that add the above C/R functionality in libvirt lxc_driver. These patches are not yet merged for reasons I will describe here.
As already mentioned, in order to add the C/R functionality in the lxc_driver we use CRIU as the underlying technology. CRIU is given the pid of the container's root and dumps the state of container into a set of image files that can vary in terms of format (google protocol buffer format, binary, raw). With the current CRIU's implementation libvirt can only take from CRIU a directory containing the .img files. This however is not consistent with the pre-existing libvirt's architecture, which expects that 'virsh save' produces a file and 'virsh restore' restores from a file and not a directory. One could argue that we can get a tar version of the directory and serve the tar file to the above two commands. This is definitely a short term solution and I will explain here why:
Consider the live migration scenario where it is very important that we keep the downtime of the container as limited as possible; what affects the downtime is a long story to discuss here, but what will surely interest us for the this scenario is not to be forced by some architecture to store the produced from CRIU .img files on disk before transferring to the destination host. There are ways to bypass this, like CRIU's page server command or even a setup like an NFS exported directory that lives on tmps on which the produced files could be saved. In fact, it would be a lot more practical if CRIU could stream libvirt the produced files in binary format through sockets for example, and libvirt then could decide what to do with them? Either store them on disk in the desired format at the simple C/R scenario or else stream them through libvirt internal streaming protocol to the destination host when dealing with live migration scenario. This implementation not only would provide a generic solution for both scenarios, but also it would have the minimum possible latency in the live migration scenario since images could remain in memory and directly be transferred to destination.
For the above, I have dug up some old patches from CRIU's archives that add the desired functionality but never got merged. What I had to do is rebase them to the current master branch and adjust them as needed. After the required changes, the latest version for these patches can be found here. They still need some minor bug fixes to get merged but they are functional.
The basic idea behind these patches is the following: CRIU dump instead of saving files on disk it sends the images to a process (image-proxy) that keeps them in memory. Another process (image-cache) takes the files from the image-proxy to keep them in memory in the destination node until CRIU restore is invoked and asks for them. All file transfers are executed through sockets.
src_node: CRIU dump -> (sends images using a local socket) -> image-proxy
dst_node: CRIU restore <- (receives images from a local socket) <- image-cache
Other challenges met
One of the most challenging parts I have confronted here is to manage to C/R the pseudoterminal interfaces of the container. The problem here lies in the fact that a pty has two ends communicating through a channel, a master end and a slave end. The slave end, lives inside the container's subtree that we dump, whilst master peer does not. The problems show up in the restore process when CRIU tries to restore a slave end that it dumped but can't find the correspond master end. CRIU provides a way to C/R these external tty's and this has been successfully adopted for dev/tty1 which is the default console. If higher numbered ttys are configured for the lxc guest, you will definitely face problems with the current implementation. But this is an easy task to be done in the future.
Extra work done in terms of the GSOC project
- Added job support to lxc driver and used job functions to support simultaneous jobs. This one can be important when for example we are running one long job on a domain such as migration, and we want to concurrently get some information about the domain's state.
- Reworked reference counting in the lxc driver.
- Fix bugs in CRIU's patches for the --remote option and merge them in criu-dev.
- Use the above patches in the existing C/R patches for libvirt so as to have the desired architecture from the file transmissions, from libvirt to CRIU and vice versa.
- Add support for higher numbered ptys.
- I also have started implementing the live migration support for the lxc_driver some time ago, but after changing the underneath protocol I had stopped in order to work on the CRIU project. The work on live migration support can only be found on my github profile.
If you simply want to test the above the simplest might be to clone the latest work that can be found on my own fork of libvirt project here.