Debugging 101 – Part II (docker-machine zombies)

This is a follow on to the Debugging 101 post where we will apply the process to a real world docker-machine problem.

We recently encountered an issue where our provisioning container started seeing an increase in docker-machine zombie processes. A quick google search pointed to some aufs related issues that should have been resolved in the version we were using. We were clueless at this point as to why we still see the zombies. Lets start with the problem definition.

 

Define the problem

  • What? We see an increasing number of docker-machine zombie processes.
  • When? It seems to happen on almost any docker-machine command, but not always. For example, running docker-machine ls sometimes creates the zombies.
    Now we skip to Collect the Data to find more information about when exactly the issue does and does not happen.
  • When does it happen? It happens whenever we have at least one host created with docker-machine.
  • When does it not happen? When there is no host created with docker-machine.
    Yay! We got Catch the failure for free, so all I need to do is create a docker-machine host and run ls on it.
# ps aux | grep Z
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18   0.0  0.0 0    0   ?  Z    18:38 0:00 [docker-machine] <defunct>
root 132  0.0  0.0 0    0   ?  Z    18:45 0:00 [docker-machine] <defunct>
# ps aux | grep Z | wc -l
10
# docker-machine ls > /dev/null; ps aux | grep Z | wc -l;
11 .... we got a new zombie!
  • Where does it happen? Inside the provisioning container.
  • Where does it not happen?
    • Even with the same docker-machine binary on the container host
    • On another container created from the same image, run interactively with a bash shell
  • Who? Running the container as privileged or not, does not seem to affect the issue

Understand the system

What is a zombie process? When is it created and how do you get rid of it?

In Unix, when a process is terminated, it does not immediately exit. It stays around to allow for cleanup of resources like file descriptors and memory and finally to allow the parent process to checks the exit status. The parent process can get the exist status using the waitpid() family of system calls.
If for some reason the parent process exits, without waiting for the child, the child process’s parent is redefined to be the ancestor of all processes in the system.
In a normal unix system, this is the init process. The init process reaps any orphaned processes by waiting on them when they exit and thus preventing them from becoming zombies.

Screen Shot 2016-06-28 at 1.10.58 PM

What is special about containers?

Inside a container, the ancestor is the EntryPoint of the container, which may not provide a full fledged init system functionality.
Most containers are designed to run one process and exit and do not need a full fledged init system. However if you are calling subprocesses from your container, which expect the ancestor to perform the reaping, you have to provide this functionality.

Identify potential causes

With the understanding of the system and the problem definition, potential causes jump out at us.

  1. Docker-machine is creating processes and not waiting for them to exit. Docker-machine creates a docker-machine child process to load the docker-machine driver plugins and it is exiting before them. This is a cause, but also seems like expected behavior.
  2. The init subsystem is not reaping the orphaned docker-machine processes. Wait… What init subsystem? We don’t have one for our container – so we have our root cause.

Fix, test and test

Fixing the issue was easy as we could run the container with bash or any other process that reaps the orphaned processes. Our provisioning container is stateless and can be shutdown at any time, so we did not require a full fledged init system. If you do require one, there are solutions like baseimage-docker, which provide this functionality for you.
We tested this with our automated integration tests as well as our bench test for scale testing and found no more zombies. We declare success at this point!

References

  1. Kepner-Tregoe Problem Solving
  2. Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems by David J Agans
  3. Advanced Programming in the UNIX® Environment, Third Edition By: W. Richard Stevens; Stephen A. Rago
  4. Slides from the DevPulseCon Debugging 101 session.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s