As a software developer, you can use software containers to run stuff in a consistent manner across various operating systems. They are an operating system paradigm in which the kernel allows the existence of multiple, isolated user space instances. To create isolated process environments within a system, containers make use of bundle of techniques with main ones including:

  • Unix’s namespace: process space and id’s
  • chroot: filesystem visibility
  • cgroups: allocated resources (networking, number of processes, memory…)

Ideally, containers can run (side-by-side) on any machine–all dependencies already included. Containerization attempts to tackle the “well, it worked on my machine” issue, where software runs seemingly fine on your machine but doesn’t run on another system.

There are platforms that help you create and run containers, but for the highly initiated, you can build a container from scratch with syscall and everything: Video link to Liz Rice’s talk about building software containers from scratch

Docker is branded platform of containerization software that packages and runs containers. Docker allows you to build containers from templates with all the configurations and environment variables you’d need. You bake in those configurations and share the images so others can run it in the same, consistent environment.

Why containers, generally

Using containers set you up for application portability and reproducibility. Furthermore, you can use container orchestration software to add scalability for continuous integration/continuous development. Container orchestrators (such with Kubernetes) are another software abstraction level to automatically scale resources, monitor, and regulate clusters of containers.

Issue: “It worked on my machine.” -> 😣
Realize you need application runtime environment standardization -> 🙂
Containers help your application's portability + reproducibility -> 😁
Use container orchestration and regulate containers -> 🤩

‘Hello World’ Dockerfile

A Docker image is a read-only template used to build containers. Images are used to store and ship applications. Docker images are usually specified with directives in a file called the Dockerfile. A directive is a command to the Docker API:

FROM The following file contents tells Docker to build an image starting with the Alpine Linux distribution that also has python 3.7 (# are comments). Alpine is selected because it’s a minimal size. The image is hosted and downloaded from Docker Hub, a repository of container images that the development community can use to base off their specific use case/ build off of. Another repository is Google’s Container Registry for business, and there are numerous others.

WORKDIR The working directory, called app, is a directory created in the container. The relevant rest of the commands will assume terminal commands in app.

COPY COPY . /app tells Docker to copy everything here (ie .) into the container’s app directory. In this example there is a hello-world.py file being copied.

CMD The line with the Docker directive CMD will execute the command python helloworld.py when the container runs (also, it executes as if it were a terminal navigated to its app directory). CMD is the command the container executes by default when launching the built image.

Lastly, build and run So, with my host machine’s terminal navigated to the location of the Dockerfile and python script, I build the container image with the command docker build --tag hello-world .. The tag option assigns a custom name so I can easily identify the image when it runs. The build directives assume off of the current directory (ie .). Docker downloads the Linux distribution and will run the script in an isolated environment on my local machine.

# base image
FROM python:3.7-alpine3.9

# Set working dir to /app and copy over everything in current dir
WORKDIR /app
COPY . /app

# Run py script on container launch
CMD ["python", "helloworld.py"]

Jupyter notebook in a container

There are similar motivations to use containers in data science. When a business has a question to answer with data, a data scientist comes up with a methodology to produce findings, communicate results, and explain their decision. They need reproducible results for auditable work. Containers help package code, data, and all dependencies to reproduce that analysis.

In the Dockerfile contents below: Create a Docker image with Jupyter notebook, python libraries, and data so that others can audit work

VOLUME The VOLUME instruction creates a mount point with the specified name and marks it as holding externally mounted volumes from native host or other containers. I’ll use the -v option to mount my host folder to that container volume.

RUN RUN is an image build step; the state of the container after a RUN command will be committed to the built image. In this example, I install jupyter and some other data science libraries.

EXPOSE This instruction tells Docker that the container listens on its 8888 port at runtime. I’ll use the -p option to set host-side port 9999 to forward to container-side 8888.

Build and run command On my host machine, I’ll do:

docker build –t hello-jupyter .

to build the image specified by the Dockefile, then:

docker run -p 9999:8888 -v "%cd%\\shared_data":/app/shared_data hello-jupyter

to run the container (tagged “hello-juptyer”) and set host-side port 9999 to forward to container-side 8888. Files in \shared_data will by mount by the container and assigned to /app/shared_data. Once it’s running, I can use the web browser to go to localhost:9999 to access Juptyer Notebooks.

FROM python:3.6.5-slim
WORKDIR /app
COPY . /app

# define container-side path to share
VOLUME /app/shared_data

# Install dependencies
RUN pip --no-cache-dir install jupyter numpy pandas seaborn

# Make container-side port 8888 available
EXPOSE 8888

# Run jupyter when container starts
CMD ["jupyter", "notebook", "--ip='*'", "--port=8888", "--no-browser", "--allow-root"]

Best practices

Use a base image with minimum resources and build it with up your particular dependencies.

Be explicit about establishing resources in Dockerfile

  • Python and package versions e.g.: pip install numpy==1.0

Define default behavior, for example ENTRYPOINT and CMD seemingly do the same thing, but you’d do the following to communicate that some python code is supposed to run:

    ENTRYPOINT ["python"]
    CMD ["./main.py"]

You cannot override the ENTRYPOINT instruction by adding command-line parameters to the docker run command, whereas the target file (eg main.py) can be overridden.

🛳