Docker for reproducible research - Data Science Toolkit

10/5/2022 10-minute read

These are my notes on Docker for reproducible research from Data Science Toolkit, by David Benkeser.

Video | Notes | Github

Part 1 (0:00 - 32:50)

Pre-lecture Question

Q: Can we review project organization?

Phony data analysis

$\underline{\text{Create and Open Project then Create Folders}}$

In the terminal:

mkdir example_project_2 $\quad\quad\quad\text{: make directory }$
subl example_project_2 $\quad\quad\quad\quad\text{: open in sublime }$

Then in a terminal in sublime create some folders:

mkdir code $\quad\quad\quad\quad\quad\quad\quad\text{: make directory }$
mkdir raw data $\quad\quad\quad\quad\quad\quad\text{: make directory }$
mkdir clean data $\quad\quad\quad\quad\quad\quad\text{: make directory }$
mkdir output $\quad\quad\quad\quad\quad\quad\quad\text{: make directory }$

$\underline{\text{Save Raw Data}}$

In the terminal:

R $\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\text{: open R }$
data(mtcars) $\quad\quad\quad\quad\quad\quad\quad\text{: load mtcars data set }$
head(mtcars) $\quad\quad\quad\quad\quad\quad\quad\text{: show head of mtcars }$
getwd() $\quad\quad\quad\quad\quad\quad\quad\quad\quad\text{: get working directory }$
saveRDS(mtcars, “raw_data/mtcars.rds”) $\quad\text{: save as .rds file}$
saveRDS(mtcars, “raw_data/mtcars.RData”) $\quad\text{: save as .RData file}$
q() $\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\text{: quit R }$
y $\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\text{: agree to quit }$
rm raw_data/mtcars.RData $\quad\quad\text{: remove data from folder}$

Note that .Rds is preferred.

$\underline{\text{Create Clean Data File}}$

[command] [N] : to create a new file.

Then save that file in the code directory as “clean_data.R” with the following code:

Note the 3 different places the here package is used.

$\underline{\text{Create Makefile}}$

[command] [N] : to create a new file.

Then save that file in the project directory as “Makefile” with the following code:

In the terminal:

make clean_data/mtcars_cleaned.rds
Rscript code/clean_data.R

Verify data is in clean_data folder

$\underline{\text{Create make table file}}$

[command] [N] : to create a new file.

Then save that file in the code folder as “make_table.R” with the following code:

$\underline{\text{Update Makefile}}$

Now update the Makefile with the following:

In the terminal:

make output/data_for_table.rds

Verify contents in output folder.

$\underline{\text{Create report.Rmd}}$

[command] [N] : to create a new file.

Then save that file in the project directory as “report.Rmd” with the following code:

$\underline{\text{Update Makefile}}$

Update the Makefile with the following:

In the terminal:

make report.html
open report.html

Pre-lecture Question (2)

Q: why doesn’t git / github recognize empty folders?

A: git doesn’t care about empty folders in directory.

Make Github aware of empty folder

In terminal

rm output/* $\quad\quad\quad\quad\text{: remove everything in output folder}$
gitinit $\quad\quad\quad\quad\quad\quad\text{: initilize git}$
git status $\quad\quad\quad\quad\quad\text{: status}$

Notice the output folder isn’t included.

cd output $\quad\quad\quad\quad\quad\text{: change to output directory}$
touch .gitkeep $\quad\quad\quad\text{: add empty file}$
ls -a $\quad\quad\quad\quad\quad\quad\text{: list all files}$
cd .. $\quad\quad\quad\quad\quad\quad\text{: change directory up one level}$
gitstatus $\quad\quad\quad\quad\quad\text{: status}$

Now the output folder is visible.

Part 2: Docker (33:00 - 1:30:45)

The problem: there are things on your computer that have nothing to do with R that R depends on. R wont help you track those dependencies. Renv only helps make sure R packages are up to date.

You can download R code from github, but that doesn’t mean you can run it.

Key Principle: Software is Code!

running code requires software
we want a way to package up all analysis and code AND software

Containerization gives us a way to do this.

Docker is a popular containerization program

Think of a Docker container as a virtual machine:

Hos its own (Unix) operating system
Has its own file system
Has its own software application

(A computer within a computer)

Images Vs. Containers

Docker image:

source code, libraries, dependencies, tools, and other files needed to run a container.
a blueprint for the enviroment in which you will execute your analysis

Docker container:

created when we run an image
contruct a run-time enviroment to execute your code
or provide an interactive way to excecute code.

How does it work?

Docker installed program.
Write a Dockerfile, plain text that tells Docker what to put in image.
build the image
run the image (creating container) to execute code.

Interactive Ubuntu container

In the terminal:

docker pull ubuntu $\quad\quad\quad\text{: this is downloading an image from the internet}$

This is installing an empty ubuntu operating system.

If you have issues with this step check out this stack overflow link here or here.

docker run -it ubuntu /bin/bash $\quad\text{: puts you inside interactive (-it) ubuntu operating system}$

In a different terminal on your computer:

docker container ps $\quad\text{: prints all the containers you have running}$

Back inside docker terminal:

touch my_file_in_my_container $\quad\quad\text{: add file to container}$
exit $\quad\quad\quad\quad\quad\quad\quad\quad\quad\text{: container stops running}$
docker container ps $\quad\quad\quad\quad\text{: see the docker container no longer exists}$
docker run -it ubuntu /bin/bash $\quad\quad\text{: start a docker container}$

Notice this is a different container, and the file that was created no longer exists.

Containers are transient!

exit $\quad\quad\quad\quad\quad\quad\quad\quad\quad\text{: exit container}$

Adding a text editor to a docker file

In the terminal:

mkdir docker_ex $\quad\quad \text{: created docker_ex directory}$
subl docker_ex $\quad\quad\text{: open in sublime}$

[command] [s] : save file as “Dockerfile”

FROM : gives a starting point at the ubuntu operating system.
apt-get: is a package management system in ubuntu
update: make sure apt-get is up to date
install vim: installs the text editor vim
Run: Docker syntax that tells it to run those two commands from the ubuntu operating system.

Now back in the terminal:

cd docker_example/
ls
docker build -t ubuntu_plus_vim . $\quad\quad\text{build docker file from current directory (thats what "." means)}$

Docker is angery because we aren’t inside the container to hit yes.

Fix: add “-y” to code (line 7) that says yes:

Now that it works lets run it.

docker run -it ubuntu_plus_vim /bin/bash $\quad\quad\text{: open up ubunto with vim that we just created}$
vim $\quad\quad\text{opens text editor for editing}$
:q $\quad\quad\quad\text{to quit vim}$

If you need to put a special piece of software inside a container you would google “how to install ____ on ubuntu”, because that is being installed.

Dockerfile

COPY Example

create a new hello world file.

Now add the following line to your dockerfile:

COPY hello_world.txt home/hello_wolrd.txt

In the sublime terminal:

docker build -t ubuntu_plus_vim_plus_helloworld .
docker run -it ubuntu_plus_vim_plus_helloworld
ls
cd home
ls
cat hello_world.txt
exit

This will be useful so that we can copy our code into the container.

WORKDIR

Sets the working directory for all of your commands.

Reason: We don’t always know what the directories are within docker (like home in the previous example), so we want docker to the directory where our project lives.

Update Dockerfile with the following:

In the terminal:

docker build -t try_workdir .
docker run -it try_workdir /bin/bash

Notice that we start out in the analysis_directory, and inside the directory is hello_world.txt.

exit

CMD

Continers entry point: what is the container going to do by defualt?

Use cases:

interactively run bash: CMD /bin/bash
open R: CMD [“R”]
run whole analysis and produce report: CMD Rscript -e “rmarkdown::render(…)”

Breakout Exercise 1:

Inside docker_ex directory:

In the terminal:

echo “This file is silly” > silly_file.txt
subl

Create and save the following Dockerfile within the same directory as newimgage.Dockerfile:

Then in the terminal:

docker build -t br1 –file newimage.Dockerfile .
docker run br1

(should print message: Hello from your computer)

docker run -it br1 /bin/bash
ls
cat silly_file.txt
echo $my_variable
exit

Part 3 (1:31:20 - 3:24:40)

Make sure you have a Docker account and are logged in. Then in the terminal:

docker pull rocker/rstudio
docker pull rocker/tidyverse

DockerHub

DockerHub is the Github for images. Many are free, some are not. * Ubuntu

Pushing to DockerHub

Share images by building locally and using docker push. In the docer_ex directory:

docker build -t user_name/test_repo .
docker push user_name/test_repo

This should now be available on your DockerHub.

Viewing / removing containers / images

docker container ps : list running containers, add -a to see all recently run containters.

docker images : list all downloaded images

docker image rm [id] : replace [id] with code from Imgae ID column of docker images

docker system prune : remove all unused images and containers. Add -a to remove everything

You can always download your containers through DockerHub, and dont need them on your computer all the time.

Rocker images

R/Rstudio images are avaialbel from rocker

R Studio Images

The entry for rocker/rstuido is to run R Studio server, a web browser interface to R Studio.

In the terminal:

docker run -e PASSWORD=“s123” -p 8787:8787 rocker/rstudio

Then open the following webpage in your web browser: http://localhost:8787

The default username: rstudio, and password: s123 (what you set before).

Now this rstudio is inside your docker, and you will have the same enviroment regardless of what computer you are using.

[CTL][C] : stops container from running.

Note: adding a -d will run the program in detached mode. It doesn’t care if you close terminal or window.

docker run -e PASSWORD=“s123” -p 8787:8787 -d rocker/rstudio

To kill:

docker container kill ##container_id##

An R Project Workflow

Download the example_project folder, and then find in the terminal.

cd example_project 2
docker build -t ex_proj .
subl .

Look through the files, and read what is happening in the docker file.

docker run -it ex_proj
cd project
ls
echo $which_fig
R
library(wesanderson)
?wesanderson
q()
exit
docker container ps $\quad\quad\text{: verify container isn't running}$

Typical Workflow:

updating analysis code, running, …
tweak figures, compile report

Docker Workflows (slightly more complicated)

different R version/packages between container and local?
update code locally, rebuild image, run code in container, …
tweak figures locally, rebuild image, compile report in container,…

Mounted Directories

We can write code and have changes appear instantly inside a running container by using the -v option to docker run

“Mounts” a local directory to a directory in the container.
i.e. directory on local computer appears in container.

For example: on local computer, project directory is at /path/to/project. In container, project direcotry is /project/

In terminal:

(example)

docker run -v /path/to/project/R:/project/R -it ex_proj $\quad\quad\text{: connect path of project/R on computer to path of project/R in container}$

(mine)

docker run -v ~/Desktop/2022/example_project 2/R:/project/R -it ex_proj
ls
cd project
cd R
cat make_barchart.R

On local computer change the contents of make_barchart.R, by adding a new comment.

cat make_barchart.R $\quad\quad\quad\text{: concatonate code}$

Mounted directories are also useful for retrieving final output.

cd .. $\quad\quad\text{: move back into project folder}$
make report.html $\quad\quad\text{: creates .html file in project}$

Output folder

Create one output folder that has anything you want the user to retrieve. Then instruct the user that if they want to retrieve the output from this container that they need to mount a local directory to this output folder.

Enviroment variables

A useful way to control the run-time behavior of docker containers is to set environment variables at runtime.

Recall we set our environment variable to “gears_vs_cylindars”. However if we check the code for make_barchart.R we notice an ifelse state with if(which_fig = “gears_vs_cylinders”) … else if(which_fig = “gears_vs_carb”).

In the terminal, in ex_proj docker

R $\quad\text{: open R}$
Sys.getenv(“which_figs”)
q() $\quad\text{: quit R}$
export some_other_variable=“hello!”
R $\quad\text{: open R}$
Sys.getenv(“some_other_variable”)
q() $\quad\text{: quit R}$
exit $\quad\text{: close docker}$

We can overide this variable at runtime using -e

In the terminal:

docker run -e which_fig=“something_else” -e some_other_variable=“some_other_value” -it ex_proj
echo $which_fig
echo $some_other_variable
R
Sys.getenv(“some other variable”)
q()
exit

Breakout Exercise 2

Run the ex-proj container with the /project/ folder mounted to a local directory. Examine the report generated.
Look at example_project/R/make_barchart.R and Dockerfile to see how the ENV variable which_fig is used.
Re-run the container using option -e which_fig=“…” to produce a figure of gears vs. carb (replace … with the proper string).
Modify the Dockerfile to add a separate directory for the output to the container. Modify the Makefile to write the html report to this directory.
Re-build and re-run the container with a local directory mounted to your new output directory in the container.
Change the CMD command so that the entry point is to make the report.