Project Organization - Data Science Toolkit

5/5/2022 12-minute read

These are my notes on Project Organization from Data Science Toolkit, by David Benkeser.

Part 1 (0:00 - 36:54)

Pre-lecture Questions

How to add headings?

In R chunk use code options : fig.cap="**Figure Caption**"

barplot(c(1,2,3,4))

Figure 1: Figure Caption

Another option would be to have the first row of your table be your title and then format so the first row doesn’t have any line filler in it.

Stackoverflow

report <- list()
report[[1]] <- "Report Name"
report[[2]] <- head(mtcars)
report

## [[1]]
## [1] "Report Name"
## 
## [[2]]
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Your goal is to minimize time as much as possible.

Don’t spend a lot of time formatting your tables until they are going to be final.

What is `::` in R?

A better way of referencing functions from a particular package.

For example: bookdown::htmldoc2 means that in that in the bookdown package there is a function htmldoc2. That way if multiple functions from different packages have the same name, now your code knows which package to call it from.

This is less ambiguous.

Lecture

Basic Principles

Use git

Put everything in one version-controlled directory.
Jenny Bryan talks about Project Oriented Workflow

Website | Twitter | GitHub
Everything related to one project lives in one folder on you computer.
Don’t spread things out across multiple folders.
Put project directory on version control, and that’s going to be your git repository.

Develop your own system

Be Consistent, but look for ways to improve.
- Naming conventions, file structure, make strucure

Data

Raw data is sacred and kept seperate from everything else.
Separate code and data
Use make files and/ or READMEs to document dependencies

Names

No spaces in file names
use meaningful file names
use YY-MM-DD date formatting
- this sorts you data (ex: most recent)properly

Code

Modularize R code.
Know where to go to make changes.
No absolute paths.
Use a package management system

What to organize?

It is probably useful to have a system for organizing:

data analysis projects;
first-author papers;
talks.

Think about organization of a project from the outset!

Collaborative Projects

Google drive
Overleaf

Some advice:

Address organization from the outset.
Ideally, bring people on board to your (version controlled, reproducible) system.
Keep open lines of communication (especially if using GitHub)

Even if some elements of the project are outside your control, you can try to bring in elements to your workflow.

E.g., if you receive comments with tracked changes from a colleague, incorporate them into the document, add a commit message describing who’s edits they were.

If working on a shared GitHub repo, keep open lines of communication, e.g., short emails (or slack messages, etc…), “Just pushed x…”

raw data is sacred
clean data will be saved into the data folder
serperate scripts out by language or have a code directory
numbering equations will give an indicator in what order the code runs.
sandbox directory for exploratory data analysis that no one really needs to see.
Makefile
renv is a way to manage R package

Note: To go back two levels in your file directory: ./../../

Organizing Data

Dont be tempted to edit raw data by hand
Everything scripted
Let collaborators know: “Don’t color code things.”
Ask for a moc data set ahead of time so you know their form ahead of time to talk about raw data formatting ahead of time.

Use meta-data files to describe raw and cleaned data.

structure as data (e.g. .csv so easy to read)

Tidy Data by Hadley Wickham

Worth the read
Each Varibale forms a column
Each observation forms a row
Each observation unit forms a table.

Exploring data

One of the first things we’ll often do is open the data and start poking around.

Could be informal, “getting to know you.”
Could be more formal, “see if anything looks interesting.”

This is often done in an ad-hoc way:

entering commands directly into R;
making and saving plots “by hand”;
etc…

Slow down and document

Your future self will thank you
even if it is just in a google doc to yourself.

You want to avoid situations like:

need to recreate a plot that you made “by hand” and saved “by hand”;
figuring out why you removed certain observations;
trying to remember what variables had an interesting relationship that you wanted to follow up on later.

Write out a set of comments describing what you are try to accomplish and fill in code from there.

I do this for every coding project.
- Data analysis, methods coding, package development

Leave a search-able comment tag by code to return to later

I use e.g., # TO DO: add math expression to labels; make colors prettier.

Sets “the bones” of a formal analysis in place while allowing for some creative flow.

From the outset, stop and think about what you want to do. Start filling in details from there. That simple approach will increase efficiency and reproducibility.

Other helpful ideas for formalizing exploratory data analysis:

.Rhistory files
- all the commands used in an R session
Informal .Rmd documents.
- easy way to organize code/comments into readable format
save intermediate objects and workspaces
- and document what they contain!
knitr::spin
- writing .R scripts with rendered-able comments
Juypyter Notebook

The here package

No absolute paths

Absolute paths are the enemy of project reproducibility.

For R projects, the here package provides a simple way to use relative file paths.

Read Jenny Bryan and James Hester’s chapter on project-oriented work-flows.

The use of here is dead-simple and best illustrated by example.

Root directory is my_project

this is where .git lives
all file paths should be relative to this

Each R script or Rmd report, should contain a call to here::i_am('path/to/this/file') at the top

here::i_am means use function i_am from here

For example:

# include at the top of script
here::i_am('R/my_analysis.R')

## here() starts at /Users/randi/Desktop/2022/rbolt2/content/english/post/2022-05-05-project-organization-data-sciene-toolkit

# now add all your R code

Now anytime you make analysis you can use the here function.

# include at the top of script
here::i_am('index.en.Rmd')

## here() starts at /Users/randi/Desktop/2022/rbolt2/content/english/post/2022-05-05-project-organization-data-sciene-toolkit

# load data 
my_data <- read.csv(here::here('data','my_data.csv'))

# do some analysis to get results
my_results <- sum(my_data)

# save results
save(my_results, file = here::here('output', 'my_results.RData'))

Part 2 (39:40 - 1:31:20)

Open Example Project in Sublime

Example Code

To begin look at the Makefile

The analysis here is going to culminate in a report, so we have a rule to make a report which says to render the report.Rmd document that lives in the Rmd folder.

We can see the rule for making our report has a prerequisite. It depends on the source code (rmd shown below) and the barchart.png figure in fig folder.

The rule for making that chart is shown on line 4: figs/barchart.png, which depends on the barchart.R code in the R directory.

This report is saying that we looked at some data and make a figure which lives in the “figs” folder and is called “barchart.png”.

This is making a barchart out of the mtcars data.

In the terminal now run make report, then open the report by typing open Rmd/report.html.

Here package (Two examples)

Identity where the folder is in relation to the project root.

In the R folder create an R file test.R with the following in code:

Terminal Command	Description
Rscript R/test.R	return the absolute path

In test.R change line 3 from here::here() to here::here("figs"), save, and run:

Terminal Command	Description
Rscript R/test.R	return the absolute path of the `figs` folder

Example 2

Terminal Command	Description
cd R	change directory to R folder
ls	list all the files in current folder
mkdir R_subdir	make new directory called `R_subdir`

Now create an R file similar to the previous example and play around with the here function. Then remove the files that were just added:

Terminal Command	Description
rm -rf R/R_subdir/	recursice force to action removal (for directories) of R_subdir
rm R/test.R	remove test.R
rm figs/barchart.png	remove barchart.png

Package management system : `renv``

We want to record not only what packages we are using but:

which versions of those packages we are using
Where we downloaded those packages from
what version of R we are using

In example_project identify the projects that we need:

here
wesanderson
knitr
rmarkdown

Terminal Command	Description
pwd	print working directory
R	open R
renv::init()	package management that will also return what version of R we are working in

Note: renv needs to already be installed in R.

A list should be returned that shows the packages we are using, and the packages those packages depend on. Then it is saving all of that information into a lockfile which is now saved into the project file.

Notice that other files appeared as well, such as “.Rprofile” which is a single line that says one line: ‘source(“renv/activate.R”)’. When you open R the software is looking for these .Rprofile files (similar to bash or zshell). Be careful messing with this file, because that creates issues with reproducibility.

The line `(“renv/activate.R”)’ is making R aware that we are in a project directory and there is a package library that’s associated with that package directory.

We also now have an “renv” folder which contains activate scripts, and takes all of the specific versions of packages and places them into the project library. Note that it wont actually put those files into the folder, but will search your computer for that version of the package and link to that. That way you don’t have a lot of duplicate packages on your computer.

Terminal Command	Description
q()	close R
R	open R

Notice there is a different output, which recognizes our package environment. If you get errors loading packages then type the following into the terminal:

Terminal Command	Description
renv::restore()	install packages on your computer

Now look back at the rmd file report.Rmd at lines 8 and 9, shown below.

Line 8. Recognizes where this file is in relation to the project directory.

Line 9. This is saying anytime you find R code in this rmd script, run it from the project directly (which is where the code needs to be run from to make it aware of renv and packages needed).

Moral of the story is to copy these two lines of code at the top of all of our rmd.

Part 3 (1:21:20 - 2:05:49)

What does a collaborative workflow look like?

Colaborating with renv

User A initializes the lockfile using renv::init().

User A commits the following to github:

renv.lock
.Rprofile
renv/activate.R

User B clones and downloads repo, and uses renv::restore() to synchronize their local project directory.

User B adds new packages to code, uses renv::snapshot() to record changes to renv.lock

User B commits renv.lock and pushes to GitHub.

User A pulls from GitHub, opens R, and uses renv::restore() to synchronize their local project directory.

Breakout Exercise

Terminal Command	Description
q()	quit R
pwd	print working directory
git init	initialize git repository

Verify a git folder has been added.

Now create a github repository for example project, and copy the line of code that looks like the following into the terminal:

Terminal Command	Description
git remote add origin https://github.com/user_name/example_project.git	connect github to computer

Note: you might have issues if your github isn’t cached.

For the first Commit:

Terminal Command	Description
git status	display status of working directory
git add *	add all files (with names)
git status	display status of working directory

Verify everything was added, otherwise:

Terminal Command	Description
it add .thing1 .thing2 .thing3	add files that begin with `.`
git status	display status of working directory

Verify everything was added.

Terminal Command	Description
git commit -m `my first commit`	commit with message
git push origin main	push commit to main directory

Verify github was updated.

User B then forks the repository to make a copy of it onto their computer. Cope a similar line of code shown below into the terminal:

Terminal Command	Description
git clone https://github.com/user_name/example_project.git	clone repository
cd example_project	change to example_project folder
R	open R
renv::restore()	restore enviroment
q()	quit
make report	make report
open Rmd/report.html	open Rmd/report.html

Verify report runs.

Terminal Command	Description
R	open R
renv::remove(‘wesanderson’)	remove `wesanderson` package from revn enviroment

Then open the barchart.R file to change following two lines of code:

Save.

Terminal Command	Description
renv::snapshot()	print enviroment snapshot

This will return the following:

Terminal Command	Description
q()	quit R
y	save

You will now have a new lockfile.

Terminal Command	Description
git diff	check that changes were made

Then commit.

User A can then try these changes out in an isolated environment.

Terminal Command	Description
git remote add userb https://github.com/user_b/example_project.git	link to user b’s repository
git fetch user_b main	fetch user_b’s main branch
git checkout remotes/user_b/main	checkout user_b’s main branch

Verify changes.

Terminal Command	Description
git checkout -b user_b	grab user_b’s branch
git merge user_b	merge user_b’s branch
git push	push to main

Now user A wants to change the colors again.

Terminal Command	Description
R	open R
renv::remove(‘RColorBrewer’)	remove ‘RColorBreker’ package from enviroment
q()	quit R

Then update the barchart.R file by removing lines 4 and 5, then replacing them with:

Save, then in the terminal:

Terminal Command	Description
make report	make report
open Rmd/report.html	open report

Verify report looks good.

Terminal Command	Description
git status	display status of working directory
git add –all	add all files
git status	display status of working directory
git commit -m `removed RColorBrewere and changed to baseR`	commit with message about change
git push origin main	push changes to main branch

B’s then copies the updated repository, and in the terminal:

Terminal Command	Description
git remote add user_a git config advice.addIgnoredFile false	creates path to user A github repository
git fetch user_a main	fetch user_a’s main branch
git checkout remotes/user_a/main	checkout user_a’s main branch
R	open R
renv::status()	check enviroment
q()	quit R
make report	make report
open Rmd/report.html	open report

Verify colors are baseR.

Terminal Command	Description
git checkout -b user_a	grab user_a’s update branch
git checkout main	checkout to main
git merge user_a	merge user_a’s updates
git push origin main	push to main

The end ~

Questions

Q: Do we have to go between bash and R and bash again to touch do anything with env?

A: Yes, but we can also do something else …

In make file add the following lines of code:

this will automate that entire process for us.

Now in terminal we can type: make restore which will cut out the middle man which avoids the need to open an R session every time.

Note: My completed replication of this example_project can be found here.