Project Organization - Data Science Toolkit
These are my notes on Project Organization from Data Science Toolkit, by David Benkeser.
Part 1 (0:00 - 36:54)
Pre-lecture Questions
How to add headings?
In R chunk use code options : fig.cap="**Figure Caption**"
barplot(c(1,2,3,4))
Figure 1: Figure Caption
Another option would be to have the first row of your table be your title and then format so the first row doesn’t have any line filler in it.
report <- list()
report[[1]] <- "Report Name"
report[[2]] <- head(mtcars)
report
## [[1]]
## [1] "Report Name"
##
## [[2]]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Your goal is to minimize time as much as possible.
Don’t spend a lot of time formatting your tables until they are going to be final.
What is :: in R?
A better way of referencing functions from a particular package.
For example: bookdown::htmldoc2 means that in that in the bookdown package there is a function htmldoc2. That way if multiple functions from different packages have the same name, now your code knows which package to call it from.
This is less ambiguous.
Lecture
Basic Principles
Use git
Put everything in one version-controlled directory.
Jenny Bryan talks about Project Oriented Workflow
Everything related to one project lives in one folder on you computer.
Don’t spread things out across multiple folders.
Put project directory on version control, and that’s going to be your git repository.
Develop your own system
- Be Consistent, but look for ways to improve.
- Naming conventions, file structure, make strucure
Data
- Raw data is sacred and kept seperate from everything else.
- Separate code and data
- Use make files and/ or READMEs to document dependencies
Names
- No spaces in file names
- use meaningful file names
- use YY-MM-DD date formatting
- this sorts you data (ex: most recent)properly
Code
- Modularize R code.
- Know where to go to make changes.
- No absolute paths.
- Use a package management system
What to organize?
It is probably useful to have a system for organizing:
- data analysis projects;
- first-author papers;
- talks.
Think about organization of a project from the outset!
Collaborative Projects
- Google drive
- Overleaf
Some advice:
- Address organization from the outset.
- Ideally, bring people on board to your (version controlled, reproducible) system.
- Keep open lines of communication (especially if using GitHub)
Even if some elements of the project are outside your control, you can try to bring in elements to your workflow.
- E.g., if you receive comments with tracked changes from a colleague, incorporate them into the document, add a commit message describing who’s edits they were.
If working on a shared GitHub repo, keep open lines of communication, e.g., short emails (or slack messages, etc…), “Just pushed x…”

- raw data is sacred
- clean data will be saved into the data folder
- serperate scripts out by language or have a code directory
- numbering equations will give an indicator in what order the code runs.
- sandbox directory for exploratory data analysis that no one really needs to see.
- Makefile
- renv is a way to manage R package
Note: To go back two levels in your file directory: ./../../
Organizing Data
- Dont be tempted to edit raw data by hand
- Everything scripted
- Let collaborators know: “Don’t color code things.”
- Ask for a moc data set ahead of time so you know their form ahead of time to talk about raw data formatting ahead of time.
Use meta-data files to describe raw and cleaned data.
- structure as data (e.g. .csv so easy to read)
- Worth the read
- Each Varibale forms a column
- Each observation forms a row
- Each observation unit forms a table.
Exploring data
One of the first things we’ll often do is open the data and start poking around.
- Could be informal, “getting to know you.”
- Could be more formal, “see if anything looks interesting.”
This is often done in an ad-hoc way:
- entering commands directly into R;
- making and saving plots “by hand”;
- etc…
Slow down and document
- Your future self will thank you
- even if it is just in a google doc to yourself.
You want to avoid situations like:
- need to recreate a plot that you made “by hand” and saved “by hand”;
- figuring out why you removed certain observations;
- trying to remember what variables had an interesting relationship that you wanted to follow up on later.
Write out a set of comments describing what you are try to accomplish and fill in code from there.
- I do this for every coding project.
- Data analysis, methods coding, package development
Leave a search-able comment tag by code to return to later
- I use e.g.,
# TO DO: add math expression to labels; make colors prettier.
Sets “the bones” of a formal analysis in place while allowing for some creative flow.
From the outset, stop and think about what you want to do. Start filling in details from there. That simple approach will increase efficiency and reproducibility.
Other helpful ideas for formalizing exploratory data analysis:
.Rhistoryfiles- all the commands used in an R session
- Informal
.Rmddocuments.- easy way to organize code/comments into readable format
saveintermediate objects and workspaces- and document what they contain!
knitr::spin- writing
.Rscripts with rendered-able comments
- writing
- Juypyter Notebook
The here package
No absolute paths
- Absolute paths are the enemy of project reproducibility.
For R projects, the here package provides a simple way to use relative file paths.
- Read Jenny Bryan and James Hester’s chapter on project-oriented work-flows.
The use of here is dead-simple and best illustrated by example.

Root directory is my_project
- this is where
.gitlives - all file paths should be relative to this
Each R script or Rmd report, should contain a call to here::i_am('path/to/this/file') at the top
here::i_ammeans use functioni_amfromhere
For example:
# include at the top of script
here::i_am('R/my_analysis.R')
## here() starts at /Users/randi/Desktop/2022/rbolt2/content/english/post/2022-05-05-project-organization-data-sciene-toolkit
# now add all your R code
Now anytime you make analysis you can use the here function.
# include at the top of script
here::i_am('index.en.Rmd')
## here() starts at /Users/randi/Desktop/2022/rbolt2/content/english/post/2022-05-05-project-organization-data-sciene-toolkit
# load data
my_data <- read.csv(here::here('data','my_data.csv'))
# do some analysis to get results
my_results <- sum(my_data)
# save results
save(my_results, file = here::here('output', 'my_results.RData'))
Part 2 (39:40 - 1:31:20)
Open Example Project in Sublime
Example Code
To begin look at the Makefile

The analysis here is going to culminate in a report, so we have a rule to make a report which says to render the report.Rmd document that lives in the Rmd folder.
We can see the rule for making our report has a prerequisite. It depends on the source code (rmd shown below) and the barchart.png figure in fig folder.
The rule for making that chart is shown on line 4: figs/barchart.png, which depends on the barchart.R code in the R directory.

This report is saying that we looked at some data and make a figure which lives in the “figs” folder and is called “barchart.png”.

This is making a barchart out of the mtcars data.
In the terminal now run make report, then open the report by typing open Rmd/report.html.
Here package (Two examples)
Identity where the folder is in relation to the project root.
In the R folder create an R file test.R with the following in code:

| Terminal Command | Description |
|---|---|
| Rscript R/test.R | return the absolute path |
In test.R change line 3 from here::here() to here::here("figs"), save, and run:
| Terminal Command | Description |
|---|---|
| Rscript R/test.R |
return the absolute path of the figs folder
|
Example 2
| Terminal Command | Description |
|---|---|
| cd R | change directory to R folder |
| ls | list all the files in current folder |
| mkdir R_subdir |
make new directory called R_subdir
|
Now create an R file similar to the previous example and play around with the here function. Then remove the files that were just added:
| Terminal Command | Description |
|---|---|
| rm -rf R/R_subdir/ | recursice force to action removal (for directories) of R_subdir |
| rm R/test.R | remove test.R |
| rm figs/barchart.png | remove barchart.png |
Package management system : `renv``
We want to record not only what packages we are using but:
- which versions of those packages we are using
- Where we downloaded those packages from
- what version of R we are using
In example_project identify the projects that we need:
- here
- wesanderson
- knitr
- rmarkdown
| Terminal Command | Description |
|---|---|
| pwd | print working directory |
| R | open R |
| renv::init() | package management that will also return what version of R we are working in |
Note: renv needs to already be installed in R.
A list should be returned that shows the packages we are using, and the packages those packages depend on. Then it is saving all of that information into a lockfile which is now saved into the project file.
Notice that other files appeared as well, such as “.Rprofile” which is a single line that says one line: ‘source(“renv/activate.R”)’. When you open R the software is looking for these .Rprofile files (similar to bash or zshell). Be careful messing with this file, because that creates issues with reproducibility.
The line `(“renv/activate.R”)’ is making R aware that we are in a project directory and there is a package library that’s associated with that package directory.
We also now have an “renv” folder which contains activate scripts, and takes all of the specific versions of packages and places them into the project library. Note that it wont actually put those files into the folder, but will search your computer for that version of the package and link to that. That way you don’t have a lot of duplicate packages on your computer.
| Terminal Command | Description |
|---|---|
| q() | close R |
| R | open R |
Notice there is a different output, which recognizes our package environment. If you get errors loading packages then type the following into the terminal:
| Terminal Command | Description |
|---|---|
| renv::restore() | install packages on your computer |
Now look back at the rmd file report.Rmd at lines 8 and 9, shown below.

Line 8. Recognizes where this file is in relation to the project directory.
Line 9. This is saying anytime you find R code in this rmd script, run it from the project directly (which is where the code needs to be run from to make it aware of renv and packages needed).
Moral of the story is to copy these two lines of code at the top of all of our rmd.
Part 3 (1:21:20 - 2:05:49)
What does a collaborative workflow look like?
Colaborating with renv
User A initializes the lockfile using renv::init().
User A commits the following to github:
- renv.lock
- .Rprofile
- renv/activate.R
User B clones and downloads repo, and uses renv::restore() to synchronize their local project directory.
User B adds new packages to code, uses renv::snapshot() to record changes to renv.lock
User B commits renv.lock and pushes to GitHub.
User A pulls from GitHub, opens R, and uses renv::restore() to synchronize their local project directory.
Breakout Exercise
| Terminal Command | Description |
|---|---|
| q() | quit R |
| pwd | print working directory |
| git init | initialize git repository |
Verify a git folder has been added.
Now create a github repository for example project, and copy the line of code that looks like the following into the terminal:
| Terminal Command | Description |
|---|---|
| git remote add origin https://github.com/user_name/example_project.git | connect github to computer |
Note: you might have issues if your github isn’t cached.
For the first Commit:
| Terminal Command | Description |
|---|---|
| git status | display status of working directory |
| git add * | add all files (with names) |
| git status | display status of working directory |
Verify everything was added, otherwise:
| Terminal Command | Description |
|---|---|
| it add .thing1 .thing2 .thing3 |
add files that begin with .
|
| git status | display status of working directory |
Verify everything was added.
| Terminal Command | Description |
|---|---|
git commit -m my first commit
|
commit with message |
| git push origin main | push commit to main directory |
Verify github was updated.
User B then forks the repository to make a copy of it onto their computer. Cope a similar line of code shown below into the terminal:
| Terminal Command | Description |
|---|---|
| git clone https://github.com/user_name/example_project.git | clone repository |
| cd example_project | change to example_project folder |
| R | open R |
| renv::restore() | restore enviroment |
| q() | quit |
| make report | make report |
| open Rmd/report.html | open Rmd/report.html |
Verify report runs.
| Terminal Command | Description |
|---|---|
| R | open R |
| renv::remove(‘wesanderson’) |
remove wesanderson package from revn enviroment
|
Then open the barchart.R file to change following two lines of code:

Save.
| Terminal Command | Description |
|---|---|
| renv::snapshot() | print enviroment snapshot |
This will return the following:
| Terminal Command | Description |
|---|---|
| q() | quit R |
| y | save |
You will now have a new lockfile.
| Terminal Command | Description |
|---|---|
| git diff | check that changes were made |
Then commit.
User A can then try these changes out in an isolated environment.
| Terminal Command | Description |
|---|---|
| git remote add userb https://github.com/user_b/example_project.git | link to user b’s repository |
| git fetch user_b main | fetch user_b’s main branch |
| git checkout remotes/user_b/main | checkout user_b’s main branch |
Verify changes.
| Terminal Command | Description |
|---|---|
| git checkout -b user_b | grab user_b’s branch |
| git merge user_b | merge user_b’s branch |
| git push | push to main |
Now user A wants to change the colors again.
| Terminal Command | Description |
|---|---|
| R | open R |
| renv::remove(‘RColorBrewer’) | remove ‘RColorBreker’ package from enviroment |
| q() | quit R |
Then update the barchart.R file by removing lines 4 and 5, then replacing them with:

Save, then in the terminal:
| Terminal Command | Description |
|---|---|
| make report | make report |
| open Rmd/report.html | open report |
Verify report looks good.
| Terminal Command | Description |
|---|---|
| git status | display status of working directory |
| git add –all | add all files |
| git status | display status of working directory |
git commit -m removed RColorBrewere and changed to baseR
|
commit with message about change |
| git push origin main | push changes to main branch |
B’s then copies the updated repository, and in the terminal:
| Terminal Command | Description |
|---|---|
| git remote add user_a git config advice.addIgnoredFile false | creates path to user A github repository |
| git fetch user_a main | fetch user_a’s main branch |
| git checkout remotes/user_a/main | checkout user_a’s main branch |
| R | open R |
| renv::status() | check enviroment |
| q() | quit R |
| make report | make report |
| open Rmd/report.html | open report |
Verify colors are baseR.
| Terminal Command | Description |
|---|---|
| git checkout -b user_a | grab user_a’s update branch |
| git checkout main | checkout to main |
| git merge user_a | merge user_a’s updates |
| git push origin main | push to main |
The end ~
Questions
Q: Do we have to go between bash and R and bash again to touch do anything with env?
A: Yes, but we can also do something else …
In make file add the following lines of code:

this will automate that entire process for us.
Now in terminal we can type: make restore which will cut out the middle man which avoids the need to open an R session every time.
Note: My completed replication of this example_project can be found here.