4.1 Scripts and Version Control
Code should be crafted according to the following guidelines:
- Use scripts
- Document scripts, but not too much
- Organize scripts consistently (see format below)
- Use Git to version control scripts
- Make atomic Git commits (see description below)
Script files should be documented and organized in such a way to enhance readability and comprehension. For example, use a standardized header for general documentation and sections to make it easier to understand and find specific code of interest. Code should also be self-documenting as much as possible. Additionally, use relative filepaths for importing and exporting objects. Scripts should also be modular by focusing on one general task. For example, use one script for cleaning data, another script for visualizing data, etc. A makefile can then be used to document the analysis workflow. There is an art to this organization, so just keep in mind the general principle of making code easy to understand, for your future self and for others.
Here is an example template for R scripts:
# =============================================================================
# Name: script.R
# Description: Visualizes data
#
# Inputs: data.csv
# Outputs: graph.png
#
# Notes: - Use a diverging color palette for best results
# - Output format can be changed as needed
# =============================================================================
# Set up environment ----------------------------------------------------------
library(tidyverse)
# Set path for OneDrive filestream based on OS type
# Note - this will work for Windows and Mac machines.
# If you use Linux, you will need to set your own path to where OneDrive filestream lives.
team_path <- ifelse(Sys.info()["sysname"]=="Windows","G:/","/Volumes/GoogleDrive/")
# Next, set the path for data directory based on whether project is current or archived.
# Note that if you use a different Shared Drive file structure than the one recommended in the "File Structure" section, you will need to manually define your data path.
# You should always double-check the automatically generated paths in order to ensure they point to the correct directory.
# First, set the name of your project
project_name <- "my-project"
# This will automatically determine if the project exists in the "current-projects" or "archived-projects" Shared Drive folder, and set the appropriate data path accordingly.
data_path <- ifelse(dir.exists(paste0(team_path,"Shared drives/<UNK>/projects/current-projects/",project_name)),
paste0(team_path,"Shared drives/<UNK>/projects/current-projects/",project_name,"/data/"),
(paste0(team_path,"Shared drives/<UNK>/projects/archived-projects/",project_name,"/data/")))
# Import data -----------------------------------------------------------------
# Load data from Shared Drive using appropriate data path
my_raw_data <- read_csv(paste0(data_path,"raw/my_raw_data.csv"))
# Process data ----------------------------------------------------------------
# Analyze data ----------------------------------------------------------------
# Visualize results -----------------------------------------------------------
# Save results ----------------------------------------------------------------
Git tracks changes in code line-by-line with the use of commits. Commits should be atomic by documenting single, specific changes in code as opposed to multiple, unrelated changes. Atomic commits can be small or large depending on the change being made, and they enable easier code review and reversion. Git commit messages should be informative and follow a certain style, such as the guide found here. There is also an art to the version control process, so just keep in mind the general principle of making atomic commits.
More advanced workflows for using Git and GitHub, such as using pull requests or branches, will vary from project to project. It is important that the members of each project agree to and follow a specific workflow to ensure that collaboration is effective and efficient.