Setup guide for R tutorials

tutorials r coding

into-the-tidyverse is a self-contained set of tutorials for data wrangling/analysis using R (via RStudio), with a very heavy emphasis on the tidyverse libraries. These are state-of-the-art tools that have always been (and always will be) free and open-source.

These tutorials require you to use GitHub (via the easy-to-use GitHub Desktop interface), and encourage the practice of version control. Learning how to use GitHub is an increasingly important part of learning how to do data science. For that reason, you will be required to use GitHub in order to follow along with these tutorials. The learning curve is not too steep, and these skills will serve you well in the future.

Setting up R

Install R

The first thing you’ll need to do is to install R from CRAN (Comprehensive R Archive Network).

This will install the R programming language, some essential libraries, and a text-based console that might remind you of an old-school terminal window.

R console (left) compared against a macOS terminal console (right)

Install RStudio

We will not be using the R console directly. Instead, we’ll be using RStudio, which builds on top of the console and gives you a lot of nice tools that help you work more efficiently. Once R is installed, you’ll then want to install RStudio.

You’ll immediately note that you have a lot more tools at your disposal:

  1. R Console: Look familiar? You can use this console to type commands, just like in the “base” R console.
  2. Script Editor: Instead of using the console to type your commands one-by-one, you can write a script that contains all of your commands. That way, it’s easy to replicate an analysis because all of your code is contained within a single document, and can be run at will.
  3. Variable Viewer: When you start loading, creating, and manipulating data, you’ll see it here.
  4. Everything Else: I primarily use this panel to look at plots, or to look at the help documentation for functions I’m using.

Annotated image of RStudio window, entirely redundant with in-line text descriptions

Set up RStudio

Open up RStudio and then open up Preferences. Find the “Workspace” settings. Make sure the checkbox is de-selected for “Restore .Rdata into workspace at startup”. Specify never as the option for “Save workspace to .Rdata on exit”. This will ensure that every time you open up RStudio, you start out with a clean working environment. You’d be amazed at how many problems can be prevented by doing this.

Download R libraries

You’ll need to install the following R libraries. Copy/paste the following code into the console window, then press enter to run it. We may need to install other libraries in the future, but these ones are so useful that I use them in virtually every project.

install.packages("tidyverse")
install.packages("here")
install.packages("janitor")

Setting up GitHub

Create GitHub account

You will need to create an account at GitHub.

git is a system for performing version control. What does that mean? Basically, it allows you to save “snapshots” of your files at any given moment. This allows you to track how your code (and data) change over time. So if you accidentally delete a lot of important code, and later realize that you need to recover it, git allows you to go back to a snapshot that contains the code you want. So you can think of it like an archiving system.

GitHub is an online data storage platform that uses git to store a full archive of changes you’ve made to a particular project. It allows you to keep a “synchronized” copy of the most up-to-date files on any computer that’s given access.

Let’s walk through some of terminology, which will help you to understand how GitHub works. Let’s say you’ve got some projects on your computer’s hard drive. Each project is stored in its own repository. When you start using git to version control your repository, it creates some hidden folders which contain a full archive of changes you’ve made (sort of - more on this in a second). Using git, you can then specify when to create a snapshot of your repository. This snapshot is called a commit because you’re committing to saving a copy of your repository at this exact moment in time. Importantly, you choose when to commit. Different people have different preferences. Some people like to commit at set time increments, regardless of whether their changes are big or small. Some people wait to commit until they have something they feel is “worth” committing. This is entirely up to you, and you’ll develop your own preferences as you gain experience.

Once you’ve committed something, you’ve saved a snapshot on your personal computer. But, that commit only lives on your computer. If you dropped your computer on the floor and destroyed its hard drive, your archive is gone forever. This is where GitHub comes in. You can save a copy of your files and your commit history onto GitHub, which allows you to clone a copy of your repository (and its full commit history!) onto any other computer. When you upload your commits to GitHub, it’s called a push. The GitHub repository contains only the “latest” versions of your files, which is dictated by when you commit and push your commits to GitHub. This can get a little more complicated when there are multiple collaborators editing files in the same repository. For simplicity, we’ll only consider two cases: 1) when a collaborator pushes a commit while you’re in the middle of changing something, and 2) how to prevent conflicting changes by working on a parallel copy of the repository.

Once you’re done changing something in your code, you might create a commit and try pushing that change to GitHub. Only to find that your collaborator has also been changing things in the code, and has pushed something before you had a chance to work on the most “updated” version of your repository. Since your commit is now on an “older” version of the repository, you must first synchronize your computer’s (older) version of the repository with the (newer) GitHub version. When you check whether there’s a newer version, this is called a fetch. When you detect that there’s a newer version, you then pull the newer version on your computer. In the best-case scenario, there is no merge conflict, and you can then push your commit without any further action. In the worst-case scenario, there is a merge conflict that you manually have to resolve. That is a giant headache to do, and you’ll want to avoid merge conflicts as much as possible. The easiest way to do this is to work with a fork (you could also do this with a branch, but we won’t be using them in this workshop… once you understand forks, it’ll be really easy to google what branches do in the future).

You might be familiar with the idea of a “multiverse” from comic books or science fiction. You’re living in the “main” timeline, but there are alternative realities that come in and out of existence at key moments in time. For example, there might be an alternative reality that’s exactly the same as this one, except that Ringo Starr never became the drummer for the Beatles. That would be an alternative universe that “forked” from our reality in 1962. That is exactly the logic of git. You could be working on a project with a collaborator who wants to try rewriting some existing code, but who wants to avoid messing up your existing data analysis pipelines. That collaborator could create a fork of your project, which diverges from the main timeline at a particular moment in time. The collaborator could then do whatever they wanted with their forked copy of the repository, without ever affecting what happens in the main copy. In principle, they could maintain a separate fork forever. Oftentimes, though, they’ll finish editing the code and will then want to merge it with the main timeline. This gives you a gentle way of merging the timelines together, since you can work through any would-be merge conflicts before trying to merge the two timelines back together.

Power users often use text-based commands to manage their git repositories, but I prefer using GitHub Desktop, which an easy-to-use point-and-click interface.

Install GitHub Desktop

Install GitHub Desktop, then log in with your new GitHub credentials to set it up. These instructions are for the into-the-tidyverse tutorials, but it’s not too hard to figure out how to fork/clone other repositories using these instructions.

If this is your first time ever using GitHub Desktop, click on Clone a repository from the internet >> copy/paste psychNerdJae/into-the-tidyverse into the textbox >> Fetch origin. This creates a cloned copy of my repository onto your hard drive (scroll back up to review terminology if you’re getting lost).

If you’ve used GitHub Desktop before, you might have some git-managed repositories on your hard drive already. If that’s the case, click on Add >> Clone Repository >> URL >> copy/paste psychNerdJae/into-the-tidyverse into the textbox >> Fetch origin.

You can now access all of the latest workshop materials. If you start creating/modifying code or data, you’ll see those changes tracked through GitHub Desktop. Remember that the only changes that are saved are the ones you take a snapshot of by committing.

You’ll note that when you try to commit new changes for the first time, you’ll be prompted, “You don’t have write access to into-the-tidyverse. Want to create a fork?” Click that. When you try to push your commit, you’ll see some scary warning message that you don’t have write access to psychNerdJae/into-the-tidyverse. Go ahead and create a fork. In the next prompt, you’ll be asked how you’re planning on using this fork. Click on For my own purposes. Now, your changes are pushed to a forked copy of yourUserName/into-the-tidyverse, which runs in a parallel timeline to my original copy. This lets you play around with code as you’re learning to use R, without ever disturbing my original files.

What if I make a change to my repository that you want to synchronize with your fork? For example, I could add more files to my repository as I develop new workshop materials, which you want to download. Since my repository is the original that your fork copies, my repository is known as the upstream (since it’s upstream in the timeline). You can always check to see if there are changes in the upstream by clicking on Current Branch and seeing whether there’s an option for upstream/main. If you don’t, that means there are no new commits in my version of the repository.

Now that all of the materials live on your hard drive (in the cloned repository you forked from mine), you’re ready to learn!

File and folder structure

These instructions are specifically for into-the-tidyverse, but should apply to any other tutorials I write.

…Now what?

If you’re ready to get started, go to the folder you’ve just created on your hard drive. From there, click into the Tutorials folder, and have fun with the first session!