2 Why use R?

Consider some of the main tasks done in Social Sciences. As in all sciences, you work with data. You inspect and manipulate that data in different ways, and modify it to make it amenable to analysis and visualizations. Visualizations and statistical analyses help to explore relationships you have observed in the real world and test for relationships in your experiments/studies. You then use these to communicate some findings to the general public.

So why would you want to use R or any programming language vs a Graphical User Interface (GUI) software like Excel, JMP, SPSS, or SAS, to do this?

To start, there are some simple and practical considerations:

  1. R is free. Most other software is not.
  2. R can handle very large datasets, GUI-based software struggle and often crash with these.
  3. R is an open-source language, which means that anyone can contribute to it. This results in it constantly evolving and improving, increasing its capabilities.
  1. As a consequence of this, with R, you can perform MUCH more complex and sophisticated statistical analyses and data visualization. R was literally created for this very purpose!

These are all important, however, one of the main reasons to use a programming language to accomplish your main tasks in Data Science and Social Sciences is for Reproducibility.

Excel and the conventional statistics software like SPSS use a point-and-click GUI, and thus lack a simple way to document and share your analyses. This makes repeating or editing your work later very hard as you will need to rely on memory and repetition. It is hard to work backwards from the end result and see how you arrived there because it is not documented anywhere. You will have to “reinvent the wheel” to perform the same analysis across different data sets. R is an open-source interpreted programming language. Instead of pointing and clicking through some software to accomplish your tasks, you write code that will execute them. All the code you write is read and evaluated line by line in documents you save. This has some very practical consequences for you:

  • Selfishly, it allows you to prepare your cleaning, analysis, and visualization code beforehand. As soon as you finish with data collection, you can just run all the code you’ve written and you will immediately be done!
  • Also, it allows you to automate tasks for yourself. This way, you do not have to repeat the same process of clicking through the GUI and re-doing the same thing many times.

Source: trendct.org

  • Allows you to create literate programming documents (you will do this in markdown, more on that later) – where you explain in English what your code is doing. This drastically helps you communicate your results by making your code be more interpretable, even when you re-visit it weeks/months/years down the line.
    • “You are always working with at least one collaborator: Future you.”
      -Hadley Wickham
  • Often times you will be working with other non-you collaborators, and this makes doing so easier. People can build off the work (code writing) you have done and see step by step exactly what that work is.
  • Generally, this helps others be able to see exactly what you did and run the same analyses themselves. This is great practice for the open-science movement, and can actually help you fight back against any critiques or questions about your data/analyses.

Artwork by Horst (2022)

So why R in particular instead of any other programming language?

“What about Python?” – you might ask. You should learn that too! Python is great for advanced machine learning (neural networks and deep learning), fMRI and other image analyses, and lots of other advanced programming things. However, for data science tasks that social scientists and researchers are most often concerned with, R is preferable because it uses the same basic syntax and approach to everything. With Python, your toolbox may be bigger, but all the tools are very different.

from Tobi Gerstenberg’s Statistical Methods for Behavioral and Social Sciences class

2.1 R tools

2.1.1 Base R vs RStudio

When talking about R, what is being referred to is the programming language R itself. You use RStudio to help write and use R code. RStudio is an Integrated Development Environment (IDE). Basically, it provides an interface to make writing and using R code easier. An analogy here is how you use Microsoft Word or Google Docs to write a paper. RStudio is the interface you use to do the work (write the essay/code file).

  • R itself can be downloaded here.
  • RStudio can be downloaded here.
    • You must download R before RStudio.

2.1.2 RStudio Tour

Generally speaking, you will see 4 sections in your RStudio window. To quickly overview each:

R file / document

This is the contents of your file where you write all of your code. This is usually the largest section and at the top left of this window.

Console

This is where all code is executed/run, and where the output of that code appears after running. You can run a line of code from your file by clicking on that line and pressing command+enter for OSX, or control+enter for windows.

Environment

This is where all your defined objects and variables are, more on this later.

Extra/swiss army knife section

This is where any graphs/visualizations you make will appear (under plots), the list of your installed or loaded packages (under Packages), or any R Documentation or functions you look up (under Help).

2.1.3 Customize Your RStudio

Looking at a default white screen for a long time can put a lot of strain on your eyes. RStudio comes with a lot of different themes you can apply by going to:

preferences -> appearance.

I personally use a custom theme called “Mojave Dark” (the one you see in the screenshots is called “Cobalt”). What theme to use is personal preference.

Additionally, there are three other settings tweaks to make. Go to preferences and then…

  1. R Markdown -> deselect “show output inline for all R markdown documents”
  2. Code -> Display -> check “Enable preview of named and hexadecimal colors”
  3. Code -> Display -> check “Use Rainbow Parentheses”

2.1.4 Different types of files

There are three main types of R files.

  1. R scripts (“fileName.r”) which contain only R code.
  2. R markdown notebooks (“fileName.rmd”) which contain plain text, R code, markdown/html code, and outputs.

Artwork by Horst (2022)

  1. RData files (“fileName.RData”), which contain saved R objects/variables.

You will use markdown files in this class because they enable you to more easily annotate your code and create literate programming documents. They also allow you to easily abide by a general programming philosophy called “separation of concerns,” which says code should be organized into distinct sections, each handling a specific task. Doing this enables you to quickly navigate your code and know where to look if something is not working as intended.

Markdown files can also be used to create documents that help communicate what you are doing in your data analysis pipeline. If you use Microsoft Word, then every time a figure or result changes, you have to go through and copy/paste a bunch of stuff, make a million changes throughout, save new files, it is a nuisance and often error prone. R Markdown files are living, breathing documents. Figures and results are automatically updated if the data is changed. Also, everything is in one place. In fact, what you are looking at here is the output of a markdown file. This entire textbook was made in RStudio with markdown files!

Artwork by Horst (2022)

2.1.5 Components of a Markdown File

Markdown files themselves have 2 components:

  1. General text, like what this is
  2. Code chunks, like this:
# This is a code chunk -- essentially mini r files
# You put R code inside chunks, text outside.
# You can comment out text by starting the line with a hashtag. 
  # Comments are text that is not run.

You put code inside chunks, text outside. You can comment out text by starting the line with a hashtag. Comments are text that is ignored by R when running your code.

You create a code chunk by putting three back-ticks ( ``` ) on two lines with a break between them, and adding a set of curly brackets { } with an “r” in it next to the first set of back ticks: ```{

You can also insert code chunks by pressing control+options+i on OSX or control+alt+i on windows (RStudio is full of nifty shortcuts!)

Generally speaking, you want to name your code chunks to help readily identify what task that section of code is handling.