2  Modern R

2.1 Initial Development of R (1990s)

In 1991, two professors in New Zealand began to develop R, a process which they documented in (Ihaka and Gentleman 1996). R is very similar to S; so similar, in fact, that it is frequently called a dialect of S. What is the difference between S and R? The creators of R describe it as having the syntax of S (meaning that most examples, including the following example, can be run in S or R) but the semantics of Scope (which is a programming language from the Lisp family).

Probably the key difference between the two languages is the lexical scoping. Whenever you use R (or most other programming languages), you have to have something called a frame. Frames include things like functions and named variables. Each function creates its own frame. The frame for the function f in the example below (from Ihaka and Gentleman (1996)) contains the named variable y and the named function g. The named function g, as a function, creates its own frame (in which to store variables and functions). There is also something called a global frame which, in the following example, includes an assignment of the value 123 to the name y and the assignment of some function to the name f.

y <- 123 # assign 123 to the name y
1f <- function(x) {
  y <- x * x # assign x times x to the name y
2  g <- function() print(y) # create new function and new scope
  g() # return the output of function g
}
3f(x=10)
1
the the beginning of the declaration of function f (between lines 2 and 6)
2
the declaration of function g
3
a function call for function f (with the argument x set to equal 10)
[1] 100
Figure 2.1: an example from Ihaka and Gentleman (1996)

As you can see, in R, running the function f with 10 as an argument results in the function returning 100 (10 times 10). In S, this very same code would have resulted in function f returning the value 123. In S, when we define the function g, S uses the global frame as the basis for the function, including the assignment of the value 123 to y. R, by contrast, creates g with a locally-scoped frame, meaning that the frame for g includes the assignment of the value x * x to the variable y (assignments which are inherited from the parent frame). Thus, in S, the function g is evaluated as print(123), but the R function is evaluated as print(x * x) (the function f is responsible for substituting x to make print(10 * 10).

Unlike Scheme, but like S, R uses lazy evaluation. In essence, this means that R does not run your code unless it absolutely has to. I’ll use the example of Figure 2.1 to explain what this means. Lines 2 through 6 contain the declaration of function f (even though line 4 also contains the declaration of function g. When you run line 2, all of the lines down to line 6 (where the closing bracket, } is located) get stored in R’s memory next to the name f. However, R will not run the function f until you actually go to use it (i.e., until you make the function call in line 7). This is why we call R lazy, but what’s the big deal?

If you make a syntax error in your declaration of function f, R is going to have to tell you that you made a syntax error at some point. In a language that is not lazy, the language evaluates function f when you store it. Thus, a non-lazy language will send you a syntax error after you run the function declaration (i.e., after you run line 2, which also causes lines 3-6 to run). If Figure 2.1 were written in a non-lazy language, the syntax error would occur where the 1 annotation is. However, in R, the function is merely stored when you run lines 2-6. Function f does not actually run until you call it in line 7 (marked with a 3 in Figure 2.1). Laziness is a feature that R inherited from S, which is also lazy.

This is getting a bit technical. The two men who developed R are Ross Ihaka and Robert Gentleman. On a family tree posted on Ihaka’s personal website, he lists himself as the academic grandchild of John Tukey, then statistician at Bell Labs that popularized exploratory data analysis (the framework that created the need for S, which was also, if you’ll recall developed at Bell Labs).

Ihaka is a now retired statistician from the University of Auckland. Gentleman is a bioinformatician who currently works at Harvard and 23andMe. Allegedly, Ihaka and Gentleman developped R for use in teaching statistics. That was part of both of their jobs as professors, after all. However, this doesn’t seem very plausible (why would the professors write their own statistical programming languages rather than using a well-documented one which would seem to be better for pedagogy), nor have I seen any specific evidence for it. That being said, in the years 1993-94, R was stuck at the University of Auckland, being used by them, probably their peers, and less probably their students, but the software was not yet being distributed, as S or S-PLUS was.

In 1995, one of their colleagues convinced them to licence use of the software as free software under a GNU general public license.

2.1.1 Free Software

I just bolded the term free software; why? As it turns out, the term free software has a specific definition that extends far beyond the idea of “software that you don’t have to pay for.” So what is free software? Free software is characterized by the four freedoms (Foundation n.d.):

Free Software Foundation’s Four Essential Freedoms

The idea of free software, and it’s formation in the four freedoms seen above, came to be popular in the mid-80’s after the Reagan government had made clear it’s stance (and the republican, and soon the democratic, party’s stance) on anti-trust enforcement. In the wake of the disruption to the computing (IBM) and telephone (AT&T) industries, the Reaganites declared that we were entering into an era of free, competitive trade while setting up a regulatory framework that would allow tech companies to consolidate power and market share ad infinitum, resulting in the current big 4(-ish): Apple, Alphabet (Google), Amazon, and Meta (and Microsoft, Nvidia, and potentially Tesla and like Netflix, depending on who you ask).

It is a good thing for us, then, that none of these companies own R, which the developers have promised will remain free software indefinitely.

2.1.2 Open Sourcing and Crowd Sourcing

These days, it feels like only a real purist will call R free software. The more en vogue term is “open source.” The Free Software Foundation would like you to treat the terms as separate however (see this article).

In reality, the labels “open source” and “free software” are mostly synonymous in that they refer to many of the same software. As Freedoms 1 and 3 make clear, software must be open source before it can be free. The free software folks’ biggest problem with “open source” is one of semantics, really. They claim that the “open source” movement argues too much about how free software is good for business and software development (i.e., because curious users can look for and find bugs). The free software people are not interested in these practical matters, focusing instead of the moral question of what sort of software is right and wrong. They correspondingly argue their case in the form of moral imperatives (the four freedoms).

I am less interested in these theoretical questions, and more interested in explaining to you what the implication of free or open software is bound to be (at least in the case of R): crowd-sourced development.

R itself provides you with basic statistical functionality. However, the vast majority of what is commonly called “R” is not actually part of the base distribution of R. Instead, most of the functionality is packaged within “packages,” which are you load into R with the library() function, as shown below:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The tidyverse is really a “meta”-package (as is the tidymodels package, by the way). This means that it is a package that contains a bunch of other packages. What, then, is a package? A package is a collection of functions that someone has written and made available online for your (free) use. An example is the ggplot2 package, which is included in tidyverse. We’ll be using it later.

2.1.2.1 Where are packages and how many are there?

I can answer the first question, which will reveal why an accurate answer for the second is impossible. CRAN is the Comprehensive R Archive Network. CRAN stores a repository of several thousand R packages. When you use the function install.packages(), you are dowloading packages from CRAN. However, there are also R packages that are not on CRAN.

The Bioconductor project is another repository of R packages (and also just a regular project, started by R co-creator Robert Gentleman). Bioconductor stores packages for a specific purpose, and that purpose is not ours, so we can safely ignore it. Packages can be (and frequently are) stored on GitHub, or similar online repositories. The devtools package has the functions you need to get packages from these other repositories (i.e., devtools::install_github(); and also to install packages stored on your computer).

Finally, there are a multitude of packages that are stored on people’s personal websites, and other non-repository internet locations. Installing (or counting) these packages is trickier than the others, but if all else fails, you can download the package from the internet and install it using devtools::install_local().

2.1.3 Wickham’s Tidyverse

Let’s return to the tidyverse package, or more precisely, the packages within. They are:

  • ggplot2 for data visualization
  • dplyr for data manipulation
  • tidyr for data tidying
  • readr for data import
  • purrr for functional programming
  • tibble for tibbles (a modern version of data frames)
  • stringr for string manipulation
  • forcats for factor manipulation

Tidyverse also includes a number of non-core packages, such as haven (for importing SPSS, Stata, and SAS files), lubridate (for data manipulation), magrittr (for piping), readxl (for importing Excel files), modelr (for some, limited modeling tasks), reprex (for producing reproducible examples), rvest (for scraping web pages), and xml2 (for reading XML files). All of these packages (all 16 of them) were written (or co-written) by Hadley Wickham. Hadley also wrote blob, which is not in the list because I don’t know it does. In the tidyverse, there are also two packages which Wickham did not write, those being jsonlite (for reading JSON files), and glue (not sure what this one does, either).

When you run library(tidyverse), as I did above, you load all eight of the core tidyverse packages. The others will not load, however. If you want to use a non-core package, you will have to load it separately or refer to it in the form package::function(). Let’s take an example from the lubridate package, which is not in the core tidyverse. lubridate has a function called now(), which returns the current date and time. Trying to use this package without loading lubridate is equivalent to telling R to use a function without telling it that or where it exists. If you want to use this function, you will have to load the lubridate package with library(package), or refer to it in the form package::function(). Both of these methods will allow R to find locate and run the now() functions.

# R won't throw an error if we specify that now is from the lubridate package
lubridate::now()
[1] "2024-05-25 09:15:42 EDT"
# we could also load lubridate and just use now normally
library(lubridate)
now()
[1] "2024-05-25 09:15:42 EDT"

If you wanted to unload a package (for some reason), you could use the unloadNamespace() function. (A namespace is just the collection of functions in the package, each of which is associated with a name, like now or even unloadNamespace).

Hadley Wickham

Hadley Wickham

This is my friend Hadley Wickham. He is a god from New Zealand. He got his bachelors in biology at the University of Auckland (which you may recall as the home university of Ross Ihaka), where he also got a masters in statistics; then, Iowa State University gave him a PhD in 2008. His PhD thesis was called “Practical tools for exploring data and models,” but I would probably read some of his other work (particularly R for data science) first/instead. He lives in Texas with his husband and dogs.

Hadley is an adjunct professor at Rice University (allegedly), as well as Stanford (where he seems to have taught last in 2019 or 2020) and the University of Auckland (where he is an honorary professor). He is the chief scientist at Posit - the company that makes RStudio. Why do I worship this man? What has he done to deserve it?

Hadley first stepped onto the scene with ggplot2 (and a few others) in (and around) 2008. We’ll use ggplot2 soon enough, but I want to explain why it’s important first. The graphical capabilities of R are quite good on their own, but using the base R functions is quite complicated. This is because the base plotting functions use a pen-on-paper approach, meaning that you have to specify each aspect on the plot individually with very few defaults. Although the base plotting functions are complicated enough for me to not know how to use them, ggplot2 is much easier because it is based on the grammar of graphics (and because the interface Hadley designed works really well).

Statisticians (and mathematicians) are really into the idea of abstraction. Other scientists use this tool, as well, but usually with much less vigor than the mathematician. She would like to talk about a graphic, but not any particular graphic. She wants to talk about the abstract idea of a graphic. What are its component parts? How do they fit together? What are the different sorts of plots and how do we tell them apart? What elements of a plot are required to communicate the information within? This sort of thing.

This is the sort of questioning that leads to the grammar of graphics, which is a book by Leland Wilkinson. Wilkinson sets up a framework and an abstraction of the concept of “graphic” that is general enough to describe essentially any plot that has or will ever appear in a scientific journal.

(Leland Wilkinson was an American psychologist, statistician, and computer scientist. He got a two degrees at Harvard, including a second bachelors in sacred divinity, and then received a PhD in psychology at Yale. I’m not going to say a lot about this man, but he has had a large influence on psychological statistics, to such an extent that he was the primary author of the APA’s guidelines and explanations for statistical methods in psychology journals (Wilkinson 1999).)

In any case, Hadley Wickham read Wilkinson’s book and built ggplot2 according to the grammar of graphics framework (gg stands for grammar of graphics). This was necessary because Wilkinson’s book is not software. It addresses theoretical questions, but not practical ones - how can I actually make this plot? ggplot2 addresses these practical questions: you make the plot with ggplot2. To the greatest extent possible, the plot is specified by default and in accordance with the grammar of graphics framework. The plot is still fully customizable (like the base R graphics), but unlike with the base graphics, you can produce a very fine “finished” looking plot with very little code (and most every line of code will have a single, specific, and easy-to-understand purpose).

2.1.4 Tidy Data

One of Wickham’s first acts, the software package we now call ggplot2, was explosive, changing permanently the way the majority of R users make plots, and gaining definitively the attention of the open-source R community who were elated to make use of this new tool. As it turns out, this would be the first, not the last, time that Hadley Wickham caused such an explosion by introducing an innovative and complete technology to the community.

Central to Wickham’s work, at least since 2007 with the release of the reshape package, has been the structure of data. There are a lot of ways that you can put the same information in a table, and it is not easy to verbally describe the ways in which these schemes are different (at least not in any generalizable, abstract way). The case here is very similar to the case with graphics. Because “data” as such is probably most important to computer scientists, they have for literal decades been proposing abstractions and definitions according to which we should understand their use of that term. In the 1960s and 1970s the new technology in this regard was the relational database, which should be to some extent familiar to anyone who has ever used the organization and note-taking app Notion, which makes heavy use of these sorts of databases. With large datasets, especially in the 70’s, using a relational structure to organize the data could speed up computation by reducing the amount of information that needs to be stored in most cases.

The relational database is not super important for our purposes. All you need to know is that this is a way of storing usually large amounts of data in separate, inter-linked tables. The complexity of this scheme is relating the multiple tables to each other, and this task becomes nearly impossible if you do not have a consistent “philosophy of data” that allows you to consistently structure data and to know how it is structured without having to look at it. Using such a consistent approach to relational databases has two primary effects. First, programming is quicker because, after all of the data is correctly formatted, you save time having to look to check how individual tables are structured. Secondly, if all of the data that you’re ever going to use has to be in the same format, then you can very easily use the same code to complete the same task with different data sets. Had all of these separate data sets been formatted differently, there is nary a chance you would ever be able to reuse code, at least not without some level of modification. Computer scientists and mathematicians can further elaborate on the benefits of a “philosophy of data,” but none of the alleged benefits will ever be as important to me as is saving time.

That brings us to this 2014 paper in which Wickham publishes his “philosophy of data”: (Wickham 2014). Wickham calls this framework the tidy data framework, and it has 3 relatively simple rules. (It may be useful to note that, as with ggplot2, the tidy data framework and the software that uses it are based on ideas that existed before Hadley Wickham. In this case, the tidy data framework is derived from the third normal form of relational databases, as set out by Cobb, who also invented the relational database.)

  1. Each variable must have its own column
  2. Each observation must have its own row
  3. Each value must have it’s own cell

These three rules are from Wickham’s textbook chapter on tidy data instead of the paper (which is less accessible and older; Wickham and Grolemund (2017)). Take a look at the following examples of the same data in three different formats:

table1
table2
table3

The first table is in tidy format because all the variables (country, year, cases, and population) have a column, each row is an observation of a specific country in a specific year, and each value has it’s own cell. This is how you want your data to look because this is how Hadley Wickham expects it to look. The second and third tables, by contrast, are not tidy. table2 is untidy because the type column contains two variables (cases and population) as values in cell rather than where the should be (as the names of the columns). table3 is untidy because it breaks the third rule, containing both the cases and the population in one column, separated by a “/” (the rate column). A far more ideal solution, if the rate variable is needed, would be to do something like this in which that variable has its own column.

table1 |> 
  mutate(rate = cases / population)

The tidy data framework is the one that I use, and it’s the one that I’m going to teach you. In principle, there is nothing speical about this framework. Any ol’ “philosophy of data” will do, so long as it is consistent. That is in principle. In practice, it is not the case that “any ol’ philosophy of data will do.” A large variety of the tools that we will be using were developed by Hadley Wickham. These tools are designed to work specifically with data that is in tidy format. In practice, then, it is very important that you use this framework because any other framework would require you to develop your own set of tools, which is a lot of work that Hadley Wickham has already done for you.