3 Some Useful Bits of R

3.1 You Gotta Have Style

Good programming style is incredibly important. It makes your code easier to read and edit. That leads to fewer errors.

Here is a brief style guide you are expected to follow for all code in this course: For more detailed see http://adv-r.had.co.nz/Style.html, on which this is based.

  1. No long lines.

    Lines of code should have at most 80 characters.

    Programming lines should not wrap, you should choose the line breaks yourself. Choose them in natural places. In R markdown, long codes lines don’t wrap, they flow off the page, so the end isn’t visible. (And it makes it obvious that you didn’t look at your own print out.)

    # Don't ever use really long lines.  Not even in comments.  They spill off the page and make people wonder what they are missing. 
  2. Use your space bar.

    There is a reason it is the largest key on the keyboard. Use it often. Spaces after commas (always). Spaces around operators (always). Spaces after the comment symbol # (always). Use other spaces judiciously to align similar code chunks to make things easier to read or compare.

    x<-c(1,2,4)+5          # BAD BAD BAD
    x <- c(1, 2, 4) + 5    # Ah :^)
  3. But don’t go crazy with the space bar.

    There are a few places you should not use spaces:

    • after open parentheses or before closed parentheses
    • between function names and the parentheses that follow
  4. Indent to show the structure of your code.

    Use 2 spaces to indent (to keep things from drifting right too quickly).

    Fortunately, this is really easy. Highlight your code and hit <CTRL>-i (PC) or <command>-i (Mac). If the indention looks odd to you, you have most likely messed up commas, quotes, parentheses, or curly braces.

  5. Choose names wisely and consistently.

    Naming things is hard, but take a moment to choose good names, and go back and change them if you come up with a better name later. Here are some helpful hints:

    1. Very short names should only be used for a very short time (a couple lines of code). Else we tend to forget what they meant. Avoid names like x, f, etc. unless the use is brief and mimics some common mathematical formula.

    2. Break long names visually. Common ways to do this are with a dot (.), an underscore _, or camelCase. There are R coders who prefer all three, but don’t mix and match for similar kinds of things, that just makes it harder to remember what to do the next time.

    good_name <- 10
    good.name <- 10
    goodName  <- 10    # note alignment via extra space in this line
    really_terrible.ideaToDo <- -5

    The trend in R is toward using underscore (_) and I recommend it. Older code often used dot (.). CamelCase is the least common in R.

    1. Recommendation: capitalize data frame names; use lower case for variables inside data frames.

      This is not a common convention in R, but it can really help to keep things straight. I’ll do this in the data sets I create, but when we use other data sets, they may not follow this convention.

    2. Avoid using names that are already in use by R.

      This can be hard to avoid when you are starting out because you don’t know what all is defined. Here are a few things to avoid.

      T           # abbreviation for TRUE
      F           # abbreviation for FALSE
      c           # used to concetenate vectors and lists
      df          # density function for f distributions
      dt          # density function for t distributions
  6. Use comments (#), but use them for the right thing.

    Comments can be used to clarify names, point out subtlties in code, etc. They should not be used for your analysis or discussion of results. Don’t comment things that are obvious without comment. Comments should add value.

    x <- 4   # set x to 4  <----------------------- no need for this comment
    x <- 4   # b/c there are four grade levels in the study  <------- useful
  7. Exceptions should be exceptional.

    No style guide works perfectly in all situations. Ocassionally you may need to violate the style guide. But these instances should be rare and should have a good reason. They should not arise form your sloppiness or laziness.

3.1.1 An additional note about homework

When you do homework, I want to see your code and the results (and your discussion of those results). Writing in R Markdown makes this all easy to do.

But make sure that I can see all the necessary things to evaluate what you are doing. You have access to your code and can investigate variables, etc. But make sure I can see what’s going one in the document. This often means displaying intermediate results. Once common way to do this is with a semi-colon:

x <- 57 * 23; x
## [1] 1311

3.2 Vectors, Lists, and Data Frames

3.2.1 Vectors

In R, a vector is a homogeneous ordered collection (indexing starts at 1). By homogeneous, we mean that each element is he same kind of thing.

Short vectors can be created using c():

x <- c(1, 3, 5)
x
## [1] 1 3 5

Evenly spaced sequences can be created using seq():

x <- seq(0, 100, by = 5); x
##  [1]   0   5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90  95 100
y <- seq(0, 1, length.out = 11); y
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0:10      # short cut for consecutive integers
##  [1]  0  1  2  3  4  5  6  7  8  9 10

Repeated values can be created with rep():

rep(5, 3)
## [1] 5 5 5
rep(c(1, 2, 3), each  = 2)
## [1] 1 1 2 2 3 3
rep(c(1, 2, 3), times = 2)
## [1] 1 2 3 1 2 3
rep(c(1, 2, 3), times = c(1, 2, 3))
## [1] 1 2 2 3 3 3
rep(c(1, 2, 3), each  = c(1, 2, 3))    # Ack! see warning message.
## Warning in rep(c(1, 2, 3), each = c(1, 2, 3)): first element used of 'each' argument
## [1] 1 2 3

When a function acts on a vector, there are several things that could happen.

  • One result can be computed from the entire vector.

    x <- c(1, 3, 6, 10)
    length(x)
    ## [1] 4
    mean(x)
    ## [1] 5
  • The function may be applied to each element of the vector and a vector of results returned. (Such functions are called vectorized.)

    log(x)
    ## [1] 0.000 1.099 1.792 2.303
    2 * x
    ## [1]  2  6 12 20
    x^2
    ## [1]   1   9  36 100
  • The first element of the vector may be used and the others ignored. (Less common but dangerous – be on the lookout. See example above.)

Items in a vector can be accessed using []:

x <- seq(10, 20, by = 2)
x[2]
## [1] 12
x[10]       # NA indicates a missing value
## [1] NA
x[10] <- 4          
x           # missing values filled in to make room!
##  [1] 10 12 14 16 18 20 NA NA NA  4
is.na(x)    
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE

In addition to using integer indices, there are two other ways to access elements of a vector: names and logicals. If the items in a vector are named, names are displayed when a vector is displayed, and names can be used to access elements.

x <- c(a = 5, b = 3, c = 12, 17, 1)
x
##  a  b  c       
##  5  3 12 17  1
names(x)
## [1] "a" "b" "c" ""  ""
x["b"]
## b 
## 3

Logicals (TRUE and FALSE) are very interesting in R. In indexing, they tell us which items to keep and which to discard.

x <- (1:10)^2
x < 50
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
x[x < 50]
## [1]  1  4  9 16 25 36 49
x[c(TRUE, FALSE)]   # T/F recycled to length 10
## [1]  1  9 25 49 81
which(x < 50)
## [1] 1 2 3 4 5 6 7

3.2.2 Lists

Lists are a lot like vectors, but

  • Create a list with list().
  • The elements can be different kinds of things (including other lists)
  • Use [[ ]] to access elements. You can also use $ to access named elements
  • If you use [] you will get a list back not an element.
# a messy list
L <- list(5, list(1, 2, 3), x = TRUE, y = list(5, a = 3, 7))  
L
## [[1]]
## [1] 5
## 
## [[2]]
## [[2]][[1]]
## [1] 1
## 
## [[2]][[2]]
## [1] 2
## 
## [[2]][[3]]
## [1] 3
## 
## 
## $x
## [1] TRUE
## 
## $y
## $y[[1]]
## [1] 5
## 
## $y$a
## [1] 3
## 
## $y[[3]]
## [1] 7
L[[1]]
## [1] 5
L[[2]]
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
L[1]
## [[1]]
## [1] 5
L$y
## [[1]]
## [1] 5
## 
## $a
## [1] 3
## 
## [[3]]
## [1] 7
L[["x"]]
## [1] TRUE
L[1:3]
## [[1]]
## [1] 5
## 
## [[2]]
## [[2]][[1]]
## [1] 1
## 
## [[2]][[2]]
## [1] 2
## 
## [[2]][[3]]
## [1] 3
## 
## 
## $x
## [1] TRUE
glimpse(L)
## List of 4
##  $  : num 5
##  $  :List of 3
##   ..$ : num 1
##   ..$ : num 2
##   ..$ : num 3
##  $ x: logi TRUE
##  $ y:List of 3
##   ..$  : num 5
##   ..$ a: num 3
##   ..$  : num 7

3.2.3 Data frames for rectangular data

Rectangular data is organized in rows and columns (much like an excel spreadsheet). These rows and columns have a particular meaning:

  • Each row represents one observational unit. Observational units go by many others names depending whether they are people, or inanimate objects, our events, etc. Examples include case, subject, item, etc. Regardless, the observational units are the things about which we collect information, and each one gets its own row in rectangular data.

  • Each column represents a variable – one thing that is “measured” and recorded (at least in principle, some measurements might be missing) for each observational unit.

Example In a study of nutritional habits of college students, our observational units are the college students in the study. Each student gets her own row in the data frame. The variables might include things like an ID number (or name), sex, height, weight, whether the student lives on campus or off, what type of meal plan they have at the dining hall, etc., etc. Each of these is recorded in a separate column.

Data frames are the standard way to store rectangular data in R. Usually variables (elements of the list) are vectors, but this isn’t required, sometimes you will see list variables in data frames. Each element (ie, variable) must have the same length (to keep things rectangular).

Here, for example, are the first few rows of a data set called KidsFeet:

library(mosaicData)   # Load package to make KidsFeet data available
head(KidsFeet)        # first few rows
name birthmonth birthyear length width sex biggerfoot domhand
David 5 88 24.4 8.4 B L R
Lars 10 87 25.4 8.8 B L L
Zach 12 87 24.5 9.7 B R R
Josh 1 88 25.2 9.8 B L R
Lang 2 88 25.1 8.9 B L R
Scotty 3 88 25.7 9.7 B R R

3.2.3.1 Accessing via

We can access rows, columns, or individual elements of a data frame using [ ]. This is the more usual way to do things.

KidsFeet[, "length"]
##  [1] 24.4 25.4 24.5 25.2 25.1 25.7 26.1 23.0 23.6 22.9 27.5 24.8 26.1 27.0 26.0 23.7 24.0 24.7 26.7
## [20] 25.5 24.0 24.4 24.0 24.5 24.2 27.1 26.1 25.5 24.2 23.9 24.0 22.5 24.5 23.6 24.7 22.9 26.0 21.6
## [39] 24.6
KidsFeet[, 4]
##  [1] 24.4 25.4 24.5 25.2 25.1 25.7 26.1 23.0 23.6 22.9 27.5 24.8 26.1 27.0 26.0 23.7 24.0 24.7 26.7
## [20] 25.5 24.0 24.4 24.0 24.5 24.2 27.1 26.1 25.5 24.2 23.9 24.0 22.5 24.5 23.6 24.7 22.9 26.0 21.6
## [39] 24.6
KidsFeet[3, ]
name birthmonth birthyear length width sex biggerfoot domhand
3 Zach 12 87 24.5 9.7 B R R
KidsFeet[3, 4]
## [1] 24.5
KidsFeet[3, 4, drop = FALSE]  # keep it a data frame
length
3 24.5
KidsFeet[1:3, 2:3]
birthmonth birthyear
5 88
10 87
12 87

By default,

  • Accessing a row returns a 1-row data frame.
  • Accessing a column returns a vector (at least for vector columns)
  • Accessing a element returns that element (technically a vector with one element in it).

3.2.3.2 Accessing columns via $

We can also access individual variables using the $ operator:

KidsFeet$length
##  [1] 24.4 25.4 24.5 25.2 25.1 25.7 26.1 23.0 23.6 22.9 27.5 24.8 26.1 27.0 26.0 23.7 24.0 24.7 26.7
## [20] 25.5 24.0 24.4 24.0 24.5 24.2 27.1 26.1 25.5 24.2 23.9 24.0 22.5 24.5 23.6 24.7 22.9 26.0 21.6
## [39] 24.6
KidsFeet$length[2]
## [1] 25.4
KidsFeet[2, "length"]
## [1] 25.4
KidsFeet[2, "length", drop = FALSE]    # keep it a data frame
length
2 25.4

As we will see, there are are other tools that will help us avoid needing to us $ or [ ] for access columns in a data frame. This is especially nice when we are working with several variables all coming from the same data frame.

3.2.3.3 Accessing by number is dangerous

Generally speaking, it is safer to access things by name than by number when that is an option. It is easy to miscalculate the row or column number you need, and if rows or columns are added to or deleted from a data frame, the numbering can change.

3.2.3.4 Implementation

Data frames are implemented in R as a special type (technically, class) of list. The elements of the list are the columns in the data frame. Each column must have the same length (so that our data frame has coherent rows). Most often the columns are vectors, but this isn’t required.

This explains why $ works the way it does – we are just accessing one item in a list. It also means that we can use [[ ]] to access a column:

KidsFeet[["length"]]
##  [1] 24.4 25.4 24.5 25.2 25.1 25.7 26.1 23.0 23.6 22.9 27.5 24.8 26.1 27.0 26.0 23.7 24.0 24.7 26.7
## [20] 25.5 24.0 24.4 24.0 24.5 24.2 27.1 26.1 25.5 24.2 23.9 24.0 22.5 24.5 23.6 24.7 22.9 26.0 21.6
## [39] 24.6
KidsFeet[[4]]
##  [1] 24.4 25.4 24.5 25.2 25.1 25.7 26.1 23.0 23.6 22.9 27.5 24.8 26.1 27.0 26.0 23.7 24.0 24.7 26.7
## [20] 25.5 24.0 24.4 24.0 24.5 24.2 27.1 26.1 25.5 24.2 23.9 24.0 22.5 24.5 23.6 24.7 22.9 26.0 21.6
## [39] 24.6
KidsFeet[4]
length
24.4
25.4
24.5
25.2
25.1
25.7
26.1
23.0
23.6
22.9
27.5
24.8
26.1
27.0
26.0
23.7
24.0
24.7
26.7
25.5
24.0
24.4
24.0
24.5
24.2
27.1
26.1
25.5
24.2
23.9
24.0
22.5
24.5
23.6
24.7
22.9
26.0
21.6
24.6

3.2.4 Other types of data

Some types of data do not work well in a rectangular arrangement of a data frame, and there are many other ways to store data. In R, other types of data commonly get stored in a list of some sort.

3.3 Plotting with ggformula

R has several plotting systems. Base graphics is the oldest. lattice and ggplot2 are both built on a system called grid graphics. ggformula is built on ggplot2 to make it easier to use and to bring in some of the advantages of lattice.

You can find out about more about ggformula at https://projectmosaic.github.io/ggformula/news/index.html.

3.4 Creating data with expand.grid()

We will frequently have need of synthetic data that includes all combinations of some variable values. expand.grid() does this for us:

expand.grid(
  a = 1:3, 
  b = c("A", "B", "C", "D"))
a b
1 A
2 A
3 A
1 B
2 B
3 B
1 C
2 C
3 C
1 D
2 D
3 D

3.5 Transforming and summarizing data dplyr and tidyr

See the tutorial at http://rsconnect.calvin.edu/wrangling-jmm2019 or https://rpruim.shinyapps.io/wrangling-jmm2019

3.6 Writing Functions

3.6.1 Why write functions?

There are two main reasons for writing functions.

  1. You may want to use a tool that requires a function as input.

    To use integrate(), for example, you must provide the integrand as a function.

  2. To make your own work easier.

    Functions make it easier to reuse code or to break larger tasks into smaller parts.

3.6.2 Function parts

Functions consist of several parts. Most importantly

  1. An argument list.

    A list of named inputs to the function. These may have default values (or not). There is also a special argument ... which gathers up any other arguments provided by the user. Many R functions make use of ...

    args(ilogit)  # one argument, called x, no default value
    ## function (x) 
    ## NULL
  2. The body.

    This is the code the tells R to do when the function is executed.

    body(ilogit)
    ## {
    ##     exp(x)/(1 + exp(x))
    ## }
  3. An environment where code is executed.

    Each function has its own “scratch pad” where it can do work without interfering with global computations. But environments in R are nested, so it is possible to reach outside of this narrow environment to access other things (and possible to change them). For the most part we won’t worry about this, but if you use a variable not defined in your function but defined elsewhere, you may see unexpecte results.

If you type the name of a function without parenthesis, you will see all three parts listed:

ilogit
## function (x)
## {
##   exp(x)/(1 + exp(x))
## }
## <bytecode: 0x7f9491bdd618>
## <environment: namespace:mosaicCore>

3.6.3 The function() function has its function

To write a function we use the function() function to specify the arguments and the body. (R will assign an environment for us.)

The general outline is

my_function_name <- 
  function(arg1 = default1, arg2 = default2, arg3, arg4, ...) {
    
    # stuff for my function to do
  }

We may include as many named arguments as we like, and some or all or none of them may have default values. The results of the last line of the function are returned. If we like, we can also use the return() function to make it clear what is being returned when.

Let’s write a function that adds. (Redundant, but a useful illustration.)

foo <- function(x, y = 5) {
  x + y     # or return(x + y)
}
foo(3, 5)
## [1] 8
foo(3)
## [1] 8
foo(2, x = 3)   # Note: this makes y = 2
## [1] 5
foo(x = 1:3, y = 100)
## [1] 101 102 103
foo(x = 1:3, y = c(100, 200, 300))  # vectorized!
## [1] 101 202 303
foo(x = 1:3, y = c(100, 200))       # You have been warned!
## Warning in x + y: longer object length is not a multiple of shorter object length
## [1] 101 202 103

Here is a more useful example. Suppose we want to integrate \(f(x) = x^2 (2-x)\) on the interval from 0 to 2. Since this is such a simple function, if we are not going to reuse it, we don’t need to bother naming it, we can just create the function inside our call to integrate().

integrate(function(x) { x^2 * (2-x) }, 0, 2)
## 1.333 with absolute error < 1.5e-14

3.7 Some common error messages

3.7.1 object not found

If R claims some object is not found, the two most likely causes are

  • a typo – if you spell the name of the object slightly differently, R can’t figure out what you mean

    blah <- 17
    bla
    ## Error in eval(expr, envir, enclos): object 'bla' not found
  • forgetting to load a package – if the object is in a package, that package must be loaded

    detach("package:mosaic", unload = TRUE)
    detach("package:ggformula", unload = TRUE)   # without packages, no gf_line()
    ## Error: package 'ggformula' is required by 'CalvinBayes' so will not be detached
    gf_line()
    ## gf_line() uses 
    ##     * a formula with shape y ~ x. 
    ##     * geom:  line 
    ##     * key attributes:  alpha, color, fill, group, linetype, size, lineend, linejoin, linemitre, arrow
    ## 
    ## For more information, try ?gf_line
    library(ggformula)      # reload package; now things work
    gf_line()
    ## gf_line() uses 
    ##     * a formula with shape y ~ x. 
    ##     * geom:  line 
    ##     * key attributes:  alpha, color, fill, group, linetype, size, lineend, linejoin, linemitre, arrow
    ## 
    ## For more information, try ?gf_line

    3.7.1 Package inputenc Error: Unicode char not set up for use with LaTeX.

Sometimes if you copy and paste text from a web page or PDF document you will get symbols knitr doesn’t know how to handle. Smart quotes, ligatures, and other special characters are the most likely cause.

tools::showNonASCIIfile() can help you locate non-ASCII characters in a file.

3.7.2 Any message mentioning yaml

YAML stand for yet another markup language. The first part of an R Markdown file is called the YAML header. If you get a yaml error when you knit a document, most likely you have messed up the YAML header someone. If you don’t see the problem, you can start a new document and copy-and-paste the contents (without the TAML header) into the new document (after its YAML header).

3.8 Exercises

  1. Create a function in R that converts Fahrenheit temperatures to Celsius temperatures. [Hint: \(C = (F-32) \cdot 5/9\).]

    What you turn in should show

    1. the code that defines your function.
    2. some test cases that show that your function is working. (Show that -40, 32, 98.6, and 212 convert to -40, 0, 37, and 100.) Note: you should be able to test all these cases by calling the function only once. Use c(-40, 32, 98.6, 212) as the input.
  2. See if you can predict the output of each line below. Then run in R to see if you are correct. If you are not correct, see if you can figure out why R does what it does (and make a note so you are not surprised the next time).

odds <- 1 + 2 * (0:4); odds
primes <- c(2, 3, 5, 7, 11, 13)
length(odds)
length(primes)
odds + 1
odds + primes
odds * primes
odds > 5
sum(odds > 5)
sum(primes < 5 | primes > 9)
odds[3]
odds[10]
odds[-3]
primes[odds]
primes[primes >= 7]
sum(primes[primes > 5])
sum(odds[odds > 5])
odds[10] <- 1 + 2 * 9
odds
y <- 1:10
y <- 1:10; y
(x <- 1:5)
  1. The problem uses the KidsFeet data set from the mosaicData package. The hints are suggested functions that might be of use.

    1. How many kids are represented in the data set. [Hint: nrow() or dim()]
    2. Which of the variables are factors? [Hint: glimpse()]
    3. Add a new variable called foot_ratio that is equal to length divided by width. [Hint: mutate()]
    4. Add a new variable called biggerfoot2 that has values "dom" (if domhand and biggerfoot are the same) and "nondom" (if domhand and biggerfoot are different). [Hint: mutate(), ==, ifelse()]
    5. Create new data set called Boys that contains only the boys. [Hint: filter(), ==]
    6. What is the name of the boy with the largest foot_ratio? Show how to find this programmatically, don’t just scan through the whole data set yourself. [Hint: max() or arrange()]

3.9 Footnotes