class: center, middle, inverse, title-slide # Rounding Data Frames ### Data 303 --- ## Your task: Round the numbers in a data frame 1. How many ways can you think of to do this? 2. If you were to write a function to do this, what would the arguments be? 3. What test cases should you create? 4. In what ways can this task be generalized? 5. How would you document your function? --- ## Some Test Data We should make a data frame with several different kinds of columns, ```r n <- 7L TestData <- tibble( double = rnorm(n, 100, 10), x = 123400 / 10^(1L:n), * integer = (1L:n) * (1L:n), # ^2 would return a double!! character = LETTERS[1L:n], factor = factor(letters[1L:n]), logical = rep(c(TRUE, FALSE), length.out = n) ) ``` --- ## Some Test Data ```r TestData ``` ``` ## # A tibble: 7 × 6 ## double x integer character factor logical ## <dbl> <dbl> <int> <chr> <fct> <lgl> ## 1 92.5 12340 1 A a TRUE ## 2 109. 1234 4 B b FALSE ## 3 98.8 123. 9 C c TRUE ## 4 105. 12.3 16 D d FALSE ## 5 102. 1.23 25 E e TRUE ## 6 96.5 0.123 36 F f FALSE ## 7 111. 0.0123 49 G g TRUE ``` ```r TestData |> str() ``` ``` ## tibble [7 × 6] (S3: tbl_df/tbl/data.frame) ## $ double : num [1:7] 92.5 108.7 98.8 104.8 101.7 ... ## $ x : num [1:7] 12340 1234 123.4 12.34 1.23 ... ## $ integer : int [1:7] 1 4 9 16 25 36 49 ## $ character: chr [1:7] "A" "B" "C" "D" ... ## $ factor : Factor w/ 7 levels "a","b","c","d",..: 1 2 3 4 5 6 7 ## $ logical : logi [1:7] TRUE FALSE TRUE FALSE TRUE FALSE ... ``` --- ## Using a for loop ```r df_round1 <- function(.data, digits = 0, ...) { for (i in seq_along(.data)) { if (is.numeric(.data[[i]])) { * .data[[i]] <- round(.data[[i]], digits = digits) } } .data } ``` --- ## Using a for loop ```r TestData |> df_round1(digits = 2) ``` ``` ## # A tibble: 7 × 6 ## double x integer character factor logical ## <dbl> <dbl> <dbl> <chr> <fct> <lgl> ## 1 92.5 12340 1 A a TRUE ## 2 109. 1234 4 B b FALSE ## 3 98.8 123. 9 C c TRUE ## 4 105. 12.3 16 D d FALSE ## 5 102. 1.23 25 E e TRUE ## 6 96.5 0.12 36 F f FALSE ## 7 111. 0.01 49 G g TRUE ``` --- ## Avoiding a for loop -- Why? * Many uses of for loops in R are computationally inefficient because * memory allocation and copying (can be reduced with some care) * `[` and `[[` are functions * For loops hide the big idea with lots of boiler plate * big idea: do something to each element in a container * R has many features of a functional programming language designed for other solutions * these make your code more efficient and more readable/maintianable * encourages good modularization --- ## Using a for loop -- lots of repetition ```r df_round1 <- function(.data, digits = 0, ...) { for (`i` in seq_along(nrow(`.data`))) { if (is.numeric(`.data`[[`i`]])) { `.data`[[`i`]] <- round(`.data`[[`i`]], digits = digits) } } `.data` } ``` * mention `.data` 5 times * mention `i` four times -- and its just a dummy variable! --- ## Avoiding a for loop -- How? ### Using `lapply()` This almost works: ```r df_round2 <- function(.data, digits = 0) { .data |> lapply( * function(x) if (is.numeric(x)) round(x, digits = digits) else x ) } ``` Try it and see what goes wrong. --- ## Avoiding a for loop -- How? ### Using `lapply()` This almost works: ```r df_round2 <- function(.data, digits = 0) { .data |> lapply( function(x) if (is.numeric(x)) round(x, digits = digits) else x ) } TestData |> df_round2(digits = 2) |> str() ``` ``` ## List of 6 ## $ double : num [1:7] 92.5 108.7 98.8 104.8 101.7 ... ## $ x : num [1:7] 12340 1234 123.4 12.34 1.23 ... ## $ integer : num [1:7] 1 4 9 16 25 36 49 ## $ character: chr [1:7] "A" "B" "C" "D" ... ## $ factor : Factor w/ 7 levels "a","b","c","d",..: 1 2 3 4 5 6 7 ## $ logical : logi [1:7] TRUE FALSE TRUE FALSE TRUE FALSE ... ``` --- ## Avoiding a for loop -- How? ### Using `lapply()` ```r round_if_numeric <- function(x, digits = 0) { * if (is.numeric(x)) round(x, digits = digits) else x } df_round2a <- function(.data, digits = 0) { lapply(.data, `round_if_numeric`, digits = digits) |> * as_tibble() } ``` --- ### Using `purrr::map_df()` There are several flavors of `purrr:map()` that * Specify the output type * Allow you to describe the function in several ways. ```r df_round3 <- function(.data, digits = 0) { .data |> `map_df`(round_if_numeric, digits = digits) } df_round3a <- function(.data, digits = 0) { .data |> map_df( ~ if (is.numeric(.)) round(., digits) else .) } ``` --- ## Just checking ```r all.equal(TestData |> df_round3(digits = 2), TestData |> df_round3a(digits = 2)) ``` ``` ## [1] TRUE ``` ```r all.equal(TestData |> df_round2a(digits = 2), TestData |> df_round3a(digits = 2)) ``` ``` ## [1] TRUE ``` ```r TestData |> df_round3(digits = 2) ``` ``` ## # A tibble: 7 × 6 ## double x integer character factor logical ## <dbl> <dbl> <dbl> <chr> <fct> <lgl> ## 1 92.5 12340 1 A a TRUE ## 2 109. 1234 4 B b FALSE ## 3 98.8 123. 9 C c TRUE ## 4 105. 12.3 16 D d FALSE ## 5 102. 1.23 25 E e TRUE ## 6 96.5 0.12 36 F f FALSE ## 7 111. 0.01 49 G g TRUE ``` --- ## Exercises 1. Apply some function to each column of a data frame: use `lapply()` or `purrr::map_df()` * This generalizes from `round()` to an arbitrary function. * Function can be separated into separate definition if usable elsewhere or for clarity and debugging 2. *Conditionally* apply some function to each column of a data frame. * Create function with a data frame and *two* functions as input, one to check the condition and one to do the work. * You will need to think a bit about possible restrictions to place on the functions that users are allowed to provide. 3. Do exercises 2-5 of section 21.5.3 in [*R for Data Science*](https://r4ds.had.co.nz/iteration.html) 4. Still have time, read more of Chapter 21 and do some of the other exercises. --- ## A few notes * Similar approaches for for other sorts of containers. * `apply()`, `vapply()`, `sapply()`, ... * `map()`, `map_df()`, `map_dbl()`, ... * `Vectorize()` can be useful for turning an unvectorized function into a vectorized function. More info in [*R for Data Science*](https://r4ds.had.co.nz/iteration.html), chapter 21 (Iteration) * For iteration in R: "This is the way" --- <!-- ## One more example --> <!-- ```{r} --> <!-- df_apply_where <- --> <!-- function(.data, .fun, .condition = function(x) TRUE, ...) { --> <!-- .fun <- purrr::as_mapper(.fun) --> <!-- .condition <- purrr::as_mapper(.condition) --> <!-- .data |> map_df(~ if(.condition(.)) .fun(., ...) else .) --> <!-- } --> <!-- TestData |> --> <!-- df_apply_where(factor, is.integer) |> --> <!-- str() --> <!-- TestData |> --> <!-- df_apply_where( ~ factor(.), ~ is.integer(.) || is.character(.)) |> --> <!-- str() --> <!-- ``` -->