Using purr::pmap to fit many models
Overview
To show some uses of purrr::pmap
to fit models with different specifications or on subsets of a dataset.
Problem
A common task in my work is to fit a variety of models with different specifiations on the same dataset. For instance, perhaps we are interested in how the departure and arrival delays among flights departing from NYC in 2013 differ from airport to airport. We can use the nycflights13::flights
dataset to investigate this question:
library(nycflights13)
head(flights)
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
flights$origin <- factor(flights$origin, levels=c("JFK", "LGA", "EWR")) # Make JFK the reference
Suppose we’d like to fit a simple linear model for each response variable (dep_delay
and arr_delay
) with the origin airport as a predictor (and JFK as the reference), and we wanted to summarize the average difference in delay for LGA relative to JFK. How might we do this?
Solutions
Using base R, we might do the following:
model_parameters <- data.frame(outcome_var = c("dep_delay", "arr_delay"), stringsAsFactors=FALSE)
# define function for fitting model and returning coefficient
fit_model <- function(
outcome_var, # outcome variable
covariate=NULL, # optional covariate, will be used later
flights_data = flights # by default, use the full flights data
){
model_string <- paste0(outcome_var, "~ origin")
if(!is.null(covariate)){model_string <- paste0(model_string, "+", covariate)}
model_formula <- as.formula(model_string)
mod <- lm(model_formula, data=flights_data)
return(coef(mod)["originLGA"])
}
# lapply over response vars and combine
result <- do.call(c, lapply(model_parameters$outcome_var, FUN=fit_model))
library(tidyverse)
as_data_frame(cbind(model_parameters, LGA_JFK_diff=result))
## # A tibble: 2 x 2
## outcome_var LGA_JFK_diff
## <chr> <dbl>
## 1 dep_delay -1.77
## 2 arr_delay 0.232
Equivalently, using purrr::pmap
we could do the following:
library(purrr) # or library(tidyverse)
result <- pmap_dbl(model_parameters, .f=fit_model)
as_data_frame(cbind(model_parameters, LGA_JFK_diff=result))
## # A tibble: 2 x 2
## outcome_var LGA_JFK_diff
## <chr> <dbl>
## 1 dep_delay -1.77
## 2 arr_delay 0.232
The suffix _dbl
indicates that the output from the function should be combined into a vector of the numeric
type.
These solutions are both similar. However, the purr::pmap
solution is more easily extensible. Suppose we wanted to add another variable to the model, such as sched_dep_time
or distance
. The base R solution might look like as follows:
model_parameters <- expand.grid(
outcome_var = c("dep_delay", "arr_delay"),
covariate = c("sched_dep_time", "distance"),
stringsAsFactors = FALSE
)
result <- do.call(c, lapply(1:nrow(model_parameters), FUN=function(i){
fit_model(
outcome_var = model_parameters[i, "outcome_var"],
covariate = model_parameters[i, "covariate"]
)
}))
as_data_frame(cbind(model_parameters, LGA_JFK_diff=result))
## # A tibble: 4 x 3
## outcome_var covariate LGA_JFK_diff
## <chr> <chr> <dbl>
## 1 dep_delay sched_dep_time -0.115
## 2 arr_delay sched_dep_time 1.84
## 3 dep_delay distance -2.55
## 4 arr_delay distance -1.75
The purr::pmap
solution is more parsimonious:
result <- pmap_dbl(model_parameters, .f=fit_model)
as_data_frame(cbind(model_parameters, LGA_JFK_diff=result))
## # A tibble: 4 x 3
## outcome_var covariate LGA_JFK_diff
## <chr> <chr> <dbl>
## 1 dep_delay sched_dep_time -0.115
## 2 arr_delay sched_dep_time 1.84
## 3 dep_delay distance -2.55
## 4 arr_delay distance -1.75
This is because purr::pmap
automatically passes the named elements of its first argument (model_parameters
) to the corresponding arguments of its second argument (fit_model
). This is a handy feature which can be used to avoid manually filling out arguments as above.
Extension
Another common task is to fit models on subgroups of the original dataset. In this case, we might be interested in calculating the difference in dep_delay
or arr_delay
between LGA and JFK for different carriers in different months. Briefly, one solution to running these models using tidyverse
functions might be as follows:
model_parameters <- expand.grid(
outcome_var = c("dep_delay", "arr_delay"),
carrier = c("AA", "DL", "UA"), # AA 'American Airlines'; 'DL' Delta Airlines; 'UA' United Airlines
month = c(1, 6, 12), # January, June, December
stringsAsFactors = FALSE
)
# nest the carrier and month variables in a'subset_to' list variable
model_parameters_tidy <- model_parameters %>%
mutate(model_num = row_number()) %>%
nest(-model_num, -outcome_var, .key="subset_to") %>%
select(-model_num)
results <- pmap_dfr(model_parameters_tidy, .f=function(outcome_var, subset_to){
subset_flights <- inner_join(flights, subset_to) # subset the flights data
diff <- fit_model(outcome_var, flights_data=subset_flights)
obs <- nrow(subset_flights)
data.frame(obs, LGA_JFK_diff=diff)
})
as_data_frame(cbind(model_parameters, results))
## # A tibble: 18 x 5
## outcome_var carrier month obs LGA_JFK_diff
## <chr> <chr> <dbl> <int> <dbl>
## 1 dep_delay AA 1. 2794 -3.48
## 2 arr_delay AA 1. 2794 -0.410
## 3 dep_delay DL 1. 3690 -0.492
## 4 arr_delay DL 1. 3690 8.59
## 5 dep_delay UA 1. 4637 7.93
## 6 arr_delay UA 1. 4637 6.63
## 7 dep_delay AA 6. 2757 -0.0472
## 8 arr_delay AA 6. 2757 -2.63
## 9 dep_delay DL 6. 4126 -0.481
## 10 arr_delay DL 6. 4126 -0.763
## 11 dep_delay UA 6. 4975 9.00
## 12 arr_delay UA 6. 4975 -2.47
## 13 dep_delay AA 12. 2705 -5.63
## 14 arr_delay AA 12. 2705 -4.80
## 15 dep_delay DL 12. 4093 -0.0297
## 16 arr_delay DL 12. 4093 2.82
## 17 dep_delay UA 12. 4931 7.13
## 18 arr_delay UA 12. 4931 9.40
Here, we leverage the tidyr::nest
function to group some of the variables on the model_parameters
data.frame into a subset_to
list column, which we subsequently use to subset the data before running the model.
Acknowledgements
I first discovered purrr::pmap
via Jenny Bryan’s excellent Row-oriented workflows in R with the Tidyverse.