intro-to-gamezoneR.Rmd
We will be acquiring data from gamezone.stats.com, using the gamezoneR
package, created by Jack Lichtenstein.
This tutorial will require the use of R and RStudio. You can follow the instructions at R Studio on how to get started.
library(dplyr)
library(ggplot2)
# set a ggplot2 theme
theme_set(
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 32/.pt, hjust = 0),
plot.subtitle = element_text(face = "italic", size = 26/.pt),
plot.caption = element_text(face = "italic", size = 20/.pt),
strip.background = element_rect(color = "black", size = 3, linetype = "blank"),
strip.text = element_text(face = "bold", size = 24/.pt, hjust = 0),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.ticks = element_blank(),
axis.text = element_text(size = 24/.pt),
axis.title = element_text(face = "bold", size = 26/.pt),
legend.title = element_text(face = "bold", size = 26/.pt),
legend.text = element_text(size = 24/.pt)))
There are two functions to scrape schedules:
The gamezoneR::gamezone_mbb_master_schedule()
function has a parameter for ranked_games
, which is a logical argument specifying whether or not to scrape only the games on that date where at least one team is ranked in the AP top 25 poll. Turning this off will scrape all games from the inputted date, regardless of whether a team is ranked or not.
Please note that specifying ranked_games = T
increases the speed of the function because it utilizes the underlying API, while ranked_games = F
does not.
# first, the ranked games for a given date
schedule <- gamezoneR::gamezone_mbb_master_schedule(date = "2021-03-10", ranked_games = T)
## ℹ Scraping ranked games master schedule for: 2021-03-10
## ℹ There were 1 game(s) on 2021-03-10
head(schedule)
## # A tibble: 1 × 21
## season start_time game_date game_id tv home away home_abbr
## <chr> <chr> <date> <dbl> <chr> <chr> <chr> <chr>
## 1 2020-21 9:30 PM ET 2021-03-10 2317212 ESPN Oklahoma Iowa State Okla
## # … with 13 more variables: home_ap_ranking <int>, home_id <int>,
## # home_is_winner <lgl>, home_logo_id <chr>, home_name <chr>,
## # home_total <dbl>, home_record <chr>, away_abbr <chr>, away_id <int>,
## # away_logo_id <chr>, away_name <chr>, away_total <dbl>, away_record <chr>
# then, all the games for a given date
schedule <- gamezoneR::gamezone_mbb_master_schedule(date = "2021-03-10", ranked_games = F)
## ℹ Scraping master GameZone schedule for: 2021-03-10
## ℹ There were 179 game(s) on 2021-03-10
head(schedule)
## # A tibble: 6 × 9
## season game_date game_id home away home_total away_total home_ap_ranking
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020-21 2021-03-10 2383329 Cent. Conn. St. UConn NA NA NA
## 2 2020-21 2021-03-10 2395558 Delaware Davi… NA NA NA
## 3 2020-21 2021-03-10 2383212 UIC Dayt… NA NA NA
## 4 2020-21 2021-03-10 2373041 Kentucky Duke NA NA NA
## 5 2020-21 2021-03-10 2378078 Elon Flor… NA NA NA
## 6 2020-21 2021-03-10 2396371 Columbia Ford… NA NA NA
## # … with 1 more variable: away_ap_ranking <dbl>
We are able to scrape games by date, but also by team using the gamezoneR::get_team_schedule()
function.
duke_schedule <- gamezoneR::gamezone_mbb_team_schedule(team = "Duke", season = "2018-19")
## ℹ Scraping 2018-19 season schedule for: Duke
head(duke_schedule)
## # A tibble: 6 × 23
## season start_time game_date game_id tv home away home_abbr
## <chr> <chr> <date> <dbl> <chr> <chr> <chr> <chr>
## 1 2018-19 9:49 PM ET 2018-11-06 2083190 ESPN Kentucky Duke Kty
## 2 2018-19 1:00 PM ET 2018-11-11 2087019 ESPN Duke Army We… Duke
## 3 2018-19 7:00 PM ET 2018-11-14 2087022 ACCN Duke Eastern… Duke
## 4 2018-19 6:09 PM ET 2018-11-19 2082919 ESP2 San Diego State Duke SDSU
## 5 2018-19 8:00 PM ET 2018-11-20 2082916 ESPN Duke Auburn Duke
## 6 2018-19 5:00 PM ET 2018-11-21 2082915 ESPN Gonzaga Duke Gonz
## # … with 15 more variables: home_ap_ranking <int>, home_id <int>,
## # home_logo_id <chr>, home_name <chr>, home_total <dbl>,
## # home_is_winner <lgl>, home_record <chr>, away_abbr <chr>,
## # away_ap_ranking <int>, away_id <int>, away_is_winner <lgl>,
## # away_logo_id <chr>, away_name <chr>, away_total <dbl>, away_record <chr>
Now that we have a schedule to work with, we can use the gamezoneR::gamezone_mbb_pbp()
function to scrape play-by-play data for individual gameIds. The argument sub_parse
is a logical parameter specifying whether or not to attempt to parse substitution and lineup data. If TRUE, the play-by-play will be appended to include columns home_1
through away_5
, specifying the names of the 10 players on the floor.
Please note that GameZone does not track all substitutions. As a result, a new column, sub_error
will reside in the appended play-by-play data. I have not come across many games with no such errors.
pbp <- gamezoneR::gamezone_mbb_pbp(duke_schedule$game_id[1],
sub_parse = T)
## • Scraping Play-by-Play for GameID: 2083190
## • Attempting to parse lineup and substitution data...
## Note that GameZone does not track all substitutions
## ✓ Completed GameID: 2083190
## Duke @ Kentucky
pbp %>%
dplyr::count(sub_error)
## # A tibble: 1 × 2
## sub_error n
## <chr> <int>
## 1 <NA> 377
head(pbp)
## # A tibble: 6 × 45
## season date game_id play_id neutral half home away home_name
## <chr> <date> <dbl> <dbl> <dbl> <int> <chr> <chr> <chr>
## 1 2018-19 2018-11-06 2083190 1 0 1 Kentucky Duke Wildcats
## 2 2018-19 2018-11-06 2083190 1 0 1 Kentucky Duke Wildcats
## 3 2018-19 2018-11-06 2083190 1 0 1 Kentucky Duke Wildcats
## 4 2018-19 2018-11-06 2083190 1 0 1 Kentucky Duke Wildcats
## 5 2018-19 2018-11-06 2083190 1 0 1 Kentucky Duke Wildcats
## 6 2018-19 2018-11-06 2083190 1 0 1 Kentucky Duke Wildcats
## # … with 36 more variables: away_name <chr>, home_timeouts <dbl>,
## # away_timeouts <dbl>, home_score <dbl>, away_score <dbl>, score_diff <dbl>,
## # team_id <dbl>, event_team <chr>, game_secs_remaining <dbl>,
## # half_secs_remaining <dbl>, play_length <dbl>, desc <chr>,
## # shot_outcome <chr>, free_throw <lgl>, three_pt <lgl>, shot_desc <chr>,
## # loc_x <dbl>, loc_y <dbl>, shooter_id <int>, shooter <chr>, assist <chr>,
## # substitution <dbl>, poss_before <chr>, poss_after <chr>, …
Now, let’s scrape all of Duke’s games from the Zion Williamson 2018-19 season using purrr::map_df()
.
duke_pbp <- purrr::map_df(duke_schedule$game_id,
gamezoneR::gamezone_mbb_pbp, sub_parse = F)
Let’s plot some shot charts. We will use the built in gamezoneR::base_court
ggplot court.
duke_shots <- duke_pbp %>%
dplyr::filter(!is.na(loc_x),
shooter == "Zion Williamson")
gamezoneR::base_court +
geom_point(data = duke_shots,
aes(loc_x, loc_y, color = shot_outcome),
alpha = 0.8) +
theme(axis.line = element_blank(),
axis.text= element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(face = "bold", hjust = 0.5, size = 30/.pt, margin = margin(0, 0, 5, 0)),
plot.subtitle = element_text(face = "italic", hjust = 0.5, size = 24/.pt),
plot.caption = element_text(face = "italic", hjust = 1, size = 20/.pt, margin = margin(0, 0, 0, 0)),
legend.spacing.x = grid::unit(0, 'cm'),
legend.title = element_text(size = 20/.pt, face = "bold"),
legend.text = element_text(size = 16/.pt),
legend.margin = margin(0, 0, 0, 0),
legend.position = 'bottom',
legend.box.margin = margin(-35, 0, 0, 0),
plot.margin = margin(5, 0, 5, 0)) +
labs(title = "All of Zion Williamson's shots at Duke",
subtitle = "2018-19 college basketball season",
color = "Outcome",
caption = "Chart: @jacklich10 | Data: @gamezoneR")
You might notice that Zion seems to have very few shots at the rim compared to what you might expect. This is true! STATS LLC’s API codes all layups and dunks at either (25, 5.25) or (26, 6.25), so many shots are plotted on top of one another. I have no idea why this is the case, but you might want to experiment with randomizing the locations of these shots when plotting shot charts.
Since gamezoneR
version 0.1.1 now attempts to parse possession information from the play-by-play, we can also look at a team’s offensive and defensive efficiency (by points per possession). Below is Duke’s offensive and defensive efficiency by game from the 2018-19 college basketball season.
duke_pbp %>%
dplyr::filter(!is.na(poss_before)) %>%
dplyr::mutate(poss_number = as.numeric(poss_number),
shot_made_numeric = dplyr::case_when(
is.na(shot_outcome) ~ NA_real_,
shot_outcome == "made" ~ 1,
shot_outcome == "missed" ~ 0),
shot_value = dplyr::case_when(
is.na(shot_outcome) ~ NA_real_,
free_throw == 1 ~ 1,
three_pt == 1 ~ 3,
T ~ 2),
points = dplyr::case_when(
shot_made_numeric == 0 ~ 0,
shot_made_numeric == 1 & free_throw == 1 ~ 1,
shot_made_numeric == 1 & three_pt == 1 ~ 3,
shot_made_numeric == 1 & three_pt == 0 & free_throw == 0 ~ 2)) %>%
dplyr::group_by(date, game_id, poss_before, poss_number) %>%
dplyr::summarise(fgm = sum(shot_outcome == "made" & free_throw == F, na.rm = T),
fga = sum(!is.na(shot_outcome) & free_throw == F),
ftm = sum(shot_outcome == "made" & free_throw == T),
fta = sum(!is.na(shot_outcome) & free_throw == T),
points = sum(points, na.rm = T),
.groups = "drop") %>%
dplyr::group_by(date, game_id, poss_before) %>%
dplyr::summarise(poss = dplyr::n(),
across(fgm:points, sum),
.groups = "drop") %>%
dplyr::mutate(ppp = points/poss,
type = ifelse(poss_before == "Duke", "Offense", "Defense"),
color = ifelse(poss_before == "Duke", "black", "#003366"),
fill = ifelse(poss_before == "Duke", "#001A57", "white")) %>%
ggplot(aes(date, ppp, fill = fill, color = color)) +
geom_line() +
geom_point(aes(size = poss),
pch = 21, stroke = 0.9) +
scale_color_identity() +
scale_fill_identity() +
scale_size_continuous(range = c(0.8, 3.5)) +
labs(title = "Duke's offensive and defensive efficiency by game",
subtitle = "2018-19 college basketball season",
x = "Date",
y = "Points per possession",
size = "# of possessions",
caption = "Chart: @jacklich10 | Data: @gamezoneR")
Alternatively, you may want to load in all games at once from the remote repository. To do so, run the following lines of code:
future::plan("multisession")
progressr::with_progress({
pbp <- gamezoneR::load_gamezone_pbp(c("2016-17", "2017-18", "2018-19", "2019-20", "2020-21"))
})
Below, we aggregate each team’s offensive and defensive efficiency’s (again by points per possession) by game.
pbp_summarized <- pbp %>%
dplyr::distinct() %>%
dplyr::filter(!is.na(poss_before)) %>%
dplyr::mutate(poss_number = as.numeric(poss_number),
shot_made_numeric = dplyr::case_when(
is.na(shot_outcome) ~ NA_real_,
shot_outcome == "made" ~ 1,
shot_outcome == "missed" ~ 0),
shot_value = dplyr::case_when(
is.na(shot_outcome) ~ NA_real_,
free_throw == 1 ~ 1,
three_pt == 1 ~ 3,
T ~ 2),
points = dplyr::case_when(
shot_made_numeric == 0 ~ 0,
shot_made_numeric == 1 & free_throw == 1 ~ 1,
shot_made_numeric == 1 & three_pt == 1 ~ 3,
shot_made_numeric == 1 & three_pt == 0 & free_throw == 0 ~ 2)) %>%
dplyr::group_by(season, date, game_id, poss_before, poss_number) %>%
dplyr::summarise(fgm = sum(shot_outcome == "made" & free_throw == F, na.rm = T),
fga = sum(!is.na(shot_outcome) & free_throw == F),
ftm = sum(shot_outcome == "made" & free_throw == T),
fta = sum(!is.na(shot_outcome) & free_throw == T),
points = sum(points, na.rm = T),
.groups = "drop") %>%
dplyr::group_by(season, date, game_id, poss_before) %>%
dplyr::summarise(poss = dplyr::n(),
across(fgm:points, sum),
.groups = "drop") %>%
dplyr::mutate(ppp = points/poss)
Let’s take a look at the most proficient offenses in college basketball since 2016-17:
Best team offensive efficiencies | ||||
---|---|---|---|---|
2016-17 through 2020-21 college basketball seasons | ||||
season | Team | Games | PPP | |
2017-18 | Villanova | 40 | 1.22 | |
2020-21 | Gonzaga | 32 | 1.22 | |
2018-19 | Gonzaga | 37 | 1.22 | |
2016-17 | Oklahoma State | 33 | 1.21 | |
2016-17 | UCLA | 36 | 1.21 | |
2019-20 | Gonzaga | 33 | 1.20 | |
2020-21 | Baylor | 30 | 1.19 | |
2017-18 | TCU | 32 | 1.19 | |
2016-17 | Villanova | 36 | 1.18 | |
2017-18 | Purdue | 37 | 1.17 | |
2020-21 | Iowa | 31 | 1.17 | |
2017-18 | Western Kentucky | 34 | 1.17 | |
2017-18 | Saint Mary's | 24 | 1.17 | |
2017-18 | Gonzaga | 37 | 1.17 | |
2016-17 | Michigan | 38 | 1.17 | |
Table: @jacklich10 | Data: @gamezoneR |