6 Data augmentation

Some examples of adding extra columns to our play-by-play data in order to support particular analyses, or to make other data-wrangling tasks easier.

6.1 The setter of a given attack

Aim: Identify the player who made the set associated with each attack (noting that some files might not have the setting action coded for all attacks, or even coded at all).

px <- px %>% mutate(set_player_id = case_when(skill == "Attack" & lag(skill) == "Set" & team == lag(team) ~ lag(player_id))

(The set_player_id column will be NA when there is no matching set coded before the attack.)

6.2 Setter player IDs

Aim: identify the player_id of the setter on court for each data row.

The play-by-play data frame has home_player_id1, home_player_id2, …, home_player_id6, which give the player_id of the home team player currently in position 1, 2, …, and 6. It also has a home_setter_position column, which tells us which position the home team setter is in (1–6). So to create a home_setter_id column we simply need to extract the value in the home_player_idX column for each row of the data frame, where X is the value in the home_setter_position at each row.

One way to do this is:

px <- px %>% mutate(home_setter_id = case_when(home_setter_position == 1 ~ home_player_id1,
                                               home_setter_position == 2 ~ home_player_id2,
                                               home_setter_position == 3 ~ home_player_id3,
                                               home_setter_position == 4 ~ home_player_id4,
                                               home_setter_position == 5 ~ home_player_id5,
                                               home_setter_position == 6 ~ home_player_id6))

But that is rather cumbersome and inelegant. A more concise method is:

px <- px %>% rowwise() %>% mutate(home_setter_id = cur_data()[[paste0("home_player_id", home_setter_position)]]) %>% ungroup

6.3 Reception quality

Aim: add a column that tells us the reception quality associated with each rally.

The reception quality for a given rally is found in the evaluation column of the row that has the skill value of “Reception”. We just need to propagate this value to all other rows associated with that particular rally.

First we extract the reception quality for each rally (remember from the Identifiers section that the point_id value identifies the rally, but needs to be combined with match_id to be globally unique):

rq <- px %>% dplyr::filter(skill == "Reception") %>% group_by(match_id, point_id) %>%
  dplyr::summarize(reception_quality = if (n() == 1) .data$evaluation else NA_character_) %>% ungroup

So rq is a data.frame with the reception quality for each match_id and point_id combination. reception_quality will be NA if there was no reception (or more than one reception — perhaps due to a scouting error?) involved in that point. Now join rq back to our full plays dataframe:

px <- px %>% left_join(rq, by = c("match_id", "point_id"))

So now we have reception_quality for all rows of the data frame.