Visualization of the number of victories in the NBA teams using animated columns in R / Habr

To begin with, slight introductory information. My name is Vladislav and my acquaintance with R took place in August last year. I decided to study the programming language because of an applied nature. Me with …

Visualization of the number of victories in the NBA commands using animated columns in R

To begin with, slight introductory information. My name is Vladislav and my acquaintance with R took place in August last year. I decided to study the programming language because of an applied nature. Since childhood, I liked to conduct sports statistics. With age, this hobby was transformed into a desire to somehow analyze these numbers and, based on the data analysis, to issue, if possible, smart thoughts. The problem is that in recent years, sport has swept a wave of data, dozens of companies compete with each other, trying to calculate, describe and push into the neuron any action of a football player, basketball player, baseball player on the site. And Excel is not suitable for analysis. So I decided to study R so that the simplest analysis did not take half a day. Already during the study, interest in programming as such, but this is already a lyrics.

I want to immediately note that much of what I would write in the future already in the Simpsons was on Habre in the article we create animated histograms using R. This article, in turn, is the translation of the Create Trending Animated Bar Charks Using R with Medium. Therefore, in order to somehow differ from the above articles, I will try to more fully describe what I am doing, as well as those moments that are not in the original article. For example, for pouring columns, I used the colors of the NBA commands, and not the standard GGPLOT2 palette, but in the data processing DATA.Table, not DPlyr. The whole thing is done in the form of a function, so now it’s enough to write the name of the team and the years for which you need to calculate the number of victories.

Data

To build a schedule, I used data on the number of victories of each of the 30 NBA teams in the last 15 seasons. They were collected from the site Stats.nba.com using expansion NBA Data RetrieverWhich through the use of the NBA API produces CSV files with the necessary statistics. Here are the complete data from my project on GitHub.

Used libraries

To process data, I use Data.Table (simply because I got acquainted with this package earlier). I also load the set of Tidyverse packages, and not a separate GGPLOT2 so as not to worry, if there is some idea that requires additional loading of the package from this set to suddenly appear during the analysis. In this particular case, GGPLOT2 can be dispensed with, other set packages do not participate. Well, Gganimate sets graphs in motion.

Work with data

First you need to put the data in order. In principle, to build graphs, we need 2 of 79 columns of the table with “raw” data. You can first select the necessary columns, you can first replace some values. I went on the second path.

A table in data.table has the form dt[i, j, by] , where by is responsible for grouping elements. I will group by the TeamName column. And there is a catch here. This column displays team names: Lakers, Celtics, Heat, etc. But during the period under review (from the 2004/05 season), several teams changed names: the New Orleans Hornets became the New Orleans Pelicans, the Charlotte Bobcats returned the historical name of the Charlotte Hornets, and the Seattle Supersonics became the Oklahoma City Thunder. This can cause confusion. The following transformations help to avoid this:

For this time period, the changes are minimal, but if you expand it, then it will become very difficult to group by TeamName and you will need to use a more reliable column. In this data, this is the TeamID.

To begin with, we get rid of the extra information, leaving only those columns that we need for work:

In data.table, the .() construct replaces the list function. A more classic option for selecting columns is table1

team name WINS
Suns 62
Heat 59
Spurs 59
Pistons 54

For animation for each season separately, this is enough, but to calculate the total number of wins for the selected period, you need to calculate the cumulative amount of wins.

Using the cumsum function, we get the numbers we need. Using := instead of = allows you to add a new column to the table, I don't overwrite it with a single CumWins column. by = TeamName groups the data by team name and the cumulative sum is calculated for each of the 30 teams separately.

Next, I add a column with the year each season started. The NBA season runs from October to May, so it spans two calendar years. In the designation of the season, the year of its beginning, i.e. Season: 2018 on the chart is the 2018/19 season in reality.

The original table has this data. The SeasonID column represents a number in the form of 2 (the year the season began), for example, 22004. You can remove the first two using the stringr package or basic R functions, but I went a little different way. It turned out that I first use this column to indicate the required seasons, then delete and create a column with dates again. Extra actions.

I did it like this:

I was “lucky” that the number of teams in the NBA did not change during the selected time period, so I simply repeated the numbers from 2004 to 2018 30 times. Again, if you go back in history, then this method will be inconvenient due to the fact that the number of teams in each season will be different, so it is preferable to use the option with clearing the SeasonID column.

Then we add the cumrank column.

It represents the ranking of teams in each season by number of wins and will be used as x-axis values. frank is a faster data.table analog of the base rank , minus means ranking in descending order (this can also be done with the decreasing = TRUE argument. I don't care about what order will teams with the same number of wins go, so ties.method = random Well, all this is grouped within one year.

And the last transformation of the table is the addition of the value_rel column.

This column is calculated as the ratio of the number of wins of each team to the highest figure for the year.In the best team, this indicator is 1, the rest are less, depending on the success of the season.

After all the additions, the table has the following view:

Teamname Wins Cumwins year Cumrank Value_Rel
Spurs 59 59 2004 3 0.9516129
Spurs 63 122 2005 1 1.0000000
Spurs 58 180 2006 2 0.9729730
Spurs 56 236 2007 1 1.0000000

The table contains only one team to clearly show the cumulativeness. All these actions are made, as in changing the names, a cascade of square brackets

Changing pouring columns from standard commands.

You can immediately move on to the construction of graphs, but there is, as it seems to me, one important point: the color of the columns on the graph. You can leave the standard GGPLOT2 palette, but this is a bad option. Firstly, it seems to me, she is ugly. And secondly, it complicates the search for the team on the schedule. For NBA fans, each of the teams is associated with a certain color: Boston is green, chicago – red, sacramento – purple, etc. Therefore, using the color of the team in pouring columns helps to identify it faster, despite the abundance of blue and red.

To do this, create a table_color table with the name of the team and its main color. Colors are taken from the site Teamcolorcodes.com.

Teamname Team_color
Hawks #E03A3E
Celtics #007a33
Nets #000000

With a table of flowers, you need to make another manipulation. T.K. When building a schedule, factors are used, then the command of the commands will change. The first on the list will be Philadelphia 76, as the only owner of the digital name, and then according to the alphabet. So we need to place the colors in the same order, and then remove the vector containing them from the table. I did it as follows:

Building a graph

We really build only one schedule that contains all 450 (15 seasons * 30 commands) of victories, and then “share” it according to the necessary variable (in our case) using the functions from the Gganimate package.

First we create a graphic object using the GGPLOT function. In the AES argument, we indicate how variables from the table will be displayed on the graph. We are grouping them by Teamname, Fill and Color will be responsible for the color of the columns.

True, it is not entirely true to call columns. Using Geom_Tile, we divide the data on the graph into rectangles. Here is an example of a diagram of this type:

It can be seen how the graph “divided” into squares (they are obtained from rectangles when using a layer Coord_equal ()), three in each column. But thanks to the Width argument, our tiles take the form of columns less.

Next, I add two signatures using Geom_text: the name of the team and the number of victories. Coord_flip changes the axis of the places, scale_fill_manual and scale_color_manual change the color of the columns, scale_x_reverse unfolds the axis h. Notice, then we take the color from the previously created Cols vector.

Theme layer indicates the parameters for setting the schedule display. It indicates how the headlines and signatures of the axes should be displayed (in any way, which Element_blank tells us in the right side of equality). We remove the legend, background, frame, mesh lines along the Y axis. Arguments Plot.title, plot.subtitle, plot.capation We set the title display, subtitle and signature parameters.More details on the meaning of all parameters can be found on the site gglot2

Animation creation

I will not dwell on using the transition_states function, this part is identical to my earlier publication on Habré. As for labs, it creates the title, subtitle, and caption of the graph. Use allows you to display on the chart each specific year, the columns from which we currently see.

nba_cumulative_wins function for creating graphs.

Writing functions simplifies and speeds up the process of obtaining the result if you need to use the code more than once. Typically, a function in R looks like this:

To begin with, it is worth understanding what parameters you want to change using the function, its arguments will depend on this. The first argument is the name of the data table that is being input. This allows you to rename it if you so desire, without changing anything in the function itself. I also want to be able to display any number of teams on the chart, from one (which is meaningless) to 30 (there simply aren't any more). I also want to be able to look at any time period within the 15 years for which I have data. All this is implemented in this form of a function:

where table is the name of the table with the input data,
elements – the names of those commands that should be displayed on the chart
first_season – the first season to be displayed on the chart
last_season is the last season to be displayed on the chart.

If the argument is very often used with some specific value, then you can set it by default. Then, if it is omitted among the function arguments, this value will be substituted. For example, if you write

then the schedules will be based up to the 2018/19 season, unless otherwise indicated.

Working with elements , first_season , last_season arguments

Using the elements argument, we can specify the name of the commands that we want to see on the chart. This is very convenient when there are 2 or 3 such teams, but if we want to display the entire league, we will have to write elements = c() and the names of all 30 teams in brackets.

So I decided to separate the input values ​​for the elements argument into several groups.
The nba_cumulative_wins function can generate charts for individual teams, divisions, conferences, or the NBA as a whole. For this, I used the following design:

The select_ character vectors contain the names of all 30 teams, 6 divisions, 2 conferences and the NBA, and the unique function leaves only one unique name, instead of 15 (by the number of years in the data).

Further with the help of the if construct. else, the input argument elements is checked for belonging to one of the classes ( %in% is used to determine whether the element belongs to the vector), and the data table is modified accordingly. Now if I want to see the scores of teams playing in the Southwest division instead of

elements = c(Mavericks, Spurs, Rockets, Grillies, Pelicans)

I just need to enter

elements = Southwest , which is much faster and more convenient.

Due to the possibility of choosing seasons, the work with dates also changes. At the very beginning, a line is added:

So I leave in the table only those rows that fall within the time interval we have chosen. The code to create the year column also changes. Now it looks like this:

Due to the grouping of elements, the procedure for obtaining the desired colors becomes more complicated. The matter is that in the table_color table only names of teams. So we need to unroll our abbreviations back. To do this, we again use the if construct. else .

Next, we create a table with the names of the commands that we need, join this table with table_color using the inner_join function from the dplyr package. inner_join only includes cases that match in both tables.

The function changes the spelling of the title and subtitle. They take on this form:

rendering

Further, all this is visualized.

I picked up the number in nframes empirically, so that depending on the number of selected commands, the speed increases / decreases.

Schedule

I hope my post is interesting. Project code on Github.

If you are interested in the sports component of these visualizations, you can visit my blog on sports.ru On both sides of the Atlantic