Monday, September 2, 2013

2013 NFL Predicted Wins



Introduction

Hi all!  Welcome to the 2013 iteration of modeling NFL wins.  As a recap, the goal of this blog is to develop a regression model to predict the wins of all 32 NFL teams and then compare those predictions to the lines set by sportsbook.ag on May 30th.  I didn’t run this study last year due to work and school priorities, but the 2011 version correctly predicted the over/under on about 62% of teams.  

As in 2001, my model’s predictions will compete against predictions made by myself and two colleagues.  What is different this year is that I ran three separate models using the same predictors.  This was done in an interest of defining the correct methodology for this particular problem.  An overall decision on whether to bet under or over will be based on the model results plus common sense.  

At this point if you care about the methodology, continue reading on.  If you’re only interested in the picks, skip down to the tables at the end of this post in the results section.

Methodology

NFL statistics on over 40 variables from 1996 to the present were used as the data for this model.  Both offensive and defensive statistics were included as variables.  For two of the models, the linear and Poisson regression models, the data was truncated to include only the last 4 years.  As will be discussed below, the multi-level regression model necessitated the full set of data.  This was done do to my belief that NFL statistics aren’t comparable over more than 5 or so years.  For instance the 98 Dolphins played a vastly different type of football than the 2010 Dolphins.  Additionally, most teams completely turn over their rosters in a period of 4 years.  IF we want to make a prediction for a team, it makes sense to include data that pertain to players who still might be on the team.

As stated above three models were used: a linear regression model, a Poisson regression model, and multilevel linear regression model.  Below is a very brief summary of each.

If you’ve taken an introductory statistics class, you’ve encountered linear regression.  The goal of linear regression is to fit the regression line which minimizes the distance from the dependent variable to that line.  However, linear regression makes several assumptions that are not met with this research question and data structure.  First, linear regression assumes that the data are continuous and normally distributed.  While NFL wins does happen to be normally distributed, you could make the argument that wins in a season is not a continuous variable because each team will be placed in one of 17 possible response categories (ranging from 0 wins to 16 wins).  However, in practice this many response categories are often ignorable with minimal detriment to the linear model.  The bigger issue, and an issue that will also plague the Poisson regression model, is that linear regression assumes that the errors in prediction are independent of one another.  Put another way, it states that the number of wins for a given team is not dependent on what year the team was observed, other teams that they would play, or roster turn over from year to year.  Because of this dependence, a linear model theoretically shouldn’t be appropriate.

The Poisson regression model solves the problem of wins not being a continuous variable.  Poisson regression models count data, which wins in a season is better categorized as.  However, Poisson regression also has the assumption of independence of errors, so that is a concern.  Poisson regression models have an additional assumption the mean and variances of the data are equal to one another.  Conveniently, this data actually meets this assumption (which is the first I’ve seen in practice).  It should be noted that this data could also be modeled as a binomial regression model.  That wasn’t done here due to time constraints.

Finally a multi-level regression model was used to solve the dependent errors issue.  Now, I don’t have much experience with multi-level models, so this discussion will be very basic.  Basically, multi-level models solve the dependence of errors problem by allowing specified regression parameters to vary by a grouping variable.  With this data, you could vary that the parameters could vary either team by team or year by year.  I chose the team grouping variable under the assumption that there is some organizational structure that should hold and also because I ran out of time.  Further, I only allowed the intercepts to vary.  While this model, does solve the dependent error issue (to a certain extent) it also models the dependent variable as continuous again.  A more appropriate model would’ve been a generalized linear model.

In order to determine which of the 40 variables to use as predictors of the next years wins, I employed a forward regression approach using the Poisson model.  Initially, I correlated each of the potential predictor variables with the number of wins.  The predictor with the highest correlation was then put in the model and checked for significance.  If it was significant, then the variable was kept.  Next, I checked for the significance of the quadratic term.  Next, I placed the variable with the next highest correlation into the model.  If that variable was significant, I tested the quadratic term and interaction term.  I iterated this process, until the gain from adding a variable was negligible.  In table 1 is the regression coefficients for the linear model.  The same predictors were used for each of the three models.

Table 1.
Call:
lm(formula = Wins ~ POPYPA + PPPG + PDPPG + PDRYards + (PPPG:PDPPG))

Residuals:
    Min      1Q  Median      3Q     Max
-5.9365 -1.7734  0.0082  1.7961  6.2105

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -16.448580   9.354073  -1.758  0.08118 .
POPYPA        0.404652   0.480832   0.842  0.40168  
PPPG          1.197017   0.410161   2.918  0.00419 **
PDPPG         0.961434   0.400043   2.403  0.01775 *
PDRYards     -0.001923   0.001040  -1.849  0.06681 .
PPPG:PDPPG   -0.046147   0.018141  -2.544  0.01221 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.735 on 122 degrees of freedom
  (32 observations deleted due to missingness)
Multiple R-squared:  0.2563, Adjusted R-squared:  0.2258
F-statistic: 8.408 on 5 and 122 DF,  p-value: 7.356e-07

First, the predictors used were pass yards per game, offensive points per game, points given up per game, rush yards allowed per game, and the interaction between points scored and points given up.  The adjusted R2 is .2258 which means that almost a fourth of the variance in wins is described by this model.  An encouraging sign is that the direction of all the regression coefficients makes sense.  For instance the regression coefficient for points per game is 1.197.  Therefore, for every one point per game increase we expect an additional 1.197 wins.  Now the coefficient for points allowed per game (PDPPG) indicates that giving up more points results in more wins.  That may seem counter-intuitive but consider teams losing by multiple scores are more likely to gain garbage points.  This inference is reflected in the interaction (PPPG:PDPPG).

Results

OK, this is probably what most people are interested in.  Table 2 is the predictions made by me and my team of “experts.”  Table 3 is the predictions made by my three models in addition to my decisions on how to bet based on the models.  

Table 2.
Team
Cardinals
Falcons
Ravens
Bills
Panthers
Bears
Bengals
Browns
Cowboys
Broncos
Lions
Packers
Texans
Colts
Jaguars
Chiefs
Dolphins
Vikings
Patriots
Saints
Giants
Jets
Raiders
Eagles
Steelers
Chargers
49ers
Seahawks
Rams
Buccaneers
Titans
Redskins
Sportsbook
5.5
10
8.5
6.5
7
8.5
8.5
6
8.5
11.5
7.5
10.5
10.5
8.5
5
7.5
7.5
7.5
11.5
9.5
9
6.5
5.5
7
9.5
7.5
11.5
10.5
7.5
7.5
6.5
8.5
MZ
Over
Over
Over
Over
Over
Under
Under
Over
Under
Under
Over
Over
Under
Over
Under
Under
Over
Under
Under
Over
Over
Under
Under
Over
Under
Over
Under
Under
Under
Under
Under
Over
KK
Over
Over
Under
Under
Over
Over
Over
Under
Under
Under
Under
Over
Over
Under
Under
Over
Over
Over
Under
Over
Over
Under
Under
Under
Over
Under
Under
Under
Under
Over
Under
Over
Wojay
Under
Over
Under
Under
Over
Over
Over
Over
Under
Over
Over
Over
Over
Over
Under
Over
Over
Over
Under
Over
Under
Over
Under
Under
Under
Under
Over
Over
Over
Over
Under
Over

Table 3.
Team
Cardinals
Falcons
Ravens
Bills
Panthers
Bears
Bengals
Browns
Cowboys
Broncos
Lions
Packers
Texans
Colts
Jaguars
Chiefs
Dolphins
Vikings
Patriots
Saints
Giants
Jets
Raiders
Eagles
Steelers
Chargers
49ers
Seahawks
Rams
Buccaneers
Titans
Redskins
Sportsbook
5.5
10
8.5
6.5
7
8.5
8.5
6
8.5
11.5
7.5
10.5
10.5
8.5
5
7.5
7.5
7.5
11.5
9.5
9
6.5
5.5
7
9.5
7.5
11.5
10.5
7.5
7.5
6.5
8.5
Poisson
5.5
9
8
6.5
8.5
8.5
8.5
7
7.5
13
7.5
9
9.5
7
7
7
7
8
12.5
6.5
8.5
6.5
7.5
7.5
8.5
8
10.5
11
7
9
7
9.5
Linear
5.5
9.5
8
6.5
8.5
9
9
7
7.5
12
7.5
9
9.5
7
7
7
7
8
11.5
6
8.5
6.5
7.5
7.5
8.5
8
10.5
11
7
9
7
9
Mul Lev
6
9
8
7
8
9
9
7.5
7.5
11.5
7.5
9
9.5
7
7
7
7
8
11
6.5
8.5
6.5
7.5
7.5
8.5
8.5
10
10
7
9
7
9
Decision
Over
Under
Under
Over
Over
Over
Over
Over
Under
Over
Over
Under
Under
Under
Over
Under
Under
Over
Over
Under
Under
Under
Under
Over
Under
Over
Under
Over
Under
Over
Over
Over



OK, a couple things to point out.  First a lot of the predictions are very close to the sportsbook line.  Part of that is due to the limited range of values that wins can take.  The other part serves as a bit of validation for my model.  However two teams vary greatly from sportsbook and common sense: the Raiders and Saints.  The Saints are being so under estimated by this model because of their historically bad defense last year (defensive points allowed being a predictor in the model) and because the model can’t take into account bounty-gate.  The Raiders are being so overestimated because the model doesn’t take into account that Oakland is tanking this season.  This is where common sense comes in to play.

Anyways that this years models.  If there are enough questions/interest I may create a follow up post. 
 
Have a good season,

Mike