Introduction
Hi all!
Welcome to the 2013 iteration of modeling NFL wins. As a recap, the goal of this blog is to
develop a regression model to predict the wins of all 32 NFL teams and then
compare those predictions to the lines set by sportsbook.ag on May 30th. I didn’t run this study last year due to work
and school priorities, but the 2011 version correctly predicted the over/under
on about 62% of teams.
As in 2001, my model’s predictions will compete
against predictions made by myself and two colleagues. What is different this year is that I ran
three separate models using the same predictors. This was done in an interest of defining the
correct methodology for this particular problem. An overall decision on whether to bet under
or over will be based on the model results plus common sense.
At this point if you care about the methodology,
continue reading on. If you’re only
interested in the picks, skip down to the tables at the end of this post in the
results section.
Methodology
NFL statistics on over 40 variables from 1996 to the
present were used as the data for this model.
Both offensive and defensive statistics were included as variables. For two of the models, the linear and Poisson
regression models, the data was truncated to include only the last 4
years. As will be discussed below, the
multi-level regression model necessitated the full set of data. This was done do to my belief that NFL
statistics aren’t comparable over more than 5 or so years. For instance the 98 Dolphins played a vastly
different type of football than the 2010 Dolphins. Additionally, most teams completely turn over
their rosters in a period of 4 years. IF
we want to make a prediction for a team, it makes sense to include data that
pertain to players who still might be on the team.
As stated above three models were used: a linear
regression model, a Poisson regression model, and multilevel linear regression
model. Below is a very brief summary of
each.
If you’ve taken an introductory statistics class,
you’ve encountered linear regression.
The goal of linear regression is to fit the regression line which
minimizes the distance from the dependent variable to that line. However, linear regression makes several
assumptions that are not met with this research question and data
structure. First, linear regression
assumes that the data are continuous and normally distributed. While NFL wins does happen to be normally
distributed, you could make the argument that wins in a season is not a
continuous variable because each team will be placed in one of 17 possible
response categories (ranging from 0 wins to 16 wins). However, in practice this many response
categories are often ignorable with minimal detriment to the linear model. The bigger issue, and an issue that will also
plague the Poisson regression model, is that linear regression assumes that the
errors in prediction are independent of one another. Put another way, it states that the number of
wins for a given team is not dependent on what year the team was observed,
other teams that they would play, or roster turn over from year to year. Because of this dependence, a linear model
theoretically shouldn’t be appropriate.
The Poisson regression model solves the problem of
wins not being a continuous variable.
Poisson regression models count data, which wins in a season is better
categorized as. However, Poisson
regression also has the assumption of independence of errors, so that is a
concern. Poisson regression models have
an additional assumption the mean and variances of the data are equal to one
another. Conveniently, this data
actually meets this assumption (which is the first I’ve seen in practice). It should be noted that this data could also
be modeled as a binomial regression model.
That wasn’t done here due to time constraints.
Finally a multi-level regression model was used to
solve the dependent errors issue. Now, I
don’t have much experience with multi-level models, so this discussion will be
very basic. Basically, multi-level
models solve the dependence of errors problem by allowing specified regression
parameters to vary by a grouping variable.
With this data, you could vary that the parameters could vary either
team by team or year by year. I chose
the team grouping variable under the assumption that there is some
organizational structure that should hold and also because I ran out of
time. Further, I only allowed the
intercepts to vary. While this model,
does solve the dependent error issue (to a certain extent) it also models the
dependent variable as continuous again.
A more appropriate model would’ve been a generalized linear model.
In order to determine which of the 40 variables to
use as predictors of the next years wins, I employed a forward regression
approach using the Poisson model. Initially,
I correlated each of the potential predictor variables with the number of
wins. The predictor with the highest
correlation was then put in the model and checked for significance. If it was significant, then the variable was
kept. Next, I checked for the
significance of the quadratic term.
Next, I placed the variable with the next highest correlation into the
model. If that variable was significant,
I tested the quadratic term and interaction term. I iterated this process, until the gain from
adding a variable was negligible. In
table 1 is the regression coefficients for the linear model. The same predictors were used for each of the
three models.
Table 1.
Call:
lm(formula = Wins ~ POPYPA + PPPG + PDPPG +
PDRYards + (PPPG:PDPPG))
Residuals:
Min 1Q Median
3Q Max
-5.9365 -1.7734
0.0082 1.7961 6.2105
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.448580 9.354073
-1.758 0.08118 .
POPYPA
0.404652 0.480832
0.842 0.40168
PPPG
1.197017 0.410161
2.918 0.00419 **
PDPPG
0.961434 0.400043
2.403 0.01775 *
PDRYards
-0.001923 0.001040
-1.849 0.06681 .
PPPG:PDPPG
-0.046147 0.018141
-2.544 0.01221 *
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.735 on 122 degrees
of freedom
(32
observations deleted due to missingness)
Multiple R-squared: 0.2563, Adjusted
R-squared: 0.2258
F-statistic: 8.408 on 5 and 122 DF, p-value: 7.356e-07
First, the predictors used were pass yards per game,
offensive points per game, points given up per game, rush yards allowed per
game, and the interaction between points scored and points given up. The adjusted R2 is .2258 which
means that almost a fourth of the variance in wins is described by this
model. An encouraging sign is that the
direction of all the regression coefficients makes sense. For instance the regression coefficient for
points per game is 1.197. Therefore, for
every one point per game increase we expect an additional 1.197 wins. Now the coefficient for points allowed per
game (PDPPG) indicates that giving up more points results in more wins. That may seem counter-intuitive but consider
teams losing by multiple scores are more likely to gain garbage points. This inference is reflected in the
interaction (PPPG:PDPPG).
Results
OK, this is probably what most people are interested
in. Table 2 is the predictions made by
me and my team of “experts.” Table 3 is
the predictions made by my three models in addition to my decisions on how to
bet based on the models.
Table 2.
|
|
|
|
|
Table 3.
|
|
|
|
|
|
OK, a couple things to point out. First a lot of the predictions are very close
to the sportsbook line. Part of that is
due to the limited range of values that wins can take. The other part serves as a bit of validation
for my model. However two teams vary
greatly from sportsbook and common sense: the Raiders and Saints. The Saints are being so under estimated by
this model because of their historically bad defense last year (defensive
points allowed being a predictor in the model) and because the model can’t take
into account bounty-gate. The Raiders
are being so overestimated because the model doesn’t take into account that
Oakland is tanking this season. This is
where common sense comes in to play.
Anyways that this years models. If there are enough questions/interest I may
create a follow up post.
Have a good season,
Mike