Wednesday, August 31, 2011

Stumbling Blindly Towards Tomorrow


“I don’t care what your stats say; I would whoop a chimpanzee’s ass in a fight.*”

-Channing Crowder

            This was supposed to be just a facebook post but I ended up writing too much.  If I’m using too many stats terms please let me know and I’ll try to use more laymen terms.  I will definitely have to change the variables included since the current model contains the redundant variables plays, rush attempts and pass attempts.  I can’t keep this in the model because plays = rush attempts + pass attempts which adversely effects the predictions. So instead, I will re-run the analysis without total plays.  However, I did want to present some ideas on the revision(s) in advance in order to solicit feedback.  I’ve listed the options below with the first two being the ideas I need the most feedback on.

Timeframe

            Currently, the model is analyzing data from 1996 through 2011.  I picked 1996 as the starting date for two reasons: that was the first year I can remember following the NFL and that was the when the Panthers and Jaguars joined the league.  However, I think it can be argued that the play-style of the NFL has changed significantly since 1996 particularly in the amount of run versus pass plays called.  The increased restrictions on pressing receivers and the God damned Tom Brady QB protection rules come to mind.  That being said, I’ve identified two additional starting points for analysis.  The first is 2001 which was the peak of the Ram’s greatest show on turf and also the beginning of the Spygate dynasty.  The other starting point would be 2006.  This date would provide for continuity of current rosters.  That is we would expect to see that teams with relatively stable rosters (Steelers, Colts) should win more games consistently.  Further, it should only take into account the latest play calling trends.  Please note, that starting the analysis from a later date will affect the R square value and while I “think” it will improve it, there is a good chance that the model could be worse.  Any thoughts on 2001 versus 2006 as a starting date or thoughts on other possible dates?

New Variables

            This involves new variables that could affect the number of predicted wins for a team.  For example, the best idea I’ve heard is to create a variable that rates how well a team collects draft picks year to year influencing wins.  Higher value is given to having more high round draft picks.  I need to say that this would be for next year’s analysis and that I refuse to include individual player variables.  So any additional variables that are not included in the first post would be great.

Outliers and Influential Observations

            This is very statistical so I won’t describe this part of the analysis.  If you would like to know about this I will of course provide the details.  However, this will be affected by what timeframe is chosen.

Type of Model

            This is even more statistical however for business people and stats guys I want to speak on this a bit.  The distribution of wins seems to be pretty fucking normal.  Like normal to an extent I’ve rarely seen.  However, I received a suggestion about using Poisson regression since it can be argued that wins behave that way.  I may try that type of model out but I would guess it wouldn’t offer much improvement.  I am concerned that the data is clustered within teams (i.e. the 2008 Dolphins win total and roster is dependent to some extent on the 2007 Dolphins win total and roster performance).  So if anyone has any ideas on and other types of GLM or cluster analysis (which I’m not sure is appropriate) that may better predict wins or I’m all ears.

So those are the options for revisions.  Any thoughts?  Oh and an expansion will be announced next week thanks to a suggestion from David M.

*Channing Crowder quotes will be provided with every subsequent update.

Monday, August 29, 2011

Discrimination - It can be a good thing



            The first model to predict wins for an NFL season has been finished.  First I’m going to give a bit of background on the processes used to arrive at the model.  If you don’t care about the statistics that went into these predictions please jump ahead to the predicted wins compared with the betting line. This model was arrived at by using backwards regression to pick the below model.  While the goal of backwards regression is usually to end with a model containing only significant predictors, I instead selected the model that had the highest adjusted R2.  As I said yesterday, I included all variables that were not linearly related.  However, due to an oversight I did include pass attempts, rush attempts, and total plays.  In future models, this will need to be corrected.

The variables included in the final model were:

·         Whether a team won the Super Bowl the previous year (SB_Win_Prev)

·         Whether a team played in a Super Bowl the previous year (Sb_App_Prev)

·         Points scored last season (Ppoints)

·         Pass attempts (POPAtt)

·         Rush attempts (PORAtt)

·         Offensive plays (POPlays)

·         Rush yards (PORYards)

·         Passing yards (POPYards)

·         Rushing TDs against (PDRTD)

·         Passing yards against (PDPYards)

·         Take Aways (PDTO)

Model Summary
Model
R
R Square
Adjusted R Square
Std. Error of the Estimate
Change Statistics
R Square Change
F Change
df1
df2
Sig. F Change
1
.385
.148
.128
2.854
.148
7.223
11
457
.000



In the above table, the only statistics of interest are the highlighted boxes: R Square and Adjusted R Square.  R Square can be looked at as how accurately the combination of the above 11 variables predict wins.  A model can predict between 0% and 100% of wins.  However, R square tends to overestimate the prediction accuracy and thus Adjusted R Square is the preferred measure.  Currently this model predicts 12.8% of wins which sucks.  A model with an adjusted R squared of 30% is my overall goal for this project.




ANOVAb
Model
Sum of Squares
df
Mean Square
F
Sig.
1
Regression
647.227
11
58.839
7.223
.000
Residual
3722.697
457
8.146


Total
4369.923
468






The above table is for those that are interested in overall model fit.  I won’t provide an explanation here but will gladly explain it for those interested.

Coefficientsa
Model
Unstandardized Coefficients
Standardized Coefficients
t
Sig.
B
Std. Error
Beta
1
(Constant)
9.895
3.597

2.751
.006
Sb_App_Prev
-.951
.781
-.076
-1.218
.224
SB_W_Prev
1.741
1.051
.100
1.657
.098
PDTO
-.047
.024
-.102
-2.009
.045
POPYards
-.001
.001
-.188
-1.702
.089
PDPYards
-.001
.000
-.108
-2.283
.023
PDRTD
-.063
.025
-.116
-2.497
.013
PORYards
-.001
.001
-.088
-1.011
.312
POPlays
-.023
.013
-.347
-1.786
.075
PORAtt
.025
.013
.427
1.954
.051
POPAtt
.026
.013
.460
1.935
.054
PPoints
.020
.004
.437
4.412
.000



The coefficients table is the most interesting theoretically.  The model formula is taken from the highlighted column and is known as the regression coefficient.  The regression coefficients tell us how much wins changes when each individual predictor increases by 1.  So for instance let’s look at rushing attempt variable.  This variable has a regression coefficient of .025.  The interpretation for this is that for every 1 extra rush attempt wins increases by .025.  This indicates that teams that have a high amount of rushing attempts win more.  Notice some of the coefficients defy what we would expect to see.  For instance, for every extra take away we expect teams to lose .047 games.  This is the result of some of the variables being correlated which can cause changes in the regression coefficients.

Predictions

So using the above regression coefficients, I plugged the stats for last season into the model and came up with predicted wins.  However, the initial estimates provided no discrimination (hence the joke in the title).  By discrimination, I mean that the majority of estimates were between 6 and 9 wins and thus it was hard to differentiate good and bad teams.  To rectify this, I ordered the teams within their division.  I rounded the original estimates and then awards bonuses and penalties based on that order.  First place teams had 2 wins added, 2nd place 1 win, 3rd place team had 1 win subtracted and 4th place teams had 2 wins subtracted.  In the following table I included the original prediction, the revised prediction, and the current line.  Note, the site I used is currently not taking bets on the Colts because of the Peyton Manning issue.

Couple of quick notes.  For the most part I was surprised at how accurate the predictions were.  While some predictions seem really high (hello Raiders) or low (Cardinals got hosed), overall the model gave a sound order of finish within the division.  Also, should anyone decide to use this to actually place bets please use this just as an extra resource to making a good decision.  Common sense should be used along with these predictions.  As an example, the Colts are predicted to win 12 games.  If Manning misses time that becomes a more and more extreme prediction.  Hope you enjoyed this and I will post possible updates tomorrow.

Division
Predicted Wins
Revised Wins
Line 8/29*
AFC East
Dolphins
7.13
6
7.5
Bills
7.01
5
5.5
Patriots
9.74
12
11.5
Jets
8.3
9
10
AFC North
Ravens
8.21
9
10
Bengals
7.39
6
5.5
Browns
6.82
5
7
Steelers
9.62
12
10.5
AFC South
Texans
7.21
6
9
Colts
9.66
12
NA
Jags
7
5
6.5
Titans
8.15
9
6.5
AFC West
Broncos
6.55
5
6
Chiefs
8.13
7
7.5
Raiders
8.85
10
6.5
Chargers
9.25
11
10
NFC South
Falcons
9.26
11
10
Panthers
4.82
3
4.5
Saints
8.66
10
10
Buccaneers
7.72
7
8
NFC West
Cardinals
6.28
4
7
49ers
6.97
6
7.5
Seahawks
7.08
8
6
Rams
7.43
9
7.5
NFC East
Giants
8
9
9
Eagles
8.03
10
10.5
Cowboys
7.82
7
9
Redskins
5.6
4
6
NFC North
Lions
7.8
9
8
Packers
9.32
11
11.5
Vikings
6.83
6
7
Bears
6.77
5
8



*I used the following website for the lines http://www.sportsbook.ag/