“I don’t care what your stats say; I would whoop a chimpanzee’s ass in a fight.*”
-Channing Crowder
This was supposed to be just a facebook post but I ended up writing too much. If I’m using too many stats terms please let me know and I’ll try to use more laymen terms. I will definitely have to change the variables included since the current model contains the redundant variables plays, rush attempts and pass attempts. I can’t keep this in the model because plays = rush attempts + pass attempts which adversely effects the predictions. So instead, I will re-run the analysis without total plays. However, I did want to present some ideas on the revision(s) in advance in order to solicit feedback. I’ve listed the options below with the first two being the ideas I need the most feedback on.
Timeframe
Currently, the model is analyzing data from 1996 through 2011. I picked 1996 as the starting date for two reasons: that was the first year I can remember following the NFL and that was the when the Panthers and Jaguars joined the league. However, I think it can be argued that the play-style of the NFL has changed significantly since 1996 particularly in the amount of run versus pass plays called. The increased restrictions on pressing receivers and the God damned Tom Brady QB protection rules come to mind. That being said, I’ve identified two additional starting points for analysis. The first is 2001 which was the peak of the Ram’s greatest show on turf and also the beginning of the Spygate dynasty. The other starting point would be 2006. This date would provide for continuity of current rosters. That is we would expect to see that teams with relatively stable rosters (Steelers, Colts) should win more games consistently. Further, it should only take into account the latest play calling trends. Please note, that starting the analysis from a later date will affect the R square value and while I “think” it will improve it, there is a good chance that the model could be worse. Any thoughts on 2001 versus 2006 as a starting date or thoughts on other possible dates?
New Variables
This involves new variables that could affect the number of predicted wins for a team. For example, the best idea I’ve heard is to create a variable that rates how well a team collects draft picks year to year influencing wins. Higher value is given to having more high round draft picks. I need to say that this would be for next year’s analysis and that I refuse to include individual player variables. So any additional variables that are not included in the first post would be great.
Outliers and Influential Observations
This is very statistical so I won’t describe this part of the analysis. If you would like to know about this I will of course provide the details. However, this will be affected by what timeframe is chosen.
Type of Model
This is even more statistical however for business people and stats guys I want to speak on this a bit. The distribution of wins seems to be pretty fucking normal. Like normal to an extent I’ve rarely seen. However, I received a suggestion about using Poisson regression since it can be argued that wins behave that way. I may try that type of model out but I would guess it wouldn’t offer much improvement. I am concerned that the data is clustered within teams (i.e. the 2008 Dolphins win total and roster is dependent to some extent on the 2007 Dolphins win total and roster performance). So if anyone has any ideas on and other types of GLM or cluster analysis (which I’m not sure is appropriate) that may better predict wins or I’m all ears.
So those are the options for revisions. Any thoughts? Oh and an expansion will be announced next week thanks to a suggestion from David M.
*Channing Crowder quotes will be provided with every subsequent update.