Sunday, August 28, 2011

Why It's Taking Me 3 Years for a Masters

Sometimes a Wasted Notion (Working Title)
AKA What Happens When I have too Much Free Time over the Summer

OK, as most of you are aware, I am a huge NFL fan.  And as some of you may know I’ve spent the last three years studying quantitative psychology (basically statistics lite) in the quaint state of Nebraska.  I guess you could say I have an interest in both.  The latter interest gives me some pretty in-depth insight into some advanced prediction models that statistics offers in addition to access to some nifty statistical computer packages.  Well, over the summer the combination of these interests created an impetus to try to find a statistical model that would most accurately predict the number of wins for all NFL teams for the upcoming season.  Fine that was a lie.  The impetus was I’m stuck in this boring ass state that has snow on the ground for seven months out of the year and as a result I listen to a lot of Bill Simmons (and a bit of Chad Millman although he seems like a dick).  Regardless, I decided to create this model with the overall goal of creating a model that would be correct on at least 56% percent of the 32 available wins bet offered by Las Vegas (Per Chad Millman unfortunately).  That benchmark is apparently the point in which you can actually make money off gambling.
Note: I have little to know idea how gambling lines or sports books work thus part of the reason I made this public was to get input on that.
Note:  For the foreseeable future this will be hypothetical money because I am operating under the assumption that I will not achieve a 56% win total.  Oh, and graduate school pays me enough that I technically qualify for welfare.
                So what happened next ended up being one of the top five nerdiest things I’ve ever done.  This is coming from the guy who wikipedied Dragon Ball Z episode synopses for two solid hours and decided that didn’t make the top five.  I found a website that recorded team stats in CSV file type and manually copy and pasted these stats into my own data file composed of 40 variables for all NFL teams dating from 1996 to the present.  The list of variables is below.  However, some of these variables are composites of one another and thus linearly dependent and therefore cannot be used for analysis.  For example, passing completion percentage is pass completions dived by passing attempts.  It would be redundant to analyze pass completions and passing attempts along with passing completion percentage.


Offense
Defense
Other
Points
Opponents Points
Wins
Points Per Game
Opponents Points Per Game
Previous Year Wins
Plays
Plays
Coaching Change
Yards Per Play
Yards Per Play
Playoff Appearance
1st Downs
1st Downs
Super Bowl Appearance
Passing Completion
Passing Completion
Super Bowl Win
Pass Attempts
Pass Attempts

Completion Percentage
Completion Percentage

Pass Yards Per Attempt
Pass Yards Per Attempt

Passing Yards
Passing Yards

Passing TDs
Passing TDs

Interceptions
Interceptions

Rush Attempts
Rush Attempts

Rushing Yards
Rushing Yards

Rushing Yards Per Attempt
Rushing Yards Per Attempt

Rushing TDs
Rushing TDs

Turnovers
Turnovers


So, these are the variables I will be using in some way to predict the variable wins.  As of right now I will be using multiple regression as my primary model.  However, my long term goal is to create a path analysis model that will (ideally) have a better fit.  That is it will be more predictive.  For the sake of this blog, path analysis can be looked at as a predictive model in which some variables can be predicted by other variables and simultaneously predict wins.  See the model below for a simple graphical example in which rushing TDs is both predicted by rush attempts and also predicts wins.
Rush Attemps -> Rush TDs -> Wins
Either later tonight or more probably tomorrow during the afternoon I will post my first model for predicting wins.  As I said above, I will use multiple regression along with a monitored backward selection of variables in order to select the model with best adjusted multiple coefficient of determination (RA2).  RA2 is the proportion of variance in wins that this model accounts for.  Said another way, it is how accurate in predicting wins the model is and it can range from 0 to 1 (with 1 being a perfect explanatory model).
So, what I need from everyone is this: how can I make the model better.  Now that statement precludes adding additional variables because it’s a bitch to do and I don’t have the time.  However, adjustments can be made.  For instance, I am analyzing data from 1996 to the present, but I personally believe that the league has changed so much in that time frame that maybe the years I analyze should be restricted.  I was leaning towards 2001 (Greatest show on Turf and the arrival of the fucking Patriots) and onward but any recommendations with accompanying logic would be great.   If additional variables can be created with the variables at hand that would be useful.  Observer help will be most needed when the path analysis really starts.  Any other suggestions will be appreciated though (maybe differences in division?).  As a final note, I’d like to keep this going whether anyone reads or not.
Limitations:
This effort does come with several limitations
·         As I said earlier, I am completely ignorant of gambling lines and sports books.  Any feedback on how to find a sports book and interpret lines would be greatly appreciated
·         Since the data was compiled by one person, there is a highly likelihood of data entry error
·         Likewise, since I’m the only one who did it I will not be sharing the entire data set (maybe limited portions)
·         Sharing of regression (and later path) estimates will be on a case by case basis
·         However, I will always specify every variable included in the model along with their inter-relationships
·         Kyle K is 32 years old and still is an undergraduate

No comments:

Post a Comment