Fun with NFL Stats, Bokeh, and Pandas

j253.github.io

129 points by J253 5 years ago

sndean 5 years ago

> The small spikes at 5 yard increments is interesting and I don't really have a good explanation other than to think that whoever recorded the yardage data liked rounding to the nearest 5 if it was close. Anyone else have any other ideas?

I'll go with the idea that the refs are biased with their ball placement and tend to put the ball on lines [0]. Also, players/teams practice, speak, think in 5 yard increments, so that's got to bias things a bit. "We've got to get to the 35 yard line for Morten to have a chance."

[0] https://gutterstats.wordpress.com/2015/11/03/are-nfl-officia...

slg 5 years ago

There are multiple other factors involved that aren't discussed fully in that article.
To start the ref bias is not necessarily unintentional. Measuring first downs and exact placement of the line of scrimmage becomes more difficult and takes more time when the ball is not initially spotted on an 5 yard increment or individual hash mark. The league offices might directly instruct refs to spot the ball on an exact yard line if there is any doubt of the spot in order to speed up the game. An obvious example of this intentional bias is when the ball is punted out of bounds. It is nearly impossible for a ref to get an exact spot in that situation and yet it is almost always spotted exactly on a hash mark.
Also lots of drives will start on a set yard line and penalties are often handed out in 5 yard increments. So while there are no rules that will start a drive on the 15, there are rules that will start a drive on the 20 or 25 and a standard 5 or 10 yard penalty would place the ball exactly at the 15 yard line.
Lastly, the spot is a continuous data point that is being recorded by humans as a discrete data point. That means the herding could be exaggerated in the recorded data and not necessarily as pronounced during the actual game. A ball might be spotted by the ref at 14.4 yards but the person responsible for data entry might eye the spot and record it as being at the 15 yard line.
swasheck 5 years ago

I've often wondered if there was a way to detect bias in officiating. Naturally there's going to be subjective element to the types of penalties called, but I also have wondered if there was a way to detect bias in the situations in which penalties are called and how they affect the probability of victory.
Additionally, with the whole Tim Donaghy [0] NBA thing, I have always been curious if there was a way to detect referee influence on the outcomes of games based on the line. There certainly times when Vegas loses (a few weeks ago was particularly bad for them), but it would be interesting to see if there was a way to detect probabilities of officials, coaches, or even players deciding the outcome of matches beyond the course of the endeavor.
But really, it'd be fun to have statistical backing to confirm or refute the whole "College Football has a pro-East Coast bias" or "this referee always gives fouls to the European soccer sides", or "the Patriots _always_ get those calls because ... Tom Brady."
[0] https://en.wikipedia.org/wiki/2007_NBA_betting_scandal
endorphone 5 years ago

While I have no explanation for the clustering, a yard is a long, long measure in football, with a dozen 4K cameras and a lot of very vested interests. Occasionally spots seem generous or not, but that's to the measure of a couple of inches (e.g. he went down 4" short of the first down). As a long time fan it just doesn't seem reasonable to me that the ref opts for the 40 yard line instead of the 39 because there's a line. That's a 3 foot difference. To further support this, the bordering two yard counts do not depress the amount of the excess on the 5 yard marker.
Not to mention that where the refs place the ball -- near the center of the field, usually on the left or right side depending upon where the play ended -- has a line on every yard marker.
But again I have no explanation. We know that touchbacks start on the 20 or 25 yard line, and penalties are increments of 5 yards (e.g. face mask would be 15 yards, holding 10 yards, false start 5 yards, encroachment 5 yards, etc...and these can pile, kickoff out of the last five yards starts at the 40 yard line, etc), and to some degree this explains the 5 yard excess.
forapurpose 5 years ago

It also could be bias in the person recording the data, who may unconsciously prefer rounding placements between the 4 and 6 yard lines to 5.
J253 5 years ago

Thanks for the link! That's a great resource and write up about the statistical heaping effect. I'll include that in a post update.
dyim 5 years ago

ha! that's funny. "We've got to get to the 35 yard line, so Morten can kick a 53-yarder, so we can have a chance."

dbt00 5 years ago

As someone who's done a lot of numerical analysis and watches a lot of football, the analysis here is pretty rudimentary.

> On third down, pass attempts outnumber run attempts at almost a 4 to 1 clip. This is likely out of increased desperation to make a first down.

Running is a low variance low yardage option, passing is a high variance high yardage option. Passing on third and medium to long is an obvious dominant strategy. Pulling the goaltender in hockey or bringing the keeper forward in soccer when trailing late in games/matches serves much the same effect.

arglebarnacle 5 years ago

Does anyone have any insight about what kind of jobs are out there for people with the kind of skills demonstrated in this post?

I have a lot of data exploring, cleaning and visualizing skills, python/SQL skills and experience using it to make business decisions, but this type of thing falls short of what most people would consider "data science"

J253 5 years ago

Agreed. And I definitely make no claims about this being earth-shattering "data science". I just happened spend a few hours over the weekend making some plots and commenting about what I saw with some Python tools.
I'll also state that I am neither a data scientist nor a statistician. I'm a Python application engineer with a background in mechanical engineering, so that might help set the context a bit more.
- rhcom2 5 years ago
  
  Most companies don't need "earth-shattering 'data science'", they need a way convey a narrative with their information and maybe try to deduce something from it.
  I work as a programmer at an architecture company and we do visualizations like this all the time for campus classroom usage for example. Is it groundbreaking? Of course not, but it helps the clients and designers a ton.
kilbuz 5 years ago

In my experience, certainly a large number of jobs advertised at 'data scientist' would be exactly as you describe. Emphasis on the 'cleaning' part.
dataanalyst1 5 years ago

Data Analyst

chaosbutters 5 years ago

I feel like this is a waste of bokeh's potential and matplotlib would have sufficed. I love bokeh for the interactive capability and controlling what you plot, zooming, and just overall more immersive feeling than a static 2d plot.

Still very interesting and insightful and lovely plots generated.

J253 5 years ago

Thanks. And I agree with Bokeh being overkill for this. My original intention was to make it fully interactive but I hit some snags on keeping the interactivity through Pelican SSG so I just kept 'em static.

nubb 5 years ago

I've always enjoyed this project for pulling nfl stats. https://github.com/BurntSushi/nflgame

burntsushi 5 years ago
That project is no longer maintained because I don't use it any more, but others have picked up the baton: https://github.com/derek-adair/nflgame
Back in the day, I used nflgame along with
```
    https://github.com/BurntSushi/nfldb
    https://github.com/BurntSushi/nflvid
    https://github.com/BurntSushi/nflfan
```
to setup a simple local web UI that allowed me to quickly search through every play and watch any single play I wanted. Video footage was available as soon as the game was over, and play info was available live as the game was playing. It was amazing.
This worked because nflvid downloaded full HD NFL games from their CDN, which was unprotected at the time. (I paid for an NFL Game Pass subscription and never distributed the video footage.) They also had XML files that delineated the time at which each play started and its duration. Some ffmpeg slicing and dicing was all it took to cut up a full game and associate each clip with each play. That's all part of what nflvid does.
I hacked all of this together in my free time years ago, and I bet a lot of people would find it amazing. One wonders why the NFL doesn't build this and sell it themselves. When I used Game Pass a few years ago, you could search for plays with rudimentary criteria, but only over a single game at a time. It was artificially very limited.
- diminoten 5 years ago
  
  We briefly spoke via GitHub about a month ago, and during that convo (it was in a ticket), you mentioned that the source has inaccuracies. Is there any elaboration there or do the NFL people use a different data source to do things like Fantasy and official stats?
  
  burntsushi 5 years ago
  
  I'm not an NFL insider. I don't know what they do internally. I only know that 1) the undocumented NFL GameCenter JSON is not 100% accurate and that 2) any user of a fantasy league would notice these inaccuracies. I did a test a while back by comparing GameCenter data with Yahoo's data. Kickers tend to have the most inaccuracies: https://github.com/BurntSushi/nflgame/blob/master/test-data/... QB stats are more solid for example, but there are still minor problems: https://github.com/BurntSushi/nflgame/blob/master/test-data/...
  From those observations, you can't really make any solid conclusions. But if you think about it for a bit, you might be able to reason your way to some guesses. For example, one possibility is that the GameCenter data is NFL's own construction that's only used for their GameCenter interfaces, where as places that "official" data is needed might be powered by Elias[1]. Why the discrepancy? Again, I don't know. It could be legacy software related. It could be contract/legal related. Or it could jus tbe plain old bugs. e.g., Maybe GameCenter hooks into an initial lossy but fast feed that is updated during the game, but never receives updates from a slower but more accurate feed later.
  Or maybe the NFL purposely inserts data canaries because they know this JSON feed is unprotected, and they intend on using those data canaries to detect folks using their data in an unlicensed fashion. I'm pretty sure IMDb does this, for example. Or maybe they just insert errors purposely to make it too costly for anyone to use this data in situations that require 100% accuracy (like fantasy football leagues).
  My guess is some innocuous blend of legal and legacy software reasons.
  [1] - http://www.esb.com/
- eunoia 5 years ago
  
  I learned Postgres while playing around with nfldb years ago. Great experience.
  Thank you for all your work, those are some very impressive projects.

tunesmith 5 years ago

For a while I had a process to extract the top ten highest-WPA plays from my favorite team's (Broncos) most recent game. But then my data source dried up. I'm glad to find out about nflscrapR, it seems like that I might be able to figure out how to do that report again with recent play-by-play data.

Incidentally, that report was really fun during the 2011 Broncos season. Normally when you are finding the plays with the largest WPA swings, you'd expect them to be distributed among both teams. But since Tim Tebow became starting quarterback, I searched for the top ten largest WPA swings for the rest of the season - and from what I recall, every single one of those dramatic plays was in the Broncos favor. Weird. :-)

MaxLeiter 5 years ago

> Passing becomes more and more popular as you use up your downs.

Just in case author sees this, this is wrong, right? It should read Running becomes more and more popular as you use up your downs? (Regarding https://j253.github.io/blog/images/article_01/01_play_by_dow...)

dragonwriter 5 years ago

> Just in case author sees this, this is wrong, right? It should read Running becomes more and more popular as you use up your downs?
Er, no, the author is right; the share of all plays that are passing plays goes up with down number (until dropping at 4), the graph shows that quite clearly.

catbird 5 years ago

Very cool! I would love to see heatmaps of play type with downs on one axis and yards-to-first-down on the other.

petersalas 5 years ago

Something like this?
http://www.yardsgained.com/#(passes_~_sacks)_~_first_down_at...
- catbird 5 years ago
  
  Oh wow, that site is excellent. I think this is the query I was thinking of before, the percentage of passes conditional on down and distance:
  http://www.yardsgained.com/#(passes_~_sacks)_~_(passes_~_sac...

seanplusplus 5 years ago

this is super cool! well done.