This is a fun project and there's plenty to learn even if you don't end up with a great model.
But if you really want to try to solve this problem, you're going to need more granular data - play-by-play, per-play lineup, injuries, player tracking if you can get your hands on it. The stats you are using are very lossy summaries of the season, so they aren't very strong predictive features.
From a technical POV, consider bucketing some of those per-game stats (e.g. 4 binary features representing which quartile the team's stat falls in compared to every other team that season). This can help to adjust for year-to-year differences. Work with pace-adjusted stats if you have access to them. Find a baseline accuracy by picking the simplest possible non-ML strategy and measuring how accurate that is (e.g. what is the accuracy of a model that always picks the better seed or W/L record?).
You need to adjust your training data to only use data that would have been available at the time of the game - if W/L or PPG includes data from future games, this is a form of data snooping and will probably give you results on your test set that won't generalize to the real world. Time-series snooping is a very easy mistake to make, but it's crucial to avoid it in order to build a good model.
You’re right. Like a lot of other engineers, I once thought “ML+High level team info=$$$$” but quickly learned that you really can’t get an edge unless you’re digging into that more granular data and even then it’s really tough. It can be very hard to improve on simple linear models.
The odds coming out of Vegas are usually priced correctly. Sports markets are very efficient—although perhaps not as ruthlessy efficient as public equity markets. I would imagine there are still syndicates out there that are the “RenTec of sports betting” and just printing alpha.
> One in 9.2 quintillion. Those are the odds that you will correctly pick the winners of all 63 games played over the course of the tournament. Mathematically speaking, there are 2^63 (~9.2 quintillion) number of ways that you can fill the bracket
This is wrong. Many of the games are not even close to 50-50.
So even in the best case you list (which I share your skepticism of), you are 1,000x more likely to win the MegaMillions or Powerball jackpot than pick the perfect winning grid.
To put the odds of picking a perfect grid by hand in rough perspective, it would be like winning the lottery, boarding a plane that crashes into the ocean, being the only survivor floating in the ocean, but then are struck by lightning 3 times and yet you live, only to then be eaten by a shark, all in a single day.
So even in the best case you list (which I share your skepticism of), you are 1,000x more likely to win the MegaMillions or Powerball jackpot than pick the perfect winning grid.
That makes me wonder what odds you'd get at the bookmakers for a perfect winning grid.
I think I remember seeing a prize somewhere of an obscene amount of money (memory is hazy, but was something like over $100M, maybe even a billion) for a perfect winning grid.
I mean do you really think NCAA basketball games are just a random coin toss? If that were the case wouldn't more than 1 16 seed have won a game? Wouldn't more lower seeds have won the whole tournament? In that case theoretically every team has the same chance to win it all yet the lowest seed ever to win was an 8 seed.
Yeah tonmoy linked to the blog post that had the predictions. Unfortunately, the model pretty much predicted a higher probability for the higher seeded team in the matchup which lead to some relatively expected predictions. If I remember correctly, it led to a pretty average percentile (50-60) when I submitted it to the ESPN leaderboards.
There was a mainstream article written about the tech a while back, which had GIFs demonstrating simulations, but I failed to find it. Essentially, it would model what the players did versus what they should have done, optimally.
That is very cool. I was thinking something more macro where you can see the expected outcome based on different aspects of game strategies such as pace, mix of plays, etc.
I actually used to work for a company where we did exactly that. We did it based off of in game footage that we would stat and annotate it was a really interesting product. I think Second Spectrum does something similar now for pro basketball teams.
This is a fun project and there's plenty to learn even if you don't end up with a great model.
But if you really want to try to solve this problem, you're going to need more granular data - play-by-play, per-play lineup, injuries, player tracking if you can get your hands on it. The stats you are using are very lossy summaries of the season, so they aren't very strong predictive features.
From a technical POV, consider bucketing some of those per-game stats (e.g. 4 binary features representing which quartile the team's stat falls in compared to every other team that season). This can help to adjust for year-to-year differences. Work with pace-adjusted stats if you have access to them. Find a baseline accuracy by picking the simplest possible non-ML strategy and measuring how accurate that is (e.g. what is the accuracy of a model that always picks the better seed or W/L record?).
You need to adjust your training data to only use data that would have been available at the time of the game - if W/L or PPG includes data from future games, this is a form of data snooping and will probably give you results on your test set that won't generalize to the real world. Time-series snooping is a very easy mistake to make, but it's crucial to avoid it in order to build a good model.
Interesting work, thanks for sharing!
You’re right. Like a lot of other engineers, I once thought “ML+High level team info=$$$$” but quickly learned that you really can’t get an edge unless you’re digging into that more granular data and even then it’s really tough. It can be very hard to improve on simple linear models.
The odds coming out of Vegas are usually priced correctly. Sports markets are very efficient—although perhaps not as ruthlessy efficient as public equity markets. I would imagine there are still syndicates out there that are the “RenTec of sports betting” and just printing alpha.
This is a great article about how the actions of some players are very predictive of the outcome of the game even if their stats don't reflect that: https://www.nytimes.com/2009/02/15/magazine/15Battier-t.html
> One in 9.2 quintillion. Those are the odds that you will correctly pick the winners of all 63 games played over the course of the tournament. Mathematically speaking, there are 2^63 (~9.2 quintillion) number of ways that you can fill the bracket
This is wrong. Many of the games are not even close to 50-50.
According to: https://math.duke.edu/news/duke-math-professor-says-odds-per... by using all available knowledge from seeding and betting odds you can get your odds down to a mere 1 in 2.4 trillion. Or perhaps even as good as 1 in 128 billion if this guy: https://www.youtube.com/watch?v=O6Smkv11Mj4 is to be believed.
So even in the best case you list (which I share your skepticism of), you are 1,000x more likely to win the MegaMillions or Powerball jackpot than pick the perfect winning grid.
To put the odds of picking a perfect grid by hand in rough perspective, it would be like winning the lottery, boarding a plane that crashes into the ocean, being the only survivor floating in the ocean, but then are struck by lightning 3 times and yet you live, only to then be eaten by a shark, all in a single day.
So even in the best case you list (which I share your skepticism of), you are 1,000x more likely to win the MegaMillions or Powerball jackpot than pick the perfect winning grid.
That makes me wonder what odds you'd get at the bookmakers for a perfect winning grid.
I think I remember seeing a prize somewhere of an obscene amount of money (memory is hazy, but was something like over $100M, maybe even a billion) for a perfect winning grid.
Edit: it was a billion.
https://genius.com/Warren-buffett-billion-dollar-march-madne...
Well since submission is free that bet has a positive expected value! Much better than the lottery.
Yet last year, the "unthinkable" happened, when #1 UVA lost in the 1st round to a 16 seed.
The odds being in favor of one team does not preclude the other team from winning.
Wow, you don't say...
Here's the Kaggle competition for this year, with many more potential starting points and ressources: https://www.kaggle.com/c/womens-machine-learning-competition...
Our non-data scientists are eager to steal this so they have a chance to beat our data scientists. Someone asked me how to install Python to run this.
XD
Interesting blog post but looks like they'd be better off just guessing
Cool project! It sounds like you've run this in previous years, so I'm curious - how well has the model done in the past?
Probably about as well as trying to predict the result of a coin toss.
If that were actually the case, this model would be far and away the best ever developed!
I mean do you really think NCAA basketball games are just a random coin toss? If that were the case wouldn't more than 1 16 seed have won a game? Wouldn't more lower seeds have won the whole tournament? In that case theoretically every team has the same chance to win it all yet the lowest seed ever to win was an 8 seed.
So, 50%? :)
Yeah tonmoy linked to the blog post that had the predictions. Unfortunately, the model pretty much predicted a higher probability for the higher seeded team in the matchup which lead to some relatively expected predictions. If I remember correctly, it led to a pretty average percentile (50-60) when I submitted it to the ESPN leaderboards.
He has predictions from 2017 and 2018 in this blog post I think
https://adeshpande3.github.io/adeshpande3.github.io/Applying...
I’d be interested in seeing something like this to create winning game strategies that coaches can utilize.
This exists, kinda:
http://www.sloansportsconference.com/wp-content/uploads/2018...
There was a mainstream article written about the tech a while back, which had GIFs demonstrating simulations, but I failed to find it. Essentially, it would model what the players did versus what they should have done, optimally.
This one? http://grantland.com/features/the-toronto-raptors-sportvu-ca...
That is very cool. I was thinking something more macro where you can see the expected outcome based on different aspects of game strategies such as pace, mix of plays, etc.
I actually used to work for a company where we did exactly that. We did it based off of in game footage that we would stat and annotate it was a really interesting product. I think Second Spectrum does something similar now for pro basketball teams.