solidasparagus 5 years ago

This is a fun project and there's plenty to learn even if you don't end up with a great model.

But if you really want to try to solve this problem, you're going to need more granular data - play-by-play, per-play lineup, injuries, player tracking if you can get your hands on it. The stats you are using are very lossy summaries of the season, so they aren't very strong predictive features.

From a technical POV, consider bucketing some of those per-game stats (e.g. 4 binary features representing which quartile the team's stat falls in compared to every other team that season). This can help to adjust for year-to-year differences. Work with pace-adjusted stats if you have access to them. Find a baseline accuracy by picking the simplest possible non-ML strategy and measuring how accurate that is (e.g. what is the accuracy of a model that always picks the better seed or W/L record?).

You need to adjust your training data to only use data that would have been available at the time of the game - if W/L or PPG includes data from future games, this is a form of data snooping and will probably give you results on your test set that won't generalize to the real world. Time-series snooping is a very easy mistake to make, but it's crucial to avoid it in order to build a good model.

Interesting work, thanks for sharing!

  • ham_sandwich 5 years ago

    You’re right. Like a lot of other engineers, I once thought “ML+High level team info=$$$$” but quickly learned that you really can’t get an edge unless you’re digging into that more granular data and even then it’s really tough. It can be very hard to improve on simple linear models.

    The odds coming out of Vegas are usually priced correctly. Sports markets are very efficient—although perhaps not as ruthlessy efficient as public equity markets. I would imagine there are still syndicates out there that are the “RenTec of sports betting” and just printing alpha.

QuackingJimbo 5 years ago

> One in 9.2 quintillion. Those are the odds that you will correctly pick the winners of all 63 games played over the course of the tournament. Mathematically speaking, there are 2^63 (~9.2 quintillion) number of ways that you can fill the bracket

This is wrong. Many of the games are not even close to 50-50.

  • dagw 5 years ago

    According to: https://math.duke.edu/news/duke-math-professor-says-odds-per... by using all available knowledge from seeding and betting odds you can get your odds down to a mere 1 in 2.4 trillion. Or perhaps even as good as 1 in 128 billion if this guy: https://www.youtube.com/watch?v=O6Smkv11Mj4 is to be believed.

    • jsjohnst 5 years ago

      So even in the best case you list (which I share your skepticism of), you are 1,000x more likely to win the MegaMillions or Powerball jackpot than pick the perfect winning grid.

      To put the odds of picking a perfect grid by hand in rough perspective, it would be like winning the lottery, boarding a plane that crashes into the ocean, being the only survivor floating in the ocean, but then are struck by lightning 3 times and yet you live, only to then be eaten by a shark, all in a single day.

      • dagw 5 years ago

        So even in the best case you list (which I share your skepticism of), you are 1,000x more likely to win the MegaMillions or Powerball jackpot than pick the perfect winning grid.

        That makes me wonder what odds you'd get at the bookmakers for a perfect winning grid.

        • jsjohnst 5 years ago

          I think I remember seeing a prize somewhere of an obscene amount of money (memory is hazy, but was something like over $100M, maybe even a billion) for a perfect winning grid.

          Edit: it was a billion.

          https://genius.com/Warren-buffett-billion-dollar-march-madne...

          • dagw 5 years ago

            Well since submission is free that bet has a positive expected value! Much better than the lottery.

  • chrisweekly 5 years ago

    Yet last year, the "unthinkable" happened, when #1 UVA lost in the 1st round to a 16 seed.

    • alextheparrot 5 years ago

      The odds being in favor of one team does not preclude the other team from winning.

dandigangi 5 years ago

Our non-data scientists are eager to steal this so they have a chance to beat our data scientists. Someone asked me how to install Python to run this.

XD

  • loblollyboy 5 years ago

    Interesting blog post but looks like they'd be better off just guessing

rococode 5 years ago

Cool project! It sounds like you've run this in previous years, so I'm curious - how well has the model done in the past?

  • aaaaaaaaaaab 5 years ago

    Probably about as well as trying to predict the result of a coin toss.

    • throwawaymath 5 years ago

      If that were actually the case, this model would be far and away the best ever developed!

    • jjuel 5 years ago

      I mean do you really think NCAA basketball games are just a random coin toss? If that were the case wouldn't more than 1 16 seed have won a game? Wouldn't more lower seeds have won the whole tournament? In that case theoretically every team has the same chance to win it all yet the lowest seed ever to win was an 8 seed.

    • klohto 5 years ago

      So, 50%? :)

  • adeshpande3 5 years ago

    Yeah tonmoy linked to the blog post that had the predictions. Unfortunately, the model pretty much predicted a higher probability for the higher seeded team in the matchup which lead to some relatively expected predictions. If I remember correctly, it led to a pretty average percentile (50-60) when I submitted it to the ESPN leaderboards.

bitxbit 5 years ago

I’d be interested in seeing something like this to create winning game strategies that coaches can utilize.

  • navigatesol 5 years ago

    This exists, kinda:

    http://www.sloansportsconference.com/wp-content/uploads/2018...

    There was a mainstream article written about the tech a while back, which had GIFs demonstrating simulations, but I failed to find it. Essentially, it would model what the players did versus what they should have done, optimally.

  • shehryarrr 5 years ago

    I actually used to work for a company where we did exactly that. We did it based off of in game footage that we would stat and annotate it was a really interesting product. I think Second Spectrum does something similar now for pro basketball teams.