vladislav 6 years ago

There are some interesting connections between GPs and neural networks. Deep neural networks with random weights concentrate about gaussian processes, so the two may not be so far apart in practice.

https://arxiv.org/pdf/1711.00165.pdf

jkabrg 6 years ago

Where have people used Gaussian Processes to good effect? And how do they compare to competing models? There appears to be a lot of theory in this book, and I'm wondering how much of it is useful to applied data science.

  • cultus 6 years ago

    I'm a data scientist who uses gaussian processes all the time. They are:

    1. Typically very accurate. 2. Sound theory and good uncertainty estimates. 3. Being a Bayesian model, tuning is very easy.

    The main competing models for some of the same tasks are gradient boosted decision trees and sometimes neural networks. GBTs win over NN for most tasks in practice, although they don't get much hype. GPs do well with smooth data in my experience, with GBTs winning over any data where a limited number of bespoke decision tree splitting rules can represent the data well.

    Interestingly, damn near anything, including neural networks, linear regressionm and GBTs can be interpreted as gaussian processes (or an approximation of GPs) by certain choice of covariance function. GPs are just functions in a reproducing kernel hilbert space defined by the covariance function. That can include most anything.

    GPs with full covariance matrices don't scale to more than a few thousand examples (n^3), but approximations can be made to scale to large datasets.

    • Xcelerate 6 years ago

      > GBTs win over NN for most tasks in practice, although they don't get much hype

      I've always thought GBDTs get too much hype. As a data scientist, it seems like everyone wants to immediately throw a random forest or GBDT at the problem without knowing anything else about it.

      • mxwsn 6 years ago

        Yeah, I think in the data science community, GBDTs are appropriately-hyped, since their dominant performance on Kaggle has been well known for some time now. In addition to that, GBDTs are so easy to run; taken together, it's probably always correct to just run a GBDT as one of the first things you do after you've got the data wrangled. Of course, as a phd-in-training data scientist, I feel disappointed (either in myself or in the task) if I can't think of a more interesting and better performing method than a GBDT :)

    • yxchng 6 years ago

      How big is the dataset you usually deal with? From my experience, you hit computational bottleneck pretty quickly with GPs (100k datapoints of maybe 100 dimensions is pretty much the max you can deal with).

      And if you are talking about dataset of this scale, then I agree with you that GPs are better than NN. However, people are excited about NN capability of dealing with immensely huge and high dimensional dataset not small scale ones.

      • cultus 6 years ago

        Thousands to millions. There are approximations that work very well with millions of examples.

        Neural networks are empirically outperformed by gradient boosted trees (look at Kaggle competitions) on most practical tasks except for image, sound, and video problems.

        Neural networks can be very slow on large datasets. Training can often take days or weeks, even with a GPU. GBTs and GP approximations are faster.

  • cfcf14 6 years ago

    I've used Gaussian Processes to great effect in the field of energy regression/forecasting for commercial buildings, and have found them to be generally superior to other approaches due to the richness of prior information you can encode into the GP kernel. I'll be attending http://gpss.cc/ this year as well, so you could consider me something of an evangelist!

    For example, you might approach a regression problem with the information that you expect the outcome to vary very smoothly w.r.t covariate 1, periodically (cyclically) over a long timespan w.r.t covariate 2, and periodically and non-smoothly (but with a known period) w.r.t covariate 3 and 4.

    You don't know the exact form of these relationships (ie: what kind of periodicity exactly, or the nature of the 'smooth' relationship), but you're pretty sure they are related in that way.

    GP regression allows you to draw from a posterior distribution over a function space of functions with those properties, conditioned on the observed data. The resulting credible intervals (HPD) are directly interpretable and meaningful without hand-wringing. And in a much more practical sense, the results you get tend to be extremely good compared to other techniques (trees, neural nets, GAM's, etc).

    Other very useful applications of GP's include hyperparameter optimization, classificaton, and GP-latent variable models (GPLVMs) for dimensionality reduction and manifold embeddings. The only downside, which other users have mentioned, is the time-space complexity of inverting the covariance matrix, making the computation O(n^3). Various approaches to making GP fitting more performant, including sparse GP models with inducing points, variational approximations to the posterior, help a bit.

    My personal feeling is that these models fall squarely into the category of 'clearly the best option if it weren't for the computational complexity'. If that bridge is crossed, we're going to see their popularity grow dramatically.

  • jofer 6 years ago

    They've also been widely used within the geosciences for about 40 years. We refer to gaussian process regression as "kriging" (after the originator if the method) and classification as "indicator kriging", but it's the same method.

    If you've ever heard anyone refer to what they're doing as "geostatistics", they were almost definitely doing a regression problem using gaussian processes.

  • mxwsn 6 years ago

    I consider GPs in two ways: hyperparameter optimization (where GPs are a specific instantiation of bayesian optimization) and a powerful non-linear regression method that supports uncertainty via the rich posterior distributions you get with GPs.

    However, on the side of hyperparameter optimization, an experiment comparing bayesian optimization to random search with only double the computational resources [0] suggests that BO is not at least 2x better than random search which, to me, means that in practice BO isn't my first go-to for hyperparameter optimization.

    On the side of regression, its poor time complexity (you need to invert the data matrix which is O(n^3) and slow in practice with maybe >1000 points) makes it less of a first go-to for me than neural networks, which maybe irrationally benefits from its current "hotness", but also for which there are well-explored strategies for generating predictions with uncertainty [1]. Of course, GPs compare significantly more favorably on the axis of interpretability, since you can get a closed-form posterior distribution for any point of interest, and the manner in which your dataset influences this posterior is transparent.

    There have been, however, interesting developments in improving the efficiency of GP regression for large datasets, such as estimating the GP with a reduced and weighted set of data points [2]. The recent Deepmind paper on neural processes is also interesting, as a combination of GPs and neural networks and boasts strong scalability. [3]

    [0]: http://www.argmin.net/2016/06/23/hyperband/

    [1]: https://eng.uber.com/neural-networks-uncertainty-estimation/

    [2]: https://arxiv.org/abs/1106.5779

    [3]: https://arxiv.org/abs/1807.01622

  • thiago_lira 6 years ago

    I'm thinking about studying these for my MSc. A colleague has given me the idea of applying the Monte Carlo Dropout technique [1] (A interpretation of Dropout as a GP) to the CEAL framework [0].

    On the other hand my mentor wanted me to work on a regression problem with a dataset from a cement factory. Have people been using GPs for problems like that?

    [0] https://arxiv.org/abs/1701.03551 (CEAL) [1] https://arxiv.org/pdf/1506.02142 (MC dropout)

  • nbap 6 years ago

    A friend of mine is using Gaussian Processes with ML to fill gaps in time series. He is working on a project for a government agency that analyses data from roadside sensors such as traffic, speed, etc, so they are using Gaussian Process to fill data gaps when those sensors fail.

    His group published a paper about it: https://trid.trb.org/view/1496472

    He also wrote about it in his MSc thesis but I can't find a link for it.

  • mandor 6 years ago

    Gaussian Processes are very good when you do not have much data (< 500 points) and and your data are low-dimensional (<10 dimensions). For this, they are more accurate than anything else, and they provide both a prediction and variance.

  • abhgh 6 years ago

    One of their more interesting applications is in Bayesian Optimization (BO). You'd do BO optimization where evaluating the objective function is expensive. As against, lets say, where you might use gradient descent, where you navigate your space via repeated evaluation of the objective function.

    BO essentially starts building its own model of what the objective function looks like in relation to its parameters, and picks a few promising points to evaluate it at. Thus, it moves the burden of evaluating a costly objective multiple times to identifying these highly promising points.

    A BO algorithm can use GP to build its own model since the uncertainty estimates that come with it are fairly valuable and help in this seeking of promising points. For ex the GP can say that at so-and-so value of the parameter the objective function is expected to do well, and its confident about it, hence this point can be explored next ... or it can say that at a particular region its extremely uncertain of its estimates, hence the region needs exploring. The BO keeps updating its model everytime the objective function is evaluated. Since at any instant the model encodes knowledge from all past evaluations, this is also sometimes known as Sequential Model Based Optimization (SMBO).

    If you need a practical example of its use within Machine Learning, BO with GPs have been used in finding the right hyperparameters for a model - [1], [2]. Traditionally you'd perform a grid-search in the space of hyperparameters, building a model at all points on the grid; BO tells you that you don't need to really do that --- you can be smart of what points in the grid should you actually build your model at.

    The problem with GP based BO is scaling. There has been a fair amount of research (its an active area) to address this - sometimes by performing BOs without GPs, and sometimes using GPs but trying to speed it up [3].

    [1] A good tutorial on the approach - https://arxiv.org/abs/1012.2599

    [2] A library for this https://github.com/JasperSnoek/spearmint The corresponding paper is linked.

    [3] This paper is a good example - it suggests doing BO both using GPs (in this case suggesting speed-ups, Section 3) and without it (section 4) https://papers.nips.cc/paper/4443-algorithms-for-hyper-param... . In fact [2] is also a good example of speeding up GP-based BO.

  • karthiktharava 6 years ago

    Being a non parametric model, they are pretty garbage for generalizations outside of the domain of the data they have "seen" and they are also not very data efficient. One area where I have seen success for them is when they are applied to automatic hyper-parameter selection for neural networks.