Human-level few-shot concept induction through minimax entropy learning

98 points by geox 13 days ago

It's surreal to see this on the front page since I worked on minimax entropy learning for texture modelling (following these authors), then abandoned that line of research because I wanted to work on concept induction and reasoning instead. I'm amazed these authors are still drawing a direct line through 27 years of their papers.

To explain what this is about (not having read the whole paper): it's using 'classical', small sample size statistics to infer relations between very small sets of data such as individual questions of an IQ test. Obviously that's going to give the best possible answer, because that's what classical statistics does. (Well, if you could solve exactly, but with this model you actually can't.) Then it's just a question of having a good prior over possible concepts. In this paper the concepts are hand-defined parametrized features (aka 'filters' in the paper, from the texture modelling lineage), e.g. "families of arithmetic expressions of length from two to six" for the number problems. There will be a lot of details involved in that.

Alternatively, something like a deep neural network trained on vast numbers of generated examples could come to learn what a suitable prior should be by learning the patterns used to generate them. Yet another option is to use a universal prior as in Solomonoff induction, based on representation/coding length.

My conclusion: the theory is good and I'm happy to see this but don't think that you can directly apply this to learning 'concepts' in anything other than IQ tests. The hard part of generalising this is firstly finding the prior over the concepts, and secondly the difficult optimisation step: using MCMC as part of the optimisation is both extremely slow and stochastic in a bad way (unlike SGD) and the reason I abandoned this approach.

p1esk 13 days ago

deep neural network trained on vast numbers of generated examples could come to learn what a suitable prior should be by learning the patterns used to generate them.
This sounds promising. Do you see any potential issues with this approach?
- versteegen 12 days ago
  
  DNNs are very good at learning features so are a natural fit if you have lots of training data. The main drawback is then you've lost the ability to work with small data. The other drawback is the computational cost of the maxent model on top of it. So you get both the best and the worst of both halves.
  So:
  input --DNN--> feature vector --minimax_entropy_optimisation--> maxent model --sampling/inference--> output
  But minimax entropy learning is just learning a model that matches statistics of a set of features (filters). If you just want to draw a single sample from that model or the maximum likelihood point (e.g. you just want the answer to an IQ question) rather than doing complicated inference (in the DNN era, people have forgotten that "inference" is far more general than classification) you don't actually need the model at all, you can skip learning it, and go straight to computing a result with the desired statistics. This is around 100-1000x faster.
  input --DNN--> feature vector --statistics_matching--> output
  This is what I mostly did in my PhD work, and there's a famous paper doing this:
  A Neural Algorithm of Artistic Style https://arxiv.org/abs/1505.07376
  This has a sister paper which is even purer statistics-of-features-matching:
  Texture Synthesis Using Convolutional Neural Networks https://arxiv.org/abs/1505.07376
  However soon people worked out you can just train a DNN to do the statistics matching directly. Initially, with a fixed target style, and later taking the style as a second input. If the results are acceptable, this is vastly faster at inference time.
  input --DNN--> output
  Using them in this way would likely mean just replacing the final layer. The original papers on 'style transfer' did this, a DNN plus minimax entropy learning. has been done in texture and image models.
  
  versteegen 10 days ago
  
  (Whoops, ignore that last line; just fragments of unfinished sentences)
- uoaei 13 days ago
  
  This is how NN theory describes the process of training, i.e., that's the mechanism by which all neural networks come to represent the problem they are being applied to solve. GP is using a fancy way to say "train NNs on massive datasets" which is par for the course.

shiandow 13 days ago

It's perhaps useful to note that their notation is not accidental.

When they state:

    P = 1/Z exp -li Hi

They're directly referring to the Gibbs distribution that pops up in thermodynamics. The Hi are the different terms of the Hamiltonian and the Z is the corresponding partition function.

From that point of view they're taking a couple of 'response functions' treating them as if they are part of some kind of energy and find the statistical distribution that such a system would have at different temperatures depending on the average value of each term. The next part is then to find the value for each term that best predicts the observed distribution.

It's a bit like observing how a system behaves for different volumes and then using that information to determine the temerature, pressure and volume of an observed system. (perhaps volume isn't the best example, but something like molarity and chemical potential is hard to visualize)

versteegen 13 days ago

Yes, but...
> some kind of energy
The negenergy (negative energy) here is just the (unnormalised, hence the 1/Z term) log-probability mass or density. Like the logit outputs of classifier.
You shouldn't read anything more into the 'energy'. There is no temperature parameter in these minimax models (although adding one to smooth the model during sampling or learning can be useful, to interpolate between pure noise and the learnt model).
- shiandow 11 days ago
  
  I've found that reading more into these 'accidental' similarities is a lot more useful than not reading into it at all.
  For what it's worth there is a parameter in front that functions as a kind of temperature, except there's more than one parameter so it's not an exact translation. You can still reason about it somewhat (or split it up into a single temperature times a convex combination of hamiltonians).
  
  versteegen 11 days ago
  
  Sure, there are fruitful parallels. But actually, energy is much 'realer' or more fundamental here than temperature. BTW, those λ_j Lagrangian multipliers can be negative, so don't map onto inverse temperature well. They're more like interaction weights in a Hopfield network.
  But there's one setting in which the energy E(x) = Σ_j λ_j H_j(x) really does become a physical (physics-esque) energy: if you sample from the distribution using Hamiltonian Monte Carlo (an MCMC algorithm). Then E(x) is the potential energy of a sample/particle at position x, which is also given a kinetic energy K with the sume E+K held constant as the particle moves about to explore the distribution. It's a cool algorithm but after having used it I'm still not sure how useful it is.
  
  shiandow 11 days ago
  
  HMC allows you to jump quite a large distance by obeying the 'equations of motion' of the 'Hamiltonian'. This helps give more diverse samples, making them less correlated thereby increasing number the effective number of samples.
  It is indeed quite cool how it works.
  
  versteegen 10 days ago
  
  Right, but I mean that HMC has parameters to tune and multiples the computation needed per sample produced so it's not clear to me for which problems it's significantly beneficial and with how much tuning. I'd guess it's best where you otherwise don't have a good proposal distribution for Metropolis-Hastings leading to either many rejects or very slow mixing.
  
  shiandow 9 days ago
  
  The NUTS sampler limits how much tuning you need. As far as I know it's currently one of the best algorithms for sampling the kind of posterior probability distributions you get with Bayesian methods.

janpmz 13 days ago

If I understood this correctly:

Minimizing entropy: Learn filters that can increase the likelihood of the observed data.

Maximizing entropy: Learning how to combine the filters.

versteegen 13 days ago

In original FRAME, "learning filters" means selecting filters from a (very small) fixed set. In this paper they add* an additional "bilevel optimization", where they both learn θ_j, the parameters of the filters, and select the filters (indicator variables z_j).
Either way, that learning is the entropy minimisation part.
Maximizing entropy is not really about learning anything, it's about not overfitting. (Although you are correct in that you have to find the Langrange multipliers to correctly combine the filters.) Basically, the desired probability distribution will be constrained only by certain statistics of (the filters you learnt) on the data, and nothing else. For example, you want a model that says "if panel 1 has an N-sided polygon, and panel 2 has an N+1-sided polygon, panel 3 has an N+2-sided polygon", and says nothing about anything else.
* I'm a bit peeved they didn't cite me for extending FRAME by learning the features like that. That was my PhD topic. But let's face it, they probably don't know about my work, or quickly forgot it because the papers were too ugly, and admittedly they've done it in a cleaner way. I had to wrap kludgy iterative feature addition around the feature optimisation step to make it more efficient for image modelling.
- janpmz 11 days ago
  
  Sad they didn't cite you. But you never know, maybe your ideas influenced their future thinking pathways.
3abiton 13 days ago

Is this reasoning from the paper?

eli_gottlieb 13 days ago

So they minimize the forward KL divergence from the generative model to the data distribution, and then maximize the entropy of the generative model? Seems like it would be simpler to just minimize the reverse/exclusive KL divergence, giving you a cross-entropy term to minimize and then an entropy of the generative model to maximize.

I guess the obstacle to doing that is not being able to evaluate the true data log-density, only sample from it, so the cross-entropy term can't be estimated?

cs702 13 days ago

Interesting. There's a lot of preprocessing involved in getting these toy models to work, but the approach seems promising.

Is there code available, so others can examine it and replicate the results? I couldn't find a link.

Hugsun 13 days ago

I can't wait to see the next innovation, that is orthogonal to the scaling laws, that will cause the next leap in LLM performance.

davedx 13 days ago

This strikes me as potentially a high impact avenue of research. Inductive reasoning is core to human learning.