GistNoesis 5 years ago

I usually try a technique Andrej didn't mention here which helps me a lot in the debugging and modelling phase : Simulated data that encompass a single key difficulty of the problem.

For example in this line of thought in question answering, there is the BABI dataset which create auxiliary problems. So you know where the problems in modelisation are.

By pushing this problem to the extreme (for example in nlp you can have tasks that consist of repeating a sequence of character in reverse order to demonstrate that the architecture is indeed capable of memorizing like a parrot), you can often create trivial problems, which takes minutes to run on a single machine, and help discover most bugs.

You can also create hierarchies of such problems, so you know in which order you have to tackle them. And you can build sub-modules and reuse them.

Quite often the code you obtain then is very explainable and you know what situation will work and what will probably not work. But this network architecture is usually "verbose" and numerically optimize a little less well on large scale problems. The trick is then to simplify your network mathematically into something that is more linear and more general. You can reorder some operations like summing along a different dimension first. Semantically this will be different but will converge better. Because for a network to optimize well it needs to work well in both the forward direction and the backward direction so that the gradient flows well.

Once you have a set of simple problems that encompass your general problem, a good solution architecture is usually a more general mixture of the model architecture of the simple problems.

  • dual_basis 5 years ago

    This is great, like TDD for ML!

shmageggy 5 years ago

This might be the most "Deep Learning" thing I've ever read:

> One time I accidentally left a model training during the winter break and when I got back in January it was SOTA (“state of the art”).

  • akhilcacharya 5 years ago

    Hope they're not using a cloud instance - that sounds incredibly expensive!

6gvONxR4sf7o 5 years ago

>The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. This step is critical.

This can't be overstated. I can't count the number of times I'm the first person to find a problem with the data. It's incredibly frustrating. Just look at your damned data to sanity check it and understand what's going on. Do the same with your model outputs. Don't just look at aggregates. Look at individual instances. Lots of them.

  • odnes 5 years ago

    I was guilty of this once and would add some more specific advice; if your dataset consists of multiple possible labels for the same samples, do not just assume that the average of these labels will describe the best label. And don't assume that training with the unclean data will produce a net that magically learns to do the aggregation for you.

  • distant_hat 5 years ago

    The number of kids who dive right into modelling without sanity checking, visualizing, or otherwise exploring through the data is shocking. Also, people who just look at the metrics and decide the model is better because lower error ignoring that at times the output is meaningless, like getting negative probabilities.

olooney 5 years ago

Well, this was unexpectedly excellent.

I don't think "stick with supervised learning" is very good advice, though. Unsupervised techniques sometimes work well for NLP and has worked well for other domains, such as medical records[1]. In particular, anytime you have access to much more unlabeled data than labeled data, it should be something you should at least consider.

[1]: https://www.nature.com/articles/srep26094

  • m0zg 5 years ago

    Why "unexpectedly"? Karpathy has weapons grade knack for explaining complex subjects in plain terms. Case in point: http://karpathy.github.io/2016/05/31/rl/ explains RL in a way even a non-practitioner will have little trouble understanding. Another prominent person with this skill is Chris Olah, one of the people behind Distill.

    • olooney 5 years ago

      Not knocking on this author at all... It's just that nowadays if I see a title in the vein of "7 Tips to Train Deep Neural Nets for Complete Beginners On Rails With Keras and TensorFlow" I click on it more out of a sense of obligation than anything but I don't go in with very high expectations. So I was pleasantly surprised to find this article was substantive and high quality.

cs702 5 years ago

As a practitioner, I found myself nodding in agreement again and again and again.

This blog post is full of the kind of real-world knowledge and how-to details that are not taught in books and often take endless hours to learn the hard way.

If you're interested in deep learning, do yourself a favor and go read this.

It is worth its weight in gold.

  • mark_l_watson 5 years ago

    I agree. I have been using machine learning since the 1980s and deep learning for the last 4 years. This is great advice that I have both bookmarked and made into a PDF to store away in my searchable collection of research material.

    Karpathy is amazing. I have had so much ‘mileage’ on two projects out of his unreasonable effectiveness of RNNs article.

IOT_Apprentice 5 years ago

Andrej is the guy doing Telsa's neural network for their FSD hardware. I truly appreciated his talk during the autonomy reveal the other day.

  • spectramax 5 years ago

    I have the opposite opinion. I dislike marketing wankery and investor bullshit, however entertaining it may be. Instead, I tremendously enjoyed Karpathy’s Stanford class and lecture videos.

    • Fricken 5 years ago

      I thought Karpathy's contribution to the presentation was an excellent technical summary of what Tesla is trying to do. It clarified a lot of things for me what they're doing with Autopilot. Up until a few days ago my attempts at scrutinizing autopilot have been limited to little snippets of information here and there, rumours, and guesswork.

      Bullshitting and wankery doesn't come naturally to Karpathy so the few spots where he was under pressure to do as much stood out like a sore thumb.

    • coder543 5 years ago

      Based on your description, I can confidently say you and I didn't watch the same Investor Day presentation by Andrej.

    • Radzell 5 years ago

      It was basically the opposite. The talk was supposed to be very technical. Did you even get the stack 2.0?

ArtWomb 5 years ago

>>> There is a large number of fancy bayesian hyper-parameter optimization toolboxes around and a few of my friends have also reported success with them, but my personal experience is that the state of the art approach to exploring a nice and wide space of models and hyperparameters is to use an intern :). Just kidding.

LOL. Human assisted training at scale is perfectly allowable for mission critical success. Especially if you enjoy an unlimited research budget!

You can follow these instructions to the letter. And the same problems around generalization will arise. It's foundational.

For 30fps camera images, handling new data in real time works fine for 99% of scenarios. But seeking usable convergence rates on petascale sized data problems such as NVidia's recent work on Deep Learning for fusion reaction container design requires a breakthrough. Not just in software. But computation architectures as well.

Deep Reinforcement Learning and the Deadly Triad

https://arxiv.org/pdf/1812.02648.pdf

Identifying and Understanding Deep Learning Phenomena

http://deep-phenomena.org/

kriro 5 years ago

"""If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization."""

Can someone translate this to PyTorch for me? Or give a simple example of how one would go about doing this?

It means, that if I have a 1:10 ratio in the data, an untrained net should predict positive in 10% of the cases, right?

OceanKing 5 years ago

I am bookmarking this article, this is pure gold.

Also, it seems to me that most of what he says can be distilled into a boilerplate/template structure for any given deep learning framework, from which new projects can be forked - does this already exist?

  • hnarayanan 5 years ago

    Yes, it’s called http://fast.ai

    • gojima2 5 years ago

      lol, when he wrote

      >> model = SuperCrossValidator(SuperDuper.fit, your_data, ResNet50, SGDOptimizer)

      under "Neural net training is a leaky abstraction" my first thought was, this IS fastai's API

    • yorwba 5 years ago

      Fast.ai is nice, but not "a boilerplate/template structure for any given deep learning framework"

mitchellgoffpc 5 years ago

For anyone learning to build and train neural nets, this is a fantastic cheat sheet; Andrej is top-notch at explaining these kinds of things. The other posts on this blog are definitely worth a read as well!

  • eanzenberg 5 years ago

    A lot of this process goes beyond NN into generic ML. Especially understanding and diving into the data.

mollerhoj 5 years ago

"though NLP seems to be doing pretty well with BERT and friends these days, quite likely owing to the more deliberate nature of text, and a higher signal to noise ratio"

What is he talking about here? BERT, GPT etc are not unsupervised, they are pretrained on a task that has naturally supervised data (language modelling).

indweller 5 years ago

In the blog, he refers to test losses at an early stage, like in "add significant digits to your eval". Does he actually refer to the test data or is he referring to validation data? I was under the idea that we were supposed to touch the test data only once at the end of all training and validation. What is the right way to handle the test data?

  • snrji 5 years ago

    By "eval" you can also mean the training subset. As I understood is at the code to evaluate the network at a given point with a given dataset. For instance, after epoch epoch, the model is evaluated for both training and validation (you see both losses)

    As you said, the test subset should only be used at the very last.

mendeza 5 years ago

This recipe results in large amounts of time spent before any results occur (depending on the task you are trying to solve). Classification is an easy task to use this recipe, but when you venture into object detection or pose estimation, data collection, labeling, and setting up training and evaluation infrastructure is much more complex.

  • liuliu 5 years ago

    Can you expand a little bit? I often find if I skip one or more steps mentioned here, the later debugging is tremendously harder (and often involves go back to these steps again). Some of these advice like visualization are well supported in many frameworks usually through TensorBoard. Others really just good common-sense try-first-or-you-will-regret-later steps that don't require significant amount of time investment.

cdelsolar 5 years ago

What a fantastic post, thank you for this.