Deep learning experiments in OCaml

blog.janestreet.com

218 points by yminsky 6 years ago

xvilka 6 years ago

Sorry for repeating myself, but since there is a machine learning and OCaml it worth mentioning Owl [1] - library for numeric and scientific computations, including ML.

[1] https://github.com/owlbarn/owl

phonebucket 6 years ago

This is great. Functional languages have such an elegant representation of so many mathematical concepts. It's a bit of a shame that they don't have more widespread use in scientific computing.

flavio81 6 years ago

> It's a bit of a shame that they don't have more widespread use in scientific computing
In truth, they have. Lisp was the first functional language (or the first language that allowed that paradigm), and has been used a lot in scientific computing, for example doing symbolic calculus and manipulation.
- nikofeyn 6 years ago
  
  that doesn't matter when very few scientists have even heard of an ml (standard ml, ocaml, f#) or a lisp/scheme (common lisp, racket), much less have an inkling to use them. their use does of course exist but by no measure is it widespread.
  
  ced 6 years ago
  
  That's part of what makes Julia promising, it's a numerically-focused Lisp, without the parentheses.
mlthoughts2018 6 years ago

I would also suggest looking into Keras and PyTorch too. I think they honestly achieve a greater degree of elegance and a greater degree of mapping the programming constructs into the mental model space of the domain expert, than any FP interface to neural nets that I’ve seen yet.
- phonebucket 6 years ago
  
  I use PyTorch a lot; it's definitely my preferred framework at the moment. I just wish there was something as thoughtfully done and well-supported in a more functionally oriented language.
  Flux.jl on Julia is the frontrunner in this regard, IMO. The added benefit is that being written in Julia the whole way down makes it easy for practitioners to delve into the source code and extend it in a performant way without going into the C level nitty gritty.
  
  dnautics 6 years ago
  
  Flux is great. Because it's Julia, I could write a custom datatype that has fewer bits and test to see if inference and training are possible, and then apply that datatype to ml models without writing custom kernels (except convnets, but I'm going to push code for that.)
  
  mlthoughts2018 6 years ago
  
  This is also very easy in Keras / Tensorflow using the FloatX parameter, or specifying e.g. float16 dtypes.
  However, I’d say desiring a framework that allows “easy” extensibility to choose float precisions lower than 16 bits and have it “just work” is actually a mistake. That type of flexibility is overkill.
  Instead, supporting a limited set of fixed types is better. To experiment with a new type requires some integration hurdle to make it recognized by the backend, and then requires published research or some similar type of evidence that there are use cases which materially benefit from that new additional fixed data type, to get a PR approved to add it.
  The reason is that permitting arbitrary complexity growth in the form of “easy” custom data type support has two big downsides, (a) the mechanism that makes it easy had to consume maintenance and development resources even if it’s a very obscure form of customization, and (b) more importantly, it proliferates and worsens the already insane problems of being able to export / import models from one language/framework to another.
  It’s a case study of KISS and YAGNI: this is super premature abstraction especially if it’s for experiments. And the hurdle of making a branch and adding your new dtype in the backend is not (and should not be seen as) a significant engineering hurdle. Rather it’s a very good check on complexity growth.
  
  dnautics 6 years ago
  
  Yeah except flux code is way simpler than tensorflow code, both for the end user and internally as well. It's not a premature optimization, it comes "for free" in Julia. Besides, you don't know what someone might need. Say someone wants to implement a deep learning model with complex valued activations or quaternion valued activations. What then?
  There is no complexity added in flux to support arbitrary datatypes; Character level lstm in about 30 lines of flux. The flux library itself is a very, very small library. Converting character lstm to a custom datatype is about three lines of code (plus about 60, reusable, for the datatype).
  This speaks to the good choice of abstractions in Julia. What you may call unnecessary optimization is for me critical research since I'm investigating building hardware and I want to make sure the fp type (and it's not at all a standard IEEE type) I would implement is usable. Most deep learning is memory bandwidth limited so a decrease in bitsize has an O(n^2) effect in computation speed.
  In the spirit of rapid iteration it was far preferable to implement 50 lines of code in Julia to get my type able to do machine learning and then know if it failed or succeeded rather than write a tf kernel, which probably would have taken me months.
  
  mlthoughts2018 6 years ago
  
  > “Say someone wants to implement a deep learning model with complex valued activations or quaternion valued activations. What then?”
  This sounds like premature abstraction to me...
  
  improbable22 6 years ago
  
  There is work on rotationally invariant networks, e.g. for identifying galaxies, or cells under a microscope. For example:
  https://arxiv.org/abs/1612.04642
  https://arxiv.org/abs/1805.12301
  I haven't looked closely enough to be sure if they literally had complex activations, but this seems like an obvious use. Maybe they would have, if only tensorflow made it easy.
  
  mlthoughts2018 6 years ago
  
  Why would you ever want to represent activations directly as complex numbers in that case? I can’t think of any good reason, compared with putting them in rotation matrix form or some other equivalent form that actually maps to the domain modeling problem.
  Even when working in signal processing problems that require complex arithmetic, the underlying representations are just based on tuples of doubles and operator conventions, and you always need to map to real spaces (real part, imaginary part, angle, or magnitude) for any type of analytical representation that can be human readable.
  In all these cases, the idea that what we should optimize for is overhead-free easy expression of cutesy math domain verbiage is a bad idea.
  Writing libraries that expose an API that matches the user’s domain mental model is a great thing. But enforcing a particular abstraction and extensibility hierarchy so those things can be “autogenerated” just by parameterizing over a new type turns out to be actually much worse than just writing that type separately, with helper functions and converters, and customizing its API to be efficient from a domain mental model perspective.
  A better way, for example, might be to use mixin patterns or decorators and other metaprogramming, while writing a custom data type and its associated methods.
  
  byt143 6 years ago
  
  Something like this? https://github.com/andyferris/Traitor.jl
  
  mlthoughts2018 6 years ago
  
  That is definitely a cool direction to take it!
  
  ChrisRackauckas 6 years ago
  
  How is it premature abstraction if it takes zero extra lines of code to support it and have it optimized? That's kind of the beauty of Julia.
  
  mlthoughts2018 6 years ago
  
  It didn’t require zero lines of code. It required a huge amount of backend code to set up the abstraction and make lots of built-in types that adhere to the abstraction. And in cases when the abstraction fails to offer the exact type of extensibility needed (which is most of the time unless you’re authoring yet another highly abstracted library that can tie its use cases to that underlying abstraction, which is never in practice), then it was wasted effort, and “no overhead” is a false description, because you still have to dig into the guts of all the stuff that gets auto-generated if you plugged into the abstraction and change the mechanism of how it gets auto-generated for your special case, or else (usually easier), just write separate data structures outside of the abstraction vortex and have a few small converters or helpers that marshal your custom data type into and out of the abstraction for the really tiny anount of auto-generated features that actually matter to the use case.
  The “but it requires zero lines of code” thing is so misleading once you hit real use cases where the choices of how the abstraction auto-generates things end up being unusable for some specific situation.
  
  ChrisRackauckas 6 years ago
  
  Right here we have a classic case of someone on the internet anonymously saying something is impossible to do while there are many many examples of exactly this working. I recommend readers of these posts to ignore the FUD and do some Google searches to look through some Julia code repositories to see it in action. There are some great tutorials and fact-based discussions out there that can lead you to some useful examples with tricks you can employ in your own code.
  
  mlthoughts2018 6 years ago
  
  It looks like you are just posting knee-jerk defensive posts about julia I guess. Whatever this is, it’s clearly not related to my earlier comments in the thread.
  What are you talking about? Where did I say any of this was not possible? It’s obviously possible.
  It just turns out to be bad when you do it. It causes problems that the company line memo about zero overhead never is upfront about.
  
  phonebucket 6 years ago
  
  Premature abstraction is a problem for engineering.
  But what about scientists who are not too fussed with engineering considerations but would like to explore such things? Then this extensibility can be valuable.
  
  dnautics 6 years ago
  
  At some point one of those explorations becomes useful and it's no longer premature.
  
  one-more-minute 6 years ago
  
  I can see where you're coming from in any other language. But in Julia it's important to realise that this "premature abstraction" isn't actually any extra work, it's just the default. If we write `f(x) = x+x` then `f` takes anything that can be added, which can be any custom number or matrix type, or really anything else. Adding type restrictions to make it work with only a limited set of types is completely doable, but actually more work than just leaving it generic.
  We didn't at any point decide "it's worth the extra effort/complexity to make Flux work with custom number types"; it's just inadvertently been that way from day one, and I didn't even know anyone one was making use of it until today.
  
  mlthoughts2018 6 years ago
  
  This is the same for many languages that treat operators with type class patterns. It’s still usually bad design choice. I find a lot of language designers & programmers like it, but they are super disconnected from the realities.
  For an example consider breeze and spire in Scala. There’s so much effort to create these bloated numeric type hierarchies that abstract out things like monoids, rings, fields, iterability, sortability, etc.
  It’s not good. Just having really boring repetitive implementations for each distinct data structure would be better! No joke! Being able to write type generic functions over sortable matrix subclasses turns out to not be valuable unless you’re also writing a highly abstracted library, which is never, certainly not when you’re using it for experiments.
  Nobody needs to be able to make a DenseMatrix[Quaternion] and get it to automatically pick up implementations of fancy indexing. No. You can just write your own helper methods, and this is better, more convenient, applies less pressure for DenseMatrix to have some indecipherably complicated abstract implementation so it can be more free to just specialize on linear algebra functionality that works for DenseMatrix[Double] which is what is needed 99.999999999% of the time.
  
  dnautics 6 years ago
  
  You really should try Julia, before making claims about its complexity.
  The numeric type systems are simple, and designed for convenience, not to satisfy mathematical theory. In the case of FloatX, It's basically Any <: Number <: AbstractReal <: AbstractFloat <: FloatX
  For complex datatypes, like vectors, matrices, dicts, etc, you have templatable datatypes, but that is no more complex than C++, and actually far cleaner in implementation.
  For the most part, you do not NEED to make a Matrix{Quaternion}. And that's fine. However, if you do, the standard library will do the right thing, as if you had made a Matrix{Int32} or a Matrix{8BitGaloisField}. And if you choose to use Matrix{Float32}, the type system interacts with the compiler, and in the standard library it picks up the fortran BLAS library so you get faster-than-c performance.
  On the other hand, you might be deploying a really large matrix on a supercomputing cluster, and it might be useful to re-index the matrix as a datatype that fits in the L1 cache of your Knights Landing chips. In which case, you have the option of redeploying as an AbstractMatrix{Float64}, implementing index catching functions, and dropping it in to you code (probably about 100 lines of code total, if even) without having to rewrite every single matrix operation everywhere.
  
  mlthoughts2018 6 years ago
  
  It’s so funny to me how Julia proponents often make it an ad hominem attack as if the writer hasn’t used the language. I’ve been using and following Julia closely since late 2012, and even attended a few meetups / talks at MIT about it since I was a grad student at the time, and even took a random matrices class with Alan Edelman in which he talked quite a bit about early julia.
  Julia is by no means the only language to have patterns like this either, and in fact it’s not even a language where these patterns are particularly easy to use (I would reserve that for Haskell, but admit there may be other languages I don’t know which also make the cut — not julia though).
  Your two ending paragraphs read to me like a super naive restatement of the company line memo for why these types of parametric abstractions are supposed to be good. It’s like a political platform, and just like a political platform it doesn’t keep its promise.
  I have worked on projects where we needed to customize bit packing, not for cache performance, but for control over a modified version of sparse matrix types.
  And I’m telling you the idea that we’d ever rely on the language’s chosen abstraction and do something like AbstractSparseMatrix{Float64} to pick up a bunch of interface properties “for free” while making the underlying logic specialized for our sparse format is crazy. It’s a naive false promise that grad students believe and it gets quickly beaten out of them in the real world once you realize how the type constraints and inheritance / type class extension constraints this places on you are too limiting and end up requiring just too much boilerplate that can’t quite be autogenerated because the way the abstract interface was chosen just doesn’t quite match your use case.
  Finally you realize going down this road was the wrong idea all along, and you just write a super short implementation of MyCustomSparseMatrix or MyCustomCachePropertyMatrix in your case, and you fill in the logic manually that you thought you’d be clever by getting “for free” via plugging into some abstraction hierarchy, and often realize for your use case you don’t need to re-implement hardly any of it, and can do the boring parts pretty easily with converters or helper functions that marshal between whatever “for free” functionality you hoped to get and your simple custom not-parametric-abstraction type.
  I’ve been down this road too many times, in many languages. I just leave it for the grad students who like playing with abstraction toys, and instead I just get back to actual work, solving problems economically, which warrants a super strong heuristic of avoiding this type of parametric abstraction pattern as much as possible.
  
  ChrisRackauckas 6 years ago
  
  I'm using the abstract numbers for encodings of uncertainty and probability distributions. This goes far beyond FloatX types and is really helping spawn a whole new area of research. The number type abstractions in Flux are really one of a kind. Kudos to the developers.
  
  mlthoughts2018 6 years ago
  
  I work on large scale MCMC and causal inference problems. Would be very interested to know what this area of research is where you require abstract numeric types themselves to represent uncertainty, and why it would be different than other inference algorithms that handle uncertainty. I admit, at first blush I am extremely skeptical. It sounds like a silly sort of thing where instead of parameterizing over a numeric type, you might parameterize over a number-from-a-distribution type, and then try to make the type system represent how everything will flow in an MCMC setup, similar to tools like pymc, except where those are declarative and procedural, this would attempt to embed that into the type system. I can scarcely think of a worse way to represent manipulating uncertainty though. I hope I’m just reading your comment incorrectly.
  
  ChrisRackauckas 6 years ago
  
  It's for differential equations. Getting uncertainty without parameter sampling saves a lot of computational time and really opens up the problems that can be solved.
  
  mlthoughts2018 6 years ago
  
  But why does “getting uncertainty without parameter sampling” have anything to do with parametric numeric types?
  The latter is just a possible manner of implementation (that I’d argue is too cutesy), but there are many other ways to design a system like that, for example like fused types in Cython combined with numba class jitting.
  I still see no reason to believe that something that parametrizes differential equation functions or linalg functions over “uncertainty primitives” would be anything but a functional programming hot take on something that could be more straightforwardly done many other ways not relying on parametric abstraction.
  
  ChrisRackauckas 6 years ago
  
  The Julia solution didn't require anyone to actually think about making it work, and it works well (we found out from a Discourse post that it works, the developers didn't even know :)), and now we are using it in our research codes because it is a great way to speedup what was traditionally done via parameter sampling.
  Until I see someone else take an existing ODE solver like LSODA and convert it into something that can output uncertainties without having to do parameter sampling, I won't think other ecosystems are very close to what we have already done. Places like SciPy are still calling out to Fortran routines from ODEPACK for this, so making it work with Numba class jitting is a long way away. Show how easy it is to code it by showing code. Ours is already done: the ball is in your court.
  
  mlthoughts2018 6 years ago
  
  I just looked up your paper with Nie and indeed you’re just using a particular set of patterns with multiple dispatch and metaprogramming. Nothing is fundamentally different than other ways of implementing the same thing that don’t rely on parametric abstraction.
  You seem not to know about numba and Cython given that you responded with a comment about scipy using FORTRAN, which is not relevant. You can do the exact same multiple dispatch patterns with Cython fused types, and with several dispatch techniques in numba.
  Look, I’m glad people like your library. It doesn’t change the larger points about this type of design pattern being premature abstraction.
  
  xvilka 6 years ago
  
  Julia is very good, the only problem is the requirement for patched LLVM (patches they provide are not yet merged in the upstream), which can cause the conflicts with other frameworks if there is no separation.
  
  dnautics 6 years ago
  
  This is fairly easily dealt with using containers.
  
  amelius 6 years ago
  
  My biggest wish is that new research implementations in the field of ML come out in 1 framework, in 1 language.
  But for engineering purposes, it's nice that there is an ocaml framework now.

yunfeng_lin 6 years ago

So much bashing on static typing on deep learning:) Does any one from Google can explain the benefit since you guys are working on swift in tensorflow

https://medium.com/tensorflow/introducing-swift-for-tensorfl...

shoyer 6 years ago

Static typing for catching errors is only a small part of the vision for Swift on TensorFlow. The real advantage of static typing is that it enables the compiler to reason to about your code, e.g., to automatically rewrite it for a hardware accelerator with guaranteed correct semantics: https://github.com/tensorflow/swift/blob/master/docs/DesignO...
This is obviously possible in Python as well (e.g., see Numba) but is clearly has additional challenges: https://github.com/tensorflow/swift/blob/master/docs/WhySwif...
(I work at Google, but not on the TensorFlow team.)
- yunfeng_lin 6 years ago
  
  Thanks! that's a very interesting idea! Definitely worth exploring. Not sure it is my false sense. It seems that many python deep learning people are so proud of their chose, it is difficult to convince them.
- seanmcdirmid 6 years ago
  
  Static typing enables some optimizations, but not as many as we’d hope given the inexpressiveness of most type systems.
  The real advantage of static typing is code completion, which allows us to forget the nuances of our library naming schemes. TypeScript is so awesome in this regard, being neither sound nor used for optimizations, but still being very useful.
- KenoFischer 6 years ago
  
  Static typing has very little to do with what the compiler can say about your code. You can have dynamic languages with very strong type systems and semantics as well as static languages with weak semantics. The only difference between static and dynamic languages is whether the compiler enforces completeness of the analysis or not.
  
  yunfeng_lin 6 years ago
  
  I am not sure I agree with you. You do need compile time type to generate efficient hardware accelerated code.
  Python has strong type, but that is only available at run time, which is not useful to generate code.
  But now python also have optional type. this might be utilized in generating more efficient code though
  
  KenoFischer 6 years ago
  
  Look at e.g. Julia for a dynamic system with a strong type system that allows the compiler to reason about the code without forcing completeness.
  
  yunfeng_lin 6 years ago
  
  Julia has optional typing which is static as well
  
  KenoFischer 6 years ago
  
  As always in these sorts of discussions, it depends how you define your terminology. However, by most definitions of static typing, Julia's type system is not static. The julia type system is very much a property of the runtime language and behaves as such. It is quite strong, true, but still not statically enforced as you would expect from a static language. In particular, (to the extent that you can identify one), you never get any sort of compile-time type errors in julia.

pc86 6 years ago

So I was lost at the VGG19 example code, but probably because I have (a) no OCaml experience; and, (b) no ML/NN experience.

Still seems interesting, though. If anyone has any suggestions on basic sources for getting a background on the concepts here I'd definitely give them a read.

hackermailman 6 years ago

Look through youtube for university lectures, like these ones https://www.youtube.com/playlist?list=PL_Ig1a5kxu57NQ50jSuf0...
Most intro classes just require familiarity with basic calculus (differentiation, chain rule), linear algebra and basic probability all of which you can just lookup directly on https://www.expii.com for a short tutorial. Toolkits are usually in Python or Lua, plus the numerous textbooks 'Deep learning with python' that are around and specific DL books such as http://www.deeplearningbook.org/.
Afterwards look around for Adversarial Learning, like detecting perturbations that force mis-classification and other attacks described in papers by Carlini and Wagner. Currently there isn't a perfect defense developed for all of these attacks, except robust optimization that provably defend some of them. Attacks are an interesting area in DL you can get into since we don't have access to large resources and can only do DL on a small scale (in my case anyway).
make3 6 years ago

Please find Stanford's "Deep Learning for Computer Vision" https://m.youtube.com/playlist?list=PL3FW7Lu3i5JvHM8ljYj-zLf...

mlthoughts2018 6 years ago

I had a very unpleasant interview regarding deep learning with Jane Street. I spoke to a member of their HR team to try to get significant assurances that the interview would actually be focused on deep learning and not puzzles or brain teasers, and that the job would really focus on deep learning for their actual business, and not just be a proxy for being generally smart and then work on whatever existing inhouse models. The HR employee reassured me significantly on both points.

Then the interview was nothing but deck of card puzzles and random riddles where you have to articulate a careful model of some physical quantity like speed or frequency to solve the puzzle. I hate that junk, never found that it correlates with a way of thinking that matters in quant finance (which I previously did for a living) and suitably failed the interview. Worse, I would have been happy to decline that interview and tell them I know I’m not their guy if only the HR staff had correctly depicted the interview & job to me.

Ok, enough grumbling. From this actual blog post,

> “Type-safety helps you ensure that your training script is not going to fail after a couple hours because of some simple type error.”

I really think this way of thinking about static typing is a very bad thing. This is not at all an actual benefit, because in any sane situation, you will use unit and integration tests that execute extremely quickly on small test data to exercise your end to end model training code.

What I currently do for this on my team is to always require that model training programs are deployed inside of containers that capture not just the state of the code, but also make it configurable to mount the training data volume and pass in ENV that governs what the training job really is.

So then Jenkins or whatever will build the container for any PRs that seek to implement or modify training, attach fixture data and fixture ENV settings, and give you quick feedback about the whole end to end training, even inclusive of GPU settings (we have to do a slight manual step to specify Jenkins running on a GPU server, but this is a vestige of some of our infra headaches).

The point is that adding all sorts of extra code to embody type annotations, and limiting people from awesome dynamic typing features is a silly thing to do if you’re worried about type errors ruining a long-running job. That should be handled by fast integration tests.

Now, there are perfectly valid other reasons to like static typing. I just always hear this one, especially in regards to Python, and it’s really the wrong way to look at it.

The extra code and constraints of static typing are liabilities that should have to offer offsetting value to choose them. You already need integration and unit tests to reliably make changes and maintain the training code. If you can get the same benefit of overall job safety (or even 99% of the same benefit), from the tests, without paying the extra costs of static typing, then don’t!

Turning it around to act like static typing is de facto always a benefit is a very one-sided way to look at it.

flavio81 6 years ago

>Now, there are perfectly valid other reasons to like static typing. I just always hear this one, especially in regards to Python, and it’s really the wrong way to look at it.
>The extra code and constraints of static typing are liabilities that should have to offer offsetting value to choose them.
Agree, agree SO much.
After years of only using statically-typed languages, and then switching to Python and Lisp, i never understood why "catching typos and type errors" was touted as the benefit. I also agree they are like a sort-of liability that has to be taken into account in order to turn it around into a benefit.
For me, it was mostly performance benefits.
Note that not talking about strong typing (vs weak typing). Strong typing is always a good thing.
elihu 6 years ago

> I really think this way of thinking about static typing is a very bad thing. This is not at all an actual benefit, because in any sane situation, you will use unit and integration tests that execute extremely quickly on small test data to exercise your end to end model training code.
Unit and integration tests don't write themselves, and they will always be incomplete. You can't test for everything, and what you get from them will depend on how much effort you put in.
Static typing prevents you from running code that tells the computer to do nonsensical things, and usually you'll get an error that tells you exactly what you did wrong. I see those as benefits. In languages like Ocaml or Haskell that have type inference, type annotations can even be omitted most of the time. In the effort-versus-confidence trade-off, I see static typing as low effort with a good payoff. Others might think static typing is too much work and would rather rely on tests. Both approaches are complementary; neither renders the other unnecessary or redundant.
> Turning it around to act like static typing is de facto always a benefit is a very one-sided way to look at it.
Sure, it's a trade-off. Opinions will always vary as to what an ideal productive development environment looks like and what trade-offs are worthwhile, but I think your dismissal of static typing as a tool for gaining some degree of confidence that some program will probably work correctly is also one-sided.
- mlthoughts2018 6 years ago
  
  “You can’t test for everything” seems like a really bad counter argument in this case because you’ll still need to write the integration tests anyway, and if the tests are incomplete (which they always are, it’s life, whether you’re writing statically typed code or not, shrug), you have to improve the tests. You can still have runtime errors and incorrect logic in statically typed code... so?
  Now if the process of that testing gets you 99% of the same overall job safety that you’d also get by increasing the code by 10% to add static typing annotations and data structure models (in addition to raising maintenance costs according to that 10% too, and possibly adding bugs or painting yourself into rigid, hard to refactor corners even if they confer some short term bug prevention benefit via the type checking), from tests you already need to write anyway, it’s a no-brainer.
  I realize there are good uses of static typing and it can come down to style preference. But truly in this case of “what if my big scientific computing system hits a type-checking-could-prevent-it sort of bug after hours of computing time,” it’s just not a good argument.
  This is why people routinely write huge scientific computing systems in Python and nobody ever worries that they hit type checking relevant errors after several hours.
  Some things where type checking can really help: ensuring you’ve exhaustively handled every case in an ADT, using the type system to prove state transitions, like with phantom types, using the type system to encode side-effectfulness like Haskell monads.
  These things often just aren’t important for something like a large-scale machine learning training program. The types of problems you run into just don’t happen to benefit much from that stuff, while the benefits you can get from writing quick ad hoc functions that can take arguments of unconstrained types and just make unchecked assumptions about their attributes is actually quite big.
  
  ernst_klim 6 years ago
  
  >“You can’t test for everything” seems like a really bad counter argument in this case because you’ll still need to write the integration tests anyway
  Types are nothing more than a proof that some property holds for your code (Curry-Howard correspondence). Tests are nothing but a proof that the property holds for the exact conditions. Types are always better then tests, the only question is how powerful your type system is and how much properties you could express as types. In F* or Idris you don't need tests, in OCaml and Haskell you need tests sometimes, when type system is not powerful enough, in python you have to write tests all the time.
  
  mlthoughts2018 6 years ago
  
  > “Types are always better then tests, the only question is how powerful your type system is and how much properties you could express as types.”
  Again, as I’ve been saying, that is a super one-sided way to look at it. Type annotations and the use of appropriate patterns required for most modern “good” static typing are costly things. Organizing things with type class patterns and algebraic data types costs you by making you write more code and more boilerplate, and have trickier things to reason about. Some languages are worse (Scala) about how bad this boilerplate affects you than others (Haskell), but the restrictions it places to facilitate the type-based proofs of safety are real. It’s not free.
  Using static typing for these things is only better when (a) the overhead of adding static typing and associated pattern code (and associated maintenance of that extra code and restrictions of how you can write ad hoc code) is not too large and (b) the type-based safety proofs couldn’t have been gotten in some cheaper way.
  In a case like a big model training program (a) and (b) just don’t hold. The extra boilerplate and maintenance is very meaningful. Just look at the difference between this blog post’s OCaml code and the equivalent stuff in Keras. The restrictions on ad hoc code also matter. If I can get back a dynamically typed container of settings, like a Python dict for passing into a GPUOptions setup in TensorFlow, and not bother needing to conform to certain types before being allowed to write code that just makes direct assumptions about what attributes I can access, or what dict values will be strings that can serve as args to functions expecting strings, ..., that just saves me lots of time and lets me write way shorter code, relatively speaking, because in this use case it is extremely easy to verify that the only types of data passed in will conform to the assumptions, something that can be checked with an integration test very quickly without requiring any compromise on the handling of arbitrary attributes from config dict values.
  Not every case is like this. Some times going to the trouble of setting things up with static typing to prove complicated assumptions are valid within the code is better and ends up reducing code through disciplined use. Static typing can be cost-effective.
  This particular use case in the blog post, though, is not one of those cases at all.
  
  ernst_klim 6 years ago
  
  >Organizing things with type class patterns and algebraic data types costs you by making you write more code and more boilerplate
  And tests are writing themselves for you?
  
  mlthoughts2018 6 years ago
  
  I don’t understand why you believe that question is rhetorically interesting or connected to anything that has been discussed.
  You’d have to write virtually all the same tests in this type of use case whether you are using the static typing approach or not. The tests won’t explicitly check types in the dynamic typing case, but will verify type safety for fixture settings and data indirectly, as a byproduct of all the other testing.
  It seems like you are really missing the point. In use cases like a big training program, you have to write integration tests, period. The compiler is not ever a useful substitute for that in this case. Now, since we know you have to use integration tests and so the cost of writing and maintaining those tests is baked in, we can ask: will those tests also cover what a compiler could have helped with, if we’re in the dynamic programming case? Yes.
  And so then the extra code we’d have to maintain and extra constraints we’d have to live with if choosing static typing turn out not to buy us anything we can’t already get with the baked-in costs of the integration tests.
  Tests don’t write themselves. Why would you ever ask that type of question like you’re being cheeky and rhetorically dramatic? It reveals that you’re still stuck imagining that you’d need to write extra type-specific tests in the dynamic typing case, which misses the point of the discussion.
  
  ernst_klim 6 years ago
  
  >You’d have to write virtually all the same tests in this type of use case whether you are using the static typing approach or not.
  No, I don't need to write tests, it I could prove something with types. Here is an example of quicksort, where all invariants and properties are ensured with types, so this code does not need any tests at all.
  https://github.com/FStarLang/FStar/blob/master/examples/algo...
  >integration tests
  You could reason about your program's correctness on any level with types.
Nelkins 6 years ago

Agreed that type-safety preventing training script failure is not the strongest argument for OCaml. In my experience, the far more compelling reason to use expressive type systems is that they allow you to be more specific about your domain models. This helps to not just prevent type errors (common but not that big of a deal) but also logic errors (I think more common, harder to suss out, and harder to catch with the equivalent amount of test coverage). This idea is often rephrased in FP communities as "Make illegal states unrepresentable."
- jnbiche 6 years ago
  
  Exactly. When we praise statically-typed languages like OCaml, Scala, and Haskell, I think people miss that we're not just talking about random type annotations. It's exactly what you describe -- we're talking about being able to model domains using the type system, which goes far, far beyond simple type errors.
  Algebraic data types are the bedrock for this kind of modeling, and I can't for the life of me understand why more languages don't add them (particularly languages like Java and C#).
burkaman 6 years ago

I applied for a more entry level software position at Jane Street, and while I would have failed the interview regardless, I had the same experience where the HR person had no idea what the interview was like.
They assured me I'd be required to write OCaml, so I spent the weekend brushing up, and that I should bring my own laptop prepared with whatever development I wanted to use. In fact it was a couple "whatever language you want" questions using their floating interview laptop, which threw me off a lot. But like I said I would have failed anyway, it's the hardest interview I've ever had.
- nikofeyn 6 years ago
  
  i wouldn't sweat it. people design those interviews to make them feel better about themselves rather than as an effective way to gauge someone's aptitude to work there.
  you'll notice that none of these places that ask these types of questions allow the candidate to ask them technical questions. it's always a one way street. there's times where i have "failed" interviews of this type when i could guarantee i have "simple" questions that they couldn't answer about software and programming.
  
  lordnacho 6 years ago
  
  > i wouldn't sweat it. people design those interviews to make them feel better about themselves rather than as an effective way to gauge someone's aptitude to work there.
  Same reason I dislike that sort of interview. Of course the thing to do is to throw in a spanner.
  "Hmm, so I guess you didn't read about the Modified Banach-Wiles-Kolmogorov algorithm? I thought that was where we were going. Ok let's do it your way."
  Throw this bomb on the way out, of course.
  
  sincerely 6 years ago
  
  I don't think I've ever had an interview where it would have been okay to quiz the interviewer on technical topics...could you explain a little better?
  
  nikofeyn 6 years ago
  
  there's a couple things. why is it not okay? as a potential employee, you want to be sure that they have an inkling as to what they are doing and aren't just asking questions out of the "interviewing at google" book. i am not actually arguing that interviewees ask these questions (see below), but it is something to think about. i usually try to ask some questions that give me a sense of how big of a mess their software is. at one of the places i failed the final interview, i wasn't ever shown any code or product that they were working on but had a sense it was a bit of a mess with a bunch of smart people writing code with no hands at the wheel.
  i have the perspective that these types of puzzle questions by the interviewer are pointless. and i was getting at the point that the interviewee asking similar pointed questions would be similarly useless. because it's easy to take that high and mighty stance instead of having a conversation. it creates an artificial environment that doesn't really exist in actual working environments.
  and i generally feel that companies are far too arrogant in their hiring process. they very much create a one way dialog as if you should be thanking them for even interviewing you. they act like "we don't need you, you need us". it creates a very bad taste in my mouth, and even if i was to be offered employement by such companies, it is possible i would turn them down unless i am convinced that their work environment is distinctly different from their interviewing process.
  
  ummonk 6 years ago
  
  Presumably they wouldn't be employed there if they couldn't pass the interview though?
- melse 6 years ago
  
  I don’t know how long ago, but in my experience of interviewing with Jane Street (at least as an intern), they were pretty upfront about the lack of expected OCaml experience, and I was able to complete all of the on site interviews in a language of my choice. The HR/recruiting team seem eager to improve their process based on people’s feedback, so maybe this is something that took time to get right.
  
  burkaman 6 years ago
  
  I don't blame the engineers at all. I thought the questions were fair and not gimmicky, just unusually difficult. I failed because I was straight out of college and it was my second real interview ever. The only issue was that the HR person was very misinformed and put me a little off balance.
  
  melse 6 years ago
  
  Yeah likewise I hadn’t done a many real interviews before this so I guess I have nothing really to compare it to. I ended up with entirely programming questions rather than generic brain teasers, so maybe there’s a bit of a mixture of questions asked.
deepGem 6 years ago

The type safety argument is total BS. First of all the training script will fail for the very first time if there is a type error. You'd be a moron to pass an argument of a different type 'a couple of hours' into the training. No sane programmer writes such code. What kind of nonsensical argument is this.
What I have found static typing to be really useful for is in remembering what I have coded. It's quite hard to remember a dynamic type while you are writing code, given the number of variables you are dealing with. Seeing that type definition next to your variable name is a handy reference. I find it helpful to speed up coding a bit and being able to remember a lot more clearly what I have done.
- mlthoughts2018 6 years ago
  
  Static typing is also a nice way to communicate design intentions. But for this to work, the annotations have to be very expressive.
  I don’t know the first thing about OCaml, but I have worked professionally with Haskell and static typing is a joy when it adds clarity and makes the contracts of functions instantly readable.
  Contrast this with Scala, which I have also worked with professionally and the difference is stark. Scala type annotations are much harder to read, and the mechanism of implicits can make for extremely mysterious code that looks like it shouldn’t compile and only once you track down some distant implicit that’s somehow in scope, can you make sense of the way types are flowing through some function contract.
  
  deepGem 6 years ago
  
  Yep agreed. Though you can communicate design intentions in comments, no ? There is another argument that programmers might not follow the comments, so strict enforcement by types helps - I don't believe in such a philosophy though. Most programmers will do the right thing and mistakes are not intentional.
  
  elbear 6 years ago
  
  It's not that they don't follow the comments, it's the fact that comments aren't executable, so they rot. People forget to update them. That can't happen with type definitions, because your code stops compiling.
  Sure, comments serve their purpose, but that purpose only slightly overlaps with that of static types.
- nnq 6 years ago
  
  > You'd be a moron to pass an argument of a different type 'a couple of hours' into the training
  Huh?! At least in non-ML code this happens all the time, data fetched by whatever thinggie that uses zillion chained libraries of code nobody has time to audit, comes in hours or days late in a long running service blowing it up... eg. "oops, point.x is now no longer and integer but more like a map[ErrorObject->vector[int]]" bc something blew up in a very unexpected way in some other nodejs code light years away from the business logic you hold in your head... (yeah, the service gets restarted, but at some point some data that should have been saved in the DB hasn't been am may need to be recovered manually from some obscure log if even recoverable)
  
  gaius 6 years ago
  
  in some other nodejs code
  ML/DL is nothing at all like webdev :-) but these days you can compile OCaml to JavaScript if you want, I encourage you to check it out
  
  deepGem 6 years ago
  
  Yeah I should have been more explicit that this comment is in reference to ML code. Training is nothing but a loop so its unlikely you pass type A in iteration 1 and type B in iteration 200. If that happens most likely your training data is messed up and type safety cannot help you as you would have compiled and tested for type A.
  
  mlthoughts2018 6 years ago
  
  You could also have a complex training job that trains a shallow model for a while, then uses it to train a deeper model, or extract an embedding from some layer and train a classical prediction model that uses the embedding as the feature vector.
  But the point is that the right way to ensure safety is with realistic fixture-based integration testing. That’s not what static typing is for in that type of use case and is not a de facto benefit of static typing.
- throwawaymath 6 years ago
  
  > What I have found static typing to be really useful for is in remembering what I have coded. It's quite hard to remember a dynamic type while you are writing code, given the number of variables you are dealing with. Seeing that type definition next to your variable name is a handy reference.
  If you consider this a benefit, then (for example with Python), don't you get the same benefit just by using docstrings? Stated another way, if all you want is a visual cue about what you're passing to a function and getting returned, why bother with all the scaffolding of type safety? You can get that just by using documentation facilities outlined in various languages' style guides. Those language facilities (such as docstrings) tend to be very useful and a good engineering practice in general.
  The point of type safety is actually to obviate what you're talking about. Smart developers can and do make the mistakes you're saying only a moron would make, regardless of available visual cues. Offloading that decision making process to a language that complains when you make that mistake instead of being forgiving about it is entirely the point.
  So I guess what I'm saying is that I'm struggling to understand why you think type safety is BS. If I read you correctly, it sounds like you'd also say that developers committing memory corruption vulnerabilities are morons, and that the scaffolding of memory management and garbage collection is BS. Why not just have explicit references a developer can read while coding to make sure they're not overflowing a container, right?
bcyn 6 years ago

> Then the interview was nothing but deck of card puzzles and random riddles
That's really disappointing. Was the position you applied for Software Developer, or a specific deep learning position?
- mlthoughts2018 6 years ago
  
  Specifically posted as a deep learning position.
typon 6 years ago

I've always thought that static typing is the least interesting thing about functional languages.
- rkido 6 years ago
  
  Not all function-oriented languages use static binding. Due to how the BEAM VM works, Elixir/Erlang is very, very function-oriented but also late-bound. Nevertheless, its type system is relatively rich compared to most other late-bound languages.
  It's never static binding that makes a language good or interesting, apart from performance benefits. You can get the productivity boon of early-warning type-checking with or without it. What makes a language good, in my opinion, is its ability to provide a type system that closely matches the needs of the domain in which that language is to be used. For example, game development requires a data-oriented approach, which Rust's type system practically forces the developer to adopt.
  More complex software engineering problems require more expressive type systems. But unless you're the one writing TensorFlow, machine learning is insignificant from a systems engineering perspective; it's simple enough for non-programmers. Thus, expressive type systems don't seem to offer much benefit here.

preparedzebra 6 years ago

I'm not convinced that functional programming will grow in terms of devs using it daily, but it has been very useful for myself in certain contexts (especially when I wrote math based libraries using permutations, heavy recursion, etc). The results of this seminar are awesome!

mark_l_watson 6 years ago

Very nice. I have spent many evenings playing with the Haskell bindings for TensorFlow that don’t have the coverage these OCaml bindings have (e.g., character seq models).

I have thought of learning some OCaml, maybe this will give me the kick in the butt to do it.

senorsmile 6 years ago

Am I the only one who gets confused by references to ML (ML derived typed FP vs Machine Learning)? The threads on this page are the represent a strange junction where I really have to think about what people mean, because they really could mean either!

yaseer 6 years ago

This happens to me too, having dabbled with ML for theorem proving.
Thing is, ML is an obscure language for most people. The association with machine learning probably dominates in 95% of people.
- icc97 6 years ago
  
  It's becoming less obscure, F# (.net), Elm and Reason (JavaScript) are bringing ML to a wider audience. Plus Jane Street do a great job of promoting the use of OCaml.
ummonk 6 years ago

I usually assume ML refers to machine learning and ML-like, ML-family, ML-derived, etc. refers to the FP languages.
checkyoursudo 6 years ago

ML with ML?
I occasionally have to double check, yes.

gaius 6 years ago

Type-safety helps you ensure that your training script is not going to fail after a couple hours because of some simple type error.

This isn’t a failure mode that ever happens in DL... 2 hours into the job you will only be dealing with floats anyway no matter what language you are using. If you’re going to fail on anything typed it will be in the first 20 seconds probably, basically the instant you start your first epoch.

habitue 6 years ago

This is true, but it's also because Tensorflow is a typed language, but it uses the syntax of python. (At least the default way) in tensorflow, the graph is type checked on startup, and it'll fail if anything is wrong.
Contrast this with pytorch, chainer, or tensorflow's dynamic computation graphs and they're much more likely to have a bug that happens later, since their graphs aren't verified up front.
Unfortunately, typed languages won't help you much there. A big reason people use pytorch is because of its flexibility (i.e. they were bumping up against the constraints of a static graph system and wanted out)
- gaius 6 years ago
  
  I admit I don’t have much experience with those, I’m a Keras and CNTK guy but the principles will be the same: marshal your data into a huge matrix of floats/one-hot and hand it off to training where it will spend 99.9% of it’s time.
  I am a fan of strong/static typing and was once very active in the OCaml community but that just struck me as a very odd thing for the OP to say... it’s just not something that people doing DL worry about. It could be valuable in the marshalling phase but that all happens before DL begins and (in my experience) in a separate program.

rememberlenny 6 years ago

For reference, Jane Street is financial firm known for their widespread use of OCaml.

remify 6 years ago

I'd add that the author Laurent Mazare is a fucking brilliant person.
- mi_lk 6 years ago
  
  You know him personally? Or he has some prior works that we can take a look?
  
  remify 6 years ago
  
  Ahah no, I was just looking at his CV and the guy is jacked.
  
  3rdAccount 6 years ago
  
  Hahaha! This is a comment I would normally make and be scoffed at.
  I'm always amazed at how smart some people are.
  
  bachmeier 6 years ago
  
  remify is Laurent Mazare.
  
  mi_lk 6 years ago
  
  What a twist.
2_listerine_pls 6 years ago

or is it?