CS 522: Machine Learning Approaches to Decode the Human Genome

cs522.stanford.edu

171 points by casawa 6 years ago

hateful 6 years ago

Direct link to video (via link on top of site): https://www.youtube.com/watch?list=PLeEhPBsiwmeQPNgi1iHb4Bi4...

casawa 6 years ago

If of interest, other notes are available at cs522.stanford.edu and more will be available shortly!

indescions_2018 6 years ago

Yeah, definitely interested ;)
Any chance the course has room for the other side of the coin? Namely, how neuroevolution and genetic strategies inform deep reinforcement learning?
Am also interested in learning about the state-of-the-art in cloud based packages. I noticed recently Google released a tool called DeepVariant for use on their genomics platform.
https://github.com/google/deepvariant
Creating a universal SNP and small indel variant caller with deep neural networks
https://www.biorxiv.org/content/early/2018/01/09/092890
- chillee 6 years ago
  
  Answer: it doesn't.
  Genetic algorithms and the like are pretty much all terrible. They're ways of approximating your gradient, and fall to the curse of dimensionality. The only reason Uber and open.ai published their papers on evolutionary strategies (something pretty different from what people think of as genetic algorithms) is that current policy gradient methods are really bad as well, allowing what is effectively random search to do well.
  It's kinda like how Bayesian hyperparameter optimization is pretty terrible and 2x random search almost always beats it easily.
vrm 6 years ago

dude nice job getting exposure with this!
mchowdhury 6 years ago

+1
visopsys 6 years ago

+1

JepZ 6 years ago

Anybody knows why we cant just write a cell simulator and start experimenting that way with DNA manipulation? I mean I have no idea what the first nucleotides are for, but when I have a simulator I can try changing them and see what happens?

This may sound like a naive approach, but is there anything special hindering us from building such a simulator or is it just that scientists will not find it useful as it will take too long before we know which nucleotides are important for our goals?

viewtransform 6 years ago

To build the cell simulator you would have to model all the biochemical pathways in a cell. This is a bit of a problem
http://biochemical-pathways.com/#/map/1
http://biochemical-pathways.com/#/map/2
- JepZ 6 years ago
  
  Thanks, cool links. I didn't even know you can use Leaflet for something else than just maps.
  
  constantlm 6 years ago
  
  Didn't even notice Leaflet there, nice spot. TIL
landryraccoon 6 years ago

> Anybody knows why we cant just write a cell simulator
This is hard. To my knowledge we can't even write most basic components of a cell simulator yet. One of the obvious requirements would be a protein folding simulator. Nobody has been able to come up with a working one of those yet.
> when I have a simulator I can try changing them and see what happens?
If you write a simulator that does this you will be a billionaire and probably win a Nobel prize in medicine while you're at it.
- SimbaOnSteroids 6 years ago
  
  What is the major hurdle that needs to be cleared in order to be able to get a NN to figure out protein folding, I understand that protein folding is complex but its also the sort of problem I perceive, perhaps extremely naively, a NN being good at solving.
- hoelle 6 years ago
  
  Yep. I recently met a team working on this stuff at the Allen Institute for Cell Science. The problem space is huge and the field is totally in its infancy.
  If anyone is interested in this work, check their job site. Last I heard they were looking to hire a programmer to help write the sim. Cool team + science.
  
  jjjensen90 6 years ago
  
  Out of curiosity I checked their jobs site. What a cool opportunity, working in a nascent field of science at a prestigious institution.
  It was funny to notice that they seem to value a Bachelor's degree at a rate of higher than 1:1 with career experience:
  > 4-5 years of experience in a software development team AND a bachelor’s degree in computer science or a related field | OR 15+ years of relevant working experience...
  > 6-9 years of experience in a software development team AND a bachelor’s degree in computer science or a related field | OR 20+ years of relevant working experience
  That is a pretty heavy premium to put on those years during undergrad. I certainly wasn't as good by undergrad + 5 years as at 15+ years in the field, but maybe they know better than me.
fsloth 6 years ago

I'm not sure if you could do that without a full quantum dynamic simulation of each cell organelle.
Is there anything better than DFT method for quantum chemistry? You can look it up and see how much effort it takes to simulate just a few molecules.
I'm pretty sure full cell level simulation would revolutionize biology and medicine.

proc0 6 years ago

> Learning the DNA regulatory code of the genome

This is so interesting. I cannot imagine what kind of language evolution chose to build on top of DNA. There must be some paradigm it maps to, and it's going to be incredibly interesting to see if a compiler/interpreter can be made for DNA, along with higher level languages that compile down to it.

shaki-dora 6 years ago

If you’re interested, much of what you’re asking about is actually known: (some) DNA sequences map directly to protein sequences. Because DNA has a 4 letter alphabet (G, C, T, A), it takes three to map to one of the 2x different amino acids that make up proteins. Proteins, in turn, are the “machines” that work in cells.
DNA also contains regulatory segments, errors, “dead” code left over (and carried through generations) long after it stopped being transcribed, DNA once inserted by viruses (both active and inactive) etc etc...
If you want an analogy with computer code: it’s most like a spaghetti-code hair ball of assembly coding it’s own compiler, IDE, vim, a few games, and a neural network ten magnitudes the size of anything tensorflow can do. It does it all on hardware that works only probabilistically. And it is constantly starved for resources, leading to hacks such as DNA sequences that code for two completely different functioning proteins depending on reading it either forward or backward, or starting to read at an offset (what’s called a “reading frame”).
There’s a thick book on molecular biology by Alberts et al. It’s the most phantastic deep dive into this, any many other, insanities. I believe Larry Page used to recommend it to all new googlers.
- entee 6 years ago
  
  MBOC is the book you're looking for by Alberts:
  https://www.amazon.com/Molecular-Biology-Cell-Bruce-Alberts/...
  It's standard reading at the very least as intro grad/senior undergrad student in the biosciences.
  Great book, well written, well curated.
  Parent commenter is correct, DNA->function is massively complicated. The main wiki article to start with is:
  https://en.wikipedia.org/wiki/Central_dogma_of_molecular_bio...
  As the article notes, there are endless exceptions and edge cases. It links to various examples of those.
- madhadron 6 years ago
  
  I've stopped recommending Alberts. It's a great cartoon guide to a mythical average eukaryotic cell, but it abstracts much farther than the data can bear and leaves the reader without the intellectual tools to work with the material in it. And so you get computer scientists thinking about assembly language and compiling and physicists building little stochastic models of state transitions without knowing the biological considerations that lead those efforts astray.
  Not that I have a book to recommend instead...
  
  entee 6 years ago
  
  I disagree, it's a decent overview that introduces the major concepts and how they work. It's not intended to be at the cutting edge or to provide all caveats. By definition all textbooks in biochemistry, biology and the like are out of date by the time they go to press.
  The average MBOC provides is enough of a basic understanding of molecular biology to move on to more advanced work including papers that start to get into the nitty gritty details. Note that many people haven't been even exposed to the basics!
  
  madhadron 6 years ago
  
  I'm not worried about it being out of date. I worry about the misconceptions it leaves in the minds of those who learn from it. They can parrot the words, but the mental models that result from studying it regularly lead people astray. At least, that's my anecdotal observation from as a research biologist.
  
  collyw 6 years ago
  
  Wow that's a blast from the past. I started a Molecular Biology undergrad in 1992 and that was one of the books we had to buy. Mr Alberts must be doing alight if that is a course text on many courses for so many years.
- AllegedAlec 6 years ago
  
  > If you want an analogy with computer code: it’s most like a spaghetti-code hair ball of assembly coding it’s own compiler, IDE, vim, a few games, and a neural network ten magnitudes the size of anything tensorflow can do. It does it all on hardware that works only probabilistically. And it is constantly starved for resources, leading to hacks such as DNA sequences that code for two completely different functioning proteins depending on reading it either forward or backward, or starting to read at an offset (what’s called a “reading frame”).
  Not only that, but it's also (quite literally) radiation hardened: the genome has evolved to a state where most changes to the DNA do very little. Or, if the environment is changing frequently, it'll have evolved to a place in the genome space where it has a higher chance to obtain more beneficial mutations (for more info on this sort of stuff, look up articles by Hogeweg, Colizzi or Crombach).
  As a computational biologist, I can tell you that evolution works in many ways, but most likely not in the way you'd expect it to.
madhadron 6 years ago

This is a common idea among computer scientists looking at biology. They see "sequence of base pairs" and immediately think of a Turing machine's tape or a memory segment in a computer. It leads them astray. There's no reason to think that evolution has or would construct anything resembling a language or a language paradigm. Evolution doesn't introduce abstractions.
Now, programmers, by our neurophysiology and training recognize things that look like abstractions in what evolution produces, and near-abstractions are useful to reduce our cognitive load when we're studying biology, but a biologist always keeps in mind a list of exceptions to those abstractions in mind. If you don't know of exceptions, it is almost always a fruitful research question to find them.
- Baeocystin 6 years ago
  
  I have always found this example of using evolutionary processes in FPGAs (and how utterly bizarre the resulting circuitry) to be very useful in clarifying just how different the biological world is from the CS one.
  https://www.damninteresting.com/on-the-origin-of-circuits/
  tl;dr - evolution takes advantage of the entire solution space without any respect the the abstraction layers we've created in our minds.
  
  thaumasiotes 6 years ago
  
  The most interesting (IMO) extract:
  > after just over 4,000 generations, test system settled upon the best program. When Dr. Thompson played the 1kHz tone, the microchip unfailingly reacted by decreasing its power output to zero volts. When he played the 10kHz tone, the output jumped up to five volts.
  > Dr. Thompson peered inside his perfect offspring to gain insight into its methods, but what he found inside was baffling. The plucky chip was utilizing only thirty-seven of its one hundred logic gates, and most of them were arranged in a curious collection of feedback loops. Five individual logic cells were functionally disconnected from the rest— with no pathways that would allow them to influence the output— yet when the researcher disabled any one of them the chip lost its ability to discriminate the tones. Furthermore, the final program did not work reliably when it was loaded onto other FPGAs of the same type.
  > It seems that evolution had not merely selected the best code for the task, it had also advocated those programs which took advantage of the electromagnetic quirks of that specific microchip environment. The five separate logic cells were clearly crucial to the chip’s operation, but they were interacting with the main circuitry through some unorthodox method— most likely via the subtle magnetic fields that are created when electrons flow through circuitry, an effect known as magnetic flux.
  This nicely illustrates a major advantage of evolutionary processes: they can use any resource in the environment, whether you know that resource exists or not.
  The program's crippling overspecialization ("the final program did not work reliably when it was loaded onto other FPGAs of the same type") is also typical of evolutionary processes.

davidgl 6 years ago

For anyone would missed it, also see this great link on HN recently: https://news.ycombinator.com/item?id=16233644 - DNA seen through the eyes of a coder (2017)

zitterbewegung 6 years ago

Can I give you my genome? Or will I be able to reproduce your methods from this site?

sinab 6 years ago

I think there is an important distinction between sequencing a genome and then using ML methods to extract meaning from it. Sequencing (has traditionally) been hard but is now more of a tractable problem [1, 2].
To answer your question, if you have your genome and a dataset of known genomes marked with functions according to regions, then you could probably perform an interesting analysis..
[1] https://nanoporetech.com/ [2] https://www.illumina.com/
maxander 6 years ago

This is a course on a fairly novel field of research, not a method to do anything remotely consumer-facing.
- nextos 6 years ago
  
  I disagree! It's a fantastic domain for startups! Nanopore is readily available. I have sequenced stuff in my kitchen, and I'm a computer scientist.
  
  maxander 6 years ago
  
  DNA sequencing is mature and, if anything, almost getting to be late in the game for new startup entries. Determining gene function through ML methods is a whole different thing, and that's what I, and the OP, was talking about. Are we all responding to the same article here?
  
  alsocasey 6 years ago
  
  I haven't looked into their new minion much yet, but how cheap are really talking about?