572 points by matt_d 3 months ago
I read this book titled "Designing Data Intensive Applications", which covers this and a lot of other stuff about designing applications in general. https://www.amazon.com/Designing-Data-Intensive-Applications...
Just finished reading DDIA, can't recommend it enough! I learned a lot of new info in every single chapter even for topics I thought I had a firm grasp on. Great job Martin Kleppmann!
He also published this awesome paper on CRDTs, "A Conflict-Free Replicated JSON Datatype".
It's the best written computer-related book I've read. On a par with Friedl's "Mastering Regular Expressions".
Very highly recommended.
> On a par with Friedl's "Mastering Regular Expressions".
That comparison sold me. You deserve a commission.
Be warned, the O'Reilly print quality is miles apart between the two. Their print-on-demand text quality and the binding are real let-downs.
That's too bad. The MRE book was great - the special typography for zero-width indicators and whatnot was really great. Progress.
Currently reading the book and I agree with your recommendation, it's very well-written, perhaps written with an intuitive learner in mind.
Too bad I can't upvote this multiple times.
Thanks, I literally just came here to ask for book recommendations on this topic.
Are there any other suggestions?
I know this sounds like a cliche at this point, but volume 3 of Donald Knuth's "The Art of Computer Programming" goes into more depth on the theory that underpins these algorithms than anything else I've ever seen/read/heard of (in fairness, I haven't read OP's suggestion yet, though).
This is one of the best technology books I've ever read. I spent a few years diving into big data architecture. I thought I had a reasonably good handle on it, then I read this book. So many insights.
That book is a must-read for anyone dealing with data.
The simultaneity of your and mkandas89's comment is what I'd call a canonical example of spooky action at a distance.
Its a must read for anyone dealing with data
Completely agree, fantastic book. Does anyone know of any similarly wonderful technical books?
There's is one important detail that is never fully articulated in this article. It says B-trees are read optimized. This is true, but there's only a big difference for random reads and not sequential reads. Because seeking a key in LSM is relatively expensive, because it has to read multiple indexes. But after this is done, reading the a lot of keys an values from this point on is not expensive (or at least not much more so than in a B-tree). The reason for this is mentioned in the article, which are the SSTables. These allow for quick through the keys, because they are stored on disk in sorted order.
LSMs do not guarantee that sequential key/value pairs are stored in the same SSTable so reading "n" k/v pairs, in the worst case, could require seeking through "n" different SSTables. There are datastores that use LSMs such as HBase that provide guarantees on top of LSMs to facilitate efficient range reads and, of course, many LSM compaction algorithms improve range reads, but the basic LSM is optimized only for single k/v reads. That's the reason, for example, why Cassandra doesn't even offer range reads such as "SELECT * FROM mytable WHERE KEY > x AND key < (x + k)".
> That's the reason, for example, why Cassandra doesn't even offer range reads
This is not true. The reason why Cassandra doesn't support that is because of hashing of keys across the cluster -- you'd have to query all shards and merge the results. That has nothing to do with the LSM storage.
cough bcachefs's b+ tree can push about a million random updates per second (almost half a million single threaded, last I checked).
I haven't heard of anything faster...
edit: found the benchmarks I ran
Congrats! That is Redis-on-Xeon territory! https://redis.io/topics/benchmarks
Though I assume that, like Redis, a reboot would lose acked writes?
An SSD can't save a write in a microsecond, its latency is at least an order of magnitude higher, right?
Enterprise drives usually have ultra caps with enough energy to power the writeback of the writebuffer to stable storage in case of power loss, thus your write latency is just the time it takes to hit the memory buffer, typically dominated by the PCIe transfer, that is, a few µs (unless of course, the buffer has filled but that means hitting peak write bandwidth).
Consumer drives typically don't pay for this, but instead good drives buffer writes in SLC flash which has lower latency than MLC.
ADDENDUM: funny thing is that you have to actively manage the energy in the caps; the writebuffer must never have more data than you have energy to write back. This becomes especially important on power-up where the empty cap is charging.
Yes, unless you did a journal flush. Default journal flush interval is 1 second because of hard drives, but on a good SSD it can be turned up to around 10 ms and still maintain those numbers.
NAND program operations are in the order of O(100us) to O(1ms).
I see how you made that mistake now. SSDs typically have many (eg 10s-100s) program operations happening in parallel. They also pack each write into something around 16kb or larger units so it's possible multiple b+ op are in each program.
Isn’t this pretty traditional stuff? What is modern about it?
Does any of this map to a GPU for column-oriented analytical data processing? Basically, machines are only good at reading and writing large, contiguous chunks of data. As these machines evolve, the optimal width of those chunks keeps getting larger. The volume of data available is growing. And the types of operations being done on data are becoming more “analytical” (meaning column-oriented and streaming access, rather than row-oriented random access). I would expect “modern storage” algorithms to therefore be cache friendly, column oriented and take the modern, in-memory storage hierarchy into account (from on-chip registers to, to high bandwidth GPU type parallel devices, to NVRAM system memory).
This article comes off to me like a CS101 intro doing Big-O asymptotic analysis on linked lists, without even mentioning the existence and effects of memory caches.
This reminds me of RAID 6’s use of abstract algebra.
I wrote a blog post a while back that might be of interest: https://www.akalin.com/intro-erasure-codes . It's similar to Igor's but fills in a few more details.
That's a fantastic introduction (although it's far more than an introduction!). I wrote a native addon for Node.js that does Reed Solomon, also using optimized Cauchy matrices (MIT license): https://github.com/ronomon/reed-solomon
Very nice! I think I stumbled on your implementation when researching my post. :)
Thanks. If that was in November last year, then it's changed a lot since then (and now a few times faster). Back then it was using a Vandermonde matrix as it was based on Backblaze's Java implementation, but a month ago everything was rewritten to use optimized Cauchy matrices.
Link below with more info, for those who were interested like me!
Reed Solomon Coding.
Linear algebra. It’s so hot right now.
Until recently I worked for a molecular neurobiology research group that, among other things, used Single-molecule RNA Fluorescence in situ Hybridization (smFISH) to measure gene expression in tissue.
Basically: design viruses with fluorescent molecules attached to them such that they attach to specific sections of RNA in the cell that are associated with particular genes. Soak tissue in viruses. Look at tissue through a microscope and count individual fluorescent dots, each of which represents one RNA molecule (I find this absolutely mind-blowing). Wash off the viruses with a specific type of chemical, and repeat with a new one for a different gene.
You can only use a few colours at a time because otherwise the microscope cannot discern them. But that would severely limit the throughput - there are a lot of genes we want to check, but the tissue will also degenerate after repeated washing. So, what can we do? Well, as I've understood, scientists using smFISH and similar techniques now use multiplexing and with Hamming codes to get around that.
So yeah, linear algebra is definitely so, so hot right now.
Doesn't RAID 6 just use Reed-Soloing/Hamming codes? Is that based on abstract algebra?
Most interesting thing I have seen for for database indexes is this paper.
Using a Neural Network in place of a B-tree. What is interesting is a NN can use many processors at the same time versus you can not with a B-tree.
In the end it comes down the power as in joules to get something done.
it's more about space efficiency. if a model can fit in only one or two cachelines, and get you "close enough", you may end up touching a lot less memory than if you just did a binary search.
Obligatory reiserfs joke: http://www.baltimoremick.com/blog/wp-content/uploads/2008/07...
This was a bad post and shouldn't be preserved. And I can still edit it. So I did. Please see below, it's a better post (even if it's bad).
>> LSM-trees are great, but better stuff exists.
Sure. My knowledge is pretty much limited to what this articles talks about, so interested to know what else is out there.
Fractal Trees as used in TokuDB (now owned by Percona).
The big thing in practice is that TokuDB vs InnoDB is dealing with large datasets.
However, I don't know where the very latest MyRocks stands vs TokuDB.
This may be an interesting comment if it had some keywords to search or (better) links. As is, I really don't see the point of it.
I don't feel like it tonight. Drinking whiskey and trying not to think about how effed up the Israel footage and the North Korea situation are. Maybe that's why it's not the best post.
I don't mean to crap on the work presented. Great article, good summary, and the tech is solid. It's just older than what excites me; a lot of progress has been made in 20 years and the majority of it hasn't found commercial applications.
But here is a citation root for a lot of amazing work in this space:
My absolute hero Edward Kmett gave a talk stitching a lot of this work together a long time ago: https://www.youtube.com/watch?v=uA0Z7_4J7u8 . I have no idea if he's pursued it, it's just one of his many talks that left me with an incredibly lasting impression.
Variants of this technique work for arbitrary documents and structures, work better at very high volume, have cache oblivious properties, and support transactions. Universal indexes that are reasonably good for all queries (as opposed to specfiic queries) are also possible. Coupled with discrimination-based techniques for O(n) table joins, there's probably a whole startup around there.
Sorry I can't do better right now.
I'm curious what you mean by "cache oblivious". How can a data structure be "cache oblivious"?
If you're optimal for a cache without knowing how big the cache is, you're optimal for all caches. It's tough to do, but it happens.
This property probably doesn't get the respect it deserves in this super weird world where you can't really say how many caches are between you and your data.
An intro here: https://rcoh.me/posts/cache-oblivious-datastructures/
Really good read, thanks. My brain couldn't quite grok the idea of a "cache-oblivious" data structure, but the article suggests they'd be more aptly described as "cache size-oblivious".
Thank you. That youtubed presentation was quite interesting.
Are you connected somehow to the israel and north-korea situations ? (meaning more than a normal person)
Great response, thanks!