Qri: A global dataset version control system built on the distributed web

204 points by anewhnaccount2 5 years ago

I really love the design and style qri! It is fun!

Can I ask why, for a git-style system, IPFS was chosen instead of GUN or SSB?

Certainly, images/files/etc. are better in IPFS than GUN or SSB.

But, you're gonna have a nightmare doing any git-style index/patch/object/etc. operations with it - both GUN & SSB's algorithms are meant to handle this type of stuff.

Did you guys do any analysis?

b_fiive 5 years ago

hey, qri dev here. Delighted you like the design, we're hoping to make data a little more "approachable" :)
We did look into SSB. I'll admit to not hearing about until only a few months ago, but the main reason we chose IPFS was for single-swarm behaviour, allowing for natural deduplication of content (a really nice property for dataset versioning).
The majority of our work has been in the exact area you mentioned, building up a dataset document model that will version, branch, and convert to different formats. We've gone so far as to write our own structured data differ (https://github.com/qri-io/deepdiff). I'm very happy with the progress we've made on this frontier so far.
I'm a huge fan of SSB, but don't think it's well suited for making datasets globally discoverable across the network. In the end the libp2p project tipped the scales for us, providing a nice set of primitives to build on.
- marknadal 5 years ago
  
  Nice work!

DocSavage 5 years ago

Interesting project, particularly with the choice of IPFS and DCAT -- something I'll have to look into. There have been other efforts to handle mostly file-based scientific data with versioning in both distributed (Dat https://blog.datproject.org/tag/science/) and centralized ways (DataHub https://datahub.csail.mit.edu/www/). Juan Benet visited our research center to give a talk about IPFS a few years ago. Really fantastic stuff.

I'm the creator of DVID (http://dvid.io), which has an entirely different approach to how we might handle distributed versioning of scientific data primarily at a larger scale (100 GB to petabytes). Like Qri and IPFS, DVID is written in Go. Our research group works in Connectomics. We start with massive 3D brain image volumes and apply automated and manual segmentation to mine the neurons and synapses of all that data. There's also a lot of associated data to manage the production of connectomes.

One of our requirements, though, is having low-latency reads and writes to the data. We decided to create a Science API that shields clients from how the data is actually represented, and for now, have used an ordered key-value stores for the backend. Pluggable "datatypes" provide the Science API and also translate requests into the underlying key-value pairs, which are the units for versioning. It's worked out pretty well for us and I'm now working on overhauling the store interface and improving the movement of versions between servers. At our scale, it's useful to be able to mail a hard drive to a collaborator to establish the base DAG data and then let them eventually do a "pull request" for their relatively small modifications.

We've published some of our data online (http://emdata.janelia.org) and visitors can actually browse through the 3d images using a Google-developed web app, Neuroglancer. It's running on a relatively small VM so I imagine any significant HN traffic might crush it :/ We are still figuring out the best way to handle the public-facing side.

I think a lot of people are coming up with their own ideas about how to version scientific data, so maybe we should establish a meeting or workshop to discuss how some of these systems might interoperate? The RDA (https://rd-alliance.org/) has been trying to establish working groups and standards, although they weren't really looking at distributed versioning a few years ago. We need something like a Github for scientific data where papers can reference data at a particular commit and then offer improvements through pull requests.

amirouche 5 years ago

> We need something like a Github for scientific data where papers can reference data at a particular commit and then offer improvements through pull requests.
exactly my thought, do you know any working group that is working toward that goal?
- DocSavage 5 years ago
  
  If by working group you mean a cross-company collection of people, I don't know of any or I would've joined them :) I've been working toward that goal for the last 5 years, but primarily with an eye to our kinds of data problems in the Connectomics field. I've been meaning to look at RDA again but reluctant to start a working group myself.
  
  b_fiive 5 years ago
  
  Hey DocSavage! I'm one of these Qri folks, I'd love to see that working group exist. I have a friend or two at the RDA. maybe we should get an email going on the subject? Projects like these are bigger than any one company or tool :)
  
  amirouche 5 years ago
  
  I started a awesome list dubbed "awesome data distribution" feedback welcome at https://github.com/amirouche/awesome-data-distribution
  
  DocSavage 5 years ago
  
  Agreed. Will follow up on email through your Qri contact page.
  
  b_fiive 5 years ago
  
  delightful. thanks!
  
  benhamner 5 years ago
  
  Any way we can help at Kaggle? Is https://www.kaggle.com/datasets helpful for your work in connectomics?
- brynb 5 years ago
  
  We’re building something along these lines at Axon (http://axon.science). Sign up for our beta if you’re interested in checking it out, and we should be able to get you set up in the next few days (we’re just starting to roll things out to the public this week).
  The basic idea is distributed version control, like git, but over p2p swarms rather than clusters around “central” repositories. We have special handling for large datasets (but still using git) to improve transfer efficiency and diffing.
  There’s a UI layer for collaboration (discussion, PRs, review) that supports deep linking to and embedding of files at specific commits, which sounds a bit like what you’re looking for.
  Feedback is very much appreciated!
  
  DocSavage 5 years ago
  
  That looks very interesting, particularly the UI layer for collaboration. Your website says it supports “massive data sets” but I would spell out what you mean since data for different fields vary by several orders of magnitude. (Massive for me starts at TBs and goes to petabytes.)
  One of the issues for me is file-based versioning, which then requires the means to parse the format. A number of ventures and organizations (e.g., NeuroData without Borders) address versioning of the entire ecosystem necessary to correctly use the underlying data files, so not sure if that’s an explicit part of your ecosystem. Most importantly, is your stack going to be open source?
- benhamner 5 years ago
  
  We're working on that through Kaggle Datasets https://www.kaggle.com/datasets
  We support data versioning, interactive web previews, seamless loading into hosted Jupyter notebooks (Kaggle Kernels), seeing/sharing analytic results built on the data version, and adding direct collaborators right now.
  We don't support a data-oriented version of an "issue" or a "pull request" quite yet, but these needs are definitely on our radar.
- mbreese 5 years ago
  
  It's probably too late for this year, but ISMB is one of the traditional locations for such a working group in the biological sciences. It might be interesting for the meeting next year though. If anyone is interested in putting together a proposal, let me know. I'd be happy to help.
- j88439h84 5 years ago
  
  http://Dvc.org does this
  
  DocSavage 5 years ago
  
  What are the differences between Dvc and Pachyderm.io, which I should have mentioned earlier?
  
  dmpetrov 5 years ago
  
  "From a very high level perspective - Pachyderm is a data engineering tool designed with ML in mind, DVC.org is a tool to organize and version control an ML project. Probably, one way to think would be Spark/Hadoop or Airflow vs Git/Github." from https://news.ycombinator.com/item?id=19130499
- smarx007 5 years ago
  
  Zenodo?
ktpsns 5 years ago

> scientific data primarily at a larger scale (100 GB to petabytes)
Buying hard discs (100TB for a few 10kEUR a few years ago) is a real investion in our institute. As far as I understood, with distributed storages each participant volunteers to share his disc to store his (and other) data. Here's the devil's advocate: Why should I share my expensively bought disc space with you?
- DocSavage 5 years ago
  
  Some institutions won't pay for others. In our space, big non-profit science institutions like Janelia and Allen Brain foot the bill for making the data available. Depending on the utility of the data, Amazon (https://aws.amazon.com/opendata/public-datasets/) or Google (https://cloud.google.com/public-datasets/) could also handle the cost of storing and distributing the data.
  With versioned data, you could leverage the largesse of the big institutions to provide the base data, and then only the deltas for the children versions need be handled by users making changes.

guywhocodes 5 years ago

What are the benefits of using qri over ipfs? At a glance it seems very similar, just narrower.

b_fiive 5 years ago

Imagine git were built on top of IPFS, and aimed specifically at datasets. Qri uses IPFS to store & move data, so all versions are just normal IPFS hashes. eg this: https://app.qri.io/b5/world_bank_population is just referencing this IPFS hash: https://ipfs.io/ipfs/QmXwh5kNGsNAysRx66jcMiw1grtFf9j7zLFGbK9...
full disclosure: I work at Qri
- guywhocodes 5 years ago
  
  Ah, that's excellent. Thanks for your time
- sjapkee 5 years ago
  
  >and aimed specifically at datasets
  What are the benefits of it? What git did not please?
ekianjo 5 years ago

In IPFS you can't search from within the protocol as far as i understand. Qri focuses on datasets and provides a search layer directly form its tools.
teawrecks 5 years ago

IPFS is listed as a dependency

mewwts 5 years ago

I love how the distributed web is seemingly built more and more in golang these days.

- https://github.com/ethereum/go-ethereum

- https://github.com/ipfs/go-ipfs

- https://github.com/textileio/go-textile

- https://github.com/lightningnetwork/lnd

to name a few other projects.

rolleiflex 5 years ago

Mine is also (Aether - https://getaether.net). I’ve also gotten comments reflecting on this same thing. I love Go. It is boring: it makes sure that I focus on doing interesting things, not on writing interesting code.
- b_fiive 5 years ago
  
  aether is the coolest
Protostome 5 years ago

Why do you love that its go in particular? (seriously asking, out of curiosity. why Go over all other languages, e.g. Rust and such)
- sheeshkebab 5 years ago
  
  Go is simpler than most other high performance languages - easy to read and understand unfamiliar codebases. It helps that go compiles to native binaries for various platforms and runs with no or minimal dependencies.
- yahyaheee 5 years ago
  
  I would mostly attribute this to go’s compositional and prescriptive nature. Go sort of pushes you toward building highly reusable pieces that can be combined to create a system. It does that in a way that’s incredibly easy to grok, which allows developer communities to more easily grow around products.
  
  Ericson2314 5 years ago
  
  Perscriptive? Maybe. Compositional? Absolutely not.
  I blame the things not being done sooner on Go.
  
  yahyaheee 5 years ago
  
  Compositional meaning composed of core elements that are combined to create something else.
  
  Ericson2314 5 years ago
  
  I know what it means, I'm saying Go doesn't have it (relative to other languages).
  
  yahyaheee 5 years ago
  
  I would say that go is very compositional in a simple manner that makes it easy to grok and hence the tools end up being highly reusable. Not all languages push you toward decomposition, but I would argue its the most important trait of a language and its community. But you know how programming language discussions go =P
  
  gameswithgo 5 years ago
  
  Well that’s everything
- maccio92 5 years ago
  
  Fanboyism
  
  stingraycharles 5 years ago
  
  On a more serious note, I do think it’s probably related to group identity (or described as “tribes” in popular media) that explains it.
  A large project using their language of choice (Go in this instance) gives external validation that their tribe is growing, and thus having made the correct choice to join it.
  
  res0nat0r 5 years ago
  
  Cross compilation and static binaries.
sjapkee 5 years ago

It only means that all this will soon die. Ruby of 2017-2019.