nemothekid 6 years ago

>Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition. As networks become more redundant, partitions become an increasingly rare event. And even if there is a partition, it is still possible for the majority partition to be available

In my experience, yes network partitions are incredibly rare. However 99% of my distributed ststem partitions have little do with the network. When running databases on a cloud environment network partitions can occur for a variety of reasons that don’t actually include the network link between databases:

1. The host database is written in a GC’d language and experiences a pathological GC pause.

2. The Virtual machine is migrated and experiences a pathological pause

3. Google migrates your machine with local SSDs, fucks up that process and you lose all your data on that machine (you do have backups right?)

4. AWS retires your instance and you need to reboot your VM.

You may never see these issue if you are running a 3 or 5 cluster database. I began seeing issues like this semi regularly once the cluster grew to 30-40 machines (Cassandra). Now I will agree that none of the issues took down majority, but if your R=3, it really only takes an unlucky partition to fuck up an entire shard

  • Quekid5 6 years ago

    I may be a bit of an old fart, but this is the exact reasoning behind my decision to never go with "distributed X" if there's a "single-machine X" where you can just vertically scale.

    If you can afford 3-5 machines/VMs for a cluster you can almost certainly afford a single machine/VM with 2-4x the resources/CPU and chances are that it'll perform just as well (or better) because it doesn't have network latency to contend with.

    Of course if you're around N >= 7 or N >= 9, then you should perhaps start considering "distributed X".

    As long as your application is mostly built on very few assumptions about consistency it's usually pretty painless to actually go distributed.

    Of course, there are legitimate cases where you want to go distributed even with N=3 or N=5, but IME they're very few and far between... but if your 3/5 machines are co-located then it's really likely that the "partitioning" problem is one of the scenarios where they actually all go down simultaneously or the availability goes to 0 because you can't actually reach any of the machines (which are likely to be on the 'same' network).

    • toast0 6 years ago

      I think part of this is that most of the common knowledge about scaling is hard fought from the 90s/2000 era. eBay got bigger and bigger Sun boxes to run Oracle, until they couldn't get anything bigger -- then they had a problem and had to shard their listings into categories, etc. In the last few Intel cpu generations, computation performance has had small gains, but addressable memory has doubled about every other release, you can now get machines with 2TB of ram, and it's not that expensive.

      You can fit a lot of data in an in memory database of 2TB.

      • lllr_finger 6 years ago

        My experience is that scaling up can lead to owning pets instead of cattle. It's never fun having a 1TB ATS instance that's acting funky but you're terrified of bouncing it. That's more a devil's advocate anecdote than an argument against scaling up.

        • toast0 6 years ago

          I think you can (and clearly should) be ready for your big boxes to go down. And it is much more painful to bounce a large box than a small box. But that doesn't mean having say 4x [1] big boxes is strictly worse than having a whole bunch of smaller boxes.

          [1] one pair in two locations, basically the minimum amount, unless you're really latency insensitive so you could have one in three locations.

      • chubot 6 years ago

        Yup, Google clusters also originally had machines with 1 or 2 CPUs! SMP on Linux was a new thing!

        Nowadays you easily have 32 cores on a machine, and each core is significantly faster than it was back then (probably at least 10x). That is a compute cluster by the definition of 1999.

        So for certain "big data" tasks (really "medium data" but not everyone knows the difference), I just use a single machine and shell scripts / xargs -P instead of messing with cluster configs.

        You can crunch a lot of data on those machines as long as you use all the cores. 32x is the difference between a day and less than an hour, so we should all make sure we are using it!

        Common languages like node.js, Python, R, OCaml, etc. do NOT let you use all your cores "by default" (due to GILs / event loop concurrency). You have to do some extra work, and not everyone does.

        • guelo 6 years ago

          > Common languages like node.js, Python, R, OCaml, etc. do NOT let you use all your cores "by default"

          If the code is written in a distributed fashion from the start then it can be designed so it's one python/node/R process per core. And that makes the transition to distributed servers simpler anyway.

          • cdcfa78156ae5 6 years ago

            > If the code is written in a distributed fashion from the start then it can be designed so it's one python/node/R process per core.

            That's a really glib dismissal of how hard the problem is. Python and node have pretty terrible support for building distributed systems. With Python, in practice most systems end up based on Celery, with huge long-running tasks. This configuration basically boils down to using Celery, and whatever queueing system it is running on, as a mainframe-style job control system.

            The "shell scripts / xargs -P" mentioned by chubot is a better solution that is much easier to write, more efficient, requires no configuration, and has far fewer failure modes. That is because Unix shell scripting is really a job control language for running Unix processes in parallel and setting up I/O redirection between them.

            • mercer 6 years ago

              Am I correct in assuming Elixir/Erlang does a much better job at this compared to Node/Python/etc., putting aside (what I understand to be) the rather big problem of their relative weakness for computation?

              • toast0 6 years ago

                Erlang can be a good fit -- the concurrency primatives allow for execution on multiple cores, and the in memory database (ets) scales pretty well. Large (approaching 1TB per node) mnesia databases require a good bit of operational know how, and willingness to patch things, so try to work up to it. Mnesia is erlang's optionally persistent, optionally distribution database layer that's included in the OTP distribution. It's got features for consensus if you want that, but I've almost always run it in 'dirty' mode, and used other layers to arrange so all the application level writes for a key are sent to a single process which then writes to mnesia -- this establishes an ordering on the updates and (mostly) eliminates the need for consensus.

              • e12e 6 years ago

                I believe with the combination of native functions ("NIFs") in rust and some work on the nif interface (to avoid making it so easy to take down the whole beam VM on errors) - you might get more of best of both worlds today - than you used to. As you say erlang itself is rather slow wrt compute.

                • mercer 6 years ago

                  Thankfully it's not an issue for me. Elixir/Erlang is pretty much perfect for most of my use-cases :). But I foresee a few projects where NIFs or perhaps using Elixir to 'orchestrate' NumPy stuff might be useful. Most of my work would remain on the Elixir side though.

              • cdcfa78156ae5 6 years ago

                Yes. Erlang is built around distributed message sending and serialization. Python does not have any such things; even some libraries like Celery punt on it by having you configure different messaging mechanisms ("backends" like RabbitMQ, Redis, etc.) and different serialization mechanisms (because built-in pickle sucks for different uses in different ways). Node.js does not come with distributed message sending and serialization either.

            • cbsmith 6 years ago

              > With Python, in practice most systems end up based on Celery, with huge long-running tasks.

              Oh dear... Yeah, that's a terrible distributed system. Interestingly, all the distributed systems I've worked on with Python haven't had Celery as any kind of core component. It's just poorly suited for the job, as it is more of a task queue. A task queue is really not a good spine for a distributed system.

              There are a lot of python distributed systems built around in memory stores, like pycos or dask, or built around existing distributed systems like Akka, Casandra, even Redis.

              • cdcfa78156ae5 6 years ago

                > There are a lot of python distributed systems built around in memory stores, like pycos or dask, or built around existing distributed systems like Akka, Casandra, even Redis.

                Cassandra and Redis just mean that you have a database-backed application. How do you schedule Python jobs? Either you build your own scheduler that pulls things out of the database, or you use an existing scheduler. I once worked on a Python system that scheduled tasks using Celery, used Redis for some synchronization flags, and Cassandra for the shared store (also for the main database). Building a custom scheduler for that system would have been a waste of time.

                • cbsmith 6 years ago

                  > Cassandra and Redis just mean that you have a database-backed application.

                  Oh there's a lot more to it than that. CRDT's... for example.

              • pmart123 6 years ago

                Well, Celery uses Rabbit-MQ and typically Redis underneath. Rabbit-MQ to pass messages, and Redis to store results.

                You can scale up web servers to handle more requests, which then uses Celery to offload jobs to different clusters.

                • cbsmith 6 years ago

                  Yeah, but fundamentally Celery is a task queue. You don't build a distributed system around that.

            • lllr_finger 6 years ago

              I think the intention was that if you're gently coerced into working with a single thread, like with node, then you're also coerced into writing your code in a way that's independent from other threads. In theory, it's easier reasoning about doing parallel work when you start from this point - I've certainly noticed this effect before.

              I don't think any reasonable developer would dismiss concurrency/parallelism as easy problems.

          • icedchai 6 years ago

            Or you could use a language that supports threads, like Go, Java, C, C++, C#... This gives you much more flexibility.

      • Quekid5 6 years ago

        That's a really good observation and I agree. I sometimes have to pinch myself to make sure I'm not dreaming when I see that there are actually pretty affordable machines with these insane amounts of RAM. (I think the first machine I personally[0] owned had 512K or something like that.)

        [0] Well, my dad bought it, but y'know. He wasn't particularly interested, but I think he recognized an interest in computers in me :).

        • 300bps 6 years ago

          My first computer had 38911 BASIC BYTES FREE.

          • dvirsky 6 years ago

            Luxury! My first computer (Sinclair ZX Spectrum) had 48KB RAM total, and IIRC 8KB of BASIC RAM available.

            • placebo 6 years ago

              Guess I win this round, unless there are older farts than me - Commodore VIC-20 with 3.5KB RAM which was plenty for me to do really cool things with in BASIC and 6502 machine language)

              • tssva 6 years ago

                My first computer was a TRS-80 Model 1 with Level 1 BASIC. 4KB of RAM and 4KB of ROM.

                It predated the VIC-20 by just under 3 years. You might have had 3.5KB free for BASIC programs but the VIC-20 had a comparatively spacious 5KB of RAM and 20KB of ROM.

                • placebo 6 years ago

                  I did start earlier than the VIC-20 (with TRS-80 Model 2 at school) but you definitely take the title :)

      • qaq 6 years ago

        You can get commodity x86 server with 12TB RAM and 224 cores.

        • toast0 6 years ago

          It looks like 12TB ram is available in eight socket Intel servers? Those are sort of commodity, but availability is limited, and NUMA becomes a much larger issue than in the more easily available dual socket configurations. Looks like Epyc can do 4TB in a dual socket configuration.

        • e12e 6 years ago

          Anyone have experience with mainframes when it comes to this level of cost/performance? I suppose 500k USD is still too "cheap" to consider more special hw/systems - but where's the cut-off when getting a monster running db2 and Code in a Linux VM or something?

        • paulddraper 6 years ago

          Link?

          It depends what is meant by "commodity" I guess.

          The largest server AWS offers is only 64 physical cores (128 logical) and less than 4 TB RAM.

    • Serow225 6 years ago

      My experience is that it's unfortunately really hard to convince people who don't deeply understand distributed systems (but think they know) that just because a system is called 'HA' or can have an 'HA' mode turned on, that doing so has downsides. They freak out if you try to propose _not_ running the HA mode, because they don't (or aren't willing to) understand the potential downsides of dealing with split-brain, bad recovery, etc. They just see it as a "enable checkbox and forget it" kind of thing that should always be turned on. This is a big struggle of mine, any suggestions would be appreciated...

      • how_gauche 6 years ago

        A rational way of explaining it is -- what is my service level objective? If my SLO is for two nines of uptime, then I can be down for 3.65 days a year, and I can get away with single-homing something, or going with a simple hot-failover replicated setup. I just need to be pretty confident that I can fix it if it breaks within a few hours.

        If I need more nines of uptime than a single system could provide (and think about the MTBF numbers for the various parts in a computer, a single machine is not going to give you even three nines of uptime), then I'm literally forced to go distributed with N >= 3 and geographical redundancy (we stay up if any k of these machines fail, or if network to "us-east-4" goes down). Things get worse (and you need to be more paranoid) if your service level obligation turns into a service level agreement, because then it usually costs you money when you mess up.

        Of course, the more distributed you are, the slower and more complicated your system becomes, and the more you expose yourself to the additional associated downtime risks ("oops, the lock service lost quorum"). They usually cost more to run, obviously. C'est la vie. There is no magic bullet in software engineering.

        It used to be that you could spend 3X or more in engineering effort getting a distributed system up and running vs its singly-homed alternative. These days with cloud deployments (and Kubernetes for on-prem) you get a lot of labor-saving devices that make rolling out a distributed system a lot easier than it used to be. You still need to do a cost/benefit analysis!

        • bostik 6 years ago

          > what is my service level objective?

          There are environments where flat time distribution for SLO calculation is not acceptable. (cough betting exchange)

          If your traffic patterns are extremely spiky, such as weekly peaks hitting 15-20x of your base load, and where a big chunk of your business can come from those peaks, then most normal calculations don't apply.

          Let's say your main system that accepts writes is 10 minutes down in a month. That's easily good for >99.9% uptime, but if a failure + PR hit from an inconveniently timed 10-minute window can be responsible for nearly 10% of your monthly revenue, that's a major business problem.

          So when setting SLOs, they should be set according to business needs. I may be a heretic saying this but not all downtime is equal.

          • wting 6 years ago

            Time based SLOs definitely have their limitations, but in this instance isn't it fairly easy to redefine the SLO in terms of requests rather than time?

            • jacques_chester 6 years ago

              This is one of the recommendations given in the Google SRE book: use request-level metrics for SLOs/SLIs where possible. As your systems grow larger the probability of total outage, which would be measured in time, becomes a smaller fraction of the probability of partial outage.

              Since total outages are a special case of partial outages, use metrics that cleanly measure partial outages. That's request error metrics.

            • bostik 6 years ago

              I wish it was that easy - our teams have their targets for p99 and p995 ratios but they cannot capture the overall user experience. For us it's not just the ratio of failed requests, but closer to a four-tuple of:

                * maximum number of users affected
                * maximum time of unavailability
                * maximum observed latency 
                * highest ratio of failed requests over a sequence of relatively tight measurement windows
              
              Those are demanding constraints, but such is reality when peak trading activity can take place within just a few minutes. If users can not place their trades during those short windows, they will quickly lose confidence and take their business elsewhere.

              So yes, request ratio is certainly a good part of the overall SLO but covers only a portion of the spectrum.

        • toast0 6 years ago

          > If I need more nines of uptime than a single system could provide (and think about the MTBF numbers for the various parts in a computer, a single machine is not going to give you even three nines of uptime).

          Three nines is 8 hours and 45 minutes. In my experience with quality hardware, at a facility with onsite spares, that gives you two to three hardware faults, assuming your software and network is perfect, which I think is a fair assumption :)

          Definitely the hardest problem when you switch to a distributed database is you have to contemplate failure. When your SPOF database is down, it's easy to handle, nothing works. When your distributed database partially fails, chances are you have a weird UX problem.

        • Serow225 6 years ago

          Thank you very much, especially that first paragraph. I think you've given me some good discussion tools, and I really appreciate that.

      • eaguyhn 6 years ago

        It's a struggle for me, too. People sometimes rest on their assumptions even though new technologies and paradigms can invalidate old rules (I'm sure everyone has lots of examples).

        I found that constant communication and scenarios help bridge the gap between how they think things work and how you do. It doesn't always work - a lot of time people don't care to discuss things in detail.

        But sometimes it worked and the other person came to realize a new way of thinking. Sometimes the discussion made me realize I was the person using invalid assumptions :)

        • Serow225 6 years ago

          Thanks for your thoughts. And definitely yes re your last point! Like rubber duck debugging :)

    • matwood 6 years ago

      > I may be a bit of an old fart, but this is the exact reasoning behind my decision to never go with "distributed X" if there's a "single-machine X" where you can just vertically scale.

      I use this same reasoning to use an RDBMS until I find a reason it will not work. An ACID compliant datastore makes so many things easier until you hit a very large scale.

    • lamontcg 6 years ago

      Also an "old fart", I would recommend postgresql first over any of these other databases. Solve the big data scaling problems when you actually have them. One database server with replication and failover is going to still solve 95-98% or more of the use cases on the web.

      • rtpg 6 years ago

        It unfortunately doesn’t adequately solve “tweak and update the database software in the middle of the day without requiring downtime” situation very well

        We do rolling releases of software all the time but it’s pretty hard for us to do much optimisation of our DB setup without doing it in the middle of the night because of how all this stuff works.

        • lamontcg 6 years ago

          Where I worked in the past they had a rule to only hire DBAs that could manage schema changes and such without downtime. Proper planning up front can mitigate the need to take a lot of those outages.

          They still took a window a year for changes where there was no alternative. That is still seems preferable to me to the kinds of ongoing problems that distributed eventually consistent databases produce. You might have better availability numbers, but if customer service has to spend money to fix mistakes caused by data errors I don't know that is better for the business.

        • jazoom 6 years ago

          Exactly. And also things like rebooting the server, updating the database image, etc. This is why I use HA databases. I can do whatever I like and as long as most of them are still up, traffic continues to be served. If one node is having network issues, no problem. The database is still accessible. And so on.

        • e12e 6 years ago

          I would assume you have a staging system to test fixes/updates on, and that rolling out updates would take the form of updating a read slave, then doing a cutover?

          Or are you talking about other kinds of tweaks?

          • rtpg 6 years ago

            The issue we have is that the cutover is super hard to do in a zero-downtime way because we almost always have active connections and you can’t have multiple primaries. There’s tricks like read connections on the replicas but it’s still really hard to coordinate.

            Given that Postgres has built in synchronous replication, I feel like it should also have some support for multiple primaries (during a cutover window) to allow better HA support

    • joemag 6 years ago

      You are totally correct that most applications do not require more than a single host to handle all the load, especially given the hosts that are easily available today. However, the far more common reason to go distributed is redundancy.

      This is especially true for storage systems, which are the topic of this post, and are by far some of the most complex distributed systems. Loosing a node can mean loosing data. Loosing a large, vertically scaled node, can mean loosing a lot of data.

      You can mitigate those risks with backups, and cloud APIs of today give customers ability to spin up replacement host in seconds. But then the focus shifts to what will achieve higher availability - a complex distributed system, but one that always runs in a active/active mode. Or a simpler single node system, but one that relies on infrequently exercised and risky path during failures.

      • ultraluminous 6 years ago

        There's a substantial difference between High Availability (HA) clustering and Load Sharing (LS) clustering. In an HA cluster a production database can run on a single active node while replicating to a separate standby node. Network partitions will cause a lag in replication but will not present an external consistency issue as there's still only one active node at any time. When the partition is resolved the replication catches up and all is right with the world. Which is to say, you can scale vertically while having both reliability and availability.

        Hard "distributed systems" problems in databases generally only creep up when you're trying to deploy a multi-active system.

        • joemag 6 years ago

          Agreed that such setups can be simpler than multi-master replication. Yet, there's still a hard distributed systems challenge even in the active/passive setup like you describe. When a failure occurs, the clients need to agree which host is the master and which host is the slave. And in the presence of network partitions, it can be impossible to say if the old master is dead, or partitioned away, and hence still happily taking writes.

          The most common way in which such systems failover is by fencing off the old master prior to failover. Which in itself can be a hard distributed problem.

          • ultraluminous 6 years ago

            You can also just configure the client to point to a single db node and have an applicative monitoring based update to the db connection string upon failover. Then you're not actually dependent on your client loadbanacing logic. It's very robust and if you follow CQRS also highly available.

            • joemag 6 years ago

              Except propagating a configuration update to multiple clients, without a split brain is a hard distributed systems problem :). Especially in the presence of those pesky network partitions.

    • oconnor663 6 years ago

      Does running on 3-5 machines give you the advantage of being able to apply kernel patches and restart machines without bringing down the whole system? That might matter to people with paying customers, even if in theory they could've replaced their 3 machines with one bigger one.

      • Quekid5 6 years ago

        It may, but if your customer can't tolerate/afford 5-10 minutes of downtime you're already into "big league" territory. I could see a case if you're rolling out gradually, but with only 3-5 machines you're not really going to notice any subtle problems in a 1-vs-2 or 2-vs-3 situation.

      • mmt 6 years ago

        I'm not sure anyone would ever actually have only one machine, without a failover option, and failing over had better not bring down the whole system.

        • yjftsjthsd-h 6 years ago

          But now we're back to a distributed system

          • mmt 6 years ago

            I don't think that's true, in the usual sense of the term.

            When a replica is used as a read slave, that introduces problems with (strong) consistency, that definitely starts to resemble any other distributed system.

            Adding automated failover adds the need for some kind of consensus algorithm, another feature of distributed systems.

            However, if failover is manual, for example, then 2 nodes can be nearly indistinguishable from 1 node, if no failover occurs.

            It also strains the definition of "distributed" if both nodes are adjacent, possibly even with direct network and/or storage connections to each other.

    • gwbas1c 6 years ago

      I wouldn't call you an old fart... It's just that you understand that massively distributed databases are usually premature optimization.

    • mjevans 6 years ago

      Past a given point it's time to re-design and optimize for today rather than what was quick and easy with a small user-base. Unless you know when you're starting that such a huge user-base is the target the Quekid5's approach is proper.

      Past that easy single / very small cluster case it's time to start asking more important design questions. As an example, which part(s) NEED to be ACID complaint and which can have eventual consistency (or do they even need that as long as the data was valid 'at some point'? So just 'atomic'.)

    • sneak 6 years ago

      This is good and sound advice. The logical outgrowth of this is that the vast majority of organizations don’t need to use distributed databases. (Those that do should probably opt for hosted ones first, and then maybe consider running their own if they have the explicit need, and the substantial SRE budget for it.)

    • derefr 6 years ago

      The one resource you won't get any more of from any cloud provider by scaling vertically, is network latency. (Because with most cloud providers, you've usually already got the fastest link you'll get for an instance, on their cheapest plan!)

      Given the same provider, ten $50 instances can usually handle a much higher traffic load (in terms of sheer bulk of packets) than a single $500 instance can.

      Alternately, you can switch over to using a mainframe architecture, where your $500 instance will actually have 10 IO-accelerated network cards, and so will be able to effectively make use of them all without saturating its however-many-core CPU's DMA channels.

    • AtlasBarfed 6 years ago

      Then you have a huge single point of failure. AP distributed systems assume you will lose nodes and force you to deal with the reality of them happening.

      So you are either accepting the downtime or planting your head in the ground in denial.

  • gregwebs 6 years ago

    Losing a majority due to various faults is not equivalent to a network partition. In practice for the systems you are working with it may have a similar effect (or a much worse one!), but in theory it is possible to recover from these faults in a short amount of time as long as there are no network disruptions.

    • adrianmonk 6 years ago

      In particular, if a node is completely down, that's likely a lot better than a network partition because it eliminates the possibility that it's going to be accepting read and write requests that could have stale data or create inconsistent data (respectively).

  • ADefenestrator 6 years ago

    5. Your network isn't partitioned exactly, but someone bumped an ethernet cable and the packet loss has reduced the goodput on that link to a level too low to sustain the throughput you need. With most congestion control algorithms (basically, not-BBR), at 10Gb even 1% packet loss is devastating.

    6. Well, basically the hundred other reasons the machine could brown-out enough that things start timing out even though it's sporadically online. Bad drive, rogue process, loose heatsink, etc.

    Dead hosts are easy. Half-dead hosts suck.

  • uluyol 6 years ago

    These are partial node failures not network partitions.

CydeWeys 6 years ago

It's definitely true that putting the burden of consistency on developers (instead of on the DB) results in a lot more tricky work for developers. On my project, which started six years ago, we use Cloud Datastore, because Cloud Spanner hadn't come out yet. It results in complicated, painful code that would be completely unnecessary with stronger transactional guarantees. Some examples: https://github.com/google/nomulus/blob/master/java/google/re... https://github.com/google/nomulus/blob/master/java/google/re... https://github.com/google/nomulus/blob/master/java/google/re...

It's no surprise that we're currently running screaming to something with stronger transactional guarantees.

  • davidgay 6 years ago

    It's worth noting that Cloud Datastore's follow-on, Cloud Firestore, does provide strong consistency, and includes a "Datastore mode" that supports the Datastore API.

    Firestore is currently in beta, but once it's GA we will be migrating all Datastore users to Firestore: https://cloud.google.com/datastore/docs/upgrade-to-firestore

    Disclaimer: I work on Cloud Datastore/Firestore.

    • stingraycharles 6 years ago

      I always got the impression that Firestore is more aimed towards mobile apps rather than backend applications, and as such more or less a different kind of product. Is this not the case?

      • davidgay 6 years ago

        Datastore was initially designed for use from App Engine, i.e., an easy start, no management, automatic scaling environment ("serverless" to use the current in-vogue term).

        I would view the Firestore API as a further extension (the "Datastore Mode" functionality was always an element of the design) of that paradigm, extending to the case where you have no trusted piece of code to mediate requests to the database, thus allowing direct use from, e.g., mobile apps (at which point other issues such as disconnected operation surface).

        So not so much a "different kind of product" and more a product that supports a strict superset of use cases.

      • thesandlord 6 years ago

        While Firestore has a Android/iOS/Web SDK, it also has great backend support (Python, Java, Node, Go, Ruby, C#, PHP) as well. The "realtime" features of Firestore are better suited for mobile IMO, but using Firestore as a scalable, consistent, document/nosql database for your backend is definetly a good use for it.

        I actually think most of the server SDKs don't even expose many of the realtime APIs. Maybe they will in the future, but it shows that you can use Firestore like a normal database just fine.

        (I work for GCP)

  • wikibob 6 years ago

    I’m guessing that Nomulus is your project?

    I ran across it before and just wanted to say it’s really cool that this is open sourced.

  • erikb 6 years ago

    Why does anybody need to provide consistency? Often you don't have complete consistency anyways. There are bugs, there are time delays, there is parallel processing. Why even have the requirement?

    • CydeWeys 6 years ago

      Let me expand on an example issue that comes up.

      Nomulus is software that runs a domain name registry, including most notably the .app TLD. There are three fundamental objects at play here; the domains themselves, contacts (ownership information that goes into WHOIS), and hosts (nameserver information that feeds into DNS). There's a many-to-many relationship here, in that contacts and hosts can be reused over an arbitrarily large number of domains.

      The problem is that you can't perform transactions over an arbitrarily large number of different objects in Cloud Datastore; you're limited to enlisting a maximum of 25 entity groups. This means that you can't perform operations that are strongly consistent when contacts or hosts are reused too often. This situation comes up a lot; registrars tend to reuse the same nameservers across many domains, as well as the contacts used for privacy/proxy services.

      These problems don't arise in a relational SQL database, because you can simply JOIN the relevant tables together (provided you have the correct indexes set up) and then perform your operations in a strongly consistent manner. That trades off scalability for consistency though, whereas in Spanner you give up neither.

      • erikb 6 years ago

        I don't see how this relates to the question whether consistency is needed. If two tables only have 80% of all the rows that you expect you can STILL do a join on them. It's just that the join, just like your original data, is not containing all data sets. The join itself will not raise an exception because of that.

        • CydeWeys 6 years ago

          Strong consistency is required because you cannot delete/rename a host or contact if it is in use at all. Hence the requirement for strong consistency. It's not good enough to say that you can go ahead with the operation because it's not used by at least 80% of domains; you need to know that it's not in used by any domains.

    • zzzcpan 6 years ago

      Well, it's pretty hard to even find applications that require strong global consistency and are willing to sacrifice latency for that. Typically apps don't need much consistency at all and can sacrifice some data instead, like with most RDBMS setups in the wild. Beyond that SEC (strong eventual consistency) covers pretty much all consistency needs there are.

      • erikb 6 years ago

        > it's pretty hard to even find applications that require strong global consistency

        Exactly my point, even a little generalized. The world isn't consistent, why always put unrealistic constraints unto ourselves. If we change from "I really need to have all the datasets in all of their truthest form" and instead go with "well there is some data coming, better than nothing" the whole system might be even more reliable. Things in the middle won't die just because the world is imperfect.

tytso 6 years ago

The post talks about Spanner using "a specialized hardware solution that uses both GPS and atomic clocks to ensure a minimal clock skew across servers." But what it fails to mention is that is that this time is distributed using a software solution --- NTP, in fact.

Google makes its "leap-smeared NTP network" available via Google's Public NTP service. And it's not expensive for someone to buy their own GPS clock and run their own leap-smared NTP service.

Yes, it means that someone who installs a NewSQL database will have to set up their own time infrastructure. But that's not hard! There are lots of things about hardware setup which are at a similar level of trickiness, such as UPS with notification so that servers can shutdown gracefully when the batteries are exhausted, locating your servers in on-prem data centers so that if a flood takes out one data center, you have continuity of service elsewhere, etc., etc.

Or of course, you can pay a cloud infrastructure provider (which Google happens to provide, but Amazon and Azure also provides similar services) to take care all of these details for you. Heck, if you use Google Compute Platform, you can use the original Spanner service (accept no substitutes :-)

  • adrianmonk 6 years ago

    It makes me wonder if, as an industry, we shouldn't just include hardware clocks in our basic expectations for a data center.

    That is, in order to be considered a satisfactory, non bare bones data center, you'd need to supply on-site, hardware, redundant, maintained and monitored clocks, verifiably meeting certain standards of accuracy and precision. Just like right now people expect backup power, cooling, network connectivity, physical security, access, etc.

    It doesn't seem like the burden would be that large for data center operators. The hardware costs don't sound large, and they already have people being paid to monitor and maintain things.

  • ryandrake 6 years ago

    > And it's not expensive for someone to buy their own GPS clock and run their own leap-smared NTP service

    This is a fun weekend hobby project that anyone can do at home if they are curious. Get a GPS chip with a PPS output signal, a plain old “hockey puck” antenna, and a computer to hook it up to (i have a rpi in my attic) and you can deliver fairly accurate time to your home network. Not a setup I’d rely on for a commercial, globally distributed deployment, but accurate enough for home use, and probably better than your current time source.

    • deepsun 6 years ago

      The problem with production usage is not clock accuracy, but accuracy guarantees. If a clock is within whopping 1ms, it is ok as long as I can prove it is definitely within the 1ms.

      Which, as far as I see, boils down to hardware+software reliability.

  • jakemoshenko 6 years ago

    I feel like the author is making a huge omission by not talking about the properties of multi-raft sharded DBs coupled with the TrueTime-like APIs from cloud providers in 2018. He links to a post from the CEO of CockroachDB written in early 2016, and Amazon launched their TimeSync service in 2017.

    It's likely that Spencer would have something different to say about the matter today.

    • Quekid5 6 years ago

      > I feel like the author is making a huge omission by not talking about the properties of multi-raft sharded DBs coupled with the TrueTime-like APIs from cloud providers in 2018.

      I mean this is all fine and dandy if it works (plus, how do know that it actually works in all the edge cases?), but that's a HUGE amount of complexity. IMO you really need to have a very good reason to even consider anything this complex.

gregwebs 6 years ago

In the post and another comment here it is stated that Calvin can handle any real-world workload. However, according to my reading of the Calvin paper, one must understand the update keys before starting the transaction. I also experienced limitations when trying to use FaunaDB: it doesn't support ad-hoc queries and it only allows for indexed queries.

I really like the Calvin protocol, and is does seem perfectly suited to many application workloads, but it is odd to me though, to see Calvin presented as purely superior to all alternatives. It seems like more research and work needs to be done to create systems that can address its shortcomings around transactions that query for keys (3.2.1 Dependent transactions in the paper), including ad-hoc queries, and even interactive queries (with a transaction).

Disclaimer: I work on TiDB/TiKV. TiKV (distributed K/V store) uses the percolator model for transactions to ensure linearizability, but TiDB SQL (built on top) allows write skew.

  • freels 6 years ago

    Support for dependent reads is a subject that the Calvin paper only touches on briefly, describing one strategy of using "reconnaissance reads" to determine the key set.

    In FaunaDB, this is formalized as an optimistic concurrency control mechanism that combines snapshot reads at the coordinator and read-write conflict detection within the transaction engine. By the time a transaction is ready to commit the entire key set is known. I wrote a blog post that goes into more detail here: https://fauna.com/blog/acid-transactions-in-a-globally-distr...

    Better analytics support, including ad-hoc queries is on our roadmap. That being said, requiring indexes was a design choice: First and foremost, we want FaunaDB to be the best operational database. It is a lot easier to write fast, predictable queries if they cannot fall back to a table scan!

    While FaunaDB requires you to create indexes to support the views your workload requires, the reward is a clearer understanding of how queries perform. We've built indexes to be pretty low commitment, however. You can add or drop them on the fly based on how your workload evolves over time.

    • gregwebs 6 years ago

      Thanks for the link to the explanation. It seems like FaunaDB is doing the hard work to show how Calvin can actually be used for all types of workloads.

      • freels 6 years ago

        You're welcome! I've enjoyed following TiKV/TiDB's development as well. It is an interesting time for databases. :-)

jules 6 years ago

The CAP theorem has been truly disastrous for databases. The CAP theorem simply says that if you have a database on 2 servers and the connection between those serves goes down, then queries against one server don't see new updates from the other server, so you either have to give up consistency (serve stale data) or give up availability (one of the servers refuses to process further requests). That's all that CAP is, but somehow half the industry has been convinced that slapping a fancy name on this concept is a justification for giving up consistency (even when the network and all servers are fully functional) to retain availability. The A for availability in CAP means that ALL database servers are fully available, which is unnecessary in practice, because clients can switch to the other servers. Giving up consistency introduces big engineering challenges. You're getting something that most people don't need in return for a large cost.

  • jshen 6 years ago

    "The A for availability in CAP means that ALL database servers are fully available"

    Is this true? I always thought it meant that clients could continue to read and write to "the database" which could include the client switching to another node. There is nothing in CAP theorem about latency, so switching, even if it adds high latency, is fine by CAP theorem.

    This lack of accounting for latency is what makes CAP theorem less useful than a lot of people realize IMO.

    • zzzcpan 6 years ago

      From implementation point of view networks are asynchronous and therefore are always partitioned. We don't actually have a luxury of arbitrary CAP interpretations, we can't know whether other nodes are available or not at any given moment. So instead we have to make requests to other nodes and always choose either to wait for responses (or more complex communications to achieve consensus) and get C or not wait for anything and get A, although each node can be a bit behind on updates from other nodes. Thus the CAP choices are pretty much about latency: waiting for globally visible updates vs not waiting and getting low latency. Both can be mixed in various proportions to get consistency with good latency, but still reasonable tolerance of byzantine failures.

    • jules 6 years ago

      It is true: if you allow clients to switch then you can have all three C,A,P.

      • zbentley 6 years ago

        No, you can't. Say you have two nodes, and a client has just sent a COMMIT to node 1. Then Node 1 gets partitioned away from node 2 and the committing client vanishes. Giving clients the ability to switch to node 2 doesn't help you determine whether node 2 does or does not have the data that was committed (consistency), so you have to choose between CP and AP.

        • jules 6 years ago

          You can trivially redirect all clients to node 1 whether or not there is a partition, and voila, you have a CAP system under that definition. If clients can be partitioned from servers then we're back in the original scenario in which switching is not allowed: just partition each client from all but one server (but a different server for each client).

          Maybe there is a way to modify the CAP theorem so that it says something non-trivial (e.g. a theorem about the limit of how many node failures a Paxos-like algorithm can handle, but even this is probably trivial, namely half), but the CAP theorem as stated by the originators is trivial, and the proof is a dressed up version of what I stated above. I don't think the authors would disagree with this at all, since they explicitly state:

          "The basic idea of the proof is to assume that all messages between G1 and G2 are lost. If a write occurs in G1 and later a read occurs in G2, then the read operation cannot return the results of the earlier write operation."

          Isn't it interesting that such a paper has 1600 citations and has reshaped the database industry?

  • rwtxn 6 years ago

    > The A for availability in CAP means that ALL database servers are fully available, which is unnecessary in practice, because clients can switch to the other servers.

    This assumes that only the servers are partitioned from each other and clients are not partitioned from the majority quorum. This might be rare but it is not impossible at scale.

    There is also a latency cost of strict serializability or linearizability which is hard to mitigate at geo-replicated scale.

    • jules 6 years ago

      Read latency doesn't have to be impacted by linearizibility. Write latency is higher, but we have to do a fair comparison. If a non-consistent database acknowledges a write it doesn't have to have made that write visible, so comparing that latency to a linearizable DB write is apples to oranges. If we compare latency until the write is globally visible, the non-consistent DB can still do better, so it is definitely possible to imagine scenarios in which such a DB is better.

      However, for what percentage of use cases does a reduction in write latency weigh up against the disadvantages? I think that's a very small percentage. Heck, the vast majority of companies that are using a hip highly available NoSQL database cluster would probably be fine running a DB on a single server with a hot standby.

      You could image a system that gives you a bit of both. When you do a write, there are several points in time that you may care about, e.g. "the write has entered the system, and will eventually be visible to all clients" and "the write is visible to all clients". The database could communicate that information (asynchronously) to the client that's doing the write, so that this client can update the UI appropriately. When a client does a read, it could specify what level of guarantee it wants for the read: "give me the most recent data, regardless of whether the write has been committed" or "only give me data that is guaranteed to have been committed". Such a system could theoretically give you low latency or consistency on a case by case basis.

  • gowld 6 years ago

    You can't always "choose another server" while maintaining low latency operations.

  • zzzcpan 6 years ago

    > slapping a fancy name on this concept is a justification for giving up consistency (even when the network and all servers are fully functional) to retain availability

    This is a misconception. AP databases are not supposed to give up consistency, just not wait for all nodes to see updates. That's it. Consistency is still there, nodes resync, users always see their own updates and all that.

    • jules 6 years ago

      Consistency in the CAP sense has a precise meaning: all reads see all completed writes. That is, if the database has told some client that their write succeeded, then it is not allowed to still serve other clients the old data.

      Inconsistency in the CAP sense may also cause inconsistency in the sense that you mean, for instance if two transactions are simultaneously accepted, but each transaction violates the precondition of the other. In a consistent database one of the transactions will see the result of the other transaction and fail, whereas a DB without consistency may accept both transactions and end up in a semantically incorrect state.

    • he0001 6 years ago

      So then its CAP not AP? CAP was about you can’t have them all?

      • zzzcpan 6 years ago

        No, it's still AP, it's just CAP consistency is very specific thing and you can't generalize it into "giving up consistency". AP systems don't give up consistency, they just don't explicitly wait for it.

        • he0001 6 years ago

          Then you are saying that all the trouble Spanner goes through to actually be consistent is not needed? And by saying a AP system can be “consistent” there’s no need for a product as Spanner. Consistency is obviously a timing problem. Something that is not consistent at some time will not be consistent since it will be inconsistent at that time.

        • almostdeadguy 6 years ago

          CAP consistency is linearizability, so if what you mean is there are other models of consistency available to "AP systems" (kind of hate the framing that these are binary features of a distributed system rather than a huge space of design choices, but anyways), then yes that's true. But AP systems don't always guarantee things like read-your-own-write consistency models, and not all conflicts that occur during a partition can be resolved in many of these databases.

      • glic3rinu 6 years ago

        The C in there is for "Strong Consistency"; CAP allows for lessers forms of consistency, like "Eventual Consistency", without giving up on Availability.

        • he0001 6 years ago

          Well yes but that’s not what we are talking about here. And “eventual consistency” is not “consistency”. And I would argue that “eventual consistency” is not consistent since it can result in fake states since it’s not consistent.

abadid 6 years ago

I'm the author of that post. I'm happy to respond to comments on the post on this thread for the next several hours. You can also leave comments on the post itself, and I will respond there at any time.

  • mehrdadn 6 years ago

    The problem isn't a lack of very accurate clocks per se, right? What you need is an accurate bound on clock error, whatever it might be. It sounds to me that non-Googles are specifying bound parameters that, unlike Google, they cannot guarantee. Doesn't that mean the blame goes to whomever put inaccurate bound parameters into the software? Why blame the algorithm for garbage input?

    • xyzzy_plugh 6 years ago

      > It sounds to me that non-Googles are specifying bound parameters that, unlike Google, they cannot guarantee.

      I've been complaining about this for years and it's so nice to see others echo the sentiment. Everyone chases timestamps but in reality they're harder to get right than most people are willing to acknowledge.

      Very few places get NTP right at large-scale, or at least within accuracies required for this class of consistently. I've never seen anyone seriously measure their SLO for clock drift, often because their observability stack is incapable of achieving the necessary resolutions. Most places hand-wave the issue entirely and just assume their clocks will be fine.

      The paper linked within TFA suggests a hybrid clock which is better but still carries some complications. I'll continue to recommend vector clocks despite their shortcomings.

    • CydeWeys 6 years ago

      The problem is that the bound on clock error directly affects your performance. So if you're willing to accept, say, a second in clock error, then all transactions will take a minimum of one second. That level of performance is going to be unacceptable in many situations.

      The potential clock error on VMs without dedicated time-keeping hardware is so large that performance turns into absolute garbage.

      • mehrdadn 6 years ago

        I would understand if the complaint was that Spanner is too slow without expensively accurate clocks and synchronization. But the complaint is that Spanner fails to guarantee consistency, which doesn't make sense to me. The requirements clearly include giving a valid clock bound, so if you give an invalid clock bound, it's clearly your fault for getting incorrect results, not Spanner's!

        • CydeWeys 6 years ago

          Spanner does guarantee consistency, thanks to its use of hardware atomic clocks and GPS. It's alternatives like CockroachDB that don't have this dedicated hardware that can fail to guarantee consistency if clocks get of sync (a problem that can't happen in Spanner).

          Spanner is really fast and massively parallelizable.

          • mvijaykarthik 6 years ago

            We recently open-sourced https://github.com/rubrikinc/kronos for the exact same problem. Coincidentally I shared that on Show HN just today: https://news.ycombinator.com/item?id=18037609

            • CydeWeys 6 years ago

              Neat. The obvious question would be, how does kronos compare with ntpd? Do they work together, is kronos a replacement, do they solve different problems, etc.? Is it expected that all the servers in a cluster synced by kronos are already running ntpd, and that kronos provides an additional level of reduced skew on top of that?

              It'd be great if you could address this somewhere in the top level README on GitHub.

              • mvijaykarthik 6 years ago

                I've added a section in the README for this. It works in conjunction with NTPD and the time provided by this library has some extra guarantees like monotonicity, immunity to large clock jumps etc (more info in the README).

  • joemag 6 years ago

    Hi, you used DynamoDB as an example of a weakly consistent system in the opening paragraph, but it actually supports both modes [1]. The point of confusion might come from the fact that the service described in 2007 Dynamo paper was an inspiration for DynamoDB, rather than DynamoDB itself.

    Disclaimer, I work for AWS, but not on DynamoDB team.

    [1] https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

    • abadid 6 years ago

      Thanks. I added a parenthetical remark to the post to indicate that I was talking about DynamoDB's default settings.

      • Twirrim 6 years ago

        It's not really the default settings, per se. You don't have to change any bit of configuration about your database to get consistency. The DynamoDB API gives you the GetItem API call and a boolean property to choose to make it a consistent read.

        It's left as a very simple task for developers leveraging DynamoDB to make the appropriate trade offs on consistent or inconsistent read.

        source: Used to work for AWS on a service that heavily leveraged DynamoDB. Not _once_ did we experience any problems with consistency or reliability, despite them and us going through numerous network partitions in that time. The only major issue came towards the end of my time there when DynamoDB had that complete service collapse for several hours.

        On the sheer scale that DynamoDB operates at, it's more likely to be a question of "How many did we automatically handle this week?" than "How often do we have to deal with network partitions?"

        • dsp1234 6 years ago

          From the GetItem docs[0]

          "GetItem provides an eventually consistent read by default."

          This seems to meet the definition of "DynamoDB's default settings"

          [0] - https://docs.aws.amazon.com/amazondynamodb/latest/APIReferen...

          • rch 6 years ago

            It's enough of a gray area to make DynamoDB a poor example in this context, since if I were to claim that it was eventually consistent without additional configuration, then an informed person might reasonably assume I didn't know what I was talking about.

            It would be better to state that both eventually consistent and fully consistent reads are available, and consistency can be enforced up front via configuration.

  • jgalentine007 6 years ago

    I recently came across CockroachDB and thought it's capabilities interesting, almost too good to be true. I also have been looking at Citus Data which shards and distributes transactions, are you aware of any consistency shortcomings with it?

    • cube2222 6 years ago

      CockroachDB's performance is dependent on time synchronization. If a node detects it's too far behind, it will commit suicide.

      However, before it detects it, there is a possibility of stale reads.

      https://www.cockroachlabs.com/docs/stable/recommended-produc...

      • jgalentine007 6 years ago

        Thanks for pointing that out - I have yet to rtfm and dive deep. I wonder how frequently time sync problems occur in virtual environments after ntp syncing - I've seen pretty erratic behavior on virtual active directory domain controllers even after syncing with hyper-v and vmware.

        • brazzledazzle 6 years ago

          It’s been a long time since I messed with domain controllers but I believe Microsoft used to have explicit guides for handling time on virtual DCs. At that time we kept around a a couple hardware DCs to be safe but I do remember having the VMware agent correct the time could result in some bad results. I think it was because it immediately fixed the time and didn’t slowly correct the drift but it’s been a long time so my memory could be off.

        • amaccuish 6 years ago

          Time shouldn't be a massive issue for AD no? It's a vector clock, not a UTC clock. The UTC clock is only used to solve conflicts no?

    • atombender 6 years ago

      My experience from test workloads on Cockroach is that single transaction performance is very bad compared to something like Postgres -- with 2.0 I was seeing easily 10x worse performance than Postgres with a three-node test cluster on Google Cloud. My impression is that it's worse for apps that have lots of small CRUD transactions on low numbers of rows, as is typical with web/mobile UIs.

      Aggregate cluster performance seemed very good, though; i.e., adding a bunch more concurrent transactions did not slow down the other transactions noticeably.

  • felixgallo 6 years ago

    why didn't you disclose your connections in your post? It appears you've written much the same article for FaunaDB's official blog in the past (https://fauna.com/blog/distributed-consistency-at-scale-span...).

    • abadid 6 years ago

      That post was actually quite different than this one. This one focuses on distributed vs. global consensus.

      But I added a note at the end that clearly documents my connection to FaunaDB. As far as Calvin, the post itself clearly says that it came out of my research group.

    • vidarh 6 years ago

      The article specifically states that Calvin comes out of his research group and that FaunaDB was inspired by Calvin.

  • xtacy 6 years ago

    Nice writeup, but given the title, I was hoping you would also have a constructive test case that shows that these systems fail to meet their guarantees (like Jepsen did). :)

    I wonder if any of the aforementioned systems (Calvin/Spanner/YugaByte) that can opportunistically commit transactions and detect issues and roll back + retry all within the scope of the RPC so it can still conform to linearisability requirement?

  • plopz 6 years ago

    I've only ever worked on small projects so I'm not at all familiar with these very high-scale distributed databases but from the post it seems to indicate that Spanner is in a league of its own because it integrated hardware into the mix where the others are software only. What are the differences in scale between the two categories mentioned?

    • abadid 6 years ago

      Yes, Spanner is quite unusual in the distributed database world in how hardware is a pretty important part of their solution. Other systems may claim important integrations with hardware, but for Spanner, the architecture really relies on particular hardware assumptions.

      To answer your question about scale: there is no real practical difference in scalability between the two categories discussed in the post. Partitioned consensus has better theoretical scalability, but I am not aware of any real-world workload that can not be handled by global consensus with batching.

      • ryanobjc 6 years ago

        I think this "global consensus with batching does everything partitioned does" is a very much theory vs practice type of statement. As in, in theory there is no difference between theory and practice :-)

        I've seen those batched consensus systems, and honestly, you're kidding yourself if you think they can handle a million qps. Just transmit time of the data on ethernet would become an issue alone! Even with 40 gig - transmit time never becomes free. So now you're stuffing 1/10th of a million qps worth of data via a single set of machines (3, 5, 7, 9? Some relatively small amount)

        Am I misunderstanding you? Hopefully I am!

  • peterwwillis 6 years ago

    "Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition."

    Many of the distributed clusters I've maintained had crap infrastructure and no change control, and parts of the clusters were constantly going down from lack of storage, CPU and RAM, or bad changes. The only reason the applications kept working were either (1) the not-broken vnodes continued operating as normal and only broken vnodes were temporarily unavailable, or (2) we shifted traffic to a working region and replication automatically caught up the bad cluster once it was fixed. Clients experienced increased error rates due primarily to these infrastructure problems, and very rarely from network partition.

    Does your consistent model take this into account, or do you really assume that network partition will be the only problem?

    • mannykannot 6 years ago

      It seems you have other problems (crap infrastructure and no change control) to deal with before the issues in this article become your biggest concern, but are not the cases you list themselves partition problems?

      • peterwwillis 6 years ago

        They cause partition, but their origin isn't the network. Nobody who runs a large system has perfectly behaving infrastructure. Infrastructure always works better in a lab than in the real world. Even if you imagine your infrastructure is rock-solid, people often make assumptions, like their quota is infinite, or their application will scale past the theoretical limits of individual network segments, i/o bounding, etc.

        The point is, resources cause problems, and the network is just one of many resources needed by the system. Other resources actually have more constraints on them than the network does. If a resource is constrained, it will impact availability in a highly-consistent model.

        The author states that simply adding network redundancy would reduce partitions, and infrastructure problems are proof that this is very short-sighted. "You have bigger problems" - no kidding! Hence the weak-consistency model!

    • StreamBright 6 years ago

      Even if you maintain your infrastructure properly you run on x86 servers with disks and CPUs that need cooling, using network devices that have fascinating failure scenarios. I guess assuming that your infra is not reliable is a must for any database nowadays.

  • evanweaver 6 years ago

    Do these concerns also apply in an HTAP or OLAP context e.g. systems like Cloudera's Kudu, which uses Hybrid Time? Or maybe Volt which you also worked on?

    • nickerten 6 years ago

      I've worked on a system that loaded data into Kudu in near real time and simultaneously ran queries on the data. Kudu has no transactions, consistency is eventual which was sufficient due to our near-real time constraint, however you do need a stable NTP source. We have lost data when the cluster could not get a reliable NTP connection, decided to shut down and tablet servers' data files became corrupted.

    • kwillets 6 years ago

      Vertica seems to include some of these observations, such as global consistency and group commit. I believe these are easier to achieve (lower overhead) in OLAP due to fewer, larger transactions.

    • xtacy 6 years ago

      OLAP systems tend to be read only (for analytics) so the question of transactions and consistency isn't really applicable.

      • kwillets 6 years ago

        They don't load themselves.

  • romed 6 years ago

    I question your statement that building apps on weakly-consistent systems is so difficult. I’ve worked on very large scale systems that you’ve definitely heard of and probably used that are built atop storage systems with very weak semantics and asynchronous replication. Aren’t such systems existence proofs, or do you think there’s just a huge difference between the abilities of engineers in various organizations?

    • cloakandswagger 6 years ago

      He said it was difficult, not impossible.

      I think it's a fairly noncontroversial statement. Dealing with eventual consistency is always going to be more difficult and require more careful thought and preparation than immediately consistent systems.

      How many programmers do you think are out there that have only ever worked on systems that use a single RDBMS instance, and what would happen if they tried to apply their techniques to a distributed, eventually consistent environment?

      • wpietri 6 years ago

        Exactly. My dad was coding when DBMSes rose to prominence, and it was basically a way to take a bunch of things that were hard to think about and sweep them under the rug. People wrote plenty of good software before they existed, but if you wanted to write a piece of data, you had to think about which disk and where on the disk and exactly the record format. Most programmers just wanted a genie that they could give data to and later ask for it back.

        It's the same today, but worse. Most programmers still want a simple abstraction that lets them just build things. But now it's not just which sector on which disk, but also which server in which data center on which continent, while withstanding the larger number of failure modes at that scale.

        When necessary, people can explicitly address that complexity. But it has a big cost, a high cognitive load.

    • groestl 6 years ago

      The biggest challenge, in my experience, is explaining the weak consistency guarantees to stakeholders, especially in a QA setting (e.g. product owner demos the product to colleagues, numbers are not immediately up-to-date).

  • CBLT 6 years ago

    Great read, thank you for sharing. Do you have any opinion on the design of Eris[0]? Consistency is achieved with extra hardware, but that hardware is a network-level sequencer.

    [0]: https://syslab.cs.washington.edu/papers/eris-sosp17.pdf

    • rwtxn 6 years ago

      AFAIK Eris (and previous similar work from the same group) assume network ordering guarantees that can be provided in a single datacenter but probably not in a WAN setting. This discussion is about distributed and geo-replicated databases.

  • Serow225 6 years ago

    I think it would be helpful to move the disclaimer to the top of the post, just for clarity's sake. Is there a forecasted date on the release of the independent Jepsen study? Who is performing it? Thanks!!

  • reilly3000 6 years ago

    Is there a "best-of-both-worlds" approach that could work, or are these two approaches mutually exclusive? I have to imagine that time drift can eventually reconciled with some kind of time delta.

    • abadid 6 years ago

      The two approaches seem mutually exclusive to me.

  • jakelarkin 6 years ago

    won't batching transactions increasing the failure rate by a multiple of (batch size)? how should users reason about that tradeoff?

romed 6 years ago

I’ve seen comments on HN over the years in which someone Dunning-Kruegers their way into saying that TrueTime is easily replicated. I always wonder if they have sixteen senior SREs in their pocket, because that’s the level of production engineering Google applies to the problem. Time SRE has at various points had take measures up to and including calling the USAF and telling them their satellites are fucked up. If you don’t have the staff for this, the easiest way to get access to TrueTime is probably to just use Cloud Spanner.

  • hosay123 6 years ago

    > Time SRE has at various points had take measures up to and including calling the USAF and telling them their satellites are fucked up

    It's another cute anecdote, but Google culture is full of these, always scant on details and always intended to show how big/smart/important/complex/indispensable their engineering is.

    "Had to" is a strong term here, it's made to sound like USAF could not possibly have noticed some deviation they were likely to correct of their own accord as a matter of routine as they had been doing for the 20 years of the GPS project prior to Google being founded.

    The reality is drift and bad clocks are and always have been a feature of GPS, one explicitly designed for, one an entire staff exists to cope with, and designs depending on the absolute accuracy of a single clock have never been correct

    • wpietri 6 years ago

      So yes, Google can be very impressed with Google. But I'm not sure that's the issue here.

      Is it really surprising that people who have extremely precise time needs and a whole team devoted to solving them would notice issues that other people wouldn't? I think it's a very common pattern that a product has some set of trailblazer users who find issues before the people who make the product.

      Also, I think you're over-interpreting. "Had to" here only means that they noticed and reported the issue first because their system depended on GPS time being right. It doesn't preclude the possibility that the USAF would notice and fix the issue eventually, just with a higher latency that Google wanted.

      • hosay123 6 years ago

        If some condition existed that exceeded GPS intended design, you most certainly wouldn't learn of it first from some random anecdote on HN.. more likely the front page of the BBC as the transportation system instantly collapses

        So the anecdote itself is noise, it's intended to show how seriously intractable a problem accurate time is, but it doesn't do that, instead it only demonstrates OP's lack of familiarity with GPS and willingness to regurgitate corporate old wives' tales

        • joshuamorton 6 years ago

          A single satellite mildly misbehaving on occasion won't necessarily cause catastrophe. You're normally connected to more than the requisite 3 satellites anyway, so you might notice less accuracy, but not anything terrible.

          Most of these systems are designed to work if you lose GPS entirely, so they fail gracefully.

          Planes won't actually fall out of the sky if GPS makes mistakes. That's y2k fearmongering.

          Why is it hard to believe that a group using GPS for a unique purpose has unique needs and detect unique issues?

        • skrebbel 6 years ago

          Sub-millisecond flaws in GPS would make the transportation system collapse? Why?

          • ihattendorf 6 years ago

            Here's an interesting article[1] about how relativity affects GPS satellites. The clock ticks in a GPS satellite need to be accurate to within 20-30 nanoseconds for accuracy, and they tick 38 microseconds/day faster to account for relativity.

            [1] http://www.astronomy.ohio-state.edu/~pogge/Ast162/Unit5/gps....

            • jandrese 6 years ago

              GPS is one of the few technologies that have to account for both Special Relativity and General Relativity. The level of engineering that went into the system is just amazing.

              Fun fact, GPS satellites use rubidium clocks instead of cesium clocks, and only maintain their accuracy thanks to yet another incredible feat of engineering.

          • dleslie 6 years ago

            Triangulation of location is bounded by the accuracy of those clocks.

            1 microsecond is 300 meters of error.

            • tzs 6 years ago

              What if all the satellites are off by the same amount?

              According to the link posted higher up in the thread, in early 2016 they were all off by 13 microseconds for 12 hours, with no apparent consequences for anything ordinary people use GPS for such as location finding.

              To triangulate, I think you need to know (1) where the satellites are, and (2) how far you are from each satellite. I think either absolute distance or relative distance works.

              Getting both of these depends on knowing the time. That time comes from the satellites. Let's say they are all off by 1 us. Your time is derived from satellite time. That would mean the time you use to look up/calculate their positions will be off by 1 us from the correct time so you would get the wrong position for the satellites.

              A quick Googling says the satellites orbital speed is 14000 km/hr, so using a time off by 1 us to look up/calculate satellite position would give you a position that is about 4 mm off.

              The procedure for deriving the time from the satellites would get some extra error from this, but that should be limited to about the time it takes like to travel 4 mm, so we can ignore that. As a result your distance measurements between you and satellites would be off by about 4 mm or less.

              The key here is that when all the satellites have the same error, the time you derive has the same error, so your distance calculations should still work, and so you only get an error of about how far satellites move in an interval equal to the time error.

              In summary, if all the satellites are off by 1 us, your triangulation seems like it would be about 4 mm more uncertain.

              If only one satellite is off, it is going to depend on how the time algorithm works. If the algorithm is such that it ends up with a time much closer to the times of the correct satellites than to the off satellite, then if it calculates the distance from the triangulated position to the expected positions of the satellites, and compares that to the measured distance, it should find that one is off by something on the order of the distance light travels in 1 us, and the others are all pretty close to where they should be. It should then be able to figure out that it has one unreliable satellite it, drop it, and then get the right location.

              I have no idea if they actually take those kinds of precautions, though.

              The case that would really screw it up would be if several satellites are off, but by different amounts. With enough observation it should be possible in many cases to even straighten that out, but it might be too complicated or too time consuming to be practical. (This is assuming that the error is that the satellites are simply set to the wrong time, but that wrong time is ticking at the right rate).

          • ninkendo 6 years ago

            Precise geolocation relies on extreme time accuracy (the story always being that relativistic time dilation effects with the difference in gravity on the surface vs LEO must be accounted for), so yeah, it wouldn't surprise me one bit that the accuracy required is on the order of much less than a millisecond.

      • 21 6 years ago

        > Is it really surprising that people who have extremely precise time needs and a whole team devoted to solving them would notice issues that other people wouldn't

        If GPS timing is bad, a lot of people will notice that their position on the map is incorrect, because that's the whole purpose of the GPS network.

        A 1 microsecond error is 300 meters.

        • zwegner 6 years ago

          > A 1 microsecond error is 300 meters.

          While the speed-of-light propagation is about 300 meters in a microsecond, isn't the final position error possibly much greater? For calculating position on Earth, you can think about a sphere expanding at the speed of light from each satellite. The 1 microsecond error here corresponds to a radius 300m bigger or smaller, which only corresponds to 300m horizontal distance on the ground if the satellite is on the horizon (assuming that Earth is locally a flat plane for simplicity here). For a satellite directly overhead, the 300m error is a vertical distance. Calculating the difference in horizontal position from this error is then finding the length of a leg of a right triangle with other leg length D and hypotenuse length D+300m, where D is the orbital distance from the satellite (according to Wikipedia, 20180km). The final horizontal distance error is then sqrt((D+300)^2 - D^2), or about 110km.

          Of course, this is just the effect of a 1us error in a single satellite, I'm sure there's ways to detect and compensate for these errors.

          • 21 6 years ago

            Intuitively this seems wrong to me. If the satellite is overhead, the error would put you 300m into the ground so to speak. I'm not sure why you project that horizontally, and especially why you take the distance to the satellite into account.

            As another sanity check, if the error for 1 us is 110 km, the error for 1 ns would be 110 m, and I suspect 1 ns error is not unusual for consumer electronics:

            > To reduce this error level to the order of meters would require an atomic clock. However, not only is this impracticable for consumer GPS devices, the GPS satellites are only accurate to about 10 nano seconds (in which time a signal would travel 3m)

            https://wiki.openstreetmap.org/wiki/Accuracy_of_GPS_data

            • zwegner 6 years ago

              > If the satellite is overhead, the error would put you 300m into the ground so to speak.

              Right, I was basically calculating where that signal would just be reaching the surface at the same time it was 300m under you. This is a circle around you with a radius of ~110km (again using the approximation of the ground as a flat plane). Thinking about it more, there's not much reason to do this (GPS isn't really tied to the surface of the Earth, it gives you 3-D coordinates). I guess my point was that the 300m of distance from 1us of light propagation should not be thought of as a horizontal distance.

            • SamReidHughes 6 years ago

              That would be if it were straight overhead, intersecting tangentially with another sphere. Realistically they're not overhead, but if two satellites are 30 degrees apart, the line of intersection between their spheres will move twice the distance one of the spheres moves. The magnifying factor is 1/sin(angle between the satellites from the observer).

        • makomk 6 years ago

          If I remember correctly, there was a bug a couple of years back which caused an incorrect time offset between GPS and UTC time to be uploaded to some of the satellites - off by a handful of microseconds. Didn't affect navigation but it did trip a bunch of alerts on systems that relied on precise time. I don't think Google was the one that alerted the USAF to that though, in fact they may not have had sufficiently accurate timekeeping back then.

        • mannykannot 6 years ago

          GPS is not typically used to confirm a position that is known accurately by other means, and that is not its purpose. Only in those cases where there is a manifest conflict with independent spatial information will the problem be evident.

          • wongarsu 6 years ago

            >GPS is not typically used to confirm a position that is known accurately by other means

            I am not so sure about that. The most common use of GPS is in satnav in cars. Satnavs typically show a map, and typically it is very easy to confirm your position on a map. Any inaccuracy by more than the usual few meters would be quickly noticed by the majority of GPS users.

          • Rapzid 6 years ago

            People are going to notice a 300m deviation due to landmarks and their eyes.

            • mannykannot 6 years ago

              Rarely, if you are navigating at sea or in the air or in the woods... and even on the road, it is not uncommon for my GPS device to be clearly off without justifying the conclusion that there is a fault in a satellite.

              • Rapzid 6 years ago

                Here are some users that have a high chance of noticing visually and in aggregate would probably produce a lot of noise:

                * Air and sea port operators and navigators

                * Military personal running supply lines

                * Military personal on foot in operations and training

                * Space-X

                * NASA

                * River boats

                * Fresh water fishermen

                * Etc

                Out of all the possible users who would notice a 300m deviation just based on visual reconciliation, I personally would not say it would be so rare that the USAF would not find out very quickly. Of course, this is ignoring the equipment that would likely detect the issue way before somebody in the Army started phoning the USAF.

              • emiliobumachar 6 years ago

                Come on, in urban traffic a 300m error will easily place one in a parallel street.

              • gregdunn 6 years ago

                Unless I'm woefully off base here, if the satellites were incorrect, you would basically be permanently 300m off, not just temporarily.

                There's not so many GPS satellites out there that you're going to be bouncing around them all the time - even if only one is affected, it would be very noticeable for extended periods of time.

    • romed 6 years ago

      I specifically worded this to be about money not brains. Most readers here can probably imagine how to implement a bounded time service. Most readers here also cannot afford to operate one. That is the point. Operating software reliably at large scale happens to be very expensive. 24x7 coverage with a short time-to-repair costs at a minimum several million dollars per year.

      • michaelt 6 years ago

          24x7 coverage with a short time-to-repair costs at
          a minimum several million dollars per year.
        
        Interesting - what are the constituents of that cost?

        What sort of challenges do you face? Do you use PTP grandmaster clocks, or something else? How many sites, and how many clocks per site? Are the support issues mostly hardware failures, configuration problems, or something else? Is 24/7 support needed because the equipment lacks failover support, or is the failover support unreliable or insufficient?

        • akiselev 6 years ago

          You generally need at least 4-5 SREs for a high availability large (big 5) scale subsystem in a multinational corp just to cover all of the timezones and make sure you're not frantically calling everyone when someone goes on vacation or has to pick up their kid from the nurse. The salary plus benefits and overhead on that is easily in the millions.

        • ahofmann 6 years ago

          I think it was meant that Google has such high costs. I read somewhere that Google operates two atomic clocks in each of its data centers, but I can't find a source for it right now, just this: https://www.wired.com/2012/11/google-spanner-time/

          • jnwatson 6 years ago

            Atomic clocks aren't all that expensive. You can get a decent rubidium one for US $5K.

    • jandrese 6 years ago

      I'm guessing that was a reference to the January 2016 event[1]?

      Google wasn't the only company that noticed it, and I have no idea if they discovered it before the USAF, but I can believe that someone from Google would phone up Schriever and ask WTF is going on.

      [1] http://ptfinc.com/gps-glitch-january-2016/

    • espeed 6 years ago

      Google also reports software and hardware security vulnerabilities and infrastructure security issues to external companies, organizations, and stakeholders responsible for the the design, operation, and maintenance of external systems. Google isn't the only company that does this - other organizations do too - but at this level expertise is a scarce resource, and we're all in the same boat so it behooves everyone when those capable can and do cooperate and participate in keeping a vigilant watch. This ethos is one if the reason the West dominates.

  • jhayward 6 years ago

    > including calling the USAF and telling them their satellites are fucked up

    Citation needed. There is a worldwide organization, led by the US Naval observatory, that keeps constant watch on GPS satellite time performance. If google noticed anything that USNO and the other participants didn't that would need a paper or three published.

  • hinkley 6 years ago

    I’m sure someone has talked about this and proven or disproven it, but I’ve always been a little uneasy with the truetime protocol. I mean it’s faster than Raft, but for the use case maybe we are trying to fix the wrong problem?

    If two events are independent, it matters very little what order we record them in the system of record.

    My whole career we have been building cause and effect at transaction time but when we debug we stare at log files and time stamps like we are reading tea leaves, trying to figure out what situation A led to corrupted data in row B.

    Maybe there’s a way to record this stuff instead of time stamps? Something DVCS style. Or maybe it’s provably intractable.

    • apoverton 6 years ago

      I think if 2 events are independent then you can use an eventually consistent DB like Cassandra and move on with your day. The importance of something like TrueTime arises when you have an example like in the article. Step 1: remove your parents from seeing your photo albums. Step 2: post your spring break photo album

      • taeric 6 years ago

        That doesn't sound like a system you need "true time" for. :(

        Of course, I think I just like thinking about how true time will start to fail once we get beyond the earth. Consider, what is the true time for events we are seeing in the stars right now?

        Granted, I fully cede that being able to rely on a fully sequenced notion of time that everyone is a part of makes some reasoning much easier.

    • sseth 6 years ago

      TrueTime is not a timestamp, but a time with a spread which reflects the maximum difference in time between two data centers.

      If i understand correctly, if the size of this increases too much, this brings down the number of transactions per second which can be done, so keeping the spread small is key. Hence the atomic clocks etc.

  • abadid 6 years ago

    Or just use global consensus with batching and avoid the need for clock synchronization altogether!

  • sabujp 6 years ago

    the only solution to this is to launch your own satellites :)

    • reilly3000 6 years ago

      If we seal the surface of, then boil the ocean we can launch our own time satellites with completely sustainable propulsion.

  • ianopolous 6 years ago

    I've always had an issue with systems assuming a universal time, because physics (Special Relativity) tells us there is no such thing. Two events in different places will be viewed as having a different order depending on your frame of reference. All that really matters on a physical level is causal connections.

    I believe vector clocks capture this semantic. But they have other trade-offs.

    • amluto 6 years ago

      Special relativity has no such problem. In SR you can easily define a “time” coordinate everywhere such that all events can be timestamped with that coordinate and will respect causality.

      GR at least usually has this property as well. (It doesn’t in the presence of closed timelike curves. It does in weak gravity and in the FLRW metric in cosmology. I’m not sure about the general strong gravity case.

      • ianopolous 6 years ago

        Of course every inertial frame in SR has a well defined time coordinate, but that is not a universal time - other frames will disagree on which of two not-causally-connected events happened first. This is normally explained through the lack of a well defined "simultaneity" across different inertial frames.

        • infogulch 6 years ago

          You are correct that there is no truly capital-U-Universal time, but it doesn't matter. You control the whole system, so just choose one and call it "true time" and make everything participating in the system match it. Simultaneity in all inertial frames can be translated between one another, so if you go to a new place that has, for example, more time dilation due to different gravity, just note the parameters and translate it into your chosen "true time" and adjust the spread.

          • ianopolous 6 years ago

            My point is not that it isn't possible, but that it is arbitrary and has no physical meaning. Writes thrown away in one frame because another concurrent write was "later" would in fact be kept in another frame. This is why it feels like a 'bug' to me conceptually - it's not how the universe works so why should a database need it.

            • infogulch 6 years ago

              To summarize, you're saying 'The universe doesn't need linear serializeability, so why should we?'

              We build abstractions because they're useful to us, not because they have some special meaning to the universe. Systems with simpler abstractions are easier to understand and therefore build on top of. Complex numbers, for example, have no direct physical meaning in the (non-quantum mechanical) universe but can still be extremely useful and are sometimes the only/best way to solve some classical problems.

              If you want to use a database where a required step of querying it is specifying a reference frame that the ordering of events is relative to, feel free. "In fact, for any two spacelike separated events, it is possible to find a reference frame where you can reverse the order in which they happen."[1] Personally I'll take a hard pass on bug reports like 'Foreign key constraint fails in reference frame 0.992c at 37.2Mm vector towards Alpha Centauri' which reads like the climax of Dante's Inferno for Systems Programmers.

              [1]: https://physics.stackexchange.com/a/75765/28368

              • ianopolous 6 years ago

                No, I'm saying the less impedance mismatch we have with the real physical world the easier things will be to work and make scale beyond earth (we will need a network that can scale to mars, for example, soonish). Fundamentally, we are limited by the laws of physics, so it is beneficial to understand them.

                We build abstractions because they are useful, but we also often build the wrong abstractions - this is the entire history of science - building better abstractions. Simpler abstractions are great, but they can limit you. We can build all kinds of wonderful machinery with just classical physics, but if you want the modern world with GPS etc. you need relativity to make it work. All the databases based on truetime are great and marvels of engineering, but they won't be able to scale to even a second planet (getting a GPS equivalent to work across two planets is orders of magnitude harder than the earth one, not to mention the latency).

                I'm not condoning your idea of a database that explicitly uses frames, but rather something based on more physical foundations like cause-and-effect. As I mentioned, I think vector clocks satisfy this. But maybe there are other better alternatives.

                I'm fully aware of the physics (I have a MPhys, and DPhil in Particle Physics from Oxford).

    • notatcomputer68 6 years ago

      Can't you solve that by just picking a reference frame though?

      • hinkley 6 years ago

        HTTP has had a reference frame since 1.0. It’s still one of my favorite features of just about any protocol ever.

        Cache headers can set expiration times or offsets, but the client and server both send their own time stamp in the message. So when the server says “expire this at noon” but your laptop’s clock is five minutes slow or three time zones away, you can still figure out what the server meant by noon because the server tells you what time it thinks it is now. The accuracy is limited by network delays but for the problem space getting the right answer to the nearest second is still pretty damned good.

      • xenadu02 6 years ago

        Yes, and we all pick the Earth’s gravity well as our reference frame. Differences due to elevation and latitude are too small to measure for purposes of transaction clocks.

      • ianopolous 6 years ago

        You need a quite elaborate system of a grid of synchronized, stationary clocks and rods to correctly define a reference frame, which is worse than what's required to come up with a "good enough on earth to some time resolution" true time.

        • ianopolous 6 years ago

          To be clear this is pretty much exactly what the system of GPS satellites does, but without physical rods, and it is very expensive to construct and maintain. This works fine to a certain accuracy as long as you restrict your application to staying on earth.

  • dustingetz 6 years ago

    When infinite resources are available, organizations will somehow find a way to consume them. There are properly consistent not-SQL architectures that do not need clocks, Datomic is one, and it does not require coordination with any government agencies to operate. https://www.datomic.com/

  • Cshelton 6 years ago

    Also, I believe all the major cloud providers provide a "TrueTime" API service. I forgot the name that AWS uses, but you can call it on your EC2 instances and make sure your hosts are all in sync. It's pretty cool.

    • otterley 6 years ago

      AFAIK AWS is only offering NTP service with a GPS source (i.e. stratum 1). TrueTime appears to be a service offering that is a step above that in terms of the guarantees it provides.

      • Cshelton 6 years ago

        Ah, I've never had a use case for anything more accurate than NTP

        • jandrese 6 years ago

          You may never have built a system that needed it, but LTE cell handoffs require higher precision than NTP can supply. Almost every cell phone in use today benefits from highly precise timing.

          This is why PTP and high precision GPS devices are built to integrate with cell provider gear.

      • fintler 6 years ago

        TrueTime is more of a client side library that utilizes a service like NTPd.

        You can implement a toy version of TrueTime with about 60 lines of C which uses a single ntp_gettime call for each of the TrueTime api functions (now, before, after).

        If AWS's NTPd service offers drift <= 200 us/s, you could use it for TrueTime.

voidmain 6 years ago

This article lacks any reference to FoundationDB, which offers external consistency and serializable distributed transactions without trusting clocks in any way. We designed it starting in 2009 and so its lineage is independent of either Calvin or Spanner.

FDB doesn't have an actively developed SQL layer at the moment, so I guess you could say it isn't a "NewSQL" database, but none of the properties under discussion have much to do with the query language.

  • uluyol 6 years ago

    As best as I could find, FoundationDB doesn't provide strict serializability (i.e. linearizability + serializability) and doesn't tackle multi-DC deployments. So I wouldn't put it in the same class as FaunaDB or Spanner.

    • jscissr 6 years ago

      There is a "Datacenter-aware mode": https://apple.github.io/foundationdb/configuration.html#data...

      Here is some discussion about linearizabilty in fdb: https://news.ycombinator.com/item?id=16884882

      • uluyol 6 years ago

        Without knowing exactly how reads and writes are replicated, I am skeptical. I have gone through the technical documentation and haven't found many details. For reference, I've done work in storage and consensus algorithms and I can tell you for a fact that without using a consensus algorithm for either reconfiguration or request propagation, you will have consistency violations.

        I would love to be proven wrong, as more systems with strong consistency guarantees is better, but for now, I don't believe that foundation db provides stronger guarantees than serializable reads and writes.

        • voidmain 6 years ago

          FoundationDB uses a consensus algorithm for reconfiguration, but not in the (happy path) transaction pipeline. It provides (by default) strict serializability (i.e. serializability and external consistency/linearizability) for arbitrary, ad hoc, interactive transactions, and it's expected to provide excellent performance when every single transaction is cross node (so e.g. indexes can be efficiently updated this way). It provides better fault tolerance than consensus replicated databases typically can because it needs only N+1 replicas instead of 2N+1 to survive N faults (it keeps 2N+1 replicas of tiny configuration for consensus, if course). It has the best testing story in the industry and is used at scale by, among others, the largest company in the world. Because it doesn't have consensus in the datapath it can have lower latencies than consensus replicated databases in multi region deployments, it also supports asynchronous replication, and an upcoming feature will provide a unique option for multi region failover with sub geographic write latencies, maintaining full transaction durability in failover, as if in synchronous replication, except for the exceptionally rare case where the failure of multiple datacenters in a region are exactly simultaneous.

          Besides its extensive documentation, you can read its source code and run its deterministic simulation tests yourself if you are interested (it's Apache licensed). Skepticism on these points was reasonable when we originally launched it in 2012 but is getting a little silly in 2018.

          • jules 6 years ago

            How does FoundationDB do distributed transactions without consensus?

bnastic 6 years ago

> With 10ms batches, Calvin was able to achieve a throughput of over 500,000 transactions per second. For comparison, Amazon.com and NASDAQ likely process no more than 10,000 orders/trades per second even during peak workloads.

I haven’t worked with NASDAQ stream directly, but knowing how fast equities tick I find this “10,000 orders/sec” estimate quite low.

Not to mention that 10ms delay in confirming an order would be really terrible.

haberman 6 years ago

> We will trace the failure to guarantee consistency to a controversial design decision made by Spanner that has been tragically and imperfectly emulated in other systems.

This post doesn't establish any "controversy" about Spanner's design decision. It only says that it requires special hardware, which other systems attempt to emulate despite not having this specialized hardware.

To call this decision "controversial" I think one would need to show that it has some significant problem in the environment it was designed for.

  • abadid 6 years ago

    By controversial, I mean that in the database research community (where I spend most of my time), there is significant disagreement about global consensus vs. Spanner's choice of partitioned consensus.

    • dekhn 6 years ago

      There may be disagreement about Spanner's choice, but my observation of it in production environment is that the choices were reasonable tradeoffs that permitted a large number of products to be moved off BigTable with confidence about the results of the system running in a demanding environment. It has worked well enough to remain established and I don't see any successor with a "better" design coming along and replacing it within a decade.

    • haberman 6 years ago

      Disagreement about what though? Does Spanner's solution have an objective problem? Do you or others in your community have specific reasons to believe that it cannot deliver on its promises?

      • abadid 6 years ago

        Spanner's approach requires help from hardware and several full time employees maintaining and ensuring the uncertainty guarantees. This increases the cost of the maintaining the system, which for Cloud Spanner is partially passed on to the end users. If you can build a system that doesn't require time synchronization, yet doesn't have any significant drawbacks relative to what Spanner provides, you'd be better off using this alternative system.

        • haberman 6 years ago

          > If you can build a system that doesn't require time synchronization, yet doesn't have any significant drawbacks relative to what Spanner provides, you'd be better off using this alternative system.

          But you describe exactly the drawbacks of giving up time synchronization:

          > The main downside of the first category is scalability. A server can process a fixed number of messages per second. If every transaction in the system participates in the same consensus protocol, the same set of servers vote on every transaction. Since voting requires communication, the number of votes per second is limited by the number of messages each server can handle. This limits the total amount of transactions per second that the system can handle.

          How is "worse scalability" not a significant drawback?

          This just sounds like an engineering tradeoff. I don't think engineering tradeoffs are the same as controversy. I get that your group's DB takes a different approach. But "blaming" Spanner for making a different trade-off doesn't come off well (I approached the article with an open mind).

          • mrep 6 years ago

            You beat me to it. Also, per

            > Calvin was able to achieve a throughput of over 500,000 transactions per second. For comparison, Amazon.com and NASDAQ likely process no more than 10,000 orders/trades per second even during peak workloads.

            Maybe if you are only considering the small scope of just the order transactions, but they are actually writing way more data to their databases such as logs, metrics, current state for things like shopping carts. In my experience, what teams have done is split their data off into their own siloed database to minimize scaling problems, but this becomes super painful when you want to join your data with others. If spanner can hold all of our data, scale, and handle joining across all of it, that sounds like a huge win.

        • ysleepy 6 years ago

          You clearly have an agenda here, the blog post contains little information while spreading FUD. Never heard of calvin and will now avoid it.

          • atombender 6 years ago

            Calvin is a research database, described in a paper and also provided as an open source project, but last I checked, it was pure research, not usable for real work. The "production version" of Calvin is arguably FaunaDB, which is a hosted SaaS product.

            Abadi is a very well known and, as far as I know, respected database researcher. No need to avoid his work.

  • 1024core 6 years ago

    That's because the author is pushing Calvin, which came out of his research group.

dgreensp 6 years ago

This is the clearest presentation of why CAP is misleading that I’ve ever seen! Wonderful.

If the app programmer knows they don’t have global consistency, just consistency per partition, I wonder if in practice there are ways to achieve the necessary application-level guarantees such as in the photo-sharing case (not that it sounds that appealing to need to do so).

  • xapata 6 years ago

    > application-level guarantees

    If your database doesn't "guarantee" consistency, but instead provides eventual consistency, your application needs to be aware of how long "eventually" is. In effect, you'd be coding something similar to Spanner's timestamp uncertainty estimation. In the photo sharing example, it'd be reasonable to put a delay on sharing new photos with other users until after permissions changes would have propagated.

    • e12e 6 years ago

      > delay on sharing new photos with other users until after permissions changes would have propagated.

      And then you'd maybe have to map the ux of changing permissions on an existing permission to a three-state: user makes change;ui shows change in progress;change takes effect and is reflected in ui.

beamatronic 6 years ago

Please be sure to distinguish between Couchbase and CouchDB. Couchbase is a CP system.

cbsmith 6 years ago

Seems to have a powerful straw man there about eventual consistency.

The point of AP systems is not 100% availability, but rather higher availability.

By the same reasoning, one should never do CP, because it is also not possible to have 100% consistency. Disk/memory/network corruption, even with ECC can overwhelm the ability to maintain consistency.

AtlasBarfed 6 years ago

The context of this article still seems to constrain consistency to guarantees for single atomic units of data (atomic at the node level).

But the second you have multiple-node transactions (your data after all in a truly distributed system will exist on mulitple nodes), then the single-node reliability become dependent on multiple data exchanges across networks for confirmations. Then your network partition exposure starts to snowball.

Same thing for joins. Great if your joins are perfectly served by a node-local sharding strategy, but when you retrieve from multiple nodes, again your network partitioning risk starts to compound.

In AWS you will see noisy neighbor networks and network unreliability. Your VMs get yanked. Your EBS might experience an IO pause. GC pauses. Or simply one of your nodes might get swamped by a query spike.

yoava 6 years ago

I like the realistic view of noSql vs CAP and the different tradeoffs. We do need to talk more about the disadvantages of noSql systems and all the new database alternatives.

Which brings me to my point - what the author fails to talk about is why spanner has taken the design decision they have made. He does claim scalability, but that is a very general word.

I believe the spanner decision is to avoid making a global voting, as in global across the entire world, from North America, to Asia, Europe and even as far as Australia. Such a global voting will take a lot of time - impose seconds of later on any write.

I think the author, when he talks about having a single global vote with a small window, thinks in terms of a single database, maybe two regions on the Continental USA, as apposed to spanner trying to be Geo distributed.

idubrov 6 years ago

Any comments on how FoundationDB might be the same / different?

  • Dave_Rosenthal 6 years ago

    FoundationDB adheres to the highest level of consistency (linearizability) and, unlike Spanner or its derivatives, does not rely on clocks to achieve this.

    I'm surprised FoundationDB is not mentioned in the article. Maybe it didn't help the author make his point, but it is both the most mature of the new breed of ACID+noSQL databases and completely free/open source.

    • riku_iki 6 years ago

      One reason is may be because it is not 'SQL' DB, which is probably prerequisite for NewSQL category.

    • grogers 6 years ago

      According to their docs, FoundationDB only provides serializable isolation, so it isn't the same as the others in the post which offer strict serializable isolation (multi-key version of linearizability). Without strong clocks you can't have strict serializablity and scale beyond a single log (but one log can get you pretty damn far). TBH most people probably only need serializable transactions anyway.

      https://apple.github.io/foundationdb/developer-guide.html#tr...

  • chubot 6 years ago

    I'm pretty sure FoundationDB still relies on the more traditional model of hardware: "many fast, identically-spec'd machines on a rack, with fast reliable networking".

    In other words, I don't think you can run Foundation DB across geographically dispersed data centers, as you can with Spanner, CockroachDB, and others.

    Running on cloud VMs is a very hostile environment for a database, as others have pointed out in this thread.

    Though I'm happy to hear from someone who knows more about FoundationDB.

    • zzzcpan 6 years ago

      FoundationDB could probably do better than CockroachDB in multi datacenter configuration, because of the different trade offs they made, but I still wouldn't expect any of them do that well enough for most applications.

rkarthik007 6 years ago

CTO of YugaByte here. We firmly stand by our claims, and I wanted to explain more.

From the post by Daniel: << CockroachDB, to its credit, has acknowledged that by only incorporating Spanner’s software innovations, the system cannot guarantee CAP consistency (which, as mentioned above, is linearizability).

YugaByte, however, continues to claim a guarantee of consistency. I would advise people not to trust this claim. YugaByte, by virtue of its Spanner roots, will run into consistency violations when the local clock on a server suddenly jumps beyond the skew uncertainty window. >>

The statement about YugaByte DB is incorrect.

1. With respect to CAP, both Cockroach DB (https://www.cockroachlabs.com/blog/limits-of-the-cap-theorem...) and YugaByte DB (https://docs.yugabyte.com/latest/faq/architecture/#how-can-y...) are CP databases with HA and there is really no difference in the claims.

2. With respect to Isolation level in ACID, YugaByte DB does not make the linearizability (called external consistency by Google Spanner) claim. YugaByte DB offers Snapshot Isolation (detects write-write conflicts) today and Serializable isolation (detect read-write and write-write conflicts) is in the roadmap (https://docs.yugabyte.com/latest/architecture/transactions/i...).

3. We have publicly claimed that we do rely on NTP and max clock skew bounds to guarantee consistency. For example, slide 43 of our NorCal DB Day talk (https://www.slideshare.net/YugaByte/yugabyte-db-architecture...) we mention we are “relying on bounded clock sync (NTP, AWS Time Sync, etc).”

  • abadid 6 years ago

    Looks like you cross-posted this comment here and on my blog, so I'll also cross-post in my response.

    I am quite confused by your statement that YugaByte does not claim linearizability. The C of CAP is linearizability (see the original CAP theorem https://users.ece.cmu.edu/~adrian/731-sp04/readings/GL-cap.p...). By claiming to be CP from CAP, you are claiming linearizability.

    CockroachDB also makes the same CP claim. However, they explicitly walk back from this claim (https://www.cockroachlabs.com/blog/living-without-atomic-clo...). In contrast, YugaByte documentation makes no effort to walk back from this claim. Rather, the documentation seems to indicate that YugaByte is linearizable at: https://blog.yugabyte.com/jepsen-testing-on-yugabyte-db-data... and https://docs.yugabyte.com/latest/develop/learn/acid-transact....

    Also, I would encourage you to read the Herlihy-Wing paper on linearizability published in 1990. You seem to be confusing linearizability with ACID isolation levels, which are actually a different concept. Peter Bailis has a good blog post on the difference: http://www.bailis.org/blog/linearizability-versus-serializab....

    On your third point, we agree. However, the point of my post is that relying on max clock skew without hardware support is dangerous. I’m going so far as to say that it’s so dangerous that it is incorrect to claim consistency guarantees when you make such assumptions.

    I would really encourage YugaByte to consider using global consensus instead of partitioned consensus. I believe you will find it much easier to support serializability (which as you said, is on your road map) and linearizability.

    • rkarthik007 6 years ago

      HN ate my comment, hope to figure hn out someday! Had written:

      From the bailis.org link you posted:

      > Linearizability is a guarantee about single operations on single objects.

      The place we references "linearizability" (in the Jepsen blog and docs) are in the context of single row-key operations; I can update our docs to clarify that further.

      For multi-key transactions, our docs clearly point out that we support Snapshot Isolation now and Serializable Isolation is in the roadmap.

    • rkarthik007 6 years ago

      Will cross-post as well - unclear about protocols here :)

      From the Peter Bailis link you had:

      > Linearizability is a guarantee about single operations on single objects.

      Our references to "linearizability" in the Jepsen blog and the docs are in the context of single key operations; happy to update to update our docs to clarify that further.

      For multi-key transactions in YugaByte DB, our docs clearly point out that we offer Snapshot Isolation today and Serializable Isolation is in our roadmap (https://docs.yugabyte.com/latest/architecture/transactions/i...).

dboreham 6 years ago

In case you're wondering : this is a very worthwhile read.

  • strken 6 years ago

    The sibling comment by zzzcpan about consistency guarantees is dead and flagged, and can't be replied to, but I would be interested in reading other takes on this tradeoff.

    The take on CAP from CockroachDB, for example, is https://www.cockroachlabs.com/blog/living-without-atomic-clo..., and is linked to in the GP article, but is somewhat downplayed and definitely deserves to be read.

    It seems obvious to me that we're within but beyond CAP when it comes to choosing what to sacrifice.

  • zzzcpan 6 years ago

    Too many false claims for my taste. Especially wrt to eventual consistency and consistency guarantees. Also some misinterpretation of CAP theorem to fit the narrative. Very typical FaunaDB promotional post.

    • dgreensp 6 years ago

      The author of the CAP theorem says the same thing about CAP:

      https://www.infoq.com/articles/cap-twelve-years-later-how-th...

      • niftich 6 years ago

        The middle part of 'CAP Twelve Years Later' [1], especially the part about 'Managing Partitions', and the associated diagram, really should be mandatory reading for anyone who invokes CAP, as it drives home the fact that a partition creates alternate universes (invariants), which have to be reconciled after the fact, and points out that allowing reads only is a valid strategy to provide Consistency (erroring out on writes), while systems that choose to provide Availability (trying to fulfill every transaction) will find themselves in a situation where they have to reconcile mutations that happened in separate universes, exactly like a version control system would. It also provides some strategies on how some write-like operations could be done to minimize disruption to existing state, such as recording the intent to mutate some data in a way that enables duplicates to be discarded later, and for the mutation to be performed at a time that it is safe.

        The author of the HN-featured article is the formulator of the PACELC theorem, which makes clear that replication is mandatory and thus always propagates state that needs to be reconciled. The ability of nodes to freely communicate with one another means that it's possible to delay answers to operations until all nodes have recorded its effects. In the event of a partition, nodes cannot share information with other nodes, so a choice must be made between denying operations that would endanger states getting out of sync, or allowing them through, and dealing with the fallout later.

        [1] https://www.infoq.com/articles/cap-twelve-years-later-how-th...

        • hinkley 6 years ago

          > recording the intent to mutate some data in a way that enables duplicates to be discarded later, and for the mutation to be performed at a time that it is safe.

          Isn’t that, fundamentally, old school batch processing? Plus ça change, plus c’est le même chose...

    • haberman 6 years ago

      Without some description of what "false claims" and "misinterpretation" you mean, this is a content-free post.

      • jessaustin 6 years ago

        It has more "content" than the post to which it replies.

        "This is good."

        "No, it is bad, in these particular areas."

        [EDIT:] Granted, the post could still be wrong, but ISTM accusations of content-freedom tend to be projections.

        • pdpi 6 years ago

          "This is a good read" is a statement of opinion. You're free to assign whatever value you like to the op's opinion, but that opinion is "content" in and of itself.

          "The article has false claim" is a statement of fact. A vague statement of fact about a non-obvious topic that puts no effort into providing concrete examples and reasoning is pretty much worthless.

          • jessaustin 6 years ago

            Adding qualifiers makes a statement have less content? Please try to imagine using this questionable argument in any other context...

            It's a controversial topic and TFA was written by someone with a commercial interest. Everything is opinion. The level of comment-policing I see here makes me quite suspicious of everything in TFA.

            [EDIT:] aargh I'll never learn every discussion that starts with "content-free" cliche goes nowhere... just downvote and move on...

            • haberman 6 years ago

              If you're suspicious about me, it's not very well-founded, considering my main comment thread on this story takes issue with the blog's claims: https://news.ycombinator.com/item?id=18039801

              I just prefer to have discussions about specific things, not posturing and innuendo.

      • zzzcpan 6 years ago

        Too much is false to explain it, like half of the article. So I just called it out instead. I know it's not very convincing, but I'm not getting paid to write giant FUD/PR posts like they do. I work on distributed stuff completely independently.

sseth 6 years ago

The author talks about high throughput achieved via batching. But he has not mentioned latency implications of batching, or perhaps I missed that in the article.

Wouldn't batching lead to an increase in transaction latency, even if we achieve higher throughput?

dekhn 6 years ago

This reads like an advertisement for the researcher's system combined with some FUD.

  • reinhardt1053 6 years ago

    would you mind to elaborate?

    • wsy 6 years ago

      Just to mention a few points:

      - The title implies that Spanner is at fault for something, then the article itself admits that Spanner is totally fine.

      - Then it goes on to argue that this approach can't possibly work without physical devices for precise time. However, there is no proof whatsoever. Please be a researcher and design an experiment that demonstrates a violated consistency guarantee.

      - Finally, not disclosing financial relationships with products which are praised in the article is really low, and has only been fixed after this has been pointed out here. And it is not marked as a later addendum, as usual for changed articles.

omneity 6 years ago

It's becoming harder to evaluate the guarantees of the most recent so called "cloud-scale" databases, Spanner or CockroachDB for instance.

Is there anything out there like a standard benchmark to compare these offerings more accurately?

acje 6 years ago

I don’t know for sure, but to me the AP vs CP interpretation seems only to be a true limitation for distributed systems of exactly two nodes.

I also like the blogs point that availability is not ever 100%, but I think the added cost of availability levels when going from an eventual consistency system to a linearizable one is underestimated because performance is going to be a significant availability factor, not only failuremodes as discussed.

rbranson 6 years ago

Maybe I’m missing something, but the way this reads seems to attribute Spanner’s concurrency control entirely to TrueTime. Spanner still uses partitioned Paxos to establish consensus and 2PC for multi-partition write transactions.

erikb 6 years ago

Consistency is a requirement that holds us back deeply. Implementing constency with eventually voted know-all masters are also holding us back. Looking at this makes an engineers heart cringe. In fact the current architecture is only slightly better than what we had in 2005, more than 10 years ago. And if it runs not on google or amazon clouds, and therefore on UNRELIABLE networks, the voting itself can fail and you have a centralized system without center. And while we claim that we need a central point of knowledge that is at sync throughout the whole system, the whole world AROUND the IT works completely without being in sync and completely without knowing it all (sometimes even having no knowledge or assumed but wrong knowledge), at a scale that IT systems only to some degree come close to nowadays.

rifung 6 years ago

Probably a dumb question but if you need all servers to talk to each other to achieve consensus, how do you deal with network latency? Do people just keep data in separate zones so they aren't globally replicated?

jpalomaki 6 years ago

And majority of developers don’t need NoSQL or NewSQL - they are well off with PostgreSQL.

I think the old and tried should be the default choice here and other paths taken only after careful consideration and good justifications.

  • threeseed 6 years ago

    Or how about instead of just blindly picking a system. You spend a week or two to evaluate a few that you think might mesh well for your data structures. PostgreSQL doesn't work well for many, many common use cases.

mr_pickles 6 years ago

No mention of Clustrix? The Spanner paper was in 2012. Clustrix started working on a "NewSQL" database in 2006 and had a product out in 2010.

  • antoncohen 6 years ago

    One big difference is that Clustrix (YC'06) was a single datacenter consistent cluster, while Spanner can do multi-region consistency. It will be interesting to see what MariaDB does with Clustrix, now that they have acquired them, fingers crossed that they open source the technology.

apaprocki 6 years ago

> a specialized hardware solution that uses both GPS and atomic clocks to ensure a minimal clock skew across servers

Maybe I'm wrong, but pretty much anyone with a DC is going to use Microsemi (previously Symmetricom) grandmaster clocks with all the bells-and-whistles including the internal Rubidium atomic oscillator along with PTP to keep everything in sync with multiple layers of redundancy. A specialized hardware solution that uses "both GPS and atomic clocks" to guarantee ~15ns RMS to UTC are just a shopping cart away.

shaklee3 6 years ago

I'm hoping aphyr will comment on this...

redwood 6 years ago

Odd to see the most popular distributed operational database system in the world left out here.

sabujp 6 years ago

since spanner is distributed on the backend, any function that uses multiple sql commands that might be called or ingested by multiple sharded servers which are then sent to spanner, should always have these calls sent in a single transaction

bbcbasic 6 years ago

MongoDb is webscale. It scales right up. Sharding is the secret webscale sauce.

openloop 6 years ago

Bad spanner. Graph databases with consistency.

peterwwillis 6 years ago

Distributed databases are like systems of government. Some are worse than others, but all of them suck. I'm not aware of any study that shows that X type of database reduces bugs, increases availability, and makes customers happier. Pick one that fits your application and deal with the suckiness.

  • threeseed 6 years ago

    Distributed databases underpin every major web app today e.g. Facebook, Instagram, Google, Apple, Spotify etc.

    So clearly they are helping to increase availability and make customers happy.

  • erikb 6 years ago

    Why not take this pain and think about other forms of architecture? Maybe a consistent source of all truth is a bullshit requirement?