Ask HN: Which distributed database solutions would you suggest?

9 points by iKSv2 6 years ago

Hi guys,

One of my project, started in 2013 has grown beyond the point of self-healing. It was cool active-active sync on the webapp (behind HAProxy) and one high-spec'd DB Server (both web servers would connect to this db server). Current DB size is ~145GB (Based on backup, actual size might be more) and its postgresql (upgraded to 10).

Now the problem is that the database has grown beyond the point of ease. The constant need to start services, stop long executing queries (easily takes upwards of 20s for simple search). Higher ups have asked to look for Distributed solutions ( I am strongly against NOSQL while management is strongly against paid solution but they might give in after I tinker with some sort of trails / open source versions). I am looking for guidance from you guys as are you using any distributed database in production (with the library available in php for connecting and stuff obviously).

I have tried installing cassandra (after which I am strongly against NOSQL) and citusDB which has some very serious limitations on the community edition (their open source offering) and also in general. Some more searching results in Postgres BDR, Cockroach DB, TimescaleDB (but this doesnt partition on NON-TIME columns).

My requirement is:

* to have a distributed database solution in place which is horizontally scalable.

* Easy to add nodes, remove nodes without any downtime (sure I can accomodate some write-locks for setup) * Have the ability to tweat replication factor. Ideally I would love to have replication = number of nodes i.e. each node has complete database. So that when there are simple queries, it doesnt have to do distributed queries (which make system slow) and when the query is a bit complex, it does distributed since data is available in each node.

* Best case: Some GIS based plugin / extension on the solution would be icing on the cake.

* SQL compatible, so that least of the application rewrite is required.

Please help ..

AznHisoka 6 years ago

You need more information on why you need to constantly start services, and why there are long running queries.

The latter is actually one of the easiest problems to find a solution for. If your problem was too many concurrent transactions/high load, then that would involve distributed solutions. But slow queries? That is solvable with less work.

iKSv2 6 years ago

The queries are slow because the resultant dataset is of high volume and then we are acting on the data. While this can be remedied (its being worked on) - I saw some demos on the web for distributed systems and think sooner or later I need to do that. Makes senses to do now

eb0la 6 years ago

Try old school tricks first:

- Explain your queries: you will be surprised to know a lot if stuff madre with ORMs is not well indexed.

- Take a look at the slow query log (and run that queries with Explain before).

- Exploit recency: I bet most reads are from recent data.. Or data that was written near other data you ask... If that is the case xonsidera partitioning the tables.

- Thing about the best index type form your tables. Some are better than others.

- If some data is not going to be written again, a columnar data store would help...

Don't go distributed at first.

Disributed means when something fails you have to know what abd where is failing... Instead of knowing just what.

If you want to go distributed.. Try sharding right now. If it works for you, add a cache and use the shards.

But try old school, boring stuff first because you might need it later.

iKSv2 6 years ago

Will check what me/team can do on old school lines. Thanks

segmondy 6 years ago

Redesign and fix your database. 9 times out of 10 it's poorly designed DB and queries. 145GB means nothing How many rows? Can you partition? Can you use smaller datatypes? Are you read heavy or write heavy? Are your indexes designed accordingly? Have you split your DB into OLTP and OLTA DBs? Do you have lots of triggers? Are you having the database doing lots of work which you can outsource to the CPU? Did you actually tune your DB parameters? Is the hardware scaled out max vertically? 145Gb is nothing.

bufferoverflow 6 years ago

First make sure you can't solve your problem by moving to a more powerful server. 145GB is tiny. 256GB of DDR3 is less than $600 on eBay, and can fit your whole DB.

Also, are you on a SSD?

iKSv2 6 years ago

I certainly could. Yes I am on SSD. But then moving to a bigger server is essentially just delaying the problem I am going to have to solve sometime in future. Also right now I can afford some time window of bad (slow) performance while I essentially migrate.
- JimmyAustin 6 years ago
  
  You'd be surprised how far you can push the problem into the future. OVH will sell you a server with 512GB of RAM for under a K a month. You can easily go into the terabytes.
  
  bufferoverflow 6 years ago
  
  Hetzner PX92 with 256GB DDR4 is just $219/month.
- bufferoverflow 6 years ago
  
  Have you thought of installing slave DB servers to spread the reads?
  Have you looked into why your slow queries are slow? Seems like indexes are missing, or your queries are unoptimized.
- brudgers 6 years ago
  
  Curious about your staff's past experience with database migrations.
  
  iKSv2 6 years ago
  
  This makes me uncomfortable. Even though we have done some migrations (albeit of lower scale).
  Please share pointers, if any.
  
  brudgers 6 years ago
  
  Your comment says most of what is worth considering.
  
  iKSv2 6 years ago
  
  Gotcha
hiram112 6 years ago

> 256GB of DDR3 is less than $600 on eBay...
Wow - I see PC builders paying $150+ for 16GB. I guess ram really goes down in price once you start buying in bulk / server memory.
- s_esser 6 years ago
  
  DDR4 are the kind those PC builders are buying

hiram112 6 years ago

I'd actually be really interested to hear more opinions, especially with regard to scaling Postgres.

It seems like a few companies have actively been working on the horizontal scaling part of Postgres, along with the RDS version.

Would this be something where he could move his data to AWS RDS, change no code, and have everything 'just work' due to AWS dealing with the scaling of Postgres?

That seems to be their selling point, though I wonder if anyone has actually used it with this much data.

I'm really hesitant of using a NoSQl solution, too, for any projects just because most don't have transactions, don't have real SQL APIs, etc.

borplk 6 years ago

Stick to Postgres and investigate how you can improve things.

I guarantee you have not reached "max postgres capability", far from it.

You will 100% regret changing the database as a solution to your problem and you will buy yourself lots of other problems.

Cypher 6 years ago

bitcoin

jxub 6 years ago

This would be a rather unintelligent choice as the bitcoin protocol is really inefficient while being optimized for resisting sybil attacks and achieving consensus.
iKSv2 6 years ago

you mean some kind of blockchain type solution ?
- is_true 6 years ago
  
  With a pinch of AI
  
  iKSv2 6 years ago
  
  any documents / example?
  
  is_true 6 years ago
  
  I was just adding buzzwords.
  To be honest, if you haven't tried to optimize your hw/code/schema/indexes. I think I would start with hardware, then schema and queries.
  I solved a problem with a database for a finance product, but the problem wasn't the database, it was abusing the ORM that made the resource usage go nuts. It made sense at the time of writing the software because the deadlines were tight, but should've been rewritten before the system started failing.

zzzcpan 6 years ago

There is no in-place replacement.

Most likely your app relies on database performance for a lot of queries, that can be fast only on local databases. This makes a serious rewrite necessary if you want to go with distributed solutions and requires knowledge and experience in distributed systems. Which takes a lot of time and that's why choosing distributed databases from the beginning instead of MySQL/PosgreSQL is so important.

Judging by your description, your best bet is probably pre-sharding. It's not a proper distributed solution, but it doesn't sound like you need one. It works like this: you partition your data into N shards, where N is two orders of magnitude higher, than the number of server you plan to have, then split each table into N tables belonging to each shard and use a static mapping of shards to servers in the application. Queries won't be able to cross shards. Some work on the app will be necessary and on the infrastructure tooling to manage servers, shards, replication, monitoring. As you would need to be able to move shards between servers, restart them, backup them and so on.

iKSv2 6 years ago

Say suppose we go ahead with keeping the system as is and re-write it parallelly (essentially beginning) what distributed database would best help our use case. (Fast, distributed, ACID, transactional, no single point of failure, should have some GIS functionality)