Ask HN: How would you build a website that runs long background jobs?

13 points by ryeguy_24 6 years ago

I am building a simple website whereby the user uploads data and the backend kicks off a long process (data computation and analytics)? My preliminary approach is to use Python Flask with Celery and RabbitMQ for job execution. Is this a decent approach? Can anyone else recommend alternative/better approaches?

atmosx 6 years ago

Your stack is as good as any. Do yourself a favour if you're building a business... Don't look at it as an engineer. If you're proficient in python, use Flask and Celery. If you're proficient in Ruby use Rails/Sinatra and Sidekiq. These tools are battle-tested, solid. You can build a business around them easily. If you ever hit any limits, then and only then look around for more.

My advice to you is to master these tools instead of learning a new language/stack/tool.

I see ppl talking about Elixir, Erlang, Phoenix, etc. It doesn't matter, what matters if for you to deliver. Python is an excellent choice, the tools, you're talking about are solid... the only thing that matters now is for you to deliver.

nanoscopic 6 years ago

I actually wrote a solution for this exact problem for SUSE Hack Week 2018. I made a small job tracking / state tracking system using Perl and nanomsg. See https://github.com/nanoscopic/galear/blob/master/client/srv/...

Essentially it is just a small server that listens on a nanomsg queue for new tasks. Workers can then be created that periodically ping the server to grab a task off the queue. Optimally a nanomsg queue would also be used to queue the tasks out to the workers; I just build it this way since I could implement the entire thing within hours and continue on with my hack week project.

The benefit of using nanomsg over several of the other message queues suggested here is that nanomsg is a brokerless message queue, meaning that there need not be any central queue tracking everything.

It is somewhat ironic in that sense that I essentially built a message broker using it, but it demonstrates the simplicity of using nanomsg.

While the code there is written in Perl, it could easily be ported to any of the other many languages nanomsg has libraries for.

In summary, to handle long running background jobs: 1. Create as many "worker" processes as you need to be able to simultaneously process multiple long running background jobs. 2. Have a way to feed tasks into those workers ( in my case a central task tracker ) 3. Have a way to feed back status of the workers to something central so you can tell what is going on

nathan_long 6 years ago

Elixir runs on the Erlang VM, which lets you spin up processes at will. Here's an article of Phoenix tips - skip down to "AVOID TASK.ASYNC IF YOU DON’T PLAN TO TASK.AWAIT" to see how you'd spin off a background job.

https://dockyard.com/blog/2016/05/02/phoenix-tips-and-tricks

cutety 6 years ago

Would second this, Elixir/Erlang are probably the best tool for this kind of job. The Erlang VM along with OTP was built for highly reliably running a ton of (potentially long running) processes. And Elixir/Phoenix is especially great for this kind of task if you require a web front end, as you can directly use all the Elixir OTP stuff and then with phoenix’s awesome channels (websockets) you can publish job execution progress over time and results to the channel directly from those processes. The best part being is you get all this without having to setup some intermediary queue/pub sub (redis, rabbitmq) to manage all this, it’s all built in and done in the Erlang VM.
The downside being you have to spend time to really learn the actor concurrency model and OTP concepts (things like GenServer and supervision trees) to really harness this power. While not impossible to learn, it sounds like OP is coming from Python, so while Elixir’s syntax is closer to Python (or higher level scripting languages), working with it is conceptually completely different than working in Python.
A generalized solution to this is you want to break down a task (one big long running job) into a bunch of super small jobs (or processes), that are kicked off by a main job. This is so they can execute quickly, and more importantly keeping them idempotent (easier to restart and debug small jobs with little to no state than big jobs with lots of state). Before kicking off the main job, you’ll want to record it starting somewhere (database, redis, genserver, etc.). Then kickoff the main job, if you need to report progress you can have the child jobs update the progress wherever you stored the start info. Then this can be retrieved either by polling some endpoint, or with pubsub and websockets. Then when the final job ends, update the information to mark that it ended and store the results/reference to the results. Then as above this can be retrieved through that endpoint so clients can be notified when their job finished. If you need to keep state across the jobs, you can use a database or something like redis as the central store for the global job data (which you should choose is dependent on whether or not you need the speed of a k-v store or the transactions/locking of a database) -- if you go the Elixir route this would be done with a GenServer instead.
I also couldn’t recommend redis enough, you can use it as a queue as well as a pub sub, eliminating the number of dependencies if you go the Python/Ruby route.

mindcrime 6 years ago

I generally do something similar, where I push a message onto a queue to trigger a long-running job. I mostly work in Java, so I usually use a JMS provider of some sort, like ActiveMQ or HornetQ. Depending on what is supposed to happen on the receive side, I might run the job in a Java thread, or I might use ProcessBuilder to spawn a native process.

bdcravens 6 years ago

Celery (or in Ruby, Sidekiq) is a pretty simple approach. (I'd start with Redis for the ephemeral storage if you want to keep it simple) Capture the request in your database, update it when project finishes. If you need client side notification when job finishes, use something like PubNub.

ryeguy_24 6 years ago

What do you recommend for the actual job execution for production environment? Is Celery sturdy enough?
- shoo 6 years ago
  
  > Is Celery sturdy enough?
  It probably depends on what your exact requirements are, but celery is likely fine. My last project had celery doing batch job processing for a line of business enterprise web app. It was fine and flexible enough to do what we needed (thousands of jobs a week, job scheduling, rabbitmq broker & postgres result store, in use for years).
  One thing to be aware of: if you're not running on windows, celery worker processes are forked, and by default python's process abstractions (subprocess etc) will fight against you to prevent you launching processes from a celery worker. This can be worked around but is a bit irritating.
  This is probably obvious but you want to ensure your celery worker processes are properly run as e.g. services with good monitoring, and configured to automatically restart if they crash (due to defects in application specific task logic, say).
  Some tips / past discussions:
  https://khashtamov.com/en/celery-best-practices-practical-ap...
  https://denibertovic.com/posts/celery-best-practices/
  https://news.ycombinator.com/item?id=7909201
  Sadly, the main issue I see with celery in the past is that it is a popular open source project with no income stream to fund development, so at times swathes of reported bugs have been closed with "won't fix ; we don't have the resources".
- toomuchtodo 6 years ago
  
  Celery is sturdy enough for production. Suggest RabbitMQ as the queue, equally sturdy.
  https://zapier.com/engineering/automating-billions-of-tasks/
- bdcravens 6 years ago
  
  I can't speak to Celery; I've used Sidekiq, which is the Ruby equivalent, and I know they are similar. We run a tremendous workload on Sidekiq (millions of events a week) in production with no issues.
- cimmanom 6 years ago
  
  Yes, celery is very commonly used in production for python applications.

87 6 years ago

Sound good if you're already familiar with RabbitMQ. If not I'd question how much it makes sense for low-frequency long-running jobs. Given the learning and complexity overhead that is.

osiutino 6 years ago

It dpends, but usually loop checking database isn’t that bad. You don’t really need message queue when you kick start something new.

whb07 6 years ago

aws lambda for me. Just send that data out and get it later when it’s ready. Run a simple python function to do your processing on the aws side.

So far I’m pretty happy with it. I minimize the need for maintaining a redis + worker etc

kohanz 6 years ago

Lambda is not intended for long running jobs. It would be costly to do so.
I'm using Lambda to kick off an EC2 worker running a Docker image with AWS Fargate. The EC2 instance only runs for the duration of the job (which lasts from 5 to 20 minutes)
- scprodigy 6 years ago
  
  Hyper.sh is faster, launching your Docker image in 5 seconds.

knowsmorsecode 6 years ago

I use quartz schedular to do background jobs.