75 points by alixaxel 2 months ago
I run a ton of Puppeteer jobs (300k in the last month), currently on EC2 and Digital Ocean VM's, mostly due to the subtle difficulties of running Puppeteer on Lambda.
Will certainly have a look at this project and contribute where possible.
My main concerns are not so much cold start time, as for my use case this is not really a huge issue, but mainly the performance of Chrome on AWS Lambda boxes. The rendering, navigation etc. needs to be snappy.
Google App Engine and Google Cloud Functions got native support for Puppeteer a few months ago as well. Let me know what you think if you try it out.
(I work for Google Cloud)
The performance of puppeteer is super bad on CGF (you can read more about it here https://github.com/GoogleChrome/puppeteer/issues/3120). It would actually be great to have someone really improve this situation instead of dismissing it as a weird IO problem.
Did some research internally, this is being tracked but still no root cause AFAIK :(
Would love to use GCF but the performance is terrible (as mentioned) and I need more geographic locations than GCF offers.
I also run hundreds of thousands of puppeteer sessions every month, all on Lambda and so far I'm pretty happy with it, from scalability itself to session performance.
Granted, there are some issues with rendering (fonts, emojis and whatnot) but meanwhile there are solutions available that could be explored.
Feel free to try it out and share your specific challenges on GitHub, I'll do my best to come up with solutions for them.
out of curiosity, what is it that you do that demands so many sessions? Just webscraping?
What is the advantage of running automated tests as lambda ?
Typically when automation tests are run, they are long running processes and lambda execution may not be suitable.
The cold start times of lambda is another challenge.
What is a good practice/model for running suites ? One test per lambda or, one spec per lambda ? Still inclined to have ec2 instances created and destroyed via devops tools like terraform to run automation.
Well, puppeteer can be used for more than test suites (think screenshots, PDF rendering, proxified APIs, ...).
But for running long automated tests, I'd probably look into alternatives like Fargate, where the billing model is per-second with a one minute minimum. Terraform + EC2 spot instances works too, obviously. :)
Thanks. The example you’ve mentioned will be useful.
There are quite some disadvantages to use AWS Lambda to do Automated Testing (test time is capped to 15min, cold start waiting time, ...).
The advantage is that you can run a lot of tests concurrently at a relatively cheap cost.
The company  I work for offers VMs that are created/destroyed automatically after each test. There's no cold start, and no time limit. Plus you can choose to run headless like Puppeteer or test in an actual OS like Win/Mac.
My SaaS uses Puppeteer for website monitoring. Shameless plug: https://checklyhq.com/product/transaction-monitoring/
It's been quite difficult to get it right, with Puppeteer being young etc. but it chugs along nicely now at around 10k-15k runs per day spread over 4 regions.
Lambda have a maximum running time of 15min. Even if you have « Cold Start » it won’t more than a minute for Chrome to be up and running.
Meaning you’ll get at least 10 to 14 minutes of Headless testing.
As for recommandations on how to do so , unless your testing is super long you should just run the entire thing in one function . Otherwise decouple your testing based on the various modules of your app ( i.e one module per function )
Understand the 15 min and <1min coldstart. The question was more towards test suites with 100’s of tests for really large products and you cannot break them. The scenarios won’t make sense. The feedback from these tests will not be done in 15 mins.
I think is more cost efficient and scalable than having an EC2 instance being created and destroyed every time you run a test.
Here's another alternative lambda layer containing headless chrome with and puppeteer example: https://github.com/RafalWilinski/serverless-puppeteer-layers
This is fantastic.
- I'm just getting started with Lambda so pardon if this is ignorant, but what's the cold start time of Chromium? Or can you warm start it somehow?
- Since scraping often depends on state, wouldn't you hit a timeout doing longer scraping joba?
So usually with Lambda, you want your jobs to be as atomic/quick as possible, as Lambda is stateless and has a maximum duration of 15 minutes.
As for the warm up times, the decompression of Chromium with Brotli takes about 700ms on a 1.5GB Lambda (this is faster than Gzip/Zip). Launching Chromium itself and opening a new tab takes another 400ms or so. If you keep your Lambdas warm (by registering a scheduled ClowdWatch event every 15 minutes for instance) your startup time will effective be those 400ms.
If you keep your Lambda warm, shouldn't you just use something like browserless (https://github.com/joelgriffith/browserless)?
I run browserless Docker container on-prem and it works very well for us. Fire&forget, +1.
Presumably you can also keep Chromium running and keep a pool of tabs to reuse. Doing this it would pretty fast I imagine.
Depending on your use case you can also disable security and open as many iframes as you want in a single tab. Not sure how this compares to multiple tabs though.
Of course you'll run into cold start again when lambda has to scale.
Yes, this is a good solution but only if you don't have any sort of session data that you want flushed out after it runs. One could argue you could use browser contexts (incognito tabs) to have ephemeral sessions, but unfortunately that feature doesn't work in --single-process mode (which AWS Lambda requires).
I'm using incognito mode to parse some pages that for some reason I can't using the normal context.
I have been considering moving my pool of chromium workers to lambda functions so we can avoid api slowdowns due to a high number of parsings at the same time.
Are there any other side effects of running chromium headless in a lambda function?