Show HN: Webpage to PDF Microservice

93 points by cjimti 6 years ago

krn 6 years ago

A basic command line alternative using Headless Chrome[1]:

  chrome --headless --disable-gpu --print-to-pdf https://www.chromestatus.com/

[1] https://developers.google.com/web/updates/2017/04/headless-c...

rav 6 years ago

Similar functionality is packaged in wkhtmltopdf, which essentially runs Webkit headless to print to PDF.
https://wkhtmltopdf.org/
- forapurpose 6 years ago
  
  The article talks about wkhtmltopdf; in fact, they developed their server in response to its limitations:
  The wkhtmltopdf utility has been around awhile and works great when you get it working correctly on your platform. However, the newest version as of this writing 0.12.5 has a bug prevening TOC generation on some platforms. Some Linux platforms require the installation of Microsoft font packs, and compiling from source leads you down a rabbit hole of dependency hell.
- cjimti 6 years ago
  
  txPDF is a simple containerized web services wrapper around wkhtmltopdf, intended to be used as a Microservice component in a larger system.
  
  ashkulz 6 years ago
  
  wkhtmltopdf maintainer here. That's really cool!
  Did you manage to find a workaround for https://github.com/wkhtmltopdf/packaging/issues/2? If so, would appreciate a PR :-)
  
  cjimti 5 years ago
  
  Thanks, I'll check out that issue and see if there is anything I can contribute. wkhtmltopdf is a great utility and we rely on it heavily.

stevekemp 6 years ago

I've reported many bugs in projects that turn "URL" to "PDF".

You need to be sure you're limiting the kind of URLs that people can submit. For example ensure that nobody makes a PDF of :

* file:////etc/passwd

* http://169.254.169.254/latest/meta-data/local-hostname

* http://localhost:8080/

I'd say over half of the "PDF-creation" projects posted here have been vulnerable to some/all of those attacks. (I continue to be surprised at how many web-to-pdf services exist. I guess there must be a lot of people paying for them?)

cjimti 6 years ago

These are great security suggestions and I should make some clarifications on the intended use. We use txPDF as a backend Microservice and not open to direct public use. It is good for automating report generation from other portions of a larger system.
jarofgreen 6 years ago

Also that people can't use them to mine crypto currency. Seen owner of one such project blog about how that happened to them.
cjr 6 years ago

I'm the owner/dev of one of those paid services, and yes, competition is fierce, but people do still pay for the convenience of not having to manage it themselves. One look at the issue count of puppeteer/phantomjs/selenium/slimer... tells its own story.

thomasfromcdnjs 6 years ago

Awesome timing. Just started work on a LinkedIn alternative called https://jaresume.com

We need a reliable way of turning peoples resumes into PDF's

Going to give this a go today or tomorrow.

Doing it with https://github.com/GoogleChrome/puppeteer also works quite well

vfulco2 6 years ago

I venture there is huge money in a sweet path from latex resumes in pdf format to ms word. I want to offer my clients a basic template but if I choose the latex route, I will inevitably have requests for the latter no matter how lame the format.
ivanche 6 years ago

Such an interesting concept! I just signed up one minute ago so I can't give much more of a feedback but I wish you a great success with this!
dvh 6 years ago

Why not simply press Ctrl+p and print to PDF?
- thomasfromcdnjs 6 years ago
  
  We have tried it in the past, just doesn't work reliably with different html configurations.
  Does ctrl+p on this page -> https://jaresume.com/thomas look good for you?
  
  rcfox 6 years ago
  
  Have you tried using a print media stylesheet? You could hide the navigation, reduce the whitespace, maybe shrink the font size a little bit, and remove link text decoration.
  
  thomasfromcdnjs 6 years ago
  
  Great idea. I have used print media sheets in the past, but found them easy to have regressions e.g. elements that are introduced but not hidden. A webpage to pdf process is also vunerable to that though.
  I think ideally, because the resume renderer is a react component, I'd rather just boot up chromium with the react component and resume data and do a fully clean render of the page into pdf.
  We shall see.
  
  rcfox 6 years ago
  
  For me, when I load the page, I see a resume display and then a split second later it's all replaced with "AN UNEXPECTED ERROR HAS OCCURRED."
  Is the error so critical that it must hide your content? Did it accidentally include your AWS keys or something?
  
  thomasfromcdnjs 6 years ago
  
  Sorry about that. I had just introduced a bug for anonymous users. Should work now.
NetOpWibby 6 years ago

Seems like a neat project.
- thomasfromcdnjs 6 years ago
  
  Thanks! Feedback always welcomed.

ernsheong 6 years ago

You can achieve this using just the browser.

In Chrome Dev Tools, click on the devices button (the icon with the phone and tablet). Using the top-right menu, select "Capture full size screenshot".

Walla, you now have a full size screenshot that you can convert into PDF.

Incidentally, I am author of https://www.pagedash.com, which is a personal web scrapbook which allows you to capture the current page as HTML and generate links to share with others.

superasn 6 years ago

I tried it with this page only but it didn't work for me. Got a 110Kb png file but it's empty. It is a valid PNG but it's completely blank. Maybe it's buggy.

ZeKZ 6 years ago

I find wkhtmltopdf very difficult to work with, for instance the official documentation is just a man [1].

I discovered the project Weasyprint[2] a few months ago. I find it easier to use, and very powerful when using Python. You can define a custom loader to inject images or styles generated on the fly for instance.

There are still some missing features compared to wkhtmltopdf, such as defining a custom footer and header, but it's a very promising project.

[1] https://wkhtmltopdf.org/usage/wkhtmltopdf.txt

[2] https://github.com/Kozea/WeasyPrint/

jimnotgym 6 years ago

Since you mention Python, I have found pdfkit[1] to be a pretty good wrapper for wkhtmltopdf. I have a document generation engine that uses it dozens of times a day. Worst part is that wkhtmltopdf in the Ubuntu repos is still compiled (when last checked) without some patch that allows it to run headlessly. I built from source, which was not too difficult.
[1]https://pypi.org/project/pdfkit/

Globz 6 years ago

One of my application running at work has a task of creating a user ordersheet made through the main app workflow and transposing it to an HTML document which is then converted to a PDF document by wkhtmltopdf and dispatched via email, etc.

I found this setup to be really stable and easy to maintain, so far it has produced around 70k orders per year and has been running for over 4 years now without any hiccups.

Before that I was using phantomjs but it wasn’t as fast and reliable for some reasons that I can’t quite remember now, since I havent touch that part of the app in a long time.

All I remember is that wkhtmltopdf was easier to tweak and compose with.

btown 6 years ago

https://prerender.com/ is a great service (fully MIT-licensed at https://github.com/prerender/prerender ) for this type of thing, both for rendering internal pages and for scraping/rendering external sites that rely heavily on client-side code.

liftbigweights 6 years ago

You can also use pdf printers available in linux distros and even windows now.

bramd 6 years ago

I'm still looking for a service like this, but that creates a nicely tagged PDF and conveys the HTML structure in the PDF tags.

Tagged PDFs are a requirement in many processes for accessibility or archival reasons.

gildas 6 years ago

Why not using HTML instead of PDF? I'm the author of an extension that allows to save faithfully a web page into an HTML file [1]. From my point of view, that should be the best solution for archiving web pages in a file. Votes on HN disagree with me though [2], I wished I could understand why.
[1] https://github.com/gildas-lormeau/SingleFile
[2] https://news.ycombinator.com/item?id=18243721
- Ibethewalrus 6 years ago
  
  read recently PDF is defacto standard by government

jotto 6 years ago

alternatively if you want a SaaS REST API:

   curl https://service.prerender.cloud/screenshot/https://google.com/ > out.jpg

   curl https://service.prerender.cloud/pdf/https://google.com/ > out.pdf

   curl https://service.prerender.cloud/https://google.com/ > out.html

https://www.prerender.cloud/

fastball 6 years ago

Why not just

> Print

> Open as PDF

supermatt 6 years ago

To save your microservice having to run a graphical environment and simulate mouse interaction?