How to Run SQL on PDF Files

88 points by vahidfazelrezai 5 years ago

For some reason, when clicking this link, I've expected SQL database running inside PDF files. Which considering how PDFs can embed JavaScript and that Emscripten exists (and that PostScript itself is Turing-complete if you want to go hardcore), may actually be doable.

frabert 5 years ago

I'm still waiting for someone to put PDF.js in a PDF so that I can finally read my PDFs while I read my PDFs
- speedplane 5 years ago
  
  I'd prefer if PDFs emulated a virtual machine that ran Windows 3.1 so I can read them in true 1990s style.
  
  frabert 5 years ago
  
  You'll have to ask Fabrice Bellard for that
gpvos 5 years ago

PDF does not implement full Postscript.
avinium 5 years ago

I briefly looked at running Javascript inside PDF a couple of years ago - I think this is possible under the spec, but no major renderers allow it.
Could be wrong.
- DanielDent 5 years ago
  
  PDFs that work in Chrome's PDF renderer:
  A calculator:
  https://pspdfkit.com/images/blog/2018/how-to-program-a-calcu...
  A game of breakout:
  https://github.com/osnr/horrifying-pdf-experiments/raw/maste...
  The PDF specification is hundreds of pages.
  
  avinium 5 years ago
  
  Oh wow - I wonder if this was added to Chrome's PDF engine after I looked into it.
  Very cool, but a bit concerning at the same time. I have vivid memories of Flash exploitations and all the malware that came with it. I know we've moved on since then, but I still don't see any reason to allow dynamic content in a page layout format.
  
  WrtCdEvrydy 5 years ago
  
  I wonder if running PouchDB would count... since it's an offline / online database.
  I'm wondering if having a PDF that would have data that could self-update if there's an internet connection.
  Anyone wanna collaborate on something like this?
  I wonder if having a PDF like this that showed you how your stocks of choice are doing would be interesting.
  
  yomly 5 years ago
  
  Would would be the advantage of this over a website/PWA?
  That it's more portable? It's more document-like/printable?
  
  WrtCdEvrydy 5 years ago
  
  Self-contained, no need for internet unless you want it to update, is forward-compatible with future PDF readers.
  
  heavenlyblue 5 years ago
  
  On Firefox calculator opens up, but I can't interact with the elements.
  The second one is an empty page. I suppose it needs to do some initialization first in order to show the page.
  
  kulu2002 5 years ago
  
  This is amazing.
- gumby 5 years ago
  
  > I think this is possible under the spec, but no major renderers allow it.
  Thank goodness, though I suppose you could take the position that you already have a full programming language in your document so how could this be worse? Though the idea of distributing node with all my documents seems a bit hairy.
- jsjohnst 5 years ago
  
  If there is one who would, it’d be Adobe Acrobat / Reader, but I don’t think even it does either.

msravi 5 years ago

So the tool is being used to extract text and then a regex is used to extract the relevant fields/values. Seems that pdftotext[1][2] with awk can do the job on your local machine without uploading your docs.

1. brew install pkg-config poppler (on mac)

2. sudo apt-get install poppler-utils (on Debian/Ubuntu)

dexcs 5 years ago

Analyzing the text is the problem. Not extracting. Are there any good open source libs out there?
- msravi 5 years ago
  
  Sure. But the tool posted here doesn't do that. It merely extracts text, and the "analysis" is a couple of regexes that are tailor-made for that particular pdf. Awk can do that much and a lot more.
  If you want to extract tables from a pdf, there's Tabula[1], but it isn't automated to run over the whole pdf - you've to do a manual rectangular selection around the table you want to extract.
  1. https://github.com/tabulapdf/tabula
  
  mirimir 5 years ago
  
  Indeed. Many years ago, I "ran SQL" on a couple decades of Usenet newsgroup data. Extraction and manipulation involved a bunch of grep, sed, tr and awk (and millions of tmp files). But, as with PDFs of utility bills, it was very specific regex.
  
  kwadhwa 5 years ago
  
  Hey, Kshitij from Rockset here.
  With Rockset you can avoid ETL when it comes to extracting and manipulating the data. Also, the main value here is that you can join this data with other data sets that are in JSON, CSV, XLS or Parquet formats using SQL to help in analysis.
  
  mirimir 5 years ago
  
  Maybe you could add modules for extracting and manipulating data from popular sources. Such as the most popular social media. Also Amazon, Craigslist, Ebay, etc. And the main search engines.
  There are many people who want usable data from such sources. And your service wouldn't be doing any scraping, so you'd probably be OK legally. But IANAL, so do check.
- danzig13 5 years ago
  
  I’ve been impressed with Camelot for PDF tables

voltagex_ 5 years ago

https://rockset.com/pricing/ - free tier looks pretty good but I'm not sure I'd be comfortable uploading bills and other documents here.

WrtCdEvrydy 5 years ago

I'm sure they take 'your privacy and security seriously'
- aasasd 5 years ago
  
  Doubly so after an incident.
  
  kkarakk 5 years ago
  
  but they have [Standard encryption] and [Field masking]on the free tier! surely that means something?
triplee 5 years ago

Yeah I was excited by this concept until I realized the link was to a site where I had to give them my PDFs.
I want a simple library I can use to do this locally for security reasons and because it's a rarity when it happens, but a big problem when I do have it.

mehrdadn 5 years ago

What do you do if you don't want to upload your stuff to somebody's server?

aboutruby 5 years ago

So it extracts the metadata and converts the PDF to text and puts that automatically in a BigQuery table (with some custom functions).

I was somehow expecting to have the parser automatically recognize patterns in the PDF and maybe try to name them (and let you rename them), kind of what advanced web scrappers do.

mLuby 5 years ago

If the author is reading, I'd love to see more info on how you trained the system to understand the text content of the PDF. And how regular/structured do the PDFs have to be to work?

sumedh 5 years ago

Not the author, there is no training here, its using pdf to text (there are various libraries to do that) and then using a regex pattern to extract the relevant information.
Since its regex only techies would be able to do that in that case they can just write their own script instead.
- kwadhwa 5 years ago
  
  Hey, Kshitij from Rockset here.
  You are correct that Rockset is doing text extraction for PDF but the main value here is that you can join this data with other data sets that are in JSON, CSV, XLS or Parquet formats using SQL without doing any ETL.

iblaine 5 years ago

This is a confusing feature. Converting PDFs to text is already a trivial technical challenge, so it's not a product differentiator. What is a product differentiator is converging indexing(index and store all the things) used by Rockset. This PDF feature seems like a distraction.

Also, I uploaded my PG&E bill to Rockset and got an empty result set...maybe I'm using it incorrectly.