For some reason, when clicking this link, I've expected SQL database running inside PDF files. Which considering how PDFs can embed JavaScript and that Emscripten exists (and that PostScript itself is Turing-complete if you want to go hardcore), may actually be doable.
Oh wow - I wonder if this was added to Chrome's PDF engine after I looked into it.
Very cool, but a bit concerning at the same time. I have vivid memories of Flash exploitations and all the malware that came with it. I know we've moved on since then, but I still don't see any reason to allow dynamic content in a page layout format.
> I think this is possible under the spec, but no major renderers allow it.
Thank goodness, though I suppose you could take the position that you already have a full programming language in your document so how could this be worse? Though the idea of distributing node with all my documents seems a bit hairy.
So the tool is being used to extract text and then a regex is used to extract the relevant fields/values. Seems that pdftotext[1][2] with awk can do the job on your local machine without uploading your docs.
1. brew install pkg-config poppler (on mac)
2. sudo apt-get install poppler-utils (on Debian/Ubuntu)
Sure. But the tool posted here doesn't do that. It merely extracts text, and the "analysis" is a couple of regexes that are tailor-made for that particular pdf. Awk can do that much and a lot more.
If you want to extract tables from a pdf, there's Tabula[1], but it isn't automated to run over the whole pdf - you've to do a manual rectangular selection around the table you want to extract.
Indeed. Many years ago, I "ran SQL" on a couple decades of Usenet newsgroup data. Extraction and manipulation involved a bunch of grep, sed, tr and awk (and millions of tmp files). But, as with PDFs of utility bills, it was very specific regex.
With Rockset you can avoid ETL when it comes to extracting and manipulating the data. Also, the main value here is that you can join this data with other data sets that are in JSON, CSV, XLS or Parquet formats using SQL to help in analysis.
Maybe you could add modules for extracting and manipulating data from popular sources. Such as the most popular social media. Also Amazon, Craigslist, Ebay, etc. And the main search engines.
There are many people who want usable data from such sources. And your service wouldn't be doing any scraping, so you'd probably be OK legally. But IANAL, so do check.
Yeah I was excited by this concept until I realized the link was to a site where I had to give them my PDFs.
I want a simple library I can use to do this locally for security reasons and because it's a rarity when it happens, but a big problem when I do have it.
So it extracts the metadata and converts the PDF to text and puts that automatically in a BigQuery table (with some custom functions).
I was somehow expecting to have the parser automatically recognize patterns in the PDF and maybe try to name them (and let you rename them), kind of what advanced web scrappers do.
If the author is reading, I'd love to see more info on how you trained the system to understand the text content of the PDF. And how regular/structured do the PDFs have to be to work?
Not the author, there is no training here, its using pdf to text (there are various libraries to do that) and then using a regex pattern to extract the relevant information.
Since its regex only techies would be able to do that in that case they can just write their own script instead.
You are correct that Rockset is doing text extraction for PDF but the main value here is that you can join this data with other data sets that are in JSON, CSV, XLS or Parquet formats using SQL without doing any ETL.
This is a confusing feature. Converting PDFs to text is already a trivial technical challenge, so it's not a product differentiator. What is a product differentiator is converging indexing(index and store all the things) used by Rockset. This PDF feature seems like a distraction.
Also, I uploaded my PG&E bill to Rockset and got an empty result set...maybe I'm using it incorrectly.
For some reason, when clicking this link, I've expected SQL database running inside PDF files. Which considering how PDFs can embed JavaScript and that Emscripten exists (and that PostScript itself is Turing-complete if you want to go hardcore), may actually be doable.
I'm still waiting for someone to put PDF.js in a PDF so that I can finally read my PDFs while I read my PDFs
I'd prefer if PDFs emulated a virtual machine that ran Windows 3.1 so I can read them in true 1990s style.
You'll have to ask Fabrice Bellard for that
PDF does not implement full Postscript.
I briefly looked at running Javascript inside PDF a couple of years ago - I think this is possible under the spec, but no major renderers allow it.
Could be wrong.
PDFs that work in Chrome's PDF renderer:
A calculator:
https://pspdfkit.com/images/blog/2018/how-to-program-a-calcu...
A game of breakout:
https://github.com/osnr/horrifying-pdf-experiments/raw/maste...
The PDF specification is hundreds of pages.
Oh wow - I wonder if this was added to Chrome's PDF engine after I looked into it.
Very cool, but a bit concerning at the same time. I have vivid memories of Flash exploitations and all the malware that came with it. I know we've moved on since then, but I still don't see any reason to allow dynamic content in a page layout format.
I wonder if running PouchDB would count... since it's an offline / online database.
I'm wondering if having a PDF that would have data that could self-update if there's an internet connection.
Anyone wanna collaborate on something like this?
I wonder if having a PDF like this that showed you how your stocks of choice are doing would be interesting.
Would would be the advantage of this over a website/PWA?
That it's more portable? It's more document-like/printable?
Self-contained, no need for internet unless you want it to update, is forward-compatible with future PDF readers.
On Firefox calculator opens up, but I can't interact with the elements.
The second one is an empty page. I suppose it needs to do some initialization first in order to show the page.
This is amazing.
> I think this is possible under the spec, but no major renderers allow it.
Thank goodness, though I suppose you could take the position that you already have a full programming language in your document so how could this be worse? Though the idea of distributing node with all my documents seems a bit hairy.
If there is one who would, it’d be Adobe Acrobat / Reader, but I don’t think even it does either.
So the tool is being used to extract text and then a regex is used to extract the relevant fields/values. Seems that pdftotext[1][2] with awk can do the job on your local machine without uploading your docs.
1. brew install pkg-config poppler (on mac)
2. sudo apt-get install poppler-utils (on Debian/Ubuntu)
Analyzing the text is the problem. Not extracting. Are there any good open source libs out there?
Sure. But the tool posted here doesn't do that. It merely extracts text, and the "analysis" is a couple of regexes that are tailor-made for that particular pdf. Awk can do that much and a lot more.
If you want to extract tables from a pdf, there's Tabula[1], but it isn't automated to run over the whole pdf - you've to do a manual rectangular selection around the table you want to extract.
1. https://github.com/tabulapdf/tabula
Indeed. Many years ago, I "ran SQL" on a couple decades of Usenet newsgroup data. Extraction and manipulation involved a bunch of grep, sed, tr and awk (and millions of tmp files). But, as with PDFs of utility bills, it was very specific regex.
Hey, Kshitij from Rockset here.
With Rockset you can avoid ETL when it comes to extracting and manipulating the data. Also, the main value here is that you can join this data with other data sets that are in JSON, CSV, XLS or Parquet formats using SQL to help in analysis.
Maybe you could add modules for extracting and manipulating data from popular sources. Such as the most popular social media. Also Amazon, Craigslist, Ebay, etc. And the main search engines.
There are many people who want usable data from such sources. And your service wouldn't be doing any scraping, so you'd probably be OK legally. But IANAL, so do check.
I’ve been impressed with Camelot for PDF tables
https://rockset.com/pricing/ - free tier looks pretty good but I'm not sure I'd be comfortable uploading bills and other documents here.
I'm sure they take 'your privacy and security seriously'
Doubly so after an incident.
but they have [Standard encryption] and [Field masking]on the free tier! surely that means something?
Yeah I was excited by this concept until I realized the link was to a site where I had to give them my PDFs.
I want a simple library I can use to do this locally for security reasons and because it's a rarity when it happens, but a big problem when I do have it.
What do you do if you don't want to upload your stuff to somebody's server?
So it extracts the metadata and converts the PDF to text and puts that automatically in a BigQuery table (with some custom functions).
I was somehow expecting to have the parser automatically recognize patterns in the PDF and maybe try to name them (and let you rename them), kind of what advanced web scrappers do.
If the author is reading, I'd love to see more info on how you trained the system to understand the text content of the PDF. And how regular/structured do the PDFs have to be to work?
Not the author, there is no training here, its using pdf to text (there are various libraries to do that) and then using a regex pattern to extract the relevant information.
Since its regex only techies would be able to do that in that case they can just write their own script instead.
Hey, Kshitij from Rockset here.
You are correct that Rockset is doing text extraction for PDF but the main value here is that you can join this data with other data sets that are in JSON, CSV, XLS or Parquet formats using SQL without doing any ETL.
This is a confusing feature. Converting PDFs to text is already a trivial technical challenge, so it's not a product differentiator. What is a product differentiator is converging indexing(index and store all the things) used by Rockset. This PDF feature seems like a distraction.
Also, I uploaded my PG&E bill to Rockset and got an empty result set...maybe I'm using it incorrectly.