Unlocking data from PDFs can be the most hellish assignment. In comes Tabula, a free software helping users transform data into readable formats again.
Note: This is the third post in a series based on meetings/talks from Mozilla Festival 2013 in London.
There is so much interest in data. But there are problems too: PDFs, while good to keep a document in shape are like poison for datatables. It is very difficult and time consuming to extract data into readable formats again.. Some data-journalists coined the phrase, that „PDFs are the place where data goes to die“.
What is the problem here?
On example is the work done by ProPublica in the US on the “Dollars for Docs” project. Here a team of journalists wanted to get a better understanding how much money is paid to doctors by pharmaceutical companies. The big barrier here was that all the spendings data was delivered to the newsroom in PDFs, resulting in a need for creative extraction of the data. But the team (aided by some considerable technical expertise) succeeded. Plus, they wrote a “making of” article about how they used a range of software to get the data into machine-readable formats again. Two words: Not easy.
An novel, open and free approach to extract data from PDFs
In comes Tabula. It’s a free software helping with the task of getting data into readable formats again. Besides being a free tool, Tabula is one example of open source tactics driven by a number of players out there, to support journalistic work.
While the functionality is so far not able to solve all problems at once, this tool provides an excellent way to start and should be further developed in the future. We started using it quite often inside the Deutsche Welle innovation team. It is super simple to use: You upload a PDF, mark the data and then get the data in clipboard, ready to be used in a spreadsheet program again. There might still be minor problems should the PDF table be very complex, but for most cases this works super-fast and easy.
One question: Where did this come from? Why would anybody sit down and organize work on such a helpful tool? How can this be financed, too? The answer: One of the creators, Manuel Aristarans, is part of a small, but growing community of coding journalist (or journalistic coders), working for La Nacion in Argentina and currently a 2013 Knight-Mozilla Open News fellow. This particular program intends to finance and support projects like Tabula. And, as you can see, it is quite effective in doing so. A team of dedicated editors and coders at La Nacion managed to transform into of the most interesting and pioneering data-driven newsrooms in Latin America in the last two years. See their “Data blog” here.
Manuel Aristaráns Blog: