In my previous post, ‘Your Own…Personal…FOIA: Part I‘, I discuss the lead-up to a side project I recently started to solve some of the problems we have in intelligence history research. Namely, the problem of making the crucial documentation we use that has been legitimately stockpiled and released under FOIA or natural declassification processing, searchable.
What we end up having here is a classic engineering problem. How can we take hundreds of thousands of flat image documents from a single archive and make them searchable? The obvious answer is Optical Character Recognition, or “OCR”.
OCR is a fun little tool that we are blessed to have here in the modern era that can transform visual text to raw text documents. It’s super handy if you have, oh, I don’t know…a stack of raw PDFs with no searchable layers. The kind you might find in a FOIA database somewhere for instance. Well it just so happens, that I have a ton of this kind of document specifically.
This isn’t great for researchers though, because if we can’t search for different words and patterns inside the files, the work of holding the government accountable for what they’ve done in the past becomes much much harder.
Where most people would start with a software called “Tesseract“, I started by trying to find a program more specialized for PDF files. The great thing/trouble with tesseract is that it works on so many different filetypes, but unfortunately, it’s basically a “jack of all trades, master of none” type of situation. I ended up finding a program that runs over the top of tesseract specifically built for PDFs! The program is called OCRmypdf, and it fixes a lot of the issues that I had with the basic installation of Tesseract. Namely, OCRmypdf can add a searchable OCR layer directly to an existing PDF. Maintaining filename integrity (which we need to do for this project in case we need to reference the original files) is much easier to script together than it was on tesseract before it.
So we start with a directory full of PDFs that we want to scan through with OCR and we script up a simple loop to do all the work for us:
#!/bin/sh for f in *.pdf; do echo "Running OCR on "$f"" ocrmypdf -v --deskew --clean --clean-final "$f" "$f"; echo "moving ocr'd file" # Make sure this value is changed to fit your destination! mv "$f" /root/OCRd echo "finished" done
The intent of this script is to run through all of the PDF files in the working directory and in addition to adding the OCR layer, also clean up the document by fixing the alignment, orientation and reducing some of the noise in the documents so that they’re a bit more presentable. Again, all of these features are accessible in the basic installation of Tesseract, so you could script it with Tesseract alone if you really wanted to, especially if you were working with scans of other image formats and wanted a dump of all of the recognizable words.
So this solution works pretty well, it adds the OCR layer to each file and then moves the completed file to a new directory that you specify so that you don’t end up reprocessing anything or having to sort out what you have and haven’t done. This is native bash as well, so you can port this to any bash running system making it accessible to even technically disinclined researchers.
This method of OCR works great if you have access to a cluster of machines that can help you process through a large volume of documents, or if you’d just prefer to run OCR locally rather than using a cloud service to do it for you. This method is somewhat cost effective, but there are better, more efficient and cost-effective ways of bulk processing documents which I will detail as the series goes on.
In part III of this series, I’ll go into detail on what I hope to accomplish with this project. Hint: it might have something to do with NLP!