Batch search & replace in PDF files

by levien on di 19 mei 2009 // Posted in misc // under ubuntu technical howto

The other day I found out I had misspelled a word in a whole batch of automatically generated PDF files. Regenerating all of them would be a lot of work, as the PDF files were plots created using perl/PDL, gnuplot and epstopdf (available in texlive-extra-utils), and the input data was scattered over about 20 different machines. Of course I could have hand-edited all the files using Inkscape, but that would also be a lot of work. Instead, I discovered there's an easy way to automatically search & replace text strings in PDF files on my Linux system, using sed and pdftk.

First, make sure you have pdftk installed. On Ubuntu you can simply do: sudo apt-get install pdftk

Then, use a shell-script to uncompress the PDF-files, replace the text and recompress them again. For instance:

#!/bin/bash

oldtext=$1
newtext=$2
pdffile=$3

cp $pdffile $pdffile.bak
pdftk $pdffile output $pdffile.tmp uncompress
sed -i "s/$oldtext/$newtext/g" $pdffile.tmp
pdftk $pdffile.tmp output $pdffile compress

You can easily modify this to run on a whole batch of files. Actually, I just made this as a quick hack, and executed the script using something like:

for i in *.pdf ; do replacepdftext.sh oldword newword $i ; done

But I'm sure there's a better way to integrate batch processing in the script itself...