Recently I tried to open several old MS Word files created on a Macintosh in OpenOffice.org on my Ubuntu machine. The text part of the documents got converted just fine by OpenOffice.org Writer, but the images became rather messed up. Anything that had been imported as a bitmap in the original files just turned out as an empty black or white square. Now the problem was that I needed some of those bitmap images that were in the documents. So I tried opening the files in MS Word 2003 at work, and then saving them again as Word for Windows documents. This fixed most of the conversion problems OpenOffice.org had with the MS Draw vector-graphics in the documents. But MS Word on the PC also couldn’t handle the bitmap images in the Mac documents, and replaced them with the following text:
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
A quick google-search led me to the following site, which had an explanation for the problem and contained several workarounds: http://blog.pclark.net/2004/12/quicktime-and-tiff-lzw-decompressor.html
Apparently MS Office for Mac doesn’t handle pasted bitmap graphics correctly, it just embeds them as Macintosh-format TIFF images which neither OpenOffice.org nor the Windows-version of Office can handle.
Apparently you can manually correct the problem on a Mac (which I don’t have at home), by editing the image or re-pasting it as PDF. Or you can use the HTML export function in MS office under Windows (which I also don’t have at home) to extract the images as pcz-files, which you then need to manually un-gzip and convert with IrfanView.
Both methods assume you have either a Mac or Windows, and own a copy of MS Office. Luckily, after some experimenting I also figured out a way to get the images on my Ubuntu Linux machine using a little shell-magic.
There is a handy package called wvWare, which can read and convert Word documents. Install it if you don’t have it yet:
sudo apt-get install wv
By default it will convert the document text to HTML on stdout, and dump the embedded image-data to files. We’re only interested in the pictures so we run it but discard the HTML-output:
wvWare test.doc >/dev/null
This should create a set of files called test0.pict, test1.pict etc. Now check which files are TIFF images:
grep TIFF *.pict
This works because the files all contained some kind of header, which may either be an old-style MacOS resource-fork or something Word-specific. Unfortunately this also means that they won’t be recognised as TIFF files by most programs. You need to strip the header. I used a hex-editor (the one in Midnight Commander) to find out where the actual TIFF-file starts (look for the 2 bytes “II” for intel-byteorder TIFF-files, or “MM” for motorola-byteorder files). In my case the TIFF started at an offset of 283 bytes in most cases, so I extracted the TIFF-file using:
tail --bytes=+283 test0.pict >test0.tif
That’s all. You should now be able to open the TIFF files in The Gimp. Alternatively you can install Imagemagick and convert the files directly to PNG for use in OpenOffice.org or LaTeX:
convert test0.tif test0.png
You can quickly extract all TIFF-images from a given MS Word document using a simple shell-script. Something like:
1 2 3 4 5 6 7 8
#!/bin/bash wvWare $1 &>/dev/null for i in `grep TIFF *.pict | cut -f 3 -d " "` do tail --bytes=+283 $i >$i.tif convert $i.tif $i.png done
This is a bit basic but works for me in most cases. Unfortunately the 283-byte offset does not seem to be general so your mileage may vary. For instance I encountered one picture which had an offset of 295 bytes. This should however be easy to detect automatically. I don’t need it anymore, but does anyone feel like writing a slightly more advanced script to handle image extraction automagically? :-)