Converting PDF to HTML for PocketPc
The up-to-date textbooks for courses offered by CSI are only offered in PDF format. I frequently need to convert PDF files to HTML so I can read them on the screen without my eyes going squirrelly. It is generally more comfortable to sit on the couch and read with my PDA than at the PC.
Here is a script to convert pdf files into html suitable for viewing in firefox or with the PocketIe browser on PDAs running Windows Mobile. The html produced views nicely with Firefox on the Mac and PC, and will probably render nicely on Palm PDAs. The HTML output can also be used (as long as you don’t add a “-asxhtml” option to pdftohtml command) with MobiPocket Publisher 3.0 to create a Microsoft Reader (.lit) file. I have not been having much luck reading .lit files since I upgraded my Dell Axim X50 to Windows Mobile 5, so for now I am sticking with html.
#!/usr/bin/env sh
# this program converts the arguments, a list of pdf files, to html.
# it uses tidy to ensure the html is conforming and sed to remove the background
# colour (which otherwise makes the document unreadable on a PocketPC) and
# to remove &nbsb; characters, which are rather wasteful of space and interrupt
#the flow.
#
for file in $*
do
pdftohtml -c -noframes $file
htmlfile=${file/.pdf/.html}
tempfile=$(mktemp)
mv $htmlfile $tempfile
tidy -ashtml -utf8 -clean $tempfile | sed -e 's;background-color:.*;;g' -e 's# # #g' > $htmlfile
rm $tempfile
done
I have run this on a mac, it would probably work on cygwin as well.
pdftohtml for some reason creates a grey background for documents, which looks horrible on my PDA, so the sed script above removes the grey background from the HTML produced by pdftohtml.
The tidy step removes the inline formatting, to decrease the size of the html. The sed script also replaces the with plain old whitespace, to allow the text to flow more and decrease the file size.
If this looks like gobbledygook to you, you have probably underestimated the skill one seems to need to be able to read PDF documents in a reasonable fashion on a PocketPC. A computer science diploma or degree helps. It would be nicer if you could right click on a PDF document, select “make available in a PDA” option, and magically have a legible document available on your PDA, but it doesn’t work like that. PDFs are often hard to read on a PC, let alone a PDA, though palm users are apparently happier with Documents To Go. You can try adobe acrobat or foxit reader for your pda; I didn’t find these workable because they don’t present the text any better for online viewing on PDAs than they do on PCs.
Technorati Tags: PDF, PocketIe, PDA, Palm, PDA, Axim X50, Windows Mobile 5, cygwin, pdftohtml

December 13th, 2005 at 7:23 am
Received from Gord:
I’d comment on your blog, but I’d have to create a WordPress account. So you get this in email instead (you can post it if you like). Here are some improvements to your shell script (in red). I assume the first line was for maximum portability,
but sh will always be found in /bin/sh, so you don’t need to do the /usr/bin/env hack (which itself isn’t as portable as you think). Your for-line should use the “$@”
variable instead of the “$*” variable, double-quoted, so you handle filenames with spaces in them correctly. The rest of the changes to the file areadding double-quotes around possible space-containing variables.
#!/bin/sh
# this program converts the arguments, a list of pdf files, to html.
# it uses tidy to ensure the html is conforming and sed to remove the background
# colour (which otherwise makes the document unreadable on a PocketPC) and
# to remove &nbsb; characters, which are rather wasteful of space and interrupt
#the flow.
#
for file in “$@”
do
pdftohtml -c -noframes “$file”
htmlfile= “${file/.pdf/.html}”
tempfile=$(mktemp)
mv “$htmlfile” $tempfile
tidy -ashtml -utf8 -clean $tempfile | sed -e ’s;background-color:.*;;g’ -e ’s# # #g’ > “$htmlfile”
rm $tempfile
done
Gord.
January 6th, 2006 at 9:15 am
[...] At some point I plan to update the pdf to pda conversion script to convert to plucker as well as or instead of html. [...]
January 22nd, 2006 at 11:16 am
[...] I have been using pdftohtml to convert pdf documents to html, and sometimes then to plucker, so I can use them on my PDA. I have been finding some PDFs just do not convert - most of the text is missing. Same result with pdftotext. [...]