Converting PDF to HTML using pdf2htmlEX
As some of you may know, most scientific literature is published in journals and conferences as self-contained pieces called "papers". Since these papers usually represent an advancement of science in a particular area, it is important to ensure that the papers are created in a high quality format that can last for centuries if not decades. In the pre-computer era, all the papers were typewritten and had to be typeset by the publisher before publishing in a journal. Reproducing them was not easy before the advent of photocopiers. In modern times, most scientific literature is created on computers and the output is typically a PDF file that adheres to the formatting prescribed the journal or conference publisher.
My preferred choice of creating papers is in LaTeX. It's a typesetting language which was specially for scientific literature, most prominently for mathematical literature. Since LaTeX separates semantics from visual representation in a document, it is easy to apply different filters to a LaTeX document and obtain the output in a format of choice. For example the output can be in postscript, a PDF, an image and even a device independent file ( DVI ).
Until a few years ago, on most Unix based systems there was a command line tool (pdf2html) to convert a LaTeX document to HTML. Unfortunately, it has not been maintained for quiet some time. As a result, it often does not work and when it does, produces broken (circa 2002) HTML. This drawback has irritated me, when I want to quickly post some section of a document (usually a paper) on a webserver for editing/review. A few weeks ago, I came across a link to pdf2htmlEX . Could it really convert any PDF to an HTML while maintaining the formatting? Impressive. I bookmarked it then and today, I finally got down to installing it on my computer.
My computer runs 64-bit Mageia 2.0. It turns out that pdf2htmlEX has not been packaged for Mageia. That's inconvenient but not terrible since the source code of pdf2htmlEX has been released by the authors and therefore I could compile the source code on my system. I read the compilation/installation instructions on the software's website. The installation required the following:
- GCC >= 4.4.6
- libpng and headers
- poppler with xpdf header >= 0.20.0
- poppler-data if your want CJK support
- fontforge (with header files)
- [Optional] ttfautohint
GCC-4.4.6 was luckily installed on my system. libpng was also installed. Headers of a library typically implies installing the devel version of the library. It turned out that I could directly install the devel version of lib64png using drakrpm. So that was easy. On Mageia 2.0, the most recent version of poppler that's packaged is 0.18.4. So the first hurdle had appeared. I first downloaded the tarballs of poppler and poppler-data from poppler's website . Although fontforge was already installed on my system, the headers were missing. Drakrpm reported a failure when I searched for the devel version of fontforge. So I downloaded the (stable) tarball for fontforge . tar xvf opened up all of these tarballs in the respective directories. Following the instructions for compiling poppler, I got the latest version installed on the system. poppler-data followed. fontforge too gave me no trouble.
When I started compiling pdf2htmlEX, it complained about lack of
poppler-0.20 on the system and quit. Wait! Hadn't I already
installed it? Why could it not find it then? The libraries in
the /usr/local/lib64/ were updated. So what was the problem? As
a shortcut, I quickly looked up command line options for cmake
so that cmake would pick poppler from the local source directory
instead of the system. Did not work. I edited the CMakeLists.txt
file to point to the local source directory. No luck
again. Trial and error did not
help. Google to the
rescue. It turned out that compiling poppler yourself does not
set the environment variable PKG_CONFIG_PATH correctly. So
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
even though this is a 64-bit system enabled the
compilation to succeed and I got the pdf2htmlEX executable
installed on my system. Note that there was no pkgconfig in
/usr/local/lib64. On executing pdf2htmlEX, it complained about a
missing 'manifest' and quit. Another google search and it turned
out that I had to
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
and the program worked flawlessly. pdf2htmlEX should
have picked up /usr/local/lib64 but it did not. Clearly,
something somewhere is wrong. That investigation is for another
day. In the meantime, you can have a look at the output of
converting my resume from
pdf
to html.