Converting PDF to HTML using pdf2htmlEX

As some of you may know, most scientific literature is published in journals and conferences as self-contained pieces called "papers". Since these papers usually represent an advancement of science in a particular area, it is important to ensure that the papers are created in a high quality format that can last for centuries if not decades. In the pre-computer era, all the papers were typewritten and had to be typeset by the publisher before publishing in a journal. Reproducing them was not easy before the advent of photocopiers. In modern times, most scientific literature is created on computers and the output is typically a PDF file that adheres to the formatting prescribed the journal or conference publisher.

My preferred choice of creating papers is in LaTeX. It's a typesetting language which was specially for scientific literature, most prominently for mathematical literature. Since LaTeX separates semantics from visual representation in a document, it is easy to apply different filters to a LaTeX document and obtain the output in a format of choice. For example the output can be in postscript, a PDF, an image and even a device independent file ( DVI ).

Until a few years ago, on most Unix based systems there was a command line tool (pdf2html) to convert a LaTeX document to HTML. Unfortunately, it has not been maintained for quiet some time. As a result, it often does not work and when it does, produces broken (circa 2002) HTML. This drawback has irritated me, when I want to quickly post some section of a document (usually a paper) on a webserver for editing/review. A few weeks ago, I came across a link to pdf2htmlEX . Could it really convert any PDF to an HTML while maintaining the formatting? Impressive. I bookmarked it then and today, I finally got down to installing it on my computer.

My computer runs 64-bit Mageia 2.0. It turns out that pdf2htmlEX has not been packaged for Mageia. That's inconvenient but not terrible since the source code of pdf2htmlEX has been released by the authors and therefore I could compile the source code on my system. I read the compilation/installation instructions on the software's website. The installation required the following:

GCC-4.4.6 was luckily installed on my system. libpng was also installed. Headers of a library typically implies installing the devel version of the library. It turned out that I could directly install the devel version of lib64png using drakrpm. So that was easy. On Mageia 2.0, the most recent version of poppler that's packaged is 0.18.4. So the first hurdle had appeared. I first downloaded the tarballs of poppler and poppler-data from poppler's website . Although fontforge was already installed on my system, the headers were missing. Drakrpm reported a failure when I searched for the devel version of fontforge. So I downloaded the (stable) tarball for fontforge . tar xvf opened up all of these tarballs in the respective directories. Following the instructions for compiling poppler, I got the latest version installed on the system. poppler-data followed. fontforge too gave me no trouble.

When I started compiling pdf2htmlEX, it complained about lack of poppler-0.20 on the system and quit. Wait! Hadn't I already installed it? Why could it not find it then? The libraries in the /usr/local/lib64/ were updated. So what was the problem? As a shortcut, I quickly looked up command line options for cmake so that cmake would pick poppler from the local source directory instead of the system. Did not work. I edited the CMakeLists.txt file to point to the local source directory. No luck again. Trial and error did not help. Google to the rescue. It turned out that compiling poppler yourself does not set the environment variable PKG_CONFIG_PATH correctly. So

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

even though this is a 64-bit system enabled the compilation to succeed and I got the pdf2htmlEX executable installed on my system. Note that there was no pkgconfig in /usr/local/lib64. On executing pdf2htmlEX, it complained about a missing 'manifest' and quit. Another google search and it turned out that I had to

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

and the program worked flawlessly. pdf2htmlEX should have picked up /usr/local/lib64 but it did not. Clearly, something somewhere is wrong. That investigation is for another day. In the meantime, you can have a look at the output of converting my resume from pdf to html.