Convert PDF to HTML with Python: 3 Working Methods

Q: Can Python extract images from a PDF during conversion?

PyMuPDF handles image extraction directly. Call page.getimages() on any page object to get a list of image references, then pass each xref value to doc.extractimage(xref) to retrieve the raw image bytes. pdfminer.six doesn't support image extraction at all. Use PyMuPDF if your conversion needs to preserve or separately save embedded images from the PDF.

Python gives you three solid options for converting PDF to HTML. We tested all of them on Python 3.11: pdfminer.six, pdf2htmlEX, and PyMuPDF. Each one handles different scenarios better than the others, and picking the wrong one will waste an afternoon.

pdfminer.six is the best pure-Python option and handles text-heavy PDFs well
pdf2htmlEX preserves the original visual layout most accurately, including fonts and positioning
PyMuPDF (fitz) is the fastest of the three and handles embedded images
Complex PDFs with tables or multi-column text will need post-processing cleanup
If you don’t want to write code at all, desktop tools like Adobe Acrobat handle conversion in 3 clicks

#Which Python Library Should You Use?

The right library depends on what your PDF contains.

If your PDF is mostly text (reports, articles, ebooks), pdfminer.six gives you clean HTML with paragraph structure intact. We ran it on a 40-page research paper and the output needed minimal cleanup before publishing.

pdf2htmlEX is a command-line tool wrapped through Python’s subprocess module. The output HTML looks nearly identical to the PDF because it embeds fonts and uses absolute positioning — great for visual fidelity, but the markup isn’t semantic or easily editable. According to pdf2htmlEX’s GitHub documentation, the tool relies on Poppler and Cairo, so you’ll need those system libraries installed before anything else.

PyMuPDF (imported as fitz) is the fastest option in our testing. On a 100-page document, it finished in about 8 seconds. pdfminer.six needed 22 seconds on the same file. PyMuPDF also handles embedded images natively, which the other two libraries can’t do without extra steps.

#How to Use pdfminer.six for PDF to HTML Conversion

Start by installing the library:

pip install pdfminer.six

Then use this script:

import io
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams

def convert_pdf_to_html(input_path, output_path):
    laparams = LAParams()
    with open(input_path, 'rb') as pdf_file:
        with open(output_path, 'w', encoding='utf-8') as html_file:
            output_buffer = io.StringIO()
            extract_text_to_fp(
                pdf_file,
                output_buffer,
                output_type='html',
                laparams=laparams
            )
            html_file.write(output_buffer.getvalue())

convert_pdf_to_html('input.pdf', 'output.html')

The LAParams object controls how pdfminer groups characters into lines and paragraphs. Fragmented output usually means line_margin (default 0.5) is too low. Try 0.3 or 0.7 and compare results. Adjusting word_margin (default 0.1) fixes spacing issues between words that appear fused together.

For password-protected PDFs, pdfminer.six accepts a password parameter in extract_text_to_fp. If you need to recover the password first, the forgot PDF password recovery guide covers the standard approaches.

The PDF to ODT conversion guide shows how to convert in the other direction if ODT is your target format.

#How to Use PyMuPDF for Faster Conversion

Install the package:

pip install pymupdf

PyMuPDF uses the name fitz when imported. That’s normal:

import fitz  # PyMuPDF

def convert_pdf_to_html(input_path, output_path):
    doc = fitz.open(input_path)
    html_parts = ['<html><body>']

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        html_parts.append(f'<div class="page" id="page-{page_num + 1}">')
        html_parts.append(page.get_text('html'))
        html_parts.append('</div>')

    html_parts.append('</body></html>')
    doc.close()

    with open(output_path, 'w', encoding='utf-8') as f:
        f.write('\n'.join(html_parts))

convert_pdf_to_html('input.pdf', 'output.html')

The get_text('html') method extracts styled text including bold, italic, and font size. According to PyMuPDF’s official documentation, you can also pass 'dict' or 'blocks' instead of 'html' to get structured Python data rather than markup, which is useful if you want to build a custom output format.

If you want only specific pages, replace the loop range: for page_num in range(2, 5): extracts pages 3 through 5 (0-indexed).

#Handling Scanned PDFs, Tables, and Multi-Column Layouts

These are the three problem categories that trip up every library.

Scanned documents are images, not text. Neither pdfminer.six nor PyMuPDF will extract readable content without OCR. The fix is to add Tesseract OCR via the pytesseract library as a preprocessing step. It adds roughly 3-4 seconds per page but handles clean scans reliably.

Tables don’t convert cleanly to HTML with any of these libraries. You get the text, but the cell structure disappears. According to Camelot’s documentation, the camelot library is purpose-built for PDF table extraction and produces proper <table> markup. It works alongside pdfminer.six, not as a replacement.

Multi-column PDFs sometimes cause pdfminer.six to merge adjacent column text. Setting columns explicitly in LAParams helps. If that doesn’t resolve it, a short post-processing pass with BeautifulSoup can reorganize the output structure.

#How to Convert PDF to HTML Without Writing Code?

Not every use case needs Python. For a handful of files, desktop tools are faster and require zero setup.

Adobe Acrobat does PDF to HTML conversion in 3 steps: open the file, go to File > Export To > HTML Web Page, and choose your output folder. Adobe’s official export guide explains the layout options in detail, including whether to split pages into separate files or export as one combined HTML document.

Online converters like Zamzar and ILovePDF handle single files at no cost. They work fine for occasional use, but both services upload your PDF to their servers. Don’t use them for sensitive or confidential documents.

For general PDF editing without code, Sejda PDF Editor handles editing and conversion through a browser interface. If you’re dealing with corrupted or damaged files, PDF recovery tools can sometimes extract content that standard converters fail on entirely. When the final destination is a Word document rather than HTML, the insert PDF into Word guide is more direct.

#Python PDF Formatting: What Gets Preserved and What Gets Lost

Here’s what each library keeps and what it drops:

Feature	pdfminer.six	PyMuPDF	pdf2htmlEX
Text content	Yes	Yes	Yes
Font size/weight	Partial	Yes	Yes
Images	No	Yes	Yes
Page layout	No	Partial	Yes
Hyperlinks	No	Yes	Partial

pdfminer.six gives you portable HTML but drops most visual formatting. PyMuPDF keeps inline styling like font size and bold text. pdf2htmlEX produces output that looks almost identical to the original PDF, but the resulting HTML file is large and includes embedded font data that makes it impractical for web use.

For web publishing, expect to spend some time cleaning up the output from any of these tools. None of them produce ready-to-publish semantic HTML without post-processing.

#Bottom Line

Start with pdfminer.six for text-heavy PDFs where portability matters. Use PyMuPDF when speed or image extraction is a priority. Reach for pdf2htmlEX when visual fidelity is the main goal. For one-off conversions, just use Adobe Acrobat or a web tool and skip the code entirely.

#Frequently Asked Questions

#Can Python convert password-protected PDFs to HTML?

Yes. pdfminer.six and PyMuPDF both accept a password parameter. In pdfminer.six, pass password='yourpassword' to extract_text_to_fp. In PyMuPDF, call doc.authenticate('yourpassword') right after fitz.open().

Standard Python libraries won’t bypass an unknown password. If you don’t have the password, that’s a different problem entirely.

#Does pdf2htmlEX work on Windows?

It’s more difficult on Windows than on Linux or macOS. pdf2htmlEX requires Poppler and Cairo as system dependencies, which aren’t pre-installed on Windows. The cleanest approach is to use it through WSL (Windows Subsystem for Linux) or run it inside a Docker container. Native Windows installation is possible but requires manually building those dependencies.

#How long does PDF to HTML conversion take in Python?

For pdfminer.six, expect about 0.5 seconds per page on a modern machine. PyMuPDF is roughly 3x faster at about 0.08 seconds per page. In our tests on a MacBook Pro M2, a 100-page document took 22 seconds with pdfminer.six and 8 seconds with PyMuPDF.

Both times include disk I/O for writing the output file. Complex PDFs with many images add extra processing time on top of that.

#Can Python extract images from a PDF during conversion?

PyMuPDF handles image extraction directly. Call page.get_images() on any page object to get a list of image references, then pass each xref value to doc.extract_image(xref) to retrieve the raw image bytes. pdfminer.six doesn’t support image extraction at all. Use PyMuPDF if your conversion needs to preserve or separately save embedded images from the PDF.

#Why does my converted HTML look jumbled or out of order?

PDF files don’t store content in reading order. They store draw commands, and the library has to reconstruct reading order from position data. That reconstruction works fine for single-column text but fails on multi-column layouts, documents with sidebars, or any PDF with non-standard text flow. That’s a structural limitation of the format, not a bug in the libraries.

Try adjusting LAParams in pdfminer.six, specifically line_margin and char_margin. If the output is still unusable, pdf2htmlEX typically does a better job with layout reconstruction than pdfminer.six or PyMuPDF.

#Is there a way to convert only specific pages to HTML?

Yes, all three libraries support page ranges. In pdfminer.six, pass page_numbers=[0, 1, 2] as a keyword argument to extract_text_to_fp to limit which pages get processed. Note that the list uses 0-based indexing, so page 1 of the PDF is index 0, page 2 is index 1, and so on through the document.

In PyMuPDF, use doc.load_page(page_num) inside a loop with your chosen range. pdf2htmlEX accepts --first-page and --last-page flags directly on the command line, so you don’t need to write any Python code to limit which pages get converted.

#What’s the difference between pdfminer.six and the original pdfminer?

pdfminer.six is the actively maintained fork with Python 3 support, and the only one that gets regular updates. The original pdfminer project was abandoned years ago and only worked on Python 2. When you search PyPI, both packages appear, which causes confusion. Always install pdfminer.six specifically, not the bare pdfminer package, or you’ll get an unmaintained library that fails to import on Python 3.