A library for converting HTML into PDFs using ReportLab

XHTML2PDF

PyPI version Python versions Travis CI AppVeyor Coveralls Read the Docs

The current release of xhtml2pdf is xhtml2pdf 0.2.5. Release Notes can be found here: Release Notes As with all open-source software, its use in production depends on many factors, so be aware that you may find issues in some cases.

Big thanks to everyone who has worked on this project so far and to those who help maintain it.

About

xhtml2pdf is a HTML to PDF converter using Python, the ReportLab Toolkit, html5lib and PyPDF2. It supports HTML5 and CSS 2.1 (and some of CSS 3). It is completely written in pure Python, so it is platform independent.

The main benefit of this tool is that a user with web skills like HTML and CSS is able to generate PDF templates very quickly without learning new technologies.

Documentation

The documentation of xhtml2pdf is available at Read the Docs.

And we could use your help improving it! A good place to start is doc/source/usage.rst.

Installation

This is a typical Python library and can be installed using pip:

pip install xhtml2pdf

Requirements

Python 2.7+. Only Python 3.4+ is tested and guaranteed to work.

All additional requirements are listed in the requirements.txt file and are installed automatically using the pip install xhtml2pdf method.

Alternatives

You can try WeasyPrint. The codebase is pretty, it has different features and it does a lot of what xhtml2pdf does.

Call for testing

This project is heavily dependent on getting its test coverage up! Furthermore, parts of the codebase could do well with cleanups and refactoring.

If you benefit from xhtml2pdf, perhaps look at the test coverage and identify parts that are yet untouched.

Development environment

  1. If you don't have it, install pip, the python package installer:

    sudo easy_install pip
    

    For more information about pip refer to http://www.pip-installer.org

  2. We will recommend using virtualenv for development. It's great to have a separate environment for each project, keeping the dependencies for multiple projects separated:

    sudo pip install virtualenv
    

    For more information about virtualenv refer to http://www.virtualenv.org

  3. Create a virtualenv for the project. This can be inside the project directory, but cannot be under version control:

    virtualenv --distribute xhtml2pdfenv --python=python2
    
  4. Activate your virtualenv:

    source xhtml2pdfenv/bin/activate
    

    Later to deactivate it use:

    deactivate
    
  5. The next step will be to install/upgrade dependencies from the requirements.txt file:

    pip install -r requirements.txt
    
  6. Run tests to check your configuration:

    nosetests --with-coverage
    

    You should have a log with the following success status:

    Ran 36 tests in 0.322s
    
    OK
    

Python integration

Some simple demos of how to integrate xhtml2pdf into a Python program may be found here: test/simple.py

Running tests

Two different test suites are available to assert that xhtml2pdf works reliably:

  1. Unit tests. The unit testing framework is currently minimal, but is being improved on a regular basis (contributions welcome). They should run in the expected way for Python's unittest module, i.e.:

    nosetests --with-coverage (or your personal favorite)
    
  2. Functional tests. Thanks to mawe42's super cool work, a full functional test suite is available at testrender/.

Contact

This project is community-led! Feel free to open up issues on GitHub about new ideas to improve xhtml2pdf.

History

These are the major milestones and the maintainers of the project:

  • 2000-2007, commercial project, spirito.de, written by Dirk Holtwich
  • 2007-2010 Dirk Holtwich (project named "Pisa", project released as GPL)
  • 2010-2012 Dirk Holtwick (project named "xhtml2pdf", changed license to Apache)
  • 2012-2015 Chris Glass (@chrisglass)
  • 2015-2016 Benjamin Bach (@benjaoming)
  • 2016-2018 Sam Spencer (@LegoStormtroopr)
  • 2018-Current Luis Zarate (@luisza)

For more history, see the CHANGELOG.txt file.

License

Copyright 2010 Dirk Holtwick, holtwick.it

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments
  • Problems with some Unicode characters

    Problems with some Unicode characters

    Hi, I'm using the latest xhtml2pdf (0.2b1) & reportlab (3.4.0) through django-easy-pdf (0.1.0) on Python 3.6.0 and it's working great for the most part! One problem I am still experiencing, though, is that some Unicode characters are not rendering properly (šŠčČćĆđĐžŽ):

    screen shot 2017-03-29 at 16 38 36

    I'm using the default django-easy-pdf base template and I found that I can somewhat repair things if I override it to declare the html encoding:

    {% block extra_style %}
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    {% endblock %}
    

    Which results in some characters being rendered correctly like Š and Ž, but not all of them (Č, Ć, Đ are still blacked out).

    screen shot 2017-03-29 at 16 38 19

    I tried experimenting with different font declarations (sans-serif, serif, external fonts), but I can't seem to fix this. The characters are never rendered correctly. I don't know if I'm missing some xhtml2pdf / Reportlab setting here. Do you maybe have an idea of a possible solution?

  • black square box while generating pdf (unicode error)

    black square box while generating pdf (unicode error)

    A weird problem. While generating pdf, inplace of unicodes square black boxes apperars. Dont know if its unicode or font-face error. I even dont know if to use the "font-face and font-family" to generate the unicode into pdf. Anything I am missing ?? Great thanks.

    Code snippet # -- coding: utf-8 --

    from xhtml2pdf import pisa
    from StringIO import StringIO
    
    source = """<html>
                <style>
                    @font-face {
                    font-family: Mangal;
                    src: url("mangal.ttf");
                    }
    
                    body {
                    font-family: Mangal;
                    }
                </style>
                <body>
                    This is a test <br/>
                           सरल
                </body>
            </html>"""
    
    # Utility function
    def convertHtmlToPdf(source):       
        pdf = StringIO()
        pisaStatus = pisa.CreatePDF(StringIO(source.encode('utf-8')), pdf)
    
        # return True on success and False on errors
        print "Success: ", pisaStatus.err
        return pdf
    
    # Main program
    if __name__=="__main__":
        print pisa.showLogging()
        pdf = convertHtmlToPdf(source)
        fd = open("test.pdf", "w+b")
        fd.write(pdf.getvalue())
        fd.close()
    
  • Twitter-Bootstrap Causes Selector CSSParseError

    Twitter-Bootstrap Causes Selector CSSParseError

    Twitter Bootstrap has some pretty gnarly CSS selectors that xhml2pdf doesn't like.

    Result is:

    Selector Pseudo Function closing ')' not found:: (u':not(', u'[controls]) {\n disp')

    1. pdf = pisa.pisaDocument(StringIO.StringIO(html.encode("UTF-8")), dest=result, link_callback=fetch_resources )
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/document.py" in pisaDocument
    2.                     encoding, context=context, xml_output=xml_output)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/document.py" in pisaStory
    3. pisaParser(src, context, default_css, xhtml, encoding, xml_output)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/parser.py" in pisaParser
    4. context.parseCSS()
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/context.py" in parseCSS
    5.     self.css = self.cssParser.parse(self.cssText)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/w3c/cssParser.py" in parse
    6.             src, stylesheet = self._parseStylesheet(src)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/w3c/cssParser.py" in _parseStylesheet
    7.             src, atResults = self._parseAtKeyword(src)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/w3c/cssParser.py" in _parseAtKeyword
    8.         src, result = self._parseAtImports(src)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/w3c/cssParser.py" in _parseAtImports
    9.         stylesheet = self.cssBuilder.atImport(import_, mediums, self)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/w3c/css.py" in atImport
    10.         return cssParser.parseExternal(import_)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/context.py" in parseExternal
    11.     result = self.parse(cssFile.getData())
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/w3c/cssParser.py" in parse
    12.             src, stylesheet = self._parseStylesheet(src)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/w3c/cssParser.py" in _parseStylesheet
    13.             src, ruleset = self._parseRuleset(src)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/w3c/cssParser.py" in _parseRuleset
    14.     src, selectors = self._parseSelectorGroup(src)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/w3c/cssParser.py" in _parseSelectorGroup
    15.         src, selector = self._parseSelector(src)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/w3c/cssParser.py" in _parseSelector
    16.     src, selector = self._parseSimpleSelector(src)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/w3c/cssParser.py" in _parseSimpleSelector
    17.             src, selector = self._parseSelectorPseudo(src, selector)
      
      File "/Library/Python/2.7/site-packages/xhtml2pdf-0.0.3-py2.7.egg/xhtml2pdf/w3c/cssParser.py" in _parseSelectorPseudo
    18.             raise self.ParseError('Selector Pseudo Function closing \')\' not found', src, ctxsrc)
      

    Exception Type: CSSParseError at /p/pdf/gd8lx6xbl Exception Value: Selector Pseudo Function closing ')' not found:: (u':not(', u'[controls]) {\n disp')

  • Now broken with html5lib

    Now broken with html5lib

    From https://pypi.python.org/pypi/html5lib/0.99999999:

    Move a whole load of stuff (inputstream, ihatexml, trie, tokenizer, utils) to be underscore prefixed to clarify their status as private

    Except https://github.com/xhtml2pdf/xhtml2pdf/blob/master/xhtml2pdf/parser.py#L17:

    from html5lib import treebuilders, inputstream
    

    Current fix:

    • Use `pip install html5lib==1.0b8`
      
  • Python 3

    Python 3

    I made some changes so that the tests now run in both Python 2 and and Python 3, Build Status. Most of the changes I made were the same as made by @wylee, in #205.

    I also added a file to do Travis CI testing #202, and updated some of the dependencies.

  • Add optional pisaDocument argument to set metadata

    Add optional pisaDocument argument to set metadata

    Without this the functionality of pisaDocument would need to be recreated in order to set metadata such as the document author.

    Usage is like so:

    pisaDocument(src=io.StringIO(html), dest=open(output_file, "w"), context_meta={
                "author": "MyCorp Ltd.",
                "title": "My Document Title",
                "subject": "My Document Subject",
                "keywords": "pdf,documents",
            })
    
  • Python2/Python3 compatibility

    Python2/Python3 compatibility

    So, I'm close but for some reason on my install image in docs don't show up in python2 and are a little smaller in python3.

    I'm gonna fix this even if it kills me.

    Todo:

    • [x] Figure out how to render a transparent PDF as white (-flatten doesn't work for multipage PDFs)
    • [ ] Make the images the right size
    • [x] Clean up the string.join issues in reportlab_paragraph
    • [ ] Fix background for tr's
  • Unwanted Helvetica font

    Unwanted Helvetica font

    No matter what font I use, there is always Helvetica and it's not embed, so most of printing companies can not print the document if a font is missing.

  • ZeroDivisionError: float division by zero

    ZeroDivisionError: float division by zero

    Hi, I get this error while trying to parse an HTML containing the following piece of code. I'm using the latest versions of all packages needed:

    • html5lib-0.90
    • pyPdf-1.13
    • reportlab-2.5
    • xhtml2pdf-0.0.3

    and Python 2.7 (2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)])

    Python Code: -[ import cStringIO as StringIO from xhtml2pdf import pisa ....

    html = ''' <TABLE BORDER="0" CELLPADDING="2" CELLSPACING="2"> <TR> <TD></TD> </TR> </TABLE> ''' dest = file('test.pdf', "wb") pdf = pisa.CreatePDF( StringIO.StringIO(html), dest, log_warn = 1, log_err = 1 ) ]-

    Note: If I put something inside the TD (example: ".... <TD>... some stuff..... </TD>........") or I change the value of the attr cellpadding, it works!!!

    Traceback: -[ Traceback (most recent call last): File "C:\tmp\test.py", line 95, in log_err = 1 File "C:\Python27\lib\site-packages\xhtml2pdf\document.py", line 131, in pisaDocument doc.build(context.story) File "C:\Python27\lib\site-packages\reportlab\platypus\doctemplate.py", line 880, in build self.handle_flowable(flowables) File "C:\Python27\lib\site-packages\reportlab\platypus\doctemplate.py", line 763, in handle_flowable if frame.add(f, canv, trySplit=self.allowSplitting): File "C:\Python27\lib\site-packages\reportlab\platypus\frames.py", line 174, in _add flowable.drawOn(canv, self._x + self._leftExtraIndent, y, _sW=aW-w) File "C:\Python27\lib\site-packages\reportlab\platypus\flowables.py", line 108, in drawOn self._drawOn(canvas) File "C:\Python27\lib\site-packages\reportlab\platypus\flowables.py", line 89, in _drawOn self.draw()#this is the bit you overload File "C:\Python27\lib\site-packages\reportlab\platypus\tables.py", line 1302, in draw self._drawCell(cellval, cellstyle, (colpos, rowpos), (colwidth, rowheight)) File "C:\Python27\lib\site-packages\reportlab\platypus\tables.py", line 1393, in _drawCell w, h = self._listCellGeom(cellval,colwidth,cellstyle,W=W, H=H,aH=rowheight) File "C:\Python27\lib\site-packages\xhtml2pdf\xhtml2pdf_reportlab.py", line 710, in _listCellGeom return Table._listCellGeom(self, V, w, s, W=W, H=H, aH=aH) File "C:\Python27\lib\site-packages\reportlab\platypus\tables.py", line 377, in _listCellGeom vw, vh = v.wrapOn(canv, aW, aH) File "C:\Python27\lib\site-packages\reportlab\platypus\flowables.py", line 119, in wrapOn w, h = self.wrap(aW,aH) File "C:\Python27\lib\site-packages\xhtml2pdf\xhtml2pdf_reportlab.py", line 693, in wrap return KeepInFrame.wrap(self, availWidth, availHeight) File "C:\Python27\lib\site-packages\reportlab\platypus\flowables.py", line 970, in wrap W, H = func(s1) File "C:\Python27\lib\site-packages\reportlab\platypus\flowables.py", line 951, in func W /= x ZeroDivisionError: float division by zero ]-

    Thanks for your great job, Shen139

  • Release a new version

    Release a new version

    I just upgraded my version with the master branch from github and it fixes a ton of issues in the current 0.0.5 release. Could you release a new version so we can just use pypi?
    Thanks for all the work on this :)

  • make rtl languages from left

    make rtl languages from left

    for example Persian text must start from right but your result seems like this Farsi / Persian: .‫یم نم‬ ‫مروخب هشيش درد ساسحا ِنودب مناوت‬ also in PDF separate character

    correct Persian من می نوانم ...

    ps: maybe i couldn't say the problem. for example correct word is "word" but your result is "dorw"

  • Footer not displayed in 0.2.8 and 0.2.7

    Footer not displayed in 0.2.8 and 0.2.7

    The footer frame is not displayed in pdf file when following the documentation.

    How to reproduce:

    • create a virtualenv with xhtml2pdf library:
    python3 -m venv venv
    ./venv/bin/pip install xhtml2pdf
    
    • add an index.html file containing the example at https://xhtml2pdf.readthedocs.io/en/latest/format_html.html#example-with-2-static-frames-and-1-content-frame
    • ./venv/bin/xhtml2pdf index.html

    The generated index.pdf shows only 'Lyrics-R-Us' and 'To PDF or not to PDF'.

    It could occurs in others releases. I only checked these two releases.

  • AttributeError: 'PmlBaseDoc' object has no attribute '_page_count'

    AttributeError: 'PmlBaseDoc' object has no attribute '_page_count'

    When I add the <pdf:pagecount> html tag within the source_html attribute of pisa.CreatePDF, copied directly from the example, it gives me the error AttributeError: 'PmlBaseDoc' object has no attribute '_page_count'

    FYI, I am on a M1 Mac

  • hindi html page is not converting to pdf as expected.

    hindi html page is not converting to pdf as expected.

    Html File

    [<html><head>
        <meta charset="utf-8">
    </head>
    <body>
        <p> 
            ट्रांसलेशन मेमोरी पर आधारित यह सिस्टम भारत सरकार के गृह मंत्रालय के अधीन राजभाषा विभाग के लिए विकसित किया गया है। 
            इस सिस्टम 
            के माध्यम से अंग्रेजी से हिंदी तथा हिंदी से अंग्रेजी में अनुवाद संभव है।
            मेटकैट - मशीनी अनुवाद के लिए एक बढ़िया और
             सचमुच काम की साइट । उपयोग करते रहने और स्वयं का अनुवाद मेमोरी डेटाबेस बना कर 
             अधिकाधिक सहूलियत और शुद्धता हासिल करने की सुविधा भी। 
             70 से अधिक फ़ाइल टाइप में काम किया जा सकता है। हिंदी भी पूरी तरह समर्थित। 
             लिंक और फार्मेटिंग आदि भी बनाए रखता है। पूरी तरह निःशुल्क।
            
        </p>
        </body> 
    </html>
    ]([url](url))
    

    Python Code

    f = open("new.pdf", 'wb') html = open("new.html", "r") print(html) pisa.CreatePDF(html, f) f.close()

    Output file new.pdf

  • No changelog file

    No changelog file

    The README references a changlog file here: https://github.com/xhtml2pdf/xhtml2pdf/blame/master/README.rst#L161 but there is no changelog file in the repository.

  • 'PyPDF3.utils.PdfReadError: file has not been decrypted' when encryption is set

    'PyPDF3.utils.PdfReadError: file has not been decrypted' when encryption is set

    When creating an encrypted PDF, following the official guide(https://xhtml2pdf.readthedocs.io/en/latest/encryption_and_signatures.html), it will raise an exception 'PyPDF3.utils.PdfReadError: file has not been decrypted'.

    Solve it by using the PyPDF3 to encrypt the PDF just before returning from pisaDocument function, instead of creating an encrypted PmlBaseDoc object.

Lektor-html-pretify - Lektor plugin to pretify the HTML DOM using Beautiful Soup

html-pretify Lektor plugin to pretify the HTML DOM using Beautiful Soup. How doe

Nov 8, 2022
A HTML-code compiler-thing that lets you reuse HTML code.

RHTML RHTML stands for Reusable-Hyper-Text-Markup-Language, and is pronounced "Rech-tee-em-el" despite how its abbreviation is. As the name stands, RH

Nov 15, 2021
Dominate is a Python library for creating and manipulating HTML documents using an elegant DOM API

Dominate Dominate is a Python library for creating and manipulating HTML documents using an elegant DOM API. It allows you to write HTML pages in pure

Dec 5, 2022
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Bleach Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes. Bleach can also linkify text safely, appl

Dec 1, 2022
Standards-compliant library for parsing and serializing HTML documents and fragments in Python

html5lib html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all majo

Nov 30, 2022
A python HTML builder library.

PyML A python HTML builder library. Goals Fully functional html builder similar to the javascript node manipulation. Implement an html parser that ret

Jul 4, 2022
Generate HTML using python 3 with an API that follows the DOM standard specfication.
Generate HTML using python 3 with an API that follows the DOM standard specfication.

Generate HTML using python 3 with an API that follows the DOM standard specfication. A JavaScript API and tons of cool features. Can be used as a fast prototyping tool.

Nov 28, 2022
Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

Nov 25, 2022
Pythonic HTML Parsing for Humans™
Pythonic HTML Parsing for Humans™

Requests-HTML: HTML Parsing for Humans™ This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. When us

Nov 29, 2022
Modded MD conversion to HTML

MDPortal A module to convert a md-eqsue lang to html Basically I ruined md in an attempt to convert it to html Overview Here is a demo file from parse

Nov 27, 2021
A jquery-like library for python

pyquery: a jquery-like library for python pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jq

Dec 5, 2022
This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pdf-scraper-with-ocr With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't imp

Oct 21, 2022
Python utility library for compositing PDF documents with reportlab.

pdfdoc-py Python utility library for compositing PDF documents with reportlab. Installation The pdfdoc-py package can be installed directly from the s

Jan 6, 2022
A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.
A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Universal Online Judge Spider Introduction This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/). It also works for all other Onl

Dec 7, 2021
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

ArchiveBox Open-source self-hosted web archiving. ▶️ Quickstart | Demo | Github | Documentation | Info & Motivation | Community | Roadmap "Your own pe

Dec 3, 2022
That project takes as input special TXT File, divides its content into lsit of HTML objects and then creates HTML file from them.

That project takes as input special TXT File, divides its content into lsit of HTML objects and then creates HTML file from them.

Jan 10, 2022
Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u

Nov 9, 2021
A Python module and command-line utility for converting .ANS format ANSI art to HTML

ansipants A Python module and command-line utility for converting .ANS format ANSI art to HTML. Installation pip install ansipants Command-line usage

Oct 16, 2022
Lektor-html-pretify - Lektor plugin to pretify the HTML DOM using Beautiful Soup

html-pretify Lektor plugin to pretify the HTML DOM using Beautiful Soup. How doe

Nov 8, 2022
Django-Text-to-HTML-converter - The simple Text to HTML Converter using Django framework

Django-Text-to-HTML-converter This is the simple Text to HTML Converter using Dj

Oct 9, 2022