Welcome, guest | Sign In | My Account | Store | Cart

This recipe shows how to convert the text in an HTML document to PDF. It uses the Beautiful Soup and xtopdf Python libraries. Beautiful Soup is a library for HTML parsing and content extraction. xtopdf is a library for PDF creation from other formats, including text and many others.

Python, 61 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
"""
HTMLTextToPDF.py
A demo program to show how to convert the text extracted from HTML 
content, to PDF. It uses the Beautiful Soup library, v4, to 
parse the HTML, and the xtopdf library to generate the PDF output.
Beautiful Soup is at: http://www.crummy.com/software/BeautifulSoup/
xtopdf is at: https://bitbucket.org/vasudevram/xtopdf
Guide to using and installing xtopdf: http://jugad2.blogspot.in/2012/07/guide-to-installing-and-using-xtopdf.html
Author: Vasudev Ram - http://www.dancingbison.com
Copyright 2015 Vasudev Ram
"""

import sys
from bs4 import BeautifulSoup
from PDFWriter import PDFWriter

def usage():
    sys.stderr.write("Usage: python " + sys.argv[0] + " html_file pdf_file\n")
    sys.stderr.write("which will extract only the text from html_file and\n")
    sys.stderr.write("write it to pdf_file\n")

def main():

    # Create some HTML for testing conversion of its text to PDF.
    html_doc = """
    <html>
        <head>
            <title>
            Test file for HTMLTextToPDF
            </title>
        </head>
        <body>
        This is text within the body element but outside any paragraph.
        <p>
        This is a paragraph of text. Hey there, how do you do?
        The quick red fox jumped over the slow blue cow.
        </p>
        <p>
        This is another paragraph of text.
        Don't mind what it contains.
        What is mind? Not matter.
        What is matter? Never mind.
        </p>
        This is also text within the body element but not within any paragraph.
        </body>
    </html>
    """

    pw = PDFWriter("HTMLTextTo.pdf")
    pw.setFont("Courier", 10)
    pw.setHeader("Conversion of HTML text to PDF")
    pw.setFooter("Generated by xtopdf: http://slid.es/vasudevram/xtopdf")
 
    # Use method chaining this time.
    for line in BeautifulSoup(html_doc).get_text().split("\n"):
        pw.writeLine(line)
    pw.savePage()
    pw.close()

if __name__ == '__main__':
    main()

This recipe shows how to convert the text in an HTML document to PDF, using the Beautiful Soup and xtopdf libraries for Python. It can be of use because it provides a quick way of showing the text of an HTML document in Python format, without jumping through a lot of hoops that a more complex solution may require. Depending on the kind of HTML text content, this recipe could either provide a solution close to the final one needed, or an intermediate or initial one, that may serve as a draft.

1 comment

Vasudev Ram (author) 9 years, 2 months ago  # | flag

More details and sample output available at this blog post:

http://jugad2.blogspot.in/2015/01/html-text-to-pdf-with-beautiful-soup.html