This is a complete program that reads an html doc and converts it to plain ASCII text. In the spirit of minimalism, this operates as a standard unix filter. E.g. htmltotext < foo.html > foo.txt
If the output is going to a terminal, then bold and underline are displayed on the terminal. Italics in HTML are mapped to underlining on the tty. Underlining in HTML is ignored (mostly due to laziness).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | #!/usr/bin/env python
# htmltotext
import sys, os, htmllib, formatter
bold = os.popen('tput bold').read()
underline = os.popen('tput smul').read()
reset = os.popen('tput sgr0').read()
class TtyFormatter(formatter.AbstractFormatter):
def __init__(self, writer):
formatter.AbstractFormatter.__init__(self, writer)
self.fontStack = []
self.fontState = (0,0)
def push_font(self, font):
size, italic, bold, tt = font
self.fontStack.append((italic, bold))
self.updateFontState()
def pop_font(self, *args):
try: self.fontStack.pop()
except: pass
self.updateFontState()
def updateFontState(self):
try: newState = self.fontStack[-1]
except: newState = (0,0)
if self.fontState != newState:
print reset,
if newState[0]: print underline,
if newState[1]: print bold,
self.fontState = newState
myWriter = formatter.DumbWriter()
if sys.stdout.isatty():
myFormatter = TtyFormatter(myWriter)
else:
myFormatter = formatter.AbstractFormatter(myWriter)
myParser = htmllib.HTMLParser(myFormatter)
myParser.feed(sys.stdin.read())
myParser.close()
|
The tput unix command is used to get the codes for the terminal. I think it is commonly available, but I haven't run it on a lot of platforms. The basic AbstractFormatter should work everywhere.
minor bugfix needed? Shouldn't that be
Fixed. The append(a,b) syntax used to work, though it probably should have been append((a,b)) from the beginning. In any case, I've fixed the bug. Thanks!