Welcome, guest | Sign In | My Account | Store | Cart

把html转化为纯文本 (Python recipe) by nillgump nillgump
ActiveState Code (http://code.activestate.com/recipes/576657/)

把html转化为纯文本

      #coding:utf-8

import HTMLParser
html=HTMLParser.HTMLParser

class   MyHtmlparser(html):
        def __init__(self):
                html.__init__(self)
                self.lidata=[]
                self.dic={}#用来登记获得的tag和其相应的属性

        def handle_data(self,data):
                self.lidata.append(data)

        def handle_starttag(self,tag,attrs):
                self.dic[tag]=attrs
                
        def handle_endtag(self,tag):
                pass
                
mydir="c://html//"
f=file(mydir+"1.htm")
in_data=f.read()
f.close()
my = MyHtmlparser()
my.feed(in_data)

for i in my.lidata:
        print i

for i in my.dic:
        print i

Tags: html, text

Created by nillgump nillgump on Sat, 21 Feb 2009 (MIT)

◄	Python recipes (4591)	►
◄	nillgump nillgump's recipes (6)	►

Required Modules

(none specified)

Other Information and Tasks

Licensed under the MIT License
Viewed 5202 times
Revision 1

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

把html转化为纯文本 (Python recipe) by nillgump nillgump ActiveState Code (http://code.activestate.com/recipes/576657/)