Strips XML/HTML Tags from string « Python recipes

Completely gets rid any tags from XML/HTML input. It gives you the same text minus the tags. The algorithm is rather simple.

      #!/usr/bin/python

# Routine by Micah D. Cochran
# Submitted on 26 Aug 2005
# This routine is allowed to be put under any license Open Source (GPL, BSD, LGPL, etc.) License 
# or any Propriety License. Effectively this routine is in public domain. Please attribute where appropriate.

def strip_ml_tags(in_text):
	"""Description: Removes all HTML/XML-like tags from the input text.
	Inputs: s --> string of text
	Outputs: text string without the tags
	
	# doctest unit testing framework

	>>> test_text = "Keep this Text <remove><me /> KEEP </remove> 123"
	>>> strip_ml_tags(test_text)
	'Keep this Text  KEEP  123'
	"""
	# convert in_text to a mutable object (e.g. list)
	s_list = list(in_text)
	i,j = 0,0
	
	while i < len(s_list):
		# iterate until a left-angle bracket is found
		if s_list[i] == '<':
			while s_list[i] != '>':
				# pop everything from the the left-angle bracket until the right-angle bracket
				s_list.pop(i)
				
			# pops the right-angle bracket, too
			s_list.pop(i)
		else:
			i=i+1
			
	# convert the list back into text
	join_char=''
	return join_char.join(s_list)

if __name__ == '__main__':
	import doctest
	doctest.testmod()

      

This might break on bad formed HTML/XML, but it might break on well formed HTML. I have not explored many of the implications.

I found it worked for a web crawler I created. I that application it got rid of the bulk of the HTML and I had to do some more filtering.

6 comments

Dinu Gherman 18 years, 8 months ago # | flag

Why so long? How about this (not much tested, but you get the idea, I suppose. This is using the unittest module with your own sample text.

class TagStrippingTest(unittest.TestCase):
    def test(self):
        "Test replacing HTML-like tags from text."
        inpText = "Keep this Text &lt;remove&gt;&lt;me /&gt; KEEP &lt;/remove&gt; 123"
        expText = "Keep this Text  KEEP  123"
        t = re.sub("&lt; */? *\w+ */?\ *&gt;", "", inpText)   ### here's the meat!
        self.assertEqual(t, expText)

Nick Matsakis 18 years, 8 months ago # | flag

This could have serious problems with HTML comments. What I use to strip HTML tags is the follow pair of regular expressions (replace the square brackets in the regex with angle brackets... I had a bear of a time trying to get it to post correctly with angle brackets). This doesn't work on all web pages I've tried and definitely doesn't implement the correct SGML comment syntax (which is very subtle, see the acid2 test for details), but it gets me by in a pinch.

import re
HTMLtag = re.compile('[.*?]')      # Matches HTML tags
HTMLcom = re.compile('[!--.*?--]') # Matches HTML comments
resultstr = HTMLtag.sub('', HTMLcom.sub('', sourcestr))

Josiah Carlson 18 years, 8 months ago # | flag

Why reinvent the wheel?

>>> test_text = "Keep this Text  KEEP  123"
>>> import HTMLParser
>>> class MLStripper(HTMLParser.HTMLParser):
...     def __init__(self):
...         self.reset()
...         self.fed = []
...     def handle_data(self, d):
...         self.fed.append(d)
...     def get_fed_data(self):
...         return ''.join(self.fed)
...
>>> x = MLStripper()
>>> x.feed(test_text)
>>> x.get_fed_data()
'Keep this Text  KEEP  123'
>>>

Using HTMLParser rather than sgmllib is preferable because it doesn't die on unmatched tags, etc. Also, using this particular module rather than a custom re-based parser will allow you to build applications that do things other than strip HTML/XML/SGML/...

grosser.meister.morti 18 years, 8 months ago # | flag

nice but inefficient. Mybe to use HTMLParser s the best solution, but if you want to not use it, I think I know a better way to do this. Your code has a performance of O(n^2) (never acces members of a list by their position, use iterators!), following code has a performance of O(n) (and it shows you the beauty of python):

def stripTags(s):
# this list is neccesarry because chk() would otherwise not know
# that intag in stripTags() is ment, and not a new intag variable in chk().
    intag = [False]

    def chk(c):
        if intag[0]:
            intag[0] = (c != '&gt;')
            return False
        elif c == '&lt;':
            intag[0] = True
            return False
        return True

    return ''.join(c for c in s if chk(c))

Josiah Carlson 18 years, 7 months ago # | flag

Gah with html escapes. Here's that test with the proper portions of tags escaped.

>>> test_text = "Keep this Text &lt;remove>&lt;me /> KEEP &lt;/remove> 123"
>>> import HTMLParser
>>> class MLStripper(HTMLParser.HTMLParser):
...     def __init__(self):
...         self.reset()
...         self.fed = []
...     def handle_data(self, d):
...         self.fed.append(d)
...     def get_fed_data(self):
...         return ''.join(self.fed)
...
>>> x = MLStripper()
>>> x.feed(test_text)
>>> x.get_fed_data()
'Keep this Text  KEEP  123'
>>>

rodrigo culagovski 16 years ago # | flag

K.I.S. simple and legible:

import re
def StripHTML (html):
    reg = re.compile(r'simple and legible:

<pre>
import re
def StripHTML (html):
    reg = re.compile(r'

</pre>

◄	Python recipes (4591)	►
◄	Micah Cochran's recipes (1)	►

Strips XML/HTML Tags from string (Python recipe) by Micah Cochran
ActiveState Code (http://code.activestate.com/recipes/440481/)

6 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Strips XML/HTML Tags from string (Python recipe) by Micah Cochran ActiveState Code (http://code.activestate.com/recipes/440481/)

6 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Strips XML/HTML Tags from string (Python recipe) by Micah Cochran
ActiveState Code (http://code.activestate.com/recipes/440481/)