ActiveState Code

Recipe 440481: Strips XML/HTML Tags from string


Completely gets rid any tags from XML/HTML input. It gives you the same text minus the tags. The algorithm is rather simple.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/usr/bin/python

# Routine by Micah D. Cochran
# Submitted on 26 Aug 2005
# This routine is allowed to be put under any license Open Source (GPL, BSD, LGPL, etc.) License 
# or any Propriety License. Effectively this routine is in public domain. Please attribute where appropriate.

def strip_ml_tags(in_text):
	"""Description: Removes all HTML/XML-like tags from the input text.
	Inputs: s --> string of text
	Outputs: text string without the tags
	
	# doctest unit testing framework

	>>> test_text = "Keep this Text <remove><me /> KEEP </remove> 123"
	>>> strip_ml_tags(test_text)
	'Keep this Text  KEEP  123'
	"""
	# convert in_text to a mutable object (e.g. list)
	s_list = list(in_text)
	i,j = 0,0
	
	while i < len(s_list):
		# iterate until a left-angle bracket is found
		if s_list[i] == '<':
			while s_list[i] != '>':
				# pop everything from the the left-angle bracket until the right-angle bracket
				s_list.pop(i)
				
			# pops the right-angle bracket, too
			s_list.pop(i)
		else:
			i=i+1
			
	# convert the list back into text
	join_char=''
	return join_char.join(s_list)

if __name__ == '__main__':
	import doctest
	doctest.testmod()

Discussion

This might break on bad formed HTML/XML, but it might break on well formed HTML. I have not explored many of the implications.

I found it worked for a web crawler I created. I that application it got rid of the bulk of the HTML and I had to do some more filtering.

Comments

  1. 1. At 5:25 p.m. on 26 aug 2005, Dinu Gherman said:

    Why so long? How about this (not much tested, but you get the idea, I suppose. This is using the unittest module with your own sample text.

    class TagStrippingTest(unittest.TestCase):
        def test(self):
            "Test replacing HTML-like tags from text."
            inpText = "Keep this Text &lt;remove&gt;&lt;me /&gt; KEEP &lt;/remove&gt; 123"
            expText = "Keep this Text  KEEP  123"
            t = re.sub("&lt; */? *\w+ */?\ *&gt;", "", inpText)   ### here's the meat!
            self.assertEqual(t, expText)
    
  2. 2. At 9:11 p.m. on 26 aug 2005, Nick Matsakis said:

    This could have serious problems with HTML comments. What I use to strip HTML tags is the follow pair of regular expressions (replace the square brackets in the regex with angle brackets... I had a bear of a time trying to get it to post correctly with angle brackets). This doesn't work on all web pages I've tried and definitely doesn't implement the correct SGML comment syntax (which is very subtle, see the acid2 test for details), but it gets me by in a pinch.

    import re
    HTMLtag = re.compile('[.*?]')      # Matches HTML tags
    HTMLcom = re.compile('[!--.*?--]') # Matches HTML comments
    resultstr = HTMLtag.sub('', HTMLcom.sub('', sourcestr))
    
  3. 3. At 3:34 p.m. on 27 aug 2005, Josiah Carlson said:

    Why reinvent the wheel?

    >>> test_text = "Keep this Text  KEEP  123"
    >>> import HTMLParser
    >>> class MLStripper(HTMLParser.HTMLParser):
    ...     def __init__(self):
    ...         self.reset()
    ...         self.fed = []
    ...     def handle_data(self, d):
    ...         self.fed.append(d)
    ...     def get_fed_data(self):
    ...         return ''.join(self.fed)
    ...
    >>> x = MLStripper()
    >>> x.feed(test_text)
    >>> x.get_fed_data()
    'Keep this Text  KEEP  123'
    >>>
    

    Using HTMLParser rather than sgmllib is preferable because it doesn't die on unmatched tags, etc. Also, using this particular module rather than a custom re-based parser will allow you to build applications that do things other than strip HTML/XML/SGML/...

  4. 4. At 8:38 a.m. on 30 aug 2005, Anonymous said:

    nice but inefficient. Mybe to use HTMLParser s the best solution, but if you want to not use it, I think I know a better way to do this. Your code has a performance of O(n^2) (never acces members of a list by their position, use iterators!), following code has a performance of O(n) (and it shows you the beauty of python):

    def stripTags(s):
    # this list is neccesarry because chk() would otherwise not know
    # that intag in stripTags() is ment, and not a new intag variable in chk().
        intag = [False]
    
        def chk(c):
            if intag[0]:
                intag[0] = (c != '&gt;')
                return False
            elif c == '&lt;':
                intag[0] = True
                return False
            return True
    
        return ''.join(c for c in s if chk(c))
    
  5. 5. At 8:27 p.m. on 27 sep 2005, Josiah Carlson said:

    Gah with html escapes. Here's that test with the proper portions of tags escaped.

    >>> test_text = "Keep this Text &lt;remove>&lt;me /> KEEP &lt;/remove> 123"
    >>> import HTMLParser
    >>> class MLStripper(HTMLParser.HTMLParser):
    ...     def __init__(self):
    ...         self.reset()
    ...         self.fed = []
    ...     def handle_data(self, d):
    ...         self.fed.append(d)
    ...     def get_fed_data(self):
    ...         return ''.join(self.fed)
    ...
    >>> x = MLStripper()
    >>> x.feed(test_text)
    >>> x.get_fed_data()
    'Keep this Text  KEEP  123'
    >>>
    
  6. 6. At 6:19 p.m. on 7 apr 2008, rodrigo culagovski said:

    K.I.S. simple and legible:

    import re
    def StripHTML (html):
        reg = re.compile(r'simple and legible:
    
    <pre>
    import re
    def StripHTML (html):
        reg = re.compile(r'
    

    </pre>

Sign in to comment