Completely gets rid any tags from XML/HTML input. It gives you the same text minus the tags. The algorithm is rather simple.
Python, 41 lines
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
#!/usr/bin/python # Routine by Micah D. Cochran # Submitted on 26 Aug 2005 # This routine is allowed to be put under any license Open Source (GPL, BSD, LGPL, etc.) License # or any Propriety License. Effectively this routine is in public domain. Please attribute where appropriate. def strip_ml_tags(in_text): """Description: Removes all HTML/XML-like tags from the input text. Inputs: s --> string of text Outputs: text string without the tags # doctest unit testing framework >>> test_text = "Keep this Text <remove><me /> KEEP </remove> 123" >>> strip_ml_tags(test_text) 'Keep this Text KEEP 123' """ # convert in_text to a mutable object (e.g. list) s_list = list(in_text) i,j = 0,0 while i < len(s_list): # iterate until a left-angle bracket is found if s_list[i] == '<': while s_list[i] != '>': # pop everything from the the left-angle bracket until the right-angle bracket s_list.pop(i) # pops the right-angle bracket, too s_list.pop(i) else: i=i+1 # convert the list back into text join_char='' return join_char.join(s_list) if __name__ == '__main__': import doctest doctest.testmod()
This might break on bad formed HTML/XML, but it might break on well formed HTML. I have not explored many of the implications.
I found it worked for a web crawler I created. I that application it got rid of the bulk of the HTML and I had to do some more filtering.
Why so long? How about this (not much tested, but you get the idea, I suppose. This is using the unittest module with your own sample text.
This could have serious problems with HTML comments. What I use to strip HTML tags is the follow pair of regular expressions (replace the square brackets in the regex with angle brackets... I had a bear of a time trying to get it to post correctly with angle brackets). This doesn't work on all web pages I've tried and definitely doesn't implement the correct SGML comment syntax (which is very subtle, see the acid2 test for details), but it gets me by in a pinch.
Why reinvent the wheel?
Using HTMLParser rather than sgmllib is preferable because it doesn't die on unmatched tags, etc. Also, using this particular module rather than a custom re-based parser will allow you to build applications that do things other than strip HTML/XML/SGML/...
nice but inefficient. Mybe to use HTMLParser s the best solution, but if you want to not use it, I think I know a better way to do this. Your code has a performance of O(n^2) (never acces members of a list by their position, use iterators!), following code has a performance of O(n) (and it shows you the beauty of python):
Gah with html escapes. Here's that test with the proper portions of tags escaped.
K.I.S. simple and legible: