Welcome, guest | Sign In | My Account | Store | Cart

Remove control character ^M from opened html files (Python recipe) by Liang Guo
ActiveState Code (http://code.activestate.com/recipes/286229/)

I used a URLOpener to get the HTML file from some web-sites for some parsing. However, the returned data file had ^M everywhere, and it was pretty annoying. Before parsing this file, I want to strip it of all occurences of this control character ^M. Of course, I can use dos2unix or similar tools to do that offline, but I wanna do it the pythonic way.

First, I need to find out the ascii value for '^M'.

>>> import curses.ascii
>>> ascii.ascii('^V^M')
'\r'

Then, I can just do a search and replace '\r' in any string.

>>> string.replace( str, '\r', '' )

In my code, I just have this line in the overriden method handle_data of my html parser class.

      import string

class Stripper( SGMLParser ) :
    ...
    
    def handle_data( self, data ) :
        data = string.replace( data, '\r', '' )
        ...

      

Tags: text

Created by Liang Guo on Mon, 12 Jul 2004 (PSF)

◄	Python recipes (4591)	►
◄	Liang Guo's recipes (1)	►

Required Modules

(none specified)

Other Information and Tasks

Licensed under the PSF License
Viewed 19396 times
Revision 1

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

Remove control character ^M from opened html files (Python recipe) by Liang Guo ActiveState Code (http://code.activestate.com/recipes/286229/)