Welcome, guest | Sign In | My Account | Store | Cart

When comparing text generated on different platforms, the newlines are different. This recipe normalizes any string to use unix-style newlines.

This code is used in the TestOOB unit testing framework (http://testoob.sourceforge.net).

Python, 3 lines
def _normalize_newlines(string):
    import re
    return re.sub(r'(\r\n|\r|\n)', '\n', string)

I've tested this on POSIX and Windows. Anyone with an old Mac care to try it? :-)


Andreas Kloss 16 years, 4 months ago  # | flag

Speed up by precompiling regular expression. On the expense of one more line (and the re module plus the regular expression inserted into your namespace), you can get some speed (On my PC, for the contents of a random python script, it finishes in a third of the time) by pulling almost everything out of the function. Of course, this works best if you use this function quite often.

import re
_newlines_re = re.compile(r'(\r\n|\r|\r)')
def _normalize_newlines(string):
    return _newlines_re.sub('\n', string)
wobsta 16 years, 4 months ago  # | flag

don't use regular expressions when not really needed. It's even better to do two replace calls:

#!/usr/bin/env python
import profile, re, random

s = "".join([random.choice(" \n\r") for i in range(10000)])

def use_re_sub():
    global r1
    for i in range(1000):
        r1 = re.sub(r'(\r\n|\r|\n)', '\n', s)

_newlines_re = re.compile(r'(\r\n|\r|\n)')
def use_re_compile():
    global r2
    for i in range(1000):
        r2 = _newlines_re.sub('\n', s)

def use_replace():
    global r3
    for i in range(1000):
        r3 = s.replace('\r\n', '\n').replace('\r', '\n')

assert r1 == r2 == r3

The last version is several times faster (of course this also depends on the string you convert).

Ori Peleg (author) 16 years, 4 months ago  # | flag

Good point. When this function shows up in my profiler I'll probably do this.

Until it does, I prefer the greater readability -- in my eyes -- of not precompiling the expression.

Ori Peleg (author) 16 years, 4 months ago  # | flag

The regular expression isn't there for special features. It's there for readability.

Replacing (\r\n|\r|\n) with whatever (I arbitrarily chose '\n') sits in my mind fairly well. And I understand that regex at a single glance.

I agree that using two replaces, noting that '\n' need not be replaced with '\n', is both efficient and clever.

I'll probably stay with the regex, though, because I find it easier to understand, and it isn't a performance hit in my application yet.

Created by Ori Peleg on Sat, 2 Jul 2005 (PSF)
Python recipes (4591)
Ori Peleg's recipes (15)

Required Modules

Other Information and Tasks