Welcome, guest | Sign In | My Account | Store | Cart

A regular expression that matches Python string literals. Tripple-quoted, unicode, and raw strings are supported.

Python, 55 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# A regular expression that matches Python string literals.
# Tripple-quoted, unicode, and raw strings are supported.  This
# regular expression should be compiled with the re.VERBOSE flag.
PY_STRING_LITERAL_RE = (r"""
[uU]?[rR]?
  (?:              # Single-quote (') strings
  '''(?:                 # Tripple-quoted can contain...
      [^']               | # a non-quote
      \\'                | # a backslashed quote
      '{1,2}(?!')          # one or two quotes
    )*''' |
  '(?:                   # Non-tripple quoted can contain...
     [^']                | # a non-quote
     \\'                   # a backslashded quote
   )*'(?!') | """+
r'''               # Double-quote (") strings
  """(?:                 # Tripple-quoted can contain...
      [^"]               | # a non-quote
      \\"                | # a backslashed single
      "{1,2}(?!")          # one or two quotes
    )*""" |
  "(?:                   # Non-tripple quoted can contain...
     [^"]                | # a non-quote
     \\"                   # a backslashded quote
   )*"(?!")
)''')

# Example use case:
def replace_identifier(s, old, new):
    """
    Replace any occurance of the Python identifier `old` with `new` in
    the given string `s` -- but do *not* modify any occurances of
    `old` that occur inside of string literals or comments.  This
    could be used, e.g., for variable renaming.
    """
    # A regexp that matches comments, strings, and `old`.
    comment_re = r'\#.*'
    regexp = re.compile(r'(?x)%s|%s|(?P<old>\b%s\b)' %
                        (comment_re, PY_STRING_LITERAL_RE, re.escape(old)))

    # A callback used to find the replacement value for each match.
    def repl(match):
        if match.group('old'):
            # We matched `old`:
            return new 
        else:
            # We matched a comment or string literal:
            return match.group()

    # Find an regexp matches, and use `repl()` to find the replacement
    # value for each.  Since re.sub only replaces leftmost
    # non-overlapping occurances, occurances of `old` inside strings
    # or comments will be matched as part of that string or comment,
    # and so won't be changed.
    return regexp.sub(repl, s)

For quick-and-dirty processing of Python source files, it can be convenient to have a regular expression that matches Python string literals. This recipe provides such a regular expression. The example use case above shows how it could be used to rename a variable in a Python file.

Note that if you're going to be doing more complex processing, it might be easier to use the tokenize module to tokenize the source input. (String literals are considered single tokens by the tokenizer, even if they're multiline, raw, etc).

One warning about this regular expression: In practice, you'll almost always want to add r"#.*|" to the beginning of it (which makes it also match Python comments). Otherwise, there's nothing to prevent it from finding what looks like the start of a string literal inside a comment.

3 comments

Christopher Smith 12 years, 1 month ago  # | flag

This doesn't parse the following (patholigical) valid case:

bad = '''def bar(): """ A quoted triple quote is not a closing of this docstring:

print '"""' """ """ # <-- this is the closing quote pass print '"""is here""" and "this"''''

>>> for x in PY_STRING_LITERAL_RE.findall(t):
...     print x
...     print '-'*22
...     
"""
    A quoted triple quote is not a closing
    of this docstring:
<h5 id="print">    &gt;&gt;&gt; print '"""</h5>

' """ """ # <-- this is the closing quote pass

"""is here"""
"this"
Christopher Smith 12 years, 1 month ago  # | flag

sorry...this isn't coming through and I now see that I created something that isn't a valid piece of code.

Alex Stewart 11 years, 4 months ago  # | flag

This code does not correctly handle backslash-escapes. Consider the case: "Eeek! \"quotes\" in my string!" The above regexp matches this incorrectly as: "Eeek! \"

The (?!') and (?!"), are also incorrect (and unnecessary), as it is syntactically valid in python to concatenate string literals by placing one right after another. For example: "foo""bar" == "foobar"

It also doesn't correctly handle a single-quoted string which crosses a newline (which should fail to match, as it's a syntax error, but actually matches without complaint).

An improved version might be:

PY_STRING_LITERAL_RE = (r"""
[uU]?[rR]?
  (?:              # Single-quote (') strings
  '''(?:                 # Triple-quoted can contain...
      [^'\\]             | # a non-quote, non-backslash
      \\.                | # a backslash followed by something
      '{1,2}(?!')          # one or two quotes
    )*''' |
  '(?:                   # Non-triple quoted can contain...
     [^'\n\\]            | # non-quote, non-backslash, non-NL
     \\.                   # a backslash followed by something
  )*' | """+
r'''               # Double-quote (") strings
  """(?:                 # Triple-quoted can contain...
      [^"\\]             | # a non-quote, non-backslash
      \\.                | # a backslash followed by something
      "{1,2}(?!")          # one or two quotes
    )*""" |
  "(?:                   # Non-triple quoted can contain...
     [^"\n\\]            | # non-quote, non-backslash, non-NL
     \\.                   # a backslash followed by something
  )*"
)''')

(Note, I haven't tested that code exhaustively either, those were just the issues I saw on first glance. Trying to do language parsing with regular expressions is fraught with peril in general, and very difficult to cover all cases correctly...)

Created by Edward Loper on Fri, 10 Mar 2006 (PSF)
Python recipes (4591)
Edward Loper's recipes (3)

Required Modules

  • (none specified)

Other Information and Tasks