Welcome, guest | Sign In | My Account | Store | Cart

Update 05/25/2014: Pyminifier 2.0 has been released and now lives on Github: https://github.com/liftoff/pyminifier (docs are here: http://liftoff.github.io/pyminifier/). The code below is very out-of-date but will be left alone for historical purposes.

Python Minifier: Reduces the size of Python code for use on embedded platforms. Performs the following:

  1. Removes docstrings.
  2. Removes comments.
  3. Removes blank lines.
  4. Minimizes code indentation.
  5. Joins multiline pairs of parentheses, braces, and brackets (and removes extraneous whitespace within).
  6. Preserves shebangs and encoding info (e.g. "# -- coding: utf-8 --")
  7. NEW: Optionally, produces a bzip2 or gzip-compressed self-extracting python script containing the minified source for ultimate minification.

Update 09/23/2010: Version 1.4.1: Fixed an indentation bug when operators such as @ and open parens started a line.

Update 09/18/2010: Version 1.4:

  • Added some command line options to save the result to an output file.
  • Added the ability to save the result as a bzip2 or gzip-compressed self-extracting python script (which is kinda neat--try it!).
  • Updated some of the docstrings to provide more examples of what each function does.

Update 06/02/2010: Version 1.3: Rewrote several functions to use Python's built-in tokenizer module (which I just discovered despite being in Python since version 2.2). This negated the requirement for pyparsing and improved performance by an order of magnitude. It also fixed some pretty serious bugs with dedent() and reduce_operators().

PLEASE POST A COMMENT IF YOU ENCOUNTER A BUG!

Python, 681 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
#       pyminifier.py
#
#       Copyright 2009 Dan McDougall <YouKnowWho@YouKnowWhat.com>
#
#       This program is free software; you can redistribute it and/or modify
#       it under the terms of the GNU General Public License as published by
#       the Free Software Foundation; Version 3 of the License
#
#       This program is distributed in the hope that it will be useful,
#       but WITHOUT ANY WARRANTY; without even the implied warranty of
#       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#       GNU General Public License for more details.
#
#       You should have received a copy of the GNU General Public License
#       along with this program; if not, the license can be downloaded here:
#
#       http://www.gnu.org/licenses/gpl.html

# Meta
__version__ = '1.4.1'
__license__ = "GNU General Public License (GPL) Version 3"
__version_info__ = (1, 4, 1)
__author__ = 'Dan McDougall <YouKnowWho@YouKnowWhat.com>'

"""
**Python Minifier:**  Reduces the size of (minifies) Python code for use on
embedded platforms.

Performs the following:
     - Removes docstrings.
     - Removes comments.
     - Minimizes code indentation.
     - Joins multiline pairs of parentheses, braces, and brackets (and removes extraneous whitespace within).
     - Preserves shebangs and encoding info (e.g. "# -- coding: utf-8 --").

Various examples and edge cases are sprinkled throughout the pyminifier code so
that it can be tested by minifying itself.  The way to test is thus:

.. code-block:: bash

    $ python pyminifier.py pyminifier.py > minified_pyminifier.py
    $ python minified_pyminifier.py pyminifier.py > this_should_be_identical.py
    $ diff minified_pyminifier.py this_should_be_identical.py
    $

If you get an error executing minified_pyminifier.py or
'this_should_be_identical.py' isn't identical to minified_pyminifier.py then
something is broken.
"""

import sys, re, cStringIO, tokenize
from optparse import OptionParser

# Compile our regular expressions for speed
multiline_quoted_string = re.compile(r'(\'\'\'|\"\"\")')
not_quoted_string = re.compile(r'(\".*\'\'\'.*\"|\'.*\"\"\".*\')')
trailing_newlines = re.compile(r'\n\n')
shebang = re.compile('^#\!.*$')
encoding = re.compile(".*coding[:=]\s*([-\w.]+)")
multiline_indicator = re.compile('\\\\(\s*#.*)?\n')
# The above also removes trailing comments: "test = 'blah \ # comment here"

# These aren't used but they're a pretty good reference:
double_quoted_string = re.compile(r'((?<!\\)".*?(?<!\\)")')
single_quoted_string = re.compile(r"((?<!\\)'.*?(?<!\\)')")
single_line_single_quoted_string = re.compile(r"((?<!\\)'''.*?(?<!\\)''')")
single_line_double_quoted_string = re.compile(r"((?<!\\)'''.*?(?<!\\)''')")

def remove_comments_and_docstrings(source):
    """
    Returns 'source' minus comments and docstrings.

    **Note**: Uses Python's built-in tokenize module to great effect.

    Example:

    .. code-block:: python

        def noop(): # This is a comment
            '''
            Does nothing.
            '''
            pass # Don't do anything

    Will become:

    .. code-block:: python

        def noop():
            pass
    """
    io_obj = cStringIO.StringIO(source)
    out = ""
    prev_toktype = tokenize.INDENT
    last_lineno = -1
    last_col = 0
    for tok in tokenize.generate_tokens(io_obj.readline):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        ltext = tok[4]
        # The following two conditionals preserve indentation.
        # This is necessary because we're not using tokenize.untokenize()
        # (because it spits out code with copious amounts of oddly-placed
        # whitespace).
        if start_line > last_lineno:
            last_col = 0
        if start_col > last_col:
            out += (" " * (start_col - last_col))
        # Remove comments:
        if token_type == tokenize.COMMENT:
            pass
        # This series of conditionals removes docstrings:
        elif token_type == tokenize.STRING:
            if prev_toktype != tokenize.INDENT:
        # This is likely a docstring; double-check we're not inside an operator:
                if prev_toktype != tokenize.NEWLINE:
                    # Note regarding NEWLINE vs NL: The tokenize module
                    # differentiates between newlines that start a new statement
                    # and newlines inside of operators such as parens, brackes,
                    # and curly braces.  Newlines inside of operators are
                    # NEWLINE and newlines that start new code are NL.
                    # Catch whole-module docstrings:
                    if start_col > 0:
                        # Unlabelled indentation means we're inside an operator
                        out += token_string
                    # Note regarding the INDENT token: The tokenize module does
                    # not label indentation inside of an operator (parens,
                    # brackets, and curly braces) as actual indentation.
                    # For example:
                    # def foo():
                    #     "The spaces before this docstring are tokenize.INDENT"
                    #     test = [
                    #         "The spaces before this string do not get a token"
                    #     ]
        else:
            out += token_string
        prev_toktype = token_type
        last_col = end_col
        last_lineno = end_line
    return out

def reduce_operators(source):
    """
    Remove spaces between operators in 'source' and returns the result.

    Example:

    .. code-block:: python

        def foo(foo, bar, blah):
            test = "This is a %s" % foo

    Will become:

    .. code-block:: python

        def foo(foo,bar,blah):
            test="This is a %s"%foo
    """
    io_obj = cStringIO.StringIO(source)
    remove_columns = []
    out = ""
    out_line = ""
    prev_toktype = tokenize.INDENT
    prev_tok = None
    last_lineno = -1
    last_col = 0
    lshift = 1
    for tok in tokenize.generate_tokens(io_obj.readline):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        ltext = tok[4]
        if start_line > last_lineno:
            last_col = 0
        if start_col > last_col:
            out_line += (" " * (start_col - last_col))
        if token_type == tokenize.OP:
            # Operators that begin a line such as @ or open parens should be
            # left alone
            start_of_line_types = [ # These indicate we're starting a new line
                tokenize.NEWLINE, tokenize.DEDENT, tokenize.INDENT]
            if prev_toktype not in start_of_line_types:
                # This is just a regular operator; remove spaces
                remove_columns.append(start_col) # Before OP
                remove_columns.append(end_col+1) # After OP
        if token_string.endswith('\n'):
            out_line += token_string
            if remove_columns:
                for col in remove_columns:
                    col = col - lshift
                    try:
            # This was really handy for debugging (looks nice, worth saving):
                        #print out_line + (" " * col) + "^"
                        # The above points to the character we're looking at
                        if out_line[col] == " ": # Only if it is a space
                            out_line = out_line[:col] + out_line[col+1:]
                            lshift += 1 # To re-align future changes on this line
                    except IndexError: # Reached and end of line, no biggie
                        pass
            out += out_line
            remove_columns = []
            out_line = ""
            lshift = 1
        else:
            out_line += token_string
        prev_toktype = token_type
        prev_token = tok
        last_col = end_col
        last_lineno = end_line
    # This makes sure to capture the last line if it doesn't end in a newline:
    out += out_line
    # The tokenize module doesn't recognize @ sign before a decorator
    return out

# NOTE: This isn't used anymore...  Just here for reference in case someone
# searches the internet looking for a way to remove similarly-styled end-of-line
# comments from non-python code.  It also acts as an edge case of sorts with
# that raw triple quoted string inside the "quoted_string" assignment.
def remove_comment(single_line):
    """
    Removes the comment at the end of the line (if any) and returns the result.
    """
    quoted_string = re.compile(
        r'''((?<!\\)".*?(?<!\\)")|((?<!\\)'.*?(?<!\\)')'''
    )
    # This divides the line up into sections:
    #   Those inside single quotes and those that are not
    split_line = quoted_string.split(single_line)
    # Remove empty items:
    split_line = [a for a in split_line if a]
    out_line = ""
    for section in split_line:
        if section.startswith("'") or section.startswith('"'):
            # This is a quoted string; leave it alone
            out_line += section
        elif '#' in section: # A '#' not in quotes?  There's a comment here!
            # Get rid of everything after the # including the # itself:
            out_line += section.split('#')[0]
            break # No reason to bother the rest--it's all comments
        else:
            # This isn't a quoted string OR a comment; leave it as-is
            out_line += section
    return out_line.rstrip() # Strip trailing whitespace before returning

def join_multiline_pairs(text, pair="()"):
    """
    Finds and removes newlines in multiline matching pairs of characters in
    'text'.  For example, "(.*\n.*), {.*\n.*}, or [.*\n.*]".

    By default it joins parens () but it will join any two characters given via
    the 'pair' variable.

    **Note:** Doesn't remove extraneous whitespace that ends up between the pair.
    Use reduce_operators() for that.

    Example:

    .. code-block:: python

        test = (
            "This is inside a multi-line pair of parentheses"
        )

    Will become:

    .. code-block:: python

        test = (            "This is inside a multi-line pair of parentheses"        )
    """
    # Readability variables
    opener = pair[0]
    closer = pair[1]

    # Tracking variables
    inside_pair = False
    inside_quotes = False
    inside_double_quotes = False
    inside_single_quotes = False
    quoted_string = False
    openers = 0
    closers = 0
    linecount = 0

    # Regular expressions
    opener_regex = re.compile('\%s' % opener)
    closer_regex = re.compile('\%s' % closer)

    output = ""

    for line in text.split('\n'):
        escaped = False
        # First we rule out multi-line strings
        multline_match = multiline_quoted_string.search(line)
        not_quoted_string_match = not_quoted_string.search(line)
        if multline_match and not not_quoted_string_match and not quoted_string:
            if len(line.split('"""')) > 1 or len(line.split("'''")):
                # This is a single line that uses the triple quotes twice
                # Treat it as if it were just a regular line:
                output += line + '\n'
                quoted_string = False
            else:
                output += line + '\n'
                quoted_string = True
        elif quoted_string and multiline_quoted_string.search(line):
            output += line + '\n'
            quoted_string = False
        # Now let's focus on the lines containing our opener and/or closer:
        elif not quoted_string:
            if opener_regex.search(line) or closer_regex.search(line) or inside_pair:
                for character in line:
                    if character == opener:
                        if not escaped and not inside_quotes:
                            openers += 1
                            inside_pair = True
                            output += character
                        else:
                            escaped = False
                            output += character
                    elif character == closer:
                        if not escaped and not inside_quotes:
                            if openers and openers == (closers + 1):
                                closers = 0
                                openers = 0
                                inside_pair = False
                                output += character
                            else:
                                closers += 1
                                output += character
                        else:
                            escaped = False
                            output += character
                    elif character == '\\':
                        if escaped:
                            escaped = False
                            output += character
                        else:
                            escaped = True
                            output += character
                    elif character == '"' and escaped:
                        output += character
                        escaped = False
                    elif character == "'" and escaped:
                        output += character
                        escaped = False
                    elif character == '"' and inside_quotes:
                        if inside_single_quotes:
                            output += character
                        else:
                            inside_quotes = False
                            inside_double_quotes = False
                            output += character
                    elif character == "'" and inside_quotes:
                        if inside_double_quotes:
                            output += character
                        else:
                            inside_quotes = False
                            inside_single_quotes = False
                            output += character
                    elif character == '"' and not inside_quotes:
                        inside_quotes = True
                        inside_double_quotes = True
                        output += character
                    elif character == "'" and not inside_quotes:
                        inside_quotes = True
                        inside_single_quotes = True
                        output += character
                    elif character == ' ' and inside_pair and not inside_quotes:
                        if not output[-1] in [' ', opener]:
                            output += ' '
                    else:
                        if escaped:
                            escaped = False
                        output += character
                if inside_pair == False:
                    output += '\n'
            else:
                output += line + '\n'
        else:
            output += line + '\n'

    # Clean up
    output = trailing_newlines.sub('\n', output)

    return output

def dedent(source):
    """
    Minimizes indentation to save precious bytes

    Example:

    .. code-block:: python

        def foo(bar):
            test = "This is a test"

    Will become:

    .. code-block:: python

        def foo(bar):
         test = "This is a test"
    """
    io_obj = cStringIO.StringIO(source)
    out = ""
    last_lineno = -1
    last_col = 0
    prev_start_line = 0
    indentation = ""
    indentation_level = 0
    for i,tok in enumerate(tokenize.generate_tokens(io_obj.readline)):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        if start_line > last_lineno:
            last_col = 0
        if token_type == tokenize.INDENT:
            indentation_level += 1
            continue
        if token_type == tokenize.DEDENT:
            indentation_level -= 1
            continue
        indentation = " " * indentation_level
        if start_line > prev_start_line:
            out += indentation + token_string
        elif start_col > last_col:
            out += " " + token_string
        else:
            out += token_string
        prev_start_line = start_line
        last_col = end_col
        last_lineno = end_line
    return out

def fix_empty_methods(source):
    """
    Appends 'pass' to empty methods/functions (i.e. where there was nothing but
    a docstring before we removed it =).

    Example:

    .. code-block:: python

        # Note: This triple-single-quote inside a triple-double-quote is also a
        # pyminifier self-test
        def myfunc():
            '''This is just a placeholder function.'''

    Will become:

    .. code-block:: python

        def myfunc(): pass
    """
    def_indentation_level = 0
    output = ""
    just_matched = False
    previous_line = None
    method = re.compile(r'^\s*def\s*.*\(.*\):.*$')
    for line in source.split('\n'):
        if len(line.strip()) > 0: # Don't look at blank lines
            if just_matched == True:
                this_indentation_level = len(line.rstrip()) - len(line.strip())
                if def_indentation_level == this_indentation_level:
                    # This method is empty, insert a 'pass' statement
                    output += "%s pass\n%s\n" % (previous_line, line)
                else:
                    output += "%s\n%s\n" % (previous_line, line)
                just_matched = False
            elif method.match(line):
                def_indentation_level = len(line) - len(line.strip()) # A commment
                just_matched = True
                previous_line = line
            else:
                output += "%s\n" % line # Another self-test
        else:
            output += "\n"
    return output

def remove_blank_lines(source):
    """
    Removes blank lines from 'source' and returns the result.

    Example:

    .. code-block:: python

        test = "foo"

        test2 = "bar"

    Will become:

    .. code-block:: python

        test = "foo"
        test2 = "bar"
    """
    io_obj = cStringIO.StringIO(source)
    source = [a for a in io_obj.readlines() if a.strip()]
    return "".join(source)

def minify(source):
    """
    Remove all docstrings, comments, blank lines, and minimize code
    indentation from 'source' then prints the result.
    """
    preserved_shebang = None
    preserved_encoding = None

    # This is for things like shebangs that must be precisely preserved
    for line in source.split('\n')[0:2]:
        # Save the first comment line if it starts with a shebang
        # (e.g. '#!/usr/bin/env python') <--also a self test!
        if shebang.match(line): # Must be first line
            preserved_shebang = line
            continue
        # Save the encoding string (must be first or second line in file)
        if encoding.match(line):
            preserved_encoding = line

    # Remove multilines (e.g. lines that end with '\' followed by a newline)
    source = multiline_indicator.sub('', source)

    # Remove docstrings (Note: Must run before fix_empty_methods())
    source = remove_comments_and_docstrings(source)

    # Remove empty (i.e. single line) methods/functions
    source = fix_empty_methods(source)

    # Join multiline pairs of parens, brackets, and braces
    source = join_multiline_pairs(source)
    source = join_multiline_pairs(source, '[]')
    source = join_multiline_pairs(source, '{}')

    # Remove whitespace between operators:
    source = reduce_operators(source)

    # Minimize indentation
    source = dedent(source)

    # Re-add preseved items
    if preserved_encoding:
        source = preserved_encoding + "\n" + source
    if preserved_shebang:
        source = preserved_shebang + "\n" + source

    # Remove blank lines
    source = remove_blank_lines(source).rstrip('\n') # Stubborn last newline

    return source

def bz2_pack(source):
    "Returns 'source' as a bzip2-compressed, self-extracting python script."
    import bz2, base64
    out = ""
    compressed_source = bz2.compress(source)
    out += 'import bz2, base64\n'
    out += "exec bz2.decompress(base64.b64decode('"
    out += base64.b64encode((compressed_source))
    out += "'))\n"
    return out

def gz_pack(source):
    "Returns 'source' as a gzip-compressed, self-extracting python script."
    import zlib, base64
    out = ""
    compressed_source = zlib.compress(source)
    out += 'import zlib, base64\n'
    out += "exec zlib.decompress(base64.b64decode('"
    out += base64.b64encode((compressed_source))
    out += "'))\n"
    return out

# The test.+() functions below are for testing pyminifer...
def test_decorator(f):
    """Decorator that does nothing"""
    return f

def test_reduce_operators():
    """Test the case where an operator such as an open paren starts a line"""
    (a, b) = 1, 2 # The indentation level should be preserved
    pass

def test_empty_functions():
    """
    This is a test method.
    This should be replaced with 'def empty_method: pass'
    """

class test_class(object):
    "Testing indented decorators"

    @test_decorator
    def foo(self):
        pass

def test_function():
    """
    This function encapsulates the edge cases to prevent them from invading the
    global namespace.
    """
    foo = ("The # character in this string should " # This comment
           "not result in a syntax error") # ...and this one should go away
    test_multi_line_list = [
        'item1',
        'item2',
        'item3'
    ]
    test_multi_line_dict = {
        'item1': 1,
        'item2': 2,
        'item3': 3
    }
    # It may seem strange but the code below tests our docstring removal code.
    test_string_inside_operators = imaginary_function(
        "This string was indented but the tokenizer won't see it that way."
    ) # To understand how this could mess up docstring removal code see the
      # remove_comments_and_docstrings() function starting at this line:
      #     "elif token_type == tokenize.STRING:"
    # This tests remove_extraneous_spaces():
    this_line_has_leading_indentation    = '''<--That extraneous space should be
                                              removed''' # But not these spaces

def main():
    usage = '%prog [options] "<input file>"'
    parser = OptionParser(usage=usage, version=__version__)
    parser.disable_interspersed_args()
    parser.add_option(
        "-o", "--outfile",
        dest="outfile",
        default=None,
        help="Save output to the given file.",
        metavar="<file path>"
    )
    parser.add_option(
        "--bzip2",
        action="store_true",
        dest="bzip2",
        default=False,
        help="bzip2-compress the result into a self-executing python script."
    )
    parser.add_option(
        "--gzip",
        action="store_true",
        dest="gzip",
        default=False,
        help="gzip-compress the result into a self-executing python script."
    )
    options, args = parser.parse_args()
    try:
        source = open(args[0]).read()
    except Exception, e:
        print e
        parser.print_help()
        sys.exit(2)
    # Minify our input script
    result = minify(source)
    # Compress it if we were asked to do so
    if options.bzip2:
        result = bz2_pack(result)
    elif options.gzip:
        result = gz_pack(result)
    # Either save the result to the output file or print it to stdout
    if options.outfile:
        f = open(options.outfile, 'w')
        f.write(result)
        f.close()
    else:
        print result

if __name__ == "__main__":
    main()

I wrote this so I could minimize the size of python code being run on embedded platforms (e.g. OpenWRT). minified + zipped modules can save a lot of space when applied to a large number of files. Here's an example of the ouput minifying itself:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
__version__='1.4.1'
__license__="GNU General Public License (GPL) Version 3"
__version_info__=(1,4,1)
__author__='Dan McDougall <YouKnowWho@YouKnowWhat.com>'
import sys,re,cStringIO,tokenize
from optparse import OptionParser
multiline_quoted_string=re.compile(r'(\'\'\'|\"\"\")')
not_quoted_string=re.compile(r'(\".*\'\'\'.*\"|\'.*\"\"\".*\')')
trailing_newlines=re.compile(r'\n\n')
shebang=re.compile('^#\!.*$')
encoding=re.compile(".*coding[:=]\s*([-\w.]+)")
multiline_indicator=re.compile('\\\\(\s*#.*)?\n')
double_quoted_string=re.compile(r'((?<!\\)".*?(?<!\\)")')
single_quoted_string=re.compile(r"((?<!\\)'.*?(?<!\\)')")
single_line_single_quoted_string=re.compile(r"((?<!\\)'''.*?(?<!\\)''')")
single_line_double_quoted_string=re.compile(r"((?<!\\)'''.*?(?<!\\)''')")
def remove_comments_and_docstrings(source):
 io_obj=cStringIO.StringIO(source)
 out=""
 prev_toktype=tokenize.INDENT
 last_lineno=-1
 last_col=0
 for tok in tokenize.generate_tokens(io_obj.readline):
  token_type=tok[0]
  token_string=tok[1]
  start_line,start_col=tok[2]
  end_line,end_col=tok[3]
  ltext=tok[4]
  if start_line>last_lineno:
   last_col=0
  if start_col>last_col:
   out+=(" "*(start_col-last_col))
  if token_type==tokenize.COMMENT:
   pass
  elif token_type==tokenize.STRING:
   if prev_toktype!=tokenize.INDENT:
    if prev_toktype!=tokenize.NEWLINE:
     if start_col>0:
      out+=token_string
  else:
   out+=token_string
  prev_toktype=token_type
  last_col=end_col
  last_lineno=end_line
 return out
def reduce_operators(source):
 io_obj=cStringIO.StringIO(source)
 remove_columns=[]
 out=""
 out_line=""
 prev_toktype=tokenize.INDENT
 prev_tok=None
 last_lineno=-1
 last_col=0
 lshift=1
 for tok in tokenize.generate_tokens(io_obj.readline):
  token_type=tok[0]
  token_string=tok[1]
  start_line,start_col=tok[2]
  end_line,end_col=tok[3]
  ltext=tok[4]
  if start_line>last_lineno:
   last_col=0
  if start_col>last_col:
   out_line+=(" "*(start_col-last_col))
  if token_type==tokenize.OP:
   start_of_line_types=[tokenize.NEWLINE,tokenize.DEDENT,tokenize.INDENT]
   if prev_toktype not in start_of_line_types:
    remove_columns.append(start_col)
    remove_columns.append(end_col+1)
  if token_string.endswith('\n'):
   out_line+=token_string
   if remove_columns:
    for col in remove_columns:
     col=col-lshift
     try:
      if out_line[col]==" ":
       out_line=out_line[:col]+out_line[col+1:]
       lshift+=1
     except IndexError:
      pass
   out+=out_line
   remove_columns=[]
   out_line=""
   lshift=1
  else:
   out_line+=token_string
  prev_toktype=token_type
  prev_token=tok
  last_col=end_col
  last_lineno=end_line
 out+=out_line
 return out
def remove_comment(single_line):
 quoted_string=re.compile( r'''((?<!\\)".*?(?<!\\)")|((?<!\\)'.*?(?<!\\)')'''
 )
 split_line=quoted_string.split(single_line)
 split_line=[a for a in split_line if a]
 out_line=""
 for section in split_line:
  if section.startswith("'")or section.startswith('"'):
   out_line+=section
  elif '#' in section:
   out_line+=section.split('#')[0]
   break
  else:
   out_line+=section
 return out_line.rstrip()
def join_multiline_pairs(text,pair="()"):
 opener=pair[0]
 closer=pair[1]
 inside_pair=False
 inside_quotes=False
 inside_double_quotes=False
 inside_single_quotes=False
 quoted_string=False
 openers=0
 closers=0
 linecount=0
 opener_regex=re.compile('\%s'%opener)
 closer_regex=re.compile('\%s'%closer)
 output=""
 for line in text.split('\n'):
  escaped=False
  multline_match=multiline_quoted_string.search(line)
  not_quoted_string_match=not_quoted_string.search(line)
  if multline_match and not not_quoted_string_match and not quoted_string:
   if len(line.split('"""'))>1 or len(line.split("'''")):
    output+=line+'\n'
    quoted_string=False
   else:
    output+=line+'\n'
    quoted_string=True
  elif quoted_string and multiline_quoted_string.search(line):
   output+=line+'\n'
   quoted_string=False
  elif not quoted_string:
   if opener_regex.search(line)or closer_regex.search(line)or inside_pair:
    for character in line:
     if character==opener:
      if not escaped and not inside_quotes:
       openers+=1
       inside_pair=True
       output+=character
      else:
       escaped=False
       output+=character
     elif character==closer:
      if not escaped and not inside_quotes:
       if openers and openers==(closers+1):
        closers=0
        openers=0
        inside_pair=False
        output+=character
       else:
        closers+=1
        output+=character
      else:
       escaped=False
       output+=character
     elif character=='\\':
      if escaped:
       escaped=False
       output+=character
      else:
       escaped=True
       output+=character
     elif character=='"' and escaped:
      output+=character
      escaped=False
     elif character=="'" and escaped:
      output+=character
      escaped=False
     elif character=='"' and inside_quotes:
      if inside_single_quotes:
       output+=character
      else:
       inside_quotes=False
       inside_double_quotes=False
       output+=character
     elif character=="'" and inside_quotes:
      if inside_double_quotes:
       output+=character
      else:
       inside_quotes=False
       inside_single_quotes=False
       output+=character
     elif character=='"' and not inside_quotes:
      inside_quotes=True
      inside_double_quotes=True
      output+=character
     elif character=="'" and not inside_quotes:
      inside_quotes=True
      inside_single_quotes=True
      output+=character
     elif character==' ' and inside_pair and not inside_quotes:
      if not output[-1]in[' ',opener]:
       output+=' '
     else:
      if escaped:
       escaped=False
      output+=character
    if inside_pair==False:
     output+='\n'
   else:
    output+=line+'\n'
  else:
   output+=line+'\n'
 output=trailing_newlines.sub('\n',output)
 return output
def dedent(source):
 io_obj=cStringIO.StringIO(source)
 out=""
 last_lineno=-1
 last_col=0
 prev_start_line=0
 indentation=""
 indentation_level=0
 for i,tok in enumerate(tokenize.generate_tokens(io_obj.readline)):
  token_type=tok[0]
  token_string=tok[1]
  start_line,start_col=tok[2]
  end_line,end_col=tok[3]
  if start_line>last_lineno:
   last_col=0
  if token_type==tokenize.INDENT:
   indentation_level+=1
   continue
  if token_type==tokenize.DEDENT:
   indentation_level-=1
   continue
  indentation=" "*indentation_level
  if start_line>prev_start_line:
   out+=indentation+token_string
  elif start_col>last_col:
   out+=" "+token_string
  else:
   out+=token_string
  prev_start_line=start_line
  last_col=end_col
  last_lineno=end_line
 return out
def fix_empty_methods(source):
 def_indentation_level=0
 output=""
 just_matched=False
 previous_line=None
 method=re.compile(r'^\s*def\s*.*\(.*\):.*$')
 for line in source.split('\n'):
  if len(line.strip())>0:
   if just_matched==True:
    this_indentation_level=len(line.rstrip())-len(line.strip())
    if def_indentation_level==this_indentation_level:
     output+="%s pass\n%s\n"%(previous_line,line)
    else:
     output+="%s\n%s\n"%(previous_line,line)
    just_matched=False
   elif method.match(line):
    def_indentation_level=len(line)-len(line.strip())
    just_matched=True
    previous_line=line
   else:
    output+="%s\n"%line
  else:
   output+="\n"
 return output
def remove_blank_lines(source):
 io_obj=cStringIO.StringIO(source)
 source=[a for a in io_obj.readlines()if a.strip()]
 return "".join(source)
def minify(source):
 preserved_shebang=None
 preserved_encoding=None
 for line in source.split('\n')[0:2]:
  if shebang.match(line):
   preserved_shebang=line
   continue
  if encoding.match(line):
   preserved_encoding=line
 source=multiline_indicator.sub('',source)
 source=remove_comments_and_docstrings(source)
 source=fix_empty_methods(source)
 source=join_multiline_pairs(source)
 source=join_multiline_pairs(source,'[]')
 source=join_multiline_pairs(source,'{}')
 source=reduce_operators(source)
 source=dedent(source)
 if preserved_encoding:
  source=preserved_encoding+"\n"+source
 if preserved_shebang:
  source=preserved_shebang+"\n"+source
 source=remove_blank_lines(source).rstrip('\n')
 return source
def bz2_pack(source):
 import bz2,base64
 out=""
 compressed_source=bz2.compress(source)
 out+='import bz2, base64\n'
 out+="exec bz2.decompress(base64.b64decode('"
 out+=base64.b64encode((compressed_source))
 out+="'))\n"
 return out
def gz_pack(source):
 import zlib,base64
 out=""
 compressed_source=zlib.compress(source)
 out+='import zlib, base64\n'
 out+="exec zlib.decompress(base64.b64decode('"
 out+=base64.b64encode((compressed_source))
 out+="'))\n"
 return out
def test_decorator(f):
 return f
def test_reduce_operators():
 (a,b)=1,2
 pass
def test_empty_functions():pass
class test_class(object):
 @test_decorator
 def foo(self):
  pass
def test_function():
 foo=("The # character in this string should " "not result in a syntax error")
 test_multi_line_list=['item1','item2','item3']
 test_multi_line_dict={'item1':1,'item2':2,'item3':3}
 test_string_inside_operators=imaginary_function("This string was indented but the tokenizer won't see it that way.")
 this_line_has_leading_indentation ='''<--That extraneous space should be
                                              removed'''
def main():
 usage='%prog [options] "<input file>"'
 parser=OptionParser(usage=usage,version=__version__)
 parser.disable_interspersed_args()
 parser.add_option("-o","--outfile",dest="outfile",default=None,help="Save output to the given file.",metavar="<file path>")
 parser.add_option("--bzip2",action="store_true",dest="bzip2",default=False,help="bzip2-compress the result into a self-executing python script.")
 parser.add_option("--gzip",action="store_true",dest="gzip",default=False,help="gzip-compress the result into a self-executing python script.")
 options,args=parser.parse_args()
 try:
  source=open(args[0]).read()
 except Exception,e:
  print e
  parser.print_help()
  sys.exit(2)
 result=minify(source)
 if options.bzip2:
  result=bz2_pack(result)
 elif options.gzip:
  result=gz_pack(result)
 if options.outfile:
  f=open(options.outfile,'w')
  f.write(result)
  f.close()
 else:
  print result
if __name__=="__main__":
 main()

The following is an example of the new self-extracting, compressed python script feature. It is the result of pyminifier compressing itself with the --gzip option:

import zlib, base64
exec zlib.decompress(base64.b64decode('eJzlGmuTm8jxu37FmI0L2EXYWruuUiqzvqp443LFXrtiJ1cpSUchGK04IyAM7MPn++/pngfMANI+crkvscsWmn4/pnum0dGTZw2rnq3T/BnNr0h5W2+LfHJEpsdTEhdJml/OSVNvpn/GlUkYXtGKpUUehoE981/6MxvWsjSmOaOwZr29+Ad5S3NaRRn51KwBQt4LKHHefnrvkn8KevLC0pil+aYAamfmvfRmLgCiBtSoUMibKCcf4jdFcxllGXn1r6L5W15c/7Qtfmwfo9qPi92ZPUl3ZVHVhN0yr6Je/LmuQP13H726+Erz9BudbKpiR4qyLqMKFJLoH8sadPiES9Vk12R1mqU5Df/dFDVNQsaZBBVFGWWaUaeynaWNf78vLfzr2u4kL+rDBJZ/LIjg0/ouPvAvLiODuopQ7GWY02sUz0wGy3yZAxbb0nVkMrd/Plo+8Y//BFCai4DpYBAgFhfzYLVkx85iurz2Vyeu5Wq2pnmSxlFdVAbnJfxxgObIP3Zfc/lJASE97Brn9asny6ULcl+rR7SPAeZBSktR2h2ljVpKSq7nA7jYOh97wOlOSw5ySuiGVHRXXNEQCHY0r1kY5QlwjQUr5rCiqWLqzickLcJi/UvQpqOvHhTOhBRNHVjWhJQVvQohW+vbkgYqa/13F2/OL75MSBaxmmufF8F0Jr/HRRY8n5BNUREgIGlOWrpLvg9rGvIV5ghF/IpGCXJB3QRyqOQtnq/aNekTXJ3hKqujSoj3xCNKRugpQilYz2H4oCAvEJLV9KbmX1/i13SjcTrTTEJtDJM6VFg4UxCOBv46CRyLWMdOizFVGK4raDXTOl/+5eOHD+BMzqWMGEPVs33In7/8/d3FW44LKHpwnvSjw5EOYF2c//T+3cW5QDMtey4XhVW687lyjHYm94DDdOEWTDQ/ynCoJZk9KloTyOK6qXLkLpM6aWIaFiUmTlE9MI3bLZE1u5wFi1WX2fDJJd4jzRU0uChQw0NZn7FtuqmD2f9h/nOqR26Cj584F0FVbERFRByIWT9j2+bpvznH+Hi9eK1GdgeBdoiRGBEgct1MFD8qS3BbZ4V7AEn692RmGCiC5QOQXaf1FnoXtKueq3qbB4lNEUI3zCRYQP3HwAgLuKd57om1urpVmxi4KpELQFsFAURIAbtt0OLMEelEJzmZzVcKXwg5CWZErNCbmJY1eZcn9Oa8qopKcZa1TJQJxW0y8CLfk8TcjUTbRka9Gffb/qKjIDTH9YdUoZ7Wg6Kkd1pHa+MY4r0tnFTQrUfPI99HzxqAPSGQVazMUukeg7fPAYZ4A3kR8dyJeOa3y5gQ0apXABGP0RjPnSb2XG57AfP5jhAZbdmW21HpENsapLrEUs3NPrK5GLE6jiutA1RXVEGyhkL5lYznRCugCxWH+BX6qnTEEemXIs3D7pRZRil0FCyHHj4GlgOxAMbQbaBQB7jGRcdZwdR3LL1pztJEkAd/jUCZdonHh/UW9aNdH6YfIFuYmUFyUSjFsBALffgj2hEXTV7jF4ESVvSS3pgH56fMfiqgriLfhyag4gxYymaJ6SFyB7oZuEvFRhU1yuKopIlSlaCLuYd3UR1vgz13GJ/RqIq3jsxcMrizSPLBep8QMsqUSODsy2v+HpYt3ICpc1VGc85aWWlZkNDu2YygF0yYBXvUckW2S4edBDwj0Td8dTSYWg7fi+xL1VC1eQwIt+Q+/lX7ZShqXEEuaq+L9EwzxGCv0tKrD9N2jtbctlEVxTVFMFFFR8hpQUEgRGpdDZWTidfG09iGXZMTW+eENxRBru1g6Vuix6KVKwFdtMgw2w9RcjdqVgjnPMqK1u2MI6p6EDiyHMAJpMXVS4TpBG1lpIzd4QbTD0qI5tf/vQPt5dLW3CdZPYr1uFL3SIeBTpbNQ9JTZq/YoY59jtBcf2eOSsfR3ALUsX40f5ADR1ugARpthA/ytPLLHVYYgn5vK0Zb9qPyZe9eN5XQMnLUkRr8gW58tAKmDx6ugE2MbMQSdIdColIK9ovpbJXmC2Diiaq2GsQYYEpyF9x71otxG7rs4gVTUEg+rVTZUg83d/342gfKQ9dg+uqzZs3PW57AcPWTbinvJQlN+H3kMZO+Q1MNfo/qhgS4lOYoKsIzNyfXvocZvaLtDDD15BSE5s2OTz+ce89D/qCByMMmIKMDDG3mNvCEbI9xkddpzjfKPi5ioDHOZTrkooeAWMcDkoFpvTh2YzyN9GQw77tj6gmihzR3zQi1ZOoeH3JB793IN+lNSHdlfRvuaL0tEn1OCPBwND21C84vDQjhV4OuFqCiadEwoaaY/Qnu5huGn5fsGGTA//7x0oF/7ly8/jCuTUKf/sXJuG2Ii6or568AMrTiVVYUlXqbshGTWk7qzutOB8xVKRt3SjDOuVflrKeMj3eW+VP4Zz11DE956lZm1F6N9k6ysWDIRBT+9zlQu9fsMUdZv88PhqC2iZlxV2OrYUm3hBk8wGNV3QLoWJWW06N1FuVfuYwHDrXFgzHd6RVO5rg45FHGrlotLMvHEUjLC9XZpXm6udVUAPvhVH+Fdz75dk/kfrfevtcTgMNpvng+P12pMZJgOIjfUKRyu1k1leADDFrdBAfprJE3i6Kj2l7fr/d7jdai7608Lcbo0OkBSJ69WNn3xPz1N1u3ZPzdSYtgnhcmcmrecyT6V+IPgSeY5CcC3KOXkRwllzCT2vT/yPZQdY2nVZvTkhxTef3tFPwRf9X3k3ilDhBvHTH6w8vuxIMVHDRiqI8QDVi+WjVOSHCy0xgRwUmd12Cn0xsaI8hPaEsvkPz1Dy9xMaGObUn0DsLdSB1noIqr5OLkqVdHuKmX3/ZY+i1L1/cwFdHuspWz2mMsp//DrIWLQB0iY57IzgbtlSibDmGQ74jmRN7aDWbeKdRo/k6iRRe7dtPkfHKM2Bwew4GDCQz+6EBlpXGNvH409eBHC6h+hcNoxnXqSVC8uSKAFzjWly0lR+bEC3sukdM8ti2aLCFwpsLbDjgJ9jniRITdQm+7IRTfsVjgLy6AlwHxMitLWR0s7LSmu5nt8c9T+fnCXg3xoQrWwa8Sfz5TFPNTRTN/8ZukkpNTeftp3Ruku+gyzaOqcyLa1xlzHTF5SKUJWTc1WErbl6EVuS5yuyaMQt9AUFQDwa3PbcNTCNdyG8EDdDUhv+3vJLBt+9V0+gWp6A3clnIKHZsw2BNUOXHdzbPu90dUngRYE9EZo1SErmHRJQ3sp2VVXJJFwX+ew1bEepXm0NTh8JnRM8vG/MKf7AT673ccQcv/9+SviwLtR0uuovKTlEV4qU/BXRUDJ+P+iCpoOR1OlCShEO9Y08LyrOkUtgjKt7wEQhVY2tdNBNHmLdrb0qwMrM/RFZUHEQgDj8ZlekVzboBvedDAoquoCqxXuAAy6+2ZtUf4dP0tLU8tL4rF/YNBRsA9Ds5QShOJoPTg5zipCAdNVRngirSpDopBssN+mmKdaWpMJPEbMMJiqP61v1elS2B7SCMBH1MIIf+VPjInPIxXIJXjH20E5RtZWX5xbOEgCO6zLj+3IY58mXrOP4Cfxw+UJWwmyHJeXQRjXAhRcyTCX5f59AaOW6e8KaLegXmom4jZMVfR585HvhK1bZriO6qRaejoGg1b9Z0WWUOVuYfYG2FhD+LZ19i4yca/rqDIdExghc+SuRPkOVqYLVAmICUM82iHP+kLrDDErRmG+BpbbNL/AKD1gyo='))

Here's the difference in file size:

$ ls -lh *pyminifier.py
-rw-r--r-- 1 riskable riskable 3.8K 2010-09-18 14:55 gzipped_minified_pyminifier.py
-rw-r--r-- 1 riskable riskable  11K 2010-09-18 15:02 minified_pyminifier.py

20 comments

Daniel Lepage 12 years, 8 months ago  # | flag

Discarding all comments removes the encoding. If the target script uses e.g. utf-8 with non-ascii characters the output will generate a syntax error.

Dan McDougall (author) 12 years, 8 months ago  # | flag

Good catch. I'll update the script to fix that (shouldn't be too hard).

Note: It also messes up quoted # signs like this:

Comment = "#"

...results in Comment = ". Still trying to figure out how to fix that.

Akira Fora 12 years, 8 months ago  # | flag

Good shot. You could save a few more bytes by : -transforming multiline instructions into single line instrustions. -removing spaces around operators. e.g. singleLineDocstring = ( method + (QuotedString(quoteChar='"', escChar='\', multiline=False) | \ QuotedString(quoteChar="'", escChar='\', multiline=False)) ) can be transformed into: singleLineDocstring=(method+(QuotedString(quoteChar='"',escChar='\',multiline=False)|QuotedString(quoteChar="'",escChar='\',multiline=False)))

Akira Fora 12 years, 8 months ago  # | flag

You can achieve further size reduction by obfuscating the code.
I didn't find any solid estimation of size reduction brought by obfuscation of Python code but Yahoo dev networks claims that, applied to CSS and Javascrip:
"In a survey of ten top U.S. web sites, minification achieved a 21% size reduction versus 25% for obfuscation." (http://developer.yahoo.com/performance/rules.html)
I guess the advantage of obfuscation over minification is even more significant applied to Python than applied to CSS and Javascrip, due to the respective syntax of these languages.
Though, obfuscors are more complex than minifiers, and obfuscated code is a pain if you have to debug it.

Dan McDougall (author) 12 years, 8 months ago  # | flag

Yeah, I thought about adding some obfuscating stuff to this but I changed my mind because it makes debugging nary impossible and when you're working on embedded platforms you often can't debug your code anywhere else but the device itself.

Also, I had originally planned to reduce spaces between operators but I never got around to it. It is in the TODO list =)

Dan McDougall (author) 12 years, 8 months ago  # | flag

UPDATE: It took FOREVER (and a lot of code) to work it all out but I've finally got this joining multiline pairs of parens, braces, and brackets (and removing unnecessary whitespace inside them). So something like this:

myvar = ('test',
    'test2',
)

Becomes...

myvar = ('test','test2',)

I googled and googled but I never did find another example of code that does that. Not even in another language. The hardest part was getting it to ignore things in triple double/single quotes and dealing with escaped characters (especially code that literally has something like "'\'").

Dan McDougall (author) 12 years, 8 months ago  # | flag

Almost forgot: I've also got it preserving shebangs (#!/usr/bin/env python) and encoding strings now (per Daniel Lepage's comment).

Dan McDougall (author) 12 years, 7 months ago  # | flag

Update (rev 10): Fixed some bugs where it wasn't working properly in certain (odd) situations. It also joins multi-lines (i.e. that end in '\') properly again.

jcaballero.hep 12 years, 4 months ago  # | flag

It doesn't work when the original code includes lines like the following one:

print "this is an example # of line which fails"

pyminifier interprets everything after the char # as a comment and, therefore, the rest of the code is not well processed.

robert marshall 12 years, 3 months ago  # | flag

If I give it a line with redundant brackets:

if (not result):

this gets converted to

if (notresult):

i.e. with no space between the 'not' and the variable

etatS evictA 12 years ago  # | flag

The problem mentioned by Robert Marshall can be fixed by changing

            elif character == ' ' and inside_pair and not inside_quotes:
                pass

to

            elif character == ' ' and inside_pair and not inside_quotes:
                if output[-1] != ' ':
                    output += ' '

or even better:

            elif character == ' ' and inside_pair and not inside_quotes:
                if not output[-1] in [' ', opener]:
                    output += ' '

jcaballero.hep mentions a problem that I haven't been able to fix and is a showstopper for me. It is some sort of error in the comment RegExp. I will be very glad if anyone would solve it.

Also, I didn't try this, but there will be trouble on """ ''' asd ''' """

(which could very reasonable appear e.g. in doctests)

Dan McDougall (author) 11 years, 6 months ago  # | flag

Updated the code to fix all of the conditions in the comments. Now I need to work on fixing the mult-line joins.

Dan McDougall (author) 11 years, 6 months ago  # | flag

Forget it, I must be crazy: Multi-line joins work fine they just look weird because the whitespace isn't being removed. I'll fix that next =)

BenoĆ®t Ryder 11 years, 4 months ago  # | flag

When minifying the the following code:

def f():
  pass
  (a, b) = 1, 2
  pass

reduce_operators() will return:

def f():
  pass
 (a,b)=1,2
  pass

This will lead to an IdentationError in dedent() due to the space removed before (a,b).

Dan McDougall (author) 11 years, 3 months ago  # | flag

@Benoit: Interesting edge case... When I run your (perfectly valid) example on my (python 2.6) install I get a syntax error from the tokenize module:

./pyminifier.py /tmp/test.py 
Traceback (most recent call last):
  File "./pyminifier.py", line 557, in <module>
    main()
  File "./pyminifier.py", line 550, in main
    print minify(source)
  File "./pyminifier.py", line 490, in minify
    source = dedent(source)
  File "./pyminifier.py", line 375, in dedent
    for i,tok in enumerate(tokenize.generate_tokens(io_obj.readline)):
  File "/usr/lib/python2.6/tokenize.py", line 346, in generate_tokens
    ("<tokenize>", lnum, pos, line))
  File "<tokenize>", line 3
    (a,b)=1,2
    ^
IndentationError: unindent does not match any outer indentation level

So this actually might be a bug in the tokenize module. I'll have to investigate further. Regardless, you can still get around this issue by using this syntax instead:

def f():
     pass
     a, b = 1, 2
     pass

...which will minify quite well.

Dan McDougall (author) 11 years, 2 months ago  # | flag

@Benoit: I fixed that "indentation error when operators start a new line" but in version 1.4.1.

Gareth Rees 10 years, 4 months ago  # | flag

I found four bugs in version 1.4.1.

  1. Docstring recognition has false positives. Input:

    def foo(): 'a' if True else 'd'

Output has a syntax error:

def foo():
 if True else 'd'
  1. Under some circumstances, a function can end up being removed completely if it's the last function in the file. Input:

    def bar(): 'docstring'

Output: nothing! (Off-by-one error in line handling?)

  1. Blank lines get deleted in triple-quoted strings. Input:

    s = """

    Multi-line string with blank lines

    """

Output:

s="""
Multi-line string with blank lines
"""
  1. Erroneous "pass" appended when function body is on same line as declaration. Input:

    def baz(): return 1

Output:

def baz():return 1 pass
Dan McDougall (author) 10 years, 4 months ago  # | flag

Thanks for reporting these bugs! Now, on to fixing them...

Steve Isaacson 9 years, 7 months ago  # | flag

I found a bug in version 1.4.1 in the "fix_empty_methods()" function. Running the following example file though pyminifier:

# Add 2 numbers
def add(a,b): return a + b
# Subtract 2 numbers
def sub(a,b): return a - b
# Multiply 2 numbers
def mult(a,b): return a * b
# Divide 2 numbers
def div(a,b): return a / b
# Test functions
def test(a,b):
    ''' This is a placeholder function '''
def test2(a,b):
    """ This is another placeholder function """

produces:

def add(a,b):return a+b pass
def sub(a,b):return a-b
def mult(a,b):return a*b pass
def div(a,b):return a/b
def test(a,b):pass
def test2(a,b):

Here is a fix for the "fix_empty_methods()" function:

def fix_empty_methods(source):
    def_indentation_level = 0
    output = ""
    just_matched = False
    previous_line = None
    method = re.compile(r'^\s*def\s*.*\(.*\):$')
    for line in source.split('\n'):
        line = line.rstrip()
        if len(line.strip()) > 0: # Don't look at blank lines
            if just_matched == True:
                this_indentation_level = len(line.rstrip()) - len(line.strip())
                if def_indentation_level == this_indentation_level:
                    # This method is empty, insert a 'pass' statement
                    if previous_line:
                        output += "%s pass\n%s" % (previous_line, line)
                    else:
                        output += " pass\n%s" % line
                else:
                    if previous_line:
                        output += "%s\n%s" % (previous_line, line)
                    else:
                        output += "\n%s" % line
                just_matched = False
                # Check to see if the current line matches too
                if method.match(line):
                    def_indentation_level = len(line) - len(line.strip()) 
                    just_matched = True
                    # We already printed this line, so don't print again
                    previous_line = None
                else:
                    output += "\n"
            elif method.match(line):
                def_indentation_level = len(line) - len(line.strip()) # A commment
                just_matched = True
                previous_line = line
            else:
                output += "%s\n" % line # Another self-test
        else:
            if not (just_matched and not previous_line):
                output += "\n"
    return output

Produces:

def add(a,b):return a+b
def sub(a,b):return a-b
def mult(a,b):return a*b
def div(a,b):return a/b
def test(a,b):pass
def test2(a,b):pass
Piyush 7 years, 3 months ago  # | flag

Hi,

I obfuscated my module, and the obfuscation is successful, great job there.

But I cannot use my module, once obfuscated.

I don't know if I'm making any sense or not.

But my requirement is to obfuscate my python module and deliver it to the customer, and the code should work without de-obfuscation.

Can you please comment on this?

Thanks in advance.