Welcome, guest | Sign In | My Account | Store | Cart

A regex-based JavaScript compression kludge.

The current version has been tested against mochikit and json.js, which indeed tripped up the previous version (see comments).

Python, 126 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
'''
a regex-based JavaScript code compression kludge
'''
import re

class JSCompressor(object):

    def __init__(self, compressionLevel=2, measureCompression=False):
        '''
        compressionLevel:
        0 - no compression, script returned unchanged. For debugging only -
            try if you suspect that compression compromises your script
        1 - Strip comments and empty lines, don't change line breaks and indentation (code remains readable)
        2 - Additionally strip insignificant whitespace (code will become quite unreadable)

        measureCompression: append a comment stating the extent of compression
        '''
        self.compressionLevel = compressionLevel
        self.measureCompression = measureCompression

    # a bunch of regexes used in compression
    # first, exempt string and regex literals from compression by transient substitution

    findLiterals = re.compile(r'''
        (\'.*?(?<=[^\\])\')             |       # single-quoted strings
        (\".*?(?<=[^\\])\")             |       # double-quoted strings
        ((?<![\*\/])\/(?![\/\*]).*?(?<![\\])\/) # JS regexes, trying hard not to be tripped up by comments
        ''', re.VERBOSE)

    # literals are temporarily replaced by numbered placeholders

    literalMarker = '@_@%d@_@'                  # temporary replacement
    backSubst = re.compile('@_@(\d+)@_@')       # put the string literals back in

    mlc1 = re.compile(r'(\/\*.*?\*\/)')         # /* ... */ comments on single line
    mlc = re.compile(r'(\/\*.*?\*\/)', re.DOTALL)  # real multiline comments
    slc = re.compile('\/\/.*')                  # remove single line comments

    collapseWs = re.compile('(?<=\S)[ \t]+')    # collapse successive non-leading white space characters into one

    squeeze = re.compile('''
        \s+(?=[\}\]\)\:\&\|\=\;\,\.\+])   |     # remove whitespace preceding control characters
        (?<=[\{\[\(\:\&\|\=\;\,\.\+])\s+  |     # ... or following such
        [ \t]+(?=\W)                      |     # remove spaces or tabs preceding non-word characters
        (?<=\W)[ \t]+                           # ... or following such
        '''
        , re.VERBOSE | re.DOTALL)

    def compress(self, script):
        '''
        perform compression and return compressed script
        '''
        if self.compressionLevel == 0:
            return script

        lengthBefore = len(script)

        # first, substitute string literals by placeholders to prevent the regexes messing with them
        literals = []

        def insertMarker(mo):
            l = mo.group()
            literals.append(l)
            return self.literalMarker % (len(literals) - 1)

        script = self.findLiterals.sub(insertMarker, script)

        # now, to the literal-stripped carcass, apply some kludgy regexes for deflation...
        script = self.slc.sub('', script)       # strip single line comments
        script = self.mlc1.sub(' ', script)     # replace /* .. */ comments on single lines by space
        script = self.mlc.sub('\n', script)     # replace real multiline comments by newlines

        # remove empty lines and trailing whitespace
        script = '\n'.join([l.rstrip() for l in script.splitlines() if l.strip()])

        if self.compressionLevel == 2:              # squeeze out any dispensible whitespace
            script = self.squeeze.sub('', script)
        elif self.compressionLevel == 1:            # only collapse multiple whitespace characters
            script = self.collapseWs.sub(' ', script)

        # now back-substitute the string and regex literals
        def backsub(mo):
            return literals[int(mo.group(1))]

        script = self.backSubst.sub(backsub, script)

        if self.measureCompression:
            lengthAfter = float(len(script))
            squeezedBy = int(100*(1-lengthAfter/lengthBefore))
            script += '\n// squeezed out %s%%\n' % squeezedBy

        return script


if __name__ == '__main__':
    script = '''


    /* this is a totally useless multiline comment, containing a silly "quoted string",
       surrounded by several superfluous line breaks
     */


    // and this is an equally important single line comment

    sth = "this string contains 'quotes', a /regex/ and a // comment yet it will survive compression";

    function wurst(){           // this is a great function
        var hans = 33;
    }

    sthelse = 'and another useless string';

    function hans(){            // another function
        var   bill   =   66;    // successive spaces will be collapsed into one;
        var bob = 77            // this line break will be preserved b/c of lacking semicolon
        var george = 88;
    }
    '''

    for x in range(1,3):
        print '\ncompression level', x, ':\n--------------'
        c = JSCompressor(compressionLevel=x, measureCompression=True)
        cpr = c.compress(script)
        print cpr
        print 'length', len(cpr)

8 comments

Mike Webb 16 years, 9 months ago  # | flag

Problem compressing MochiKit. It is great to have a compress capability in python. I was trying this and ran into a problem compressing my scripts. I seem to hit a problem in how it compresses MochiKit.Base. I will see if I can track the issue down when I can.

Andreas Gustafsson 16 years, 5 months ago  # | flag

Buggy, alas. This would be quite useful if it actually worked, but unfortunately it doesn't. For example, compressing http://www.json.org/json2.js at level 2 yields a script with syntax errors.

Andreas Gustafsson 16 years, 5 months ago  # | flag

An alternative JavaScript compressor. There's another JavaScript compressor at http://www.crockford.com/javascript/jsmin.py.txt. It's working fine for me.

Michael Palmer (author) 15 years, 11 months ago  # | flag

json.js and mochikit now work. The problem was caused by "quoted phrases" in comments. This is now resolved; json.js and all example pages in the standard download have been tested and work.

Michael Palmer (author) 15 years, 11 months ago  # | flag

oops. 'standard download' is 'mochikit standard download'.

Robert Ruana 12 years, 2 months ago  # | flag

In your squeeze regular expression, the clause [ \t]+(?=\W) should be [ \t]+(?=[^\w\s]). Otherwise it will be tripped up by multiple spaces when compressionLevel is set to 2.

For example: return value; Will become: returnvalue;

Multiple spaces will still be shrunk down to one space by the last clause, (?<=\W)[ \t]+, so leading whitespace and redundant spaces are still removed.

Robert Ruana 12 years, 2 months ago  # | flag

There should be two spaces between "return" and "value" in the above example.

Robert Ruana 12 years, 1 month ago  # | flag

Actually, the regular expression clause should be [ \t]+(?=[^\$_\w\s]). Otherwise it will choke on variables that start with the dollar sign or underscore.

Created by Michael Palmer on Thu, 13 Jul 2006 (PSF)
Python recipes (4591)
Michael Palmer's recipes (8)

Required Modules

Other Information and Tasks