A regex-based JavaScript compression kludge.
The current version has been tested against mochikit and json.js, which indeed tripped up the previous version (see comments).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | '''
a regex-based JavaScript code compression kludge
'''
import re
class JSCompressor(object):
def __init__(self, compressionLevel=2, measureCompression=False):
'''
compressionLevel:
0 - no compression, script returned unchanged. For debugging only -
try if you suspect that compression compromises your script
1 - Strip comments and empty lines, don't change line breaks and indentation (code remains readable)
2 - Additionally strip insignificant whitespace (code will become quite unreadable)
measureCompression: append a comment stating the extent of compression
'''
self.compressionLevel = compressionLevel
self.measureCompression = measureCompression
# a bunch of regexes used in compression
# first, exempt string and regex literals from compression by transient substitution
findLiterals = re.compile(r'''
(\'.*?(?<=[^\\])\') | # single-quoted strings
(\".*?(?<=[^\\])\") | # double-quoted strings
((?<![\*\/])\/(?![\/\*]).*?(?<![\\])\/) # JS regexes, trying hard not to be tripped up by comments
''', re.VERBOSE)
# literals are temporarily replaced by numbered placeholders
literalMarker = '@_@%d@_@' # temporary replacement
backSubst = re.compile('@_@(\d+)@_@') # put the string literals back in
mlc1 = re.compile(r'(\/\*.*?\*\/)') # /* ... */ comments on single line
mlc = re.compile(r'(\/\*.*?\*\/)', re.DOTALL) # real multiline comments
slc = re.compile('\/\/.*') # remove single line comments
collapseWs = re.compile('(?<=\S)[ \t]+') # collapse successive non-leading white space characters into one
squeeze = re.compile('''
\s+(?=[\}\]\)\:\&\|\=\;\,\.\+]) | # remove whitespace preceding control characters
(?<=[\{\[\(\:\&\|\=\;\,\.\+])\s+ | # ... or following such
[ \t]+(?=\W) | # remove spaces or tabs preceding non-word characters
(?<=\W)[ \t]+ # ... or following such
'''
, re.VERBOSE | re.DOTALL)
def compress(self, script):
'''
perform compression and return compressed script
'''
if self.compressionLevel == 0:
return script
lengthBefore = len(script)
# first, substitute string literals by placeholders to prevent the regexes messing with them
literals = []
def insertMarker(mo):
l = mo.group()
literals.append(l)
return self.literalMarker % (len(literals) - 1)
script = self.findLiterals.sub(insertMarker, script)
# now, to the literal-stripped carcass, apply some kludgy regexes for deflation...
script = self.slc.sub('', script) # strip single line comments
script = self.mlc1.sub(' ', script) # replace /* .. */ comments on single lines by space
script = self.mlc.sub('\n', script) # replace real multiline comments by newlines
# remove empty lines and trailing whitespace
script = '\n'.join([l.rstrip() for l in script.splitlines() if l.strip()])
if self.compressionLevel == 2: # squeeze out any dispensible whitespace
script = self.squeeze.sub('', script)
elif self.compressionLevel == 1: # only collapse multiple whitespace characters
script = self.collapseWs.sub(' ', script)
# now back-substitute the string and regex literals
def backsub(mo):
return literals[int(mo.group(1))]
script = self.backSubst.sub(backsub, script)
if self.measureCompression:
lengthAfter = float(len(script))
squeezedBy = int(100*(1-lengthAfter/lengthBefore))
script += '\n// squeezed out %s%%\n' % squeezedBy
return script
if __name__ == '__main__':
script = '''
/* this is a totally useless multiline comment, containing a silly "quoted string",
surrounded by several superfluous line breaks
*/
// and this is an equally important single line comment
sth = "this string contains 'quotes', a /regex/ and a // comment yet it will survive compression";
function wurst(){ // this is a great function
var hans = 33;
}
sthelse = 'and another useless string';
function hans(){ // another function
var bill = 66; // successive spaces will be collapsed into one;
var bob = 77 // this line break will be preserved b/c of lacking semicolon
var george = 88;
}
'''
for x in range(1,3):
print '\ncompression level', x, ':\n--------------'
c = JSCompressor(compressionLevel=x, measureCompression=True)
cpr = c.compress(script)
print cpr
print 'length', len(cpr)
|
Tags: web
Problem compressing MochiKit. It is great to have a compress capability in python. I was trying this and ran into a problem compressing my scripts. I seem to hit a problem in how it compresses MochiKit.Base. I will see if I can track the issue down when I can.
Buggy, alas. This would be quite useful if it actually worked, but unfortunately it doesn't. For example, compressing http://www.json.org/json2.js at level 2 yields a script with syntax errors.
An alternative JavaScript compressor. There's another JavaScript compressor at http://www.crockford.com/javascript/jsmin.py.txt. It's working fine for me.
json.js and mochikit now work. The problem was caused by "quoted phrases" in comments. This is now resolved; json.js and all example pages in the standard download have been tested and work.
oops. 'standard download' is 'mochikit standard download'.
In your squeeze regular expression, the clause
[ \t]+(?=\W)
should be[ \t]+(?=[^\w\s])
. Otherwise it will be tripped up by multiple spaces when compressionLevel is set to 2.For example:
return value;
Will become:returnvalue;
Multiple spaces will still be shrunk down to one space by the last clause,
(?<=\W)[ \t]+
, so leading whitespace and redundant spaces are still removed.There should be two spaces between "return" and "value" in the above example.
Actually, the regular expression clause should be
[ \t]+(?=[^\$_\w\s])
. Otherwise it will choke on variables that start with the dollar sign or underscore.