Shannon entropy of a string indicates minimum average number of bits per symbol required for encoding (compressing) the string.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | # Shannon Entropy of a string
# = minimum average number of bits per symbol
# required for encoding the string
#
# So the theoretical limit for data compression:
# Shannon Entropy of the string * string length
# FB - 201011291
import math
from sets import Set
st = 'aabcddddefffg' # input string
# st = '00010101011110' # Shannon entropy for this would be 1 bit/symbol
print 'Input string:'
print st
print
stList = list(st)
alphabet = list(Set(stList)) # list of symbols in the string
print 'Alphabet of symbols in the string:'
print alphabet
print
# calculate the frequency of each symbol in the string
freqList = []
for symbol in alphabet:
ctr = 0
for sym in stList:
if sym == symbol:
ctr += 1
freqList.append(float(ctr) / len(stList))
print 'Frequencies of alphabet symbols:'
print freqList
print
# Shannon entropy
ent = 0.0
for freq in freqList:
ent = ent + freq * math.log(freq, 2)
ent = -ent
print 'Shannon entropy:'
print ent
print 'Minimum number of bits required to encode each symbol:'
print int(math.ceil(ent))
|
Tags: math, mathematics
Since Shannon entropy allows calculation of theoretical data compression limit, it can be used to calculate efficiency percentage of any data compression software (for a particular input data file):
compression efficiency percentage = ((original file size) - (compressed file size))/((original file size) - (min file size possible)) * 100
(min file size possible) = SE * (file size)
(Actually this efficiency calculation is only approximate because (min file size possible) excludes decompression key size which in reality included in the compressed file.)
Actually it is:
(min file size possible) = SE * (file size) / 8
because SE value means 'min bits per byte-character' in the file. So the result is in bits, which must be divided by 8 to convert to bytes.
This code can be used to calculate Shannon Entropy of file:
Shannon Entropy calculation for the same input string using 2/4/8/16 bits/symbol:
If these lines added to the end of above code then it would also calculate min possible string size after max theoretical data compression:
Min possible string size is in bits.
As the above code would show, max possible compressed size of a data string depends on bits/symbol value chosen. So if a data compression algorithm, that allows for any bits/symbol value to be used, existed then Shannon entropy values for different bits/symbol could be used to choose bits/symbol value that produces the smallest compressed file size. (But of course this would also increase compression time a lot!)
Shannon Entropy of an image (calculated for red, green, blue separately):
Shannon Entropy of a file (this code is more efficient then the previous one):
thank you :)
I'm trying to run the code in post "Shannon Entropy of a file (this code is more efficient then the previous one)", however, it gives a syntax error in Python.
What modifications do I need to make for it to scan a specific file? Sorry for the noobish question. Thanks! :)
It is designed to be run from (Windows/Linux) command prompt.
First install latest Python 2.x version.
Save that code as file_entropy.py file.
Add path of Python directory to your Windows/Linux path.
Open command prompt.
Type:
python [Path to file_entropy.py]file_entropy.py [Pathtoyourfiletobeanalyzed][Nameofthefiletobeanalyzed]
Or
file_entropy.py [Path to file_entropy.py][Pathtoyourfiletobeanalyzed][Nameofthefiletobeanalyzed].
I just run again and there is no syntax error.
(If you still get it please post me the exact error message you see.)