http://akiscode.com/articles/sha-1directoryhash.shtml
By definition a cryptographic hash is, "a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the (cryptographic) hash value, such that an accidental or intentional change to the data will change the hash value".
Usually these hashes are used on files to "fingerprint" them, but in order to do the same to a directory you have to do something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | # http://akiscode.com/articles/sha-1directoryhash.shtml
# Copyright (c) 2009 Stephen Akiki
# MIT License (Means you can do whatever you want with this)
# See http://www.opensource.org/licenses/mit-license.php
# Error Codes:
# -1 -> Directory does not exist
# -2 -> General error (see stack traceback)
def GetHashofDirs(directory, verbose=0):
import hashlib, os
SHAhash = hashlib.sha1()
if not os.path.exists (directory):
return -1
try:
for root, dirs, files in os.walk(directory):
for names in files:
if verbose == 1:
print 'Hashing', names
filepath = os.path.join(root,names)
try:
f1 = open(filepath, 'rb')
except:
# You can't open the file for some reason
f1.close()
continue
while 1:
# Read file in as little chunks
buf = f1.read(4096)
if not buf : break
SHAhash.update(hashlib.sha1(buf).hexdigest())
f1.close()
except:
import traceback
# Print the stack traceback
traceback.print_exc()
return -2
return SHAhash.hexdigest()
print GetHashofDirs('My Documents', 1)
|
.. It's perfectly possible to do this for arbitrarily big directories while maintaining constant memory usage, providing you stick with one hashing algorithym (IMO sha1 is better, and combining the two mainly just disguises md5's failings with a tiny dash of sha1, like MSG on a chinese take-out meal)
like this:
You might also update the hash string for 'empty directory' too. However do you think it's really a good idea to use a 'magic value' like that? A simple
near the start of the function could avoid it.
You could also do the md5-hashes-into-sha1 thing with constant memory usage: calculate md5 sums and feed them into the sha1 object using update() one at a time.
Here is a version that won't run out of memory.
The checksum generated here is based purely on the contents of files found, processed in ascending sort order using the normalised file paths. This version does not account for changes in file ownership, permissions or stat information (although this might be useful to add in). The value of the checksum will change if renames affect the order in which the files are passed in or processed by the _update_checksum() internal function. This is usually fine for most use cases but be aware that YMMV.
Thanks guys, updated so it doesn't run out of memory.
I think its too simple: something changed but dunnow what :)
Better approach is calculating SHA1 against each file in directory.
Stable solution is cfv written in Python:
http://cfv.sf.net
These examples worked great for a file transfer QA project I was working on. David Moss's example (Third comment) seemed to work the best for my circumstances. I did notice that if you passed it a single filename in a list, to boinked. I added a few lines of code to re-listify the filename once passed in to the _update_checksum function. I tend to do things old-school and not very Pythonic, though. Is there a simpler way to get this to work with a single filename as well as what it already does for directories? My bigger project needs to create a pre-table before file transfers and then be able to do an individual file after each transfer.