Welcome, guest | Sign In | My Account | Store | Cart
http://akiscode.com/articles/sha-1directoryhash.shtml

By definition a cryptographic hash is, "a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the (cryptographic) hash value, such that an accidental or intentional change to the data will change the hash value".

Usually these hashes are used on files to "fingerprint" them, but in order to do the same to a directory you have to do something like this:

Python, 43 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# http://akiscode.com/articles/sha-1directoryhash.shtml
# Copyright (c) 2009 Stephen Akiki
# MIT License (Means you can do whatever you want with this)
#  See http://www.opensource.org/licenses/mit-license.php
# Error Codes:
#   -1 -> Directory does not exist
#   -2 -> General error (see stack traceback)

def GetHashofDirs(directory, verbose=0):
  import hashlib, os
  SHAhash = hashlib.sha1()
  if not os.path.exists (directory):
    return -1
    
  try:
    for root, dirs, files in os.walk(directory):
      for names in files:
        if verbose == 1:
          print 'Hashing', names
        filepath = os.path.join(root,names)
        try:
          f1 = open(filepath, 'rb')
        except:
          # You can't open the file for some reason
          f1.close()
          continue

	while 1:
	  # Read file in as little chunks
  	  buf = f1.read(4096)
	  if not buf : break
	  SHAhash.update(hashlib.sha1(buf).hexdigest())
        f1.close()

  except:
    import traceback
    # Print the stack traceback
    traceback.print_exc()
    return -2

  return SHAhash.hexdigest()

print GetHashofDirs('My Documents', 1)

6 comments

David Gowers 14 years, 4 months ago  # | flag

.. It's perfectly possible to do this for arbitrarily big directories while maintaining constant memory usage, providing you stick with one hashing algorithym (IMO sha1 is better, and combining the two mainly just disguises md5's failings with a tiny dash of sha1, like MSG on a chinese take-out meal)

like this:

import hashlib # and don't import md5 or sha1 -- they're deprecated in 2.6

finalhash = hashlib.sha1() # or md5; if you consider it sufficient for the problem.

# ...

# in the body of the loop over files:
#
# get the size of the file before opening it
f1size = os.path

# ... (actually open the file, as before)

while f1.tell() != f1size:
    # update the hash with the contents of the file,
    # 256k at a time
    finalhash.update (f1.read (0x40000))

You might also update the hash string for 'empty directory' too. However do you think it's really a good idea to use a 'magic value' like that? A simple

if not os.path.exists (directory):
    return -3

near the start of the function could avoid it.

David Gowers 14 years, 4 months ago  # | flag

You could also do the md5-hashes-into-sha1 thing with constant memory usage: calculate md5 sums and feed them into the sha1 object using update() one at a time.

David Moss 14 years, 4 months ago  # | flag

Here is a version that won't run out of memory.

The checksum generated here is based purely on the contents of files found, processed in ascending sort order using the normalised file paths. This version does not account for changes in file ownership, permissions or stat information (although this might be useful to add in). The value of the checksum will change if renames affect the order in which the files are passed in or processed by the _update_checksum() internal function. This is usually fine for most use cases but be aware that YMMV.

#/usr/bin/env python
import hashlib
from os.path import normpath, walk, isdir, isfile, dirname, basename, \
    exists as path_exists, join as path_join

def path_checksum(paths):
    """
    Recursively calculates a checksum representing the contents of all files
    found with a sequence of file and/or directory paths.

    """
    if not hasattr(paths, '__iter__'):
        raise TypeError('sequence or iterable expected not %r!' % type(paths))

    def _update_checksum(checksum, dirname, filenames):
        for filename in sorted(filenames):
            path = path_join(dirname, filename)
            if isfile(path):
                print path
                fh = open(path, 'rb')
                while 1:
                    buf = fh.read(4096)
                    if not buf : break
                    checksum.update(buf)
                fh.close()

    chksum = hashlib.sha1()

    for path in sorted([normpath(f) for f in paths]):
        if path_exists(path):
            if isdir(path):
                walk(path, _update_checksum, chksum)
            elif isfile(path):
                _update_checksum(chksum, dirname(path), basename(path))

    return chksum.hexdigest()

if __name__ == '__main__':
    print path_checksum([r'/tmp', '/etc/hosts'])
Stephen Akiki (author) 14 years, 4 months ago  # | flag

Thanks guys, updated so it doesn't run out of memory.

Denis Barmenkov 14 years, 4 months ago  # | flag

I think its too simple: something changed but dunnow what :)

Better approach is calculating SHA1 against each file in directory.

Stable solution is cfv written in Python:

http://cfv.sf.net

chuck kellum 14 years, 3 months ago  # | flag

These examples worked great for a file transfer QA project I was working on. David Moss's example (Third comment) seemed to work the best for my circumstances. I did notice that if you passed it a single filename in a list, to boinked. I added a few lines of code to re-listify the filename once passed in to the _update_checksum function. I tend to do things old-school and not very Pythonic, though. Is there a simpler way to get this to work with a single filename as well as what it already does for directories? My bigger project needs to create a pre-table before file transfers and then be able to do an individual file after each transfer.