Dupinator -- detect and delete duplicate files « Python recipes

Point this script at a folder or several folders and it will find and delete all duplicate files within the folders, leaving behind the first file found of any set of duplicates. It is designed to handle hundreds of thousands of files of any size at a time and to do so quickly. It was written to eliminate duplicates across several photo libraries that had been shared between users. As the script was a one-off to solve a very particular problem, there are no options nor is it refactoring into any kind of modules or reusable functions.

      #! /usr/bin/python

import os
import sys
import stat
import md5

filesBySize = {}

def walker(arg, dirname, fnames):
    d = os.getcwd()
    os.chdir(dirname)
    try:
        fnames.remove('Thumbs')
    except ValueError:
        pass        
    for f in fnames:
        if not os.path.isfile(f):
            continue
        size = os.stat(f)[stat.ST_SIZE]
        if size < 100:
            continue
        if filesBySize.has_key(size):
            a = filesBySize[size]
        else:
            a = []
            filesBySize[size] = a
        a.append(os.path.join(dirname, f))
    os.chdir(d)

for x in sys.argv[1:]:
    print 'Scanning directory "%s"....' % x
    os.path.walk(x, walker, filesBySize)    

print 'Finding potential dupes...'
potentialDupes = []
potentialCount = 0
trueType = type(True)
sizes = filesBySize.keys()
sizes.sort()
for k in sizes:
    inFiles = filesBySize[k]
    outFiles = []
    hashes = {}
    if len(inFiles) is 1: continue
    print 'Testing %d files of size %d...' % (len(inFiles), k)
    for fileName in inFiles:
        if not os.path.isfile(fileName):
            continue
        aFile = file(fileName, 'r')
        hasher = md5.new(aFile.read(1024))
        hashValue = hasher.digest()
        if hashes.has_key(hashValue):
            x = hashes[hashValue]
            if type(x) is not trueType:
                outFiles.append(hashes[hashValue])
                hashes[hashValue] = True
            outFiles.append(fileName)
        else:
            hashes[hashValue] = fileName
        aFile.close()
    if len(outFiles):
        potentialDupes.append(outFiles)
        potentialCount = potentialCount + len(outFiles)
del filesBySize

print 'Found %d sets of potential dupes...' % potentialCount
print 'Scanning for real dupes...'

dupes = []
for aSet in potentialDupes:
    outFiles = []
    hashes = {}
    for fileName in aSet:
        print 'Scanning file "%s"...' % fileName
        aFile = file(fileName, 'r')
        hasher = md5.new()
        while True:
            r = aFile.read(4096)
            if not len(r):
                break
            hasher.update(r)
        aFile.close()
        hashValue = hasher.digest()
        if hashes.has_key(hashValue):
            if not len(outFiles):
                outFiles.append(hashes[hashValue])
            outFiles.append(fileName)
        else:
            hashes[hashValue] = fileName
    if len(outFiles):
        dupes.append(outFiles)

i = 0
for d in dupes:
    print 'Original is %s' % d[0]
    for f in d[1:]:
        i = i + 1
        print 'Deleting %s' % f
        os.remove(f)
    print

      

The script uses a multipass approach to finding duplicate files. First, it walks all of the directories pass in and groups all files by size. In the next pass, the script walks each set of files of the same size and checksums the first 1024 bytes. Finally, the script walks each set of files that are the same size with the same hash of the first 1024 bytes and checksums each file in its entirety.

The very last step is to walk each set of files of the same length/hash and delete all but the first file in the set.

It ran against a 3.5 gigabyte set of files composed of about 120,000 files, of which there were about 50,000 duplicates, most of which were over 1 megabyte. The total run took about 2 minutes on a 1.33ghz G4 powerbook. Fast enough for me and fast enough without actually optimizing anything beyond the obvious.

Tags: files

15 comments

Martin Blais 19 years, 3 months ago # | flag

Hard links? This is really cool, i was going to write something very similar. My application: I have made backups for the last ten years, and many times complete backups, which have now been copied on a single hdd for safety. Much of these files are identical. I wanted to hardlink them together to save the disk space. I can now just modify your app! thanks,

Drew Perttula 19 years, 3 months ago # | flag

fslint. http://www.iol.ie/~padraiga/fslint/ does this, as well as other useful things (and in shorter code too, I believe). I haven't investigated the various file-compare optimizations of each system.

BTW, a common use of the fslint tools is to find dups on the same filesystem and replace them with hardlinks. If you don't care about the once-identical files being forever identical, you can avoid needless space waste.

thattommyhall ; 16 years, 8 months ago # | flag

Shortcuts In Windows. I love this, freed up 60G from our stuffed file server

import win32com.client

def mkshortcut(source, target):
    shell = win32com.client.Dispatch("WScript.Shell")
    shortcut = shell.CreateShortCut(source)
    shortcut.Targetpath = target
    shortcut.save()

def shortcutise(list_of_duplicates):
    filename = list_of_duplicates[0].split('\\')[-1]
    #the "canonical" file is named the same as the first one in the list
    dupepath = string.join(list_of_duplicates[0].split('\\')[:6],'\\') + '\\DUPLICATED\\'
    #Just takes first 6 parts of the path (maps quite nicely to client folder in our setup, you will need to decide where you want them (may
    if not os.path.isdir(dupepath):
        os.mkdir(dupepath)
    canonical = dupepath + filename
    for i in list_of_duplicates:
        if not os.path.isfile(i):
            continue
            #added in in case duplicate list is calculated first and file no longer exists
        if not os.path.isfile(canonical):
            #creates the canonical file if it does not exist
            print "moving ",i, canonical
            shutil.move(i, canonical)
            print "linking ",i, canonical
            mkshortcut(i + '.lnk', canonical)
            continue
        print "linking ",i, canonical
        mkshortcut(i + '.lnk', canonical)
        print "deleting", i
        os.remove(i)

We used

dupefile = open('duplicates.txt','r').read()
for duplicatelist in dupefile.split('******************\n'):
    duplicatelist = duplicatelist.split('\n')
    shortcutise(duplicatelist)

Where duplicates.txt was created with a slight modification of the main script.

thattommyhall ; 16 years, 8 months ago # | flag

Size of the scan. We freed up 80G on a 1.6TB SAN in about 3 hours, memory usage was fine throughout (I ran another freeware duplicate finder and it crashed twice)

Benjamin Sergeant 16 years, 7 months ago # | flag

73 Gigs freed ! Cool ! I had lots of mp3 duplicated, waiting forever to be correctly tagged so kept in different directories but then merged ..., plus backups ...

[bsergean@lisa1 ~]$ df -h /media # before
Filesystem            Size  Used Avail Use% Mounted on
/dev/md3              287G  285G  2.7G 100% /media
[bsergean@lisa1 ~]$ df -h /media # after
Filesystem            Size  Used Avail Use% Mounted on
/dev/md3              287G  212G   76G  74% /media

Here is the stupid patch to create hardlinks instead of deleting files.

[bsergean@lisa1 bin]$ svn diff find_dup.py
Index: find_dup.py
===================================================================
--- find_dup.py (revision 164)
+++ find_dup.py (working copy)
@@ -96,6 +96,7 @@
     print 'Original is %s' % d[0]
     for f in d[1:]:
         i = i + 1
-        print 'Deleting %s' % f
+        print 'Deleting %s and hardlinking it' % f
         os.remove(f)
+        os.link(d[0],f)
     print

Martin Bene 15 years, 9 months ago # | flag

Just for the record: This script has a very serious bug that leads to data loss: the "Scanning for real dupes..." step traverses potential duplicate sets defined as lists of names with identical file size. if you've got 4 files (a,b,c,d) of n bytes where a=b and c=d, only a will be left and b, c, and d will be deleted, thus losing the contents of the 2nd pair of duplicates.

Paul Rougieux 15 years ago # | flag

This script: doublesdetector.py computes "SHA of files that have the same size, and group files by SHA". It simply finds duplicate files in a directory tree. It doesn't delete the duplicated files but it works fine for me (I use it to delete duplicate photos). It can probably be modified to add a delete files functionality.

Adrian Miatlev 14 years, 3 months ago # | flag

@Paul

Indeed, in doublesdetector.py, just replace lines at the end:

if len(sys.argv)>1 :
    doubles = detectDoubles(" ".join(sys.argv[1:]))
    print 'The following files are identical:'
    print '\n'.join(["----\n%s" % '\n'.join(doubles[filesha]) for filesha in doubles.keys()])
    print '----'
else:
    print message

by the following lines:

if len(sys.argv)>1 :
    doubles = detectDoubles(" ".join(sys.argv[1:]))
    for filesha in doubles.keys():
        print "The following files are identical:\n"
        n = 0
        for filename in doubles[filesha]:
            if n > 0:
                os.remove(filename)
                print "%s (removed)" % filename
            else:
                print filename
            n += 1
        print "\n"
else:
    print message

Benjamin Sergeant 12 years, 9 months ago # | flag

This is super useful but somewhat slow. I'm wondering how we can optimize it like computing the hash in parallel, or if it's just the stat(1) and spinning the disk that takes time.

Sam Brown 12 years, 3 months ago # | flag

@Adrian, your code is very useful and it makes doublesdetector.py do almost exactly what I need. However, my duplicate files are of the form "filename.mp3" and "filename 1.mp3" so this code removes the original and not the duplicate. I tweaked it to remove the file that comes first so that the actual duplicate is removed:

if len(sys.argv)>1 :
    doubles = detectDoubles(" ".join(sys.argv[1:]))
    for filesha in doubles.keys():
        print "The following files are identical:\n"
        n = 1
        for filename in doubles[filesha]:
            if n > 0:
                os.remove(filename)
                print "%s (removed)" % filename
            else:
                print filename
            n -= 1
        print "\n"
else:
    print message

Klovie 12 years, 1 month ago # | flag

I was wondering how can the existing script be used to achieve the following.

The result of the script - (PASSED or FAILED) is formed according to: The result is FAILED when at least one file from the first directory is not bitwise equal to the corresponding file in the second directory or the second directory has no corresponding file. Otherwise test is PASSED.

Thanks

Nick Demou 11 years ago # | flag

@Martin Bene: Tested exactly the scenario you described and found no such bug:

cd /tmp/laal

md5sum *
c1737d69d12635aa8ba43012c29c0bc5  a
c1737d69d12635aa8ba43012c29c0bc5  b
e1d7e0934d499289f2b58995ea8ffc5a  c
e1d7e0934d499289f2b58995ea8ffc5a  d

python ~/bin/delete-duplicates.py /tmp/laal/
/home/ndemou/bin/delete-duplicates.py:10: DeprecationWarning: the md5 module is deprecated; use hashlib instead
  import md5
Scanning directory "/tmp/laal/"....
b
a
d
c
Finding potential dupes...
Testing 4 files of size 1395...
Found 4 sets of potential dupes...
Scanning for real dupes...
Scanning file "/tmp/laal/b"...
Scanning file "/tmp/laal/a"...
Scanning file "/tmp/laal/d"...
Scanning file "/tmp/laal/c"...
Original is /tmp/laal/b
Deleting /tmp/laal/a
Deleting /tmp/laal/c


md5sum *
c1737d69d12635aa8ba43012c29c0bc5  b
e1d7e0934d499289f2b58995ea8ffc5a  d

(admittedly the messages that are printed may make you think that it's doing something wrong but it's not)

mps205 10 years, 9 months ago # | flag

@Nick Demou, @Martin Bene:

Martin was sort of right, so was Nick. No data will be lost, but when modified for hard linking the links could potentially be to the wrong file. e.g.:

~/tmp$ md5sum *
da3f1eb6d9a517d8feb935177f44fc1e  a
da3f1eb6d9a517d8feb935177f44fc1e  b
da3f1eb6d9a517d8feb935177f44fc1e  c
da3f1eb6d9a517d8feb935177f44fc1e  d
f662f6c5f5e176b8189904e5b4e82e14  e
f662f6c5f5e176b8189904e5b4e82e14  f
f662f6c5f5e176b8189904e5b4e82e14  g
f662f6c5f5e176b8189904e5b4e82e14  h

Leads to...

~/tmp$ md5sum *
da3f1eb6d9a517d8feb935177f44fc1e  a
da3f1eb6d9a517d8feb935177f44fc1e  b
f662f6c5f5e176b8189904e5b4e82e14  c
da3f1eb6d9a517d8feb935177f44fc1e  d
da3f1eb6d9a517d8feb935177f44fc1e  e
da3f1eb6d9a517d8feb935177f44fc1e  f
da3f1eb6d9a517d8feb935177f44fc1e  g
da3f1eb6d9a517d8feb935177f44fc1e  h

My changes:

pickle to save a dump of the duplicates (in case you want to come back and delete later)
Progress indicator during the last hashing step
Hard linking (from Benjamin Sergeant)
Fix the bug related to Martin's post

@@ -4,6 +4,7 @@
 import sys
 import stat
 import md5
+import pickle

 filesBySize = {}

@@ -67,12 +68,14 @@
 print 'Found %d sets of potential dupes...' % potentialCount
 print 'Scanning for real dupes...'

+i=0
 dupes = []
 for aSet in potentialDupes:
+    i+=1
     outFiles = []
     hashes = {}
     for fileName in aSet:
-        print 'Scanning file "%s"...' % fileName
+        print 'Scanning %d/%d "%s"...' % (i, potentialCount, fileName)
         aFile = file(fileName, 'r')
         hasher = md5.new()
         while True:
@@ -85,18 +88,23 @@
         if hashes.has_key(hashValue):
             if not len(outFiles):
                 outFiles.append(hashes[hashValue])
-            outFiles.append(fileName)
+            hashes[hashValue].append(fileName)
         else:
-            hashes[hashValue] = fileName
-    if len(outFiles):
-        dupes.append(outFiles)
+            hashes[hashValue] = [fileName]
+    for k in hashes.keys():
+        if len(hashes[k]) > 1:
+            dupes.append(hashes[k])
+
+dupdump = file("dupedump", "w")
+pickle.dump(dupes, dupdump)
+dupdump.close()

 i = 0
 for d in dupes:
     print 'Original is %s' % d[0]
     for f in d[1:]:
         i = i + 1
-        print 'Deleting %s' % f
+        print 'Deleting/linking %s' % f
         os.remove(f)
+        os.link(d[0],f)
     print

keving 9 years, 7 months ago # | flag

Just use Dublicate Files Deleter!

Huaxia Xia 8 years, 5 months ago # | flag

There is a bug in the block of line 71-92: it doesn't consider multiple groups of files with the same size. For example, if we have files a1, a2, b1 and b2, where a1 = a2 != b1 = b2, but a1.size = b1.size, then we will get a duplicate list [a1, a2, b2].

A corrected version could be:

for aSet in potentialDupes:
    outFiles = {}
    for fileName in aSet:
        print 'Scanning file "%s"...' % fileName
        with file(fileName, 'r') as aFile:
          hasher = hashlib.md5()
          while True:
            r = aFile.read(4096)
            if not len(r):
                break
            hasher.update(r)
        hashValue = hasher.digest()
        outFiles.setdefault(hashValue, []).append(fileName)
    for k, Files in outFiles.iteritems():
        if len(Files) > 1:
          dupes.append(Files)

◄	Python recipes (4591)	►
◄	Bill Bumgarner's recipes (1)	►

Dupinator -- detect and delete duplicate files (Python recipe) by Bill Bumgarner
ActiveState Code (http://code.activestate.com/recipes/362459/)

15 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Dupinator -- detect and delete duplicate files (Python recipe) by Bill Bumgarner ActiveState Code (http://code.activestate.com/recipes/362459/)

15 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Dupinator -- detect and delete duplicate files (Python recipe) by Bill Bumgarner
ActiveState Code (http://code.activestate.com/recipes/362459/)