Point this script at a folder or several folders and it will find and delete all duplicate files within the folders, leaving behind the first file found of any set of duplicates. It is designed to handle hundreds of thousands of files of any size at a time and to do so quickly. It was written to eliminate duplicates across several photo libraries that had been shared between users. As the script was a one-off to solve a very particular problem, there are no options nor is it refactoring into any kind of modules or reusable functions.
Python, 101 lines
The script uses a multipass approach to finding duplicate files. First, it walks all of the directories pass in and groups all files by size. In the next pass, the script walks each set of files of the same size and checksums the first 1024 bytes. Finally, the script walks each set of files that are the same size with the same hash of the first 1024 bytes and checksums each file in its entirety.
The very last step is to walk each set of files of the same length/hash and delete all but the first file in the set.
It ran against a 3.5 gigabyte set of files composed of about 120,000 files, of which there were about 50,000 duplicates, most of which were over 1 megabyte. The total run took about 2 minutes on a 1.33ghz G4 powerbook. Fast enough for me and fast enough without actually optimizing anything beyond the obvious.