dupfinder | Python Package Manager Index (PyPM)

INSTALL>

pypm install dupfinder

[+]

How to install dupfinder

Download and install ActivePython
Open Command Prompt
Type pypm install dupfinder

Python 2.7

Python 3.2

Python 3.3

Windows (32-bit)

The build is available for this platform; click to see other versions

1.4.3

Available

View build log

Windows (64-bit)

1.4.3

Available

View build log

Mac OS X (10.5+)

1.4.3

Available

View build log

Linux (32-bit)

1.4.3

Available

View build log

Linux (64-bit)

1.4.3

Available

View build log

file duplication finder manager

Author

Andriy Mylenkyy

License

GPL

Dependencies

distribute

Imports

dupfinder

Lastest release

version 1.4.3 on Jan 5th, 2011

This package designed to find and manage duplications, and contains two utilities:

dupfind - to find duplications
dupmanage - to manage found duplications

DUPFIND UTILITY:

dupfind utility allows you to find duplicated files and directories in your file system.

Show how utility find duplicated files:

By default utility identifies duplication files by file content.

First of all - create several different files in the current directory.

>>> createFile('tfile1.txt', "A"*10)
>>> createFile('tfile2.txt', "A"*1025)
>>> createFile('tfile3.txt', "A"*2048)

Then create other files in another directory, one of them to be the same as already created ones.

>>> mkd("dir1")
>>> createFile('tfile1.txt', "A"*20, "dir1")
>>> createFile('tfile2.txt', "A"*1025, "dir1")
>>> createFile('tfile13.txt', "A"*48, "dir1")

Look into the directories contents:

>>> ls()
=== list directory ===
D :: dir1 :: ...
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048

>>> ls("dir1")
=== list dir1 directory ===
F :: tfile1.txt :: 20
F :: tfile13.txt :: 48
F :: tfile2.txt :: 1025

We see, that "tfile2.txt" is same in both directories, while "tfile1.txt" - has the same name, but differs in size. So utility must identify only "tfile2.txt" as a duplication file.

We force output results with "-o <output file name>" argument to outputf file, and pass testdir as directory that is looking for duplications.

>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})

Now check the results file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

Show quick/slow utility mode:

As mentioned above - utility identifies duplication files by file contents. This mode slows down the system and consumes a lot of system resources.

However, in most cases the file name and size is enough to identify the duplication. So in that case you can use quick mode --quick (-q) option.

So test the previous files in the quick mode:

>>> dupfind("-q -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})

Now check the result file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

As we can see the quick mode identifies duplications correctly.

Let's show that there are cases when this mode can lead to mistakes. To do that let's add a file with the same name and size but different content and apply utility in both modes:

>>> createFile('tfile000.txt', "First  "*20,)
>>> createFile('tfile000.txt', "Second "*20, "dir1")

Now check the duplication results using default (not quick mode) ...

>>> dupfind(" -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

As we can see not-quick mode identifies duplications correctly.

Let's check duplications using the quick mode...

>>> dupfind(" -q -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,140,F,txt,tfile000.txt,.../tmp.../dir1,...
...,140,F,txt,tfile000.txt,.../tmp...,...
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

As we can see wrong duplications are found using the quick-mode.

Cleanup the test

>>> cleanTestDir()

Show how utility finds duplicated directories:

Utility identifies duplicated directories as directories, all files of which are duplicated and all inner directories are also duplicated directories.

First compare 2 directories with the same files.

Create directories with the same content.

>>> def mkDir(dpath):
...     mkd(dpath)
...     createFile('tfile1.txt', "A"*10, dpath)
...     createFile('tfile2.txt', "A"*1025, dpath)
...     createFile('tfile3.txt', "A"*2048, dpath)
...
>>> mkDir("dir1")
>>> mkDir("dir2")

Confirm that the directories' contents are really identical

>>> ls("dir1")
=== list dir1 directory ===
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048

>>> ls("dir2")
=== list dir2 directory ===
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048

Now run the utility and check the result file:

>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir1,...
...,D,,dir2,...

Compare 2 directories with the same files and dirs.

Create new directories with the same content, but different names in previously created directories.

So for directories to be interpreted as duplications - they don't need to have the same name, but the identical content.

Add 2 identical directories to the previous ones.

>>> def mkDir1(dpath):
...     mkd(dpath)
...     createFile('tfile11.txt', "B"*4000, dpath)
...     createFile('tfile12.txt', "B"*222, dpath)
...
>>> mkDir1("dir1/dir11")
>>> mkDir1("dir2/dir21")

Note that we added two directories with same contents, but different names. This should not break duplications.

>>> def mkDir2(dpath):
...     mkd(dpath)
...     createFile('tfile21.txt', "C"*4096, dpath)
...     createFile('tfile22.txt', "C"*123, dpath)
...     createFile('tfile23.txt', "C"*444, dpath)
...     createFile('tfile24.txt', "C"*555, dpath)
...
>>> mkDir2("dir1/dir22")
>>> mkDir2("dir2/dir22")

Confirm that directories' contents are really identical

>>> ls("dir1")
=== list dir1 directory ===
D :: dir11 :: -1
D :: dir22 :: -1
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048
>>> ls("dir2")
=== list dir2 directory ===
D :: dir21 :: -1
D :: dir22 :: -1
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048

And contents for inner directories

First subdirectory:

>>> ls("dir1/dir11")
=== list dir1/dir11 directory ===
F :: tfile11.txt :: 4000
F :: tfile12.txt :: 222
>>> ls("dir2/dir21")
=== list dir2/dir21 directory ===
F :: tfile11.txt :: 4000
F :: tfile12.txt :: 222

Second subdirectory:

>>> ls("dir1/dir22")
=== list dir1/dir22 directory ===
F :: tfile21.txt :: 4096
F :: tfile22.txt :: 123
F :: tfile23.txt :: 444
F :: tfile24.txt :: 555
>>> ls("dir2/dir22")
=== list dir2/dir22 directory ===
F :: tfile21.txt :: 4096
F :: tfile22.txt :: 123
F :: tfile23.txt :: 444
F :: tfile24.txt :: 555

Now test the utility.

>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})

Checks the results file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir1,...
...,D,,dir2,...

NOTE:

Inner duplication directories are excluded from the results:

>>> outputres = file(outputf).read()
>>> "dir1/dir11" in outputres
False
>>> "dir1/dir22" in outputres
False
>>> "dir2/dir21" in outputres
False
>>> "dir2/dir22" in outputres
False

Utility accepts more than one argument as directories list:

Use previous directory structure to prove this:

Now pass to utility "dir1/dir11" and "dir2" directories:

>>> dupfind("-o %(o)s %(dir1-11)s %(dir2)s" % {
...     'o':outputf,
...     'dir1-11': os.path.join(testdir,"dir1/dir11"),
...     'dir2': os.path.join(testdir,"dir2"),})

Now check the result file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir11,.../tmp.../dir1,...
...,D,,dir21,.../tmp.../dir2,...

DUPMANAGE UTILITY:

dupmanage utility allows you to manage duplication files and directories of your file system with csv data file.

Utility use csv-formatted data-file to process duplication items. Data file must contain the following columns:

type
name
directory
operation
operation_data

Utility supports 2 types of operations with duplication items:

deleting ("D")
symlinking ("L") only for UNIX-like systems

operation_data is only used for symlinking operation and must contain the path to symlinking sorce item.

Show how utility manages duplications:

To show - use previous directory structure and also add several duplications:

Create a file in the root directory and the same file in another catalog.

>>> createFile('tfile03.txt', "D"*100)
>>> mkd("dir3")
>>> createFile('tfile03.txt', "D"*100, "dir3")

Look into directories contents:

>>> ls()
=== list directory ===
D :: dir1 :: ...
D :: dir2 :: ...
D :: dir3 :: ...
F :: tfile03.txt :: 100

>>> ls("dir3")
=== list dir3 directory ===
F :: tfile03.txt :: 100

We already know the previous duplications, so now we create csv-formatted data file to manage duplications.

>>> manage_data = """type,name,directory,operation,operation_data
... F,tfile03.txt,%(testdir)s/dir3,L,%(testdir)s/tfile03.txt
... D,dir2,%(testdir)s,D,
... """ % {'testdir': testdir}
>>> createFile('manage.csv', manage_data)

Now call the utility and check result directory content:

>>> manage_path = os.path.join(testdir, 'manage.csv')
>>> dupmanage("%s -v" % manage_path)
[...
[...]: Symlink .../tfile03.txt item to .../dir3/tfile03.txt
[...]: Remove .../dir2 directory
[...]: Processed 2 items

Review directory content:

>>> ls()
=== list directory ===
D :: dir1 :: ...
D :: dir3 :: ...
F :: tfile03.txt :: 100

>>> ls("dir3")
=== list dir3 directory ===
L :: tfile03.txt :: ...

HISTORY:

1.4.3

Comment useless for now output_format option

System Message: WARNING/2 (<string>, line 412)

Bullet list ends without a blank line; unexpected unindent.

for dupfinder utility.

1.4.2

Refactoring content comparison to use zlib.crc32

System Message: WARNING/2 (<string>, line 419)

Bullet list ends without a blank line; unexpected unindent.

function to calculate file content diges - speedup algorythm. * Fixed some bugs

1.4

Updated file duplication finding: added file

System Message: WARNING/2 (<string>, line 428)

Bullet list ends without a blank line; unexpected unindent.

comparison by content oportunity. Made this variant - default one. * Added -q (--quick) option to use quick file comparison (by name and size) * Added tests for quick/not-quick duplication finding

1.2

Added dupmanage utility for manage duplications
Added tests for dupmanage utility

1.0

Tests for dupfinder utility added

0.8

Refactoring classes: remove DupFilter,

System Message: WARNING/2 (<string>, line 453)

Bullet list ends without a blank line; unexpected unindent.

move filtering into DupOut class. * Force implicitly hiding inner content of a duplication directories.

0.7

Refactoring utility into classes
Fix bugs with bad files processing
Fix bug with size calculation

0.5

Refactoring inner finding algorithm
Implemented opportunity to remove from

System Message: WARNING/2 (<string>, line 471)

Bullet list ends without a blank line; unexpected unindent.

the result report inner content from duplication directories

0.3

Files finder implemented
Output in csv format
added filters by size

0.1

Initial release

PyPM Index

dupfinder 1.4.3

Find and manage duplication files on the file system

How to install dupfinder

Links

Author

License

Dependencies

Imports

Lastest release

DUPFIND UTILITY:

Show how utility find duplicated files:

Show quick/slow utility mode:

Show how utility finds duplicated directories:

First compare 2 directories with the same files.

Compare 2 directories with the same files and dirs.

NOTE:

Utility accepts more than one argument as directories list:

DUPMANAGE UTILITY:

Show how utility manages duplications:

HISTORY:

1.4.3

1.4.2

1.4

1.2

1.0

0.8

0.7

0.5

0.3

0.1

Subscribe to package updates

Download Stats

What does the lock icon mean?

Need custom builds or support?

Plan on re-distributing ActivePython?

Accounts

PyPM

Feedback & Information

ActiveState

PyPM Index

dupfinder 1.4.3 Find and manage duplication files on the file system

How to install dupfinder

Links

Author

License

Dependencies

Imports

Lastest release

DUPFIND UTILITY:

Show how utility find duplicated files:

Show quick/slow utility mode:

Show how utility finds duplicated directories:

First compare 2 directories with the same files.

Compare 2 directories with the same files and dirs.

NOTE:

Utility accepts more than one argument as directories list:

DUPMANAGE UTILITY:

Show how utility manages duplications:

HISTORY:

1.4.3

1.4.2

1.4

1.2

1.0

0.8

0.7

0.5

0.3

0.1

Subscribe to package updates

Download Stats

What does the lock icon mean?

Need custom builds or support?

Plan on re-distributing ActivePython?

Accounts

PyPM

Feedback & Information

ActiveState

dupfinder 1.4.3

Find and manage duplication files on the file system