Tuesday, January 3, 2012

Detect duplicate files in Linux or cygwin

In doing some back up of pictures and movies, I wanted to detect duplicate files and delete them. I found this post, but it doesn't run efficiently on large file sets since it md5sum's all files regardless of whether a size match amongst all the other files has been found. A stat command on a file is much quicker than actually computing the md5sum since a stat command looks at the file's header information, whereas an md5sum command has to read the entire file off the disk. Thus, I only want to do an md5sum of files where the length matches other files, since only files with a length equal to another file's length can possibly be a duplicate of another file.

We come to the following "one-liner", which only md5sum's the files that have identical length to some other file in the system:
find . -type f -exec stat --printf='%32s ' {} \; -print |\
    sort -rn |\
    uniq -d -w32 --all-repeated=separate |\
    awk '(NF > 0){
        system( \
            "md5sum \"`echo \"" $0 "\"|\
            sed -r \"s/^ *[0-9]+ *//\" `\" |\
            cut -c 1-32 | tr -d [:space:] " );
        printf " %32s %s", $1, $2 ;\
        for (i = 3; i <= NF; i++) printf $i " "; \
        printf "\n;\
    sort -r |\
    uniq -d -w65 --all-repeated=separate  |\
    awk '{for (i = 3; i <= NF; i++) printf $i " ";print "";}'
 # Print size of every file to find duplicated file sizes                        
 # sort  to put duplicate files sizes together ( for uniq )
 # print all files that have file sizes equal to any other file
 # for the remaining files, insert the md5sum after the file size
 # do the md5sum while allowing for spaces in the file names
 # ( More than 1 consecutive space in a file name is not allowed..)

 # print out the file size with a fixed width for future comparisons and also print the file name 
 


 # re-sort to catch duplicates of _different_ files with _identical_ sizes                        
 # Compare the file size and md5sum for every file and print multiples                            
 # get rid of the file size and md5sum and simply print out the duplicate file names by themselves


If you have a ton of files that you're analyzing, you may want to see progress happening. Due to the nature of sort, it won't generate output until all its input lines are received. We can intercept and log the intermediate output so you can see progress by modifying the command with tee:

find . -type f -exec stat --printf='%32s ' {} \; -print |\
    tee find_stat.log |\
    sort -rn |\
    uniq -d -w32 --all-repeated=separate |\
    awk '(NF > 0){
        system( \
            "md5sum \"`echo \"" $0 "\"|\
            sed -r \"s/^ *[0-9]+ *//\" | tee -a f.log`\" |\
            cut -c 1-32 | tr -d [:space:] " );
"        printf " %32s %s", $1, $2 ;\
        for (i = 3; i <= NF; i++) printf $i " "; \
        printf "\n;\
    }' |\
    tee md5sum_generation.log |\
    sort -r |\
    uniq -d -w65 --all-repeated=separate  |\
    awk '{for (i = 3; i <= NF; i++) printf $i " ";print "";}' | \
    tee repeated_files.log


Then, in a seperate window from the above command, you can run tail:
tail -f find_stat.log
or
tail -f md5sum_generation.log

3 comments:

ZAPATA said...

this question will not sound very smart to you, i'm not that firm with bash and programming under linux so far, but how do i automatically remove all found dupes or at least move them into a specific direction?
So far i tried to simply add a rm -f "{$3}" at the end of your script, but this didn't show to work out, as i wanted it to.

ZAPATA said...

to be more specific, i'm not experienced with programming at all :-) That's exactly why i'd would like to get to understand how i can edit your script, to automatically remove the duplicates. I'd be very thankful for any suggestions!

Zapata

LDiracDelta said...

Hey Zapata,
Congratulations on being the first person ever to comment on my blog. Sorry I haven't responded to you for half a year.

You have that list of files in repeated_files.log that looks like:


./.cpan/build/MP3-Tag-1.12-I3DkIc/lib/MP3/Tag/Cue.pm
./.cpan/build/MP3-Tag-1.12-HZFD2g/lib/MP3/Tag/Cue.pm

./.cpan/build/MP3-Tag-1.12-I3DkIc/t/mp3tag.t
./.cpan/build/MP3-Tag-1.12-HZFD2g/t/mp3tag.t

./.cpan/build/MP3-Tag-1.12-I3DkIc/examples/eat_wav_mp3_header.pl
./.cpan/build/MP3-Tag-1.12-HZFD2g/examples/eat_wav_mp3_header.pl


You could do something like deleting all files except the first file in every group:


rm -f `cat repeated_files.log | awk 'BEGIN {skip=1;} /^$/ { skip=1; next; } {if (skip==1) { skip=0; next;} print }'`