We come to the following "one-liner", which only md5sum's the files that have identical length to some other file in the system:
find . -type f -exec stat --printf='%32s ' {} \; -print |\ sort -rn |\ uniq -d -w32 --all-repeated=separate |\ awk '(NF > 0){ system( \ "md5sum \"`echo \"" $0 "\"|\ sed -r \"s/^ *[0-9]+ *//\" `\" |\ cut -c 1-32 | tr -d [:space:] " ); printf " %32s %s", $1, $2 ;\ for (i = 3; i <= NF; i++) printf $i " "; \ printf "\n;\ sort -r |\ uniq -d -w65 --all-repeated=separate |\ awk '{for (i = 3; i <= NF; i++) printf $i " ";print "";}' | # Print size of every file to find duplicated file sizes # sort to put duplicate files sizes together ( for uniq ) # print all files that have file sizes equal to any other file # for the remaining files, insert the md5sum after the file size # do the md5sum while allowing for spaces in the file names # ( More than 1 consecutive space in a file name is not allowed..) # print out the file size with a fixed width for future comparisons and also print the file name # re-sort to catch duplicates of _different_ files with _identical_ sizes # Compare the file size and md5sum for every file and print multiples # get rid of the file size and md5sum and simply print out the duplicate file names by themselves |
If you have a ton of files that you're analyzing, you may want to see progress happening. Due to the nature of sort, it won't generate output until all its input lines are received. We can intercept and log the intermediate output so you can see progress by modifying the command with tee:
find . -type f -exec stat --printf='%32s ' {} \; -print |\ tee find_stat.log |\ sort -rn |\ uniq -d -w32 --all-repeated=separate |\ awk '(NF > 0){ system( \ "md5sum \"`echo \"" $0 "\"|\ sed -r \"s/^ *[0-9]+ *//\" | tee -a f.log`\" |\ cut -c 1-32 | tr -d [:space:] " ); " printf " %32s %s", $1, $2 ;\ for (i = 3; i <= NF; i++) printf $i " "; \ printf "\n;\ }' |\ tee md5sum_generation.log |\ sort -r |\ uniq -d -w65 --all-repeated=separate |\ awk '{for (i = 3; i <= NF; i++) printf $i " ";print "";}' | \ tee repeated_files.log
Then, in a seperate window from the above command, you can run tail:
tail -f find_stat.logor
tail -f md5sum_generation.log
3 comments:
this question will not sound very smart to you, i'm not that firm with bash and programming under linux so far, but how do i automatically remove all found dupes or at least move them into a specific direction?
So far i tried to simply add a rm -f "{$3}" at the end of your script, but this didn't show to work out, as i wanted it to.
to be more specific, i'm not experienced with programming at all :-) That's exactly why i'd would like to get to understand how i can edit your script, to automatically remove the duplicates. I'd be very thankful for any suggestions!
Zapata
Hey Zapata,
Congratulations on being the first person ever to comment on my blog. Sorry I haven't responded to you for half a year.
You have that list of files in repeated_files.log that looks like:
./.cpan/build/MP3-Tag-1.12-I3DkIc/lib/MP3/Tag/Cue.pm
./.cpan/build/MP3-Tag-1.12-HZFD2g/lib/MP3/Tag/Cue.pm
./.cpan/build/MP3-Tag-1.12-I3DkIc/t/mp3tag.t
./.cpan/build/MP3-Tag-1.12-HZFD2g/t/mp3tag.t
./.cpan/build/MP3-Tag-1.12-I3DkIc/examples/eat_wav_mp3_header.pl
./.cpan/build/MP3-Tag-1.12-HZFD2g/examples/eat_wav_mp3_header.pl
You could do something like deleting all files except the first file in every group:
rm -f `cat repeated_files.log | awk 'BEGIN {skip=1;} /^$/ { skip=1; next; } {if (skip==1) { skip=0; next;} print }'`
Post a Comment