So I ran into a little problem the other day, when I spliced together two folders of wallpaper and ended up with a bunch that were probably the same.
Rather than riffle through them one by one and delete the duplicates, I dug around a little on the Internet and found fdupes.
Simple enough, fdupes checks file sizes, MD5 sums and does byte-by-byte comparisons, and where it finds similarities, it spits out the name.
To add to the fun though, you can trigger an interactive mode, and fdupes will eradicate the ones that you decree.
Exactly what I needed to solve my problem: I get a short list of files that are most likely identical, I pick the one I don’t want, and fdupes takes care of the rest.
For every task, there is a perfect tool. 😈
I generally recommend rdfind over fdupes. In addition to listing & deleting duplicate files, it also has the ability to replace the duplicates with symbolic or hard links.
rdfind lacks an interactive mode, so for deleting a small number of duplicate files fdupes’ interactive mode might be better. On the other hand, if you are expecting too many duplicates to use the interactive mode while retaining your sanity, rdfind’s man page clearly describes how it ranks the files to determine which one to keep, whereas fdupes only states that it will keep the “first” one it finds, without any explanation of the order it is going to use.
rdfind will also outperform fdupes in many cases as it uses some heuristics to avoid reading files in whole that it determines cannot be duplicates (unique file size, unique start/end of file). Here’s the timings I just got running both over a kernel tree I had lying around – rdfind took about half the time fdupes did:
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ time fdupes -r .
…
2.35user 9.86system 5:04.65elapsed 4%CPU (0avgtext+0avgdata 11368maxresident)k
539040inputs+0outputs (1major+29787minor)pagefaults 0swaps
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ time rdfind .
…
1.50user 3.49system 2:38.17elapsed 3%CPU (0avgtext+0avgdata 12540maxresident)k
488496inputs+88outputs (11major+5220minor)pagefaults 0swaps
That sounds interesting. I’ll add it to the R section. Thanks! 🙂
Pingback: findimagedupes: Exactly what it says | Inconsolation
Pingback: rdfind: Echolocation | Inconsolation