Finding and Removing Duplicate Files
I had a clever idea a couple months ago: to write a blog post detailing how to find recursively find duplicate files in a folder. My technique was good enough: track file sizes, find files that had the same file size and MD5 hash, and display the resulting list. It wasn’t foolproof, but it showed some thought. After spending a little too much time on the post, I realized I had never checked CPAN. Of course there is already a module to handle that exact task.
So here is my problem. I have — let’s see —
$ find ~/Sync -type f | wc -l 44388
I have 44,388 files in my Sync folder.
I organized my home machines recently. When I say “organized” I mean
that everything got swept into my
~/Sync folder to deal with later.
The refuse of several years squirreling files into random locations is
now sitting in that single folder.
Well, now it is time to clean that single folder up. I want to find and delete duplicate files. I planned to focus on image files, but File::Find::Duplicates makes it easier to find all duplicates.
find_duplicate_files subroutine, which finds the duplicate
files in a list of folders.
First tell me how many sets of duplicates I have.
This will tell me how much work is ahead of me.
$ perl count-dupes.pl Found 3465 sets of duplicates in /Users/brian/Sync
Removing the files was easy, but it rattled my nerves.
I fought the temptation to add progress bars or anything like that. Focus on getting the job done. I can add work if I end up revisiting this task later.
$ perl remove-dupes.pl Deleted 3509 files.
I removed a lot of files. Are there still any duplicates?
$ perl count-dupes.pl Found 0 sets of duplicates in /Users/brian/Sync
Thing is, I suspect that my
Sync directory contains many empty
About Those Directories
File::Find::Rule::DirectoryEmpty helps with exactly that problem. It extends the useful File::Find::Rule module to simplify finding files with characteristics you define.
$ perl find-leaves.pl 2904 empty directories
Yow. I can delete those directories, but then there could be parent directories that are now empty, and then grandparent directories, and then —
You know what? Just keep looking and deleting until there no more empty directories.
I like a little logging on each pass so that I know what my program is seeing.
$ perl remove-leaves.pl Found 2904 empty directories Found 529 empty directories Found 29 empty directories Found 5 empty directories 3467 empty folders deleted
I might dig in later to actually organize the remaining files. I may even automate it with some Perl. This is good enough for today, though.
$ find ~/Sync/ -type f | wc -l 40880
Now I have 40,880 files in my
~/Sync folder. Maybe I should have
counted directories too.