A problem that I have been battling is how unorganized my photos and videos are. I have multiple backups taken from the various devices, which has resulted in multiple copies of images and videos. Recently, I started thinking about how to organize those with the following goals:
After manually going through a few folders, I realized that it was going to be a very onerous task. So I started thinking about writing a program to identify duplicates. I recently watched a great video by David Beazley on Python generators and it was exactly what I needed for my purposes.
So I created this program that groups the duplicates together and prints a list for review.
import os
import time
import fnmatch
from collections import defaultdict
#http://szudzik.com/ElegantPairing.pdf
#This is used to combine size_of_file_in_bytes and last_modified into one number associated uniquely to both
#using a pairing function.
def elegant_pairing_fn(x,y):
if x == max(x,y):
return (x*x)+x+y
else:
return (y*y)+x
#generates a sequence of file names for the given search pattern starting from the top directory.
def gen_find(filepat, top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist, filepat):
yield os.path.join(path,name)
#generates a sequence of dictionary objects with detailed file information.
def gen_stat(filelist):
for name in filelist:
dstat = (dict(zip(osstatcolnames,os.stat(name))))
dstat['filename'] = name
dstat['unique_id'] = elegant_pairing_fn(dstat['size_of_file_in_bytes'], dstat['last_modified'])
yield dstat
osstatcolnames=('protection_bits', 'inode_number', 'device', 'number_of_hard_links'
, 'user_id_of_owner', 'group_id_of_owner', 'size_of_file_in_bytes'
, 'last_accessed', 'last_modified', 'created')
t0 = time.time()
jpgfiles = gen_find("*.jpg", "C:\\Users\\Shantanu\\Pictures")
jpgfilestats=gen_stat(jpgfiles)
#Now group the file names with the same unique_id value
#The algorithm assumes that the combination of file size and last modified date is unique per image file.
# Meaning it is almost impossible to have images that are different and yet have the same size and last modified date.
# This is a safe assumption as long as the files have not been modified using some photo editing software.
filecount = 0
jpgfilegroups = defaultdict(list)
for l in jpgfilestats:
filecount += 1
jpgfilegroups[l['unique_id']].append(l['filename'])
#Print groups with more than one file names in it.
# We do not care to review files that appear only once in our stash.
count = 0
groupcount = 0
for k,v in jpgfilegroups.items():
groupcount += 1
if len(v) > 1:
count += 1
print(count, k, v)
print ("Total files processed: ", filecount)
print ("Total file groups processed: ", groupcount)
print ("Total time taken: ", time.time() - t0)
On my laptop, the program processed ~11K files in 2.8 seconds and came up with 2689 groups of possible duplicates. Manually reviewing those 2689 groups is a much simpler task.
The algorithm assumes that the combination of file size and last modified date is unique per image file. Meaning it is almost impossible to have images that are different and yet have the same size and last modified date. This is a safe assumption as long as the files have never been modified using some photo editing software (which is conveniently true for my case).
I learnt some new tools as a result of this exercise:
- Remove duplicates
- Organize by year and by event
- Remove images that are out of focus or not relevant
After manually going through a few folders, I realized that it was going to be a very onerous task. So I started thinking about writing a program to identify duplicates. I recently watched a great video by David Beazley on Python generators and it was exactly what I needed for my purposes.
So I created this program that groups the duplicates together and prints a list for review.
import os
import time
import fnmatch
from collections import defaultdict
#http://szudzik.com/ElegantPairing.pdf
#This is used to combine size_of_file_in_bytes and last_modified into one number associated uniquely to both
#using a pairing function.
def elegant_pairing_fn(x,y):
if x == max(x,y):
return (x*x)+x+y
else:
return (y*y)+x
#generates a sequence of file names for the given search pattern starting from the top directory.
def gen_find(filepat, top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist, filepat):
yield os.path.join(path,name)
#generates a sequence of dictionary objects with detailed file information.
def gen_stat(filelist):
for name in filelist:
dstat = (dict(zip(osstatcolnames,os.stat(name))))
dstat['filename'] = name
dstat['unique_id'] = elegant_pairing_fn(dstat['size_of_file_in_bytes'], dstat['last_modified'])
yield dstat
osstatcolnames=('protection_bits', 'inode_number', 'device', 'number_of_hard_links'
, 'user_id_of_owner', 'group_id_of_owner', 'size_of_file_in_bytes'
, 'last_accessed', 'last_modified', 'created')
t0 = time.time()
jpgfiles = gen_find("*.jpg", "C:\\Users\\Shantanu\\Pictures")
jpgfilestats=gen_stat(jpgfiles)
#Now group the file names with the same unique_id value
#The algorithm assumes that the combination of file size and last modified date is unique per image file.
# Meaning it is almost impossible to have images that are different and yet have the same size and last modified date.
# This is a safe assumption as long as the files have not been modified using some photo editing software.
filecount = 0
jpgfilegroups = defaultdict(list)
for l in jpgfilestats:
filecount += 1
jpgfilegroups[l['unique_id']].append(l['filename'])
#Print groups with more than one file names in it.
# We do not care to review files that appear only once in our stash.
count = 0
groupcount = 0
for k,v in jpgfilegroups.items():
groupcount += 1
if len(v) > 1:
count += 1
print(count, k, v)
print ("Total files processed: ", filecount)
print ("Total file groups processed: ", groupcount)
print ("Total time taken: ", time.time() - t0)
On my laptop, the program processed ~11K files in 2.8 seconds and came up with 2689 groups of possible duplicates. Manually reviewing those 2689 groups is a much simpler task.
The algorithm assumes that the combination of file size and last modified date is unique per image file. Meaning it is almost impossible to have images that are different and yet have the same size and last modified date. This is a safe assumption as long as the files have never been modified using some photo editing software (which is conveniently true for my case).
I learnt some new tools as a result of this exercise:
- A practical use of Python generators for scratching my own itch.
- The Python os library and its functions
- The mathematical concept of 'pairing functions' to uniquely represent two numbers as one.
- EDIT 20141020: I just realized that jpgfilegroups is actually a hash table using chaining for collision resolution.