
"Taming the Photo Chaos: Building a Smart Duplicate Finder in Python"
I’ve preserved all the original photos I’ve taken since I started photography in college—not just the JPEGs, but every single RAW file. These endlessly multiplying photos have always been a massive headache—well, technically, they used to be. Like many photographers, I am a textbook digital hoarder.
The root of the problem was my constant switching between photo management workflows. In the beginning, I’d just dump every SD card into a hard drive without a second thought. This left me with a mess of similarly named files and chaotic folder structures. Finding a specific photo? Good luck.
Later, I migrated to Apple Photos. While its features were polished and it was a huge step up from having no organization at all, I never grew comfortable with my library being controlled by a closed ecosystem. Even accessing my original files became a hurdle. The breaking point came when a macOS update suddenly dropped proper support for Photos libraries on external HDDs, leaving my entire collection effectively trapped inside the app.
Finally, I settled on a Lightroom Classic "folder-based" workflow, categorizing everything by date. Even without Lightroom, the files remain perfectly organized in the file system. Knowing I’m no longer locked into a specific ecosystem finally brought me some peace of mind.
But due to all these past migrations, I ended up with multiple libraries. Apple Photos had actually renamed and copied all my photos into hashes, making them completely unrecognizable, and my library grew to perhaps three times bigger than its estimated original size. I once tried reviewing them folder by folder, date by date, but I soon realized this was impossible. It was a waste of my life to manually review more than 100,000 photos.
I decided to construct a more robust way, an extremely faster workflow to save me from this duplicates hell. And as a software engineer, there's only one logical solution: spending 20 hours writing a Python script to save me 5 hours of manual work.
Welcome to part one. Today, we're going to catch the easy ones: the exact duplicates.
The Illusion of a Duplicate
Finding duplicates sounds easy. Just write a script to look for files with the same name, right?
Wrong.
When you migrate between different systems (like Apple Photos, Lightroom, or just copying from SD cards), the file names are the first thing to get butchered. IMG_1234.JPG on an SD card becomes IMG_1234 (1).JPG when you copy it twice, and as I mentioned, Apple Photos might rename it to something completely cryptic like A83J92K-19XJ.JPG.
We can't trust the names. We have to look at the actual content of the file. We need to check the bytes.
Enter Cryptographic Hashing (SHA-256)
If we want to know if two files are exactly the same, we could compare them byte by byte. But doing that for thousands of files would require loading them into memory and comparing them all against each other. That's highly inefficient.
Instead, we use a concept called Hashing.
Think of a hash as a digital fingerprint for a file. We feed the bytes of a file into a mathematical algorithm, and it spits out a fixed-length string of characters. For this project, I used SHA-256 (Secure Hash Algorithm 256-bit).
The most important rule of cryptographic hashing is this: If even a single byte in a 5MB file changes, the entire hash changes completely. If two files have the exact same SHA-256 hash, you can be 99.99999999% certain they are the exact same file, regardless of their file names.
The Pitfall of Exact Hashing
But there is a major catch. What happens if the image looks identical, but the file is technically different?
When a camera takes a photo, it writes metadata (like GPS coordinates, camera model, and the date taken) directly into the file's bytes (usually the EXIF header). Because SHA-256 blindly calculates the hash of every single byte in the file, if an app modifies the metadata—like tagging a face or rotating the image—without touching a single pixel of the actual picture, the SHA-256 hash completely changes. The script will think they are two entirely different files.
The Code: Catching the Exact Copies
Python makes hashing incredibly easy with its built-in hashlib library. Here is the exact function I wrote for my scanner:
```python
import hashlib
def_sha256(path: str) -> str: h = hashlib.sha256()withopen(path, "rb") as f:# Read the file in 1MB chunksfor chunk initer(lambda: f.read(1 << 20), b""): h.update(chunk)return h.hexdigest()
```
The Memory Trick
Did you notice the iter(lambda: f.read(1 << 20), b"") line? That's a crucial trick.
Some of the "photos" in my library are actually 2GB 4K videos. If you try to do a simple f.read() on a 2GB file, Python will load the entire 2GB into your computer's RAM. If you run that across multiple files at once, your computer will crash.
By reading the file in chunks of 1 << 20 bytes (exactly 1 Megabyte), we only ever hold 1MB in memory at a time. The hash updates progressively, and our memory usage stays practically at zero.
The "Aha!" Moment
Once we have this function, the rest is simple grouping. We scan every file, calculate its SHA-256 hash, and put it in a dictionary where the key is the hash and the value is a list of files.
```python
from collections import defaultdict
# A simplified version of the logicsha_map = defaultdict(list)
for file in files: hash_value = _sha256(file) sha_map[hash_value].append(file)
# If a hash has more than one file, we've found a duplicate.exact_dupes = {k: v for k, v in sha_map.items() iflen(v) > 1}
```
When I ran this initial pass over my 100,000 files, the results were incredibly satisfying. I found gigabytes of identical photos and videos hiding behind completely different file names. The script ruthlessly grouped them together.
But my celebration was cut short.
The Cliffhanger
I was scrolling through the results, feeling like a genius, when I noticed something missing.
I had a folder of photos from a family vacation. I knew for a fact I had duplicates because my mom had texted me some of the photos she took, and I had also saved them from a shared Google Photos album.
To my human eyes, Mom_Text_01.JPG and iCloud_Share_01.JPG were the exact same photo. But my script didn't catch them.
Why? Because when a photo goes through an app like Line or WhatsApp, it gets compressed. The file size changes. And remember the golden rule of SHA-256? If a single byte changes, the entire hash changes.
My script was looking for identical bytes, but I needed it to look for identical images. I needed to teach Python how to see like a human.
Next time in Part 2: Seeing Like a Human (The Magic of Perceptual Hashing).