Building a Photo Tagging System with CLIP
My photography gallery has 370 images spanning about a decade of shooting. Landscapes, urban decay, wildlife, portraits, and a surprising number of cat photos. For years they sat in a flat grid with no way to filter or browse by subject, which meant anyone visiting the page was essentially scrolling through a shuffled pile of everything I've ever pointed a camera at. I wanted category tags, but manually tagging 370 photos sounded like a weekend I'd never get back and would inevitably lead to inconsistency anyway (is a snowy field "landscape" or "winter"? Both? Do I care enough to decide 370 times?).
So I did what any self-respecting data scientist would do: I made the computer do it.
Quick jargon guide
- CLIP: an AI model that compares a picture with a text description and scores how well they match.
- Embedding space: the number-based map the AI uses internally, where similar images and similar phrases end up closer together.
- Zero-shot: using the AI as-is, without first retraining it on my own photo collection.
- Threshold: the cut-off score that decides whether a tag is added or left out.
- Multi-label tagging: letting one image have several tags at once, rather than forcing it into just one category.
The Approach
OpenCLIP is an open-source implementation of OpenAI's CLIP model. The core idea is elegant: it maps images and text into the same embedding space, so you can describe what you're looking for in plain English and the model will tell you how well each image matches. No training on your own data, no labelling, no fine-tuning. You just... describe the categories and let it figure it out.
I defined 10 categories, each with 3-4 descriptive text prompts to give the model a richer understanding of what I meant:
- Landscape: "a wide landscape photograph of hills and countryside"
- Urban: "a photograph of a city street or urban scene"
- Architecture: "a photograph focused on a building or architectural structure"
- Abandoned: "a photograph of an abandoned or derelict building"
- Wildlife: "a photograph of an animal or bird"
- Nature: "a close-up photograph of trees, flowers, or plants"
- Portrait: "a portrait photograph of a person"
- B&W: "a black and white photograph"
- Silhouette: "a silhouette photograph with a backlit subject"
- Winter: "a winter photograph with snow on the ground"
Using multiple prompts per category improves accuracy because the model averages the text embeddings, giving a more robust representation than any single phrase could. It's a simple trick but it makes a noticeable difference, especially for ambiguous categories like "nature" vs "landscape" where the boundary is fuzzy.
Results
The ViT-B-32 model processed all 370 images in 44 seconds on CPU. No GPU needed, no API calls, zero cost. I ran this on my laptop while making tea.
Each image gets multi-label tags (a photo can be both "landscape" and "winter"), with per-category similarity thresholds tuned to get roughly 1.3 tags per image on average. I went through a couple of iterations on the thresholds: too aggressive and everything got tagged "nature", too conservative and most images only got one tag. The output is a simple JSON mapping filename to tag array:
{
"1": ["nature", "silhouette"],
"15": ["abandoned", "bw"],
"50": ["architecture", "bw", "urban"],
"300": ["wildlife"]
}
Accuracy
Spot-checking against my own judgement, I'd estimate 85-90% accuracy. The model handles clear cases well: a sheep in a field is "wildlife", a monochrome street scene is "bw" + "urban", a derelict building is "abandoned". Where it struggles is with quirky or abstract shots. A painted dalmatian traffic cone on a city street got tagged "architecture", which I can't even be mad about because it technically is sitting next to a building. A lensball photo got tagged "portrait" because the model detected what it thought was a face. It wasn't wrong exactly, it was just working with a different definition of "portrait" than I had in mind.
The nice thing about this approach is that re-running the whole pipeline is trivial. Tweak a threshold, adjust a prompt, run again in under a minute. If I wanted to add a "creative/abstract" category to catch those edge cases, it would take about two minutes of work.
Integration
The gallery page loads the tags JSON via JavaScript and renders filter buttons above the photo grid. Click “🐾 Wildlife (78)” and the gallery filters instantly using Isotope for the layout animation. It’s satisfying in a way that only a well-animated filter transition can be.
The tagging script lives in the repo at scripts/tag_photos.py, so if I add new photos I just re-run it and push. The whole thing is reproducible by anyone with the same dependencies installed (open_clip_torch, pillow, tqdm).
What I'd Do Differently
- A larger CLIP model (ViT-L-14) would improve accuracy on ambiguous images, at the cost of slower inference. Whether that trade-off is worth it for 370 images is debatable, but for a larger collection it probably would be.
- Adding a "creative/abstract" category would catch the lensball shots and experimental compositions that currently get shoehorned into whatever category the model finds least implausible.
- A manual override JSON that merges with the auto-generated tags would let me fix the obvious misses without re-running the whole pipeline. Sometimes you just want to tell the computer "no, that's a traffic cone, not architecture" and move on with your life.
The full code is on GitHub, and you can see the result on the Photography page.
Common questions
Did this need a GPU?
No. I ran it on CPU and processed all 370 images in 44 seconds. A GPU would help at larger scale, but it was unnecessary for this gallery.
Why not just tag the photos manually?
Manual tagging would have taken much longer and would likely have been inconsistent over time. The model gave me a fast first pass that I can rerun whenever I add more photos.
How accurate was it in practice?
Roughly 85 to 90 percent by my own spot checks. It handled obvious cases well and mostly struggled on weird, abstract, or borderline compositions.
Could this work for someone else's image collection?
Yes, as long as the categories are described clearly. The prompts and thresholds would need tuning for a different style of photography, but the overall approach is reusable.


