Building a Photo Tagging System with CLIP

10 April 2026 · data-science python photography

My photography gallery has 370 images spanning a decade of shooting — landscapes, urban decay, wildlife, portraits, and everything in between. For years they sat in a flat grid with no way to filter or browse by subject. I wanted category tags, but manually tagging 370 photos sounded like a weekend I'd never get back.

Enter OpenCLIP — an open-source implementation of OpenAI's CLIP model that maps images and text into the same embedding space. The idea: describe each category in plain English, encode both the descriptions and the photos, then match them by cosine similarity.

The Approach

I defined 10 categories, each with 3-4 descriptive text prompts:

  • Landscape — "a wide landscape photograph of hills and countryside"
  • Urban — "a photograph of a city street or urban scene"
  • Architecture — "a photograph focused on a building or architectural structure"
  • Abandoned — "a photograph of an abandoned or derelict building"
  • Wildlife — "a photograph of an animal or bird"
  • Nature — "a close-up photograph of trees, flowers, or plants"
  • Portrait — "a portrait photograph of a person"
  • B&W — "a black and white photograph"
  • Silhouette — "a silhouette photograph with a backlit subject"
  • Winter — "a winter photograph with snow on the ground"

Multiple prompts per category improves accuracy — the model averages the text embeddings, giving a more robust representation than a single phrase.

Results

The ViT-B-32 model processed all 370 images in 44 seconds on CPU. No GPU needed, no API calls, zero cost.

Each image gets multi-label tags (a photo can be both "landscape" and "winter"), with per-category similarity thresholds tuned to get roughly 1.3 tags per image on average. The output is a simple JSON mapping filename to tag array:

{
  "1": ["nature", "silhouette"],
  "15": ["abandoned", "bw"],
  "50": ["architecture", "bw", "urban"],
  "300": ["wildlife"]
}

Accuracy

Spot-checking against my own judgement, I'd estimate 85-90% accuracy. The model handles clear cases well — a sheep in a field is "wildlife", a monochrome street scene is "bw" + "urban". Edge cases like a painted traffic cone on a street getting tagged "architecture" are the main failure mode.

The beauty of the approach is that re-running is trivial. Tweak a threshold, adjust a prompt, run again in under a minute.

Integration

The tags JSON feeds into the gallery page via JavaScript. Filter buttons with emoji and counts sit above the photo grid — click "🐾 Wildlife (78)" and the gallery filters instantly. The existing Isotope layout handles the animation.

The tagging script lives in the repo at scripts/tag_photos.py — if I add new photos, I just re-run it.

What I'd Do Differently

  • A larger CLIP model (ViT-L-14) would improve accuracy on ambiguous images, at the cost of slower inference.
  • Adding a "creative/abstract" category would catch the lensball shots and experimental compositions that currently get shoehorned into other categories.
  • A manual override JSON that merges with the auto-generated tags would let me fix the obvious misses without re-running the whole pipeline.

The full code is on GitHub, and you can see the result on the Photography page.


← Back to all posts