Technical Journal — Files System (08/02)

Focus: EXIF re-extraction, colour profile normalization, and making the system observable + consistent across CLI, jobs, and APIs.


What I did

  • Shipped a bulk EXIF re-extract API
    • Built /api/v1/exif/reextract-bulk to accept a list of image UUIDs and re-run EXIF extraction at scale.
    • Ensured it reuses the same extraction pipeline as the CLI command instead of duplicating logic.
    • Made the endpoint behave like other v1 APIs (removed “internal-only” restrictions, aligned auth/middleware).
  • Hardened colour profile handling
    • Fixed a real production edge case:
      • When ICC-header.ColorSpaceData exists and trims to “RGB”, it must override ExifIFD.ColorSpace(even if EXIF says sRGB).
    • Normalized colour profile now correctly reports:
      • space: “RGB”
      • source: “icc_header”
    • Raw EXIF remains untouched; only derived data is affected.
  • Fixed a production TypeError
    • Found and resolved a strict return-type bug in ColorProfileFromMetadata.
    • The extractor previously assumed colour space always exists — reality disagreed.
    • Updated logic to safely handle missing ICC + EXIF colour space without throwing.
    • Result: no more 500s when images legitimately have no colour space data.
  • Unified the main EXIF extractor job
    • Updated the existing EXIF extraction job to use the same shared service as:
      • CLI re-extract
      • Bulk re-extract API
    • This removed silent divergence between “initial ingest” and “manual re-extract”.
  • Added a read-only bulk EXIF fetch endpoint
    • Implemented a new endpoint with the same payload as reextract-bulk, but read-only.
    • Purpose: fetch persisted EXIF + normalized colour profile directly from DB.
    • Useful for inspection, debugging, and client-side validation without mutation.
  • Improved observability without log pollution
    • Continued using bordered-investigation.log for:
      • EXIF edge cases
      • missing colour space diagnostics
      • local path / metadata investigations
    • Kept production logs clean and meaningful.

What I learned

  • Type systems don’t protect you from reality
    • Strict return types are only correct if your data model matches the real world.
    • EXIF data is messy, optional, and inconsistent — code must reflect that truth.
  • Single source of truth matters more than speed
    • The biggest long-term win was forcing CLI, job, and API to share the same extractor.
    • Any duplication here would silently rot over time.
  • “Read” endpoints are as important as “write” endpoints
    • Being able to fetch raw EXIF + derived state from DB is critical for debugging and trust.
    • Mutation-only APIs make systems opaque and stressful to operate.
  • Observability needs intent, not volume
    • A dedicated investigation log is far more valuable than spamming production logs.
    • Knowing where to log is just as important as knowing what to log.
  • Colour management is full of traps
    • ICC headers can be more authoritative than EXIF tags.
    • Normalization rules must be explicit, documented, and tested — assumptions will fail.

Overall:

Today was about turning EXIF handling from a “best effort” feature into a reliable, inspectable, and repeatable system. Less magic, more truth.