Migrating 95GB of Product Images to Cloudflare R2

Crosby Street Studios sells rugs and home decor out of NYC. Their InFlow inventory tracks 43,926 SKUs. Of those, 5,263 had product images attached. The other 38,663 were text rows in a system where buyers click thumbnails to decide what to order.

The images existed. They were sitting on an FTP server a previous contractor had filled, plus a directory of files scraped from supplier sites a year or two earlier. We counted 15,463 image files spread across both sources, averaging about 2.9 per product where coverage existed. None of it was wired to InFlow.

This post is about the pipeline we wrote to move that 95GB into Cloudflare R2, match every image to a barcode in the catalog, and push the associations into InFlow through the only door it offers: a CSV file.

95 GB

Image data on the FTP server and scraped supplier directories

The constraint that shaped everything

InFlow has no API for bulk image assignment. None. The only supported way to attach an image to a product is through their CSV import, and the column format is rigid: SKU on one side, a publicly-accessible image URL on the other. Not a file upload. A URL. The URL has to resolve, has to return an image, and has to stay reachable for the product card to keep working.

Once we understood that, the rest of the architecture wrote itself. Every image needed a public home before InFlow could ingest it, which meant uploading 95GB to object storage was step one, not step three. We picked Cloudflare R2 because the egress is free and the URL structure is predictable. Then we needed a way to map filenames on disk to SKUs in InFlow, because nothing in the FTP tree was named by SKU.

What we had going for us: every product in InFlow has a barcode, and most of the image filenames or parent directories had a barcode embedded somewhere in the path. Whoever organized the FTP server originally had used barcodes as the de facto product key. The matching problem was therefore an extraction problem first, then a normalization problem, then a lookup.

What the FTP tree actually looked like

We wrote a scanner first, before any matching logic, because we needed to see the shape of the data. The scanner walked the FTP server, pulled file size, modification time, full path, and ran a regex over the path to pick out barcode candidates. We wrote it as a Python script that wrote results to a SQLite database. Roughly 23 minutes to crawl, give or take, depending on how the FTP server felt that day.

The directory structure was inconsistent in the way old contractor work usually is. Some products had their own folder, named with a barcode. Some folders contained mixed products and the barcode was only in the filename. Some files had the barcode in both places, sometimes matching, sometimes not. A handful of folders had names like MISC_REORDER_2023 with no barcode anywhere and a dozen images inside, which the team eventually identified by sight.

After the scan we had 15,463 image rows in SQLite, each with at least one barcode candidate extracted from the path. About 200 had ambiguous candidates (two different barcodes in folder name and filename). Those got flagged for manual review and held back from the first matching pass.

The barcode matching pass

The InFlow product export was a 43,926-row CSV with SKU, barcode, product name, and a few category fields. Loading that into a dict keyed by barcode took about a second. The matching loop was a few dozen lines.

The naive version matched maybe 60% of the 15,463 images. The other 40% failed because the barcodes on disk didn't look exactly like the barcodes in InFlow, even though they referred to the same product. There were five categories of mismatch we had to handle:

Dashes versus spaces. Some filenames used 123-456-789 where InFlow stored 123 456 789.
Stripped leading zeros. A filename of 1234567890 in InFlow as 01234567890.
Old barcode format. Crosby Street had migrated to a new barcode scheme in 2024 but the FTP files predated that. We needed the historical mapping table from their team.
Case sensitivity in the alphanumeric prefixes some product lines used.
Trailing whitespace and zero-width characters in the InFlow export, which we didn't catch until we ran a hex dump on a few rows that should have matched but weren't.

Each of these got its own normalization step. Lowercase everything, strip non-alphanumeric, pad to consistent length, try the lookup against the new barcode table first and the old one second. The order mattered, because some products had a number in the new system that collided with a different product's number in the old system. New-first, old-as-fallback was the rule.

After all the normalization, 15,282 of the 15,463 images matched cleanly. That's 98.8%. The remaining 181 went into a spreadsheet for the Crosby Street team to walk through. Most turned out to be products that had been discontinued and removed from InFlow but still had files on the FTP server. A few were genuine new additions where the barcode existed nowhere in the export.

15,282

Images matched to a SKU on the first clean pass

Uploading to R2 without losing your mind

95GB across consumer-grade internet, going to a service in another data center, takes a while and breaks at least once. The first version of our uploader ran the matched images through boto3 against R2's S3-compatible endpoint, sequentially. After about four hours it died on a transient SSL error, no progress saved, and we got to start over.

The second version had a checkpoint file. Every successful upload wrote one line to a JSONL log: barcode, sequence number, R2 key, file size, completion timestamp. On startup the uploader read the log into memory and skipped anything already processed. We added a graceful shutdown handler so a Ctrl+C wouldn't corrupt the in-progress upload but would let the current one finish writing its checkpoint line.

If you take one thing from this post, take this: build resumability into any large data move from the first commit. We rebuilt our uploader once to add it. Every subsequent transfer project we've done started with the checkpoint logic before the actual transfer logic. The cost of adding it later is much higher than building it first.

The third version added concurrency. Eight parallel uploads, with the checkpoint file protected by a simple file lock. That brought wall-clock time for the remaining files down to about 9 hours. Total upload across all three versions and several network drops was somewhere around 14 hours of actual transfer time, spread across three days.

Each image got a deterministic R2 key: {barcode}/{sequence_number}.{extension}. So the third image for barcode 0123456789012 (a JPG) would land at 0123456789012/3.jpg. The sequence number came from the order of files within a product folder, sorted lexically. This made it possible to inspect coverage per product with a single aws s3 ls against the R2 bucket. It also made the URL pattern trivial to template into the InFlow CSV.

The image browser nobody asked for

Crosby Street's team wanted to spot-check the matches before any of this hit InFlow. Importing 15,000 wrong image associations would have been worse than importing none, because the product cards would then show genuinely incorrect pictures and undoing them is a row-by-row operation. Reasonable ask.

We built a Cloudflare Worker that read the R2 bucket and rendered an HTML page per product. No framework, no build step, just Worker code that fetched the list of files for a barcode prefix and templated them into an <img> grid. Pagination by barcode, search by SKU, basic HTTP auth in front because the catalog is private. The whole Worker is maybe 200 lines including the auth middleware.

It took a few hours to write. The Crosby Street team spent about a week clicking through the matches. They flagged around 40 corrections, most of them cases where a barcode had been reused across product generations and the image we'd matched was the older version. We patched those by hand in the upload manifest, re-ran the affected R2 keys, and moved on.

The Worker is still running, mostly because it's now the easiest way for Crosby Street's buyers to look up images by SKU when they're not at their inventory terminal. Cost is whatever the free tier covers, which for their traffic is everything.

Generating the import CSVs

InFlow's CSV import has two constraints that mattered. First, a row limit per batch. They don't publish a hard number, but we tested with 1,000 and got timeouts, and with 500 it ran fine but slowly. We settled on 850 as the batch size, which was comfortably under the timeout threshold and processed in about 3 to 5 minutes per batch.

Second, the column format is exact. SKU first, image URL second, no header row, UTF-8, Unix line endings, and the URL has to be a direct link to the image, not a redirect. We learned the redirect rule the hard way after the first batch silently dropped about 30 rows where R2 had returned a 301 because we'd passed an inconsistent path casing.

The CSV generator was a small script that read the matched-images table, grouped by SKU, and wrote one row per image up to InFlow's per-product image cap. We ran it 18 times to generate the 18 batch files we needed for the full 15,282-image set. Each batch went into the InFlow web UI manually. Crosby Street's ops lead clicked through them over an afternoon.

After all 18 batches: image coverage on the active catalog moved from 12% to roughly 40%.

12% coverage: 5,263 of 43,926 products had images

40% coverage: 15,282 images attached, 18 CSV imports

The remaining 60% of products genuinely don't have source images. That's not a pipeline gap. That's a photography gap, and it's the next project, with a different vendor.

The R2 key structure paid for itself twice

The {barcode}/{sequence_number}.{extension} layout was the smallest design decision in the project and the one that saved the most time downstream.

When the team flagged 40 corrections, we identified the wrong files instantly because the R2 prefix was the barcode. When we built the Worker browser, the routing was a one-liner because the URL pattern matched the bucket prefix. When we generated the CSV, the URL templating was string interpolation, no lookup table needed. When InFlow's import dropped rows on the redirect issue, we found the bad keys with a single bucket list.

A different naming scheme would have worked. UUIDs would have worked. A hash of the original filename would have worked. None of them would have made the four downstream uses as cheap. Predictable keys are a kind of self-documentation.

What we'd do differently

A few things we noticed after the dust settled.

The barcode normalization pass should have been written as a single function with a clear contract, not as a chain of regex substitutions sprinkled through the matching loop. We refactored it once during the project and we'll refactor it again the next time we touch this codebase.

The Worker browser should have been built first, not last. Having a visual diff of the matches before the upload would have caught the 40 correction cases before we paid the egress to upload them, even though R2 egress is free, the FTP read time wasn't.

The InFlow batch size of 850 was found by trial. We could have asked InFlow's support team for the actual limit. We didn't, and probably should have. Saving 90 minutes of testing isn't a lot, but it adds up across projects.

The hardest part of this project wasn't the upload, the matching, or the R2 integration. It was that the only door into InFlow is a CSV file with specific column names in a specific order, and every architectural decision had to flow from that.

What was actually built

The pipeline, end to end:

FTP scanner. Python script, walked the server, wrote file metadata and barcode candidates to SQLite.
Barcode matcher. Loaded the InFlow product export into a dict, applied five normalization rules, produced a matched-images table with a 98.8% hit rate.
R2 uploader. Concurrent upload with checkpoint-based resumability and graceful shutdown. About 14 hours total transfer time.
Worker image browser. ~200 lines of Cloudflare Worker code, basic HTTP auth, used by the client for verification and now for ongoing SKU lookup.
CSV generator. Produced 18 InFlow-compatible batch files of ~850 rows each.
Manual import. 18 imports in InFlow's web UI, ~3-5 minutes each.

End state: image coverage moved from 12% to 40%, zero files lost, every R2 key predictable from the barcode, and a Worker browser that the client uses every week.

The codebase is a few hundred lines of Python and a Worker file. Most of the time was spent on the matching edge cases and the upload reliability work, which is usually how these projects go. The interesting part of a data migration is rarely the part that moves the data.

Stack: Python 3.11 for the ETL, SQLite for intermediate state, boto3 against the R2 S3-compatible endpoint, Cloudflare Workers for the image browser, InFlow web UI for the final imports. Code runs locally for the migration, Worker deployed to Cloudflare's edge.