Nigiva

Data URLs versus rotating presigned HTTPS: latency and cache in multimodal chat APIs

2026-05-10T00:00:00+02:00

TL;DR

On every tested model, mean data: completion latency beat fresh presigned HTTPS (+13%, +23%, +39% presigned slowdown, by tier).

Caching tracks the image, not the URL text, for Gemini, OpenAI, and Anthropic here. Rotating presigned URLs did not look like wiping cache just by changing the link on any stack.

From habit to hypothesis

I got used to presigned HTTPS URLs for multimodal payloads. Stored chat JSON stayed small because I did not embed full image bytes in every message. The provider fetches the PNG over HTTPS on their side. I assumed I was moving traffic off my laptops and servers.

A teammate asked a simple follow-up: aside from RAM and log size, does that pattern really make completions faster? This post is what I measured.

My guess was that presigned wins on elapsed time because download at the provider would beat me sending Base64 again on every turn. I had no data, only a gut feeling.

That picture weakens when you remember requests already leave my machines on fast datacenter uplinks. OVH, AWS, and similar hosts are built to accept uploads. Assuming the provider always wins on fetch versus one more POST with the image inside is a rough shortcut for that setup. Slow home upload, mobile, or strict bandwidth caps can still change the tradeoff. So I ran a benchmark.

OCR as the benchmark task

OCR was a good fit. Current multimodal models already do well on clean synthetic screenshots. Outputs are nearly repeatable. I can generate endless labeled pairs in code instead of tuning hand-written creative prompts.

The synthetic pages stick to one recipe: five OCR lines of about ten words each, light backgrounds, fixed layout.

The setup ties each page to fixed ground-truth text baked into the image. Each assistant reply gets compared to that target string with rapidfuzz, yielding a similarity score from 0 (no overlap) through 1 (exact transcript). temperature is 0. Replies stay inside one fenced Markdown code block.

Images live in Cloudflare R2 for presigned. data: reads files from disk and sends Base64.

Main numbers rotate presigned URLs on every replay (including older history rows) so logs never reuse stale URL strings by accident. I also ran presigned with one fixed HTTPS URL whenever the same PNG comes back.

Latency versus data: looked the same in both setups, so rewriting the signature did not read like a latency win in what I captured. Nothing in these traces suggested multimodal backends reuse prior HTTPS fetches just because the PNG URL matched an earlier request; completions still behaved like vendors pay for decode plus HTTPS pull, not like a warmed URL-aware shortcut.

I use 3 vendors, smaller multimodal models (Gemini, GPT, Claude, names below). Larger tiers should behave in the same ballpark, but this run does not prove that for every model.

Avoid mixing cached work between lanes

To compare data: with presigned without one side reusing image cache hits meant for the other, I generate paired PNG layouts (A: white background, dark text; B: off-white background, charcoal text). Same wording, slightly different RGB, different hashes and bucket keys, hard to spot by eye. data or presigned receives A or B at random each time.

Example from series 6, turn 6 in the generator (a / b files). Same transcript; two skins so bytes never match between lanes while the page still reads the same to a human.

Variant a in the generator: higher-contrast white backdrop and dark type.

Variant b in the generator: slightly tinted backdrop and softened ink so the raster hash diverges.

For presigned, histories use a fresh signed GET URL whenever an image appears, including turns already stored earlier in the thread.

Method shape and timer

The benchmark harness ran on an OVH VPS in a datacenter. I did not drive it from my home machine on purpose: residential upload is often the weak link for large multimodal payloads, and I wanted timings closer to what you get from a production-style host on a datacenter link. Presigned objects live in Cloudflare R2, so the vendor fetches PNGs from R2 over HTTPS while the API client runs on that same VPS. That way both data: reads from local disk and presigned fetch paths are not dominated by consumer broadband jitter or caps.

Runs use series and turn depth.

Ten independent chats (series) with unrelated text. Inside each, ten turns, index 0 through 9, calls run strictly one after another.

Turn 0: one PNG plus a short OCR instruction. Turns > 0: send full history plus one new PNG.

total_time_ms runs from starting the multimodal request until the last assistant token arrives. Signing, disk read, and Base64 encoding (when used) finish before that timer starts.

Replay outline (one series, one delivery path):

flowchart LR
  subgraph warmup [Before timed calls]
    R2[R2 PNG objects uploaded]
    IDX[Index rows per serie turn method]
  end
  subgraph per_series_per_method [Replay one series]
    T0["Turn 0: user PNG plus OCR brief"]
    T0 --> A0["Assistant OCR"]
    A0 --> T1["Turn 1: full history plus new PNG"]
    T1 --> A1["Assistant OCR"]
    A1 --> Tn["Turns 2 through 9 repeat pattern"]
  end
  warmup --> per_series_per_method

Wire formats

Method	Payload shape
`data`	`data:image/png;base64,...` inside the multimodal POST
`presigned`	Fresh Cloudflare R2 HTTPS GET URL for each emission and replay

PNG canvas 1920 x 1080. Model identifiers in logs: gemini-flash-2.5 (flash-2.5), gpt-5.4-mini, claude-haiku-4.5 (haiku-4.5).

Operational notes

💡

Note

OpenAI and Gemini turned on prompt caching with defaults in this harness. Claude requires explicit cache_control on the multimodal payloads you intend to cache; without it Anthropic prompt caching stays off.
Vendors expose cached-input totals under different shapes; compare rows cautiously rather than stacking them blindly.
Where Claude caching was active, ephemeral TTL stayed at 5 minutes.
Some stacks prefetch HTTPS-linked PNG bytes on the client before timing. That work stays outside total_time_ms.

This benchmark compares inlined PNG bytes in data: requests with presigned HTTPS GETs routed through Cloudflare R2.

Question 1: Same accuracy on both lanes?

If rapidfuzz scores disagree, latency numbers are pointless.

Result: mean score 1.0 (100 transcripts each model and method). Choosing data: or presigned did not break OCR accuracy here.

Question 2: Does URL rotation break cache counters?

Side question: every presigned URL string is new each time. Do vendors treat the image as brand new and zero the cached-input totals they expose?

Across replay with fresh signatures, including older turns, claude-haiku-4.5 still showed equal mean cached-input token counts for data and presigned. That lines up with the image driving those counters, not the URL alone, inside each completion payload.

Separate runs reused one stable presigned HTTPS string per PNG, without issuing another signature whenever history repeats. Cached-input counters moved like the rotated-URL batches, which sits well with lumping ordinary static CDN links into the same story on whether rewriting the HTTPS string clears those counters all by itself.

Combination	Cached input tokens (`mean`)
gemini-flash-2.5 / `data`	964
gemini-flash-2.5 / `presigned`	823
gpt-5.4-mini / `data`	8474
gpt-5.4-mini / `presigned`	8950
claude-haiku-4.5 / `data`	7175
claude-haiku-4.5 / `presigned`	7175

Claude shows the same mean cached-input count for data and presigned. I thought Gemini and OpenAI would too. They do not. I only set explicit cache_control for Claude here. For Gemini and OpenAI I left vendor defaults. Something in those defaults likely treats data: and presigned paths differently. I do not know every knob on their side.

I first ran Claude without setting cache_control. That was a mistake: Anthropic only enables this cache path when you mark it explicitly. Loads of data: requests came back BadRequestError with a PNG download timeout message from Anthropic's side. My client retries the call up to three times, which sometimes unstuck it, but it mostly felt like data: image sends freezing and then succeeding late. Turning cache_control on flattened the failures and trimmed spend. Numbers in the main table still skew faster on data:. During the bad window, presigned tended to survive more cleanly if you optimize for uptime. None of those failure runs feed the published means above.

Question 3: Which path finishes faster?

Each cell pools 100 timed calls (10 series, 10 turns). Totals come from summary_by_model_method.

Model	Data mean (`ms`)	Presigned mean (`ms`)	Presigned slowdown
gemini-flash-2.5	2832	3200	+13%
gpt-5.4-mini	1854	2280	+23%
claude-haiku-4.5	2689	3749	+39%

presigned means the vendor pulls the PNG over HTTPS. data: means you already inlined the PNG in the HTTPS POST.

Figure: latency vs turn index

Averages across ten chats per plotted turn. Solid blue is data. Dashed red is presigned. Markers: circle gemini-flash-2.5, square gpt-5.4-mini, triangle claude-haiku-4.5.

Single turns swing a lot (queue, longer context). The averaged table still has presigned slower. Gemini can sit below data: on early turns, so I rely on means and the chart together.

What caught me off guard

I expected a tie or a presigned win because my day job habits favor small logs and less RAM, not milliseconds. Mean total_time_ms rose about 13% to 40% when the vendor had to GET from R2 instead of parsing inline Base64 (model dependent), on the same OVH VPS with matching upload and download speed.

Cached-input counts also did not match my neat guess.

Closing take

If you only care about shortest median completion time on this OCR setup, pick data:. Presigned still wins when upload is slow, when you reuse the same images in long chats, or when you need less RAM or smaller stored chats. Those goals sit next to latency; they do not replace it.

I only wired this benchmark through OpenAI, Google Gemini, and Anthropic. I still expect the same directional story on other multimodal hosts, but treat that as guesswork until somebody runs the same split on their APIs.

Cobjectric: Measuring Parsing Quality for Structured Data

2026-05-09T00:00:00+02:00

TL;DR

Cobjectric scores structured objects with compute_fill_rate, compute_fill_rate_accuracy, and compute_similarity.

It started as a benchmark harness for curriculum vitae (CV) parsing, but the same recipe generalizes to API payloads, configs, migration QA, and any nested dict / JSON you care about.

For Specs, pandas export, and API tables, read docs.

Where this came from

I built Cobjectric because I kept comparing parsed CVs against a schema and a labeled extract. The painful bit is rarely strict equality everywhere. Outputs are almost right: extra spaces, different casing, harmless punctuation, lists in another order, or a field present on one side but missing on the other.

That pattern is not CV-specific. Once you model your payload as a BaseModel, you get repeatable metrics you can log, aggregate, and compare across prompts or pipelines.

Fill rate: how complete is one object?

Fill rate answers a simple question: which fields look filled vs missing for a single instance?

from cobjectric import BaseModel


class Person(BaseModel):
    name: str
    age: int
    email: str


person = Person.from_dict(
    {
        "name": "John Doe",
        "age": 30,
    }
)

result = person.compute_fill_rate()
print(result.fields.name.value)
print(result.fields.age.value)
print(result.fields.email.value)
print(result.mean())

You land around 66.7% mean completeness when 2 / 3 fields are present.

📝

Note

Think of per-field scores as 1.0 when the field is present and valid, and 0.0 when it is missing or fails validation. If you need weighted summaries, Spec weights apply here too.

Fill rate accuracy: did we miss the same fields?

Fill rate accuracy compares two objects, but still focuses on presence, not semantic equality. That is useful when you want to know whether your extractor skipped the same sections as your reference label.

got = Person.from_dict({"name": "John", "age": 30})
expected = Person.from_dict(
    {
        "name": "Jane",
        "age": 25,
        "email": "jane@example.com",
    }
)

accuracy = got.compute_fill_rate_accuracy(expected)
print(accuracy.fields.name.value)
print(accuracy.fields.age.value)
print(accuracy.fields.email.value)
print(accuracy.mean())

Here 66.7% means 2 / 3 fields share the same filled-or-missing pattern (both sides have name and age, only expected has email).

Similarity: near matches for noisy text

When both sides have text, you usually care about near matches, not character-by-character identity. Think casing edits, spacing, light paraphrases, or abbreviations, not only literal typos.

from cobjectric import BaseModel
from cobjectric.specs import TextSpec


class Article(BaseModel):
    title: str = TextSpec(scorer="WRatio")
    content: str = TextSpec(scorer="WRatio")


reference = Article.from_dict(
    {
        "title": "Introduction to Machine Learning",
        "content": (
            "Machine learning is a subset of artificial intelligence."
        ),
    }
)

parsed = Article.from_dict(
    {
        "title": "Introduction to machine learning",
        "content": "Machine learning is a subset of AI.",
    }
)

similarity = parsed.compute_similarity(reference)
print(similarity.fields.title.value)
print(similarity.fields.content.value)
print(similarity.mean())

0
8735294117647059
9367647058823529

That is the practical win: casing changes can score 100%, while light paraphrases still land around 87.4% on content, so the overall score stays near 93.7%.

Other slots should stay exact once normalization runs: IDs, enums, fixed taxonomy labels, SKUs. KeywordSpec uses exact similarity on those strings (with preprocessing such as stripping whitespace and optional int-to-string coercion), so you do not get partial fuzzy credit when the value must match.

💡

Tip

Use TextSpec for free-form prose so normalization (case, spacing, accents) and RapidFuzz-backed similarity stay consistent. Tune scorer (for example WRatio) when you need stricter or looser fuzzy behavior.

Use KeywordSpec when the contract is effectively "equal or wrong": matching normalized tokens must score 1.0, anything else scores 0.0. For fully custom rules you can still attach similarity_func or use helpers such as exact_similarity from cobjectric.similarity; see Similarity and Pre-defined Specs.

Lists: match items even when order shifts

Models often emit arrays of nested objects. Pairwise index alignment works when order is stable. When it is not, ListCompareStrategy.OPTIMAL_ASSIGNMENT finds a strong one-to-one pairing.

from cobjectric import BaseModel, Spec, ListCompareStrategy
from cobjectric.specs import KeywordSpec


class Skill(BaseModel):
    name: str = KeywordSpec()
    level: str = KeywordSpec()


class Developer(BaseModel):
    skills: list[Skill] = Spec(
        list_compare_strategy=ListCompareStrategy.OPTIMAL_ASSIGNMENT
    )


reference = Developer.from_dict(
    {
        "skills": [
            {"name": "Python", "level": "Expert"},
            {"name": "JavaScript", "level": "Intermediate"},
            {"name": "SQL", "level": "Advanced"},
        ]
    }
)

parsed = Developer.from_dict(
    {
        "skills": [
            {"name": "JavaScript", "level": "Intermediate"},
            {"name": "SQL", "level": "Advanced"},
            {"name": "Python", "level": "Expert"},
        ]
    }
)

similarity = parsed.compute_similarity(reference)
print(similarity.mean())

1.0

100% here means every aligned pair matches on structured fields, even though the incoming list was rotated.

⚠️

Warning

Default pairwise alignment compares index i on both sides. If your generator shuffles sections (skills, roles, bullet lists), pairwise similarity will look unfairly bad even when the content is right. Reach for Levenshtein when order is mostly stable but items insert or drop, or optimal assignment when order is unreliable (SciPy required for that strategy).

Case study: a CV-shaped schema with all three metrics

This mirrors what I wanted first: nested Experience rows plus fuzzy summary text. Assume Experience, CV, reference_cv, and llm_cv match the expanded snippet below.

On this toy pair, completeness lines up, but wording still drifts, which is exactly when you want all three APIs:

fill_only = llm_cv.compute_fill_rate()
presence_match = llm_cv.compute_fill_rate_accuracy(reference_cv)
value_match = llm_cv.compute_similarity(reference_cv)

print(fill_only.mean())
print(presence_match.mean())
print(value_match.mean())

0
0
9755555555555555

Readout: fill rate is 100% because the model output is fully populated. Fill-rate accuracy is also 100% because the same slots are filled on both sides. Similarity is ~97.6% because names and summaries are close, not identical, even when companies and descriptions line up.

Show full CV example (models, payloads, all metrics)

Hello world

2026-05-01T00:00:00+02:00

TL;DR

This blog is my public lab notebook. I'm aiming for evidence: questions, controlled setups or benchmarks, and honest conclusions about what works where.

Posts reflect my views and my hardware unless I say otherwise.

Read About for background, keys, and how email works here.

Disclaimers

What's here is my opinion, not my employers', clients', or sponsors'. When something overlaps with work I'll still separate personal takes from official positions.

Experiments run on my dime: my hardware, cloud credits I pay for, and code I own unless a note explicitly says I got sponsored GPU time or borrowed gear. When that happens I'll flag it up front.

If you spot a mistake or disagree with an argument, email blog@nigiva.com. That's the inbox I actually read for serious replies.

The mantra this site keeps coming back to

I care about answers you can actually stress-test.

When it's realistic I write from something close to the scientific method: nail down the claim, freeze the scenario, compare options with benchmarks or reproducible demos, then say clearly what's better here (not universal slogans). When evidence is thin I'll say so and treat the piece as a notebook entry, not a final verdict.

That's a big part of why I publish: writing forces assumptions, harness design, and failure modes out into the open.

Replication fits in the same picture: document procedures clearly enough that people can rerun them. When calibration matters, redo external papers or older versions of my own notes. Same curve or a different one, both teach you something.

Why bother publishing

Writing keeps me honest. Turning fuzzy intuition into prose pushes me toward clearer assumptions, tighter experiments, and explicit limits.

This site is also external memory. Over a year I ship tons of fragments that never leave local repos or notebooks. Putting some of that in public helps me see what actually moved instead of feeling stuck.

Friends sometimes ask how I think about a topic. Instead of repeating long chats I can send one link that already has trade-offs, failures, or partial wins.

Publishing also lets me stress-test ideas. Some posts back hunches with numbers; others will age badly and deserve updates. Both are useful.

For opinion-heavy posts, keeping a dated version in public helps me stay honest about drift: I can point to what I believed earlier, admit when something aged poorly, show where I changed my mind, and explain why instead of pretending my takes were always consistent.

What you'll find

ML engineering notes.
Agentic systems experiments.
Reverse-engineering curiosity, written responsibly.
Art-adjacent tangents when they cross tooling.
Projects I'm working on.
And maybe more… 😉

ℹ️

Info

Want plain Markdown instead of HTML? Swap .html for .md on the URL path. For this note that's /2026/05/01/hello-world.md (front matter included). Handy if you want to edit offline or paste into an agent.