Shipping LLM features without melting your infrastructure

2026-06-16 · AI · 1 min read · @TechScribeWire

The demo always works. You wire up a model, paste in a prompt, and the magic happens on the first try. Then you ship it to real users and discover that “magic” has a p99 latency of nine seconds and a monthly bill that looks like a typo.

Treat the model like a flaky network dependency

The single most useful mental shift is to stop treating the model as a function call and start treating it as a remote service that is occasionally slow, occasionally wrong, and occasionally down. That means timeouts, retries with backoff, and a fallback path that degrades gracefully instead of spinning forever.

const result = await Promise.race([
  callModel(prompt),
  timeout(4000), // give up and show a cached/simpler response
]);

Cache aggressively, at every layer

Most production prompts are far less unique than they feel. Normalize inputs and cache completions by a hash of the normalized prompt. Even a 20% hit rate meaningfully cuts cost and tail latency.

Stream tokens to hide latency

Users forgive slowness they can watch. Streaming the response makes a four-second completion feel instant because the first token arrives in a few hundred milliseconds.

Budget tokens like you budget money

Set hard max-token caps per request.
Trim context windows ruthlessly; more context is not free.
Log token counts per feature so you can see which flows are expensive.

None of this is glamorous, but it is the difference between an AI feature that survives contact with real traffic and one that gets rolled back on a Friday night.

#LLMs #production #latency #cost

Shipping LLM features without melting your infrastructure

Treat the model like a flaky network dependency

Cache aggressively, at every layer

Stream tokens to hide latency

Budget tokens like you budget money

Related on the wire

The quiet power of boring deployments

Reading code is the job

Rotate your secrets before they rotate you