Shipping LLM features without melting your infrastructure
2026-06-16 · AI · 1 min read · @TechScribeWire
The demo always works. You wire up a model, paste in a prompt, and the magic happens on the first try. Then you ship it to real users and discover that “magic” has a p99 latency of nine seconds and a monthly bill that looks like a typo.
Treat the model like a flaky network dependency
The single most useful mental shift is to stop treating the model as a function call and start treating it as a remote service that is occasionally slow, occasionally wrong, and occasionally down. That means timeouts, retries with backoff, and a fallback path that degrades gracefully instead of spinning forever.
const result = await Promise.race([
callModel(prompt),
timeout(4000), // give up and show a cached/simpler response
]);
Cache aggressively, at every layer
Most production prompts are far less unique than they feel. Normalize inputs and cache completions by a hash of the normalized prompt. Even a 20% hit rate meaningfully cuts cost and tail latency.
Stream tokens to hide latency
Users forgive slowness they can watch. Streaming the response makes a four-second completion feel instant because the first token arrives in a few hundred milliseconds.
Budget tokens like you budget money
- Set hard max-token caps per request.
- Trim context windows ruthlessly; more context is not free.
- Log token counts per feature so you can see which flows are expensive.
None of this is glamorous, but it is the difference between an AI feature that survives contact with real traffic and one that gets rolled back on a Friday night.