Exact-match caches miss paraphrased prompts, so repetitive intent still triggers full-price LLM calls even when user meaning is identical.
Wrap completion calls once, compute local prompt embeddings with ONNX, then perform cosine similarity lookup before forwarding misses to the provider.
- Sub-10ms lookup overhead on CPU.
- In-process architecture with no external orchestration layer.
- 40–70% savings potential for repetitive support/FAQ workloads.