How We Cut AI Infrastructure Cost by 93% in Production
Azure AI Foundry pricing at scale will surprise most organisations. Here is the exact architecture decision that changed everything.
When you process a few hundred documents a month through a managed LLM API, the cost is negligible. When you scale to 5,000+ documents per hour, the economics break entirely.
That was the situation at a major UK life insurer. Their AI infrastructure bill was tracking toward £2M annually and growing linearly with volume. The CFO flagged it. The CTO needed a solution that didn't compromise on performance or data governance.
The problem with the obvious answers
The simple responses — switch to a cheaper model, reduce usage, negotiate a volume discount — didn't hold up:
- ·Cheaper managed models introduced accuracy risk on underwriting decisions carrying significant liability
- ·Reducing usage meant the business value case collapsed
- ·Volume discounts don't change the fundamental economics at scale
The real answer was structural, not transactional.
The hybrid routing architecture
The insight was that not all LLM requests are equal. Some require low latency and high accuracy — they must use the best available model. Others are high-volume, tolerant of slightly higher latency, and don't justify managed API pricing.
We built an intelligent routing layer that classifies each request at inference time and directs it to the optimal endpoint:
- Latency-sensitive, low-volume queries → Azure AI Foundry (GPT-4o) - High-volume batch processing → Self-hosted GPU cluster (Llama 3.x, Qwen 2.5 on AKS with KAITO operator)
The self-hosted cluster runs on NC48ads A100 v4 nodes within the organisation's own Azure estate. No data leaves the perimeter. Full FCA compliance maintained.
The numbers
The cost reduction was 93%. Not 20%, not 40% — 93%. Annualised saving: over £1.8M. Throughput achieved: 47,637 requests per hour.
The routing architecture is now patent-pending. It has been adopted as the organisation's standard for all LLM deployments.
What this means for your organisation
If you are running AI workloads at scale on managed APIs, you almost certainly have a cost problem you haven't fully quantified yet. The break-even point for self-hosted infrastructure is lower than most engineering teams assume — typically around 50,000 tokens per day.
The calculation isn't complicated. The architecture isn't exotic. What's missing, usually, is someone who has done it in a regulated production environment and can de-risk the transition.