← Back to Insights
AI Cost OptimisationMarch 2026 · 7 min read

How We Cut AI Infrastructure Cost by 93% in Production

Azure AI Foundry pricing at scale will surprise most organisations. Here is the exact architecture decision that changed everything.

When you process a few hundred documents a month through a managed LLM API, the cost is negligible. When you scale to 5,000+ documents per hour, the economics break entirely.

That was the situation at a major UK life insurer. Their AI infrastructure bill was tracking toward £2M annually and growing linearly with volume. The CFO flagged it. The CTO needed a solution that didn't compromise on performance or data governance.

The problem with the obvious answers

The simple responses — switch to a cheaper model, reduce usage, negotiate a volume discount — didn't hold up. Cheaper managed models introduced accuracy risk on underwriting decisions carrying significant liability. Reducing usage meant the business value case collapsed. Volume discounts don't change the fundamental economics at scale.

The real answer was structural, not transactional.

The hybrid routing architecture

The insight was that not all LLM requests are equal. Some require low latency and high accuracy. Others are high-volume, tolerant of slightly higher latency, and don't justify managed API pricing.

We built an intelligent routing layer that classifies each request at inference time:

- Latency-sensitive, low-volume queries → Azure AI Foundry (GPT-4o) - High-volume batch processing → Self-hosted GPU cluster (Llama 3.x, Qwen 2.5 on AKS with KAITO operator)

The self-hosted cluster runs on NC48ads A100 v4 nodes within the organisation's own Azure estate. No data leaves the perimeter. Full FCA compliance maintained.

The numbers

The cost reduction was 93%. Annualised saving: over £1.8M. Throughput achieved: 47,637 requests per hour. The routing architecture is now patent-pending and adopted as the organisation's standard for all LLM deployments.

What this means for your organisation

If you are running AI workloads at scale on managed APIs, you almost certainly have a cost problem you haven't fully quantified yet. The break-even point for self-hosted infrastructure is lower than most engineering teams assume — typically around 50,000 tokens per day.

The architecture isn't exotic. What's missing, usually, is someone who has done it in a regulated production environment and can de-risk the transition.

Dealing with this in your organisation?

Book a 30-minute call. No pitch — a direct conversation about your specific situation.

Book a Discovery Call