Skip to content

LLM Gateways: Building Runtime Resilience Against API Downtime

Learn how LLM gateways protect AI applications from provider outages using automatic failover, retries, and load balancing, with config examples using LiteLLM.

LLM Gateways: Building Runtime Resilience Against API Downtime

It's 2am and your AI feature just stopped working. Not because your code broke, but because OpenAI (or Anthropic, or whoever you call) is having a bad night. Your users see errors. Your on-call engineer gets paged. And there's nothing you can do but wait.

This happens more often than people admit. LLM providers are still maturing their infrastructure, and outages, rate limit spikes, and slow responses are part of the deal. If your app calls only one provider, that provider's bad day becomes your bad day too.

The fix isn't to pick a "more reliable" provider. It's to stop depending on just one. That's exactly what an LLM gateway does, and this post breaks down how.

What Is an LLM Gateway?

An LLM gateway sits between your app and the AI providers you use (OpenAI, Anthropic, Google, etc).

Instead of calling each provider directly, your app talks to the gateway. The gateway decides which model actually handles the request.

Your App


LLM Gateway

   ├──► OpenAI
   ├──► Anthropic
   ├──► Google Vertex
   └──► Azure OpenAI

This single layer can handle:

  • Routing requests to the right provider
  • Retrying failed calls automatically
  • Switching to a backup model if one provider goes down
  • Caching repeated responses
  • Tracking cost and usage across providers

The result: your app code stays simple, and outages stop being your problem to solve manually.

Why API Downtime Is a Real Risk

Provider outages aren't rare edge cases anymore. As more companies build on LLMs, the failure modes show up constantly in production:

  • Rate limits: You hit a request cap during a traffic spike.
  • 5xx errors: The provider's servers fail temporarily.
  • Timeouts: Requests take too long and your app gives up.
  • Regional outages: A whole data center region goes dark.

Teams calling a provider directly often see error rates climb into the double digits during rough patches. One fallback test found that without protection, error rates sat around 12 to 15 percent, and dropped to under 2 percent once automatic fallbacks were added.

That gap is the difference between "AI feature works" and "AI feature is down."

How Gateways Build Resilience

A gateway doesn't prevent provider outages. It makes sure your app survives them. Here's how.

1. Automatic Failover

When your primary model fails, the gateway sends the request to a backup model instead, without your app knowing anything went wrong.

When a primary provider fails or hits rate limits, the gateway automatically reroutes traffic to backup providers with zero downtime.

Example using LiteLLM (an open-source gateway), defined in a config file:

yaml
model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/your-deployment-name
      api_base: https://your-azure-endpoint.com
      api_key: your-azure-api-key
      rpm: 6

  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4-ca
      api_base: https://your-canada-endpoint.com
      api_key: your-azure-api-key
      rpm: 6

router_settings:
  fallbacks:
    - gpt-3.5-turbo: ["gpt-4"]

If gpt-3.5-turbo fails, the request automatically retries against gpt-4. Your app code doesn't change at all.

2. Retry Policies

Not every failure needs a full fallback. Some just need a quick retry, especially for things like timeouts or rate limits that often resolve themselves in seconds.

yaml
router_settings:
  retry_policy:
    TimeoutErrorRetries: 3
    RateLimitErrorRetries: 3
    InternalServerErrorRetries: 4

This lets you define a separate retry count for each error type, like authentication errors, timeouts, rate limits, or server errors. You're not retrying blindly. You're matching the response to the type of failure.

3. Load Balancing Across Deployments

If you have multiple deployments of the same model (say, Azure OpenAI in two different regions), the gateway can spread traffic across them.

yaml
router_settings:
  routing_strategy: least-busy

If one region slows down or goes offline, traffic shifts to the healthy region automatically. Cooldowns apply to individual deployments, not entire model groups, so the router isolates failures to specific deployments while keeping healthy alternatives available.

4. Health-Aware Routing

Good gateways don't wait for a request to fail before reacting. They route traffic away from unhealthy deployments before users even hit errors, based on ongoing latency and error tracking.

Setting Up a Basic Gateway

Here's a minimal working example using LiteLLM's proxy server.

Step 1: Create your config file

yaml
# config.yaml
model_list:
  - model_name: primary-model
    litellm_params:
      model: gpt-4
      api_key: os.environ/OPENAI_API_KEY

  - model_name: backup-model
    litellm_params:
      model: claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks:
    - primary-model: ["backup-model"]
  num_retries: 2
  timeout: 30

Step 2: Run the gateway

You can run it with Docker, mounting the config file and passing in your API keys as environment variables:

bash
docker run \
  -v $(pwd)/config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=your-key \
  -e ANTHROPIC_API_KEY=your-key \
  -p 4000:4000 \
  docker.litellm.ai/berriai/litellm:main-latest \
  --config /app/config.yaml

Step 3: Call it like a normal OpenAI client

Your app just points at the local proxy URL instead of OpenAI's URL directly:

python
import openai

client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="primary-model",
    messages=[{"role": "user", "content": "Write a short poem"}]
)

print(response.choices[0].message.content)

If primary-model is down, the gateway silently retries with backup-model. Your code never changes.

When Do You Actually Need One?

You don't need a gateway just because you call an LLM API. A widely cited rule of thumb is that once you call more than one model provider, or spend more than a few hundred dollars a month on API calls, a gateway starts paying for itself through caching, cheaper routing, and avoided downtime; below that, a direct SDK call is usually enough.

SituationDirect SDK CallLLM Gateway
Single provider, low volumeFineOptional
Multiple providersHard to manageBuilt for this
Need automatic failoverManual code requiredBuilt in
Need cost tracking across teamsManual loggingCentralized dashboard
Production app with real usersRiskyRecommended
Side project or prototypeFineProbably overkill

You don't have to build this yourself. Several mature options exist:

GatewayBest For
LiteLLMOpen source, self-hosted, large provider support
BifrostHigh performance (built in Go), low latency at scale
Cloudflare AI GatewayNo infrastructure setup, edge network
Kong AI GatewayTeams already using Kong for API management
Vercel AI GatewayTeams already deployed on Vercel/Next.js

An LLM gateway unifies multiple model providers behind a single API, enabling policy enforcement, automatic failover, load balancing, semantic caching, usage governance, and centralized observability. Most of the well-known options cover these same basics; the differences come down to hosting model, latency, and ecosystem fit.

Common Mistakes to Avoid

  • Only adding one backup model. If your backup is from the same provider as your primary, a provider-wide outage takes both down.
  • No timeout settings. Without timeouts, a slow request can hang your app instead of triggering a fallback.
  • Ignoring cost differences between fallback models. A backup model might be more expensive. Know the cost tradeoff before you ship it.
  • Skipping monitoring. If your gateway swallows every failure silently, you may not notice a provider is degraded until costs or latency spike.

Wrapping Up

API downtime isn't a hypothetical. It's a recurring cost of doing business with LLM providers. The good news is you don't have to absorb that cost directly.

An LLM gateway gives you automatic failover, smart retries, and load balancing without rewriting your application logic every time a provider has a bad day. Start small: one primary model, one backup, a sane retry policy. Expand from there as your traffic and provider list grow.

Q&A

1. Is an LLM gateway the same as an API gateway?

They're related but not identical. A general API gateway manages any backend service. An LLM gateway is built specifically for AI traffic patterns, like token-based costs, model-specific rate limits, and provider-specific failure types.

2. Do I need a gateway if I only use one provider?

Not necessarily for failover, but you'll still get value from retries, timeouts, and usage tracking. True resilience against outages requires at least one backup provider.

3. What happens to in-flight requests during a failover?

The gateway resends the failed request to the backup model. The original request isn't recovered; it's retried fresh, so make sure your prompts are idempotent where possible.

4. Does using a gateway add latency?

A small amount, since requests pass through an extra hop. Well-built gateways add only single-digit milliseconds in most cases, which is negligible compared to LLM response times.

5. Can I run a gateway myself, or do I need a hosted service?

Both options exist. LiteLLM, for example, can run self-hosted via Docker. Cloudflare and Vercel offer managed versions with no infrastructure setup.

6. How do I decide which model becomes the fallback?

Pick a model from a different provider so a single outage can't take down both. Match capability as closely as possible so output quality doesn't change.

7. What's the difference between a retry and a fallback?

A retry tries the same model again, useful for short-lived issues like timeouts. A fallback switches to a different model entirely, useful when the primary is fully down.

8. Will a gateway protect me from rate limits?

Yes, partially. Gateways can route around rate-limited deployments and distribute load across multiple API keys or regions to avoid hitting limits in the first place.

9. Does adding a gateway lock me into one tool?

Most gateways, like LiteLLM, are OpenAI-compatible. This compatibility allows easy migration of existing applications, so switching gateways later usually means a config change, not a full rewrite.

10. Is this overkill for a small side project?

Probably. If you're calling one provider at low volume, a direct SDK call is simpler and works fine. Add a gateway once you have real users depending on uptime.


My SaaS
Acluebox
Build modular and reusable system prompts with my SaaS,
Acluebox
. Also, free prompt template generators there.

References

  1. LiteLLM Getting Started Documentation - https://docs.litellm.ai/docs/

  2. LiteLLM Routing & Load Balancing Documentation - https://docs.litellm.ai/docs/routing

  3. Top 5 LLM Gateways for 2026: A Comprehensive Comparison - https://www.getmaxim.ai/articles/top-5-llm-gateways-for-2026-a-comprehensive-comparison/

  4. LLM Gateway Architecture: 2026 Engineering Reference - https://www.digitalapplied.com/blog/llm-gateway-architecture-2026-engineering-reference

  5. LiteLLM Fallback Configuration: Reduce API Errors by 90% - https://markaicode.com/tutorial/litellm-fallback-configuration/

Tags

LLM GatewaysAPI DowntimeResilience

Made with ❤️ by Mun Bock Ho

Copyright ©️ 2026