You’re juggling Ollama for local models, Azure AI Foundry for production, LM Studio for experimentation, and maybe OpenAI as a fallback. Each has its own SDK, authentication pattern, and quirks. Your codebase is littered with conditional logic: “If local, use this URL; if Azure, use that SDK; if OpenRouter, do something else entirely.”
This is the multi-provider chaos problem, and it’s exhausting.
Enter LiteLLM—a unified API gateway that sits between your application and every AI provider you use. It speaks one language (the OpenAI API format) while routing requests to 100+ providers behind the scenes. Whether you’re hitting GPT-4 in Azure, Llama 3 in Ollama, or Mistral on your LM Studio instance, your code stays the same.
In this post, we’ll deploy LiteLLM locally using Docker Compose with a PostgreSQL database for observability and cost tracking. We’ll explore the problems it solves, the tradeoffs you’ll face, and how it compares to cloud alternatives like OpenRouter.
The Problem: Provider Fragmentation#
Let’s say you’re building an AI-powered feature. Your requirements are:
- Local development: Use Ollama models (free, fast, private).
- Production: Use Azure OpenAI for compliance and enterprise SLAs.
- Experimentation: Try new models from Anthropic, Cohere, or Hugging Face.
Without a gateway, your code looks like this:
| |
This is brittle. Every time you add a provider, you add conditional branches, new SDKs, and new failure modes.
The Solution: LiteLLM as a Unified Gateway#
LiteLLM acts as a reverse proxy and translation layer. You send requests in the OpenAI format, and LiteLLM handles the provider-specific translation.
Architecture:
| |
Your application code becomes:
| |
Notice the model name: ollama/llama3. LiteLLM uses provider prefixes to route the request. Change ollama/llama3 to azure/gpt-4, and your code doesn’t change—LiteLLM handles everything.
Deploying LiteLLM Locally with Docker Compose#
Let’s set up a production-grade local deployment with:
- LiteLLM: The gateway.
- PostgreSQL: To store request logs, costs, and usage analytics.
- Persistent storage: So your data survives restarts.
Step 1: Create the Docker Compose File#
Create a file named docker-compose.yml:
| |
Step 2: Create the LiteLLM Configuration File#
Create litellm-config.yaml in the same directory:
| |
Step 3: Set Environment Variables#
Create a .env file for sensitive credentials (add this to .gitignore):
| |
Update your docker-compose.yml to load the .env file:
| |
Step 4: Start the Stack#
| |
LiteLLM will start on http://localhost:4000. The database will be initialized automatically.
Check the logs:
| |
You should see:
| |
Step 5: Test It#
Use curl to send a request:
| |
If you have Ollama running locally with the llama3 model, you’ll get a response routed through LiteLLM.
Using LiteLLM from .NET#
Here’s how to integrate it into your .NET application:
| |
To switch providers, just change the model name. No SDK changes, no conditional logic.
Observability and Cost Tracking#
One of LiteLLM’s best features is built-in observability. Every request is logged to the PostgreSQL database.
View your costs and usage:
LiteLLM includes a web UI. Access it at:
| |
You’ll see:
- Total requests per model
- Token usage (input/output)
- Cost estimates (based on provider pricing)
- Latency metrics
This is invaluable for:
- Debugging which models are being called
- Identifying expensive queries
- Optimizing prompt lengths
You can also query the database directly:
| |
The Tradeoffs#
LiteLLM is powerful, but it’s not a silver bullet. Here are the considerations:
Pros:#
- Unified Interface: One API for all providers. Massively simplifies your code.
- Cost Visibility: Track spending across providers in a single dashboard.
- Self-Hosted: Your data stays local. No third-party tracking.
- Fallback & Load Balancing: Configure fallback models if your primary fails (e.g., Ollama → Azure).
- Observability: Request logs, latency metrics, and error rates baked in.
Cons:#
- Single Point of Failure: If LiteLLM goes down, all your AI calls fail. (Mitigate with health checks and restarts.)
- Latency Overhead: Adds ~10-50ms per request (negligible for most use cases).
- Configuration Complexity: The
model_listYAML can get large if you support many models. - No Built-in Caching: Unlike some gateways, LiteLLM doesn’t cache responses by default (you’ll need Redis or a custom solution).
LiteLLM vs. OpenRouter vs. Other Gateways#
Let’s compare LiteLLM to its alternatives:
OpenRouter#
- What It Is: A cloud-hosted AI gateway (openrouter.ai).
- Pros:
- Zero setup. Just sign up and get an API key.
- Built-in credit system and cost tracking.
- Access to 100+ models from OpenAI, Anthropic, Google, open-source, and more.
- Cons:
- Cloud-only. Your requests go through OpenRouter’s servers (privacy concern for sensitive data).
- Cost markup: OpenRouter adds a small fee on top of provider costs.
- No local model support: Can’t route to your Ollama or LM Studio instances.
- When to Use: Quick prototypes, non-sensitive data, or when you want zero DevOps overhead.
LiteLLM (Self-Hosted)#
- Pros:
- Full control. Your data never leaves your network.
- Supports local models (Ollama, LM Studio).
- Free (except infrastructure costs).
- Cons:
- Requires Docker and basic DevOps knowledge.
- You manage uptime, updates, and backups.
- When to Use: Privacy-sensitive applications, hybrid local/cloud workflows, or enterprise environments.
Portkey.ai#
- What It Is: A cloud AI gateway with advanced features (caching, prompt management, A/B testing).
- Pros: Enterprise-grade features like semantic caching, prompt versioning, and analytics.
- Cons: Cloud-only, paid plans only for advanced features.
AI Gateway (Cloudflare)#
- What It Is: Cloudflare’s free AI gateway for rate limiting and caching.
- Pros: Free, built into Cloudflare’s edge network (low latency).
- Cons: Limited to Cloudflare-hosted models and OpenAI/Anthropic proxying.
Summary#
| Feature | LiteLLM (Self-Hosted) | OpenRouter | Portkey.ai | Cloudflare AI Gateway |
|---|---|---|---|---|
| Self-Hosted | ✅ | ❌ | ❌ | ❌ |
| Local Models | ✅ | ❌ | ❌ | ❌ |
| Cost Tracking | ✅ | ✅ | ✅ | ✅ |
| Built-in Caching | ❌ | ❌ | ✅ | ✅ |
| Prompt Versioning | ❌ | ❌ | ✅ | ❌ |
| Zero DevOps | ❌ | ✅ | ✅ | ✅ |
The verdict: Use LiteLLM if you need to consolidate local and cloud models with full data control. Use OpenRouter if you want a cloud solution with zero setup.
How LiteLLM Fits into a Self-Hosted AI Developer Workflow#
Here’s a typical workflow for a developer using LiteLLM:
Local Development:
- Use Ollama models (
ollama/llama3) for rapid iteration. Free, fast, private. - LiteLLM routes all requests to
http://localhost:11434.
- Use Ollama models (
Staging:
- Switch to a mid-tier cloud model (
azure/gpt-35-turbo) for more accurate testing. - LiteLLM routes to your Azure OpenAI instance.
- Switch to a mid-tier cloud model (
Production:
- Deploy with a premium model (
azure/gpt-4,openai/gpt-4o). - LiteLLM tracks costs and usage in the database.
- Deploy with a premium model (
Experimentation:
- Easily test new providers (Anthropic Claude, Google Gemini) by adding them to
litellm-config.yaml. - No code changes required.
- Easily test new providers (Anthropic Claude, Google Gemini) by adding them to
The beauty: Your application code never changes. You swap models by updating a YAML file.
Practical Use Cases for LiteLLM#
Here are some real-world scenarios where LiteLLM shines:
1. Multi-Tenant SaaS with Per-Customer Models#
Your SaaS app lets customers choose their AI provider (for compliance or cost reasons). Customer A wants Azure, Customer B wants AWS Bedrock, Customer C wants local Ollama.
Solution: Use LiteLLM with virtual keys. Each customer gets a key tied to a specific model in the config.
2. Cost Optimization with Fallbacks#
You want to use a cheap local model (ollama/phi-3) for simple queries and fall back to azure/gpt-4 for complex ones.
Solution: Configure LiteLLM’s fallback logic:
| |
3. Development/Production Parity#
Devs use Ollama locally; production uses Azure. With LiteLLM, both environments use the same code—just different config files.
4. Prompt Experimentation Across Providers#
You’re A/B testing whether GPT-4 or Claude 3.5 Sonnet performs better for summarization. Route 50% of traffic to each via LiteLLM’s load balancing.
Advanced Tips#
Enable Caching for Repeated Queries#
LiteLLM doesn’t include caching, but you can add Redis:
| |
Monitor with Prometheus#
LiteLLM exposes Prometheus metrics at http://localhost:4000/metrics. Scrape these for uptime and latency monitoring.
Use Virtual Keys for Team Management#
Create API keys for different teams or projects, each with its own budget:
| |
Getting Started Checklist#
- Install Docker and Docker Compose
- Create
docker-compose.ymlandlitellm-config.yaml - Add your provider API keys to a
.envfile - Run
docker-compose up -d - Test with curl or your .NET app
- Access the UI at
http://localhost:4000/uito monitor usage
Conclusion#
LiteLLM is a game-changer for developers juggling multiple AI providers. It unifies Ollama, Azure AI Foundry, OpenAI, Anthropic, and dozens of others behind a single, OpenAI-compatible API. With Docker Compose and PostgreSQL, you get observability, cost tracking, and full data control—all running on your own hardware.
If you’re tired of maintaining provider-specific SDKs and conditionals, give LiteLLM a try. Your codebase (and your sanity) will thank you.
Further Reading#
- LiteLLM Official Documentation - Comprehensive guides and API reference
- LiteLLM GitHub Repository - Source code and issue tracker
- LiteLLM Proxy Server Docs - Detailed proxy setup guide
- Supported Providers - Complete list of 100+ supported AI providers
- OpenRouter Documentation - Alternative cloud gateway
- Ollama Documentation - Running local models
- Azure AI Foundry - Microsoft’s AI platform
- LiteLLM Docker Images - Official Docker images
- LiteLLM Cost Tracking Tutorial - Setting up budgets and alerts
- Langfuse Integration - Advanced observability with Langfuse
