Skip to main content
LiteLLM: Your Self-Hosted AI Gateway for Local and Cloud Models

LiteLLM: Your Self-Hosted AI Gateway for Local and Cloud Models

Unify Azure AI Foundry, Ollama, LM Studio, and 100+ AI providers behind a single API with LiteLLM. Learn to deploy it locally with Docker Compose and PostgreSQL.

  1. Posts/

LiteLLM: Your Self-Hosted AI Gateway for Local and Cloud Models

👤

Chris Malpass

Author

Table of Contents

You’re juggling Ollama for local models, Azure AI Foundry for production, LM Studio for experimentation, and maybe OpenAI as a fallback. Each has its own SDK, authentication pattern, and quirks. Your codebase is littered with conditional logic: “If local, use this URL; if Azure, use that SDK; if OpenRouter, do something else entirely.”

This is the multi-provider chaos problem, and it’s exhausting.

Enter LiteLLM—a unified API gateway that sits between your application and every AI provider you use. It speaks one language (the OpenAI API format) while routing requests to 100+ providers behind the scenes. Whether you’re hitting GPT-4 in Azure, Llama 3 in Ollama, or Mistral on your LM Studio instance, your code stays the same.

In this post, we’ll deploy LiteLLM locally using Docker Compose with a PostgreSQL database for observability and cost tracking. We’ll explore the problems it solves, the tradeoffs you’ll face, and how it compares to cloud alternatives like OpenRouter.

The Problem: Provider Fragmentation
#

Let’s say you’re building an AI-powered feature. Your requirements are:

  1. Local development: Use Ollama models (free, fast, private).
  2. Production: Use Azure OpenAI for compliance and enterprise SLAs.
  3. Experimentation: Try new models from Anthropic, Cohere, or Hugging Face.

Without a gateway, your code looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
if (environment == "local")
{
    // Ollama-specific setup
    var ollamaClient = new HttpClient { BaseAddress = new Uri("http://localhost:11434") };
}
else if (environment == "azure")
{
    // Azure OpenAI SDK
    var azureClient = new OpenAIClient(new Uri(azureEndpoint), new AzureKeyCredential(apiKey));
}
else if (environment == "anthropic")
{
    // Anthropic SDK with different request format
    var anthropicClient = new AnthropicClient(apiKey);
}

This is brittle. Every time you add a provider, you add conditional branches, new SDKs, and new failure modes.

The Solution: LiteLLM as a Unified Gateway
#

LiteLLM acts as a reverse proxy and translation layer. You send requests in the OpenAI format, and LiteLLM handles the provider-specific translation.

Architecture:

1
Your .NET App → LiteLLM (http://localhost:4000) → Ollama, Azure, OpenAI, Anthropic, etc.

Your application code becomes:

1
2
3
4
5
6
7
8
9
// One client, one format, any provider
var client = new OpenAIClient(new Uri("http://localhost:4000/v1"), new AzureKeyCredential("your-litellm-key"));

var response = await client.GetChatCompletionsAsync(
    deploymentOrModelName: "ollama/llama3", // or "azure/gpt-4", "openai/gpt-4o", etc.
    new ChatCompletionsOptions
    {
        Messages = { new ChatRequestUserMessage("Explain quantum computing.") }
    });

Notice the model name: ollama/llama3. LiteLLM uses provider prefixes to route the request. Change ollama/llama3 to azure/gpt-4, and your code doesn’t change—LiteLLM handles everything.

Deploying LiteLLM Locally with Docker Compose
#

Let’s set up a production-grade local deployment with:

  • LiteLLM: The gateway.
  • PostgreSQL: To store request logs, costs, and usage analytics.
  • Persistent storage: So your data survives restarts.

Step 1: Create the Docker Compose File
#

Create a file named docker-compose.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
version: '3.8'

services:
  # PostgreSQL database for LiteLLM's observability and analytics
  postgres:
    image: postgres:16-alpine
    container_name: litellm-db
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm_user
      POSTGRES_PASSWORD: your_secure_password_here
    volumes:
      - litellm-db-data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U litellm_user -d litellm"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - litellm-network

  # LiteLLM proxy server
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm-proxy
    ports:
      - "4000:4000"
    environment:
      # Database connection
      DATABASE_URL: "postgresql://litellm_user:your_secure_password_here@postgres:5432/litellm"
      
      # Optional: Set a master key for API authentication
      LITELLM_MASTER_KEY: "sk-1234-your-master-key"
      
      # Optional: Enable detailed logging
      LITELLM_LOG: "INFO"
      
      # Store the config file path
      LITELLM_CONFIG_PATH: /app/config.yaml
    volumes:
      # Mount your LiteLLM configuration file
      - ./litellm-config.yaml:/app/config.yaml
      # Optional: Persist logs
      - ./logs:/app/logs
    depends_on:
      postgres:
        condition: service_healthy
    networks:
      - litellm-network
    restart: unless-stopped

volumes:
  litellm-db-data:

networks:
  litellm-network:
    driver: bridge

Step 2: Create the LiteLLM Configuration File
#

Create litellm-config.yaml in the same directory:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
model_list:
  # Azure OpenAI models
  - model_name: azure/gpt-4
    litellm_params:
      model: azure/gpt-4
      api_base: https://your-resource.openai.azure.com
      api_key: os.environ/AZURE_API_KEY
      api_version: "2024-02-15-preview"
  
  - model_name: azure/gpt-35-turbo
    litellm_params:
      model: azure/gpt-35-turbo
      api_base: https://your-resource.openai.azure.com
      api_key: os.environ/AZURE_API_KEY
      api_version: "2024-02-15-preview"
  
  # Ollama local models
  - model_name: ollama/llama3
    litellm_params:
      model: ollama/llama3
      api_base: http://host.docker.internal:11434  # Access host's Ollama from Docker
  
  - model_name: ollama/mistral
    litellm_params:
      model: ollama/mistral
      api_base: http://host.docker.internal:11434
  
  # LM Studio models (if running locally)
  - model_name: lmstudio/local-model
    litellm_params:
      model: openai/local-model  # LM Studio uses OpenAI-compatible API
      api_base: http://host.docker.internal:1234/v1
  
  # OpenAI direct (optional)
  - model_name: openai/gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY

# Optional: Set up cost tracking and budgets
litellm_settings:
  # Enable request/response logging to the database
  success_callback: ["langfuse"]
  
  # Set budget alerts (in USD)
  max_budget: 100
  budget_duration: 30d  # 30 day rolling window

# Optional: Set up user/team management
general_settings:
  master_key: "sk-1234-your-master-key"
  database_url: "postgresql://litellm_user:your_secure_password_here@postgres:5432/litellm"

Step 3: Set Environment Variables
#

Create a .env file for sensitive credentials (add this to .gitignore):

1
2
3
AZURE_API_KEY=your_azure_openai_key
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key

Update your docker-compose.yml to load the .env file:

1
2
3
4
litellm:
  # ... existing config ...
  env_file:
    - .env

Step 4: Start the Stack
#

1
docker-compose up -d

LiteLLM will start on http://localhost:4000. The database will be initialized automatically.

Check the logs:

1
docker-compose logs -f litellm

You should see:

1
INFO: LiteLLM Proxy running on http://0.0.0.0:4000

Step 5: Test It
#

Use curl to send a request:

1
2
3
4
5
6
7
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234-your-master-key" \
  -d '{
    "model": "ollama/llama3",
    "messages": [{"role": "user", "content": "What is LiteLLM?"}]
  }'

If you have Ollama running locally with the llama3 model, you’ll get a response routed through LiteLLM.

Using LiteLLM from .NET
#

Here’s how to integrate it into your .NET application:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
using Azure.AI.OpenAI;
using Azure;

// Configure the OpenAI client to point at LiteLLM
var client = new OpenAIClient(
    new Uri("http://localhost:4000/v1"),
    new AzureKeyCredential("sk-1234-your-master-key"));

// Use any model configured in LiteLLM
var response = await client.GetChatCompletionsAsync(
    deploymentOrModelName: "ollama/llama3", // Switch to "azure/gpt-4" for production
    new ChatCompletionsOptions
    {
        Messages =
        {
            new ChatRequestSystemMessage("You are a helpful AI assistant."),
            new ChatRequestUserMessage("Explain Docker Compose in one sentence.")
        },
        Temperature = 0.7f,
        MaxTokens = 150
    });

Console.WriteLine(response.Value.Choices[0].Message.Content);

To switch providers, just change the model name. No SDK changes, no conditional logic.

Observability and Cost Tracking
#

One of LiteLLM’s best features is built-in observability. Every request is logged to the PostgreSQL database.

View your costs and usage:

LiteLLM includes a web UI. Access it at:

1
http://localhost:4000/ui

You’ll see:

  • Total requests per model
  • Token usage (input/output)
  • Cost estimates (based on provider pricing)
  • Latency metrics

This is invaluable for:

  • Debugging which models are being called
  • Identifying expensive queries
  • Optimizing prompt lengths

You can also query the database directly:

1
2
3
4
5
6
7
-- Connect to the database
docker exec -it litellm-db psql -U litellm_user -d litellm

-- Query request logs
SELECT model, COUNT(*), SUM(total_cost) 
FROM request_logs 
GROUP BY model;

The Tradeoffs
#

LiteLLM is powerful, but it’s not a silver bullet. Here are the considerations:

Pros:
#

  1. Unified Interface: One API for all providers. Massively simplifies your code.
  2. Cost Visibility: Track spending across providers in a single dashboard.
  3. Self-Hosted: Your data stays local. No third-party tracking.
  4. Fallback & Load Balancing: Configure fallback models if your primary fails (e.g., Ollama → Azure).
  5. Observability: Request logs, latency metrics, and error rates baked in.

Cons:
#

  1. Single Point of Failure: If LiteLLM goes down, all your AI calls fail. (Mitigate with health checks and restarts.)
  2. Latency Overhead: Adds ~10-50ms per request (negligible for most use cases).
  3. Configuration Complexity: The model_list YAML can get large if you support many models.
  4. No Built-in Caching: Unlike some gateways, LiteLLM doesn’t cache responses by default (you’ll need Redis or a custom solution).

LiteLLM vs. OpenRouter vs. Other Gateways
#

Let’s compare LiteLLM to its alternatives:

OpenRouter
#

  • What It Is: A cloud-hosted AI gateway (openrouter.ai).
  • Pros:
    • Zero setup. Just sign up and get an API key.
    • Built-in credit system and cost tracking.
    • Access to 100+ models from OpenAI, Anthropic, Google, open-source, and more.
  • Cons:
    • Cloud-only. Your requests go through OpenRouter’s servers (privacy concern for sensitive data).
    • Cost markup: OpenRouter adds a small fee on top of provider costs.
    • No local model support: Can’t route to your Ollama or LM Studio instances.
  • When to Use: Quick prototypes, non-sensitive data, or when you want zero DevOps overhead.

LiteLLM (Self-Hosted)
#

  • Pros:
    • Full control. Your data never leaves your network.
    • Supports local models (Ollama, LM Studio).
    • Free (except infrastructure costs).
  • Cons:
    • Requires Docker and basic DevOps knowledge.
    • You manage uptime, updates, and backups.
  • When to Use: Privacy-sensitive applications, hybrid local/cloud workflows, or enterprise environments.

Portkey.ai
#

  • What It Is: A cloud AI gateway with advanced features (caching, prompt management, A/B testing).
  • Pros: Enterprise-grade features like semantic caching, prompt versioning, and analytics.
  • Cons: Cloud-only, paid plans only for advanced features.

AI Gateway (Cloudflare)
#

  • What It Is: Cloudflare’s free AI gateway for rate limiting and caching.
  • Pros: Free, built into Cloudflare’s edge network (low latency).
  • Cons: Limited to Cloudflare-hosted models and OpenAI/Anthropic proxying.

Summary
#

FeatureLiteLLM (Self-Hosted)OpenRouterPortkey.aiCloudflare AI Gateway
Self-Hosted
Local Models
Cost Tracking
Built-in Caching
Prompt Versioning
Zero DevOps

The verdict: Use LiteLLM if you need to consolidate local and cloud models with full data control. Use OpenRouter if you want a cloud solution with zero setup.

How LiteLLM Fits into a Self-Hosted AI Developer Workflow
#

Here’s a typical workflow for a developer using LiteLLM:

  1. Local Development:

    • Use Ollama models (ollama/llama3) for rapid iteration. Free, fast, private.
    • LiteLLM routes all requests to http://localhost:11434.
  2. Staging:

    • Switch to a mid-tier cloud model (azure/gpt-35-turbo) for more accurate testing.
    • LiteLLM routes to your Azure OpenAI instance.
  3. Production:

    • Deploy with a premium model (azure/gpt-4, openai/gpt-4o).
    • LiteLLM tracks costs and usage in the database.
  4. Experimentation:

    • Easily test new providers (Anthropic Claude, Google Gemini) by adding them to litellm-config.yaml.
    • No code changes required.

The beauty: Your application code never changes. You swap models by updating a YAML file.

Practical Use Cases for LiteLLM
#

Here are some real-world scenarios where LiteLLM shines:

1. Multi-Tenant SaaS with Per-Customer Models
#

Your SaaS app lets customers choose their AI provider (for compliance or cost reasons). Customer A wants Azure, Customer B wants AWS Bedrock, Customer C wants local Ollama.

Solution: Use LiteLLM with virtual keys. Each customer gets a key tied to a specific model in the config.

2. Cost Optimization with Fallbacks
#

You want to use a cheap local model (ollama/phi-3) for simple queries and fall back to azure/gpt-4 for complex ones.

Solution: Configure LiteLLM’s fallback logic:

1
2
3
4
5
model_list:
  - model_name: smart-model
    litellm_params:
      model: ollama/phi-3
      fallbacks: ["azure/gpt-4"]

3. Development/Production Parity
#

Devs use Ollama locally; production uses Azure. With LiteLLM, both environments use the same code—just different config files.

4. Prompt Experimentation Across Providers
#

You’re A/B testing whether GPT-4 or Claude 3.5 Sonnet performs better for summarization. Route 50% of traffic to each via LiteLLM’s load balancing.

Advanced Tips
#

Enable Caching for Repeated Queries
#

LiteLLM doesn’t include caching, but you can add Redis:

1
2
3
4
5
6
7
# In docker-compose.yml
redis:
  image: redis:7-alpine
  ports:
    - "6379:6379"

# Update your .NET app to check Redis before calling LiteLLM

Monitor with Prometheus
#

LiteLLM exposes Prometheus metrics at http://localhost:4000/metrics. Scrape these for uptime and latency monitoring.

Use Virtual Keys for Team Management
#

Create API keys for different teams or projects, each with its own budget:

1
2
3
curl http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-1234-your-master-key" \
  -d '{"team_id": "engineering", "max_budget": 50}'

Getting Started Checklist
#

  • Install Docker and Docker Compose
  • Create docker-compose.yml and litellm-config.yaml
  • Add your provider API keys to a .env file
  • Run docker-compose up -d
  • Test with curl or your .NET app
  • Access the UI at http://localhost:4000/ui to monitor usage

Conclusion
#

LiteLLM is a game-changer for developers juggling multiple AI providers. It unifies Ollama, Azure AI Foundry, OpenAI, Anthropic, and dozens of others behind a single, OpenAI-compatible API. With Docker Compose and PostgreSQL, you get observability, cost tracking, and full data control—all running on your own hardware.

If you’re tired of maintaining provider-specific SDKs and conditionals, give LiteLLM a try. Your codebase (and your sanity) will thank you.

Further Reading
#