Skip to main content
Hybrid AI: Combining Local and Cloud Models in .NET

Hybrid AI: Combining Local and Cloud Models in .NET

Explore a hybrid AI architecture in .NET that uses local ONNX models for speed and privacy, while leveraging powerful cloud-based LLMs for complex reasoning.

  1. Posts/

Hybrid AI: Combining Local and Cloud Models in .NET

·912 words·5 mins· loading
👤

Chris Malpass

Author

When building AI-powered applications, developers often face a difficult choice: run models locally for speed and privacy, or use powerful cloud-based models for state-of-the-art performance?

The answer is often both.

A hybrid AI architecture combines the best of both worlds. It uses small, efficient local models for routine tasks and intelligently escalates to larger, more capable cloud models when necessary. This pattern is perfect for creating responsive, cost-effective, and powerful AI experiences in .NET applications.

The Hybrid AI Pattern
#

Imagine a customer support chatbot in a desktop application. The goal is to answer user queries quickly and cheaply, while still being able to handle complex questions.

  1. Step 1: Local First (The “Fast Path”)

    • The user asks a question.
    • The application first uses a local, lightweight model (e.g., an ONNX-based sentence-transformer) to perform a semantic search over a local FAQ database.
    • If a high-confidence match is found, the answer is provided instantly. This is fast, free (no API calls), and completely private.
  2. Step 2: Cloud Escalation (The “Smart Path”)

    • If the local search returns a low-confidence result, the application escalates the query to a powerful cloud-based LLM like GPT-4.
    • The LLM receives the user’s question along with the context retrieved from the local search.
    • The LLM generates a more nuanced and comprehensive answer.

This approach ensures that simple, common questions are handled efficiently, while complex, novel queries get the attention of a more powerful reasoning engine.

The Tools
#

  • .NET 8+
  • ONNX Runtime: For running the local semantic search model.
  • A Vector Store: To hold the FAQ embeddings.
  • An LLM SDK: Such as Azure.AI.OpenAI to connect to cloud models.

Example: The Router Logic
#

The core of a hybrid system is the “router”—a piece of logic that decides whether to use the local model or escalate to the cloud.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
using Azure;
using Azure.AI.OpenAI;
using System.Linq;

public class HybridAIService
{
    private readonly LocalSearchService _localSearch;
    private readonly OpenAIClient _cloudClient;
    private const double ConfidenceThreshold = 0.85;

    public HybridAIService(string onnxModelPath, string openAIKey)
    {
        _localSearch = new LocalSearchService(onnxModelPath);
        _cloudClient = new OpenAIClient(openAIKey);
    }

    public async Task<string> GetAnswerAsync(string userQuestion)
    {
        // 1. Try the local-first approach
        var localResult = _localSearch.FindBestMatch(userQuestion);

        // 2. The Router Logic
        if (localResult.Confidence >= ConfidenceThreshold)
        {
            Console.WriteLine("INFO: Answered using local model.");
            return localResult.Answer;
        }
        else
        {
            Console.WriteLine("INFO: Low local confidence. Escalating to cloud model.");
            // 3. Escalate to the cloud
            return await GetCloudAnswerAsync(userQuestion, localResult.Answer);
        }
    }

    private async Task<string> GetCloudAnswerAsync(string question, string retrievedContext)
    {
        var prompt = $@"
            Based on the following context, please provide a comprehensive answer to the user's question.
            If the context is not sufficient, use your general knowledge.

            Context: ""{retrievedContext}""

            Question: ""{question}""
        ";

        var chatCompletionsOptions = new ChatCompletionsOptions()
        {
            DeploymentName = "gpt-4", // Or your preferred model
            Messages =
            {
                new ChatRequestSystemMessage("You are a helpful assistant."),
                new ChatRequestUserMessage(prompt),
            }
        };

        // Use the modern Azure.AI.OpenAI v2.0+ client patterns if available, 
        // but for clarity here we stick to the standard ChatCompletions pattern.
        Response<ChatCompletions> response = await _cloudClient.GetChatCompletionsAsync(chatCompletionsOptions);
        return response.Value.Choices.First().Message.Content;
    }
}

// Dummy implementation for LocalSearchService to make the code compile
public class LocalSearchService
{
    public LocalSearchService(string model) { }
    
    // Returns a tuple with named elements for clarity
    public (string Answer, double Confidence) FindBestMatch(string query)
    {
        // In a real app, this would perform a vector search using ONNX Runtime or a local vector DB.
        // For this example, we simulate a match.
        if (query.Contains("password reset", StringComparison.OrdinalIgnoreCase))
        {
            return ("You can reset your password in the Account settings.", 0.9);
        }
        return ("Sorry, I'm not sure how to help with that.", 0.5);
    }
}

How the Router Works:
#

  • Confidence Score: The local semantic search doesn’t just return an answer; it returns a confidence score (e.g., the cosine similarity of the vectors).
  • Threshold: We define a ConfidenceThreshold. If the score is above this value, we trust the local answer.
  • Escalation: If the score is below the threshold, we make the API call to the cloud model, passing the best local result as context to help the LLM.

Benefits of the Hybrid Approach
#

  1. Responsiveness: Users get instant answers for common questions, improving the user experience.
  2. Cost-Effectiveness: Reduces the number of expensive API calls to cloud services. You only pay for the queries that truly require advanced reasoning.
  3. Privacy: Sensitive data can be processed locally without ever leaving the user’s machine. Only the escalated, low-confidence queries are sent to the cloud.
  4. Offline Capability: The “fast path” can work even when the user is offline, providing a baseline level of functionality.

Conclusion
#

The hybrid AI pattern offers a pragmatic and powerful way to build intelligent .NET applications. By combining the strengths of local and cloud models, you can create systems that are fast, efficient, private, and smart. Instead of choosing between local or cloud, choose the architecture that lets you use the right tool for the right job.

Further Reading
#