Skip to main content
Building a RAG System with ONNX Runtime and C#

Building a RAG System with ONNX Runtime and C#

Learn how to build a Retrieval-Augmented Generation (RAG) pipeline in C# using the ONNX Runtime for local, private, and efficient semantic search.

  1. Posts/

Building a RAG System with ONNX Runtime and C#

·1174 words·6 mins· loading
👤

Chris Malpass

Author

Large Language Models (LLMs) are powerful, but they have a critical limitation: their knowledge is frozen at the time of training. They don’t know about your private documents or recent events. The solution is Retrieval-Augmented Generation (RAG), a technique that enhances LLM responses with information retrieved from your own data sources.

While many RAG systems rely on Python and heavy frameworks, you can build a highly efficient RAG pipeline right in .NET using the ONNX Runtime. This approach is perfect for desktop applications, edge devices, or keeping your data private.

The RAG Workflow
#

A RAG system has two main stages:

  1. Indexing (Offline):

    • Load your documents (text files, PDFs, etc.).
    • Chunk the documents into manageable pieces.
    • Use an embedding model to convert each chunk into a numerical vector.
    • Store these vectors in a Vector Database.
  2. Querying (Online):

    • Take a user’s question and convert it into a vector using the same embedding model.
    • Perform a similarity search in the Vector Database to find the most relevant document chunks.
    • Construct a prompt that includes the user’s question and the retrieved chunks.
    • Send this augmented prompt to an LLM to generate the final answer.

Tools for the Job
#

  • .NET 8+
  • Microsoft.ML.OnnxRuntime: To run our embedding model locally.
  • Microsoft.ML.Tokenizers: The official .NET library for tokenizing text, essential for preparing input for the model.
  • A Sentence-Transformer Model in ONNX format: Models like all-MiniLM-L6-v2 are excellent for this. You can find them on the Hugging Face Hub.
  • A Vector Store: For this example, we’ll build a simple in-memory store, but you could use a dedicated library like Microsoft.KernelMemory or a database like ChromaDB.

Step 1: The ONNX Embedding Service
#

First, let’s create a service that can take text and turn it into a vector using an ONNX model. You’ll need to download the model.onnx file from a sentence-transformer repository on Hugging Face.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using Microsoft.ML.Tokenizers;
using System.Numerics.Tensors; // For TensorPrimitives in .NET 8

public class EmbeddingService : IDisposable
{
    private readonly InferenceSession _session;
    private readonly Tokenizer _tokenizer;

    public EmbeddingService(string modelPath, string tokenizerPath)
    {
        _session = new InferenceSession(modelPath);
        // Load the tokenizer from the tokenizer.json file (downloaded from Hugging Face)
        _tokenizer = new Tokenizer(File.OpenRead(tokenizerPath), TokenizerModel.Bert); 
        // Note: Adjust TokenizerModel based on your specific model (e.g., Bert, Bpe)
    }

    public float[] GetEmbedding(string text)
    {
        // 1. Tokenize the input
        var tokens = _tokenizer.Encode(text);
        
        // 2. Create input tensors
        // Most BERT-based models expect input_ids, attention_mask, and token_type_ids
        var inputIds = new DenseTensor<long>(tokens.Ids.Select(x => (long)x).ToArray(), new[] { 1, tokens.Ids.Count });
        var attentionMask = new DenseTensor<long>(Enumerable.Repeat(1L, tokens.Ids.Count).ToArray(), new[] { 1, tokens.Ids.Count });
        var tokenTypeIds = new DenseTensor<long>(Enumerable.Repeat(0L, tokens.Ids.Count).ToArray(), new[] { 1, tokens.Ids.Count });

        var inputs = new List<NamedOnnxValue>
        {
            NamedOnnxValue.CreateFromTensor("input_ids", inputIds),
            NamedOnnxValue.CreateFromTensor("attention_mask", attentionMask),
            NamedOnnxValue.CreateFromTensor("token_type_ids", tokenTypeIds)
        };

        // 3. Run inference
        using var results = _session.Run(inputs);
        
        // 4. Extract the embeddings
        // Usually the last hidden state or a specific pooling layer. 
        // For sentence-transformers, we often take the mean pooling or the CLS token.
        // This example assumes the model returns a 'last_hidden_state' and we take the first token (CLS) for simplicity.
        // Check your specific model's output names!
        var output = results.First().AsTensor<float>();
        
        // Extract the embedding for the first token (CLS token)
        // Shape is [BatchSize, SequenceLength, HiddenSize] -> [1, SeqLen, 384]
        var embedding = new float[output.Dimensions[2]];
        for (int i = 0; i < embedding.Length; i++)
        {
            embedding[i] = output[0, 0, i];
        }

        // 5. Normalize
        // Calculate the Euclidean norm (L2 norm) of the embedding vector
        float norm = TensorPrimitives.Norm(embedding);
        
        // Divide each element by the norm to get a unit vector
        TensorPrimitives.Divide(embedding, norm, embedding);
        
        return embedding;
    }

    public void Dispose() => _session?.Dispose();
}

Note: The Microsoft.ML.Tokenizers library is evolving. Ensure you check the documentation for the specific version you are using. Also, different models have different input requirements (some don’t need token_type_ids) and output structures. Always inspect your ONNX model using a tool like Netron to verify input/output names and shapes.

Step 2: In-Memory Vector Store
#

Next, a simple class to hold our document chunks and their vectors.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
using System.Numerics.Tensors; // Requires .NET 8+

public class VectorStore
{
    private readonly List<(string Text, float[] Vector)> _vectors = new();

    public void Add(string text, float[] vector)
    {
        _vectors.Add((text, vector));
    }

    public string FindMostSimilar(float[] queryVector)
    {
        if (!_vectors.Any()) return "No documents in store.";

        var bestMatch = _vectors
            .Select(v => new
            {
                Text = v.Text,
                // High-performance Cosine Similarity using .NET 8 TensorPrimitives
                Similarity = TensorPrimitives.CosineSimilarity(v.Vector, queryVector)
            })
            .OrderByDescending(x => x.Similarity)
            .FirstOrDefault();

        return bestMatch?.Text ?? "No similar documents found.";
    }
}

Step 3: Putting It All Together
#

Now, we can orchestrate the RAG pipeline.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
public static void Main(string[] args)
{
    // You can download these files from the Hugging Face model page (e.g., sentence-transformers/all-MiniLM-L6-v2)
    var modelPath = "model.onnx";
    var tokenizerPath = "tokenizer.json"; 
    
    using var embeddingService = new EmbeddingService(modelPath, tokenizerPath);
    var vectorStore = new VectorStore();

    // --- Indexing Stage ---
    var documents = new[]
    {
        "The capital of France is Paris.",
        "Photosynthesis is the process used by plants to convert light into energy.",
        "The .NET Framework was first released in 2002."
    };

    Console.WriteLine("Indexing documents...");
    foreach (var doc in documents)
    {
        var vector = embeddingService.GetEmbedding(doc);
        vectorStore.Add(doc, vector);
    }
    Console.WriteLine("Indexing complete.");

    // --- Querying Stage ---
    var userQuestion = "When was .NET released?";
    Console.WriteLine($"\nUser Question: {userQuestion}");

    var queryVector = embeddingService.GetEmbedding(userQuestion);
    var retrievedContext = vectorStore.FindMostSimilar(queryVector);

    Console.WriteLine($"Retrieved Context: {retrievedContext}");

    // --- Augment and Generate Stage ---
    var prompt = $@"
        Context: {retrievedContext}

        Question: {userQuestion}

        Answer based only on the context provided:
    ";

    Console.WriteLine("\n--- Augmented Prompt ---");
    Console.WriteLine(prompt);

    // Send this prompt to your chosen LLM (e.g., via OpenAI's API)
    // var answer = llmClient.Generate(prompt);
    // Console.WriteLine($"\nFinal Answer: {answer}");
}

Conclusion
#

Building a RAG pipeline in C# with ONNX is not only possible but also incredibly powerful. It allows you to create private, efficient, and domain-specific AI applications without relying on external services for the core semantic search functionality. As the .NET AI ecosystem continues to grow, expect the tokenization and vector database steps to become even easier to implement.

Further Reading
#