A Guide to Building a RAG AI Assistant Using DeepSeek-R1

In this tutorial, you'll learn how to build an AI assistant that uses DeepSeek-R1's powerful reasoning model to provide relevant and insightful responses. Our guide will leverage an agentic reasoning architecture, grounded in a RAG model, to create an AI assistant capable of accessing and reasoning over large knowledge bases.

In this blog, you will learn:

How to deploy DeepSeek-R1 using Ollama.
How to integrate the vector database Qdrant with a movie database for efficient vector search.
How to generate and manage embeddings for movie titles and summaries.
How to implement a flexible workflow using LangGraph to handle different query types (RAG-based or generic).
How to leverage Ollama's AI models to generate user-friendly, human-like responses.

By the end of this tutorial, you'll be equipped with the knowledge to build your own AI-driven assistant that can intelligently respond to movie-related queries, whether about specific movie details or general recommendations.

Ready to dive in? Let's get started!

DeepSeek Rag Workflow -

‍

Understanding DeepSeek

DeepSeek R1 represents a groundbreaking shift in reasoning-driven AI, leveraging pure reinforcement learning (RL) to cultivate advanced reasoning abilities in Large Language Models (LLMs). It’s part of a broader family of models, including DeepSeek R1-Zero, DeepSeek R1-Instruct, and DeepSeek R1-Chat, each tailored for specific tasks ranging from raw reasoning to conversational AI.

Architecture and Core Innovations

DeepSeek R1 is architected around a Transformer model enhanced with Mixture of Experts (MoE), which selectively activates a subset of its parameters during inference. This approach is both computationally efficient and scalable, allowing the model to tackle complex reasoning tasks without the immense computational burden typical of traditional LLMs. Let’s break it down:

Mixture of Experts (MoE):
- Sparse Activation: Only a fraction of the model’s parameters are activated per inference, optimizing both speed and resource usage.
- Dynamic Routing: A token-wise gating mechanism directs tokens to the most relevant expert layers, enhancing the model’s adaptability and precision.
Training Paradigm:
- Group Relative Policy Optimization (GRPO):
  - An innovative RL algorithm replacing the traditional critic model, GRPO uses group-based scores to estimate the advantage, significantly reducing memory and computational demands.
  - This method fosters emergent reasoning behaviors, such as self-verification and reflection, allowing the model to iteratively refine its responses.
- Multi-Stage Training Pipeline:
  - Begins with a cold-start fine-tuning phase, leveraging a small dataset to establish baseline reasoning capabilities.
  - Followed by RL-focused fine-tuning, concentrating on reasoning-intensive tasks like coding and logic, while incorporating language consistency rewards to ensure coherence and readability.

Variants and Their Focus

DeepSeek-R1-Distill-Qwen-32B:
This powerhouse model distills the capabilities of the massive Qwen-32B, striking a balance between scale and performance. It excels at complex text generation tasks, making it a top choice for applications requiring nuanced and context-rich outputs. Its large parameter count enables it to handle intricate queries and generate highly coherent and relevant responses.

DeepSeek-R1-Distill-Qwen-1.5B:
A more compact yet powerful version, this model is fine-tuned to deliver high-quality outputs with reduced computational requirements. Ideal for scenarios where efficiency is crucial, it maintains strong performance across diverse benchmarks, making it versatile for both general and specialized tasks.

DeepSeek-R1-Distill-Llama-70B:
Leveraging the capabilities of Llama-70B, this variant is optimized for deep reasoning and creative tasks. Its focus on extensive data distillation allows it to generate sophisticated and creative content, suitable for research, creative writing, and complex problem-solving.

DeepSeek-V3:
Representing the latest in the DeepSeek lineup, this version brings improved text generation abilities with a focus on accuracy and contextual understanding. It’s particularly well-suited for real-time applications and dynamic content creation, thanks to its updated training paradigms and enhanced data processing techniques.

DeepSeek-LLM-67B-Chat:
Designed specifically for engaging and nuanced conversational experiences, this model integrates advanced RLHF (Reinforcement Learning from Human Feedback) techniques. It’s perfect for chatbots and virtual assistants, as it delivers human-like dialogue and adapts seamlessly to varying conversational contexts.

By innovatively combining reinforcement learning, efficient architecture, and strategic fine-tuning, DeepSeek-R1 pushes the boundaries of what’s achievable in reasoning-focused AI, paving the way for more intelligent, accessible, and versatile AI solutions.

Prerequisites

Set Up DeepSeek-R1 Locally with Ollama

Ollama simplifies running LLMs in the cloud by handling model downloads, quantization, and execution seamlessly.

Step 1: Install Ollama

First, download and install Ollama from the official website.

Step 2: Download and Run DeepSeek-R1

Let’s test the setup and download our model. Launch the terminal and type the following command.

ollama run deepseek-r1:Xb

Ollama offers a range of DeepSeek-R1 models, spanning from 1.5B parameters to the full 671B parameter model. The 671B model is the original DeepSeek-R1, while the smaller models are distilled versions based on Qwen and Llama architectures. If your hardware does not support the 671B model, you can easily run a smaller version by using the following command and replacing the X below with the parameter size you want (1.5b, 7b, 8b, 14b, 32b, 70b, 671b):

For instance, if you want to test it locally, you can run this command below:

ollama run deepseek-r1:1.5b

‍

With this flexibility, you can use DeepSeek-R1's capabilities even if you don’t have a supercomputer.

Step 3: Run DeepSeek-R1 in the Background

To run DeepSeek-R1 continuously and serve it via an API, start the Ollama server:

ollama serve

This will make the model available for integration with other applications.

Handling “Port Busy” Configuration for Ollama

If you encounter a "Port Busy" error when trying to run Ollama, you may need to change the port where Ollama is being served. Follow these steps to resolve the issue:

Set a New Port for Ollama: Run the following command in your Bash terminal to configure Ollama to use a different port:

set OLLAMA_HOST=127.0.0.1:11435

Start the Ollama Server: Once the new port is set, start the Ollama service with this command:

ollama serve

This will bind Ollama to the new port (11435) and resolve any port conflicts that might have occurred.

Implementation of the Agentic RAG System Using DeepSeek R1

Now that we have DeepSeek-R1 running, we can use it in our code. Ollama Python library provides a fabulous interface to a wide range of open source models, making it a breeze to integrate LLM inference and embedding generation.

Step 1- Install Ollama Python Library

We can run Ollama in any integrated development environment (IDE) of choice. You can install the Ollama Python package using the following code:

!pip install ollama

Step 2 - Test the Chat Completion

Once the Ollama library is installed, we can test chat completion using the ollama.chat function.

import ollama
response = ollama.chat(
    model="deepseek-r1",
    messages=[
        {"role": "user", "content": "Explain Newton's second law of motion"},
    ],
)
print(response["message"]["content"])

‍

The ollama.chat() function takes the model name and a user prompt, processing it as a conversational exchange. The script then extracts and prints the model's response.

The model generates this response:

‍

<think>
Okay, so I need to explain Newton's Second Law of Motion. Hmm, where do I start? I remember from school that Newton was an important scientist who contributed a lot to physics. The second law sounds like it's about how things move when forces are applied.

Let me think... Newton's laws were the big three, right? First about inertia, then acceleration with force, and maybe third about action and reaction. So I should probably focus on the second one this time.

Alright, so from what I remember, Newton's Second Law is something like F equals mass times acceleration. But how does that work exactly? Force equals mass multiplied by acceleration. That means if you have a bigger force, it will cause more acceleration on an object with less mass or more on one with more mass?

Wait, let me get this straight. If I push a box harder, it moves faster. So the same force applied over a larger mass would give a smaller acceleration. Makes sense because less mass means less resistance to movement.

But how does that relate to everyday situations? Like when you jump, your foot exerts a force on the ground, and the ground applies an equal and opposite force pushing you up. That's the reaction part of Newton's third law, right?

And then in the car, if I push back on the dashboard with my hand (which has less mass than the vehicle), the car moves forward because it has more inertia. So F equals m times a comes into play there too.

I should also think about how this relates to acceleration and force. The greater the force applied, the faster something accelerates. And if you have a larger mass, even with the same force, it won't accelerate as much. That seems counterintuitive at first because I thought bigger objects move slower when pushed.

But in reality, because objects have more inertia (mass), they resist changes in motion more, so for the same force, acceleration is lower. So if you have a car and someone gets into it suddenly, the force from the seatbelt or something has to push all those passengers forward, which increases their acceleration beyond what an empty car would do.

Wait, maybe I'm mixing this up. If you apply the brakes suddenly, the momentum of the vehicle goes into your body, right? So that's why it's hard to stop without craning your neck. That must be related to force and inertia again.

I should also consider how this applies in more complex scenarios, like rockets or something. Newton's laws definitely explain how a rocket expels gas, which creates thrust, causing acceleration in the opposite direction.

Oh, I think it's important to mention that mass is a scalar quantity here, so it doesn't have direction. Only force and acceleration vectors matter. The law works regardless of the frame of reference because it's an absolute statement about how forces affect accelerations.

Wait, but what if something isn't moving? Like if I'm sitting in a train and I push the back seat with my hands, but the train is moving at constant speed. The force on my body from my hands doesn't cause acceleration relative to the train. So that's why it's tricky sometimes--relative vs absolute.

So Newton's Second Law has implications beyond just everyday situations. It explains how things like rockets work and how forces affect movement in physics experiments, as well as even how cars behave when you press the brakes.

I should make sure I explain the difference between force and acceleration clearly. Force is what causes the change in motion (acceleration), while mass affects that change. So if I have a heavy object with less force applied, it won't accelerate much because of its inertia.

Also, maybe give some examples beyond cars and rockets to show how this law applies everywhere. Like how a basketball rolls faster on ice than on grass--ice has more friction, which is mass, so the basketball (lower mass) moves faster despite being pushed with the same force by my foot.

I should also touch on units here because I often get confused with Newton's laws and other formulas. Force in newtons (N), mass in kilograms (kg), acceleration in meters per second squared (m/s²). So F = ma is straightforward as long as you remember the units correspond correctly.

But maybe some people find it confusing to visualize how all these variables interact. Maybe draw a free-body diagram or something if I can, but since I'm just explaining the law conceptually, perhaps not necessary right now.

Wait, what's acceleration? It's the rate of change of velocity. So whether speed is constant (no acceleration) or changes direction (centripetal acceleration), all that matters is how fast it's changing in each moment.

So putting it all together: Newton's Second Law relates force and mass to acceleration. The greater the net force, the greater the acceleration for a given mass. And if you have more mass, even with the same force, less acceleration happens because of inertia. This law explains why things move faster when pushed harder, how rockets work, and applies to so many other areas.

I think that covers it. Maybe add some historical context--like how Newton himself formulated the third law, which states that every action has an equal and opposite reaction, but that's not directly part of this explanation.
</think>

Newton's Second Law of Motion is a fundamental principle in physics that describes the relationship between force, mass, and acceleration. It states that the net force acting on an object is equal to the product of its mass and its acceleration (F = ma). Here's a structured explanation:

1. **Understanding the Components**:
   - **Force (F)**: This is the external influence causing change in motion.
   - **Mass (m)**: The quantity that measures resistance to acceleration, essentially how "heavy" an object is.
   - **Acceleration (a)**: The rate at which velocity changes over time.

2. **Mathematical Representation**:
   The law is expressed as F = ma. This equation shows that a greater force results in more acceleration for the same mass, and a larger mass results in less acceleration for the same force.

3. **Interpretation of Terms**:
   - **Force**: Involves vectors, meaning direction matters (e.g., pushing forward or backward).
   - **Acceleration**: Also involves vectors, indicating both speed and direction changes.
   - **Mass**: A scalar quantity independent of direction, affecting resistance to motion.

4. **Historical Context**:
   Sir Isaac Newton formulated the law in his "Principia Mathematica," along with three other fundamental laws, which together describe motion and forces comprehensively.

5. **Applications Beyond Simple Situations**:
   - **Rockets**: Thrust from exhaust gases causes acceleration opposite to the direction of movement.
   - **Cars**: Seatbelts create an unbalanced force that imparts momentum forward.

6. **Implications in Various Scenarios**:
   - **Sports and Gymnastics**: Pushing against resistance leads to acceleration.
   - ** thrown objects**: The force of launch causes them to accelerate through space.

7. **Free-Body Diagrams**:
   A diagram illustrating the forces acting on an object, showing how net force determines acceleration.

In summary, Newton's Second Law explains that changes in motion are due to external forces, with mass and acceleration being directly related through F = ma, applicable across various physical phenomena.

‍

So, we now have a working chat model with DeepSeek-R1. Let’s proceed to the next step, where we will use an embedding model nomic-embed-text to generate vector embeddings from data, and then store them in Qdrant vector search engine.

Step 3 - Install Embedding Model Using Ollama

We will use the open source embedding model nomic-embed-text which has high ratings in the MTEB leaderboard on Hugging Face.

Here’s how you can pull the model:

ollama pull nomic-embed-text

Now, we can use it like this:

ollama.embeddings(model='nomic-embed-text', prompt='The sky is blue because of rayleigh scattering')

Step 4 - Setting Up Qdrant Using Docker

To set up Qdrant locally and run it on port 6333 using Docker, follow these steps:

Pull the Qdrant Docker Image: First, pull the latest Qdrant image from Docker Hub by running the following command:

docker pull qdrant/qdrant

Run Qdrant on Port 6333: Once the image is downloaded, start a Qdrant container and bind it to port 6333 on your localhost. Run the following command:

docker run -p 6333:6333 qdrant/qdrant

This will bind Qdrant's default port (6333) to port 6333 on your cloud instance.

You can then head to http://<your_machine_ip>:6333/dashboard to find the Qdrant webui. Make sure you open the firewall on the 6333 port.

 http://localhost:6333/dashboard

Step 5 - Install Libraries

Let’s now install the key libraries we will use: LangGraph and LangChain. LangChain provides a wide range of tools to build LLM applications, whereas LangGraph helps with building AI agents. We could have also used CrewAI instead of LangGraph, as it is equally powerful.

pip install langgraph qdrant-client langchain ollama requests

Step 6 - Import all Libraries

Let’s import the libraries so we can proceed with the code:

import os
import json
import uuid
import numpy as np
import ollama
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from dotenv import load_dotenv

from langgraph.graph.message import add_messages
from langgraph.graph import StateGraph
from typing_extensions import TypedDict

You may need to use load_dotenv to load your environment variables:

load_dotenv()

Step 6 - Connect with Qdrant

We had Qdrant running before, so we can now connect to it.

# Qdrant connection setup
qdrant_client = QdrantClient(host="localhost", port=6333)
# Create or connect to collection in Qdrant
COLLECTION_NAME = "movie_embeddings"
VECTOR_SIZE = 768 #our embedding model dimension

Qdrant has multitenancy support, and if you are building a SaaS application, you might want to leverage it. Multitenancy allows you to use the same collection for many customers, and segregate them by a tenant id.

Note that we will use 768 as the number of dimensions for our generated vectors, which is the dimension that nomic-embed-text generates.

Step 7 - State Class Definition

‍

In LangGraph, you have to maintain state between various workflow nodes. Let’s create a Pydantic class for the State that we plan to maintain:

class State(TypedDict):
    query: str
    answer: str

Step 8 - Embedding Generation and Storage

Let’s now write a function to generate embeddings.

def generate_embedding(text, model="nomic-embed-text"):
    text = text.replace("\n", " ")
    return ollama.embeddings(model = model, prompt=text).embedding

We will also write a function that takes the data as input, generates embeddings with the above function,m and then saves it in Qdrant.

def update_vectordb(data: list):
    points = []
    status = False

    # Create collection if not exists
    if not qdrant_client.collection_exists(collection_name=COLLECTION_NAME):
        qdrant_client.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
        )

    # Loop through the data and generate embeddings
    for item in data:
        insertion_string = item["title"] + " " + item["summary"] + " " + item["year"] + " " + item["plot"] + " " + item["genre"]

        # Generate embedding for the movie's title or plot
        vectorized_item = generate_embedding(insertion_string)
        point_id = str(uuid.uuid4())  # Unique ID for each point
        metadata = {
            "title": item["title"],
            "year": item["year"],
            "genre": item["genre"],
            "plot": item["plot"],
            "summary": item["summary"]
        }

        # Create PointStruct for Qdrant
        point = PointStruct(
            id=point_id,
            vector=vectorized_item,
            payload=metadata
        )
        points.append(point)

    # Insert points into Qdrant
    if points:
        qdrant_client.upsert(collection_name=COLLECTION_NAME, points=points)
        print(f"Successfully inserted {len(points)} embeddings into collection {COLLECTION_NAME}")
        status = True
    else:
        print("No valid content found.")
   
    return status

The update_vectordb function updates the Qdrant vector database with movie-related data.

How It Works:

Check for Collection: Ensures the Qdrant collection exists.
Embedding Generation: Creates vector embeddings from each movie’s title and summary.
Unique Point Creation: Assigns a UUID and stores metadata (title, year, genre, plot, summary).
Insertion into Qdrant: Uses upsert to add or update embeddings and metadata.

This function keeps the vector database up to date, enabling efficient search and retrieval.

Step 9 - Vector Search Function

Let’s write another function to query Qdrant using a query embedding. This function will be useful later when we are stitching together the workflow.

# Function to query the vector database in Qdrant
def query_qdrant(query: str, limit=5):
    # Generate query vector
    query_vector = generate_embedding(query)

    # Query Qdrant
    result = qdrant_client.query_points(
        collection_name=COLLECTION_NAME,
        query=query_vector,
        limit=limit,
        with_vectors=False
    )
    return result.points

The query_qdrant function searches a Qdrant vector database for results similar to your query.

How It Works:

Embedding Generation: Converts your query into a vector using generate_embedding.
Querying Qdrant: Uses qdrant_client.query_points to perform a similarity search, retrieving the closest matches.
Returning Results: Returns the most relevant points based on cosine similarity.

This function powers efficient semantic search, enabling accurate retrieval of relevant data.

Step 10 - Load Dataset

We will now load the data from a JSON

# Path to your movies.json file
movies_json_file = "movies.json"

# Load the data from the JSON file
movie_data = load_movies_from_json(movies_json_file)

# Update Qdrant with the loaded movie data
update_vectordb(movie_data)

…and then upsert the data into Qdrant.

Step 11 - Prepare Context from Search Result

One mistake that many developers make is that they attempt to generate embeddings from the user query directly, without interpreting it. This has numerous problems.

For instance, if the user is looking for a movie released in 2024, and you perform a naive semantic search, your results will include movies from other years as well. The way to solve this problem is to use the LLM to break down your query into query and metadata, and then perform a semantic search that leverages metadata filters (eg: Year: 2024). This leads to results which are far more accurate. You can improve upon this tutorial using that tactic.

Now, let’s create a function that stitches together the query result and creates an LLM context from it:

def generate_movie_context(query_result):
    context = "Based on your query, here are some relevant movies:\n\n"
   
    for i, result in enumerate(query_result):
        movie = result.payload
        title = movie.get('title', 'Unknown')
        year = movie.get('year', 'N/A')
        genre = movie.get('genre', 'Unknown')
        plot = movie.get('plot', 'No plot available.')
        summary = movie.get('summary', 'No summary available.')
       
        context += f"Movie {i+1}:\n"
        context += f"Title: {title} ({year})\n"
        context += f"Genre: {genre}\n"
        context += f"Plot: {plot}\n"
        context += f"Summary: {summary}\n\n"
   
    return context

Step 12 - Using DeepSeek-R1 for Contextual Responses

We can now put together the ‘generation’ function of our RAG system.

def deepseek_r1_rag(context, query):
    try:
        # Build a more clear and specific system message to guide the response
        system_message = (
            "You are an intelligent assistant who generates human-friendly, clear, and relevant answers."
            " You will respond based on the given context and query. Make sure to give precise, natural-sounding responses."
            " Use the context information provided to enhance the accuracy and relevance of the answer."
            "Respond in 100 words"
        )

       
        # Sending the context and query to the Ollama API for response generation
        response = ollama.chat(
            model="deepseek-r1:1.5b",
            messages=[
                {"role": "user", "content": system_message},
                {"role": "user", "content": f"Context: {context}\nQuery: {query}"},
            ]
        )
        return response["message"]["content"]

    except Exception as e:
        return f"Error occurred: {str(e)}"

The deepseek_r1_rag function generates responses using the DeepSeek-R1 model in a RAG setup. It sets a system message to ensure concise, contextually relevant replies within 100 words. The function processes the user's query and retrieved context, sending them to ollama.chat(). This enables DeepSeek-R1 to generate accurate, context-aware answers.

Step 13 - Using DeepSeek-R1 for Generic Answer Generation

We will also create a function where we directly pass the user query to the LLM without fetching context. This function will be our fallback function in case the LLM decides not to fetch data from the vector store.

def deepseek_r1_llm(query):
    try:
        # Build a more clear and specific system message to guide the response
        system_message = (
            "You are an intelligent assistant who generates human-friendly, clear, and relevant answers."
            " You will respond based on the given context and query. Make sure to give precise, natural-sounding responses."
            " Answer the query given"
            "Respond in 100 words"
        )

       
        # Sending the context and query to the Ollama API for response generation
        response = ollama.chat(
            model="deepseek-r1:1.5b",
            messages=[
                {"role": "user", "content": system_message},
                {"role": "user", "content": f"Query: {query}"},
            ]
        )
        return response["message"]["content"]

    except Exception as e:
        return f"Error occurred: {str(e)}"

We can now create our retrieval-augmented generation workflow using DeepSeek-R1.

Step 14 - LangGraph Workflow Node Using RAG

The following code shows how to create the workflow function using RAG context.

def workflow_with_rag(state: State):
    query = state["query"]
    result = query_qdrant(query)
    context = generate_movie_context(result)
   
    response = deepseek_r1_rag(context, query)

    if not response:
        return {"answer": "No response generated"}
   
    return {"answer": response}

This function implements a retrieval-augmented generation (RAG) workflow for processing movie-related queries. The function first retrieves relevant movie data from Qdrant by querying the vector database. It then generates a movie-specific context using the retrieved data. Finally, the context and query are sent to the DeepSeek-R1 model for response generation. If no response is generated, it returns a fallback message indicating no response. Otherwise, the function returns the generated answer.

Step 15 - LangGraph Workflow Node for Generic Query Processing

We will also create a workflow node that doesn’t use RAG. This is useful in situations where the user’s query falls outside the purview of the knowledge-base stored:

def workflow_without_rag(state: State):
    query = state["query"]
    response = deepseek_r1_llm(query)
   
    if not response:
        return {"answer": "No response generated"}
   
    return {"answer": response}

Step 16 - Router Node for Dynamic Workflow Selection

We will also need a node that routes the workflow based on the query.

# Router Node
def route_workflow(state: State):
    query = state["query"]
    query_result_score = list(query_qdrant(query)[0])[2][1]
   
    if query_result_score > 0.5:
        print("Node Chosen -----> RAG")
        response = workflow_with_rag(state)
    else:
        print("Node Chosen -----> GENERIC")
        response = workflow_without_rag(state)
   
    # Ensure state always has "answer"
    state["answer"] = response.get("answer", "No answer")
    return state

This function is designed to route the user's query to either a RAG-based or a generic response workflow based on the similarity score of the query result. It first calculates the similarity score from the query result obtained from the Qdrant vector database. If the score is greater than 0.5, it routes the query to the RAG workflow, leveraging context-based generation. If the score is below the threshold, the query is processed using a generic workflow. Finally, the function ensures that the state always contains an "answer", whether from RAG or the fallback method.

Code Overview

Query Evaluation: Retrieves similarity score for the query from Qdrant.
Dynamic Routing: Chooses the RAG workflow or generic workflow based on the similarity score.
State Update: Ensures the state has the "answer" field before returning the result.

Step 17- Build and Compile LangGraph Workflow

# Create LangGraph
graph_builder = StateGraph(State)

# Add workflows as nodes
graph_builder.add_node("router", route_workflow)

# Set entry and finish points
graph_builder.set_entry_point("router")
graph_builder.set_finish_point("router")

# Compile the graph
graph = graph_builder.compile()

This code snippet demonstrates how to create and configure a LangGraph for a dynamic workflow system. It initializes a StateGraph instance and adds the route_workflow function as a node in the graph, which is responsible for determining the appropriate response workflow based on query similarity. The entry and finish points are both set to the "router" node, ensuring that the graph starts and ends with this decision point. Finally, the graph is compiled, making it ready for execution in the workflow system.

Code Overview

Graph Creation: Initializes a LangGraph (StateGraph) using the State type.
Node Addition: Adds route_workflow as a node, allowing dynamic routing based on query context.
Entry and Finish Points: Configures both the entry and finish points to the "router" node.

Graph Compilation: Compiles the graph, making it ready for deployment in a LangChain-based workflow system.

Step 19 - Run the Movie Knowledge Assistant Agent

# Run the Agent
def run_agent():
    print("Welcome to the Movie Knowledge Assistant! Type 'quit' to exit.")
    while True:
        user_input = input("User: ")
        if user_input.lower() in ["quit", "exit"]:
            print("Goodbye!")
            break

        state_input = {"query": user_input, "answer": "None"}
        # Process the query through LangGraph
        for event in graph.stream(state_input):
            if event.values() is not None:
                for value in event.values():
                    print("Assistant Response ->: ", value.get("answer", "No answer"))
            else:
                print("No values found in event.")

Code enables you to interact with the Movie Knowledge Assistant Agent through a text-based interface. The agent processes queries in a loop using LangGraph workflows, providing real-time responses. It listens for input, processes it through the LangChain graph, and returns an answer. The session runs until you type “quit” or “exit.”

How It Works:

Continuous Input Loop: The agent keeps running until you exit.
State Setup: Your query is stored in a state_input dictionary with an initial empty answer.
Graph Processing: The query is processed through LangGraph’s stream method, returning an assistant response if available.

This provides an intuitive way to interact with the AI-powered movie assistant.

Step 18 - Test the Workflow

# Start the agent
if __name__ == "__main__":
    run_agent()

Let’s now test our workflow.

Results

Let’s try a query:

Multiverse adventure with Spider-Man

You will be able to follow the DeepSeek R1 ‘thought-process’. ou will be able to follow the DeepSeek R1 ‘thought-process’.

This is the context that the vector search fetched:

{
  "relevant_results": [
    {
      "id": "7dedd85c-70aa-46d4-a4a2-7061a2eddf92",
      "score": 0.79638296,
      "title": "Spider-Man: Across the Spider-Verse",
      "year": 2023,
      "genre": "Animation, Action, Adventure",
      "plot": "Miles Morales journeys across the multiverse to team up with other Spider-Men to face a new, greater threat.",
      "summary": "Miles Morales must team up with different versions of Spider-Man across the multiverse to stop a new, dangerous adversary."
    },
    {
      "id": "e6a4d365-36c3-4f33-80a2-cbcb5e935a29",
      "score": 0.63772786,
      "title": "Ant-Man and The Wasp: Quantumania",
      "year": 2023,
      "genre": "Action, Adventure, Comedy",
      "plot": "Scott Lang and Hope van Dyne venture into the Quantum Realm where they encounter new threats and mysteries.",
      "summary": "Ant-Man and the Wasp must navigate through an unknown realm, facing dangers and challenges that put their world at risk."
    },
    {
      "id": "f4c28bf6-d23d-48fd-bac2-d64cb26c5fca",
      "score": 0.6265694,
      "title": "The Marvels",
      "year": 2023,
      "genre": "Action, Adventure, Fantasy",
      "plot": "Captain Marvel, Monica Rambeau, and Kamala Khan team up to battle a new intergalactic threat.",
      "summary": "Three Marvel heroes unite to fight a powerful force, navigating their powers and teaming up to save the universe."
    },
    {
      "id": "8ac0c395-9f6b-4814-b111-0e86777aeb57",
      "score": 0.57232714,
      "title": "Guardians of the Galaxy Vol. 3",
      "year": 2023,
      "genre": "Action, Adventure, Comedy",
      "plot": "The Guardians must protect Rocket from a powerful villain while dealing with their own personal conflicts.",
      "summary": "The Guardians of the Galaxy face new challenges as they fight to save one of their own while grappling with personal demons."
    }
  ]
}

This will generate a response text like this:

Assistant Response ->:  <think>
Okay, so I need to come up with an answer that fits within 100 words for this query. Let me see what information is available.

The user wants a movie about a multiverse adventure with Spider-Man. Looking at the context provided:

Movie 1 is Spider-Man: Across the Spider-Verse (2023). It's set in a multiverse and involves Miles Morales teaming up with others to tackle a new threat. So this seems like a perfect fit.

The other movies are more about Ant-Man, Marvels, and Guardians of the Galaxy, which don't involve Spider-Man. So I can't pick those unless they're still relevant, but I think Movie 1 is the one that fits best.

I should make sure to mention all key points: Spider-Man's journey across a multiverse, different versions from various teams, facing new threats, and the summary provided. Also, it's worth mentioning that each team member has their own identities within the multiverse.

So structuring the answer to include the movie title, genre, plot elements related to the multiverse and superhero adventures, and summarize the summary given.
</think>

Spider-Man: Across the Spider-Verse (2023) is a stunning adventure where Miles Morales teams up with other superheroes across multiple multiverses to face new challenges. Based on the provided context, this film features a dynamic story that spans multiple universes, showcasing Spider-Man's journey and the heroes' teamwork.

‍

That completes the basic RAG workflow architecture. Now, let’s leverage Structured Output functionality that Ollama provides.

Structured Output with Ollama Using Python

Structured outputs are highly beneficial because they allow us to use AI as part of a wider workflow. Ollama makes LLMs steerable using this.

Now, using the Ollama Python library, pass the schema as a JSON object to the format parameter, either as a dictionary or, preferably, by using Pydantic to serialize the schema with model_json_schema().

from pydantic import BaseModel

class Movie(BaseModel):
  name: str
  cast: list
  budget: str


try:
    response = ollama.chat(
        model="deepseek-r1:1.5b",
        messages=[
            {
            'role': 'user',
            'content': 'Tell me about interstellar cast, budget and name',
            }
        ],
        format=Movie.model_json_schema(),
    )
    country = Movie.model_validate_json(response.message.content)
    print(country)


except Exception as e:
    print(f"Error occurred: {str(e)}")

Results

{
  "name": "Interstellar Cast",
  "cast": [
    {
      "name": "J. J. Cross",
      "year": 2014,
      "city": "New York",
      "country": "United States"
    },
    {
      "name": "Nate Frakes",
      "year": 2017,
      "city": "Los Angeles",
      "country": "United States"
    },
    {
      "name": "Ethan Hawke",
      "year": 2018,
      "city": "Los Angeles",
      "country": "United States"
    },
    {
      "name": "Kathleen O'Hara",
      "year": 2017,
      "city": "San Francisco",
      "country": "United States"
    },
    {
      "name": "Cristian Schaefer",
      "year": 2015,
      "city": "Los Angeles",
      "country": "United States"
    }
  ],
  "budget": "$69 million"
}

When to Use Structured Output?

As discussed before, you should use the extract_query_metadata function before performing a search in Qdrant. The function will extract structured metadata (e.g., year, genre) from the user's natural language query. Once extracted, you can use this metadata to improve search filtering by combining vector search (using embeddings) with structured filtering (using metadata-based filtering in Qdrant).

How to Use It in Qdrant Search?

Step 1: Extract Metadata from the Query

Use the extract_query_metadata function to get both the search query and the metadata filters.

query = "Find AI movies from 1980"
structured_query = extract_query_metadata(query)

if structured_query:
    search_query = structured_query["query"]
    filters = structured_query["metadata"]
else:
    search_query = query
    filters = {}

Step 2: Perform a Filtered Search in Qdrant

If you're using Qdrant, you can combine:

Vector Search (based on embeddings)
Metadata Filtering (like year == 1980)

from qdrant_client import QdrantClient, models

# Connect to Qdrant
client = QdrantClient("http://localhost:6333")

# Define the search request
search_results = client.search(
    collection_name="movies",
    query_vector=your_embedding_function(search_query),  # Convert query to vector
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="year",
                match=models.MatchValue(value=filters.get("year"))  # Apply year filter
            )
        ]
    ) if "year" in filters else None,
    limit=5  # Return top 5 results
)

# Print the search results
print(search_results)

Why Use Metadata Filtering in Qdrant?

Vector Search Alone Isn't Enough
- Searching only with embeddings might return irrelevant results (e.g., AI movies from different years).
- Metadata filtering ensures the results match exact conditions (like year, genre, or actor).
Combining Embeddings + Metadata Improves Precision
- Embeddings help find semantically similar movies.
- Metadata filters ensure results meet specific criteria (e.g., only movies from 1980).
Example of Combining Both
- Query: "Find AI movies from 1980"
- Extracted Query: "Find AI movies"
- Metadata: {"year": 1980}
- Qdrant Search:
  - Vector search retrieves AI-related movies.
  - Filtering ensures only 1980 movies are shown.

Conclusion

Building an agentic AI assistant using DeepSeek-R1 is a powerful way to create intelligent, context-aware, and reasoning-driven AI applications. This guide walked through the core architecture of DeepSeek-R1, covering its Mixture of Experts (MoE) structure, reinforcement learning techniques (GRPO), and multi-stage training pipeline, all of which contribute to its superior reasoning capabilities.

The tutorial provided a step-by-step approach to deploying DeepSeek-R1 using Ollama, integrating Qdrant for vector search, and implementing a retrieval-augmented generation (RAG) model to enhance response accuracy. It also detailed how to structure AI workflows with LangGraph, dynamically routing queries between RAG-based contextual answers and general LLM responses based on query similarity.

Through structured output handling, metadata filtering, and efficient embedding management, the AI assistant can deliver more precise and relevant answers, particularly in specialized domains like movie recommendations. By leveraging DeepSeek-R1’s advanced reasoning, combined with vector databases and structured query filtering, developers can build robust, scalable AI assistants capable of processing complex, domain-specific queries while maintaining data privacy and efficiency.

This guide equips you with all the necessary tools to develop your own AI assistant, whether for knowledge retrieval, interactive chatbots, or domain-specific applications. With agentic AI, the future of autonomous, self-improving AI systems is closer than ever—offering new possibilities for intelligent automation, personalized AI, and deeper reasoning capabilities in real-world use cases.

Next Steps

Launching an AI-powered application or building an AI feature doesn’t require massive upfront investment or a dedicated internal team. Superteams.ai enables businesses to start with a focused, cost-effective proof-of-concept—using your existing data—to validate ROI before scaling.

Whether you’re struggling with low accuracy in current LLM implementations or have no AI expertise in-house, our pre-vetted engineers handle the heavy lifting: from data cleaning and pipeline design to precision tuning and deployment. Once our work completes, we transfer the knowhow to your team, with documentation and a working setup.

Ready to get started?

Let’s discuss your data, goals, and challenges. In 30 minutes, we’ll outline a roadmap to build an AI system that delivers accurate, reliable, and actionable results—not hallucinations.

Request a meeting now:

Book a Discovery Call | Get a Demo