Learn to build an AI assistant using DeepSeek-R1’s reasoning model with an agentic RAG architecture for insightful responses over large knowledge bases.
In this tutorial, you'll learn how to build an AI assistant that uses DeepSeek-R1's powerful reasoning model to provide relevant and insightful responses. Our guide will leverage an agentic reasoning architecture, grounded in a RAG model, to create an AI assistant capable of accessing and reasoning over large knowledge bases.
By the end of this tutorial, you'll be equipped with the knowledge to build your own AI-driven assistant that can intelligently respond to movie-related queries, whether about specific movie details or general recommendations.
Ready to dive in? Let's get started!
DeepSeek R1 represents a groundbreaking shift in reasoning-driven AI, leveraging pure reinforcement learning (RL) to cultivate advanced reasoning abilities in Large Language Models (LLMs). It’s part of a broader family of models, including DeepSeek R1-Zero, DeepSeek R1-Instruct, and DeepSeek R1-Chat, each tailored for specific tasks ranging from raw reasoning to conversational AI.
DeepSeek R1 is architected around a Transformer model enhanced with Mixture of Experts (MoE), which selectively activates a subset of its parameters during inference. This approach is both computationally efficient and scalable, allowing the model to tackle complex reasoning tasks without the immense computational burden typical of traditional LLMs. Let’s break it down:
By innovatively combining reinforcement learning, efficient architecture, and strategic fine-tuning, DeepSeek-R1 pushes the boundaries of what’s achievable in reasoning-focused AI, paving the way for more intelligent, accessible, and versatile AI solutions.
Ollama simplifies running LLMs in the cloud by handling model downloads, quantization, and execution seamlessly.
Step 1: Install Ollama
First, download and install Ollama from the official website.
Step 2: Download and Run DeepSeek-R1
Let’s test the setup and download our model. Launch the terminal and type the following command.
ollama run deepseek-r1:Xb
Ollama offers a range of DeepSeek-R1 models, spanning from 1.5B parameters to the full 671B parameter model. The 671B model is the original DeepSeek-R1, while the smaller models are distilled versions based on Qwen and Llama architectures. If your hardware does not support the 671B model, you can easily run a smaller version by using the following command and replacing the X below with the parameter size you want (1.5b, 7b, 8b, 14b, 32b, 70b, 671b):
For instance, if you want to test it locally, you can run this command below:
ollama run deepseek-r1:1.5b
With this flexibility, you can use DeepSeek-R1's capabilities even if you don’t have a supercomputer.
Step 3: Run DeepSeek-R1 in the Background
To run DeepSeek-R1 continuously and serve it via an API, start the Ollama server:
ollama serve
This will make the model available for integration with other applications.
If you encounter a "Port Busy" error when trying to run Ollama, you may need to change the port where Ollama is being served. Follow these steps to resolve the issue:
Set a New Port for Ollama: Run the following command in your Bash terminal to configure Ollama to use a different port:
set OLLAMA_HOST=127.0.0.1:11435
Start the Ollama Server: Once the new port is set, start the Ollama service with this command:
ollama serve
This will bind Ollama to the new port (11435) and resolve any port conflicts that might have occurred.
Now that we have DeepSeek-R1 running, we can use it in our code. Ollama Python library provides a fabulous interface to a wide range of open source models, making it a breeze to integrate LLM inference and embedding generation.
We can run Ollama in any integrated development environment (IDE) of choice. You can install the Ollama Python package using the following code:
!pip install ollama
Once the Ollama library is installed, we can test chat completion using the ollama.chat function.
import ollama
response = ollama.chat(
model="deepseek-r1",
messages=[
{"role": "user", "content": "Explain Newton's second law of motion"},
],
)
print(response["message"]["content"])
The ollama.chat() function takes the model name and a user prompt, processing it as a conversational exchange. The script then extracts and prints the model's response.
The model generates this response:
<think>
Okay, so I need to explain Newton's Second Law of Motion. Hmm, where do I start? I remember from school that Newton was an important scientist who contributed a lot to physics. The second law sounds like it's about how things move when forces are applied.
Let me think... Newton's laws were the big three, right? First about inertia, then acceleration with force, and maybe third about action and reaction. So I should probably focus on the second one this time.
Alright, so from what I remember, Newton's Second Law is something like F equals mass times acceleration. But how does that work exactly? Force equals mass multiplied by acceleration. That means if you have a bigger force, it will cause more acceleration on an object with less mass or more on one with more mass?
Wait, let me get this straight. If I push a box harder, it moves faster. So the same force applied over a larger mass would give a smaller acceleration. Makes sense because less mass means less resistance to movement.
But how does that relate to everyday situations? Like when you jump, your foot exerts a force on the ground, and the ground applies an equal and opposite force pushing you up. That's the reaction part of Newton's third law, right?
And then in the car, if I push back on the dashboard with my hand (which has less mass than the vehicle), the car moves forward because it has more inertia. So F equals m times a comes into play there too.
I should also think about how this relates to acceleration and force. The greater the force applied, the faster something accelerates. And if you have a larger mass, even with the same force, it won't accelerate as much. That seems counterintuitive at first because I thought bigger objects move slower when pushed.
But in reality, because objects have more inertia (mass), they resist changes in motion more, so for the same force, acceleration is lower. So if you have a car and someone gets into it suddenly, the force from the seatbelt or something has to push all those passengers forward, which increases their acceleration beyond what an empty car would do.
Wait, maybe I'm mixing this up. If you apply the brakes suddenly, the momentum of the vehicle goes into your body, right? So that's why it's hard to stop without craning your neck. That must be related to force and inertia again.
I should also consider how this applies in more complex scenarios, like rockets or something. Newton's laws definitely explain how a rocket expels gas, which creates thrust, causing acceleration in the opposite direction.
Oh, I think it's important to mention that mass is a scalar quantity here, so it doesn't have direction. Only force and acceleration vectors matter. The law works regardless of the frame of reference because it's an absolute statement about how forces affect accelerations.
Wait, but what if something isn't moving? Like if I'm sitting in a train and I push the back seat with my hands, but the train is moving at constant speed. The force on my body from my hands doesn't cause acceleration relative to the train. So that's why it's tricky sometimes--relative vs absolute.
So Newton's Second Law has implications beyond just everyday situations. It explains how things like rockets work and how forces affect movement in physics experiments, as well as even how cars behave when you press the brakes.
I should make sure I explain the difference between force and acceleration clearly. Force is what causes the change in motion (acceleration), while mass affects that change. So if I have a heavy object with less force applied, it won't accelerate much because of its inertia.
Also, maybe give some examples beyond cars and rockets to show how this law applies everywhere. Like how a basketball rolls faster on ice than on grass--ice has more friction, which is mass, so the basketball (lower mass) moves faster despite being pushed with the same force by my foot.
I should also touch on units here because I often get confused with Newton's laws and other formulas. Force in newtons (N), mass in kilograms (kg), acceleration in meters per second squared (m/s²). So F = ma is straightforward as long as you remember the units correspond correctly.
But maybe some people find it confusing to visualize how all these variables interact. Maybe draw a free-body diagram or something if I can, but since I'm just explaining the law conceptually, perhaps not necessary right now.
Wait, what's acceleration? It's the rate of change of velocity. So whether speed is constant (no acceleration) or changes direction (centripetal acceleration), all that matters is how fast it's changing in each moment.
So putting it all together: Newton's Second Law relates force and mass to acceleration. The greater the net force, the greater the acceleration for a given mass. And if you have more mass, even with the same force, less acceleration happens because of inertia. This law explains why things move faster when pushed harder, how rockets work, and applies to so many other areas.
I think that covers it. Maybe add some historical context--like how Newton himself formulated the third law, which states that every action has an equal and opposite reaction, but that's not directly part of this explanation.
</think>
Newton's Second Law of Motion is a fundamental principle in physics that describes the relationship between force, mass, and acceleration. It states that the net force acting on an object is equal to the product of its mass and its acceleration (F = ma). Here's a structured explanation:
1. **Understanding the Components**:
- **Force (F)**: This is the external influence causing change in motion.
- **Mass (m)**: The quantity that measures resistance to acceleration, essentially how "heavy" an object is.
- **Acceleration (a)**: The rate at which velocity changes over time.
2. **Mathematical Representation**:
The law is expressed as F = ma. This equation shows that a greater force results in more acceleration for the same mass, and a larger mass results in less acceleration for the same force.
3. **Interpretation of Terms**:
- **Force**: Involves vectors, meaning direction matters (e.g., pushing forward or backward).
- **Acceleration**: Also involves vectors, indicating both speed and direction changes.
- **Mass**: A scalar quantity independent of direction, affecting resistance to motion.
4. **Historical Context**:
Sir Isaac Newton formulated the law in his "Principia Mathematica," along with three other fundamental laws, which together describe motion and forces comprehensively.
5. **Applications Beyond Simple Situations**:
- **Rockets**: Thrust from exhaust gases causes acceleration opposite to the direction of movement.
- **Cars**: Seatbelts create an unbalanced force that imparts momentum forward.
6. **Implications in Various Scenarios**:
- **Sports and Gymnastics**: Pushing against resistance leads to acceleration.
- ** thrown objects**: The force of launch causes them to accelerate through space.
7. **Free-Body Diagrams**:
A diagram illustrating the forces acting on an object, showing how net force determines acceleration.
In summary, Newton's Second Law explains that changes in motion are due to external forces, with mass and acceleration being directly related through F = ma, applicable across various physical phenomena.
So, we now have a working chat model with DeepSeek-R1. Let’s proceed to the next step, where we will use an embedding model nomic-embed-text to generate vector embeddings from data, and then store them in Qdrant vector search engine.
We will use the open source embedding model nomic-embed-text which has high ratings in the MTEB leaderboard on Hugging Face.
Here’s how you can pull the model:
ollama pull nomic-embed-text
Now, we can use it like this:
ollama.embeddings(model='nomic-embed-text', prompt='The sky is blue because of rayleigh scattering')
To set up Qdrant locally and run it on port 6333 using Docker, follow these steps:
Pull the Qdrant Docker Image: First, pull the latest Qdrant image from Docker Hub by running the following command:
docker pull qdrant/qdrant
Run Qdrant on Port 6333: Once the image is downloaded, start a Qdrant container and bind it to port 6333 on your localhost. Run the following command:
docker run -p 6333:6333 qdrant/qdrant
You can then head to http://<your_machine_ip>:6333/dashboard to find the Qdrant webui. Make sure you open the firewall on the 6333 port.
http://localhost:6333/dashboard
Let’s now install the key libraries we will use: LangGraph and LangChain. LangChain provides a wide range of tools to build LLM applications, whereas LangGraph helps with building AI agents. We could have also used CrewAI instead of LangGraph, as it is equally powerful.
pip install langgraph qdrant-client langchain ollama requests
Let’s import the libraries so we can proceed with the code:
import os
import json
import uuid
import numpy as np
import ollama
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from dotenv import load_dotenv
from langgraph.graph.message import add_messages
from langgraph.graph import StateGraph
from typing_extensions import TypedDict
You may need to use load_dotenv to load your environment variables:
load_dotenv()
We had Qdrant running before, so we can now connect to it.
# Qdrant connection setup
qdrant_client = QdrantClient(host="localhost", port=6333)
# Create or connect to collection in Qdrant
COLLECTION_NAME = "movie_embeddings"
VECTOR_SIZE = 768 #our embedding model dimension
Qdrant has multitenancy support, and if you are building a SaaS application, you might want to leverage it. Multitenancy allows you to use the same collection for many customers, and segregate them by a tenant id.
Note that we will use 768 as the number of dimensions for our generated vectors, which is the dimension that nomic-embed-text generates.
In LangGraph, you have to maintain state between various workflow nodes. Let’s create a Pydantic class for the State that we plan to maintain:
class State(TypedDict):
query: str
answer: str
Let’s now write a function to generate embeddings.
def generate_embedding(text, model="nomic-embed-text"):
text = text.replace("\n", " ")
return ollama.embeddings(model = model, prompt=text).embedding
We will also write a function that takes the data as input, generates embeddings with the above function,m and then saves it in Qdrant.
def update_vectordb(data: list):
points = []
status = False
# Create collection if not exists
if not qdrant_client.collection_exists(collection_name=COLLECTION_NAME):
qdrant_client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
)
# Loop through the data and generate embeddings
for item in data:
insertion_string = item["title"] + " " + item["summary"] + " " + item["year"] + " " + item["plot"] + " " + item["genre"]
# Generate embedding for the movie's title or plot
vectorized_item = generate_embedding(insertion_string)
point_id = str(uuid.uuid4()) # Unique ID for each point
metadata = {
"title": item["title"],
"year": item["year"],
"genre": item["genre"],
"plot": item["plot"],
"summary": item["summary"]
}
# Create PointStruct for Qdrant
point = PointStruct(
id=point_id,
vector=vectorized_item,
payload=metadata
)
points.append(point)
# Insert points into Qdrant
if points:
qdrant_client.upsert(collection_name=COLLECTION_NAME, points=points)
print(f"Successfully inserted {len(points)} embeddings into collection {COLLECTION_NAME}")
status = True
else:
print("No valid content found.")
return status
The update_vectordb function updates the Qdrant vector database with movie-related data.
This function keeps the vector database up to date, enabling efficient search and retrieval.
Let’s write another function to query Qdrant using a query embedding. This function will be useful later when we are stitching together the workflow.
# Function to query the vector database in Qdrant
def query_qdrant(query: str, limit=5):
# Generate query vector
query_vector = generate_embedding(query)
# Query Qdrant
result = qdrant_client.query_points(
collection_name=COLLECTION_NAME,
query=query_vector,
limit=limit,
with_vectors=False
)
return result.points
The query_qdrant function searches a Qdrant vector database for results similar to your query.
This function powers efficient semantic search, enabling accurate retrieval of relevant data.
We will now load the data from a JSON
# Path to your movies.json file
movies_json_file = "movies.json"
# Load the data from the JSON file
movie_data = load_movies_from_json(movies_json_file)
# Update Qdrant with the loaded movie data
update_vectordb(movie_data)
…and then upsert the data into Qdrant.
One mistake that many developers make is that they attempt to generate embeddings from the user query directly, without interpreting it. This has numerous problems.
For instance, if the user is looking for a movie released in 2024, and you perform a naive semantic search, your results will include movies from other years as well. The way to solve this problem is to use the LLM to break down your query into query and metadata, and then perform a semantic search that leverages metadata filters (eg: Year: 2024). This leads to results which are far more accurate. You can improve upon this tutorial using that tactic.
Now, let’s create a function that stitches together the query result and creates an LLM context from it:
def generate_movie_context(query_result):
context = "Based on your query, here are some relevant movies:\n\n"
for i, result in enumerate(query_result):
movie = result.payload
title = movie.get('title', 'Unknown')
year = movie.get('year', 'N/A')
genre = movie.get('genre', 'Unknown')
plot = movie.get('plot', 'No plot available.')
summary = movie.get('summary', 'No summary available.')
context += f"Movie {i+1}:\n"
context += f"Title: {title} ({year})\n"
context += f"Genre: {genre}\n"
context += f"Plot: {plot}\n"
context += f"Summary: {summary}\n\n"
return context
We can now put together the ‘generation’ function of our RAG system.
def deepseek_r1_rag(context, query):
try:
# Build a more clear and specific system message to guide the response
system_message = (
"You are an intelligent assistant who generates human-friendly, clear, and relevant answers."
" You will respond based on the given context and query. Make sure to give precise, natural-sounding responses."
" Use the context information provided to enhance the accuracy and relevance of the answer."
"Respond in 100 words"
)
# Sending the context and query to the Ollama API for response generation
response = ollama.chat(
model="deepseek-r1:1.5b",
messages=[
{"role": "user", "content": system_message},
{"role": "user", "content": f"Context: {context}\nQuery: {query}"},
]
)
return response["message"]["content"]
except Exception as e:
return f"Error occurred: {str(e)}"
The deepseek_r1_rag function generates responses using the DeepSeek-R1 model in a RAG setup. It sets a system message to ensure concise, contextually relevant replies within 100 words. The function processes the user's query and retrieved context, sending them to ollama.chat(). This enables DeepSeek-R1 to generate accurate, context-aware answers.
We will also create a function where we directly pass the user query to the LLM without fetching context. This function will be our fallback function in case the LLM decides not to fetch data from the vector store.
def deepseek_r1_llm(query):
try:
# Build a more clear and specific system message to guide the response
system_message = (
"You are an intelligent assistant who generates human-friendly, clear, and relevant answers."
" You will respond based on the given context and query. Make sure to give precise, natural-sounding responses."
" Answer the query given"
"Respond in 100 words"
)
# Sending the context and query to the Ollama API for response generation
response = ollama.chat(
model="deepseek-r1:1.5b",
messages=[
{"role": "user", "content": system_message},
{"role": "user", "content": f"Query: {query}"},
]
)
return response["message"]["content"]
except Exception as e:
return f"Error occurred: {str(e)}"
We can now create our retrieval-augmented generation workflow using DeepSeek-R1.
The following code shows how to create the workflow function using RAG context.
def workflow_with_rag(state: State):
query = state["query"]
result = query_qdrant(query)
context = generate_movie_context(result)
response = deepseek_r1_rag(context, query)
if not response:
return {"answer": "No response generated"}
return {"answer": response}
This function implements a retrieval-augmented generation (RAG) workflow for processing movie-related queries. The function first retrieves relevant movie data from Qdrant by querying the vector database. It then generates a movie-specific context using the retrieved data. Finally, the context and query are sent to the DeepSeek-R1 model for response generation. If no response is generated, it returns a fallback message indicating no response. Otherwise, the function returns the generated answer.
We will also create a workflow node that doesn’t use RAG. This is useful in situations where the user’s query falls outside the purview of the knowledge-base stored:
def workflow_without_rag(state: State):
query = state["query"]
response = deepseek_r1_llm(query)
if not response:
return {"answer": "No response generated"}
return {"answer": response}
We will also need a node that routes the workflow based on the query.
# Router Node
def route_workflow(state: State):
query = state["query"]
query_result_score = list(query_qdrant(query)[0])[2][1]
if query_result_score > 0.5:
print("Node Chosen -----> RAG")
response = workflow_with_rag(state)
else:
print("Node Chosen -----> GENERIC")
response = workflow_without_rag(state)
# Ensure state always has "answer"
state["answer"] = response.get("answer", "No answer")
return state
This function is designed to route the user's query to either a RAG-based or a generic response workflow based on the similarity score of the query result. It first calculates the similarity score from the query result obtained from the Qdrant vector database. If the score is greater than 0.5, it routes the query to the RAG workflow, leveraging context-based generation. If the score is below the threshold, the query is processed using a generic workflow. Finally, the function ensures that the state always contains an "answer", whether from RAG or the fallback method.
Code Overview
# Create LangGraph
graph_builder = StateGraph(State)
# Add workflows as nodes
graph_builder.add_node("router", route_workflow)
# Set entry and finish points
graph_builder.set_entry_point("router")
graph_builder.set_finish_point("router")
# Compile the graph
graph = graph_builder.compile()
This code snippet demonstrates how to create and configure a LangGraph for a dynamic workflow system. It initializes a StateGraph instance and adds the route_workflow function as a node in the graph, which is responsible for determining the appropriate response workflow based on query similarity. The entry and finish points are both set to the "router" node, ensuring that the graph starts and ends with this decision point. Finally, the graph is compiled, making it ready for execution in the workflow system.
Code Overview
Graph Compilation: Compiles the graph, making it ready for deployment in a LangChain-based workflow system.
# Run the Agent
def run_agent():
print("Welcome to the Movie Knowledge Assistant! Type 'quit' to exit.")
while True:
user_input = input("User: ")
if user_input.lower() in ["quit", "exit"]:
print("Goodbye!")
break
state_input = {"query": user_input, "answer": "None"}
# Process the query through LangGraph
for event in graph.stream(state_input):
if event.values() is not None:
for value in event.values():
print("Assistant Response ->: ", value.get("answer", "No answer"))
else:
print("No values found in event.")
Code enables you to interact with the Movie Knowledge Assistant Agent through a text-based interface. The agent processes queries in a loop using LangGraph workflows, providing real-time responses. It listens for input, processes it through the LangChain graph, and returns an answer. The session runs until you type “quit” or “exit.”
This provides an intuitive way to interact with the AI-powered movie assistant.
# Start the agent
if __name__ == "__main__":
run_agent()
Let’s now test our workflow.
Let’s try a query:
Multiverse adventure with Spider-Man
You will be able to follow the DeepSeek R1 ‘thought-process’. ou will be able to follow the DeepSeek R1 ‘thought-process’.
This is the context that the vector search fetched:
{
"relevant_results": [
{
"id": "7dedd85c-70aa-46d4-a4a2-7061a2eddf92",
"score": 0.79638296,
"title": "Spider-Man: Across the Spider-Verse",
"year": 2023,
"genre": "Animation, Action, Adventure",
"plot": "Miles Morales journeys across the multiverse to team up with other Spider-Men to face a new, greater threat.",
"summary": "Miles Morales must team up with different versions of Spider-Man across the multiverse to stop a new, dangerous adversary."
},
{
"id": "e6a4d365-36c3-4f33-80a2-cbcb5e935a29",
"score": 0.63772786,
"title": "Ant-Man and The Wasp: Quantumania",
"year": 2023,
"genre": "Action, Adventure, Comedy",
"plot": "Scott Lang and Hope van Dyne venture into the Quantum Realm where they encounter new threats and mysteries.",
"summary": "Ant-Man and the Wasp must navigate through an unknown realm, facing dangers and challenges that put their world at risk."
},
{
"id": "f4c28bf6-d23d-48fd-bac2-d64cb26c5fca",
"score": 0.6265694,
"title": "The Marvels",
"year": 2023,
"genre": "Action, Adventure, Fantasy",
"plot": "Captain Marvel, Monica Rambeau, and Kamala Khan team up to battle a new intergalactic threat.",
"summary": "Three Marvel heroes unite to fight a powerful force, navigating their powers and teaming up to save the universe."
},
{
"id": "8ac0c395-9f6b-4814-b111-0e86777aeb57",
"score": 0.57232714,
"title": "Guardians of the Galaxy Vol. 3",
"year": 2023,
"genre": "Action, Adventure, Comedy",
"plot": "The Guardians must protect Rocket from a powerful villain while dealing with their own personal conflicts.",
"summary": "The Guardians of the Galaxy face new challenges as they fight to save one of their own while grappling with personal demons."
}
]
}
This will generate a response text like this:
Assistant Response ->: <think>
Okay, so I need to come up with an answer that fits within 100 words for this query. Let me see what information is available.
The user wants a movie about a multiverse adventure with Spider-Man. Looking at the context provided:
Movie 1 is Spider-Man: Across the Spider-Verse (2023). It's set in a multiverse and involves Miles Morales teaming up with others to tackle a new threat. So this seems like a perfect fit.
The other movies are more about Ant-Man, Marvels, and Guardians of the Galaxy, which don't involve Spider-Man. So I can't pick those unless they're still relevant, but I think Movie 1 is the one that fits best.
I should make sure to mention all key points: Spider-Man's journey across a multiverse, different versions from various teams, facing new threats, and the summary provided. Also, it's worth mentioning that each team member has their own identities within the multiverse.
So structuring the answer to include the movie title, genre, plot elements related to the multiverse and superhero adventures, and summarize the summary given.
</think>
Spider-Man: Across the Spider-Verse (2023) is a stunning adventure where Miles Morales teams up with other superheroes across multiple multiverses to face new challenges. Based on the provided context, this film features a dynamic story that spans multiple universes, showcasing Spider-Man's journey and the heroes' teamwork.
That completes the basic RAG workflow architecture. Now, let’s leverage Structured Output functionality that Ollama provides.
Structured outputs are highly beneficial because they allow us to use AI as part of a wider workflow. Ollama makes LLMs steerable using this.
Now, using the Ollama Python library, pass the schema as a JSON object to the format parameter, either as a dictionary or, preferably, by using Pydantic to serialize the schema with model_json_schema().
from pydantic import BaseModel
class Movie(BaseModel):
name: str
cast: list
budget: str
try:
response = ollama.chat(
model="deepseek-r1:1.5b",
messages=[
{
'role': 'user',
'content': 'Tell me about interstellar cast, budget and name',
}
],
format=Movie.model_json_schema(),
)
country = Movie.model_validate_json(response.message.content)
print(country)
except Exception as e:
print(f"Error occurred: {str(e)}")
Results
{
"name": "Interstellar Cast",
"cast": [
{
"name": "J. J. Cross",
"year": 2014,
"city": "New York",
"country": "United States"
},
{
"name": "Nate Frakes",
"year": 2017,
"city": "Los Angeles",
"country": "United States"
},
{
"name": "Ethan Hawke",
"year": 2018,
"city": "Los Angeles",
"country": "United States"
},
{
"name": "Kathleen O'Hara",
"year": 2017,
"city": "San Francisco",
"country": "United States"
},
{
"name": "Cristian Schaefer",
"year": 2015,
"city": "Los Angeles",
"country": "United States"
}
],
"budget": "$69 million"
}
As discussed before, you should use the extract_query_metadata function before performing a search in Qdrant. The function will extract structured metadata (e.g., year, genre) from the user's natural language query. Once extracted, you can use this metadata to improve search filtering by combining vector search (using embeddings) with structured filtering (using metadata-based filtering in Qdrant).
Step 1: Extract Metadata from the Query
Use the extract_query_metadata function to get both the search query and the metadata filters.
query = "Find AI movies from 1980"
structured_query = extract_query_metadata(query)
if structured_query:
search_query = structured_query["query"]
filters = structured_query["metadata"]
else:
search_query = query
filters = {}
Step 2: Perform a Filtered Search in Qdrant
If you're using Qdrant, you can combine:
from qdrant_client import QdrantClient, models
# Connect to Qdrant
client = QdrantClient("http://localhost:6333")
# Define the search request
search_results = client.search(
collection_name="movies",
query_vector=your_embedding_function(search_query), # Convert query to vector
query_filter=models.Filter(
must=[
models.FieldCondition(
key="year",
match=models.MatchValue(value=filters.get("year")) # Apply year filter
)
]
) if "year" in filters else None,
limit=5 # Return top 5 results
)
# Print the search results
print(search_results)
Building an agentic AI assistant using DeepSeek-R1 is a powerful way to create intelligent, context-aware, and reasoning-driven AI applications. This guide walked through the core architecture of DeepSeek-R1, covering its Mixture of Experts (MoE) structure, reinforcement learning techniques (GRPO), and multi-stage training pipeline, all of which contribute to its superior reasoning capabilities.
The tutorial provided a step-by-step approach to deploying DeepSeek-R1 using Ollama, integrating Qdrant for vector search, and implementing a retrieval-augmented generation (RAG) model to enhance response accuracy. It also detailed how to structure AI workflows with LangGraph, dynamically routing queries between RAG-based contextual answers and general LLM responses based on query similarity.
Through structured output handling, metadata filtering, and efficient embedding management, the AI assistant can deliver more precise and relevant answers, particularly in specialized domains like movie recommendations. By leveraging DeepSeek-R1’s advanced reasoning, combined with vector databases and structured query filtering, developers can build robust, scalable AI assistants capable of processing complex, domain-specific queries while maintaining data privacy and efficiency.
This guide equips you with all the necessary tools to develop your own AI assistant, whether for knowledge retrieval, interactive chatbots, or domain-specific applications. With agentic AI, the future of autonomous, self-improving AI systems is closer than ever—offering new possibilities for intelligent automation, personalized AI, and deeper reasoning capabilities in real-world use cases.
Launching an AI-powered application or building an AI feature doesn’t require massive upfront investment or a dedicated internal team. Superteams.ai enables businesses to start with a focused, cost-effective proof-of-concept—using your existing data—to validate ROI before scaling.
Whether you’re struggling with low accuracy in current LLM implementations or have no AI expertise in-house, our pre-vetted engineers handle the heavy lifting: from data cleaning and pipeline design to precision tuning and deployment. Once our work completes, we transfer the knowhow to your team, with documentation and a working setup.
Ready to get started?
Let’s discuss your data, goals, and challenges. In 30 minutes, we’ll outline a roadmap to build an AI system that delivers accurate, reliable, and actionable results—not hallucinations.
Request a meeting now: