This blog explores ways to enhance LLM app accuracy, the role of data ingestion, and how LLMs can improve data quality for better performance.
Recently we worked on an LLM-powered AI assistant, where we had to deal with a number of different data formats. The goal was deceptively simple: build an assistant that can reason over data from PDF documents (reports with graphs, charts), surveys (question-answer text), the company’s SQL database (PostgreSQL), and the NoSQL database (MongoDB).
The company tried to throw everything into a vector database using a RAG framework, and hoped that the assistant would give highly accurate results. However, they were disappointed when they noticed that their results were not up to their expectations. So, they engaged us to build a proof of concept demo on a subset of their data.
What had gone wrong? The answer was simple: the quality of any LLM-powered assistant is highly dependent on the accuracy of the data retrieval. This is the data that the LLM uses as its context to generate responses. If you want accuracy, you have to focus on data quality.
In this blog, we will break down the different approaches you can take to improve the accuracy and quality of your LLM-powered applications. We will also show you data ingestion plays a key role in this, and how you can use LLM more effectively in cleaning up your data.
Before we begin, let’s understand the LLM context and why retrieval step is important.
RAG enhances LLMs by grounding responses using retrieved context. Here is the typical RAG workflow:
However, a common question that many developers ask is - can’t you simply use a large context window LLM and throw all data at it? You can, but there are some caveats.
Large language models (LLMs) have a context window length. This defines the total number of tokens (or short pieces of text) you can throw at it. Gemini 1.5 Pro model has one of the largest context windows - 2 million tokens. With other models, such as OpenAI’s 4o or Cohere’s Command-R, the context window length is much smaller (approximately around 128K tokens).
Also, remember that LLMs are stateless – they don’t remember history on their own. This means that if you want the LLM to also use conversation memory, you have to additionally add it to the prompt along with the context. This limits the context window available in many applications.
Additionally, LLMs suffer from the ‘Lost in the Middle’ problem, where the model may lose information that is present in the middle of long contexts. This means that the context you present to the LLM for generation should be accurate and high quality. And this, in turn, means that if you have a large volume of data, you need high quality retrieval approaches to fetch relevant information for LLM generation.
Quality of Retrieval-augmented Generation (RAG) systems, therefore, depend heavily on data. You have to ensure high quality data in your data stores, and if required, perform pre-processing to improve data accuracy.
Key Insight: RAG’s accuracy depends on retrieval precision and contextual relevance. Poor retrieval = hallucinated answers.
Many developers associate RAG directly with vector stores or semantic search. This is incorrect. RAG systems can use any of the following kinds of retrieval, in order to prompt the LLM:
In semantic retrieval, you use dense vector embeddings to perform search in a dataset. Vector embeddings are numerical representations of data generated by AI models (embedding models). You convert your dataset and your query into embedding vectors, and then search through the underlying store to find vectors closest to the query vector. You then use the data retrieved to prompt the LLM.
For example, if you have a dataset on football news, with text such as “Messi scored a fabulous goal in the World Cup Final 2024”, semantic search would be able to find data using a query string like “top moments in football”, as these strings and the query vector are likely to be close to each other in vector space.
Semantic search is typically what you would use when you have unstructured data in the form of PDFs or text documents. If you have images, you have to use a vision language model (like PaliGemma or Llama3.2-8B) to interpret the image into text. If you have audio, you should use a speech-to-text model like Whisper to transcribe the audio, and then convert them into embeddings.
You can also retrieve results using keyword searches, and use the resulting data to prompt the LLM. Typical keyword search algorithms like BM25, which are implemented by search engines like ElasticSearch, are highly capable of finding strings based on keyword prefix or phrases. They can even leverage synonyms or similar words to find results.
Keyword search as a retrieval technique is highly useful when your data contains domain vocabulary: such as, stock ticker symbols (eg: META, or AAPL, or NVDA). Modern vector search engines also support the above in the form of Sparse Vectors. When using Sparse Vectors, you will perform a similar embedding creation process as above.
On the other hand, if you directly want to use data in your existing search engine, you can also leverage LLMs to generate exact queries. For example, if you have movies data in ElasticSearch, you can use the LLM to generate an exact query to find movies around ‘Artificial Intelligence’, where the LLM will generate:
{
"query": {
"match": {
"plot": {
"query": "artificial intelligence",
"fuzziness": "AUTO"
}
}
}
}
You can use this query to then perform search through ElasticSearch and retrieve data for RAG.
The big difference between Sparse Vector Search vs directly performing keyword search is that Sparse Vectors also leverage the underlying similarity between terms to retrieve results.
Most applications have their data tucked away in SQL databases, like PostgreSQL, MySQL, MariaDB and others. LLMs are becoming increasingly good at translating natural language queries to SQL queries, if you give the table schema as part of the context.
In other words, you can use LLMs to generate SQL queries, and then use those queries to perform retrieval from your database. You can then present the resulting data to the LLM for generation. For example, if you have a table schema:
CREATE TABLE stock_prices (
id SERIAL PRIMARY KEY,
symbol VARCHAR(10),
date DATE,
open_price DECIMAL(10,2),
close_price DECIMAL(10,2),
high_price DECIMAL(10,2),
low_price DECIMAL(10,2),
volume BIGINT
);
And you present the above schema and ask the LLM to generate SQL for a query like:
“Generate an SQL query to find the best-performing stock between NVIDIA (NVDA) and Meta (META) over the last 6 months based on percentage gain.”
The LLM may generate a query like:
WITH price_changes AS (
SELECT
symbol,
(MAX(close_price) - MIN(open_price)) / MIN(open_price) * 100 AS percentage_gain
FROM stock_prices
WHERE date >= CURRENT_DATE - INTERVAL '6 months'
AND symbol IN ('NVDA', 'META')
GROUP BY symbol
)
SELECT symbol, percentage_gain
FROM price_changes
ORDER BY percentage_gain DESC
LIMIT 1;
You can then programmatically execute the query and retrieve the right results.
This technique, which has come to be known as GraphRAG, is another powerful approach to retrieving data from datasets. Knowledge Graphs work by storing information in the form of entities (or nodes) and relationships (or edges) between them. Both nodes and edges can have properties associated with them. In Knowledge Graphs, you retrieve data using Cypher queries.
Once again, you use LLM to generate Cypher queries from your dataset, store it in a Graph Database, and then query it when you want to retrieve data for generation. For example, LLMs can convert a text string like “Taylor Swift's Eras Tour concluded after 21 months with a record-breaking gross of over $2 billion” into Cypher queries like these:
MERGE (t:Artist {name: "Taylor Swift"})
MERGE (tour:Tour {name: "Eras Tour"})
SET tour.duration_months = 21, tour.gross = 2000000000, tour.currency = "USD", tour.status = "concluded", tour.record = "record-breaking"
MERGE (t)-[:PERFORMED]->(tour)
You can also use your data in NoSQL databases like MongoDB to perform retrieval. The approach here is similar to how we performed Cypher or SQL queries to retrieve data.
Suppose you have news articles in your MongoDB collection, like:
news_article = {
"symbol": "NVDA",
"title": "NVIDIA Dips 10% After DeepSeek Announcement",
"content": "DeepSeek AI announced its latest AI model DeepSeek-R1, and it caused ripples across the AI community, crashing NVIDIA stock.",
"date": "2025-01-27",
"tags": ["AI", "Stock Market", "NVIDIA", "GPUs"]
}
During retrieval step in RAG, you can use the LLM to generate query keywords from your query, and then use that to perform search:
query = "NVIDIA"
results = collection.find({"$text": {"$search": query}}).limit(3)
Then use the results to prompt the LLM for generation.
You can use a similar approach if your data is in key value stores, like Redis or DynamoDB.
For time-sensitive applications (e.g., stock prices, sensor readings), Timeseries databases like InfluxDB or OpenTSDB can be used for retrieval. LLMs can generate queries that extract trends, patterns, and anomalies over time, enhancing contextual retrieval for RAG-based applications.
Here’s an example:
For a user query - “Identify temperature anomalies in warehouse sensors over the past week”- the LLM generated the following InfluxDB Query (using Flux language):
from(bucket: "sensor_data")
|> range(start: -7d)
|> filter(fn: (r) => r._measurement == "temperature" and r._field == "value")
|> aggregateWindow(every: 1h, fn: mean)
|> anomalyDetection(
method: "mean_std",
threshold: 3.0 // Detect values 3 standard deviations from the mean
)
This query may return response like:
{
"timestamp": "2024-05-10T14:00:00Z",
"sensor_id": "WH1-TEMP-002",
"value": 42.5,
"anomaly_score": 3.2,
"metadata": {
"location": "Warehouse A",
"threshold": "Normal range: 18°C-25°C"
}
}
In the generation step, LLM will use the above data to generate a response like:
“Sensor WH1-TEMP-002 in Warehouse A recorded an anomalous temperature of 42.5°C on May 10th, exceeding the normal range by 3.2 standard deviations. Recommend immediate inspection.”
Key Insight: RAG can use different retrieval techniques. Tailor your retrieval based on where you have your data.
To maximize retrieval accuracy and minimize LLM hallucinations, data quality must be prioritized at every stage—from ingestion to retrieval. As you saw above, your approach to data cleanup or preprocessing will depend on the type of data you have.
Below are some actionable tactics you can use to refine your data pipeline, tailored to common formats like PDFs, surveys, SQL/NoSQL databases, and knowledge graphs.
Goal: Eliminate noise and standardize formats to ensure consistency.
Tactics:
Example:
For PDF reports with charts, extract only the chart title, axis labels, and conclusions—and summarize the image using vision models. This reduces noise and focuses on human-interpretable insights.
Goal: Condense data into meaningful chunks to avoid overwhelming the LLM.
Tactics:
Example:
In MongoDB news articles, extract entities like
symbol: NVDA, event: DeepSeek-R1 launch, and
impact: stock dip 10%
to create concise, searchable metadata.
Goal: Attach contextual tags to enable hybrid filtering (semantic + metadata).
Tactics:
Example:
A PDF chart titled “Q4 Revenue Growth” could be tagged with section=financials, chart_type=bar, and metric=revenue. This allows queries like:
vector_search(query="Q4 performance") + filter(section="financials")
Goal: Align data with retrieval methods (SQL, Graph, etc.) to simplify querying.
Tactics:
Example:
For GraphRAG, preprocess news articles into nodes (Company: NVIDIA, Event: Stock Dip) and edges (CAUSED_BY → DeepSeek-R1 Launch).
Goal: Balance context preservation with retrieval efficiency.
Tactics:
Example:
A 100-page PDF report might be split into:
Goal: Fill gaps in sparse or unbalanced datasets.
Tactics:
Example:
If your SQL database lacks data for “Meta stock performance in 2023,” use the LLM to simulate plausible entries for testing retrieval robustness.
Goal: Continuously refine data quality through feedback loops.
Tactics:
By treating data quality as a dynamic, domain-specific challenge—not a checkbox—you ensure your RAG system retrieves exactly what the LLM needs, nothing more, nothing less.
The choice of LLM (e.g., GPT-4, Claude, Llama, DeepSeek, Qwen, Mistral, or Gemini) plays a role in the quality of generated responses. Not all LLMs are created equal, and some excel with code, while others excel with reasoning over images. Mix LLMs if required in your pipeline.
You should also remember that the underlying architecture of your data pipeline is equally critical. Even the most advanced LLMs will produce low-quality or hallucinated outputs if fed poorly processed, irrelevant, or contextually fragmented data.
Naive approaches like relying on generic text chunking libraries (e.g., splitting documents at fixed token intervals) or ignoring domain-specific data structures often result in:
For instance, automatically chunking a technical research paper into 500-token segments might break apart equations, diagrams, and their explanations, rendering the retrieved context useless for answering precise questions.
Similarly, using a general-purpose embedding model for medical data—without fine-tuning on clinical terminology—can degrade retrieval accuracy by up to 40%. Check MTEB for the latest open fine-tuned embedding models, or create one.
Domain-Specific Customization:
Hybrid Retrieval:
Feedback Loops:
Key Insight: A state-of-the-art LLM is only as good as the data it receives. Invest in an architecture that treats data preprocessing, retrieval, and context formatting as first-class citizens.
Superteams.ai specializes in assembling pre-vetted AI engineering teams to tackle the end-to-end challenges of developing precise, production-ready LLM applications. Unlike generic AI vendors, we focus first on data quality and domain-specific architecture, ensuring retrieval pipelines, chunking strategies, and context formatting align with your unique data and use case.
When we collaborate, our first goal would be to transform messy, unstructured data (PDFs, databases, multimedia) into clean, semantically rich inputs optimized for LLMs—avoiding the pitfalls of one-size-fits-all solutions. Once data cleanup is complete, we put together a workflow, and create a FastAPI-powered API, which your application can use.
Here’s how we engage with our SaaS companies or enterprises:
Discovery & Data Audit:
Proof-of-Concept (PoC) Development:
Iterative Optimization:
Scaling and Deployment:
Knowledge Transfer:
Launching an AI-powered application or building an AI feature doesn’t require massive upfront investment or a dedicated internal team. Superteams.ai enables businesses to start with a focused, cost-effective proof-of-concept—using your existing data—to validate ROI before scaling.
Whether you’re struggling with low accuracy in current LLM implementations or have no AI expertise in-house, our pre-vetted engineers handle the heavy lifting: from data cleaning and pipeline design to precision tuning and deployment. Once our work completes, we transfer the knowhow to your team, with documentation and a working setup.
Ready to get started?
Let’s discuss your data, goals, and challenges. In 30 minutes, we’ll outline a roadmap to build an AI system that delivers accurate, reliable, and actionable results—not hallucinations.
Request a Meeting Now:
Book a Discovery Call | Email: founders@superteams.ai