How to Improve Accuracy of Your LLM-Powered RAG or AI Applications

Recently we worked on an LLM-powered AI assistant, where we had to deal with a number of different data formats. The goal was deceptively simple: build an assistant that can reason over data from PDF documents (reports with graphs, charts), surveys (question-answer text), the company’s SQL database (PostgreSQL), and the NoSQL database (MongoDB).

The company tried to throw everything into a vector database using a RAG framework, and hoped that the assistant would give highly accurate results. However, they were disappointed when they noticed that their results were not up to their expectations. So, they engaged us to build a proof of concept demo on a subset of their data.

What had gone wrong? The answer was simple: the quality of any LLM-powered assistant is highly dependent on the accuracy of the data retrieval. This is the data that the LLM uses as its context to generate responses. If you want accuracy, you have to focus on data quality.

In this blog, we will break down the different approaches you can take to improve the accuracy and quality of your LLM-powered applications. We will also show you data ingestion plays a key role in this, and how you can use LLM more effectively in cleaning up your data.

Before we begin, let’s understand the LLM context and why retrieval step is important.

Understanding the Importance of Retrieval in RAG

RAG enhances LLMs by grounding responses using retrieved context. Here is the typical RAG workflow:

Retrieve: Fetch relevant context from all data sources.
Augment: Inject this context into the LLM prompt.
Generate: The LLM synthesizes a response.

However, a common question that many developers ask is - can’t you simply use a large context window LLM and throw all data at it? You can, but there are some caveats.

LLMs have Limited Context Window

Large language models (LLMs) have a context window length. This defines the total number of tokens (or short pieces of text) you can throw at it. Gemini 1.5 Pro model has one of the largest context windows - 2 million tokens. With other models, such as OpenAI’s 4o or Cohere’s Command-R, the context window length is much smaller (approximately around 128K tokens).

Also, remember that LLMs are stateless – they don’t remember history on their own. This means that if you want the LLM to also use conversation memory, you have to additionally add it to the prompt along with the context. This limits the context window available in many applications.

Lost in the Middle Problem

Additionally, LLMs suffer from the ‘Lost in the Middle’ problem, where the model may lose information that is present in the middle of long contexts. This means that the context you present to the LLM for generation should be accurate and high quality. And this, in turn, means that if you have a large volume of data, you need high quality retrieval approaches to fetch relevant information for LLM generation.

Quality of Retrieval-augmented Generation (RAG) systems, therefore, depend heavily on data. You have to ensure high quality data in your data stores, and if required, perform pre-processing to improve data accuracy.

Key Insight: RAG’s accuracy depends on retrieval precision and contextual relevance. Poor retrieval = hallucinated answers.

Different Kinds of Retrieval Approaches in RAG Systems

Many developers associate RAG directly with vector stores or semantic search. This is incorrect. RAG systems can use any of the following kinds of retrieval, in order to prompt the LLM:

Semantic Retrieval in RAG

In semantic retrieval, you use dense vector embeddings to perform search in a dataset. Vector embeddings are numerical representations of data generated by AI models (embedding models). You convert your dataset and your query into embedding vectors, and then search through the underlying store to find vectors closest to the query vector. You then use the data retrieved to prompt the LLM.

For example, if you have a dataset on football news, with text such as “Messi scored a fabulous goal in the World Cup Final 2024”, semantic search would be able to find data using a query string like “top moments in football”, as these strings and the query vector are likely to be close to each other in vector space.

Semantic search is typically what you would use when you have unstructured data in the form of PDFs or text documents. If you have images, you have to use a vision language model (like PaliGemma or Llama3.2-8B) to interpret the image into text. If you have audio, you should use a speech-to-text model like Whisper to transcribe the audio, and then convert them into embeddings.

Data Format	Preprocessing Steps	Tools/Models	Conversion to Embeddings	Application Example
Text/PDF	Extract raw text (Use vision language models or OCR if scanned PDFs).	- PyPDF2, PDFMiner - VLMs - Unstructured.io	Use text embedding models (e.g., OpenAI text-embedding-3-small, Cohere Embed).	Search for “key findings in Q3 financial reports” across PDF documents.
Images	Generate text captions or OCR text.	- Vision-language models (PaliGemma, Llama3.2-8B) - PIL, OpenCV for processing.	Convert generated text into embeddings.	Query “charts showing revenue growth” to retrieve images with captions like “Q4 revenue up 20%.”
Audio	Transcribe speech to text.	- Whisper (speech-to-text) - FFmpeg (audio processing).	Embed transcribed text using text embedding models.	Retrieve customer service calls mentioning “billing issues” via transcribed audio data.

Keyword Search or Sparse Vector Retrieval in RAG

You can also retrieve results using keyword searches, and use the resulting data to prompt the LLM. Typical keyword search algorithms like BM25, which are implemented by search engines like ElasticSearch, are highly capable of finding strings based on keyword prefix or phrases. They can even leverage synonyms or similar words to find results.

Keyword search as a retrieval technique is highly useful when your data contains domain vocabulary: such as, stock ticker symbols (eg: META, or AAPL, or NVDA). Modern vector search engines also support the above in the form of Sparse Vectors. When using Sparse Vectors, you will perform a similar embedding creation process as above.

On the other hand, if you directly want to use data in your existing search engine, you can also leverage LLMs to generate exact queries. For example, if you have movies data in ElasticSearch, you can use the LLM to generate an exact query to find movies around ‘Artificial Intelligence’, where the LLM will generate:

{
  "query": {
    "match": {
      "plot": {
        "query": "artificial intelligence",
        "fuzziness": "AUTO"
      }
    }
  }
}

You can use this query to then perform search through ElasticSearch and retrieve data for RAG.

The big difference between Sparse Vector Search vs directly performing keyword search is that Sparse Vectors also leverage the underlying similarity between terms to retrieve results.

SQL-Powered Retrieval in RAG

Most applications have their data tucked away in SQL databases, like PostgreSQL, MySQL, MariaDB and others. LLMs are becoming increasingly good at translating natural language queries to SQL queries, if you give the table schema as part of the context.

In other words, you can use LLMs to generate SQL queries, and then use those queries to perform retrieval from your database. You can then present the resulting data to the LLM for generation. For example, if you have a table schema:

CREATE TABLE stock_prices (
    id SERIAL PRIMARY KEY,
    symbol VARCHAR(10),
    date DATE,
    open_price DECIMAL(10,2),
    close_price DECIMAL(10,2),
    high_price DECIMAL(10,2),
    low_price DECIMAL(10,2),
    volume BIGINT
);

And you present the above schema and ask the LLM to generate SQL for a query like:

“Generate an SQL query to find the best-performing stock between NVIDIA (NVDA) and Meta (META) over the last 6 months based on percentage gain.”

The LLM may generate a query like:

WITH price_changes AS (
    SELECT 
        symbol, 
        (MAX(close_price) - MIN(open_price)) / MIN(open_price) * 100 AS percentage_gain
    FROM stock_prices
    WHERE date >= CURRENT_DATE - INTERVAL '6 months'
    AND symbol IN ('NVDA', 'META')
    GROUP BY symbol
)
SELECT symbol, percentage_gain 
FROM price_changes
ORDER BY percentage_gain DESC
LIMIT 1;

You can then programmatically execute the query and retrieve the right results.

Knowledge Graph-Powered Retrieval in RAG (GraphRAG)

This technique, which has come to be known as GraphRAG, is another powerful approach to retrieving data from datasets. Knowledge Graphs work by storing information in the form of entities (or nodes) and relationships (or edges) between them. Both nodes and edges can have properties associated with them. In Knowledge Graphs, you retrieve data using Cypher queries.

Once again, you use LLM to generate Cypher queries from your dataset, store it in a Graph Database, and then query it when you want to retrieve data for generation. For example, LLMs can convert a text string like “Taylor Swift's Eras Tour concluded after 21 months with a record-breaking gross of over $2 billion” into Cypher queries like these:

MERGE (t:Artist {name: "Taylor Swift"}) 
MERGE (tour:Tour {name: "Eras Tour"}) 
SET tour.duration_months = 21, tour.gross = 2000000000, tour.currency = "USD", tour.status = "concluded", tour.record = "record-breaking" 

MERGE (t)-[:PERFORMED]->(tour)

NoSQL Database Retrieval in RAG

You can also use your data in NoSQL databases like MongoDB to perform retrieval. The approach here is similar to how we performed Cypher or SQL queries to retrieve data.

Suppose you have news articles in your MongoDB collection, like:

news_article = {
    "symbol": "NVDA",
    "title": "NVIDIA Dips 10% After DeepSeek Announcement",
    "content": "DeepSeek AI announced its latest AI model DeepSeek-R1, and it caused ripples across the AI community, crashing NVIDIA stock.",
    "date": "2025-01-27",
    "tags": ["AI", "Stock Market", "NVIDIA", "GPUs"]
}

During retrieval step in RAG, you can use the LLM to generate query keywords from your query, and then use that to perform search:

query = "NVIDIA"
results = collection.find({"$text": {"$search": query}}).limit(3)

Then use the results to prompt the LLM for generation.

You can use a similar approach if your data is in key value stores, like Redis or DynamoDB.

Timeseries Data

For time-sensitive applications (e.g., stock prices, sensor readings), Timeseries databases like InfluxDB or OpenTSDB can be used for retrieval. LLMs can generate queries that extract trends, patterns, and anomalies over time, enhancing contextual retrieval for RAG-based applications.

Here’s an example:

For a user query - “Identify temperature anomalies in warehouse sensors over the past week”- the LLM generated the following InfluxDB Query (using Flux language):

from(bucket: "sensor_data")  
  |> range(start: -7d)  
  |> filter(fn: (r) => r._measurement == "temperature" and r._field == "value")  
  |> aggregateWindow(every: 1h, fn: mean)  
  |> anomalyDetection(  
    method: "mean_std",  
    threshold: 3.0 // Detect values 3 standard deviations from the mean  
  )

This query may return response like:

{  
  "timestamp": "2024-05-10T14:00:00Z",  
  "sensor_id": "WH1-TEMP-002",  
  "value": 42.5,  
  "anomaly_score": 3.2,  
  "metadata": {  
    "location": "Warehouse A",  
    "threshold": "Normal range: 18°C-25°C"  
  }  
}

In the generation step, LLM will use the above data to generate a response like:

‍
“Sensor WH1-TEMP-002 in Warehouse A recorded an anomalous temperature of 42.5°C on May 10th, exceeding the normal range by 3.2 standard deviations. Recommend immediate inspection.”

‍Key Insight: RAG can use different retrieval techniques. Tailor your retrieval based on where you have your data.

Improving Data Quality

To maximize retrieval accuracy and minimize LLM hallucinations, data quality must be prioritized at every stage—from ingestion to retrieval. As you saw above, your approach to data cleanup or preprocessing will depend on the type of data you have.

Below are some actionable tactics you can use to refine your data pipeline, tailored to common formats like PDFs, surveys, SQL/NoSQL databases, and knowledge graphs.

1. Data Cleaning & Normalization

Goal: Eliminate noise and standardize formats to ensure consistency.

Tactics:

Remove Irrelevant Content: Strip boilerplate text (e.g., headers/footers in PDFs, disclaimers in surveys). Use vision language models like Llama3.2, 4o or Claude 3.5 to convert graphs and charts into text.
Normalize Text: Convert all text to lowercase, fix encoding errors, and standardize date/number formats (e.g., 2024-01-01 vs. Jan 1, 2024).
Resolve Ambiguity: Map synonyms (e.g., “NVIDIA” → “NVDA”) and resolve acronyms (e.g., “LLM” → “large language model”).
Increase Data Density: Use summarization capability of LLMs to convert text with sparse information into structured formats or data dense text.‍

Example:
For PDF reports with charts, extract only the chart title, axis labels, and conclusions—and summarize the image using vision models. This reduces noise and focuses on human-interpretable insights.

2. Enhancing Information Density

Goal: Condense data into meaningful chunks to avoid overwhelming the LLM.

Tactics:

Summarize Redundant Data: For surveys with repetitive answers, cluster responses (e.g., “User feedback on feature X: 80% positive”).
Extract Key Entities: Use NLP libraries (e.g., spaCy) to identify entities (people, companies, dates) in unstructured text.
Precompute Aggregations: For SQL data, precompute metrics like “monthly sales growth” instead of dumping raw tables.

Example:
In MongoDB news articles, extract entities like

symbol: NVDA, event: DeepSeek-R1 launch, and 
impact: stock dip 10%

to create concise, searchable metadata.

3. Metadata Enrichment

Goal: Attach contextual tags to enable hybrid filtering (semantic + metadata).

Tactics:

Add Source/Time Contexts: Tag chunks with source=PDF, document_type=financial_report, or date=2024-03.
Embed Domain-Specific Tags: For surveys, add demographic=under_30 or sentiment=negative.
Hierarchical Tagging: For NoSQL data, nest tags (e.g., user > session > feedback).

Example:
A PDF chart titled “Q4 Revenue Growth” could be tagged with section=financials, chart_type=bar, and metric=revenue. This allows queries like:

vector_search(query="Q4 performance") + filter(section="financials")

4. Structured Data Preparation

Goal: Align data with retrieval methods (SQL, Graph, etc.) to simplify querying.

Tactics:

SQL Optimization: Precompute common aggregations (e.g., monthly_sales) and index frequently queried columns.
Graph Relationships: For knowledge graphs, define node-edge schemas upfront (e.g., Artist → PERFORMED → Tour).
NoSQL Nesting: Store related data hierarchically (e.g., MongoDB subdocuments for news_article.tags).

Example:
For GraphRAG, preprocess news articles into nodes (Company: NVIDIA, Event: Stock Dip) and edges (CAUSED_BY → DeepSeek-R1 Launch).

5. Chunk Optimization

Goal: Balance context preservation with retrieval efficiency.

Tactics:

Dynamic Chunk Sizes: Use smaller chunks for technical content (e.g., SQL schemas) and larger ones for narratives (e.g., survey summaries).
Semantic Boundaries: Split PDFs at section headers (e.g., “Methodology” or “Results”) to retain context.
Overlap Critical Context: Add overlapping tokens between chunks for continuity (e.g., repeat a key term like “NVIDIA” in adjacent chunks).

Example:
A 100-page PDF report might be split into:

Summary: Key Findings (1 chunk)
Section 3: Financial Analysis (5 chunks with overlapping revenue metrics)

6. Synthetic Data Augmentation

Goal: Fill gaps in sparse or unbalanced datasets.

Tactics:

LLM-Generated Examples: Create synthetic survey responses for underrepresented demographics.
Query-Context Pairs: Generate hypothetical user questions paired with relevant document snippets.

Example:
If your SQL database lacks data for “Meta stock performance in 2023,” use the LLM to simulate plausible entries for testing retrieval robustness.

7. Validation & Iteration

Goal: Continuously refine data quality through feedback loops.

Tactics:

Golden Queries: Maintain a test set of 50-100 queries with known answers to evaluate retrieval accuracy.
A/B Testing: Compare metadata-filtered vs. pure vector search for your dataset.
User Feedback: Log failed retrievals and adjust chunking/metadata rules (e.g., users asking “Where’s the chart for X?” → improve chart tagging).

Key Insights:

Customization is Critical: A tactic that works for PDFs (e.g., section-based chunking) may fail for NoSQL nested data.
Iterate Relentlessly: Data quality isn’t a one-time task—build pipelines to monitor and update embeddings, metadata, and chunks.
Less is More: Prioritize precision over recall. A 200-token chunk with 90% relevance beats a 500-token chunk with 50% relevance.

By treating data quality as a dynamic, domain-specific challenge—not a checkbox—you ensure your RAG system retrieves exactly what the LLM needs, nothing more, nothing less.

Data Format / Type	Cleaning Techniques
PDFs (Documents, Reports)	- Remove boilerplate text. - Use vision-language models to extract text from charts and graphs. - Normalize text. - Summarize long sections using LLMs. - Chunk based on semantic boundaries.
Surveys (Structured & Unstructured Responses)	- Deduplicate and cluster similar responses. - Map synonyms and resolve acronyms. - Add metadata tags. - Generate synthetic responses for underrepresented groups.
SQL Databases (Structured Data)	- Normalize text fields. - Precompute aggregations. - Index frequently queried columns. - Validate data integrity.
NoSQL Databases (Semi-Structured Data)	- Standardize hierarchical data formats. - Preprocess fields for retrieval. - Use entity linking. - Store related data hierarchically.
Knowledge Graphs (Graph Data)	- Define consistent node-edge relationships. - Normalize entity names and relationships. - Enhance metadata tagging for semantic retrieval. - Preprocess news articles into structured nodes.
Time-Series Data (Financial, IoT, Logs)	- Resample and interpolate missing values. - Remove outliers using statistical methods. - Normalize timestamps to a standard format. - Aggregate data at appropriate time intervals.
Text Data (Articles, Customer Reviews, Logs)	- Remove noise. - Extract key entities. - Use LLM-based summarization. - Split into meaningful chunks while retaining semantic coherence.
Images & Multimedia (Infographics, Videos)	- Extract textual content using OCR. - Convert key visual elements into structured data. - Attach metadata tags.
Synthetic & Augmented Data	- Generate missing data points using LLMs. - Create query-context pairs for better retrieval. - Validate synthetic data against real-world distributions.
Validation & Iteration	- Maintain a golden query set to test retrieval accuracy. - A/B test different chunking and metadata strategies. - Continuously log and refine based on user feedback.

Importance of the LLM and Underlying Architecture

The choice of LLM (e.g., GPT-4, Claude, Llama, DeepSeek, Qwen, Mistral, or Gemini) plays a role in the quality of generated responses. Not all LLMs are created equal, and some excel with code, while others excel with reasoning over images. Mix LLMs if required in your pipeline.

You should also remember that the underlying architecture of your data pipeline is equally critical. Even the most advanced LLMs will produce low-quality or hallucinated outputs if fed poorly processed, irrelevant, or contextually fragmented data.

Naive approaches like relying on generic text chunking libraries (e.g., splitting documents at fixed token intervals) or ignoring domain-specific data structures often result in:

Loss of critical context (e.g., splitting a financial report mid-sentence, severing a chart from its analysis).
Noisy retrievals (e.g., injecting irrelevant boilerplate text into prompts).
Semantic mismatches (e.g., failing to align user queries with domain-specific vocabulary).

For instance, automatically chunking a technical research paper into 500-token segments might break apart equations, diagrams, and their explanations, rendering the retrieved context useless for answering precise questions.

Similarly, using a general-purpose embedding model for medical data—without fine-tuning on clinical terminology—can degrade retrieval accuracy by up to 40%. Check MTEB for the latest open fine-tuned embedding models, or create one.

Misc Architecture Considerations

Domain-Specific Customization:

Chunking rules, embedding models, and metadata schemas must align with your data type (e.g., legal contracts vs. social media posts).
Example: Legal documents require section-aware splitting (e.g., “Clause 2.1: Liability”), not arbitrary sentence boundaries.

Hybrid Retrieval:

Combine semantic search with filters (e.g., date, author) and keyword matching to balance recall and precision.

Feedback Loops:

Continuously refine preprocessing and retrieval based on user interactions (e.g., log queries where the LLM returned “I don’t know”).

Key Insight: A state-of-the-art LLM is only as good as the data it receives. Invest in an architecture that treats data preprocessing, retrieval, and context formatting as first-class citizens.

How Superteams.ai Helps in Building Accurate LLM-Powered Applications

Superteams.ai specializes in assembling pre-vetted AI engineering teams to tackle the end-to-end challenges of developing precise, production-ready LLM applications. Unlike generic AI vendors, we focus first on data quality and domain-specific architecture, ensuring retrieval pipelines, chunking strategies, and context formatting align with your unique data and use case.

When we collaborate, our first goal would be to transform messy, unstructured data (PDFs, databases, multimedia) into clean, semantically rich inputs optimized for LLMs—avoiding the pitfalls of one-size-fits-all solutions. Once data cleanup is complete, we put together a workflow, and create a FastAPI-powered API, which your application can use.

Here’s how we engage with our SaaS companies or enterprises:

Discovery & Data Audit:

Identify data sources, formats, and quality gaps.
Map domain-specific requirements (e.g., legal jargon, medical terminology).

Proof-of-Concept (PoC) Development:

Build a minimal pipeline with tailored chunking, embedding, and retrieval logic.
Test against “golden queries” to measure precision/recall.

Iterative Optimization:

Refine preprocessing (e.g., OCR improvements for scanned PDFs).
Implement hybrid search with metadata filters and reranking.

Scaling and Deployment:

Scale the system to handle multiple users, Dockerize or containerize it.
Deploy it on AWS, GCP, Azure or your cloud infrastructure.

Knowledge Transfer:

Document workflows and train internal teams on maintenance/updates.
Provide modular codebases for ease of integration

Next Steps

Launching an AI-powered application or building an AI feature doesn’t require massive upfront investment or a dedicated internal team. Superteams.ai enables businesses to start with a focused, cost-effective proof-of-concept—using your existing data—to validate ROI before scaling.

Whether you’re struggling with low accuracy in current LLM implementations or have no AI expertise in-house, our pre-vetted engineers handle the heavy lifting: from data cleaning and pipeline design to precision tuning and deployment. Once our work completes, we transfer the knowhow to your team, with documentation and a working setup.

Ready to get started?

Let’s discuss your data, goals, and challenges. In 30 minutes, we’ll outline a roadmap to build an AI system that delivers accurate, reliable, and actionable results—not hallucinations.

Request a Meeting Now:
Book a Discovery Call | Email: founders@superteams.ai

Authors

Soum Paul

CoFounder @ Superteams.ai | 2x Published Author | IIT Kanpur | Yoga | Travel

How to Improve Accuracy of Your LLM-Powered RAG or AI Applications

Understanding the Importance of Retrieval in RAG

LLMs have Limited Context Window

Lost in the Middle Problem

Different Kinds of Retrieval Approaches in RAG Systems

Semantic Retrieval in RAG

Keyword Search or Sparse Vector Retrieval in RAG

SQL-Powered Retrieval in RAG

Knowledge Graph-Powered Retrieval in RAG (GraphRAG)

NoSQL Database Retrieval in RAG

Timeseries Data

Improving Data Quality

1. Data Cleaning & Normalization

2. Enhancing Information Density

3. Metadata Enrichment

4. Structured Data Preparation

5. Chunk Optimization

6. Synthetic Data Augmentation

7. Validation & Iteration

Key Insights:

Importance of the LLM and Underlying Architecture

Misc Architecture Considerations

How Superteams.ai Helps in Building Accurate LLM-Powered Applications

Next Steps

Authors

Soum Paul

More from our Editors

Fractional AI-Powered Marketing Teams: A Playbook for Start‑Ups

How to Migrate from WordPress Using Agentic AI

A Deep-Dive Into Model Context Protocol (MCP)

A Guide to Using AI to Streamline Workflows in NBFCs and MFIs

Inbound Sales Tactic for Startups: Leveraging Press and Media Outreach

Using Press & Media to Attract Investors: A Guide for Tech Startups

How to Improve Accuracy of Your LLM-Powered RAG or AI Applications

Understanding the Importance of Retrieval in RAG

LLMs have Limited Context Window

Lost in the Middle Problem

Different Kinds of Retrieval Approaches in RAG Systems

Semantic Retrieval in RAG

Keyword Search or Sparse Vector Retrieval in RAG

SQL-Powered Retrieval in RAG

Knowledge Graph-Powered Retrieval in RAG (GraphRAG)

NoSQL Database Retrieval in RAG

Timeseries Data

Improving Data Quality

1. Data Cleaning & Normalization

2. Enhancing Information Density

3. Metadata Enrichment

4. Structured Data Preparation

5. Chunk Optimization

6. Synthetic Data Augmentation

7. Validation & Iteration

Key Insights:

Importance of the LLM and Underlying Architecture

Misc Architecture Considerations

How Superteams.ai Helps in Building Accurate LLM-Powered Applications

Next Steps

Authors

Soum Paul

More from our Editors

Fractional AI-Powered Marketing Teams: A Playbook for Start‑Ups

How to Migrate from WordPress Using Agentic AI

A Deep-Dive Into Model Context Protocol (MCP)

A Guide to Using AI to Streamline Workflows in NBFCs and MFIs

Inbound Sales Tactic for Startups: Leveraging Press and Media Outreach

Using Press & Media to Attract Investors: A Guide for Tech Startups

Subscribe to receive articles right in your inbox