Academy
Updated on
Nov 25, 2024

Crafting an AI-Powered Document Query Interface Using Langchain: A Step-by-Step Guide

This guide speaks about the importance of document querying systems, covers the real life challenges in the field of Law, Healthcare & Academia, and discusses the implementation of a query

Crafting an AI-Powered Document Query Interface Using Langchain: A Step-by-Step Guide
Ready to build AI-powered products or integrate seamless AI workflows into your enterprise or SaaS platform? Schedule a free consultation with our experts today.

Introduction

As data volumes continue to grow, the ability to efficiently extract relevant information from very long documents becomes increasingly critical. Thanks to advances in Artificial Intelligence (AI) and Natural Language Processing (NLP), we now have powerful tools to accomplish this task. This article aims to guide you through the process of creating a document query interface using the latest NLP techniques.

Understanding Natural Language Processing (NLP)

Prior to delving into the mechanics of crafting a document query interface using Large Language Models (LLMs) and Natural Language Processing (NLP), let's take a moment to unpack the concept of NLP and its fundamental principles. Consider owning a toy robot. It responds to commands by carrying out tasks for you, such as picking up items or creating images. But what if you could communicate with your robot in your normal language, just like you would with a close friend, and it would still understand you? That is the main goal of natural language processing! It's comparable to teaching your toy robot to speak your language and engage in conversation. 

Let's now delve a little further. Consider the times you have used voice commands to send a text message from your smartphone or to ask a digital assistant a question. How does your phone understand what you're saying? That's where Natural Language Processing comes in. Natural language processing is a field that combines computer science, artificial intelligence, and linguistics. Its goal is to enable computers to understand human languages. It is the technology that powers spell checkers, speech recognition systems, and machine translation. NLP enables your phone to understand the words you say, the context in which they are said, and even the tone of your voice. It functions as your phone's brain, deciphering your words and providing relevant responses. 

Understanding the Problem Statement

Having established a foundational understanding of Natural Language Processing, it's time to delve deeper into our specific issue at hand - document querying. We'll dissect the problem statement, comprehend its intricacies, and explore how NLP can effectively address this challenge. In scenarios involving a substantial quantity of content in a document, manually searching through to locate particular pieces of information becomes not only tedious but also a considerable drain on time. The solution? Harnessing the power of Artificial Intelligence to construct a document query interface. Such a system would facilitate users in querying the document database in their natural language, thereby mirroring the ease of human-to-human conversation.

The Importance of Natural Language Search in Document Analysis

The Role & Benefits of NLP in Document Search

The time and effort needed for manual searches are greatly reduced thanks to modern NLP approaches that enable machines to comprehend and answer complex inquiries. NLP offers several benefits in document analysis, including efficiency, precision, and scalability.

  • Efficiency: NLP algorithms can process large volumes of text quickly and accurately. This is because they are able to understand the context and semantics of the text, which allows them to extract the relevant information more efficiently than traditional methods.
  • Precision: NLP algorithms can provide highly relevant search results. This is because they are able to understand the context of the search query, which allows them to return results that are more likely to be relevant to the user's needs.
  • Scalability: NLP can handle increasing amounts of data without a significant drop in performance. This is because NLP algorithms are typically designed to be scalable, meaning that they can be easily adapted to handle larger and larger datasets.

Real-World Applications

The real life applications of document querying using NLP are in a variety of fields, including law, academia, and healthcare. These fields deal with large amounts of unstructured text data, which can be difficult to search using traditional methods. NLP-powered search can help to overcome these challenges by providing more relevant and accurate results.

For example, 

  1. NLP can be used to search legal documents for specific terms or phrases. This can be helpful for lawyers who need to quickly find relevant information in a large corpus of legal text.
  2. NLP can also be used to search academic research papers for keywords or concepts. This can help researchers to find relevant papers more easily and efficiently.
  3. In healthcare, NLP can be used to search patient records for specific medical terms or conditions. This can help doctors and nurses to quickly find the information they need to provide better care for their patients.

Overall, NLP-powered search can bring transformative improvements to a variety of fields that deal with large amounts of unstructured text data. By making it easier to find relevant information, NLP can help to improve productivity, efficiency, and accuracy.

Steps to Building an AI-Powered Document Query Interface

Having comprehended the workings of Natural Language Processing, its merits, and the real-life implications for document querying, let's shift our attention to our core project - Document Querying. 

Project Background

In the context of our tutorial, we'll employ a document which contains biology notes in PDF format as a case study for executing queries. Please note, this version of our program is tailored to handle PDFs specifically, but with minor adjustments in the code, other document formats can be easily accommodated as well. It's now time to delve into the rich details of the development and implementation process.

Note:

  • Store your ‘OPENAI_API_KEY’ in a ‘.env file’ in the same directory. We will use dotenv to load the API key. This way it will stay anonymous. 
  • You can then continue with the following code in another .py file.

Project Implementation

To get started, we need to install some dependencies first. We will be importing ‘dotenv’ which will be used to load environment variables, ‘streamlit’ for building the web application, 'PyPDF2' for reading the PDF files. Then, several modules from the 'langchain' library are imported, which include tools for text splitting, creating embeddings, similarity search, loading the question answering chain, and using the OpenAI language model.

from dotenv import load_dotenv
import streamlit as st

from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

from langchain.callbacks import get_openai_callback

Once all the libraries and dependencies are installed, we will be setting up the steamlit application. In the main() function which you see below, we load the environment variables first using the ‘load_dotenv’. After which the streamlit page configuration and header are set next.

Once the necessary libraries and dependencies have been installed, we can set up the Streamlit application. The main() function loads the environment variables using the load_dotenv() function. Then, the Streamlit page configuration and header are set. 

def main():
    load_dotenv()

    st.set_page_config(page_title="Ask PDF")
    st.header("Ask PDF")

In simpler terms, the main() function is the entry point for the Streamlit application. It loads the environment variables, which are used to store configuration settings for the application. Then, it sets the Streamlit page configuration and header. The page configuration controls the appearance of the application, while the header displays the title of the application.

Next, we will create a file upload interface in the Streamlit application. This will allow users to upload a PDF file to the application. The file upload interface will be created using the st.file_uploader() function. This function provides a widget that allows users to upload files to the application. Once the file upload interface has been created, users will be able to select a PDF file from their computer and upload it to the application. The file will then be stored in the application's temporary directory.

# upload file
pdf = st.file_uploader("Upload your PDF", type="pdf")

Once a PDF file is uploaded, the code will extract the text from it and store it in a variable called text. The text is then split into smaller chunks using the CharacterTextSplitter function, which was imported from the langchain library. The CharacterTextSplitter function splits the text into chunks of a specified length. This allows the code to process the text more efficiently and to identify patterns in the text.

# extract the text
if pdf is not None:

  pdf_reader = PdfReader(pdf)

  text = ""
  for page in pdf_reader.pages:
    text += page.extract_text()

# split into chunks
  text_splitter = CharacterTextSplitter(

                              separator="\n",
                              chunk_size=1000,

                              chunk_overlap=200,
                              length_function=len

  )
  chunks = text_splitter.split_text(text)

The flowchart below provides a visual representation of the process of extracting text from PDF files, splitting the text into chunks, converting the chunks into embeddings, and creating a knowledge base. The flowchart is easy to follow and provides a clear overview of the process.

After the text has been split into chunks, the chunks are converted into embeddings using the OpenAIEmbeddings library. Embeddings are numerical representations of text that can be used to measure the similarity between different pieces of text. The embeddings are then used to create a knowledge base using the FAISS library. FAISS is a library for efficient similarity search that can be used to find the most similar chunks of text to a given query. The knowledge base can be used to perform a variety of tasks, such as answering questions, generating summaries, and finding related documents.

# create embeddings

embeddings = OpenAIEmbeddings()
knowledge_base = FAISS.from_texts(chunks, embeddings)

Once the knowledge base has been created, the code will be written to take user input and provide responses. In the application, users will be able to ask questions about the document. If a question is asked, a similarity search will be performed on the knowledge base to find relevant information in the document uploaded earlier. The load_qa_chain() function will be used to load a question-answering model, which will then generate a response to the user's question. The response will be displayed on the Streamlit application.

# show user input
user_question = st.text_input(
    "Ask a question about your PDF:")
if user_question:
    docs = knowledge_base.similarity_search(
        user_question)

    llm = OpenAI()
    chain = load_qa_chain(llm, chain_type=
    "stuff")
    with get_openai_callback() as cb:
        response = chain.run(input_documents=docs,
                             question=user_question)
        print(cb)

    st.write(response)

In simpler terms, the code will first perform a similarity search on the knowledge base to find the information that is most relevant to the user's question. Then, the question-answering model will be loaded and used to generate a response to the user's question using the language model. The response will then be displayed on the Streamlit application.

Finally, if the script is executed directly, the main() function will be run. This function launches the Streamlit application and allows the user to interact with it.

The main() function is the entry point for the application. It loads the environment variables, sets the Streamlit page configuration and header, creates a file upload interface, extracts text from PDF files, splits the text into chunks, converts the chunks into embeddings, creates a knowledge base, and writes code for taking user input and providing responses.

Once the main() function has been run, the Streamlit application will be launched and the user will be able to interact with it. The user will be able to upload PDF files, ask questions about the documents, and view the responses generated by the question-answering model.

if_name__ == '__main__':
    main()

Project Results

The following three images show the results of our project. They demonstrate the ability of our application to extract text from PDF files, split the text into chunks, convert the chunks into embeddings, create a knowledge base, and answer questions about the documents.

The first image shows the Streamlit interface of our application. The second image shows the results of uploading a PDF file to the application and asking the question 'What is Genome Annotation?'. The third image shows the results of asking the question 'Explain Phylogeny'.

Image 1.

This image shows the Streamlit interface that we created in the tutorial. As you can see, there is an option to upload a file from your local storage. This allows users to upload PDF files to the application.

Image 2.

This image shows that a PDF file has been uploaded to the application. The user has asked the question 'What is Genome Annotation?' and the application has responded with a definition. This shows that the application is able to answer questions about the documents that have been uploaded.

Image 3.

This image shows that the application has responded to another query. The user has asked the question 'Explain Phylogeny' and the application has responded with a brief explanation. This shows that the application is able to answer a variety of questions about information which is present in the document uploaded.

Limitations

  • Privacy Concerns: Users must upload their documents to be processed, which could potentially raise privacy and security concerns, especially with sensitive documents.
  • PDF Format Limitation: As currently implemented, this project only supports the querying of PDF documents. While this limitation can be overcome with modifications in the code, the effort to do so may be non-trivial and require additional libraries.

Conclusion

In this tutorial, I have shown how to extract text from PDF files, split the text into chunks, convert the chunks into embeddings, create a knowledge base, and answer questions about the documents. I have also shown how to use Streamlit to create a user-friendly interface for my application.

Authors