This guide speaks about the importance of document querying systems, covers the real life challenges in the field of Law, Healthcare & Academia, and discusses the implementation of a query
As data volumes continue to grow, the ability to efficiently extract relevant information from very long documents becomes increasingly critical. Thanks to advances in Artificial Intelligence (AI) and Natural Language Processing (NLP), we now have powerful tools to accomplish this task. This article aims to guide you through the process of creating a document query interface using the latest NLP techniques.
Prior to delving into the mechanics of crafting a document query interface using Large Language Models (LLMs) and Natural Language Processing (NLP), let's take a moment to unpack the concept of NLP and its fundamental principles. Consider owning a toy robot. It responds to commands by carrying out tasks for you, such as picking up items or creating images. But what if you could communicate with your robot in your normal language, just like you would with a close friend, and it would still understand you? That is the main goal of natural language processing! It's comparable to teaching your toy robot to speak your language and engage in conversation.
Let's now delve a little further. Consider the times you have used voice commands to send a text message from your smartphone or to ask a digital assistant a question. How does your phone understand what you're saying? That's where Natural Language Processing comes in. Natural language processing is a field that combines computer science, artificial intelligence, and linguistics. Its goal is to enable computers to understand human languages. It is the technology that powers spell checkers, speech recognition systems, and machine translation. NLP enables your phone to understand the words you say, the context in which they are said, and even the tone of your voice. It functions as your phone's brain, deciphering your words and providing relevant responses.
Having established a foundational understanding of Natural Language Processing, it's time to delve deeper into our specific issue at hand - document querying. We'll dissect the problem statement, comprehend its intricacies, and explore how NLP can effectively address this challenge. In scenarios involving a substantial quantity of content in a document, manually searching through to locate particular pieces of information becomes not only tedious but also a considerable drain on time. The solution? Harnessing the power of Artificial Intelligence to construct a document query interface. Such a system would facilitate users in querying the document database in their natural language, thereby mirroring the ease of human-to-human conversation.
The time and effort needed for manual searches are greatly reduced thanks to modern NLP approaches that enable machines to comprehend and answer complex inquiries. NLP offers several benefits in document analysis, including efficiency, precision, and scalability.
The real life applications of document querying using NLP are in a variety of fields, including law, academia, and healthcare. These fields deal with large amounts of unstructured text data, which can be difficult to search using traditional methods. NLP-powered search can help to overcome these challenges by providing more relevant and accurate results.
For example,
Overall, NLP-powered search can bring transformative improvements to a variety of fields that deal with large amounts of unstructured text data. By making it easier to find relevant information, NLP can help to improve productivity, efficiency, and accuracy.
Having comprehended the workings of Natural Language Processing, its merits, and the real-life implications for document querying, let's shift our attention to our core project - Document Querying.
In the context of our tutorial, we'll employ a document which contains biology notes in PDF format as a case study for executing queries. Please note, this version of our program is tailored to handle PDFs specifically, but with minor adjustments in the code, other document formats can be easily accommodated as well. It's now time to delve into the rich details of the development and implementation process.
Note:
To get started, we need to install some dependencies first. We will be importing ‘dotenv’ which will be used to load environment variables, ‘streamlit’ for building the web application, 'PyPDF2' for reading the PDF files. Then, several modules from the 'langchain' library are imported, which include tools for text splitting, creating embeddings, similarity search, loading the question answering chain, and using the OpenAI language model.
from dotenv import load_dotenv
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback
Once all the libraries and dependencies are installed, we will be setting up the steamlit application. In the main() function which you see below, we load the environment variables first using the ‘load_dotenv’. After which the streamlit page configuration and header are set next.
Once the necessary libraries and dependencies have been installed, we can set up the Streamlit application. The main() function loads the environment variables using the load_dotenv() function. Then, the Streamlit page configuration and header are set.
def main():
load_dotenv()
st.set_page_config(page_title="Ask PDF")
st.header("Ask PDF")
In simpler terms, the main() function is the entry point for the Streamlit application. It loads the environment variables, which are used to store configuration settings for the application. Then, it sets the Streamlit page configuration and header. The page configuration controls the appearance of the application, while the header displays the title of the application.
Next, we will create a file upload interface in the Streamlit application. This will allow users to upload a PDF file to the application. The file upload interface will be created using the st.file_uploader() function. This function provides a widget that allows users to upload files to the application. Once the file upload interface has been created, users will be able to select a PDF file from their computer and upload it to the application. The file will then be stored in the application's temporary directory.
# upload file
pdf = st.file_uploader("Upload your PDF", type="pdf")
Once a PDF file is uploaded, the code will extract the text from it and store it in a variable called text. The text is then split into smaller chunks using the CharacterTextSplitter function, which was imported from the langchain library. The CharacterTextSplitter function splits the text into chunks of a specified length. This allows the code to process the text more efficiently and to identify patterns in the text.
# extract the text
if pdf is not None:
pdf_reader = PdfReader(pdf)
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
# split into chunks
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(text)
The flowchart below provides a visual representation of the process of extracting text from PDF files, splitting the text into chunks, converting the chunks into embeddings, and creating a knowledge base. The flowchart is easy to follow and provides a clear overview of the process.
After the text has been split into chunks, the chunks are converted into embeddings using the OpenAIEmbeddings library. Embeddings are numerical representations of text that can be used to measure the similarity between different pieces of text. The embeddings are then used to create a knowledge base using the FAISS library. FAISS is a library for efficient similarity search that can be used to find the most similar chunks of text to a given query. The knowledge base can be used to perform a variety of tasks, such as answering questions, generating summaries, and finding related documents.
# create embeddings
embeddings = OpenAIEmbeddings()
knowledge_base = FAISS.from_texts(chunks, embeddings)
Once the knowledge base has been created, the code will be written to take user input and provide responses. In the application, users will be able to ask questions about the document. If a question is asked, a similarity search will be performed on the knowledge base to find relevant information in the document uploaded earlier. The load_qa_chain() function will be used to load a question-answering model, which will then generate a response to the user's question. The response will be displayed on the Streamlit application.
# show user input
user_question = st.text_input(
"Ask a question about your PDF:")
if user_question:
docs = knowledge_base.similarity_search(
user_question)
llm = OpenAI()
chain = load_qa_chain(llm, chain_type=
"stuff")
with get_openai_callback() as cb:
response = chain.run(input_documents=docs,
question=user_question)
print(cb)
st.write(response)
In simpler terms, the code will first perform a similarity search on the knowledge base to find the information that is most relevant to the user's question. Then, the question-answering model will be loaded and used to generate a response to the user's question using the language model. The response will then be displayed on the Streamlit application.
Finally, if the script is executed directly, the main() function will be run. This function launches the Streamlit application and allows the user to interact with it.
The main() function is the entry point for the application. It loads the environment variables, sets the Streamlit page configuration and header, creates a file upload interface, extracts text from PDF files, splits the text into chunks, converts the chunks into embeddings, creates a knowledge base, and writes code for taking user input and providing responses.
Once the main() function has been run, the Streamlit application will be launched and the user will be able to interact with it. The user will be able to upload PDF files, ask questions about the documents, and view the responses generated by the question-answering model.
if_name__ == '__main__':
main()
The following three images show the results of our project. They demonstrate the ability of our application to extract text from PDF files, split the text into chunks, convert the chunks into embeddings, create a knowledge base, and answer questions about the documents.
The first image shows the Streamlit interface of our application. The second image shows the results of uploading a PDF file to the application and asking the question 'What is Genome Annotation?'. The third image shows the results of asking the question 'Explain Phylogeny'.
Image 1.
This image shows the Streamlit interface that we created in the tutorial. As you can see, there is an option to upload a file from your local storage. This allows users to upload PDF files to the application.
Image 2.
This image shows that a PDF file has been uploaded to the application. The user has asked the question 'What is Genome Annotation?' and the application has responded with a definition. This shows that the application is able to answer questions about the documents that have been uploaded.
Image 3.
This image shows that the application has responded to another query. The user has asked the question 'Explain Phylogeny' and the application has responded with a brief explanation. This shows that the application is able to answer a variety of questions about information which is present in the document uploaded.
In this tutorial, I have shown how to extract text from PDF files, split the text into chunks, convert the chunks into embeddings, create a knowledge base, and answer questions about the documents. I have also shown how to use Streamlit to create a user-friendly interface for my application.