A quick guide to building a RAG system locally using the Llama3-instruct model through Ollama.
In this blog, we will try to understand about RAG and how it works with LLM models. The RAG is basically used to increase the knowledge of the LLM models. There are two ways to increase the model’s knowledge for specific tasks.
The first method is fine-tuning. In this process, the LLM models are retrained on new data in order to perform some domain-specific tasks. This method requires expertise to handle the training dataset, model training process and evaluation of model. In this method the model weights are updated entirely to do some specific tasks.
Another method is building the RAG system, this is a very simple process compared to the first method. There is no need to have expertise in LLM models in order to do this task. This is a process simply integrating the Vector Database into LLM models. In this method, LLM models generate the content from both model’s training data and vector’s database, by using the similarity search.
Retrieval Augmented Generation
The RAG stands for Retrieval Augmented Generation. It is a process of optimizing the output of a large language model, so it references an authoritative knowledge base outside its training data sources before generating a response. Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences.
The RAG is a used by LLM models to augment data, which has one more context related to the input query. In this process, the data, which is used to increase the model knowledge, is embedded by specific embedding models, and the vector values (numerical representation of the text) are stored in the vector database.
The vector database is integrated with LLM models, and when the query is prompted, the model generates the data from both the model training data and the vector database. The RAG system is more beneficial to keep the model knowledge up to date.
The RAG extends the already powerful capabilities of LLMs to specific domains or an organization’s internal knowledge base, all without the need to retrain the model. It is a cost-effective approach to improving LLM output, so it remains relevant, accurate, and useful in various contexts.
Ollama
Ollama is a free and open-source tool that lets anyone run open LLMs locally. Ollama stands for ‘Omni-Layer Learning Language Acquisition Model’. Ollama is a groundbreaking platform that democratizes access to LLMs by enabling users to run them locally on their machines.
Developed with a vision to empower individuals and organizations, Ollama provides a user-friendly interface and seamless integration capabilities, making it easier than ever to leverage the power of LLMs for various applications and use cases. The command-line interface (CLI) tool that lets you conveniently download LLMs and run them locally and privately. With a couple of commands, you can download models like Llama, gemma, and more.
To download ollama: Ollama
After installing the ollama on your device based on your operating systems, to check ollama installation, run the following command in the terminal:
ollama
Let’s start the Tutorial
In every coding tutorial, the first step is creating an environment with dependencies, here the necessary packages are installed in the environment. The Google Colab is used to run the codes.
!pip install langchain-community==0.2.4 langchain==0.2.3 faiss-cpu==1.8.0 unstructured==0.14.5 unstructured[pdf]==0.14.5 transformers==4.41.2 sentence-transformers==3.0.1
Then the required modules are imported from the installed packages. The ‘os’ library is used to interact with the operating system, such as file and directory manipulation, process management, and environment variable access. The ‘Ollama’ module is used to configure the downloaded Llama3-instruct model to our environment.
The ‘UnstructuredFileLoader’ is used to load an unstructured data file (PDF) and extract the data from that. The ‘FAISS’ is stands for Facebook AI Similarity Search; it is performing similarity searches on high-dimensional vectors. It provides various indexing and search algorithms to quickly find the nearest neighbors of an input vector.
The ‘HuggingFaceEmbeddings’ use the embedding model to convert the text into a vector representation. The module ‘CharacterTextSplitter’ is used to make chunks from the unstructured data. The chunks are the group of characters; they are the preprocess of the embeddings.
The ‘RetrievalQA’ is used to create a chain, which is then used to retrieve the data by invoking the LLM model.
import os
from langchain_community.llms import Ollama
from langchain.document_loaders import UnstructuredFileLoader
from langchain_community.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import RetrievalQA
The Google colab free tier doesn’t have access to use the terminal, So the extension is used to access the terminal. The terminal is required to run Ollama to use the Llama3-instruct model.
!pip install colab-xterm
%load_ext colabxterm
%xterm
In the terminal, the following commands are used to download the LLM models locally through Ollama:
#initiating the ollama
ollama serve && ollama run
# dwonloading the model
ollama pull llama3-instruct
After successfully downloading the model, the Lama3-instruct model is loaded into our environment through the Ollama module, and the temperature is set to zero, it is to make sure our model is more deterministic.
# loading the LLM
llm = Ollama(
model="llama3:instruct",
temperature=0
)
After that, the unstructured PDF file is loaded into our environment to extract the data from it.
# loading the document
loader = UnstructuredFileLoader("/content/attention_all_you_need.pdf")
documents = loader.load()
The chunks are split by using ‘newline’ as a separator, the maximum size of each chunk is 1000 characters long. The chunk overlap is the number of characters that overlap between consecutive chunks. Here, 200 characters overlap with the next chunk.
The unstructured text data from the PDF file is split by using the CharacterTextSplitter function.
# create document chunks
text_splitter = CharacterTextSplitter(separator="/n",
chunk_size=1000,
chunk_overlap=200)
text_chunks = text_splitter.split_documents(documents)
The Hugging Face embedding model is used to embed the split unstructured data, and the vector values are stored into the variable.
At the end, the FAISS index is created from the document chunks using the embeddings.
# loading the vector embedding model
embeddings = HuggingFaceEmbeddings()
knowledge_base = FAISS.from_documents(text_chunks, embeddings)
Then the “RetrievalQA” chain was initiated, this chain is used to retrieve the data from the “knowledge_base” variable. This variable contains all the vector values of the document text.
# retrieval QA chain
qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=knowledge_base.as_retriever())
The QA chain was successfully created, it utilizes two components: one is the LLM model, and another one is the variable, which one stores the vector values of document text. This application doesn’t use any vector database to store the embeddings, instead, it uses a single variable to store the embeddings.
The QA chain was invoked by raising the question, and the relevant text is augmented as an output.
question = "What is this document about?"
response = qa_chain.invoke({"query": question})
print(response["result"])
#output
"""Based on the provided context, it appears that this document is about attention mechanisms in neural networks, specifically in the context of natural language processing and machine translation. The document includes references to various research papers and provides visualizations of attention mechanisms at work, highlighting their ability to capture long-distance dependencies and perform tasks such as anaphora resolution."""
Here, another example.
question = "What is the architecture discussed in the model?"
response = qa_chain.invoke({"query": question})
print(response["result"])
#output
"""Based on the provided context, it appears that the architecture being discussed is a sequence-to-sequence learning model with attention mechanisms. This is evident from references [35], [38], and the attention visualizations presented in Figures 3-5.
In particular, the model seems to be using encoder-decoder architectures with self-attention mechanisms, as described in [35] and [38]. The attention visualizations show how different heads in the attention mechanism are attending to specific parts of the input sentence, highlighting the ability of the model to capture long-distance dependencies and perform tasks such as anaphora resolution.
It's also worth noting that the model is likely using a neural machine translation system, given the references to Google's neural machine translation system [38] and the discussion of attention mechanisms in the context of neural machine translation."""
Thanks for Reading!
Local RAG with Llama3-instruct (Ollama) was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
Vasukumar P | Sciencx (2024-08-06T22:22:36+00:00) Local RAG with Llama3-instruct (Ollama). Retrieved from https://www.scien.cx/2024/08/06/local-rag-with-llama3-instruct-ollama/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.