Interested in a hands-on learning experience for developing LLM applications?
Join our LLM Bootcamp today and Get 30% Off for a Limited Time!

embedding

In this blog, we are enhancing our Language Model (LLM) experience by adopting the Retrieval-Augmented Generation (RAG) approach! Let’s explore RAG in LLM for enhanced results!

We’ll explore the fundamental architecture of RAG conceptually and delve deeper by implementing it through the Lang Chain orchestration framework and leveraging an open-source model from Hugging Face for both question-answering and text embedding. 

So, let’s get started! 

Common Hallucinations in Large Language Models  

The most common problem faced by state-of-the-art LLMs is that they produce inaccurate or hallucinated responses. This mostly occurs when prompted with information not present in their training set, despite being trained on extensive data.

 

llm bootcamp banner

 

This discrepancy between the general knowledge embedded in the LLM’s weights and newer information can be bridged using RAG. The solution provided by RAG eliminates the need for computationally intensive and expertise-dependent fine-tuning, offering a more flexible approach to adapting to evolving information.

 

Read more about: AI hallucinations and risks associated with large language models

 

AI hallucinations
AI hallucinations

What is RAG? 

Retrieval Augmented Generation involves enhancing the output of Large Language Models (LLMs) by providing them with additional information from an external knowledge source.

 

Explore LLM context augmentation techniques like RAG and fine-tuning in detail with out podcast now!

 

This method aims to improve the accuracy and contextuality of LLM-generated responses while minimizing factual inaccuracies. RAG empowers language models to sidestep the need for retraining, facilitating access to the most up-to-date information to produce trustworthy outputs through retrieval-based generation. 

The Architecture of RAG Approach

 

RAG in LLM - Elevate Your Large Language Models Experience | Data Science Dojo

Figure from Lang chain documentation

Prerequisites for Code Implementation 

1. HuggingFace Account and LLAMA2 Model Access:

  • Create a Hugging Face account (free sign-up available) to access open-source Llama 2 and embedding models. 
  • Request access to LLAMA2 models using this form (access is typically granted within a few hours). 
  • After gaining access to Llama 2 models, please proceed to the provided link, select the checkbox to indicate your agreement to the information, and then click ‘Submit’. 

2. Google Colab Account:

  • Create a Google account if you don’t already have one. 
  • Use Google Colab for code execution. 

3. Google Colab Environment Setup:

  • In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4 for faster execution of code. 

4. Library and Dependency Installation:

  • Install necessary libraries and dependencies using the following command: 

 

5. Authentication with HuggingFace:

  • Integrate your Hugging Face token into Colab’s environment:

 

 

  • When prompted, enter your Hugging Face token obtained from the “Access Token” tab in your Hugging Face settings. 

A 5-Step Guide to Implement RAG in LLM

Step 1: Document Loading 

Loading a document refers to the process of retrieving and storing data as documents in memory from a specified source. This process is typically facilitated by document loaders, which provide a “load” method for accessing and loading documents into the memory. 

Lang chain has number of document loaders in this example we will be using “WebBaseLoader” class from the “langchain.document_loaders” module to load content from a specific web page.

 

 
The code extracts content from the web page “https://lilianweng.github.io/posts/2023-06-23-agent/“. BeautifulSoup (`bs4`) is employed for HTML parsing, focusing on elements with the classes “post-content”, “post-title”, and “post-header.” The loaded content is stored in the variable `docs`. 

Step 2: Document Transformation – Splitting/Chunking Document 

After loading the data, it can be transformed to fit the application’s requirements or to extract relevant portions. This involves splitting lengthy documents into smaller chunks that are compatible with the model and produce accurate and clear results.

Lang Chain offers various text splitters, in this implementation we chose the “RecursiveCharacterTextSplitter” for generic text processing.

 

 

The code breaks documents into chunks of 1000 characters with a 200-character overlap. This chunking is employed for embedding and vector storage, enabling more focused retrieval of relevant content during runtime.

The recursive splitter ensures chunks maintain contextual integrity by using common separators, like new lines, until the desired chunk size is achieved. 

Step 3: Storage in Vector Database 

After extracting text chunks, we store and index them for future searches using the RAG application. A common approach involves embedding the content of each split and storing these embeddings in a vector store. 

When searching, we embed the search query and perform a similarity search to identify stored splits with embeddings most similar to the query embedding. Cosine similarity, which measures the angle between embeddings, is a simple similarity measure. 

Using the Chroma vector store and open source “HuggingFaceEmbeddings” in Lang chain, we can embed and store all document splits in a single command. 

Text Embedding:

Text embedding converts textual data into numerical vectors that capture the semantic meaning of the text. This enables efficient identification of similar text pieces. An embedding model, which is a variant of Language Models (LLMs) specifically designed for this purpose. 

 Lang Chain’s Embeddings class facilitates interaction with various text embedding models. While any model can be used, we opted for “HuggingFaceEmbeddings”. 

 

 

This code initializes an instance of the HuggingFaceEmbeddings class, configuring it with an open-source pre-trained model located at “sentence-transformers/all-MiniLM-l6-v2“. By doing this text embedding is created for converting textual data into numerical vectors. 

 

How generative AI and LLMs work

 

Vector Stores:

Vector stores are specialized databases designed to efficiently store and search for high-dimensional vectors, such as text embeddings. They enable the retrieval of the most similar embedding vectors based on a given query vector. Lang Chain integrates with various vector stores, and we are using the “Chroma” vector store for this task.

 

 

This code utilizes the Chroma class to create a vector store (vectorstore) from the previously split documents (splits) using the specified embeddings (embeddings). The Chroma vector store facilitates efficient storage and retrieval of document vectors for further processing. 

Step 4: Retrieval of Text Chunks 

After storing the data, preparing the LLM model, and constructing the pipeline, we need to retrieve the data. Retrievers serve as interfaces that return documents based on a query. 

Retrievers cannot store documents; they can only retrieve them. Vector stores form the foundation of retrievers. Lang Chain offers a variety of retriever algorithms, here is the one we implement. 

 

 

Step 5: Generation of Answer with RAG Approach 

Preparing the LLM Model:

In the context of Retrieval Augmented Generation (RAG), an LLM model plays a crucial role in generating comprehensive and informative responses to user queries. By leveraging its ability to process and understand natural language, the LLM model can effectively combine retrieved documents with the given query to produce insightful and relevant outputs.

 

 

These lines import the necessary libraries for handling pre-trained models and tokenization. The specific model “meta-llama/Llama-2-7b-chat-hfis chosen for its question-answering capabilities.

 

 

This code defines a transformer pipeline, which encapsulates the pre-trained HuggingFace model and its associated configuration. It specifies the task as “text-generation” and sets various parameters to optimize the pipeline’s performance. 

 

 

This line creates a Lang Chain pipeline (HuggingFace Pipeline) that wraps the transformer pipeline. The model_kwargs parameter adjusts the model’s “temperature” to control its creativity and randomness. 

Retrieval QA Chain:

To combine question-answering with a retrieval step, we employ the RetrievalQA chain, which utilizes a language model and a vector database as a retriever. By default, we process all data in a single batch and set the chain type to “stuff” when interacting with the language model. 

 

 

This code initializes a RetrievalQA instance by specifying a chain type (“stuff”), a HuggingFacePipeline (llm), and a retriever (retriever-initialize previously in the code from vectorstore). The return_source_documents parameter is set to True to include source documents in the output, enhancing contextual information retrieval.

Finally, we call this QA chain with the specific question we want to ask.

 

 

The result will be: 

 

 

We can print source documents to see which document chunks the model used to generate the answer to this specific query.

 

 

In this output, only 2 out of 4 document contents are shown as an example, that were retrieved to answer the specific question. 

Conclusion 

In conclusion, by embracing the Retrieval-Augmented Generation (RAG) approach, we have elevated our Language Model (LLM) experience to new heights.

Through a deep dive into the conceptual foundations of RAG and practical implementation using the Lang Chain orchestration framework, coupled with the power of an open-source model from Hugging Face, we have enhanced the question-answering capabilities of LLMs.

 

Explore a hands-on curriculum that helps you build custom LLM applications!

 

This journey exemplifies the seamless integration of innovative technologies to optimize LLM capabilities, paving the way for a more efficient and powerful language processing experience. Cheers to the exciting possibilities that arise from combining innovative approaches with open-source resources! 

December 6, 2023

Transformers have revolutionized natural language processing with their use of self-attention mechanisms. In this post, we will study the key components of transformers to understand how they have become the basis of the state of the art in different tasks.  

 

Introduction: Attention is all you need 

The Transformer architecture was first introduced in the 2017 paper “Attention is All You Need” by researchers at Google. Unlike previous sequence models such as RNNs, Transformer relies entirely on self-attention to model dependencies in sequential data like text.   

 

Large language models knowledge test

 

Remarkably, this simple change led to major improvements in machine translation quality over existing methods. Since then, Transformers have been applied successfully to diverse NLP tasks like text generation, summarization, and question-answering. Their versatility has even led to applications in computer vision 

 

Large language model bootcamp

 

But what exactly is self-attention and why is it so effective? Let’s explore this. 

The limitations of Recurrent Neural Networks – RNNs   

Recurrent neural networks (RNNs) used to be the dominant approach for modeling sequences. An RNN processes textual data incrementally, maintaining a “memory” of the previous context. For example, to predict the next word in a sentence, an RNN model would incorporate information about all the preceding words.  

However, RNNs have certain limitations. They process data sequentially, making parallelization difficult. More critically, they struggle to learn long-range dependencies because the information gets diluted over many steps of time. Attention mechanisms were proposed to mitigate this issue. 

Learn to build LLM applications

Why use a transformer model?  

The transformer architecture has enabled the development of new models that can be trained on large datasets and significantly outperform recurrent neural networks like LSTMs. These new models are utilized for tasks like sequence classification, question answering, language modeling, named entity recognition, summarization, and translation.  

Let’s examine the key components of transformers to understand how they have become the foundation for state-of-the-art performance on different NLP tasks.  

Transformer design  

A transformer consists of an encoder and a decoder. The encoder’s role is to encode the inputs (i.e. sentences) into a state, often containing multiple tensors. This state is then passed to the decoder to generate the outputs.

In machine translation, the encoder converts a source sentence, e.g. “Hello world“, into a state, such as a vector, that captures its semantic meaning.

The decoder then utilizes this state to produce the translated target sentence, e.g. “Bonjour le monde.” Both the encoder and decoder primarily employ Multi-Head Attention and Feedforward Networks, which are the focus of this article.   

 

Transformer model architecture

Key transformer components  

1. Input embedding  

Embedding aims to create a vector representation of words where words with similar meanings will be close in terms of Euclidean distance. For instance, the words “bathroom” and “shower” are related to the same concept, so their word vectors are close in Euclidean space as they convey similar meanings.  

For the encoder, the authors opted for an embedding size of 512 (i.e. each word is represented by a 512-dimensional vector).  

  

  Input embedding

 

2. Positional encoding  

The position of a word plays a crucial role in understanding the sequence we want to model.  

Therefore, we add positional information about the word’s location in the sequence to its vector. The authors used the following sinusoidal. 

Position encoding   

 

We will explain positional encoding in more detail with an example.  

  Position encoding example

  

We note the position of each word in the sequence.  

We define dmodel = 512, which represents the size of the embedding vector of each word (i.e. the vector dimension). We can now rewrite the two positional encoding equations as:  

 

two positional encoding equations

 


We can see that the wavelength (i.e. frequency) λt decreases as the dimension increases, this forms a progression along the wave from 2pi to 10000.2pi.  

  

  wavelength

 

In this model, the absolute positional information of a word in a sequence is added directly to its initial vector. For this, the positional encoding must have the same size dmodel as the initial word vector.  


3.
Attention mechanism  

  • Scaled Dot-Product Attention  

  Scaled Dot-Product Attention

  

Let’s explain the attention mechanism. The key goal of attention is to estimate the relative relevance of the keywords compared to the query word for the same entity. For this, the attention mechanism takes a query vector Q representing a word, the keys K comprising all other words in the sentence, and values V representing the word vectors.  

In our case, V = Q (for the two self-attention layers). In other words, the attention mechanism provides the significance of a word in a given sentence.  

 

attention mechanism

  

When we compute the normalized dot product between the query and the keys, we get a tensor that represents the relative importance of each other word for the query. To go deeper into mathematics, we can try to understand why the authors used a dot product to calculate the relation between two words.  

 

Get Started with Generative AI                                    

 

A word is represented by a vector in Euclidian space, in this case, a vector of size 512.   

When computing the dot product between Q and KT, we calculate the product between Q’s orthogonal projection onto K. In other words, we estimate the alignment between the query and keyword vectors, returning a weight for each word in the sentence.  

We then normalize by dk to counteract large Q and K magnitudes which can push the softmax function into regions with tiny gradients. The softmax function regularizes the terms and rescales them between 0 and 1 (i.e., converts the dot product to a probability distribution), with the goal of normalizing all weights between 0 and 1.  

  softmax function

Finally, we multiply the weights (i.e., importance) by the values V to reduce irrelevant words and focus on the most significant words.  

    

Attention mechanism (2)

 

 

  • Multi-Head Attention  

  Multi-Head Attention

 

The key idea is that attention is applied multiple times in parallel on different projections of the input queries, keys, and values. This allows the model to learn different types of dependencies between the input words.  

  

The input queries (Q), keys (K), and values (V) are each linearly projected h times into smaller subspaces. For example, h=8 times into 64-dimensional spaces.  

Attention is then applied in each of these h projected subspaces in parallel, yielding h different attention outputs.  

 

Attention mechanism and transformers - LLM
Attention mechanism and transformers – LLM Bootcamp Data Science Dojo

 

These h outputs are concatenated and linearly projected again to get the final values. The projections allow the model to focus on different positional and semantic relationships between words since each projected subspace captures different information.  

Doing this in parallel (multi-head) instead of sequentially improves efficiency.  

The projection matrices are learned during training to discover the most useful projections. So, in summary, multi-head attention applies the attention mechanism in multiple parallel subspaces to learn different types of dependencies between words in an efficient way.  

  

Let’s dive into the mechanics of encoder-decoder architecture.  

Transformer model architecture   

 

In this section, we’ll explain how the encoder and decoder work together to translate an English sentence into a French one, step by step.  

1. Encoder  

Encoder

  • Convert a sequence of tokens to a sequence of vectors by using embeddings.    

Positional encoding

 

 

  • Add position information in each word vector.  

 

The key advantage of recurrent neural networks is their knack for understanding relationships between sequences and remembering information. On the other hand, Transformers employ positional encoding to factor in where words are in a sequence.  

  • Apply Multi-Head Attention  

Apply Multi-Head Attention   

  • Use Feed Forward Network  

 

2. Decoder  

  • Utilize embeddings to transform a French sentence into vectors.   

  decoder French

 

  • Add positional details within each word vector.    

Positional encoding French   

  • Apply Multi-Head Attention  

  Apply Multi-Head Attention French

 

  • Apply Feed Forward Network  

 

  • Apply Multi-Head Attention to the encoder output.  

Multi-Head Attention - encoder output
 

We can observe that the Transformer combines the encoder’s output with the decoder’s input. This enables it to discern the relationship between the vectors that encode the English and French sentences.  

  • Apply the Feed Forward Network again.  
  • Compute the probability for the next word by using linear + SoftMax block. The decoder returns the highest probability as the next word at the output.  

  Linear and SoftMax block

In our case, the next word after “Je” is “suis” 

 

Final thoughts 

The transformer model outperforms all the models on different benchmarks also there was no difference seen between the translation provided by the algorithm and by humans.   

Transformers are a major advance in NLP, they exceed RNN by having a lower training cost allowing to train models on larger corpora. Even today, transformers remain the basis of state-of-the-art models such as BERT, Roberta, XLNET, and GPT.  

 

 

References: 

https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf 

https://github.com/hkproj/transformer-from-scratch-notes 

http://jalammar.github.io/illustrated-transformer/ 

October 18, 2023

Related Topics

Statistics
Resources
rag
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
AI