Price as low as $4499 | Learn to build custom large language model applications

RAG integration revolutionized search with LLM, boosting dynamic retrieval.

Within the implementation of a RAG system, a pivotal factor governing its efficiency and performance lies in the determination of the optimal chunk size. How does one identify the most effective chunk size for seamless and efficient retrieval? This is precisely where the comprehensive assessment provided by the LlamaIndex Response Evaluation tool becomes invaluable.

In this article, we will provide a comprehensive walkthrough, enabling you to discern the ideal chunk size through the powerful features of LlamaIndex’s Response Evaluation module. 

Tune in to Co-founder and CEO of LlamaIndex, Jerry Liu, and learn all about LLMs, RAG, fine-tuning and more!

Why chunk size matters in RAG system

Selecting the appropriate chunk size is a crucial determination that holds sway over the effectiveness and precision of a RAG system in various ways: 

 

Pertinence and detail:

Opting for a smaller chunk size, such as 256, results in more detailed segments. However, this heightened detail brings the potential risk that pivotal information might not be included in the most retrieved segments.

On the contrary, a chunk size of 512 is likely to encompass all vital information within the leading chunks, ensuring that responses to inquiries are readily accessible. To navigate this challenge, we will employ the faithfulness and relevance metrics.

These metrics gauge the absence of ‘hallucinations’ and the ‘relevancy’ of responses concerning the query and the contexts retrieved, respectively. 

 

Large language model bootcamp

Generation time for responses:

With an increase in the chunk size, the volume of information directed into the LLM for generating a response also increases. While this can guarantee a more comprehensive context, it might potentially decelerate the system. Ensuring that the added depth doesn’t compromise the system’s responsiveness is pivotal.

Ultimately, finding the ideal chunk size boils down to achieving a delicate equilibrium. Capturing all crucial information while maintaining operational speed It’s essential to conduct comprehensive testing with different sizes to discover a setup that aligns with the unique use case and dataset requirements. 

Why evaluation? 

The discussion surrounding evaluation in the field of NLP has been contentious, particularly with the advancements in NLP methodologies.

Traditional evaluation techniques like BLEU or F1 are now unreliable for assessing models because they have limited correspondence with human evaluations.

As a result, the landscape of evaluation practices continues to shift, emphasizing the need for cautious application. 

In this blog, our focus will be on configuring the gpt-3.5-turbo model to serve as the central tool for evaluating the responses in our experiment.

To facilitate this, we establish two key evaluators, the faithfulness evaluator and the relevance evaluator, utilizing the service context. This approach aligns with the evolving standards of LLM evaluation, reflecting the need for more sophisticated and reliable evaluation mechanisms. 

 

 Faithfulness evaluator: This evaluator is instrumental in determining whether the response was artificially generated and checks if the response from a query engine corresponds with any source nodes. 

Relevancy evaluator: This evaluator is crucial for gauging whether the query was effectively addressed by the response and examines whether the response, combined with source nodes, matches the query. 

In order to determine the appropriate chunk size, we will calculate metrics such as average response time, average faithfulness, and average relevancy across different chunk sizes.  

 

 

Downloading dataset 

We will be using the IRS armed forces tax guide for this experiment. 

  • mkdir is used to make a folder. Here we are making a folder named dataset in the root directory. 
  • wget command is used for non-interactive downloading of files from the web. It allows users to retrieve content from web servers, supporting various protocols like HTTP, HTTPS, and FTP. 

 

 

Load dataset 

  • SimpleDirectoryReader class will help us to load all the files in the dataset directory. 
  • document[0:10] represents that we will only be loading the first 10 pages of the file for the sake of simplicity. 

 

 

Defining question bank 

These questions will help us to evaluate metrics for different chunk sizes. 

 

 

 

Establishing evaluators  

This code initializes an OpenAI language model (gpt-3.5-turbo) with temperature=0 settings and instantiate evaluators for measuring faithfulness and relevancy, utilizing the ServiceContext module with default configurations. 

 

 

Main evaluator method 

We will be evaluating each chunk size based on 3 metrics. 

  1. Average Response Time 
  2. Average Faithfulness 
  3. Average Relevancy 

 

Read this blog about Orchestation Framework

 

  • The function evaluator takes two parameters, chunkSize and questionBank. 
  • It first initializes an OpenAI language model (llm) with the model set to gpt-3.5-turbo. 
  • Then, it creates a serviceContext using the ServiceContext.from_defaults method, specifying the language model (llm) and the chunk size (chunkSize). 
  • The function uses the VectorStoreIndex.from_documents method to create a vector index from a set of documents, with the service context specified. 
  • It builds a query engine (queryEngine) from the vector index. 
  • The total number of questions in the question bank is determined and stored in the variable totalQuestions. 

Next, the function initializes variables for tracking various metrics: 

  • totalResponseTime: Tracks the cumulative response time for all questions. 
  • totalFaithfulness: Tracks the cumulative faithfulness score for all questions. 
  • totalRelevancy: Tracks the cumulative relevancy score for all questions. 
  • It records the start time before querying the queryEngine for a response to the current question. 
  • It calculates the elapsed time for the query by subtracting the start time from the current time. 
  • The function evaluates the faithfulness of the response using faithfulnessLLM.evaluate_response and stores the result in the faithfulnessResult variable. 
  • Similarly, it evaluates the relevancy of the response using relevancyLLM.evaluate_response and stores the result in the relevancyResult variable. 
  • The function accumulates the elapsed time, faithfulness result, and relevancy result in their respective total variables. 
  • After evaluating all the questions, the function computes the averages 

 

 

 

Testing different chunk sizes 

To find out the best chunk size for our data, we have defined a list of chunk sizes then we will traverse through the list of chunk sizes and find out the average response time, average faithfulness, and average relevance with the help of evaluator method. After this, we will convert our data list into a data frame with the help of Pandas DataFrame class to view it in a fine manner. 

 

 

From the illustration, it is evident that the chunk size of 128 exhibits the highest average faithfulness and relevancy while maintaining the second-lowest average response time. 

Use LlamaIndex to construct a RAG system 

Identifying the best chunk size for a RAG system depends on a combination of intuition and empirical data. By utilizing LlamaIndex’s Response Evaluation module, we can experiment with different sizes and make well-informed decisions.

When constructing a RAG system, it is crucial to remember that the chunk size plays a pivotal role. Therefore, it is essential to invest the necessary time to thoroughly evaluate and fine-tune the chunk size for optimal outcomes. 

 

You can find the complete code here 

With the introduction of LLaMA v1, we witnessed a surge in customized models like Alpaca, Vicuna, and WizardLM. This surge motivated various businesses to launch their own foundational models, such as OpenLLaMA, Falcon, and XGen, with licenses suitable for commercial purposes. LLaMA 2, the latest release, now combines the strengths of both approaches, offering an efficient foundational model with a more permissive license. 

 

In the first half of 2023, the software landscape underwent a significant transformation with the widespread adoption of APIs like OpenAI API to build infrastructures based on Large Language Models (LLMs). Libraries like LangChain and LlamaIndex played crucial roles in this evolution.  

Large language model bootcamp

 

As we move into the latter part of the year, fine-tuning or instruction tuning of these models is becoming standard practice in the LLMOps workflow. This trend is motivated by several factors, including  

 

  • Potential cost savings 
  • The capacity to handle sensitive data 
  • The opportunity to develop models that can outperform well-known models like ChatGPT and GPT-4 in specific tasks. 

Fine-tuning: 

Fine-tuning methods refer to various techniques used to enhance the performance of a pre-trained model by adapting it to a specific task or domain. These methods are valuable for optimizing a model’s weights and parameters to excel in the target task. Here are different fine-tuning methods: 

  • Supervised Fine-Tuning: This method involves further training a pre-trained language model (LLM) on a specific downstream task using labeled data. The model’s parameters are updated to excel in this task, such as text classification, named entity recognition, or sentiment analysis. 

 

  • Transfer Learning: Transfer learning involves repurposing a pre-trained model’s architecture and weights for a new task or domain. Typically, the model is initially trained on a broad dataset and is then fine-tuned to adapt to specific tasks or domains, making it an efficient approach. 

 

  • Sequential Fine-tuning: Sequential fine-tuning entails the gradual adaptation of a pre-trained model on multiple related tasks or domains in succession. This sequential learning helps the model capture intricate language patterns across various tasks, leading to improved generalization and performance. 

 

  • Task-specific Fine-tuning: Task-specific fine-tuning is a method where the pre-trained model undergoes further training on a dedicated dataset for a particular task or domain. While it demands more data and time than transfer learning, it can yield higher performance tailored to the specific task. 

 

  • Multi-task Learning: Multi-task learning involves fine-tuning the pre-trained model on several tasks simultaneously. This strategy enables the model to learn and leverage common features and representations across different tasks, ultimately enhancing its ability to generalize and perform well. 

 

  • Adapter Training: Adapter training entails training lightweight modules that are integrated into the pre-trained model. These adapters allow for fine-tuning on specific tasks without interfering with the original model’s performance on other tasks. This approach maintains efficiency while adapting to task-specific requirements. 

 

Why fine-tune LLM? 

 

Fine tuning LLM

Source: DeepLearningAI 

 

The figure discusses the allocation of AI tasks within organizations, taking into account the amount of available data. On the left side of the spectrum, having a substantial amount of data allows organizations to train their own models from scratch, albeit at a high cost.

Alternatively, if an organization possesses a moderate amount of data, it can fine-tune pre-existing models to achieve excellent performance. For those with limited data, the recommended approach is in-context learning, specifically through techniques like retrieval augmented generation using general models.

However, our focus will be on the fine-tuning aspect, as it offers a favorable balance between accuracy, performance, and speed compared to larger, more general models. 

 

Pre-trained LLM

Source: Intuitive Tutorials 

 

Why LLaMA 2? 

Before we dive into the detailed guide, let’s take a quick look at the benefits of Llama 2. 

 

 Read more about Palm 2 vs Llama 2 in this blog

 

  • Diverse range: Llama 2 comes in various sizes, from 7 billion to a massive 70 billion parameters. It shares a similar architecture with Llama 1 but boasts improved capabilities.
  • Extensive training ata: This model has been trained on a massive dataset of 2 trillion tokens, demonstrating its vast exposure to a wide range of information. 
  • Enhanced context: With an extended context length of 4,000 tokens, the model can better understand and generate extensive content. 
  • Grouped query attention (GQA): GQA has been introduced to enhance inference scalability, making attention calculations faster by storing previous token pair information. 
  • Performance excellence: Llama 2 models consistently outperform their predecessors, particularly the Llama 2 70B version. They excel in various benchmarks, competing strongly with models like Llama 1 65B and even Falcon models. 
  •  Open source vs. closed source LLMs: When compared to models like GPT-3.5 or PaLM (540B), Llama 2 70B demonstrates impressive performance. While there may be a slight gap in certain benchmarks when compared to GPT-4 and PaLM-2, the model’s potential is evident. 

Parameter efficient fine-tuning (PEFT) 

Parameter Efficient Fine-Tuning involves adapting pre-trained models to new tasks while making minimal changes to the model’s parameters. This is especially important for large neural network models like BERT, GPT, and similar ones. Let’s delve into why PEFT is so significant:

  • Reduced overfitting: Limited datasets can be problematic. Making too many parameter adjustments can lead to model overfitting. PEFT allows us to strike a balance between the model’s flexibility and tailoring it to new tasks. 
  • Faster training: Making fewer parameter changes results in fewer computations, which in turn leads to faster training sessions. 
  • Resource efficiency: Training deep neural networks requires substantial computational resources. PEFT minimizes the computational and memory demands, making it more practical to deploy in resource-constrained environments.  
  • Knowledge preservation: Extensive pretraining on diverse datasets equips models with valuable general knowledge. PEFT ensures that this wealth of knowledge is retained when adapting the model to new tasks. 

Learn to build LLM applications

 

PEFT technique 

The most popular PEFT technique is LoRA. Let’s see what it offers:

  • LoRA 

LoRA, or Low-Rank Adaptation, represents a groundbreaking advancement in the realm of large language models. At the beginning of the year, these models seemed accessible only to wealthy companies. However, LoRA has changed the landscape. 

LoRA has made the use of large language models accessible to a wider audience. Its low-rank adaptation approach has significantly reduced the number of trainable parameters by up to 10,000 times. This results in:  

  • A threefold reduction in GPU requirements, which is typically a major bottleneck. 
  • Comparable, if not superior, performance even without fine-tuning the entire model. 

In traditional fine-tuning, we modify the existing weights of a pre-trained model using new examples. Conventionally, this required a matrix of the same size. However, by employing creative methods and the concept of rank factorization, a matrix can be split into two smaller matrices. When multiplied together, they approximate the original matrix. 

To illustrate, imagine a 1000×1000 matrix with 1,000,000 parameters. Through rank factorization, if the rank is, for instance, five, we could have two matrices, each sized 1000×5. When combined, they represent just 10,000 parameters, resulting in a significant reduction. 

In recent days, researchers have introduced an extension of LoRA known as QLoRA. 

  • QLoRA 

QLoRA is an extension of LoRA that further introduces quantization to enhance parameter efficiency during fine-tuning. It builds on the principles of LoRA while introducing 4-bit NormalFloat (NF4) quantization and Double Quantization techniques. 

 

Quantization + LoRA

Environment setup 

About dataset 

 The dataset has undergone special processing to ensure a seamless match with Llama 2’s prompt format, making it ready for training without the need for additional modifications. 

 

 

Since the data has already been adapted to Llama 2’s prompt format, it can be directly employed to tune the model for particular applications.

Configuring the model and tokenizer 

We start by specifying the pre-trained Llama 2 model and prepare for an improved version called “llama-2-7b-enhanced“. We load the tokenizer and make slight adjustments to ensure compatibility with half-precision floating-point numbers (fp16) operations. Working with fp16 can offer various advantages, including reduced memory usage and faster model training. However, it’s important to note that not all operations work seamlessly with this lower precision format, and tokenization, a crucial step in preparing text data for model training, is one of them. 

Next, we load the pre-trained Llama 2 model with our quantization configurations. We then deactivate caching and configure a pretraining temperature parameter.

In order to shrink the model’s size and boost inference speed, we employ 4-bit quantization provided by the BitsAndBytesConfig. Quantization involves representing the model’s weights in a way that consumes less memory.

The configuration mentioned here uses the ‘nf4‘ type for quantization. You can experiment with different quantization types to explore potential performance variations. 

 

 

Quantization configuration 

In the context of training a machine learning model using Low-Rank Adaptation (LoRA), several parameters play a significant role. Here’s a simplified explanation of each: 

Parameters specific to LoRA:

  • Dropout Rate (lora_dropout): This parameter represents the probability that the output of each neuron is set to zero during training. It is used to prevent overfitting, which occurs when the model becomes too tailored to the training data. 

 

  • Rank (r): Rank measures how the original weight matrices are decomposed into simpler, smaller matrices. This decomposition reduces computational demands and memory usage. Lower ranks can make the model faster but may impact its performance. The original LoRA paper suggests starting with a rank of 8, but for QLoRA, a rank of 64 is recommended. 

 

  • Lora_alpha: This parameter controls the scaling of the low-rank approximation. It’s like finding the right balance between the original model and the low-rank approximation. Higher values can make the approximation more influential during the fine-tuning process, which can affect both performance and computational cost. 

 

By adjusting these parameters, particularly lora_alpha and r, you can observe how the model’s performance and resource utilization change. This allows you to fine-tune the model for your specific task and find the optimal configuration. 

 

 

You can find the code of this notebook here.

Conclusion 

I asked both the fine-tuned and unfine-tuned models of LLaMA 2 about a university, and the fine-tuned model provided the correct result. The unfine-tuned model does not know about the query therefore it hallucinated the response. 

Unfine tuned

Unfine-tuned 

fine tuned  

Fine-tuned

Before we understand LlamaIndex, let’s step back a bit. Imagine a futuristic landscape where machines possess an extraordinary ability to understand and produce human-like text effortlessly. LLMs have made this vision a reality. Armed with a vast ocean of training data, these marvels of innovation have become the crown jewels of the tech world.

There is no denying that LLMs (Large Language Models) are currently the talk of the town! From revolutionizing text generation and reasoning, LLMs are trained on massive datasets and have been making waves in the tech vicinity.

One particular LLM has emerged as a true superstar. Back in November 2022, ChatGPT, an LLM developed by OpenAI, attracted a staggering one million users within 5 days of its beta launch.

ChatGPT
Source: Chart: ChatGPT Sprints to One Million Users | Statista  

When researchers and developers saw these stats they started thinking on how we can best feed/augment these LLMs with our own private data. They started thinking about different solutions.

Finetune your own LLM. You adapt an existing LLM by training your data. But, this is very costly and time-consuming.

Combining all the documents into a single large prompt for an LLM might be possible now with the increased token limit of 100k for models. However, this approach could result in slower processing times and higher computational costs.

Instead of inputting all the data, selectively provide relevant information to the LLM prompt. Choose the useful bits for each query instead of including everything.

Option 3 appears to be both relevant and feasible, but it requires the development of a specialized toolkit. Recognizing this need, efforts have already begun to create the necessary tools.

Introducing LlamaIndex

Recently a toolkit was launched for building applications using LLM, known as Langchain. LlamaIndex is built on top of Langchain to provide a central interface to connect your LLMs with external data.

Key Components of LlamaIndex:

The key components of LlamaIndex are as follows

  • Data Connectors: The data connector, known as the Reader, collects data from various sources and formats, converting it into a straightforward document format with textual content and basic metadata.
  • Data Index: It is a data structure facilitating efficient retrieval of pertinent information in response to user queries. At a broad level, Indices are constructed using Documents and serve as the foundation for Query Engines and Chat Engines, enabling seamless interactions and question-and-answer capabilities based on the underlying data. Internally, Indices store data within Node objects, which represent segments of the original documents.
  • Retrievers: Retrievers play a crucial role in obtaining the most pertinent information based on user queries or chat messages. They can be constructed based on Indices or as standalone components and serve as a fundamental element in Query Engines and Chat Engines for retrieving contextually relevant data.
  • Query Engines: A query engine is a versatile interface that enables users to pose questions regarding their data. By accepting natural language queries, the query engine provides comprehensive and informative responses.
  • Chat Engines: A chat engine serves as an advanced interface for engaging in interactive conversations with your data, allowing for multiple exchanges instead of a single question-and-answer format. Similar to ChatGPT but enhanced with access to a knowledge base, the chat engine maintains a contextual understanding by retaining the conversation history and can provide answers that consider the relevant past context.

Difference between query engine and chat engine:

It is important to note that there is a significant distinction between a query engine and a chat engine. Although they may appear similar at first glance, they serve different purposes:

A query engine operates as an independent system that handles individual questions over the data without maintaining a record of the conversation history.

On the other hand, a chat engine is designed to keep track of the entire conversation history, allowing users to query both the data and previous responses. This functionality resembles ChatGPT, where the chat engine leverages the context of past exchanges to provide more comprehensive and contextually relevant answers

  • Customization: LlamaIndex offers customization options where you can modify the default settings, such as the utilization of OpenAI’s text-davinci-003 model. Users have the flexibility to customize the underlying language model (LLM) and other settings used in LlamaIndex, with support for various integrations and LangChain’s LLM modules.
  • Analysis: LlamaIndex offers a diverse range of analysis tools for examining indices and queries. These tools include features for analyzing token usage and associated costs. Additionally, LlamaIndex provides a Playground module, which presents a visual interface for analyzing token usage across different index structures and evaluating performance metrics.
  • Structured Outputs: LlamaIndex offers an assortment of modules that empower language models (LLMs) to generate structured outputs. These modules are available at various levels of abstraction, providing flexibility and versatility in producing organized and formatted results.
  • Evaluation: LlamaIndex provides essential modules for assessing the quality of both document retrieval and response synthesis. These modules enable the evaluation of “hallucination,” which refers to situations where the generated response does not align with the retrieved sources. A hallucination occurs when the model generates an answer without effectively grounding it in the given contextual information from the prompt.
  • Integrations: LlamaIndex offers a wide array of integrations with various toolsets and storage providers. These integrations encompass features such as utilizing vector stores, integrating with ChatGPT plugins, compatibility with Langchain, and the capability to trace with Graphsignal. These integrations enhance the functionality and versatility of LlamaIndex by allowing seamless interaction with different tools and platforms.
  • Callbacks: LlamaIndex offers a callback feature that assists in debugging, tracking, and tracing the internal operations of the library. The callback manager allows for the addition of multiple callbacks as required. These callbacks not only log event-related data but also track the duration and frequency of each event occurrence. Moreover, a trace map of events is recorded, providing valuable information that callbacks can utilize in a manner that best suits their specific needs.
  • Storage: LlamaIndex offers a user-friendly interface that simplifies the process of ingesting, indexing, and querying external data. By abstracting away complexities, LlamaIndex allows users to query their data with just a few lines of code. Behind the scenes, LlamaIndex provides the flexibility to customize storage components for different purposes. This includes document stores for storing ingested documents (represented as Node objects), index stores for storing index metadata, and vector stores for storing embedding vectors.The document and index stores utilize a shared key-value store abstraction, providing a common framework for efficient storage and retrieval of data

Now that we have explored the key components of LlamaIndex, let’s delve into its operational mechanisms and understand how it functions.

How Llama-Index Works:

To begin, the first step is to import the documents into LlamaIndex, which provides various pre-existing readers for sources like databases, Discord, Slack, Google Sheets, Notion, and the one we will utilize today, the Simple Directory Reader, among others.[Text Wrapping Break][Text Wrapping Break]You can check for more here: Llama Hub (llama-hub-ui.vercel.app)

Once the documents are loaded, LlamaIndex proceeds to parse them into nodes, which are essentially segments of text. Subsequently, an index is constructed to enable quick retrieval of relevant data when querying the documents. The index can be stored in different formats, but we will opt for a Vector Store as it is typically the most useful when querying text documents without specific limitations.

LlamaIndex is built upon LangChain, which serves as the foundational framework for a wide range of LLM applications. While LangChain provides the fundamental building blocks, LlamaIndex is specifically designed to streamline the workflow described above.

Here is an example code showcasing the utilization of the SimpleDirectoryReader data loader in LlamaIndex, along with the integration of the OpenAI language model for natural language processing.

Installing the necessary libraries required to run the code.


Importing openai library and setting the secret API (Application Programming Interface) key.


Importing the SimpleDirectoryReader class from llama_index library and loading the data from it.


Importing SimpleNodeParser class from llama_index and parsing the documents into nodes – basically in chunks of text.


Importing VectorStoreIndex class from llama_index to create index from the chunks of text so that each time when a query is placed only relevant data is sent to OpenAI. In short, for the sake of cost effectiveness.

Conclusion:

LlamaIndex, built on top of Langchain, offers a powerful toolkit for integrating external data with LLMs. By parsing documents into nodes, constructing an efficient index, and selectively querying relevant information, LlamaIndex enables cost-effective exploration of text data.

The provided code example demonstrates the utilization of LlamaIndex’s data loader and query engine, showcasing its potential for next-generation text exploration. For the notebook of the above code, refer to the source code available here.