Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

retrieval augmented generation

Huda Mahmood - Author
Huda Mahmood
| February 1

Retrieval augmented generation (RAG) has improved the function of large language models (LLM). It empowers generative AI to create more coherent and contextually relevant content. Let’s take a deeper look into understanding RAG.


What is retrieval augmented generation?


It is an AI framework and a type of natural language processing (NLP) model that enables the retrieval of information from an external knowledge base. It integrates retrieval-based and generation-based approaches to provide a robust database for LLMs.


A retrieval augmented generation model accesses a large pre-existing pool of knowledge to improve the quality of LLM-generated responses. It ensures that the information is more accurate and up-to-date by combining factual data with contextually relevant information.


By combining vector databases and LLM, the retrieval model has set up a standard for the search and navigation of data for generative AI. It has become one of the most used techniques for LLM.


retrieval augmented generation
An example illustrating retrieval augmentation – Source: LinkedIn


Benefits of RAG


While retrieval augmented generation improves LLM responses, it offers multiple benefits to the generative AI efforts of an organization.


Improved contextual awareness


The retrieval component allows access to a large knowledge base, enabling the model to generate contextually relevant information. Due to improved awareness of the context, the output generated is more coherent and appropriate.


Enhanced accuracy


An LLM using a retrieval model can produce accurate results with proper attribution, including citations of relevant sources. Access to a large and accurate database ensures that factually correct results are generated.


Adaptability to dynamic knowledge


The knowledge base of a retrieval model is regularly updated to ensure access to the latest information. The system integrates new information without retraining the entire program, ensuring quick adaptability. It enables the generative models to access the latest statistics and research.


Resource efficiency


Retrieval mechanisms enable the model to retrieve information from a large information base. The contextual relevance of the data enhances the accuracy of the results, making the process resource-efficient. It makes handling of large data volumes easier and makes the system cost-efficient.


Increased developer control


Developers use a retrieval augmented generation model to control the information base of a LLM. They can adapt the data to the changing needs of the user. Moreover, they can also restrict the accessibility of the knowledge base, giving them control of data authorization.


Large language model bootcamp


Frameworks for retrieval augmented generation


A RAG system combines a retrieval model with a generation model. Developers use frameworks and libraries available online to implement the required retrieval system. Let’s take a look at some of the common resources used for it.


Hugging face transformers


It is a popular library of pre-trained models for different tasks. It includes retrieval models like Dense Passage Retrieval (DPR) and generation models like GPT. The transformer allows the integration of these systems to generate a unified retrieval augmented generation model.


Facebook AI similarity search (FAISS)


FAISS is used for similarity search and clustering dense vectors. It plays a crucial role in building retrieval components of a system. Its use is preferred in models where vector similarity is crucial for the system.


PyTorch and TensorFlow


These are commonly used deep learning frameworks that offer immense flexibility in building RAG models. They enable the developers to create retrieval and generation models separately. Both models can then be integrated into a larger framework to develop a RAG model.




It is a Python framework that is built on Elasticsearch. It is suitable to build end-to-end conversational AI systems. The components of the framework are used for storage of information, retrieval models, and generation models.


Learn to build LLM applications


Use cases of RAG


Some common use cases and real-world applications are listed below.

Content creation


It primarily deals with writing articles and blogs. It is one of the most common uses of LLM where the retrieval models are used to generate coherent and relevant content. It can lead to personalized results for users that include real-time trends and relevant contextual information.


Real-time commentary


A retriever uses APIs to connect real-time information updates with an LLM. It is used to create a virtual commentator which can be integrated further to create text-to-speech models. IBM used this mechanism during the US Open 2023 for live commentary.


Question answering system


question answering through retrieval augmented generation
Question answering through retrieval augmented generation – Source: Medium


The ability of LLMs to generate contextually relevant content enables the retrieval model to function as a question-answering machine. It can retrieve factual information from an extensive knowledge base to create a comprehensive answer.


Language translation


Translation is a tricky process. A retrieval model can detect the context of phrases and words, enabling the generation of relevant translations. Access to external databases ensures the results are accurate and fluent for the users. The extensive information on available idioms and phrases in multiple languages ensures this use case of the retrieval model.


Educational assistance


The application of a retrieval model in the educational arena is an extension of question answering systems. It uses the said system, particularly for educational queries of users. In answering questions and generating academic content, the system can create more comprehensive results with contextually relevant information.


Future of RAG


The integration of retrieval and generation models in LLM is expected to grow in the future. The current trends indicate their increasing use in technological applications. Some common areas of future development of RAG include:


  • Improved architecture – the development of retrieval and generation models will result in the innovation of neural network architectures


  • Enhanced conversational agents – improved adaptation of knowledge base into retrieval model databases will result in more sophisticated conversational agents that can adapt to domain-specific information in an improved manner


  • Integration with multimodal information – including different types of information, including images and audio, can result in contextually rich responses that encompass a diverse range of media


  • Increased focus on ethical concerns – since data privacy and ethics are becoming increasingly important in today’s digital world, the retrieval models will also focus more on mitigating biases and ethical concerns from the development systems



Hence, retrieval augmented generation is an important aspect of large language models within the arena of generative AI. It has improved the overall content processing and promises an improved architecture of LLMs in the future.

Muhammad Jan Author
Muhammad Jan
| October 31

RAG integration revolutionized search with LLM, boosting dynamic retrieval.

Within the implementation of a RAG system, a pivotal factor governing its efficiency and performance lies in the determination of the optimal chunk size. How does one identify the most effective chunk size for seamless and efficient retrieval? This is precisely where the comprehensive assessment provided by the LlamaIndex Response Evaluation tool becomes invaluable.

In this article, we will provide a comprehensive walkthrough, enabling you to discern the ideal chunk size through the powerful features of LlamaIndex’s Response Evaluation module. 


Why chunk size matters in RAG system

Selecting the appropriate chunk size is a crucial determination that holds sway over the effectiveness and precision of a RAG system in various ways: 




Pertinence and detail:

Opting for a smaller chunk size, such as 256, results in more detailed segments. However, this heightened detail brings the potential risk that pivotal information might not be included in the most retrieved segments.

On the contrary, a chunk size of 512 is likely to encompass all vital information within the leading chunks, ensuring that responses to inquiries are readily accessible. To navigate this challenge, we will employ the faithfulness and relevance metrics.

These metrics gauge the absence of ‘hallucinations’ and the ‘relevancy’ of responses concerning the query and the contexts retrieved, respectively. 


Large language model bootcamp

Generation time for responses:

With an increase in the chunk size, the volume of information directed into the LLM for generating a response also increases. While this can guarantee a more comprehensive context, it might potentially decelerate the system. Ensuring that the added depth doesn’t compromise the system’s responsiveness is pivotal.

Ultimately, finding the ideal chunk size boils down to achieving a delicate equilibrium. Capturing all crucial information while maintaining operational speed It’s essential to conduct comprehensive testing with different sizes to discover a setup that aligns with the unique use case and dataset requirements. 

Why evaluation? 

The discussion surrounding evaluation in the field of NLP has been contentious, particularly with the advancements in NLP methodologies.

Traditional evaluation techniques like BLEU or F1 are now unreliable for assessing models because they have limited correspondence with human evaluations.

As a result, the landscape of evaluation practices continues to shift, emphasizing the need for cautious application. 

In this blog, our focus will be on configuring the gpt-3.5-turbo model to serve as the central tool for evaluating the responses in our experiment.

To facilitate this, we establish two key evaluators, the faithfulness evaluator and the relevance evaluator, utilizing the service context. This approach aligns with the evolving standards of LLM evaluation, reflecting the need for more sophisticated and reliable evaluation mechanisms. 


 Faithfulness evaluator: This evaluator is instrumental in determining whether the response was artificially generated and checks if the response from a query engine corresponds with any source nodes. 

Relevancy evaluator: This evaluator is crucial for gauging whether the query was effectively addressed by the response and examines whether the response, combined with source nodes, matches the query. 

In order to determine the appropriate chunk size, we will calculate metrics such as average response time, average faithfulness, and average relevancy across different chunk sizes.  



Downloading dataset 

We will be using the IRS armed forces tax guide for this experiment. 

  • mkdir is used to make a folder. Here we are making a folder named dataset in the root directory. 
  • wget command is used for non-interactive downloading of files from the web. It allows users to retrieve content from web servers, supporting various protocols like HTTP, HTTPS, and FTP. 



Load dataset 

  • SimpleDirectoryReader class will help us to load all the files in the dataset directory. 
  • document[0:10] represents that we will only be loading the first 10 pages of the file for the sake of simplicity. 



Defining question bank 

These questions will help us to evaluate metrics for different chunk sizes. 




Establishing evaluators  

This code initializes an OpenAI language model (gpt-3.5-turbo) with temperature=0 settings and instantiate evaluators for measuring faithfulness and relevancy, utilizing the ServiceContext module with default configurations. 



Main evaluator method 

We will be evaluating each chunk size based on 3 metrics. 

  1. Average Response Time 
  2. Average Faithfulness 
  3. Average Relevancy 


Read this blog about Orchestation Framework


  • The function evaluator takes two parameters, chunkSize and questionBank. 
  • It first initializes an OpenAI language model (llm) with the model set to gpt-3.5-turbo. 
  • Then, it creates a serviceContext using the ServiceContext.from_defaults method, specifying the language model (llm) and the chunk size (chunkSize). 
  • The function uses the VectorStoreIndex.from_documents method to create a vector index from a set of documents, with the service context specified. 
  • It builds a query engine (queryEngine) from the vector index. 
  • The total number of questions in the question bank is determined and stored in the variable totalQuestions. 

Next, the function initializes variables for tracking various metrics: 

  • totalResponseTime: Tracks the cumulative response time for all questions. 
  • totalFaithfulness: Tracks the cumulative faithfulness score for all questions. 
  • totalRelevancy: Tracks the cumulative relevancy score for all questions. 
  • It records the start time before querying the queryEngine for a response to the current question. 
  • It calculates the elapsed time for the query by subtracting the start time from the current time. 
  • The function evaluates the faithfulness of the response using faithfulnessLLM.evaluate_response and stores the result in the faithfulnessResult variable. 
  • Similarly, it evaluates the relevancy of the response using relevancyLLM.evaluate_response and stores the result in the relevancyResult variable. 
  • The function accumulates the elapsed time, faithfulness result, and relevancy result in their respective total variables. 
  • After evaluating all the questions, the function computes the averages 




Testing different chunk sizes 

To find out the best chunk size for our data, we have defined a list of chunk sizes then we will traverse through the list of chunk sizes and find out the average response time, average faithfulness, and average relevance with the help of evaluator method. After this, we will convert our data list into a data frame with the help of Pandas DataFrame class to view it in a fine manner. 



From the illustration, it is evident that the chunk size of 128 exhibits the highest average faithfulness and relevancy while maintaining the second-lowest average response time. 

Use LlamaIndex to construct a RAG system 

Identifying the best chunk size for a RAG system depends on a combination of intuition and empirical data. By utilizing LlamaIndex’s Response Evaluation module, we can experiment with different sizes and make well-informed decisions.

When constructing a RAG system, it is crucial to remember that the chunk size plays a pivotal role. Therefore, it is essential to invest the necessary time to thoroughly evaluate and fine-tune the chunk size for optimal outcomes. 


You can find the complete code here