Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

Transformers have revolutionized natural language processing with their use of self-attention mechanisms. In this post, we will study the key components of transformers to understand how they have become the basis of the state of the art in different tasks.  


Introduction: Attention is all you need 

The Transformer architecture was first introduced in the 2017 paper “Attention is All You Need” by researchers at Google. Unlike previous sequence models such as RNNs, Transformer relies entirely on self-attention to model dependencies in sequential data like text.   


Large language models knowledge test


Remarkably, this simple change led to major improvements in machine translation quality over existing methods. Since then, Transformers have been applied successfully to diverse NLP tasks like text generation, summarization, and question-answering. Their versatility has even led to applications in computer vision 


Large language model bootcamp


But what exactly is self-attention and why is it so effective? Let’s explore this. 

The limitations of Recurrent Neural Networks – RNNs   

Recurrent neural networks (RNNs) used to be the dominant approach for modeling sequences. An RNN processes textual data incrementally, maintaining a “memory” of the previous context. For example, to predict the next word in a sentence, an RNN model would incorporate information about all the preceding words.  

However, RNNs have certain limitations. They process data sequentially, making parallelization difficult. More critically, they struggle to learn long-range dependencies because the information gets diluted over many steps of time. Attention mechanisms were proposed to mitigate this issue. 

Learn to build LLM applications

Why use a transformer model?  

The transformer architecture has enabled the development of new models that can be trained on large datasets and significantly outperform recurrent neural networks like LSTMs. These new models are utilized for tasks like sequence classification, question answering, language modeling, named entity recognition, summarization, and translation.  

Let’s examine the key components of transformers to understand how they have become the foundation for state-of-the-art performance on different NLP tasks.  

Transformer design  

A transformer consists of an encoder and a decoder. The encoder’s role is to encode the inputs (i.e. sentences) into a state, often containing multiple tensors. This state is then passed to the decoder to generate the outputs.

In machine translation, the encoder converts a source sentence, e.g. “Hello world“, into a state, such as a vector, that captures its semantic meaning.

The decoder then utilizes this state to produce the translated target sentence, e.g. “Bonjour le monde.” Both the encoder and decoder primarily employ Multi-Head Attention and Feedforward Networks, which are the focus of this article.   


Transformer model architecture

Key transformer components  

1. Input embedding  

Embedding aims to create a vector representation of words where words with similar meanings will be close in terms of Euclidean distance. For instance, the words “bathroom” and “shower” are related to the same concept, so their word vectors are close in Euclidean space as they convey similar meanings.  

For the encoder, the authors opted for an embedding size of 512 (i.e. each word is represented by a 512-dimensional vector).  


  Input embedding


2. Positional encoding  

The position of a word plays a crucial role in understanding the sequence we want to model.  

Therefore, we add positional information about the word’s location in the sequence to its vector. The authors used the following sinusoidal. 

Position encoding  


We will explain positional encoding in more detail with an example.  

  Position encoding example


We note the position of each word in the sequence.  

We define dmodel = 512, which represents the size of the embedding vector of each word (i.e. the vector dimension). We can now rewrite the two positional encoding equations as:  


two positional encoding equations


We can see that the wavelength (i.e. frequency) λt decreases as the dimension increases, this forms a progression along the wave from 2pi to 10000.2pi.  




In this model, the absolute positional information of a word in a sequence is added directly to its initial vector. For this, the positional encoding must have the same size dmodel as the initial word vector.  

Attention mechanism  

  • Scaled Dot-Product Attention  

  Scaled Dot-Product Attention


Let’s explain the attention mechanism. The key goal of attention is to estimate the relative relevance of the keywords compared to the query word for the same entity. For this, the attention mechanism takes a query vector Q representing a word, the keys K comprising all other words in the sentence, and values V representing the word vectors.  

In our case, V = Q (for the two self-attention layers). In other words, the attention mechanism provides the significance of a word in a given sentence.  


attention mechanism


When we compute the normalized dot product between the query and the keys, we get a tensor that represents the relative importance of each other word for the query. To go deeper into mathematics, we can try to understand why the authors used a dot product to calculate the relation between two words.  


Get Started with Generative AI                                    


A word is represented by a vector in Euclidian space, in this case, a vector of size 512.   

When computing the dot product between Q and KT, we calculate the product between Q’s orthogonal projection onto K. In other words, we estimate the alignment between the query and keyword vectors, returning a weight for each word in the sentence.  

We then normalize by dk to counteract large Q and K magnitudes which can push the softmax function into regions with tiny gradients. The softmax function regularizes the terms and rescales them between 0 and 1 (i.e., converts the dot product to a probability distribution), with the goal of normalizing all weights between 0 and 1.  

 softmax function

Finally, we multiply the weights (i.e., importance) by the values V to reduce irrelevant words and focus on the most significant words.  


Attention mechanism (2)



  • Multi-Head Attention  

  Multi-Head Attention


The key idea is that attention is applied multiple times in parallel on different projections of the input queries, keys, and values. This allows the model to learn different types of dependencies between the input words.  


The input queries (Q), keys (K), and values (V) are each linearly projected h times into smaller subspaces. For example, h=8 times into 64-dimensional spaces.  

Attention is then applied in each of these h projected subspaces in parallel, yielding h different attention outputs.  


Attention mechanism and transformers - LLM
Attention mechanism and transformers – LLM Bootcamp Data Science Dojo


These h outputs are concatenated and linearly projected again to get the final values. The projections allow the model to focus on different positional and semantic relationships between words since each projected subspace captures different information.  

Doing this in parallel (multi-head) instead of sequentially improves efficiency.  

The projection matrices are learned during training to discover the most useful projections. So, in summary, multi-head attention applies the attention mechanism in multiple parallel subspaces to learn different types of dependencies between words in an efficient way.  


Let’s dive into the mechanics of encoder-decoder architecture.  

Transformer model architecture  


In this section, we’ll explain how the encoder and decoder work together to translate an English sentence into a French one, step by step.  

1. Encoder  


  • Convert a sequence of tokens to a sequence of vectors by using embeddings.    

Positional encoding



  • Add position information in each word vector.  


The key advantage of recurrent neural networks is their knack for understanding relationships between sequences and remembering information. On the other hand, Transformers employ positional encoding to factor in where words are in a sequence.  

  • Apply Multi-Head Attention  

Apply Multi-Head Attention  

  • Use Feed Forward Network  


2. Decoder  

  • Utilize embeddings to transform a French sentence into vectors.   

  decoder French


  • Add positional details within each word vector.    

Positional encoding French  

  • Apply Multi-Head Attention  

  Apply Multi-Head Attention French


  • Apply Feed Forward Network  


  • Apply Multi-Head Attention to the encoder output.  

Multi-Head Attention - encoder output

We can observe that the Transformer combines the encoder’s output with the decoder’s input. This enables it to discern the relationship between the vectors that encode the English and French sentences.  

  • Apply the Feed Forward Network again.  
  • Compute the probability for the next word by using linear + SoftMax block. The decoder returns the highest probability as the next word at the output.  

  Linear and SoftMax block

In our case, the next word after “Je” is “suis” 


Final thoughts 

The transformer model outperforms all the models on different benchmarks also there was no difference seen between the translation provided by the algorithm and by humans.   

Transformers are a major advance in NLP, they exceed RNN by having a lower training cost allowing to train models on larger corpora. Even today, transformers remain the basis of state-of-the-art models such as BERT, Roberta, XLNET, and GPT.  







October 18, 2023

Large language models (LLMs) like GPT-3 and GPT-4. revolutionized the landscape of NLP. These models have laid a strong foundation for creating powerful, scalable applications. However, the potential of these models isaffected by the quality of the prompt. This highlights the importance of prompt engineering.



Furthermore, real-world NLP applications often require more complexity than a single ChatGPT session can provide. This is where LangChain comes into play! 



Get more information on Large Language models and its applications and tools by clicking below:

Large language model bootcamp


Harrison Chase’s brainchild, LangChain, is a Python library designed to help you leverage the power of LLMs to build custom NLP applications. As of May 2023, this game-changing library has already garnered almost 40,000 stars on GitHub. 



Interested in learning about Large Language Models and building custom ChatGPT like applications for your business? Click below

Learn More                  


This comprehensive beginner’s guide provides a thorough introduction to LangChain, offering a detailed exploration of its core features. It walks you through the process of building a basic application using LangChain and shares valuable tips and industry best practices to make the most of this powerful framework. Whether you’re new to Language Learning Models (LLMs) or looking for a more efficient way to develop language generation applications, this guide serves as a valuable resource to help you leverage the capabilities of LLMs with LangChain. 

Overview of LangChain modules 

These modules are essential for any application using the Language Model (LLM).


LangChain offers standardized and adaptable interfaces for each module. Additionally, LangChain provides external integrations and even ready-made implementations for seamless usage. Let’s delve deeper into these modules. 

Overview of LangChain Modules
Overview of LangChain Modules


LLM is the fundamental component of LangChain. It is essentially a wrapper around a large language model that helps use the functionality and capability of a specific large language model. 


As stated earlier, LLM (Language Model) serves as the fundamental unit within LangChain. However, in line with the “LangChain” concept, it offers the ability to link together multiple LLM calls to address specific objectives. 

For instance, you may have a need to retrieve data from a specific URL, summarize the retrieved text, and utilize the resulting summary to answer questions. 

On the other hand, chains can also be simpler in nature. For instance, you might want to gather user input, construct a prompt using that input, and generate a response based on the constructed prompt. 


Large language model bootcamp



Prompts have become a popular modeling approach in programming. It simplifies prompt creation and management with specialized classes and functions, including the essential PromptTemplate. 


Document loaders and Utils 

LangChain’s Document Loaders and Utils modules simplify data access and computation. Document loaders convert diverse data sources into text for processing, while the utils module offers interactive system sessions and code snippets for mathematical computations. 

Vector stores 

The widely used index type involves generating numerical embeddings for each document using an embedding model. These embeddings, along with the associated documents, are stored in a vector store. This vector store enables efficient retrieval of relevant documents based on their embeddings. 


LangChain offers a flexible approach for tasks where the sequence of language model calls is not deterministic. Its “Agents” can act based on user input and previous responses. The library also integrates with vector databases and has memory capabilities to retain the state between calls, enabling more advanced interactions. 


Building our App 

Now that we’ve gained an understanding of LangChain, let’s build a PDF Q/A Bot app using LangChain and OpenAI. Let me first show you the architecture diagram for our app and then we will start with our app creation. 


QA Chatbot Architecture
QA Chatbot Architecture


Below is an example code that demonstrates the architecture of a PDF Q&A chatbot. This code utilizes the OpenAI language model for natural language processing, the FAISS database for efficient similarity search, PyPDF2 for reading PDF files, and Streamlit for creating a web application interface.


The chatbot leverages LangChain’s Conversational Retrieval Chain to find the most relevant answer from a document based on the user’s question. This integrated setup enables an interactive and accurate question-answering experience for the users. 

Importing necessary libraries 

Import Statements: These lines import the necessary libraries and functions required to run the application. 

  • PyPDF2: Python library used to read and manipulate PDF files. 
  • langchain: a framework for developing applications powered by language models. 
  • streamlit: A Python library used to create web applications quickly. 
Importing necessary libraries
Importing necessary libraries

If the LangChain and OpenAI are not installed already, you first need to run the following commands in the terminal. 

Install LangChain


Setting openAI API key 

You will replace the placeholder with your OpenAI API key which you can access from OpenAI API. The above line sets the OpenAI API key, which you need to use OpenAI’s language models. 

Setting OpenAI API Key

Streamlit UI 

These lines of code create the web interface using Streamlit. The user is prompted to upload a PDF file.

Streamlit UI
Streamlit UI

Reading the PDF file 

If a file has been uploaded, this block reads the PDF file, extracts the text from each page, and concatenates it into a single string. 

Reading the PDF File
Reading the PDF File

Text splitting 

Language Models are often limited by the amount of text that you can pass to them. Therefore, it is necessary to split them up into smaller chunks. It provides several utilities for doing so. 

Text Splitting 
Text Splitting

Using a Text Splitter can also help improve the results from vector store searches, as eg. smaller chunks may sometimes be more likely to match a query. Here we are splitting the text into 1k tokens with 200 tokens overlap. 


Here, the OpenAIEmbeddings function is used to download embeddings, which are vector representations of the text data. These embeddings are then used with FAISS to create an efficient search index from the chunks of text.  


Creating conversational retrieval chain 

The chains developed are modular components that can be easily reused and connected. They consist of predefined sequences of actions encapsulated in a single line of code. With these chains, there’s no need to explicitly call the GPT model or define prompt properties. This specific chain allows you to engage in conversation while referencing documents and retains a history of interactions. 

Creating Conversational Retrieval Chain
Creating Conversational Retrieval Chain

Streamlit for generating responses and displaying in the App 

This block prepares a response that includes the generated answer and the source documents and displays it on the web interface. 

Streamlit for Generating Responses and Displaying in the App
Streamlit for Generating Responses and Displaying in the App

Let’s run our App 

QA Chatbot
QA Chatbot

Here we uploaded a PDF, asked a question, and got our required answer with the source document. See, that is how the magic of LangChain works.  

You can find the code for this app on my GitHub repository LangChain-Custom-PDF-Chatbot.

Build your own conversational AI applications 

Concluding the journey! Mastering LangChain for creating a basic Q&A application has been a success. I trust you have acquired a fundamental comprehension of LangChain’s potential. Now, take the initiative to delve into LangChain further and construct even more captivating applications. Enjoy the coding adventure.


Learn to build custom large language model applications today!                                                

May 22, 2023