fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

Regular Expressions

Data Science Dojo
Guest Author

In the second article of this chatbot series, learn how to build a rule-based chatbot and discuss their business applications.

Chatbots have surged in popularity, becoming pivotal in text-based customer interactions, especially in support services. It’s expected that nearly 25% of customer service operations will use them by 2020.

In the first part of A Beginners Guide to Chatbotswe discussed what chatbots were, their rise to popularity, and their use cases in the industry. We also saw how the technology has evolved over the past 50 years.




 

In this second part of the series, we’ll take you through the process of building a simple Rule-based chatbot in Python. Before we start with the tutorial, we need to understand the different types of chatbots and how they work.

Types of chatbots

Chatbots can be classified into two different types based on how they are built:

Rule-based Chatbots

Rule-based chatbots are pretty straightforward. They use a predefined response database and follow set rules to determine appropriate replies. Although they can’t generate answers independently, their effectiveness hinges on the depth of the response database and the efficiency of their rules.

The simplest form of Rule-based Chatbots has one-to-one tables of inputs and their responses. These bots are extremely limited and can only respond to queries if they are an exact match with the inputs defined in their database.

AI-based chatbots

With the rise in the use of machine learning in recent years, a new approach to building chatbots has emerged. With the use of artificial intelligence, creating extremely intuitive and precise chatbots tailored to specific purposes has become possible.

Unlike their rule-based kin, AI-based chatbots are based on complex machine-learning models that enable them to self-learn.

 

Libraries for building a rule-based chatbot

 

Now that we’re familiar with how chatbots work, we’ll be looking at the libraries that will be used to build our simple Rule-based Chatbot.

Natural Language Toolkit (NLTK)

 

Natural Language Toolkit is a Python library that makes it easy to process human language data. It provides easy-to-use interfaces to many language-based resources such as the Open Multilingual Wordnet, and access to a variety of text-processing libraries.

Regular Expression (RegEx) in Python

 

regular expression is a special sequence of characters that helps you search for and find patterns of words/sentences/sequences of letters in sets of strings, using a specialized syntax. They are widely used for text searching and matching in UNIX.

 

Python includes support for regular expression through the re package.

Want to upgrade your Python abilities? Check out Data Science Dojo’s Introduction to Python for Data Science.

Building a rule-based chatbot

 

This very simple rule-based chatbot will work by searching for specific keywords in user input. The keywords will help determine the desired action of the user (user’s intent). Once the intent is identified, the bot will pick out an appropriate response.

 

Role of intent in a rule-based chatbot
Role of intent in a rule-based chatbot

The list of keywords and a dictionary of responses will be built up manually based on the specific use case for the chatbot.

We’ll be designing a very simple chatbot for a bank that can respond to greetings (Hi, Hello, etc.) and answer questions about the bank’s hours of operation.

A flow of how the chatbot would process inputs is shown below:

 

Python rule-based chatbot process
The flow of input requests by a chatbot

 

We will be following the steps below to build our chatbot

 

  1. Importing Dependencies
  2. Building the Keyword List
  3. Building a dictionary of Intents
  4. Defining a dictionary of responses
  5. Matching Intents and Generating Responses

Importing dependencies

 

First, we will import needed the packages/libraries. The re package handles regular expressions in Python. We’ll also use WordNet from NLTK, a lexical database that defines semantic relationships between words, to build a dictionary of synonyms for our keywords. This will expand our list of keywords without manually introducing every possible word a user could use.

 

# Importing modules
import re
from nltk.corpus import wordnet

Building a list of keywords

 

Once we have imported our libraries, we’ll need to build up a list of keywords that our chatbot will look for. This list can be as exhaustive as you want. The more keywords you have, the better your chatbot will perform.

 

As discussed previously, we’ll be using WordNet to build up a dictionary of synonyms to our keywords. For details about how WordNet is structured, visit their website.

 

Code:

# Building a list of Keywords
list_words=['hello','timings']
list_syn={}
for word in list_words:
    synonyms=[]
    for syn in wordnet.synsets(word):
        for lem in syn.lemmas():
            # Remove any special characters from synonym strings
            lem_name = re.sub('[^a-zA-Z0-9 \n\.]', ' ', lem.name())
            synonyms.append(lem_name)
    list_syn[word]=set(synonyms)
print (list_syn)

 

Output:

 

hello
{'hello', 'howdy', 'hi', 'hullo', 'how do you do'}
timings
{'time', 'clock', 'timing'}

 

Here, we first defined a list of words list_words that we will be using as our keywords. We used WordNet to expand our initial list with synonyms of the keywords. This list of keywords is stored in list_syn.

 

New keywords can simply be added to list_words. The chatbot will automatically pull their synonyms and add them to the keywords dictionary. You can also edit list_syn directly if you want to add specific words or phrases that you know your users will use.

Building a dictionary of intents

 

Once our keywords list is complete, we need to build up a dictionary that matches our keywords to intents. We also need to reformat the keywords in a special syntax that makes them visible to Regular Expression’s search function.

 

Code:

# Building dictionary of Intents & Keywords
keywords={}
keywords_dict={}
# Defining a new key in the keywords dictionary
keywords['greet']=[]
# Populating the values in the keywords dictionary with synonyms of keywords formatted with RegEx metacharacters 
for synonym in list(list_syn['hello']):
    keywords['greet'].append('.*\\b'+synonym+'\\b.*')

# Defining a new key in the keywords dictionary
keywords['timings']=[]
# Populating the values in the keywords dictionary with synonyms of keywords formatted with RegEx metacharacters 
for synonym in list(list_syn['timings']):
    keywords['timings'].append('.*\\b'+synonym+'\\b.*')
for intent, keys in keywords.items():
    # Joining the values in the keywords dictionary with the OR (|) operator updating them in keywords_dict dictionary
    keywords_dict[intent]=re.compile('|'.join(keys))
print (keywords_dict)

 

Output:

{'greet': re.compile('.*\\bhello\\b.*|.*\\bhowdy\\b.*|.*\\bhi\\b.*|.*\\bhullo\\b.*|.*\\bhow-do-you-do\\b.*'), 'timings': re.compile('.*\\btime\\b.*|.*\\bclock\\b.*|.*\\btiming\\b.*')}

 

The updated and formatted dictionary is stored in keywords_dict. The intent is the key and the string of keywords is the value of the dictionary.

 

Let’s look at one key-value pair of the keywords_dict dictionary to understand the syntax of Regular Expression;

{'greet': re.compile('.*\\bhullo\\b.*|.*\\bhow-do-you-do\\b.*|.*\\bhowdy\\b.*|.*\\bhello\\b.*|.*\\bhi\\b.*')
 Regular Expression uses specific patterns of special Meta-Characters to search for strings or sets of strings in an expression.

 

Since we need our chatbot to search for specific words in larger input strings we use the following sequences of meta-characters:

 

.*\\bhullo\\b.*

 

In this specific sequence, the keyword (hullo) is encased between a \b sequence. This tells the RegEx Search function that the search parameter is the keyword (hullo).

 

The first sequence \bhullo\b is encased between a period star .* sequence. This sequence tells the RegEx Search function to search the entire input string from beginning to end for the search parameter (hullo).

 

In the dictionary, multiple such sequences are separated by the OR | operator. This operator tells the search function to look for any of the mentioned keywords in the input string.

 

More details about Regular Expression and its syntax can be found here. You can add as many key-value pairs to the dictionary as you want to increase the functionality of the chatbot.

Defining responses

 

The next step is defining responses for each intent type. This part is very straightforward. The responses are described in another dictionary with the intent being the key.

 

We’ve also added a fallback intent and its response. This is a fail-safe response in case the chatbot is unable to extract any relevant keywords from the user input.

 

Code:

# Building a dictionary of responses
responses={
    'greet':'Hello! How can I help you?',
    'timings':'We are open from 9AM to 5PM, Monday to Friday. We are closed on weekends and public holidays.',
    'fallback':'I dont quite understand. Could you repeat that?',
}

Matching intents and generating responses

 

Now that we have the back end of the chatbot completed, we’ll move on to taking input from the user and searching the input string for our keywords.

 

We use the RegEx Search function to search the user input for keywords stored in the value field of the keywords_dict dictionary.  If you recall, the values in the keywords_dict dictionary were formatted with special sequences of meta-characters.

 

RegEx’s search function uses the sequences to compare the character patterns in the keywords with the input string. If a match is found, the current intent is selected and used as a key to the responses dictionary to select the correct response.

 

Code:

print ("Welcome to MyBank. How may I help you?")
# While loop to run the chatbot indefinetely
while (True):  
    # Takes the user input and converts all characters to lowercase
    user_input = input().lower()
    # Defining the Chatbot's exit condition
    if user_input == 'quit': 
        print ("Thank you for visiting.")
        break    
    matched_intent = None 
    for intent,pattern in keywords_dict.items():
        # Using the regular expression search function to look for keywords in user input
        if re.search(pattern, user_input): 
            # if a keyword matches, select the corresponding intent from the keywords_dict dictionary
            matched_intent=intent  
    # The fallback intent is selected by default
    key='fallback' 
    if matched_intent in responses:
        # If a keyword matches, the fallback intent is replaced by the matched intent as the key for the responses dictionary
        key = matched_intent
    # The chatbot prints the response that matches the selected intent
    print (responses[key]) 

 

The chatbot picked the greeting from the first user input (‘Hi’) and responded according to the matched intent.

The same happened when it located the word (‘time’) in the second user input.

The third user input (‘How can I open a bank account’) didn’t have any keywords that were present in Bankbot’s database so it went to its fallback intent.

You can add multiple keywords/phrases/sentences and intents to build a robust chatbot for human interaction.

Building chatbot in Python – the next step

 

This blog was a hands-on introduction to building a basic chatbot in Python. We only worked with 2 intents in this tutorial for simplicity. You can easily expand the functionality of this chatbot by adding more keywords, intents, and responses.

 

Building a rule-based chatbot is a laborious process, especially in a business environment where it requires increased complexity. It makes rule-based chatbots impractical for enterprises due its limited conversational capabilities. AI-based Chatbots are a more practical solution for real-world scenarios.

 

In the next blog in the series, we’ll learn how to build a simple AI-based Chatbot in Python.

Do you want to learn more about machine learning and its applications? Check out Data Science Dojo’s online data science certificate program!

Data Science Dojo
DSD staffer
| April 28

A regular expression is a sequence of characters that specifies a search pattern in a text. Learn more about Its common uses in this regex 101 guide.

What is a regular expression?

A regular expression, or regex for short, is perhaps the most common thing that every data scientist must deal with multiple times in their career, and the frustration is natural because, at a vast majority of universities, this skill is not taught unless you have taken some hard-core Computer Science theory course. Even at times, trying to solve a problem using regular expression can cause multiple issues, which is summed beautifully in this meme:

regular expressions meme, Data Science Humor
Regular Expressions Meme

 

Making sense and using them can be quite daunting for beginners, but this RegEx 101 blog has got you covered. Regular expressions, in essence, are used to match strings in the text that we want to find. Consider the following scenario:

You are interested in counting the number of occurrences of Pandas in a journal related to endangered species to see how much focus is on this species. You write an algorithm that calculates the occurrences of the word ‘panda.’

 

Regular-Expressions infographic

 

However, as you might have noticed, your algorithm will miss the words ‘Panda’ and ‘Pandas.’ In this case, you might argue that a simple if-else condition will also count these words. But imagine, while converting the journal alphabets of every word is randomly capitalized or converted into a lower-case letter. Now, there are the following possibilities

  • PaNda
  • pAnDaS
  • PaNdA

And may more variations as well. Now you must write a lot of if-else conditions and even must write nested if-else conditions as well. What if I tell you that you can do this in one line of code using regular expressions? First, we need to learn some basics before coming back to solve the problem ourselves.

PRO TIP: Join our data science bootcamp program today to enhance your data science knowledge!

Square Brackets ([])

The name might sound scary, but it is nothing but the symbol: []. Some people also refer to square brackets as character class – a regular expression jargon word that means that it will match any character inside the bracket. For instance:

Pattern Matches
[Pp]enguin Penguin, penguin
[0123456789] (This will match any digit)
[0oO] 0, o, O

Disjunction (|)

The pipe symbol means nothing but either ‘A’ or ‘B’, and it is helpful in cases where you want to select multiple strings simultaneously. For instance:

Pattern Matches
A|B|C A, B, C
Black|White Black, White
[Bb]lack|[Ww]hite Black, black, White, white

Question Mark (?)

The question mark symbol means that the character it comes after is optional. For instance:

Pattern Matches
Ab?c Ac, Abc
Colou?r Color, Colour

Asterisk (*)

The asterisk symbol matches with 0 or more occurrences of the earlier character or group. For instance:

Pattern Matches
Sh* (0 or more of earlier character h)

S, Sh, Shh, Shhh.

(banana)* (0 or more of earlier banana. This will also match with nothing, but most regex engines will ignore it or give you a warning in that case)

banana, bananabanana, bananabananabanana.

Plus (+)

The plus symbol means to match with one or more occurrences of the earlier character or group. For instance:

Pattern Matches
Sh+ (1 or more of earlier character h)

Sh, Shh, Shhh.

(banana)+ (1 or more of the earlier banana)

banana, bananabanana, bananabananabanana.

Difference between Asterisk (*) and Plus(+)

The difference between the asterisk confuses many people; even the experts sometimes must look at the internet for their differences. However, there is an effortless way to remember the distinction between them.

Imagine you have a number 1, and you multiply it with 0:

1*0 = 0 or more occurrences of earlier character or group.

Now suppose that you have the same number 1, and you add it with 0:1+0 = 1 or more occurrences of an earlier character or group.

It is that simple when you try to understand things intuitively.

Negation (^)

Negation has two everyday use cases:

1. Inside square brackets, it will search for the negation of whatever is inside the brackets. For instance:

Pattern Matches
[^Aa] It will match with anything that is not A or a
[^0123456789] It will match anything that is not a digit

2. It can also be used as an anchor to search for expressions at the start of the line(s) only. For instance:

Pattern Matches
^Apple It will match with every Apple that is at the start of any line in the text
^(Apple|Banana) It will match with every Apple and Banana that is at the start of any line in the text

Dollar ($)

A dollar is used to search for expressions at the end of the line. For instance:

Pattern Matches
$[0123456789] It will match with any digit at the end of any line in the text.
$([Pp]anda) It will match with every Panda and panda at the end of any line in the text.

Conclusion

This article covered some of the very basic types of regular expression by using a story-telling approach. Regular expressions are indeed very hard to understand but if one develops an intuitive understanding then they can be easily learned.

Related Topics

Statistics
Resources
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
Artificial Intelligence