Regular Expressions

In the second article of this chatbot series, learn how to build a rule-based chatbot and discuss the business applications of them.

Chatbots have become extremely popular in recent years and their use in the industry has skyrocketed. They have found a strong foothold in almost every task that requires text-based public dealing. They have become so critical in the support industry, for example, that almost 25% of all customer service operations are expected to use them by 2020.

In the first part of A Beginners Guide to Chatbotswe discussed what chatbots were, their rise to popularity and their use-cases in the industry. We also saw how the technology has evolved over the past 50 years.

In this second part of the series, we’ll be taking you through how to build a simple Rule-based chatbot in Python. Before we start with the tutorial, we need to understand the different types of chatbots and how they work.

Types of chatbots

Chatbots can be classified into two different types, based on how they are built:

Rule-based Chatbots

Rule-based chatbots are pretty straight forward. They are provided with a database of responses and are given a set of rules that help them match out an appropriate response from the provided database. They cannot generate their own answers but with an extensive database of answers and smartly designed rules, they can be very productive and useful.

The simplest form of Rule-based Chatbots have one-to-one tables of inputs and their responses. These bots are extremely limited and can only respond to queries if they are an exact match with the inputs defined in their database.

AI-based chatbots

With the rise in the use of machine learning in recent years, a new approach to building chatbots has emerged. Using artificial intelligence, it has become possible to create extremely intuitive and precise chatbots tailored to specific purposes.

Unlike their rule-based kin, AI based chatbots are based on complex machine learning models that enable them to self-learn.

Now that we’re familiar with how chatbots work, we’ll be looking at the libraries that will be used to build our simple Rule-based Chatbot.

Natural Language Toolkit (NLTK)

Natural Language Toolkit is a Python library that makes it easy to process human language data. It provides easy-to-use interfaces to many language-based resources such as the Open Multilingual Wordnet, as well as access to a variety of text-processing libraries.

Regular Expression (RegEx) in Python

regular expression is a special sequence of characters that helps you search for and find patterns of words/sentences/sequence of letters in sets of strings, using a specialized syntax. They are widely used for text searching and matching in UNIX.

Python includes support for regular expression through the re package.

Want to upgrade your Python abilities? Checkout Data Science Dojo’s Introduction to Python for Data Science.

Building a chatbot

This very simple rule based chatbot will work by searching for specific keywords in inputs given by a user. The keywords will be used to understand what action the user wants to take (user’s intent). Once the intent is identified, the bot will then pick out a response appropriate to the intent.

 

Keyword intent

The list of keywords the bot will be searching for and the dictionary of responses will be built up manually based on the specific use case for the chatbot.

We’ll be designing a very simple chatbot for a Bank. The bot will be able to respond to greetings (Hi, Hello etc.) and will be able to answer questions about the bank’s hours of operation.

A flow of how the chatbot would process inputs is shown below;

 

Python chatbot process
Flow of how the chatbot will process

We will be following the steps below to build our chatbot

  1. Importing Dependencies
  2. Building the Keyword List
  3. Building a dictionary of Intents
  4. Defining a dictionary of responses
  5. Matching Intents and Generating Responses

Importing dependencies

The first thing we’ll need to do is import the packages/libraries we’ll be using. re is the package that handles regular expression in Python. We’ll also be using WordNet from NLTK. WordNet is a lexical database that defines semantical relationships between words. We’ll be using WordNet to build up a dictionary of synonyms to our keywords. This will help us expand our list of keywords without manually having to introduce every possible word a user could use.

# Importing modules
import re
from nltk.corpus import wordnet

Building a list of keywords

Once we have imported our libraries, we’ll need to build up a list of keywords that our chatbot will look for. This list can be as exhaustive as you want. The more keywords you have, the better your chatbot will perform.

As discussed previously, we’ll be using WordNet to build up a dictionary of synonyms to our keywords. For details about how WordNet is structured, visit their website.

Code:

# Building a list of Keywords
list_words=['hello','timings']
list_syn={}
for word in list_words:
    synonyms=[]
    for syn in wordnet.synsets(word):
        for lem in syn.lemmas():
            # Remove any special characters from synonym strings
            lem_name = re.sub('[^a-zA-Z0-9 \n\.]', ' ', lem.name())
            synonyms.append(lem_name)
    list_syn[word]=set(synonyms)
print (list_syn)

Output:

hello
{'hello', 'howdy', 'hi', 'hullo', 'how do you do'}
timings
{'time', 'clock', 'timing'}

Here, we first defined a list of words list_words that we will be using as our keywords. We used WordNet to expand our initial list with synonyms of the keywords. This list of keywords is stored in list_syn.

New keywords can simply be added to list_words. The chatbot will automatically pull their synonyms and add them to the keywords dictionary. You can also edit list_syn directly if you want to add specific words or phrases that you know your users will use.

Building a dictionary of intents

Once our keywords list is complete, we need to build up a dictionary that matches our keywords to intents. We also need to reformat the keywords in a special syntax that makes them visible to Regular Expression’s search function.

Code:

# Building dictionary of Intents & Keywords
keywords={}
keywords_dict={}
# Defining a new key in the keywords dictionary
keywords['greet']=[]
# Populating the values in the keywords dictionary with synonyms of keywords formatted with RegEx metacharacters 
for synonym in list(list_syn['hello']):
    keywords['greet'].append('.*\\b'+synonym+'\\b.*')

# Defining a new key in the keywords dictionary
keywords['timings']=[]
# Populating the values in the keywords dictionary with synonyms of keywords formatted with RegEx metacharacters 
for synonym in list(list_syn['timings']):
    keywords['timings'].append('.*\\b'+synonym+'\\b.*')
for intent, keys in keywords.items():
    # Joining the values in the keywords dictionary with the OR (|) operator updating them in keywords_dict dictionary
    keywords_dict[intent]=re.compile('|'.join(keys))
print (keywords_dict)

Output:

{'greet': re.compile('.*\\bhello\\b.*|.*\\bhowdy\\b.*|.*\\bhi\\b.*|.*\\bhullo\\b.*|.*\\bhow-do-you-do\\b.*'), 'timings': re.compile('.*\\btime\\b.*|.*\\bclock\\b.*|.*\\btiming\\b.*')}

The updated and formatted dictionary is stored in keywords_dict. The intent is the key and the string of keywords is the value of the dictionary.

Let’s look at one key-value pair of the keywords_dict dictionary to understand the syntax of Regular Expression;

{'greet': re.compile('.*\\bhullo\\b.*|.*\\bhow-do-you-do\\b.*|.*\\bhowdy\\b.*|.*\\bhello\\b.*|.*\\bhi\\b.*')

Regular Expression uses specific patterns of special Meta-Characters to search for strings or sets of strings in an expression.

Since we need our chatbot to search for specific words in larger input strings we use the following sequences of meta-characters:

.*\\bhullo\\b.*

In this specific sequence, the keyword (hullo) is encased between a \b sequence. This tells the RegEx Search function that the search parameter is the keyword (hullo).

The first sequence \bhullo\b is encased between a period-star .* sequence. This sequence tells the RegEx Search function to search the entire input string from beginning to end for the search parameter (hullo).

In the dictionary, multiple such sequences are separated by the OR | operator. This operator tells the search function to look for any of the mentioned keywords in the input string.

More details about Regular Expression and its syntax can be found here.

You can add as many key-value pairs to the dictionary as you want to increase the functionality of the chatbot.

Defining responses

The next step is defining responses for each intent type. This part is very straightforward. The responses are described in another dictionary with the intent being the key.

We’ve also added a fallback intent and its response. This is a fail-safe response in case the chatbot is unable to extract any relevant keywords from the user input.

Code:

# Building a dictionary of responses
responses={
    'greet':'Hello! How can I help you?',
    'timings':'We are open from 9AM to 5PM, Monday to Friday. We are closed on weekends and public holidays.',
    'fallback':'I dont quite understand. Could you repeat that?',
}

Matching intents and generating responses

Now that we have the back-end of the chatbot completed, we’ll move on to taking input from the user and searching the input string for our keywords.

We use the RegEx Search function to search the user input for keywords stored in the value field of the keywords_dict dictionary.  If you recall, the values in the keywords_dict dictionary were formatted with special sequences of meta-characters. RegEx’s search function uses those sequences to compare the patterns of characters in the keywords with patterns of characters in the input string.

If a match is found, the current intent gets selected and is used as the key to the responses dictionary to select the correct response.

Code:

print ("Welcome to MyBank. How may I help you?")
# While loop to run the chatbot indefinetely
while (True):  
    # Takes the user input and converts all characters to lowercase
    user_input = input().lower()
    # Defining the Chatbot's exit condition
    if user_input == 'quit': 
        print ("Thank you for visiting.")
        break    
    matched_intent = None 
    for intent,pattern in keywords_dict.items():
        # Using the regular expression search function to look for keywords in user input
        if re.search(pattern, user_input): 
            # if a keyword matches, select the corresponding intent from the keywords_dict dictionary
            matched_intent=intent  
    # The fallback intent is selected by default
    key='fallback' 
    if matched_intent in responses:
        # If a keyword matches, the fallback intent is replaced by the matched intent as the key for the responses dictionary
        key = matched_intent
    # The chatbot prints the response that matches the selected intent
    print (responses[key]) 

The chatbot picked the greeting from the first user input (‘Hi’) and responded according to the matched intent. The same happened when it located the word (‘time’) in the second user input. The third user input (‘How can I open a bank account’) didn’t have any keywords that present in Bankbot’s database and so it went to its fallback intent.

You can add as many keywords/phrases/sentences and intents as you want to make sure your chatbot is robust when talking to an actual human.

Conclusion

This blog was a hands-on introduction to building a very simple rule-based chatbot in python. We only worked with 2 intents in this tutorial for simplicity. You can easily expand the functionality of this chatbot by adding more keywords, intents and responses.

As we saw, building a rule-based chatbot is a laborious process. In a business environment, a chatbot could be required to have a lot more intent depending on the tasks it is supposed to undertake.

In such a situation, rule-based chatbots become very impractical as maintaining a rule base would become extremely complex. In addition, the chatbot would severely be limited in terms of its conversational capabilities as it is near impossible to describe exactly how a user will interact with the bot.

AI-based Chatbots are a much more practical solution for real-world scenarios. In the next blog in the series, we’ll be looking at how to build a simple AI-based Chatbot in Python.

Do you want to learn more about machine learning and it’s applications? Check out Data Science Dojo’s online data science certificate program!

DSD staffer
| April 28, 2022

A regular expression is a sequence of characters that specifies a search pattern in a text. Learn more about Its common uses in this regex 101 guide.

Regular-Expressions infographic
Regular Expressions Infographic

What is a regular expression?

A regular expression, or regex for short, is perhaps the most common thing that every data scientist must deal with multiple times in their career, and the frustration is natural because, at a vast majority of universities, this skill is not taught unless you have taken some hard-core Computer Science theory course. Even at times, trying to solve a problem using regular expression can cause multiple issues, which is summed beautifully in this meme:

regular expressions meme, Data Science Humor
Regular Expressions Meme

Making sense and using them can be quite daunting for beginners, but this RegEx 101 blog has got you covered. Regular expressions, in essence, are used to match strings in the text that we want to find. Consider the following scenario:

You are interested in counting the number of occurrences of Pandas in a journal related to endangered species to see how much focus is on this species. You write an algorithm that calculates the occurrences of the word ‘panda.’ However, as you might have noticed, your algorithm will miss the words ‘Panda’ and ‘Pandas.’ In this case, you might argue that a simple if-else condition will also count these words. But imagine, while converting the journal alphabets of every word is randomly capitalized or converted into a lower-case letter. Now, there are the following possibilities

  • PaNda
  • pAnDaS
  • PaNdA

And may more variations as well. Now you must write a lot of if-else conditions and even must write nested if-else conditions as well. What if I tell you that you can do this in one line of code using regular expressions? First, we need to learn some basics before coming back to solve the problem ourselves.

PRO TIP: Join our data science bootcamp program today to enhance your data science knowledge!

Square Brackets ([])

The name might sound scary, but it is nothing but the symbol: []. Some people also refer to square brackets as character class – a regular expression jargon word that means that it will match any character inside the bracket. For instance:

Pattern Matches
[Pp]enguin Penguin, penguin
[0123456789] (This will match any digit)
[0oO] 0, o, O

Disjunction (|)

The pipe symbol means nothing but either ‘A’ or ‘B’, and it is helpful in cases where you want to select multiple strings simultaneously. For instance:

Pattern Matches
A|B|C A, B, C
Black|White Black, White
[Bb]lack|[Ww]hite Black, black, White, white

Question Mark (?)

The question mark symbol means that the character it comes after is optional. For instance:

Pattern Matches
Ab?c Ac, Abc
Colou?r Color, Colour

Asterisk (*)

The asterisk symbol matches with 0 or more occurrences of the earlier character or group. For instance:

Pattern Matches
Sh* (0 or more of earlier character h)

S, Sh, Shh, Shhh.

(banana)* (0 or more of earlier banana. This will also match with nothing, but most regex engines will ignore it or give you a warning in that case)

banana, bananabanana, bananabananabanana.

Plus (+)

The plus symbol means to match with one or more occurrences of the earlier character or group. For instance:

Pattern Matches
Sh+ (1 or more of earlier character h)

Sh, Shh, Shhh.

(banana)+ (1 or more of the earlier banana)

banana, bananabanana, bananabananabanana.

Difference between Asterisk (*) and Plus(+)

The difference between the asterisk confuses many people; even the experts sometimes must look at the internet for their differences. However, there is an effortless way to remember the distinction between them.

Imagine you have a number 1, and you multiply it with 0:

1*0 = 0 or more occurrences of earlier character or group.

Now suppose that you have the same number 1, and you add it with 0:1+0 = 1 or more occurrences of an earlier character or group.

It is that simple when you try to understand things intuitively.

Negation (^)

Negation has two everyday use cases:

1. Inside square brackets, it will search for the negation of whatever is inside the brackets. For instance:

Pattern Matches
[^Aa] It will match with anything that is not A or a
[^0123456789] It will match anything that is not a digit

2. It can also be used as an anchor to search for expressions at the start of the line(s) only. For instance:

Pattern Matches
^Apple It will match with every Apple that is at the start of any line in the text
^(Apple|Banana) It will match with every Apple and Banana that is at the start of any line in the text

Dollar ($)

A dollar is used to search for expressions at the end of the line. For instance:

Pattern Matches
$[0123456789] It will match with any digit at the end of any line in the text.
$([Pp]anda) It will match with every Panda and panda at the end of any line in the text.

Conclusion

This article covered some of the very basic types of regular expression by using a story-telling approach. Regular expressions are indeed very hard to understand but if one develops an intuitive understanding then they can be easily learned.

Related Topics

Web Development
Top
Statistics
Software Testing
Programming Language
Podcasts
Natural Language
Machine Learning
Hypothesis Testing
High-Tech
Events
Discussions
Demos
Data Visualization
Data Security
Data Science
Data Mining
Data Engineering
Data Analytics
Conferences

Up for a Weekly Dose of Data Science?

Subscribe to our weekly newsletter & stay up-to-date with current data science news, blogs, and resources.