fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

Regular expression 101 – A beginner’s guide

Data Science Dojo
DSD staffer

April 28

A regular expression is a sequence of characters that specifies a search pattern in a text. Learn more about Its common uses in this regex 101 guide.

What is a regular expression?

A regular expression, or regex for short, is perhaps the most common thing that every data scientist must deal with multiple times in their career, and the frustration is natural because, at a vast majority of universities, this skill is not taught unless you have taken some hard-core Computer Science theory course. Even at times, trying to solve a problem using regular expression can cause multiple issues, which is summed beautifully in this meme:

regular expressions meme, Data Science Humor
Regular Expressions Meme

 

Making sense and using them can be quite daunting for beginners, but this RegEx 101 blog has got you covered. Regular expressions, in essence, are used to match strings in the text that we want to find. Consider the following scenario:

You are interested in counting the number of occurrences of Pandas in a journal related to endangered species to see how much focus is on this species. You write an algorithm that calculates the occurrences of the word ‘panda.’

 

Regular-Expressions infographic

 

However, as you might have noticed, your algorithm will miss the words ‘Panda’ and ‘Pandas.’ In this case, you might argue that a simple if-else condition will also count these words. But imagine, while converting the journal alphabets of every word is randomly capitalized or converted into a lower-case letter. Now, there are the following possibilities

  • PaNda
  • pAnDaS
  • PaNdA

And may more variations as well. Now you must write a lot of if-else conditions and even must write nested if-else conditions as well. What if I tell you that you can do this in one line of code using regular expressions? First, we need to learn some basics before coming back to solve the problem ourselves.

PRO TIP: Join our data science bootcamp program today to enhance your data science knowledge!

Square Brackets ([])

The name might sound scary, but it is nothing but the symbol: []. Some people also refer to square brackets as character class – a regular expression jargon word that means that it will match any character inside the bracket. For instance:

Pattern Matches
[Pp]enguin Penguin, penguin
[0123456789] (This will match any digit)
[0oO] 0, o, O

Disjunction (|)

The pipe symbol means nothing but either ‘A’ or ‘B’, and it is helpful in cases where you want to select multiple strings simultaneously. For instance:

Pattern Matches
A|B|C A, B, C
Black|White Black, White
[Bb]lack|[Ww]hite Black, black, White, white

Question Mark (?)

The question mark symbol means that the character it comes after is optional. For instance:

Pattern Matches
Ab?c Ac, Abc
Colou?r Color, Colour

Asterisk (*)

The asterisk symbol matches with 0 or more occurrences of the earlier character or group. For instance:

Pattern Matches
Sh* (0 or more of earlier character h)

S, Sh, Shh, Shhh.

(banana)* (0 or more of earlier banana. This will also match with nothing, but most regex engines will ignore it or give you a warning in that case)

banana, bananabanana, bananabananabanana.

Plus (+)

The plus symbol means to match with one or more occurrences of the earlier character or group. For instance:

Pattern Matches
Sh+ (1 or more of earlier character h)

Sh, Shh, Shhh.

(banana)+ (1 or more of the earlier banana)

banana, bananabanana, bananabananabanana.

Difference between Asterisk (*) and Plus(+)

The difference between the asterisk confuses many people; even the experts sometimes must look at the internet for their differences. However, there is an effortless way to remember the distinction between them.

Imagine you have a number 1, and you multiply it with 0:

1*0 = 0 or more occurrences of earlier character or group.

Now suppose that you have the same number 1, and you add it with 0:1+0 = 1 or more occurrences of an earlier character or group.

It is that simple when you try to understand things intuitively.

Negation (^)

Negation has two everyday use cases:

1. Inside square brackets, it will search for the negation of whatever is inside the brackets. For instance:

Pattern Matches
[^Aa] It will match with anything that is not A or a
[^0123456789] It will match anything that is not a digit

2. It can also be used as an anchor to search for expressions at the start of the line(s) only. For instance:

Pattern Matches
^Apple It will match with every Apple that is at the start of any line in the text
^(Apple|Banana) It will match with every Apple and Banana that is at the start of any line in the text

Dollar ($)

A dollar is used to search for expressions at the end of the line. For instance:

Pattern Matches
$[0123456789] It will match with any digit at the end of any line in the text.
$([Pp]anda) It will match with every Panda and panda at the end of any line in the text.

Conclusion

This article covered some of the very basic types of regular expression by using a story-telling approach. Regular expressions are indeed very hard to understand but if one develops an intuitive understanding then they can be easily learned.

DSD Sign
Written by DSD staffer
Interested in writing for us? Apply here: Submit your guest post with us
Newsletters | Data Science Dojo
Up for a Weekly Dose of Data Science?

Subscribe to our weekly newsletter & stay up-to-date with current data science news, blogs, and resources.

Data Science Dojo | data science for everyone

Discover more from Data Science Dojo

Subscribe to get the latest updates on AI, Data Science, LLMs, and Machine Learning.