For a hands-on learning experience to develop Agentic AI applications, join our Agentic AI Bootcamp today. Early Bird Discount

Blog Regular expression 101 – A beginner’s guide

Regular expression 101 – A beginner’s guide

Published June 10, 2022

Data Science

Data Science Dojo Staff

Want to Build AI agents that can reason, plan, and execute autonomously?

A regular expression is a sequence of characters that specifies a search pattern in a text. Learn more about Its common uses in this regex 101 guide.

What is a Regular Expression?

A regular expression, or regex for short, is perhaps the most common thing that every data scientist must deal with multiple times in their career, and the frustration is natural because, at a vast majority of universities, this skill is not taught unless you have taken some hard-core Computer Science theory course.

Dive deep into the debate of data science vs computer science

Even at times, trying to solve a problem using regular expression can cause multiple issues, which is summed beautifully in this meme:

regular expressions meme, Data Science Humor — Regular Expressions Meme

Making sense and using them can be quite daunting for beginners, but this RegEx 101 blog has got you covered. Regular expressions, in essence, are used to match strings in the text that we want to find. Consider the following scenario:

You are interested in counting the number of occurrences of Pandas in a journal related to endangered species to see how much focus is on this species. You write an algorithm that calculates the occurrences of the word ‘panda.’

However, as you might have noticed, your algorithm will miss the words ‘Panda’ and ‘Pandas.’ In this case, you might argue that a simple if-else condition will also count these words. But imagine, while converting the journal alphabets of every word is randomly capitalized or converted into a lower-case letter. Now, there are the following possibilities

PaNda
pAnDaS
PaNdA

And may more variations as well. Now you must write a lot of if-else conditions and even must write nested if-else conditions as well. What if I tell you that you can do this in one line of code using regular expressions? First, we need to learn some basics before coming back to solve the problem ourselves.

PRO TIP: Join our data science bootcamp today to enhance your data science knowledge!

Square Brackets ([])

The name might sound scary, but it is nothing but the symbol: []. Some people also refer to square brackets as character class – a regular expression jargon word that means that it will match any character inside the bracket. For instance:

Pattern	Matches
[Pp]enguin	Penguin, penguin
[0123456789]	(This will match any digit)
[0oO]	0, o, O

Disjunction (|)

The pipe symbol means nothing but either ‘A’ or ‘B’, and it is helpful in cases where you want to select multiple strings simultaneously. For instance:

Pattern	Matches
A\|B\|C	A, B, C
Black\|White	Black, White
[Bb]lack\|[Ww]hite	Black, black, White, white

Question Mark (?)

The question mark symbol means that the character it comes after is optional. For instance:

Pattern	Matches
Ab?c	Ac, Abc
Colou?r	Color, Colour

Asterisk (*)

The asterisk symbol matches with 0 or more occurrences of the earlier character or group. For instance:

Pattern

Matches

Sh*

(0 or more of earlier character h)

S, Sh, Shh, Shhh.

(banana)*

(0 or more of earlier banana. This will also match with nothing, but most regex engines will ignore it or give you a warning in that case)

banana, bananabanana, bananabananabanana.

Plus (+)

The plus symbol means to match with one or more occurrences of the earlier character or group. For instance:

Pattern

Matches

Sh+

(1 or more of earlier character h)

Sh, Shh, Shhh.

(banana)+

(1 or more of the earlier banana)

banana, bananabanana, bananabananabanana.

Difference between Asterisk (*) and Plus(+)

The difference between the asterisk confuses many people; even the experts sometimes must look at the internet for their differences. However, there is an effortless way to remember the distinction between them.

Imagine you have a number 1, and you multiply it with 0:

1*0 = 0 or more occurrences of earlier character or group.

Now suppose that you have the same number 1, and you add it with 0:1+0 = 1 or more occurrences of an earlier character or group.

It is that simple when you try to understand things intuitively.

Negation (^)

Negation has two everyday use cases:

1. Inside square brackets, it will search for the negation of whatever is inside the brackets. For instance:

Pattern	Matches
[^Aa]	It will match with anything that is not A or a
[^0123456789]	It will match anything that is not a digit

2. It can also be used as an anchor to search for expressions at the start of the line(s) only. For instance:

Pattern	Matches
^Apple	It will match with every Apple that is at the start of any line in the text
^(Apple\|Banana)	It will match with every Apple and Banana that is at the start of any line in the text

Dollar ($)

A dollar is used to search for expressions at the end of the line. For instance:

Pattern	Matches
$[0123456789]	It will match with any digit at the end of any line in the text.
$([Pp]anda)	It will match with every Panda and panda at the end of any line in the text.

Conclusion

This article covered some of the very basic types of regular expression by using a story-telling approach. Regular expressions are indeed very hard to understand but if one develops an intuitive understanding then they can be easily learned.

Subscribe to our newsletter

Monthly curated AI content, Data Science Dojo updates, and more.

Bootcamps

Courses

Case Studies

Reviews

Consulting

Case studies

Community

Company