Regex cheat sheet for python

I am writing a program that will find all the sex and sires in a raw data file and get rid of characters we don’t need anymore.

Character Classes

Character classes are a small group of characters that you need to use with an escape \. They are used as a way of catching all the different types of a certain character.

For instance, instead of going [A-Za-z] you can just do \w and that will catch all the different word characters. There are a handful of them:

Escape character Use Example
\w Word Characters a-z, A-Z, 0-9, _
\W Non-Word Characters !@#$%^^&*()
\d Digits 0-9
\D Non Digits Not 0-9
\s White Spaces \t, \n, \r
\S Non white spaces Opposite of above
\b Boundary of a word Basically the first or last letter of a word
\b Opposite Leaves off the first or last letter

The lowercase and upper case are the exact opposite. \d finds all digits (0-9) and \D finds the exact opposite.

Special characters

Special characters are when things get helpful. I don’t want to type out \w\w\w\w to find every 4 letter word. Seems silly.

Special Character Use Example
^ Beginning of a string ^\d
$ Matches everything to it’s left at the end of a string \w$ finds when I don’t end a line with punctuation
. Anything besides \n  
\ Escape character  
| OR A|B A OR B
{N} Number of times the thing to it’s left needs to be found \w{4} Finds a word character 4 times in a row
{N,} Same as above, but N or more times \w{4,} Finds a word character 4 or more long
{N,X} Finds whatever code N through X times \w{4,6} Finds word characters that are 4 to 6 long
* Finds 0 or more times \w* Finds a word character 0 or more times
+ Finds something 1 or more times. \d+ Finds a digit 1 or more times
? Sort of optional. Like the () around the area code. Sometimes they’re there other times they aren’t \(?\d{3}\)? Finds the area code with optional ()

Sets

Sets use brackets and are used for groups of characters. An example would be [A-Z]. That would be every uppercase letter A-Z.

You also don’t need to repeat letters so a letter can appear 1 or more times. For instance if I wanted to find “James Carney” it would look like this:

[jamescrny\s]

Carney looks a little weird because it has some letters that appear in James so they don’t need to be repeated. I also added \s because there is a space between James and Carney.

Set layout Purpose
[ ] Regular Set. Finds things between the brackets
[milk] Finds either of those letters, but not all of them in one string
[a-z] Finds lowercase letters a-z
[-a] Finds - and ‘a’ since the - is at the beginning of the set
[^anything] the ^ excludes anything in the bracket
[]?*+ Anything in the bracket, followed by the special character determines how many characters you’re looking for



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Against the Machine- On the Unmaking of Humanity
  • The Fish That Ate The Whale
  • The Death of Ivan Ilyich
  • Write down your virtues
  • Life Isn't Easy