Regex cheat sheet for python
I am writing a program that will find all the sex and sires in a raw data file and get rid of characters we don’t need anymore.
Character Classes
Character classes are a small group of characters that you need to use with an escape \. They are used as a way of catching all the different types of a certain character.
For instance, instead of going [A-Za-z] you can just do \w and that will catch all the different word characters. There are a handful of them:
| Escape character | Use | Example |
|---|---|---|
\w | Word Characters | a-z, A-Z, 0-9, _ |
\W | Non-Word Characters | !@#$%^^&*() |
\d | Digits | 0-9 |
\D | Non Digits | Not 0-9 |
\s | White Spaces | \t, \n, \r |
\S | Non white spaces | Opposite of above |
\b | Boundary of a word | Basically the first or last letter of a word |
\b | Opposite | Leaves off the first or last letter |
The lowercase and upper case are the exact opposite. \d finds all digits (0-9) and \D finds the exact opposite.
Special characters
Special characters are when things get helpful. I don’t want to type out \w\w\w\w to find every 4 letter word. Seems silly.
| Special Character | Use | Example |
|---|---|---|
^ | Beginning of a string | ^\d |
$ | Matches everything to it’s left at the end of a string | \w$ finds when I don’t end a line with punctuation |
. | Anything besides \n | |
\ | Escape character | |
| | OR | A|B A OR B |
{N} | Number of times the thing to it’s left needs to be found | \w{4} Finds a word character 4 times in a row |
{N,} | Same as above, but N or more times | \w{4,} Finds a word character 4 or more long |
{N,X} | Finds whatever code N through X times | \w{4,6} Finds word characters that are 4 to 6 long |
* | Finds 0 or more times | \w* Finds a word character 0 or more times |
+ | Finds something 1 or more times. | \d+ Finds a digit 1 or more times |
? | Sort of optional. Like the () around the area code. Sometimes they’re there other times they aren’t | \(?\d{3}\)? Finds the area code with optional () |
Sets
Sets use brackets and are used for groups of characters. An example would be [A-Z]. That would be every uppercase letter A-Z.
You also don’t need to repeat letters so a letter can appear 1 or more times. For instance if I wanted to find “James Carney” it would look like this:
[jamescrny\s]
Carney looks a little weird because it has some letters that appear in James so they don’t need to be repeated. I also added \s because there is a space between James and Carney.
| Set layout | Purpose |
|---|---|
| [ ] | Regular Set. Finds things between the brackets |
| [milk] | Finds either of those letters, but not all of them in one string |
| [a-z] | Finds lowercase letters a-z |
| [-a] | Finds - and ‘a’ since the - is at the beginning of the set |
| [^anything] | the ^ excludes anything in the bracket |
| []?*+ | Anything in the bracket, followed by the special character determines how many characters you’re looking for |
Enjoy Reading This Article?
Here are some more articles you might like to read next: